Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pluggable and customizable cache design #907

Draft
wants to merge 23 commits into
base: master
Choose a base branch
from

Conversation

team172011
Copy link
Member

@team172011 team172011 commented Oct 26, 2022

Suggestion on how to use a third party library (caffeine) as cache

  • This approach allows to use a default cache, to enable or to disable caching for specific indicator instances and to be able to provide a custom cache implementation for specific indicator instance.

  • This approach is using the endTime (ZonedDateTime) as key instead of the index. This would avoid issues with limited bar series where the indices of the underlying array may be reused.

  • The CachedIndicator and the RecursiveCachedIndicator have been removed. They will be replaced by an external cache of type IndicatorValueCache that can be configured (maxSize at the moment) and will be used in the AbstractIndicator. Current cache implementations are:

    • BaseIndicatorValueCache
    • NoIndicatorValueCache
    • NativeIndicatorValueCache the former cache using custom circular array implementation for caching
    • NativeRecursiveIndicatorValueCache the former recursive cache extending the NativeIndicatorValueCache

Caching in AbstractIndicator looks like this now:

    @Override
    public T getValue(final int index) {
        final BarSeries series = getBarSeries();
        if (series == null || index >= series.getEndIndex()) {
            // Don't cache result if last bar or no available bar
            return calculate(index);
        }

        if (index < series.getBeginIndex()) {
            return calculate(0);
        }

        final Bar bar = series.getBar(index);
        return cache.get(bar.getEndTime(), (endTime) -> calculate(index));
    }

Usage example

      SMAIndicator smaIndicator = new SMAIndicator(new ClosePriceIndicator(barSeries), 5); // default with an BaseIndicatorValueCache
      
      smaIndicator.clearCache(); // clear the cache
      smaIndicator.disableCache(); // disable caching for this indicator instance
      smaIndicator.setCache(new BaseIndicatorValueCache<>(new IndicatorValueCacheConfig(300))); // set cache with custom config
      smaIndicator.setCache(new MyCustomCache()); // set custom cache implementation
      Map<ZonedDateTime, Num> values = smaIndicator.getCache().getValues(); // get all cached values

Performance tests
Executed the example MovingAverageCrossOverRangeBacktest.java. 100 Runs with different subsets of the provided BarSeries, each run creates 2145 Strategies and executes them with a BacktestExecutor:

Test1

BarSeries: ETH-USD-PT5M-2023-3-13_2023-3-15.json
99 Runs, 2145 Strategies per run, avg in ms:
New Caching Time: 662
Old Caching Time: 694
No Caching Time: 809

Test 2

BarSeries: bitstamp_trades_from_20131125_usd.csv
99 Runs, 2145 Strategies per run, avg in ms:
New Caching Time: 3663
Old Caching Time: 3797
No Caching Time: 4485

Test 3

BarSeries: appleinc_bars_from_20130101_usd.csv
99 Runs, 2145 Strategies per run, avg in ms:
New Caching Time: 231
Old Caching Time: 226
No Caching Time: 258

Test1.txt
Test2.txt
Test3.txt

@team172011 team172011 marked this pull request as ready for review October 29, 2022 10:04
@TheCookieLab
Copy link
Member

I thought I read in a thread somewhere that the bar series size management functionality (i.e. setting a max bar count, removing old bars, etc) was a blocker for this feature. Is that still the case here or is there something else?

@team172011
Copy link
Member Author

Is that still the case here or is there something else

@TheCookieLab Usually it is possible to set a max cache size in third party caching libraries, especially in coffeine cache.
So far this is just a suggestion from me on how we could replace the custom solution by a third party library. This is a big change and so far there as not been an appropriate discussion about this topic

@nimo23
Copy link
Contributor

nimo23 commented Jun 18, 2023

This approach is also using the endTime (ZonedDateTime) as key instead of the index. This would avoid issues with limited bar series where the indices of the underlying array may be reused

@team172011 Would it make sense as a first step (before replacing the actual caching system) to create a new PR that just replaces the index with the endTime in our existing cache?

@team172011 team172011 self-assigned this Jul 12, 2023
@team172011
Copy link
Member Author

team172011 commented Jul 13, 2023

@nimo23

What about the old caching implementations. If you include the old caching logic into another IndicatorValueCache-implementation within

caching/NativeIndicatorValueCache caching/NativeRecursiveIndicatorValueCache (with changes as needed, e.g. endTime instead of index)

I added two further cache implementations:

  • NativeIndicatorValueCache the former cache using custom circular array implementation for caching
  • NativeRecursiveIndicatorValueCache the former recursive cache extending the NativeIndicatorValueCache

and added the NativeRecursiveIndicatorValueCache to all indicators that where extending the RecursiveCachedIndicator. All tests are passing now

* their values based on the values of other indicators. Such nested indicators
* can call {@link #getValue(int)} multiple times without the need to
* {@link #calculate(int)} again.
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not remove the existing javadocs of this class.

* Constructor.
*
* @param series the bar series
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not remove the javadocs of this class.

* that may be thrown on the first getValue(int) call of a recursive indicator.
* Concretely when an index value is asked, if the last cached value is too
* old/far, the computation of all the values between the last cached and the
* asked one is executed iteratively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not remove the javadocs of this class.

/**
* The recursion threshold for which an iterative calculation is executed.
*
* TODO: Should be variable (depending on the sub-indicators used in this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not remove the javadocs of this class.

@TheCookieLab
Copy link
Member

TheCookieLab commented Jul 14, 2023 via email

@nimo23
Copy link
Contributor

nimo23 commented Jul 14, 2023

Let’s not normalize raising concerns on blank lines in between imports,
comment style, etc in code reviews, that’s frankly anal retentive and
causes more harm than good. There’s a reason code style is typically
considered off-limits in code reviews for anything not explicitly defined
upfront as a universal style guide.

No, there is a good reason for maintaining code style! The end users of this library are API end users, i.e. in order not to diminish the developer's joy, the coding style must be homogeneous, among other things! if everyone does their gibberish and sloppily codes into it, you end up with poorly maintainable and readable code in the long run.

Off-topic: Maybe we should provide an automated task like "spotless" or "gradle-imports" into our github build?

@TheCookieLab
Copy link
Member

TheCookieLab commented Jul 14, 2023 via email

import org.ta4j.core.BarSeries;
import org.ta4j.core.BaseTradingRecord;
import org.ta4j.core.Position;
import org.ta4j.core.*;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the "*" (asterisk) and import only required classes.

@team172011
Copy link
Member Author

Let’s not normalize raising concerns on blank lines in between imports, comment style, etc in code reviews, that’s frankly anal retentive and causes more harm than good. There’s a reason code style is typically considered off-limits in code reviews for anything not explicitly defined upfront as a universal style guide.

@TheCookieLab I aggree, especially in PRs like this where the focus is on the general design and solving different challenges to reach the objectives.
But we should also ensure not to touch hundreds of unrelated files, so I am okay with having to correct those import issues

@team172011
Copy link
Member Author

team172011 commented Jul 14, 2023

Current problem I face ist that one goal of this PR is to eliminate alle the difficulties we have with the indexed based caching and the sync between the circular array in CachedIndicators and in the BarSeries. Still having
the NativeIndicatorValueCache, NativeRecursiveIndicatorValueCache is good for comparison, but If we still have those indexed based caches nothing has been really fixed regarding this problem. So I do not want to have them in the final version of this PR.

This leads to a second challenge: How to replace the RecursiveCachedIndicator with a version that is not index based. The RecursiveCachedIndicator needs to now the latest available index to decided if it should iteratively fill the cache. From a specific index to another index.

Only idea I have so far is to create a cache that always checks/calculates if every previous index is already in the cache
-> like suggested in #973 (#975) this would be a "cache" that is always iteratively calculating all values from a (configurable?) threshold to the current index if the current index is not already in the cache

@nimo23
Copy link
Contributor

nimo23 commented Jul 14, 2023

I aggree, especially in PRs like this where the focus is on the general design and solving different challenges to reach the objectives.

@team172011 I don't agree. It does not depend on any kind of PR!

Things like clean code, etc. should be understandable and not just a social expression within the community. If you both agree, then please answer me: who feels responsible for cleaning up your dirt afterwards? I? If not me, no one will - history has proven that! The result today would be very different, not good code! So please think twice before anyone here seems to agree. I have more and more fear that the project will be steered in the wrong direction in the future if such fundamental, actually completely self-evident things have to be said here several times. That's tedious.

@nimo23
Copy link
Contributor

nimo23 commented Jul 15, 2023

..one goal of this PR is to eliminate alle the difficulties we have with the indexed based caching and the sync between the circular array in CachedIndicators and in the BarSeries. Still having the NativeIndicatorValueCache, NativeRecursiveIndicatorValueCache is good for comparison, but If we still have those indexed based caches nothing has been really fixed regarding this problem.

1. Solution: If I understand it correctly, one potential solution to this problem to remove the "indexed based caching":
The problem can be solved if we look at this purely relationally, i.e. the "sync between" can be done in constant time (rather than iterating over n elements) if we have a primary key that we can look up in a map (e.g. "ConcurrentHashMap<String,ValueObject >") ). The primary key is a unique identifier (e.g. a property or combination of properties of the bar data that makes it unique, which is called, for example, BarId). The primary key of the indicator value is the foreign key in the cache, i.e. the value in the cache can be retrieved in constant time by its foreign key BarId. If no such foreign key exists, then we know, that we have to calculate it and put the calculated value into the cache (e.g. by something like var cachedValue = map.putIfAbsent(myBarId, myIndicatorCalculator);). With such a solution there are no more indexes and no more iterating in the cache.

So I do not want to have them in the final version of this PR.

Hmm... but if we can solve this with "solution 1" then it would make sense to keep using it, or?

This leads to a second challenge: How to replace the RecursiveCachedIndicator with a version that is not index based. The RecursiveCachedIndicator needs to now the latest available index to decided if it should iteratively fill the cache. From a specific index to another index.

Hmm..the latest could be the last element in a sorted map or alternativly, the latest one with a time based BarId (e.g. a prefix within BarId consisting of the latest bar.beginTime or a timebased UUID which is created for each bar on construction time). For example:

class Bar {
    String getBarId();
}

class BaseBar {
    // will be assigned on each bar creation
    private final String barId;

// can be used internally to create the barId for a bar
private String createBarId(){
       // either by properties of bar which makes a barId (nearly) unique
      // the "time based part" can then be extracted for e.g. RecursiveCachedIndicator
       StringBuilder sb = new StringBuilder();
       sb.append(beginTime);
       sb.append("/");
       sb.append(endTime):
       sb.append(openPrice);
return sb.toString();

// or by returning a timebased (nearly unique) UUID
return UUID.nameUUIDFromBytes(java.nio.ByteBuffer.allocate(16).putLong(System.currentTimeMillis()).array()).toString();

// another alternative:
// return UUID.randomUUID().toString() + "_TS_" + String.valueOf(beginTime.toEpochSecond());

}
}

Something like this would also generate a random String really fast, nearly unique and is also thread safe (the parameter appendTimestamp can be replaced by simply using the bar.beginTime if we wish:

With such a (time based) barId, the indicator value can then be referenced within our caches.

@team172011
Copy link
Member Author

With such a solution there are no more indexes and no more iterating in the cache.

But isn't this exactly the same what I already did in this PR? I just used the endTime of each bar as primary key, it is mandatory and it makes no sense to have two bars for the same endTime -> primary key.

And how would the sync work in constant time? At the moment, the BarSeries and CachedIndicator are using the maximumBarCount and removedBarCount property to calculate the "real" index. If we would only store a unique key (like my PR already does with endTime) we do not know how many bars or which bars may have been removed from the BarSeries. We would have to iterate over the whole cache values and check if they are still in the barSeries

However the concept or the RecursiveCachedIndicator would not work without known and storing indices of values

@nimo23
Copy link
Contributor

nimo23 commented Jul 15, 2023

I just used the endTime of each bar as primary key, it is mandatory and it makes no sense to have two bars for the same endTime -> primary key.

Yes, you are right - using endTime as primary key makes totally sense for the current kind of solution where we extend an indicator (instead of composing the (global) cache into an indicator). Well, we have to think about if extending an indicator would decrease the flexibility of the cache and the indicator (e.g. indicators with extending a cache cannot be stateless, or?) For example, if an Indicator is created as a Function (which is a little off-topic, but should be considered here also to find a more future-proof solution for our caching system), then the only way to pass its cache is by a parameter (= composition). Am I right? The inheritance structure would perhaps make some possibilities for the future impossible in the first place - I am trying to find a solution that doesn't have these shortcomings in the first place. But it seems more complex to find such a global solution. One benefit of "extending" is that the cache is cleared automatically as soon as the indicator class is collected by the garbage collector. I know I'm digressing, but we should find a future-proof solution..

And how would the sync work in constant time? At the moment, the BarSeries and CachedIndicator are using the maximumBarCount and removedBarCount property to calculate the "real" index. If we would only store a unique key (like my PR already does with endTime) we do not know how many bars or which bars may have been removed from the BarSeries. We would have to iterate over the whole cache values and check if they are still in the barSeries

Hmm..yes, that's a problem with the current caching design and how we connect the things together: we have the relation Bar<->Indicator<->Cache and the caching must actually compute the "meta information" (how many or which bars removed, etc.) as we do not store these information at the place where it is actually needed. The problem of determining "how many or which bars were removed" is only due to our current caching design. The solution actually sounds very simple: The indicator calculates a value for a bar and then saves this value in its cache. The next time the bar asks the indicator for the value, the indicator looks in its cache to see if it has already been calculated. Unfortunately, the cache has to determine the missing meta information itself, since this meta information cannot be called up constantly at runtime. Now how to connect the cache to the bar in a way that eliminates this problem - maybe just by inverting and relocating the caching, i.e. a bar needs a cache and not necessarily the indicator - that sounds unusual at first, because we have been used to our actual caching system for so long. But think about it carefully: This also has the advantage that if the bar is no longer in the list, its cache is also implicitly emptied. This would also solve the “extends” problem (described above) and would pave the way for us to consider indicators as stateless. The only stateful thing should be the Bar and nothing else.

@team172011
Copy link
Member Author

team172011 commented Jul 15, 2023

The indicator calculates a value for a bar and then saves this value in its cache. The next time the bar asks the indicator for the value, the indicator looks in its cache to see if it has already been calculated.

This idea sounds very interesting. This would allow an indicator not just to cache the final value for index n but also use it for meta information like the ParabolicSarIndicator would need them for the next index n+1 (isUptrend, lastExtreme, acceleration factor, see #975).

That means we would change the Indicator interface to something like:

public interface Indicator<T> {

    T getValue(Bar bar);
    
    // ...
}

And the Bar interface to something like

public interface Bar extends Serializable {
    
    IndicatorValueCache<T> getCache(IndicatorInstanceKey key) // PROBLEM: T is unknown can only be Object

    // ...
}

The API of the IndicatorValueCache could look like this, providing also a possibility to store meta information required for
further calculations

public interface IndicatorValueCache<T> {
    
    Optional<T> getValue(); // returns the value for this indicator if already calculated and stored
    
    void putValue(T result); // stores the value
    
    Map<String, Object> getMetaValues(); // possibility to add meta information
    
    putMetaValues(Map<String, Object>); // possibility to get meta information
     
     // and/or: 
     Optional <Object> getMetaValue(String key);
     
    void putMetaValue(String key, Object value);

}

And of course each indicator instance needs to have its own key for the cache in the Bar

public interface IndicatorInstanceKey {
     
   // needs to be stored in each indicator instance
   default String getKey() {
       return new UUID().toString();
  }
}

EDIT 1: Added comment to Bar example to describe the problem with generic T

@nimo23
Copy link
Contributor

nimo23 commented Jul 15, 2023

@team172011 very good, you understood exactly what I meant. And the solution you provided is very promising 👍

@team172011
Copy link
Member Author

team172011 commented Jul 15, 2023

But I remember one issue I had in a previous try to refactor the caching:

As far as you leave the Indicator<T> scope (by not extending it or putting the cache in another class like Bar) you are loosing the information about the generic class parameter T. And that means it gets very ugly because you can handle the results only as type Object and need to do a lot of casting and type checking.

So this suggestion would not work, because the Bar does not know which Indicator with wich kind of type T is using it.

Because of that, when playing around with implementing some concepts in swift I completely dropped the concept of having a generic class parameter for Indicators.

My personal attitude is also to avoid using generics or just use them in cases where you really are able to generify something in a usefull way.

But beside of my personal flavor I really thing that having a generic class parameter T in Indicator is useless for this lib. At a first look ist sounds great to have the possibility to create indicators that return any type T. But when having a closer look, the ta4j library really only supports having Boolean and Num type for T:
For example almost all rules expect Indicator<Num> as parameters. Also almost all the criterions returning Num and working with Num results. Analysis classes like CashFlow and Returns are also implementing the Indicator<Num> interface.

As far as I can see, there is no reasonable way to use this library with Indicator<T> where T is something else than Boolean or Num.

So one option could also be to remove the generic class parameter T by introducing two interfaces (for the time being ignoring DateTimeIndicator and I think one or two indicators returning a long or so):

  • NumIndicator with getValue -> Num
  • BooleanIndicator with getValue -> Boolean

@nimo23
Copy link
Contributor

nimo23 commented Jul 15, 2023

you are loosing the information about the generic class parameter T.

The past has shown that we really don't need such kind of abstraction level. Your solution:

    NumIndicator with getValue -> Num
    BooleanIndicator with getValue -> Boolean

is absolutely fine and completely sufficient. If in the future there is a need for another indicator type, then let's just add it in this lib, for example:

    DateTimeIndicator with getValue -> LocalDateTime

You could do it like that (an interface extending another interface).

/**
 * Indicator over a {@link BarSeries bar series}.
 * 
 * <p>
 * Returns a value of type {@link Num}. for each index of the bar series.
 */
public interface NumIndicator extends Indicator<Num>{

    /**
     * @param index the bar index
     * @return the Num value of the indicator
     */
    Num getValue(int index);

}
/**
 * Indicator over a {@link BarSeries bar series}.
 * 
 * <p>
 * Returns a value of type {@link Boolean} for each index of the bar series.
 */
public interface BooleanIndicator extends Indicator<Boolean>{

    /**
     * @param index the bar index
     * @return the Boolean value of the indicator
     */
    Boolean getValue(int index);

}

and so on and you can put these indicator types into the package indicators/types.

@TheCookieLab
Copy link
Member

TheCookieLab commented Jul 15, 2023

Current problem I face ist that one goal of this PR is to eliminate alle the difficulties we have with the indexed based caching and the sync between the circular array in CachedIndicators and in the BarSeries. Still having the NativeIndicatorValueCache, NativeRecursiveIndicatorValueCache is good for comparison, but If we still have those indexed based caches nothing has been really fixed regarding this problem. So I do not want to have them in the final version of this PR.

This leads to a second challenge: How to replace the RecursiveCachedIndicator with a version that is not index based. The RecursiveCachedIndicator needs to now the latest available index to decided if it should iteratively fill the cache. From a specific index to another index.

Only idea I have so far is to create a cache that always checks/calculates if every previous index is already in the cache -> like suggested in #973 (#975) this would be a "cache" that is always iteratively calculating all values from a (configurable?) threshold to the current index if the current index is not already in the cache

@team172011 Are we sure we still need the RecursiveCachedIndicator? I modified ZLEMAIndicator to extend CachedIndicator and then ran CachedIndicatorTest.recursiveCachedIndicatorOnMovingBarSeriesShouldNotCauseStackOverflow which passed successfully. I'm not terribly familiar with the origin story here so may be missing some important context. Can you confirm?

@team172011
Copy link
Member Author

team172011 commented Jul 15, 2023

@TheCookieLab it depends on the stack size of your jvm.For me the org.ta4j.core.indicators.volume.OnBalanceVolumeIndicatorTest#stackOverflowError test fails, when only extending the cached indicator in OnBalanceVolumeIndicator.

I am mixing up some thing but somehow it is all related in the end. The OnBalanceVolumeIndicator is also a good example for our "unstableBars" feature:

    @Override
    protected Num calculate(int index) {
        if (index == 0) {
            return zero();
        }
        final Num prevClose = getBarSeries().getBar(index - 1).getClosePrice();
        final Num currentClose = getBarSeries().getBar(index).getClosePrice();

        final Num obvPrev = getValue(index - 1);
        if (prevClose.isGreaterThan(currentClose)) {
            return obvPrev.minus(getBarSeries().getBar(index).getVolume());
        } else if (prevClose.isLessThan(currentClose)) {
            return obvPrev.plus(getBarSeries().getBar(index).getVolume());
        } else {
            return obvPrev;
        }
    }

    @Override
    public int getUnstableBars() {
        return 0;
    }

Every calculation is based on the previous close and previous indicator value. So what is the unstable bars period? When is it returning "correct" values? If you call the indicator for index n and you have only three previous values the result would differ for the same index n with thirty previous values for the same BarSeries.
We could eliminate the recursive cache problem, if we are sure that we have the correct warm up period x for each indicator that depends on previous values. That means with this warm up period of x once could check for every call if the last x values are available and only use those last x values for the current calculation. So the OnBalanceVolumeIndicator would not calculate back to index = 0 for calculating the value for index 999 but for example only back to index 100 because we know the last 100 bars are "enough" for a valid result

@nimo23
Copy link
Contributor

nimo23 commented Jul 16, 2023

We could eliminate the recursive cache problem, if we are sure that we have the correct warm up period x for each indicator that depends on previous values.

Yes, good point. But it will probably take a while until #1051 has been solved.

@team172011
Copy link
Member Author

So far this approach does not really solve all mentioned problems, especially the question regarding recursive calculations is still unsolved. I will revert this to a draft PR

@team172011 team172011 marked this pull request as draft August 6, 2023 12:11
@team172011 team172011 mentioned this pull request Aug 6, 2023
@sgflt
Copy link
Contributor

sgflt commented Apr 19, 2024

May I ask, what access patterns does solve caching?

From my research, I have found a few main scenarios.

LiveTrading
There we are interested in last state. We usually want to update state of indicators on external event (tick, candle, other).
In this case caching does a little job. Only multiple indicators with the same parameters may benefit from it.

new SMA(1 ,2, close) for entry strategy and new SMA(1, 2, close) for exit may reuse same values. But we can workaround it and reuse instance of SMA(1,2,close) indicator.

Also array may be replaced with only pointer to last bar.

Backtesting
In this case I usually run optimization algorihms that estimates the best parameters of strategy. It is done in serial manner, so actually indexed array is superior also as cache. I have done little research experiments with caffeine and as cache for calculated values it was musch slower, because hashCode and equals are of similar costf of coputation as cached value. Adding locks and synchronized blocks it was less performant than indexed array.

Secondly each optimization run have different parameters so it can not benefit from cache calculated in previous run.

RecursiveIndicator in serial acces does not need to be recursive, it depends only on previous value that may be stored inside instance. See approach in #1140

I have dropped all "caching indicators" in #1148 and after completion I am going to investigate caching needs of each indicator separately to prove that global cache does not make real benefit. Only optimization of each indicator does matter (I think).

@TheCookieLab
Copy link
Member

Sure, your willingness to take on the effort is appreciated, and if you can show comparable performance without the cache then we may not need it after all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants