HIBF creates a very large index #370

genomewalker · 2023-09-01T07:26:09Z

Hi

I have been trying to build an index of a large collection of microbial genomes (102999) using HIBF and the resulting index is way larger than when I create the same index using IBF.

The raptor version I used:

VERSION
    Last update: 2023-08-30
    Raptor version: 3.1.0-rc.1 (raptor-v3.0.0-146-gedec71b5a2c19a2203278db814b3362ddb98e9e6)
    Sharg version: 1.1.1
    SeqAn version: 3.4.0-rc.1

The layout stat file:

## ### Parameters ###
## number of user bins = 102999
## number of hash functions = 2
## false positive rate = 0.05
## ### Notation ###
## X-IBF = An IBF with X number of bins.
## X-HIBF = An HIBF with tmax = X, e.g a maximum of X technical bins on each level.
## ### Column Description ###
## tmax : The maximum number of technical bin on each level
## c_tmax : The technical extra cost of querying an tmax-IBF, compared to 64-IBF
## l_tmax : The estimated query cost for an tmax-HIBF, compared to an 64-HIBF
## m_tmax : The estimated memory consumption for an tmax-HIBF, compared to an 64-HIBF
## (l*m)_tmax : Computed by l_tmax * m_tmax
## size : The expected total size of an tmax-HIBF
# tmax  c_tmax  l_tmax  m_tmax  (l*m)_tmax      size
64      1.00    0.00    1.00    0.00    424.3GiB
384     1.51    3.34    1.48    4.96    630.0GiB
# Best t_max (regarding expected query runtime): 64

The prepare and layout and build commands I used:

raptor prepare --input genomes.lst --output genomes_k20_w20 --kmer 20 --window 20 --threads 32
raptor layout --input-file genomes_k20_w20/minimiser.list --output-sketches-to genomes_k20_w20 \
    --determine-best-tmax --kmer-size 20 --false-positive-rate 0.05 --threads 32 \
    --output-filename genomes_k20_w20_binning
raptor build --input genomes_k20_w20_binning --output genomes_k20_w20.index --threads 32

The final index is ~1Tb, and these are the timings of building the index, where it had a peak memory usage of ~3Tb:

============= Timings =============
Wall clock time [s]: 40397.13
Peak memory usage [TiB]: 2.9
Index allocation [s]: 0.00
User bin I/O avg per thread [s]: 0.00
User bin I/O sum [s]: 0.00
Merge kmer sets avg per thread [s]: 0.00
Merge kmer sets sum [s]: 0.00
Fill IBF avg per thread [s]: 0.00
Fill IBF sum [s]: 0.00
Store index [s]: 0.00

The IBF index is ~750G and required a fraction of the memory to build the index. Shouldn't the HBIF be smaller than the IBF index? Any suggestions are much appreciated :-)

Thanks
Antonio

The text was updated successfully, but these errors were encountered:

eseiler · 2023-09-01T09:36:22Z

Hey there!

Version

The version you are using has some major refactorings. That's also why the Timings show 0.00 seconds for most of the statistics.

The results should be the same (unit tests are fine), but I haven't benchmarked the performance yet.
You could use the latest release (3.0.1), but I don't think that the results would be different.

EDIT: One bug that I just encountered, and that will be fixed soon is that raptor build will always use the same number of threads as used for raptor layout, ignoring the raptor build --threads option.

Layout

It looks like it will use t_max = 64, so the HIBF will have at least 3 levels (log_64(102999) is about 2.8).
This may result in a bigger index size than using only 2 levels (t_max = 384).

We will have to investigate why the estimation of the size (424.3GiB) is so far off the actual size.

Building RAM

The memory usage looks way too high. This might be due to the t_max = 64. When building in parallel, we store the k-mers that we insert in lower levels to reuse in upper levels. With a small t_max, we have to store more content, which increases memory usage.

Index Size

Whether the HIBF is smaller than the IBF depends on the data and t_max.
The worst case for the HIBF is when all the genomes are equally sized (size = number of unique k-mers).
Let's say all genomes are equally sized, and we have 4096 genomes. Then an HIBF with t_max = 64 would have two layers. The top-level has 64 bins available, and each of these bins would be 64 of the original genomes (64*64=4096). So we would store all k-mers in the top-level, and then the k-mers of 64 genomes for each of the 64 lower-level. Long story short, we would, in this worst case, have an index of twice the size of the IBF.

When using 3 levels, this might get worse, depending on the data.
It looks like your data is quite unevenly sized (750GiB vs 1TiB, even though there are 3 layers).
This might also improve when using t_max = 384.

Questions/Suggestions

Try running the layout without --determine-best-tmax. It should then default to using t_max = 384.
Note: If you have exactly one file per genome, you can also skip raptor prepare. But since you've already run it, you can just reuse the minimiser.list for raptor layout.
Is the list of genomes something that you can share; and are the genomes freely available? Then we could also try it ourselves.
Can you share the layout file? You should be able to attach a gzipped file to a GitHub comment.

genomewalker · 2023-09-01T13:04:44Z

Hi @eseiler

thank you very much for your prompt answer, it is very useful. I will try your recommendations :-)

You can get the genome fasta files from here and the layout files here

genomewalker · 2023-09-02T06:14:23Z

An update on this, without specifying --determine-best-tmax now the index is only 588GiB and the peak memory has been 590GiB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HIBF creates a very large index #370

HIBF creates a very large index #370

genomewalker commented Sep 1, 2023

eseiler commented Sep 1, 2023 •

edited

genomewalker commented Sep 1, 2023

genomewalker commented Sep 2, 2023

HIBF creates a very large index #370

HIBF creates a very large index #370

Comments

genomewalker commented Sep 1, 2023

eseiler commented Sep 1, 2023 • edited

Version

Layout

Building RAM

Index Size

Questions/Suggestions

genomewalker commented Sep 1, 2023

genomewalker commented Sep 2, 2023

eseiler commented Sep 1, 2023 •

edited