Skip to content

gsufsort: building suffix arrays, LCP-arrays and BWTs for string collections [AMB 2020]

License

Notifications You must be signed in to change notification settings

felipelouza/gsufsort

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gsufsort

gsufsort [1] is a fast, portable and lightweight tool for constructing the suffix array and related data structures for string collections.

gsufsort runs in internal memory and data structures are written to disk.

For a string collection, gsufsort can compute the following data structures:

  • Suffix array (SA)
  • Inverse suffix array (ISA)
  • LCP-array (LCP)
  • k-truncated LCP-array (k-LCP)
  • Document array (DA)
  • Burrows-Wheeler transform (BWT)
  • Inverse BWT
  • Generalized suffix array (GSA)
  • Generalized enhanced suffix array (GESA)

Compilation and installation

gsufsort will compile in systems with a standard C compiler (like gcc) and make.

git clone https://github.com/felipelouza/gsufsort.git
cd gsufsort
make 

Issuing these commands will build executables gsufsort and gsufsort-64.

For inputs larger than 2GB, gsufsort-64 must be used.

To enable support to compressed files, zlib is required. If zlib is not installed in you system, build with option make GZ=0.

Execution

./gsufsort INPUT [options]

where INPUT is a single file or directory with a string collection.

Construction options:

--build	              (default)
--sa    [w]           compute the SA using w bytes (default 4), write to INPUT.w.sa
--isa   [w]           compute the ISA, write to INPUT.w.isa
--lcp   [w]           compute the LCP, write to INPUT.w.lcp
--trlcp k             compute the k-truncated LCP array, write to INPUT.w.lcp
--da    [w]           compute the DA, write to INPUT.w.da
--gsa   [w1][w2]      compute the GSA=(string,suffix) as pairs of w1+w2 bytes, write to INPUT.w1.w2.gsa
--gesa  [w1][w2][w3]  compute the GESA=(GSA,LCP,BWT), write to INPUT.w1.w2.w3.1.gesa
--bwt                 compute the BWT using 1 byte per symbol, write to INPUT.bwt
--docs  d             process only the first d strings in the collection
--light               run the lightweight algorithm to compute DA, GSA and GESA
--output DIR/NAME     write output files to DIR and use NAME as a prefix for file names

Loading options:

--load                load data-structures from disk INPUT[.sa][.da][.lcp][.gsa][.str]
--ibwt                invert the BWT, given INPUT.bwt, write output to INPUT.bwt.ibwt

Input options:

--txt                 handle input as raw text files (one string per line)
--fasta               handle input as fasta files 
--fastq               handle input as fastq files
--dir                 include all files at the input directory
--lower               convert input to lowercase before data structures construction
--upper               convert input to uppercase before data structures construction

Output options:

--str                 write the collection concatenation (T^{cat}) to INPUT.1.str
--print [p]           print the first p elements of arrays to stdout, defaults to the collection length
--qs                  write QS sequences in fastq permuted according to the BWT to INPUT.bwt.qs
--lcp_max             write the maximum LCP value, write to INPUT.w.lcp_max
--lcp_max_text        write the maximum LCP value text, write to INPUT.lcp_max.txt
--lcp_avg             write the average LCP value, write to INPUT.w.lcp_avg
--time                print the running time in seconds
--verbose             verbose output
--help                this help message

Input files

  • File types (text, fasta or fastq) will be selected by extensions: .txt, .fasta (or .fa, .fna.) and .fastq (or .fq).

  • Options --txt, --fasta and --fastq enable loading file disregarding extensions.

  • In txt files, each line is taken as a strings in the collection. In fasta and fastq files, each sequence is taken as a string in the collection.

  • gsufsort supports the ASCII alphabet, but values 0 and 255 are reserved and must not occur in the input.

  • IUPAC symbols and 'N' are not handled as special symbols in fasta or fastq files.

  • gzipped input files (with .gz extension) are supported using zlib and kseq libraries. If zlib is not installed in your system, build gsufsort with the option make GZ=0. If zlib is not available and a gzipped file is given as input, a runtime error will occur.

  • A directory may be given as input, selecting option --dir. Every file with expected extensions in the directory will be processed to compose the collection, and the default output file prefix will be all. See also Wiki.

Output files

  • Output files are written by default in the current directory.

  • If option --output DIR/ is set, files are written to directory DIR. Setting --output DIR/NAME will make files be written to directory DIR with suffix NAME.

  • Output files format is discussed below.

quick test

To run a test with all strings from dataset/example.txt, type:

./gsufsort dataset/example.txt --sa --bwt
## gsufsort ##
## store_to_disk ##
example.txt.4.sa	76 bytes (n = 19)
example.txt.bwt	19 bytes (n = 19)

To see the result (option --print) stored in disk INPUT.4.sa and INPUT.bwt, use --load option:

./gsufsort example.txt --sa --bwt --load --print
## load_from_disk ##
example.txt.4.sa	76 bytes (n = 19)
example.txt.bwt	19 bytes (n = 19)
i	SA	BWT	suffixes
0	18	$	#
1	6	a	$
2	12	a	$
3	17	n	$
4	5	n	a$
5	11	b	a$
6	9	n	aba$
7	15	n	an$
8	3	n	ana$
9	7	$	anaba$
10	13	$	anan$
11	1	b	anana$
12	10	a	ba$
13	0	#	banana$
14	16	a	n$
15	4	a	na$
16	8	a	naba$
17	14	a	nan$
18	2	a	nana$

output format

  • SA, ISA, LCP, k-LCP and DA are each written sequentially to a binary file. The file has no header and every integer takes w bytes. The default value of w is 4.

  • BWT and iBWT are written in ASCII format, using 1 byte per input symbol.

  • The GSA is written sequentially to a binary file, with no headers. The values of DA and SA are intercalated throughout the file with w1 and w2 bytes respectively: DA[0],SA[0],DA[1],SA[1],...

  • The GESA is written sequentially to a binary file, with no headers. The values of DA and SA, LCP and BWT are intercalated throughout the file with w1, w2, w3 and 1 bytes respectively: DA[0],SA[0],LCP[0],BWT[0],DA[1],SA[1],LCP[1],BWT[1]...

wiki

See more details and additional features in Wiki.

authors

thanks

We thank to Antonis Maronikolakis for his helpful suggestions.

references

[1] Louza, F.A., Telles, G.P., Gog, S., Prezza, N., Rosone, G.. gsufsort: constructing suffix arrays, LCP arrays and BWTs for string collections. Algorithms Mol Biol 15, 18 (2020). https://doi.org/10.1186/s13015-020-00177-y


Supported by the project Italian MIUR-SIR CMACBioSeq ("Combinatorial methods for analysis and compression of biological sequences") grant n.~RBSI146R5L. P.I. Giovanna Rosone