pygds: Python Interface to CoreArray Genomic Data Structure (GDS) Files

GNU General Public License, GPLv3 (2017)

pre-release version: v0.1

Features

This package provides a high-level Python interface to CoreArray Genomic Data Structure (GDS) data files, which are portable across platforms with hierarchical structure to store multiple scalable array-oriented data sets with metadata information. It is suited for large-scale datasets, especially for data which are much larger than the available random-access memory. The pygds package offers the efficient operations specifically designed for integers of less than 8 bits, since a diploid genotype, like single-nucleotide polymorphism (SNP), usually occupies fewer bits than a byte. Data compression and decompression are available with relatively efficient random access.

Package Maintainer

Dr. Xiuwen Zheng (zhengxwen@gmail.com)

Prerequisites

Python 2 (2.6-2.7), and Python 3 (3.3-3.6)

NumPy 1.6.0 or later

liblzma in xz utilities

Installation

pip install git+git://github.com/CoreArray/pygds.git

Citation

Original papers (implemented in R/Bioconductor):

gdsfmt, SeqArray

Zheng X, Levine D, Shen J, Gogarten SM, Laurie C, Weir BS (2012). A High-performance Computing Toolset for Relatedness and Principal Component Analysis of SNP Data. Bioinformatics. DOI: 10.1093/bioinformatics/bts606.

Zheng X, Gogarten S, Lawrence M, Stilp A, Conomos M, Weir BS, Laurie C, Levine D (2017). SeqArray -- A storage-efficient high-performance data format for WGS variant calls. Bioinformatics. DOI: 10.1093/bioinformatics/btx145.

Examples

import pygds as gds

fn = gds.get_example_path('ceu_exon.gds')
f = gds.gdsfile()
f.open(fn)
f.show()
f.close()

File: pygds/data/ceu_exon.gds (32.5K)
+    [  ] *
|--+ description   [  ] *
|--+ sample.id   { Str8 90 LZMA_ra(35.8%), 258B } *
|--+ variant.id   { Int32 1348 LZMA_ra(16.8%), 906B } *
|--+ position   { Int32 1348 LZMA_ra(64.6%), 3.4K } *
|--+ chromosome   { Str8 1348 LZMA_ra(4.63%), 158B } *
|--+ allele   { Str8 1348 LZMA_ra(16.7%), 902B } *
|--+ genotype   [  ] *
|  |--+ data   { Bit2 1348x90x2 LZMA_ra(26.3%), 15.6K } *
|  |--+ extra.index   { Int32 0x3 LZMA_ra, 19B } *
|  \--+ extra   { Int16 0 LZMA_ra, 19B }
|--+ phase   [  ]
|  |--+ data   { Bit1 1348x90 LZMA_ra(0.91%), 138B } *
|  |--+ extra.index   { Int32 0x3 LZMA_ra, 19B } *
|  \--+ extra   { Bit1 0 LZMA_ra, 19B }
|--+ annotation   [  ]
|  |--+ id   { Str8 1348 LZMA_ra(38.4%), 5.5K } *
|  |--+ qual   { Float32 1348 LZMA_ra(2.26%), 122B } *
|  \--+ filter   { Int32,factor 1348 LZMA_ra(2.26%), 122B } *
\--+ sample.annotation   [  ]
   \--+ family   { Str8 90 LZMA_ra(57.1%), 222B }

Also See

PySeqArray: data manipulation of whole-genome sequencing variants in Python

Name	Name	Last commit message	Last commit date
Latest commit zhengxwen version minor number May 15, 2017 a8605a4 · May 15, 2017 History 51 Commits
pygds	pygds	version minor number	May 15, 2017
src	src	__version__	May 4, 2017
.gitignore	.gitignore	Initial commit	Mar 13, 2017
.travis.yml	.travis.yml	minor fix	Apr 22, 2017
LICENSE	LICENSE	Initial commit	Mar 13, 2017
MANIFEST.in	MANIFEST.in	more exported c_api	Mar 28, 2017
README.md	README.md	update README.md	Apr 30, 2017
setup.py	setup.py	version minor number	May 15, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

pygds: Python Interface to CoreArray Genomic Data Structure (GDS) Files

Features

Package Maintainer

Prerequisites

Installation

Citation

Original papers (implemented in R/Bioconductor):

Examples

Also See

About

Releases

Packages

Languages

License

CoreArray/pygds

Folders and files

Latest commit

History

Repository files navigation

pygds: Python Interface to CoreArray Genomic Data Structure (GDS) Files

Features

Package Maintainer

Prerequisites

Installation

Citation

Original papers (implemented in R/Bioconductor):

Examples

Also See

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages