Skip to content

cramjaco/Nyvac_096_Microbiome

Repository files navigation

Nyvac096Microbiome

Jacob A. Cram jcram@umces.edu cramjaco@gmail.com

This directory accompanies the manuscript "The human gut microbiota associated with baselin eimmune status and response to HIV vaccines". As of this writing this manuscript is accepted but not published at Plos ONE. This directory contains materials to run both the upstream processing of the microbiome data (demultiplexing with Qiime, sequence variant assignment with DADA2, phylogenetic tree with phangorn, SV taxonomic assignment with DADA2) and downstream analysis (statistics in a notebook file in R).

The downstream analysis can be run without re-doing the upstream portion. We default to using the files generated by the upstream analysis from our initial run, for consistancy between runs. There appears to be some variability in the results that one gets between upstream runs.

Dependencies:

R version 3.6.1.

Notes on R I have not had success with all of the subsequent dependencies when using condas to install R. https://unix.stackexchange.com/questions/149451/install-r-in-my-own-directory When I tested these scripts I built R 3.6.1 from source on a clean virtual box containing Ubuntu 18.04.

wget http://cran.rstudio.com/src/base/R-3/R-3.6.1.tar.gz

The following packages were required (I installed with apt) for my R build. build-essential fort77 xorg-dev liblzma-dev libblas-dev gfortran gcc-multilib gobjc++ libreadline-dev libbz2-dev libcurl4-openssl-dev texlive-fonts-extra texinfo default-jdk libssl-dev libxml2-dev t1-xfree86-nonfree ttf-xfree86-nonfree ttf-xfree86-nonfree-syriac xfonts-75dpi xfonts-100dpi libcairo2-dev

wget http://cran.rstudio.com/src/base/R-3/R-3.6.1.tar.gz
tar xvf R-3.6.1.tar.gz
cd R-3.6.1
./configure --prefix=$HOME/R
make && make install

And of course, add R to path. I did this by adding

export PATH=$PATH:$HOME/R-3.6.1/bin

to my .bashrc file and rebooting

Anacondas

No longer requires jupyter notebook. Required for python scripts in the upstram analyis.

RStudio

I have used both local rstudio and rstudio server for this. Most recently rstudio server 1.2.1335

For upstream analysis.

  • To run the demultiplexing, you will need to install qiime1. I recommend using anacondas to set up the following environment

conda create -n qiime1 numpy=1.10 python=2.7 qiime matplotlib=1.4.3 mock nose -c bioconda

When I ran the upstream analyis, all work was done in fall 2017 on R version 3.4.1. I have not re-run this upstream analysis since then.

For downstream analyis:

###Rstudio or rstudio server.

I have been using the r package packrat to keep track of packages.

Some dependencies that were required on my system -- I have root access and so used sudo apt install. If you are doing this on a cluster, you may need to install many of these locally or get your system administrator to do it for you.

No longer a problem, but in case it re-surfaces...

To make the igraph r package able to run, you need to modify your anacondas directory slightly, as per this github issue igraph/rigraph#275 (comment)

To do this, navigate in the terminal to your anaconda directory. In my case this is done with

cd ~/anaconda3

and then deactivate all local copies of libgfortran.so.4.0.0

find . -name "libgfortran.so.4.0.0" -execdir mv {} {}_off ';'

Now you are ready to install r packages. I've set up pacrat to do this for you. In theory, all you have to do is navigate to the project directory

cd ~/Nyvac_096_Microbiome

And then run R from the terminal.

The packrat library should bootstrap itself and then install all of the necessary R packages.

If that doesn't happen, try running install.packages('packrat') and then restoring from snapshot packrat::restore()

There are some packages that I don't call with library, rather I just address functions in them by specifying the package name eg rsample::bootstraps(). These need to be installed manually. Or maybe packrat will start tracking them. You may need to run the following:

install.packages(c('rsample'))

I'm still looking for these

Note - I had been trying to use condas to install R packages, but didn not have success Activate irkernel from within R to connect it to jupyter notebook. Jupyter notebook must be installed and then the system restarted before this command will work

IRkernel::installspec()

How to run analyses

Upstream analysis

Upstream analysis is not necessary to redo the downstream analysis.

The order for this analysis is:

  • demultiplex
  • call SVs
  • make tree
  • generate taxonomic information.

These scripts can be called in order by calling, from the scripts\ directory all_upstream.sh

On systems running slurm (such as Fated entities created in order to log onto a website and spam or otherwise wreak havoc upon it. To guard against this eventuality, websites have implemented CAPTCHAs, a challenge used to prove the user is a human and not an automated program. A typical CAPTCHA might distort a random sequence of letters and numbers and put it in a strange and/or mixed font and ask a user to type it, or it might show a set of pictures and ask the user which ones contain fire hydrants; these tasks are meant to be easy for humans but obscenely difficult for computers. CAPTCHAs are a recurring theme on xkcd.

CAPTCHAs run by Google are also used to train artificial intelligences to get better at these difficult tasks, such as reading poorly-scanned text or identifying objects of interest on the road (the latter being the subject of 1897: Self Driving).

This comic jokes about a malicious CAPTCHA which is being used to train an AI to dominate the world. In order to red Hutch's rhino cluster), you can call sbatch scripts/upstream.sbatch in order to request a 16 node cluster. This should take about 8 hours to run. The slow step is remaking the phylogenetic tree. If I was going to do this again from scratch, I'd probably use raxml.

all_upstream.sh just calls other scripts, those pieces can be run as follows:

Individual pieces can be run as follows:

  • To demultiplex, run sh scripts/demultBothPlates.sh
  • The next three scripts must be run inside of the nyvac-lab-2 environment source attach nyvac-lab-2
  • To remake dada2 sequence varients run Rscript dada2work-March2018Run.R. One can also open the r script and run it in any R interpreter. (This is true of all of the subsequent R steps. Such a process makes for substantially easier debugging.
  • To make the phylogenetic tree Rscript makeTree.R
  • To generate taxonomic information first acquire necessary training data by running sh pull_training.sh. Then run Rscript dada2taxonomy-March2018Run.R.

Downstream analysis.

This can be run independently of the downstream analysis. It defaults to using data from the data\ directory. In theory, all one should need to do is open the Mar2018_096.ipynb file in jupyter notebook or jupyter lab and run all of the cells.

If you want to run it on re-analyzed data, find comment out the line #upOriginal <- TRUE and uncomment the line upOriginal <- FALSE.

If you want the script to run faster, set jnperm <- 9999, the cost of this is that the p-values are not calculated as precisely. If you want p-values that don't fluctuate from run to run, set jnperm <- 99999 and maybe go get lunch or similar while the file is running.

Changes since earlier versions:

In order to use the breakaway package by adw36, which I need to calculate richness (and confedence intervals, and to run appropriate statistics), I need an R version > 3.5. This branch is for the newist version 3.6.1.

This change lead to new bugs, now resolved, and took care of some old bugs. I have re-written the readme to acomidate these things

About

Full stack version of 096 Nivac and Microbiome study

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages