Copyright © 2011 by Cornell University and the Cornell Research
Foundation, Inc. All Rights Reserved.Permission to use, copy, modify and distribute any part of GobyWeb web application for next-generation sequencing data alignment and analysis, officially docketed at Cornell as D-5061 (“WORK”) and its associated copyrights for educational, research and non-profit purposes, without fee, and without a written agreement is hereby granted, provided that the above copyright notice, this paragraph and the following three paragraphs appear in all copies.
Those desiring to incorporate WORK into commercial products or use WORK and its associated copyrights for commercial purposes should contact the Cornell Center for Technology Enterprise and Commercialization at 395 Pine Tree Road, Suite 310, Ithaca, NY 14850; email:cctecconnect@cornell.edu; Tel: 607-254-4698; FAX: 607-254-5454 for a commercial license.
IN NO EVENT SHALL THE CORNELL RESEARCH FOUNDATION, INC. AND CORNELL UNIVERSITY BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF WORK AND ITS ASSOCIATED COPYRIGHTS, EVEN IF THE CORNELL RESEARCH FOUNDATION, INC. AND CORNELL UNIVERSITY MAY HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
THE WORK PROVIDED HEREIN IS ON AN “AS IS” BASIS, AND THE CORNELL RESEARCH FOUNDATION, INC. AND CORNELL UNIVERSITY HAVE NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. THE CORNELL RESEARCH FOUNDATION, INC. AND CORNELL UNIVERSITY MAKE NO REPRESENTATIONS AND EXTEND NO WARRANTIES OF ANY KIND, EITHER IMPLIED OR EXPRESS, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, OR THAT THE USE OF WORK AND ITS ASSOCIATED COPYRIGHTS WILL NOT INFRINGE ANY PATENT, TRADEMARK OR OTHER RIGHTS.
Download
Overview
For an overview of this software, please see the project home page. GobyWeb is under constant development and this distribution is in binary form only (yet portable thanks to Java and Groovy). Please make sure you read the copyright notice found at the top of this document before downloading the distribution. A copy of this non-commercial license is also packaged with the distribution.
GobyWeb Binary Distributions are available for download:
- Latest: February 21, 2012 (118 MB)
- Older: November 15, 2011 (90 MB)
Installation of the Binary Distribution
Hardware and Software Requirements
Because Next-gen sample alignment and analysis is computationally intensive, to increase throughput, GobyWeb is designed to require more than one machine to run. Figure 1 shows the architecture GobyWeb is designed to use. A single front-end web server runs the GobyWeb web application. The backend is comprised of a single database (we have tested with Oracle) and multiple servers (compute nodes) running in an Oracle Grid Engine (OGE) cluster.
Hardware requirements, database server
- No specific hardware requirements from the GobyWeb standpoint.
Hardware requirements, web server
- Linux-based server web server running Apache and Tomcat. In our setup, Apache is receiving all web traffic and using mod_jk to direct traffic to the appropriate Tomcat. As Tomcat can also act as the web server, Apache is an optional component. We recommend a web server with systems with 4+ cores and at least 8GB of memory. The web-server should have an associated filesystem to store:
- Samples, as they are being uploaded but before moving to the cluster’s shared filesystem.
- Results of alignments, differential expression, etc.
Hardware requirements, OGE compute nodes
- One or more Linux-based OGE compute nodes and an associated OGE queue. Alignment and alignment analysis is a computationally and memory intensive task, so we recommend multiple OGE compute nodes, each with 4+ cores and at least 32GB of memory. The OGE compute nodes should have the following filesystems:
- A filesystem shared across all compute nodes (such as via NFS mount) to store uploaded samples.
- Filesystems, preferably local to each compute node (for speed), containing the references indexed with each supported aligner and various per-reference support files. These filesystems local to each compute node are where most per-job execution occurs to increase speed and decrease network traffic.
Software requirements, database server
- We have tested with Oracle 10.2. Although untested, other database servers, such as MySql, should theoretically work.
The database stores all meta-data for GobyWeb objects (Samples, Alignment Jobs, Alignments, etc.). As all the GobyWeb item meta-data is stored in the Oracle database, we strongly urge you to have a verified backup and restore scheme in place.
Software requirements and accounts, web server
- The bash shell installed to /bin/bash.
- A user account. We recommond using the name gobyweb. This user’s default shell should be bash.
- The home directory for user (gobyweb) on the web server should not be the same as the home directories for the OGE compute nodes.
- Tomcat, we are running 7.0.22.
- The normal complement of Linux/UNIX utilities (sed, cut, etc.).
- SSH
- Java, preferably the latest version (1.6_29 as of this writing).
Environment variable JAVA_HOME pointing to your installation of Java.
JAVA_HOME/bin in your PATH. - Groovy, preferable the latest version (1.8.3 as of this writing)
Environment variable GROOVY_HOME pointing to your installation of Groovy.
$GROOVY_HOME/bin in your PATH
Software requirements and accounts, OGE compute nodes
- The bash shell installed to /bin/bash.
- A user account. We recommond using the name gobyweb. This user’s default shell should be bash.
- The home directories for the user (gobyweb) on all of the OGE compute nodes should be shared.
- Oracle Grid Engine (OGE) / Sun Grid Engine (SGE). We are running SGE 6.2u5.
- The normal complement of Linux/UNIX utilities (sed, cut, etc.).
- SSH
- The “R” programming language. We use version 2.12.2.
- OPTIONAL: cURL. This can be used during index creation to assist with downloading annotation files from MartService.
- Java, preferably the latest version (1.6_29 as of this writing).
The Environment variable JAVA_HOME pointing to your installation of Java.
$JAVA_HOME/bin in your PATH - Groovy, preferable the latest version (1.8.3 as of this writing)
The Environment variable GROOVY_HOME pointing to your installation of Groovy.
$GROOVY_HOME/bin in your PATH
Firewall and SSH considerations
- The web server needs the ability to communicate with the database server.
- The compute nodes need to be able to communicate with the web application via http on the same port / urls that users use to access GobyWeb.
- The compute nodes and the web server need to be able to communicate bidirectionally, without password. This includes but may not be limited to ssh, scp, and rsync.
Disc Space Requirements
Next-gen alignment is a computationally, memory, and storage intensive process. The various filesystems, outlined above, need to have a fair bit of space. It is not uncommon for a single Sample (reads file) to be larger than a gigabyte and for the resultant alignment from that Sample to be over 150 megabytes. Additionally, each experiment often requires multiple Samples. The quantity of required disc space can grow rapidly.
As the database is only used to store the meta-data about Samples, Alignments, etc. the amount of space required by the Oracle database is much more modest. An Oracle Tablespace on the order of 500 megabytes should be more than sufficient for a great number of Samples and Alignments.
Cluster-Side Installation
Before progressing, please make sure the above hardware, software, and configuration requirements have been met.
We will call the collection of OGE compute nodes, the “cluster”.
Extract the gobyweb binary distribution to the home directory of the gobyweb account. This will make the following files and directories
~gobyweb/
webapp.tgz [this will be moved to the webserver]
goby/
goby.jar
log4j.properties
goby.properties
*.R
*k
*.jsap
*.bin
lib/
sqlite-jdbc-3.7.2.jar
nextgen-tools/
[BWA, GSNAP, and SAMTOOLS go here]
index-creation/
reference-db/
[this is where indexed references will go]
create-indexes.sh
Biomart.groovy
goby-1.9.8-cpp.tgz
Next, in the ~gobyweb/ directory on the cluster make the following additional directories.
~gobyweb/
GOBYWEB_SGE_JOBS
GOBYWEB_FILES
Obtaining the adapters.txt file
The adapters.txt file contains sequences for paired-end adapters used in Illumina sequencing protocols. You can obtain these sequences from support@illumina.com and put them one sequence per line in a text format. If sequences contain Xs, expand each possibility to represent a fully specified sequence. For instance, expand AXT to AAT, ACT,ATT,AGT and put each possibility on one line.
Once you have obtained this file, it is necessary to install it on the cluster where GobyWeb expects to find it located at ~gobyweb/goby/adapters.txt .
Installing the Goby and SQLite JDB Driver Jars for Groovy
The distribution contains the SQLite JDBC Driver sqlite-jdbc-3.7.2.jar. This should be installed for use with Groovy using the following commands
- mkdir -p ~/.groovy/lib
- ln -s ~/goby/lib/sqlite-jdbc-3.7.2.jar ~/.groovy/lib
- ln -s ~/goby/goby.jar ~/.groovy/lib
Building and installing the goby-1.9.8-cpp Library
The goby-1.9.8-cpp.tgz file is included in distribution and is required to build the aligners that have native Goby file format support. To build this library, use the following steps
Unpack the library tar file
- cd ~
- tar zxvf goby-1.9.8-cpp.tgz
Follow the instructions found in ~/goby-1.9.8-cpp/README.txt
Configuring R
Assuming you have R installed, we’ll need to configure R and install several R packages for GobyWeb. First, let’s add a few variables to ~gobyweb/.bash_profile (these assume you are running an R 2.12 version)
export R_HOME=`R RHOME | /bin/grep --invert-match WARNING`
export R_LIB1=${R_HOME}/lib
export R_LIB2=${HOME}/R/x86_64-unknown-linux-gnu-library/2.12/rJava/jri
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${R_LIB1}:${R_LIB2}
You should logout then login so these environment variables take effect.
Secondly, we need to install several R packages. Execute the following commands (the first command is to start R, the subsequent commands are R commands). It should be noted that the first time you execute install.packages below, it will ask about creating a local R library directory. In our case this was in ${HOME}/R/x86_64-unknown-linux-gnu-library/2.12, as seen above in R_LIB2. If the directory it uses is different in your installation, you should change R_LIB2 above to reflect the actual directory.
- R
- install.packages(‘Rserve’,,’http://www.rforge.net/’)
- install.packages(‘ROCR’, dependencies=TRUE)
- source(“http://bioconductor.org/biocLite.R”)
- biocLite()
- biocLite(“DESeq”)
- install.packages(“Cairo”, dependencies=TRUE)
- install.packages(“rJava”, dependencies=TRUE)
Building the Aligners
This distribution of GobyWeb only supports the BWA and GSNAP aligners and requires these aligners be built with native Goby file format support. In the very near future, the next version of GobyWeb will be distributed and it will support additional aligners such as Last and have greater configuration flexibility supported by a plug-in architecture.
Building and installing the BWA aligner with Goby support
GobyWeb depends on a version of BWA that has been modified to directly support Goby file formats.
To build and install this version of BWA Execute the following commands
- cd ~
- wget \
http://campagnelab.org/files/20110428-bwa-0.5.9-icb-beta.tgz- tar zxvf 20110428-bwa-0.5.9-icb-beta.tgz
- cd bwa-0.5.9-icb-beta
- chmod +x autogen.sh
- ./configure –with-goby
- make
- If at this point you get a compilation error similar to
Supported emulations: elf_x86_64 elf_i386 i386linux collect2: ld returned 1 exit status
you should edit the Makefile and look for the line goby_LIBS and change the
“-m” to “-mt” then run make again - Install the newly built version of bwa-icb
- mkdir -p ${HOME}/goby/nextgen-tools/bwa/
- cp bwa ${HOME}/goby/nextgen-tools/bwa/bwa-icb
Building and installing the GSNAP aligner with Goby support
The current version of GMAP/GSNAP (version 2011-10-16 as of this writing) supports Goby file formats. The instructions for building and installing GSNAP are
- wget \
http://research-pub.gene.com/gmap/src/gmap-gsnap-2011-10-16.tar.gz- tar zxvf gmap-gsnap-2011-10-16.tar.gz
- cd gmap-2011-10-16
- ./configure –with-goby=${LOCAL_LIB}
- make
- mkdir –p ${HOME}/goby/nextgen-tools/gsnap/
- cp src/gsnap ${HOME}/goby/nextgen-tools/gsnap/gsnap-icb
- cp src/cmetindex \
src/gmap \
src/gmapindex \
src/iit_store \
util/dbsnp_iit \
util/fa_coords \
util/gmap_build \
util/gmap_process \
util/gmap_setup \
util/gmap_compress \
util/gmap_reassemble \
util/gmap_uncompress \
util/md_coords \
util/psl_genes \
util/psl_introns \
util/psl_splicesites ${HOME}/goby/nextgen-tools/gsnap/
Installing samtools
The instructions for downloading, building, and installing samtools are
- wget \
http://downloads.sourceforge.net/project/samtools/samtools/0.1.14/samtools-0.1.14.tar.bz2- tar jxvf samtools-0.1.14.tar.bz2
- cd samtools-0.1.14
- make
- mkdir –p ${HOME}/goby/nextgen-tools/samtools/
- cp samtools ${HOME}/goby/nextgen-tools/samtools/
- cp bcftools/bcftools ${HOME}/goby/nextgen-tools/samtools/
Installing tabix
The instructions for downloading, building, and installing tabix are
- wget \
http://downloads.sourceforge.net/project/samtools/tabix/tabix-0.2.3.tar.bz2- tar jxvf tabix-0.2.3.tar.bz2
- cd tabix-0.2.3
- make
- mkdir –p ${HOME}/goby/nextgen-tools/tabix/
- cp bgzip ${HOME}/goby/nextgen-tools/tabix/
- cp tabix ${HOME}/goby/nextgen-tools/tabix/
Installing vcftools
The instructions for downloading, building, and installing vcftools are
- wget \
http://downloads.sourceforge.net/project/vcftools/vcftools_v0.1.4a.tar.gz- tar zxvf vcftools_v0.1.4a.tar.gz
- cd vcftools_0.1.4a
- make
- mkdir –p ${HOME}/goby/nextgen-tools/vcftools/
- cp perl/* ${HOME}/goby/nextgen-tools/vcftools/
- cp cpp/vcftools ${HOME}/goby/nextgen-tools/vcftools/
Edit your ~gobyweb/.bash_profile to include the following lines
# the followign is required by VCFTOOLS perl scripts:
export PERL5LIB=${PERL5LIB}:${HOME}/goby/nextgen-tools/vcftools/
Creating reference indexes
One of the primary jobs of GobyWeb is to align fasta/fastq samples to a reference. In order to do this, we will need to index these references using the various aligners. Ultimately we’ll recommend you store these indexed references on the filesystems that are local to each OGE compute node. To assist with the index creation, we’ve created some scripts. You should find these scripts in ~gobyweb/index-creation/.
You will first need to obtain a reference file. The references we use generally come from ftp://ftp.ensembl.org/pub.
- First, download the single “toplevel” “dna” fa.gz reference file. For this example we are going to download, the following file to the ~gobyweb/index-creation/ directory:
ftp://ftp.ensembl.org/pub/release-64/fasta/mus_musculus/dna/Mus_musculus.NCBIM37.64.dna.toplevel.fa.gz - Second, download the “cdna” “all” fa.gz file. Again into the ~gobyweb/index-creation/ directory:
ftp://ftp.ensembl.org/pub/release-64/fasta/mus_musculus/cdna/Mus_musculus.NCBIM37.64.cdna.all.fa.gz
For this reference, the key information is
INPUT_FASTA=Mus_musculus.NCBIM37.64.dna.toplevel.fa.gz INPUT_CDNA_FASTA=Mus_musculus.NCBIM37.64.cdna.all.fa.gz ORGANISM=mus_musculus VERSION= NCBIM37.64
Edit the file create-indexes.sh. Near the top, you will find lots of examples / configurations for references you might want to build, but only leave one set of configuration values uncommented. Further down, you will find variables for GSNAP_DIR, BWA_DIR, and SAMTOOLS_DIR - you probably don’t need to change these. Just below that you will find the variables ALIGNERS and SPACES. These define which aligners you want to create references with and if you want to create “basespace” and/or “colorspace” indexes (not every aligner supports both).
OPTIONAL STEP: cURL can optionally be used by the script Biomart.groovy. By default, cURL isn’t used, but if you have a relatively recent version of cURL (7.21.6 or later, which can be checked with the “curl –version” command) you can enable downloading using cURL. The benefit of using cURL for the it that it gives more visual indication to the status of the file transfers. To enable this, edit Biomart.groovy and change the downloadWithCurl option
boolean downloadWithCurl = true
Once you have configured create-indexes.sh, go head and run it. Index creation can take a long time (easily hours if you are creating indexes for multiple aligners). During the process of aligning indexes, a few annotation files will be fetched from BioMart. The process of downloading annotations can be very time consuming.
Once the indexes have been created, if you aligned for basespace with the BWA aligner, you would have the following directory structure
~gobyweb/
index-creation/
reference-db/
NCBIM37.64/
mus_musculus/
reference/ [original reference, annotation files, etc.]
basespace/
bwa/
index* [the actual bwa index files]
Once you have built the references you need, you’ll want to install them to the compute nodes. While near-term future versions of GobyWeb will be more flexible regarding reference directory locations, the current version of GobyWeb requires they be located using this exact directory structure on the OGE cluster nodes.
/scratchLocal/
gobyweb/
input-data/
reference-db/
VERSION/
ORGANISM/
reference/
basespace|colorspace/
ALIGNER_NAME/
index*
Make sure /scratchLocal/gobyweb/input-data/reference-db/ exists and copy the contents of ~gobyweb/index-creation/reference-db/
Assuming you are placing these directories of indexes and support files on local disc of each OGE compute node, you can now mirror or rsync the files to all of the OGE compute nodes. It might be temping to store these references on a single shared network resource, but this would likely be very detrimental to the overall throughput of the alignment.
Setting up and configuring the GobyWeb web application
Extract the web application, make directories
Move the “webapp.tgz” file from the distribution to the ~gobyweb/ directory on the web server, extract the files from the archive, make support directories, and create the configuration files from the SAMPLE versions.
- cd ~gobyweb/
- tar zxvf webapp.tgz
- mkdir GOBYWEB_FILES
- mkdir GOBYWEB_RESULTS
- mkdir GOBYWEB_UPLOADS
- cd webapp/conf
- cp DataSource.groovy-SAMPLE DataSource.groovy
- cp Config.groovy-SAMPLE Config.groovy
The files in the gobyweb account should now be similar to
~gobyweb/
GOBYWEB_FILES/
GOBYWEB_RESULTS/
GOBYWEB_UPLOADS/
apache-tomcat-7.0.22/
webapp/
tomcat-cmd
temp/
conf/
Convig.groovy
Config.groovy-SAMPLE
DataSource.groovy
DataSource.groovy-SAMPLE
server.xml
tomcat-users.xml
web.xml
webapps/
ROOT/
index.thml
gobyweb.war
logs/
If you are familiar with using Tomcat, you might notice that we do not use the “webapps” directory within the ~gobyweb/apache-tomcat directory, but instead we are storing the GobyWeb web application (gobyweb.war) in ~gobyweb/webapp/webapps/ directory.
Configure GobyWeb
To configure your instance of GobyWeb, edit the two files
~gobyweb/webapp/conf/Config.groovy
~gobyweb/webapp/conf/DataSource.groovy
For each of these files, we’ve provided a “-SAMPLE” version that contains documentation describing the configuration options. These files are written in Groovy. Groovy is a JVM-based language that is very similar to Java. You can learn more about Groovy at
http://groovy.codehaus.org/
http://groovy.codehaus.org/Beginners+Tutorial
but you shouldn’t need any Groovy experience to edit these files.
Config.groovy configures application options, directories, etc. DataSource.groovy configures how to connect to the Oracle database.
If you intend to use a database server other than Oracle, you will need to install that servers jdbc .jar file. This jar file should be placed in
~gobyweb/webapp/lib/
Configure Tomcat
To configure the instance of Tomcat, you will need to edit the files
~gobyweb/webapp/conf/server.xml
~gobyweb/webapp/conf/tomcat-users.xml
The server.xml that comes with the GobyWeb distribution specifies that GobyWeb will run on port 8106. Change this port to any port number suitable for your local system.
As far as GobyWeb is concerned, there is very little that needs to be different from a “stock” Tomcat server.xml file, so you may prefer to use the server.xml that comes from the Tomcat distribution. One possible exception is the line
<Context path=”" docBase=”/home/gobyweb/webapp-dev/webapps/ROOT” reloadable=”true” />
This line specifies the location of this instances ROOT directory, which in the GobyWeb distribution, contains the file index.html. This file ensures that if a user goes to
http://your_server/
they will be redirected to the running GobyWeb application at
http://your_server/gobyweb
The last file you need to edit is ~/webapp/tomcat-cmd. Look for the line
CATALINA_HOME=${HOME}/apache-tomcat-7.0.22
and edit it so it points to the directory where you installed Tomcat.
Starting GobyWeb
To start GobyWeb, execute the following
- cd ~gobyweb/webapp
- ./tomcat_cmd start
You can monitor status and activity of GobyWeb monitoring the Tomcat log
- tail -f ~gobyweb/webapp/logs/catalina.out
Logging Into GobyWeb
The first time you run a new instance of GobyWeb, an “administrator” account will be created. The default login for this account is
Username: admin
Password: default_password
It is recommended that you change the default password immediately (you can do this from the Account tab in the deployed web application).
Stopping GobyWeb
To stop GobyWeb, execute the following
- cd ~gobyweb/webapp
- ./tomcat_cmd stop


Leave a Comment