Copyright © 2011 by Cornell University and the Cornell Research
Foundation, Inc. All Rights Reserved.

Permission to use, copy, modify and distribute any part of GobyWeb web application for next-generation sequencing data alignment and analysis, officially docketed at Cornell as D-5061 (“WORK”) and its associated copyrights for educational, research and non-profit purposes, without fee, and without a written agreement is hereby granted, provided that the above copyright notice, this paragraph and the following three paragraphs appear in all copies.

Those desiring to incorporate WORK into commercial products or use WORK and its associated copyrights for commercial purposes should contact the Cornell Center for Technology Enterprise and Commercialization at 395 Pine Tree Road, Suite 310, Ithaca, NY 14850; email:cctecconnect@cornell.edu; Tel: 607-254-4698; FAX: 607-254-5454 for a commercial license.

IN NO EVENT SHALL THE CORNELL RESEARCH FOUNDATION, INC. AND CORNELL UNIVERSITY BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF WORK AND ITS ASSOCIATED COPYRIGHTS, EVEN IF THE CORNELL RESEARCH FOUNDATION, INC. AND CORNELL UNIVERSITY MAY HAVE BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

THE WORK PROVIDED HEREIN IS ON AN “AS IS” BASIS, AND THE CORNELL RESEARCH FOUNDATION, INC. AND CORNELL UNIVERSITY HAVE NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS. THE CORNELL RESEARCH FOUNDATION, INC. AND CORNELL UNIVERSITY MAKE NO REPRESENTATIONS AND EXTEND NO WARRANTIES OF ANY KIND, EITHER IMPLIED OR EXPRESS, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE, OR THAT THE USE OF WORK AND ITS ASSOCIATED COPYRIGHTS WILL NOT INFRINGE ANY PATENT, TRADEMARK OR OTHER RIGHTS.

Download

Overview

For an overview of this software, please see the project home page. GobyWeb is under constant development and this distribution is in binary form only (yet portable thanks to Java and Groovy). Please make sure you read the copyright notice found at the top of this document before downloading the distribution. A copy of this non-commercial license is also packaged with the distribution.

This instructions are valid for the following public GobyWeb Binary Distributions:

  • 1.7.1 (118 MB, February 23, 2012)
  • 1.6.1 (90 MB, November 15, 2011)

Installation of the Binary Distribution

Hardware and Software Requirements

Because Next-gen sample alignment and analysis is computationally intensive, to increase throughput, GobyWeb is designed to require more than one machine to run. Figure 1 shows the architecture GobyWeb is designed to use. A single front-end web server runs the GobyWeb web application. The backend is comprised of a single database (we have tested with Oracle) and multiple servers (compute nodes) running in an Oracle Grid Engine (OGE) cluster.

Figure 1, Infrastructure Overview

Hardware requirements, database server

  • No specific hardware requirements from the GobyWeb standpoint.

Hardware requirements, web server

  • Linux-based server web server running Apache and Tomcat. In our setup, Apache is receiving all web traffic and using mod_jk to direct traffic to the appropriate Tomcat. As Tomcat can also act as the web server, Apache is an optional component. We recommend a web server with systems with 4+ cores and at least 8GB of memory. The web-server should have an associated filesystem to store:
    • Samples, as they are being uploaded but before moving to the cluster’s shared filesystem.
    • Results of alignments, differential expression, etc.

Hardware requirements, OGE compute nodes

  • One or more Linux-based OGE compute nodes and an associated OGE queue. Alignment and alignment analysis is a computationally and memory intensive task, so we recommend multiple OGE compute nodes, each with 4+ cores and at least 32GB of memory. The OGE compute nodes should have the following filesystems:
    • A filesystem shared across all compute nodes (such as via NFS mount) to store uploaded samples.
    • Filesystems, preferably local to each compute node (for speed), containing the references indexed with each supported aligner and various per-reference support files. These filesystems local to each compute node are where most per-job execution occurs to increase speed and decrease network traffic.

Software requirements, database server

  • We have tested with Oracle 10.2. Although untested, other database servers, such as MySql, should theoretically work.
    The database stores all meta-data for GobyWeb objects (Samples, Alignment Jobs, Alignments, etc.). As all the GobyWeb item meta-data is stored in the Oracle database, we strongly urge you to have a verified backup and restore scheme in place.

Software requirements and accounts, web server

  • The bash shell installed to /bin/bash.
  • A user account. We recommond using the name gobyweb. This user’s default shell should be bash.
  • The home directory for user (gobyweb) on the web server should not be the same as the home directories for the OGE compute nodes.
  • Tomcat, we are running 7.0.22.
  • The normal complement of Linux/UNIX utilities (sed, cut, etc.).
  • SSH
  • Java, preferably the latest version (1.6_29 as of this writing).
    Environment variable JAVA_HOME pointing to your installation of Java.
    JAVA_HOME/bin in your PATH.
  • Groovy, preferable the latest version (1.8.3 as of this writing)
    Environment variable GROOVY_HOME pointing to your installation of Groovy.
    $GROOVY_HOME/bin in your PATH

Software requirements and accounts, OGE compute nodes

  • The bash shell installed to /bin/bash.
  • A user account. We recommond using the name gobyweb. This user’s default shell should be bash.
  • The home directories for the user (gobyweb) on all of the OGE compute nodes should be shared.
  • Oracle Grid Engine (OGE) / Sun Grid Engine (SGE). We are running SGE 6.2u5.
  • The normal complement of Linux/UNIX utilities (sed, cut, etc.).
  • SSH
  • The “R” programming language. We use version 2.12.2.
  • OPTIONAL: cURL. This can be used during index creation to assist with downloading annotation files from MartService.
  • Java, preferably the latest version (1.6_29 as of this writing).
    The Environment variable JAVA_HOME pointing to your installation of Java.
    $JAVA_HOME/bin in your PATH
  • Groovy, preferable the latest version (1.8.3 as of this writing)
    The Environment variable GROOVY_HOME pointing to your installation of Groovy.
    $GROOVY_HOME/bin in your PATH

Firewall and SSH considerations

  • The web server needs the ability to communicate with the database server.
  • The compute nodes need to be able to communicate with the web application via http on the same port / urls that users use to access GobyWeb.
  • The compute nodes and the web server need to be able to communicate bidirectionally, without password. This includes but may not be limited to ssh, scp, and rsync.

Disc Space Requirements

Next-gen alignment is a computationally, memory, and storage intensive process. The various filesystems, outlined above, need to have a fair bit of space. It is not uncommon for a single Sample (reads file) to be larger than a gigabyte and for the resultant alignment from that Sample to be over 150 megabytes. Additionally, each experiment often requires multiple Samples. The quantity of required disc space can grow rapidly.

As the database is only used to store the meta-data about Samples, Alignments, etc. the amount of space required by the Oracle database is much more modest. An Oracle Tablespace on the order of 500 megabytes should be more than sufficient for a great number of Samples and Alignments.

Cluster-Side Installation

Before progressing, please make sure the above hardware, software, and configuration requirements have been met.

We will call the collection of OGE compute nodes, the “cluster”.

Extract the gobyweb binary distribution to the home directory of the gobyweb account. This will make the following files and directories

~gobyweb/
     webapp.tgz 		[this will be moved to the webserver]
     goby/
          goby.jar
          log4j.properties
          goby.properties
          *.R
          *k
          *.jsap
          *.bin
          lib/
               sqlite-jdbc-3.7.2.jar
          nextgen-tools/
               [BWA, GSNAP, and SAMTOOLS go here]
     index-creation/
          reference-db/
               [this is where indexed references will go]
          create-indexes.sh
          Biomart.groovy
     goby-1.9.8-cpp.tgz

Next, in the ~gobyweb/ directory on the cluster make the following additional directories.

~gobyweb/
     GOBYWEB_SGE_JOBS
     GOBYWEB_FILES

Obtaining the adapters.txt file

The adapters.txt file contains sequences for paired-end adapters used in Illumina sequencing protocols. You can obtain these sequences from support@illumina.com and put them one sequence per line in a text format. If sequences contain Xs, expand each possibility to represent a fully specified sequence. For instance, expand AXT to AAT, ACT,ATT,AGT and put each possibility on one line.

Once you have obtained this file, it is necessary to install it on the cluster where GobyWeb expects to find it located at ~gobyweb/goby/adapters.txt .

Installing the Goby and SQLite JDB Driver Jars for Groovy

The distribution contains the SQLite JDBC Driver sqlite-jdbc-3.7.2.jar. This should be installed for use with Groovy using the following commands

  • mkdir -p ~/.groovy/lib
  • ln -s ~/goby/lib/sqlite-jdbc-3.7.2.jar ~/.groovy/lib
  • ln -s ~/goby/goby.jar ~/.groovy/lib

Building and installing the goby-1.9.8-cpp Library

The goby-1.9.8-cpp.tgz file is included in distribution and is required to build the aligners that have native Goby file format support. To build this library, use the following steps

Unpack the library tar file

  • cd ~
  • tar zxvf goby-1.9.8-cpp.tgz

Follow the instructions found in ~/goby-1.9.8-cpp/README.txt

Configuring R

Assuming you have R installed, we’ll need to configure R and install several R packages for GobyWeb. First, let’s add a few variables to ~gobyweb/.bash_profile (these assume you are running an R 2.12 version)

export R_HOME=`R RHOME | /bin/grep --invert-match WARNING`
export R_LIB1=${R_HOME}/lib
export R_LIB2=${HOME}/R/x86_64-unknown-linux-gnu-library/2.12/rJava/jri
export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:${R_LIB1}:${R_LIB2}

You should logout then login so these environment variables take effect.

Secondly, we need to install several R packages. Execute the following commands (the first command is to start R, the subsequent commands are R commands). It should be noted that the first time you execute install.packages below, it will ask about creating a local R library directory. In our case this was in ${HOME}/R/x86_64-unknown-linux-gnu-library/2.12, as seen above in R_LIB2. If the directory it uses is different in your installation, you should change R_LIB2 above to reflect the actual directory.

  • R
  • install.packages(‘Rserve’,,’http://www.rforge.net/’)
  • install.packages(‘ROCR’, dependencies=TRUE)
  • source(“http://bioconductor.org/biocLite.R”)
  • biocLite()
  • biocLite(“DESeq”)
  • install.packages(“Cairo”, dependencies=TRUE)
  • install.packages(“rJava”, dependencies=TRUE)

Building the Aligners

This distribution of GobyWeb only supports the BWA and GSNAP aligners and requires these aligners be built with native Goby file format support. In the very near future, the next version of GobyWeb will be distributed and it will support additional aligners such as Last and have greater configuration flexibility supported by a plug-in architecture.

Building and installing the BWA aligner with Goby support

GobyWeb depends on a version of BWA that has been modified to directly support Goby file formats.
To build and install this version of BWA Execute the following commands

  • cd ~
  • wget \
    http://campagnelab.org/files/20110428-bwa-0.5.9-icb-beta.tgz
  • tar zxvf 20110428-bwa-0.5.9-icb-beta.tgz
  • cd bwa-0.5.9-icb-beta
  • chmod +x autogen.sh
  • ./configure –with-goby
  • make
  1. If at this point you get a compilation error similar to
    Supported emulations: elf_x86_64 elf_i386 i386linux
    collect2: ld returned 1 exit status

    you should edit the Makefile and look for the line goby_LIBS and change the
    “-m” to “-mt” then run make again

  2. Install the newly built version of bwa-icb
  • mkdir -p ${HOME}/goby/nextgen-tools/bwa/
  • cp bwa ${HOME}/goby/nextgen-tools/bwa/bwa-icb

Building and installing the GSNAP aligner with Goby support

The current version of GMAP/GSNAP (version 2011-10-16 as of this writing) supports Goby file formats. The instructions for building and installing GSNAP are

  • wget \
    http://research-pub.gene.com/gmap/src/gmap-gsnap-2011-10-16.tar.gz
  • tar zxvf gmap-gsnap-2011-10-16.tar.gz
  • cd gmap-2011-10-16
  • ./configure –with-goby=${LOCAL_LIB}
  • make
  • mkdir –p ${HOME}/goby/nextgen-tools/gsnap/
  • cp src/gsnap ${HOME}/goby/nextgen-tools/gsnap/gsnap-icb
  • cp src/cmetindex \
    src/gmap \
    src/gmapindex \
    src/iit_store \
    util/dbsnp_iit \
    util/fa_coords \
    util/gmap_build \
    util/gmap_process \
    util/gmap_setup \
    util/gmap_compress \
    util/gmap_reassemble \
    util/gmap_uncompress \
    util/md_coords \
    util/psl_genes \
    util/psl_introns \
    util/psl_splicesites ${HOME}/goby/nextgen-tools/gsnap/

Installing samtools

The instructions for downloading, building, and installing samtools are

  • wget \
    http://downloads.sourceforge.net/project/samtools/samtools/0.1.14/samtools-0.1.14.tar.bz2
  • tar jxvf samtools-0.1.14.tar.bz2
  • cd samtools-0.1.14
  • make
  • mkdir –p ${HOME}/goby/nextgen-tools/samtools/
  • cp samtools ${HOME}/goby/nextgen-tools/samtools/
  • cp bcftools/bcftools ${HOME}/goby/nextgen-tools/samtools/

Installing tabix

The instructions for downloading, building, and installing tabix are

  • wget \
    http://downloads.sourceforge.net/project/samtools/tabix/tabix-0.2.3.tar.bz2
  • tar jxvf tabix-0.2.3.tar.bz2
  • cd tabix-0.2.3
  • make
  • mkdir –p ${HOME}/goby/nextgen-tools/tabix/
  • cp bgzip ${HOME}/goby/nextgen-tools/tabix/
  • cp tabix ${HOME}/goby/nextgen-tools/tabix/

Installing vcftools

The instructions for downloading, building, and installing vcftools are

  • wget \
    http://downloads.sourceforge.net/project/vcftools/vcftools_v0.1.4a.tar.gz
  • tar zxvf vcftools_v0.1.4a.tar.gz
  • cd vcftools_0.1.4a
  • make
  • mkdir –p ${HOME}/goby/nextgen-tools/vcftools/
  • cp perl/* ${HOME}/goby/nextgen-tools/vcftools/
  • cp cpp/vcftools ${HOME}/goby/nextgen-tools/vcftools/

Edit your ~gobyweb/.bash_profile to include the following lines

# the followign is required by VCFTOOLS perl scripts:
export PERL5LIB=${PERL5LIB}:${HOME}/goby/nextgen-tools/vcftools/

Creating reference indexes

One of the primary jobs of GobyWeb is to align fasta/fastq samples to a reference. In order to do this, we will need to index these references using the various aligners. Ultimately we’ll recommend you store these indexed references on the filesystems that are local to each OGE compute node. To assist with the index creation, we’ve created some scripts. You should find these scripts in ~gobyweb/index-creation/.

You will first need to obtain a reference file. The references we use generally come from ftp://ftp.ensembl.org/pub.

For this reference, the key information is

INPUT_FASTA=Mus_musculus.NCBIM37.64.dna.toplevel.fa.gz
INPUT_CDNA_FASTA=Mus_musculus.NCBIM37.64.cdna.all.fa.gz
ORGANISM=mus_musculus
VERSION= NCBIM37.64

Edit the file create-indexes.sh. Near the top, you will find lots of examples / configurations for references you might want to build, but only leave one set of configuration values uncommented. Further down, you will find variables for GSNAP_DIR, BWA_DIR, and SAMTOOLS_DIR – you probably don’t need to change these. Just below that you will find the variables ALIGNERS and SPACES. These define which aligners you want to create references with and if you want to create “basespace” and/or “colorspace” indexes (not every aligner supports both).

OPTIONAL STEP: cURL can optionally be used by the script Biomart.groovy. By default, cURL isn’t used, but if you have a relatively recent version of cURL (7.21.6 or later, which can be checked with the “curl –version” command) you can enable downloading using cURL. The benefit of using cURL for the it that it gives more visual indication to the status of the file transfers. To enable this, edit Biomart.groovy and change the downloadWithCurl option

boolean downloadWithCurl = true

Once you have configured create-indexes.sh, go head and run it. Index creation can take a long time (easily hours if you are creating indexes for multiple aligners). During the process of aligning indexes, a few annotation files will be fetched from BioMart. The process of downloading annotations can be very time consuming.

Once the indexes have been created, if you aligned for basespace with the BWA aligner, you would have the following directory structure

~gobyweb/
   index-creation/
     reference-db/
       NCBIM37.64/
         mus_musculus/
           reference/	[original reference, annotation files, etc.]
           basespace/
              bwa/
                 index*    [the actual bwa index files]

Once you have built the references you need, you’ll want to install them to the compute nodes. While near-term future versions of GobyWeb will be more flexible regarding reference directory locations, the current version of GobyWeb requires they be located using this exact directory structure on the OGE cluster nodes.

/scratchLocal/
   gobyweb/
      input-data/
         reference-db/
            VERSION/
               ORGANISM/
                  reference/
                     basespace|colorspace/
                        ALIGNER_NAME/
                           index*

Make sure /scratchLocal/gobyweb/input-data/reference-db/ exists and copy the contents of ~gobyweb/index-creation/reference-db/

Assuming you are placing these directories of indexes and support files on local disc of each OGE compute node, you can now mirror or rsync the files to all of the OGE compute nodes. It might be temping to store these references on a single shared network resource, but this would likely be very detrimental to the overall throughput of the alignment.

Setting up and configuring the GobyWeb web application

Extract the web application, make directories

Move the “webapp.tgz” file from the distribution to the ~gobyweb/ directory on the web server, extract the files from the archive, make support directories, and create the configuration files from the SAMPLE versions.

  • cd ~gobyweb/
  • tar zxvf webapp.tgz
  • mkdir GOBYWEB_FILES
  • mkdir GOBYWEB_RESULTS
  • mkdir GOBYWEB_UPLOADS
  • cd webapp/conf
  • cp DataSource.groovy-SAMPLE DataSource.groovy
  • cp Config.groovy-SAMPLE Config.groovy

The files in the gobyweb account should now be similar to

~gobyweb/
     GOBYWEB_FILES/
     GOBYWEB_RESULTS/
     GOBYWEB_UPLOADS/
     apache-tomcat-7.0.22/
     webapp/
          tomcat-cmd
          temp/
          conf/
              Convig.groovy
              Config.groovy-SAMPLE
              DataSource.groovy
              DataSource.groovy-SAMPLE
              server.xml
              tomcat-users.xml
              web.xml
          webapps/
               ROOT/
                    index.thml
               gobyweb.war
          logs/

If you are familiar with using Tomcat, you might notice that we do not use the “webapps” directory within the ~gobyweb/apache-tomcat directory, but instead we are storing the GobyWeb web application (gobyweb.war) in ~gobyweb/webapp/webapps/ directory.

Configure GobyWeb

To configure your instance of GobyWeb, edit the two files

~gobyweb/webapp/conf/Config.groovy
~gobyweb/webapp/conf/DataSource.groovy

For each of these files, we’ve provided a “-SAMPLE” version that contains documentation describing the configuration options. These files are written in Groovy. Groovy is a JVM-based language that is very similar to Java. You can learn more about Groovy at

http://groovy.codehaus.org/
http://groovy.codehaus.org/Beginners+Tutorial

but you shouldn’t need any Groovy experience to edit these files.

Config.groovy configures application options, directories, etc. DataSource.groovy configures how to connect to the Oracle database.

If you intend to use a database server other than Oracle, you will need to install that servers jdbc .jar file. This jar file should be placed in

~gobyweb/webapp/lib/

Configure Tomcat

To configure the instance of Tomcat, you will need to edit the files

~gobyweb/webapp/conf/server.xml
~gobyweb/webapp/conf/tomcat-users.xml

The server.xml that comes with the GobyWeb distribution specifies that GobyWeb will run on port 8106. Change this port to any port number suitable for your local system.

As far as GobyWeb is concerned, there is very little that needs to be different from a “stock” Tomcat server.xml file, so you may prefer to use the server.xml that comes from the Tomcat distribution. One possible exception is the line

<Context path=”” docBase=”/home/gobyweb/webapp-dev/webapps/ROOT” reloadable=”true” />

This line specifies the location of this instances ROOT directory, which in the GobyWeb distribution, contains the file index.html. This file ensures that if a user goes to

http://your_server/

they will be redirected to the running GobyWeb application at

http://your_server/gobyweb

The last file you need to edit is ~/webapp/tomcat-cmd. Look for the line

CATALINA_HOME=${HOME}/apache-tomcat-7.0.22

and edit it so it points to the directory where you installed Tomcat.

Starting GobyWeb

To start GobyWeb, execute the following

  • cd ~gobyweb/webapp
  • ./tomcat_cmd start

You can monitor status and activity of GobyWeb monitoring the Tomcat log

  • tail -f ~gobyweb/webapp/logs/catalina.out

Logging Into GobyWeb

The first time you run a new instance of GobyWeb, an “administrator” account will be created. The default login for this account is

Username: admin
Password: default_password

It is recommended that you change the default password immediately (you can do this from the Account tab in the deployed web application).

Stopping GobyWeb

To stop GobyWeb, execute the following

  • cd ~gobyweb/webapp
  • ./tomcat_cmd stop