Documentation


How was PossumBase created?

Tissue samples

AgResearch collected tissue samples for the following 8 libraries:

  • POSSUM_01-C-POSSUM-IMMUNE-2KB (pooled tissue from Spleen, Lymph node, Splenocytes stimulated with ConA for 2 hours, Splenocytes stimulated with ConA for 19 hours)
    Vector: PDNRLIB; Approximately equal quantities of RNA from 4 tissues were pooled for cDNA library generation. The tissues were:1. Spleen tissue, 2. Lymph node tissue, 3. Splenocytes stimulated with ConA for 2 hours, 4. Splenocytes stimulated with ConA for 19 hours.
     
  • POSSUM_01-POSSUM-C-BRAIN-2KB (hypothalamus and whole pituitary glands)
    Vector: PDNRLIB; Approximately equal quantities of RNA from: 1. brain tissue from the hypothalamus region from 3 individuals. 2. whole pituitary glands collected from 18 individuals.
     
  • POSSUM_01-POSSUM-C-OVARY-2KB (ovaries)
    Vector: PDNRLIB; Approximately equal quantities of RNA collected from ovaries at the following 8 stages of development: 1. Late folicular (3 tissue samples). 2. Mid luteal (3 tissue samples). 3. Juvenile (2 tissue samples). 4. Mid pregnancy (2 tissue samples). 5. Early luteal (2 tissue samples). 6. Early/Mid follicular (2 tissue samples). 7. Anaestrous (2 tissue samples). 8. Lactating (2 tissue samples).
     
  • POSSUM_01-POSSUM-GUT-2KB (pooled epidermal cells scraped from Duodenum, Ileum, Jejunum, Proximal Colon, Distal Colon, Caecum,P eyers patches)
    Vector: PDNRLIB; Approximately equal quantities of RNA from epidermal cells scraped from the following 7 regions of the possum gut were pooled for cDNA library generation: The tissues were:1. Duodenum, 2. Ileum, 3. Jejunum, 4. Proximal Colon, 5. Distal Colon, 6. Caecum, 7. Peyers patches.
     
  • POSSUM_01-POSSUM-KIDNEY-2KB (Approximately equal quantities of RNA from Kidney Cortex and Kidney Medulla were pooled for cDNA library generation)
    Vector: PDNRLIB;
     
  • POSSUM_01-POSSUM-LIVER-2KB (RNA prepared from Liver was used for cDNA library generation)
    Vector: PDNRLIB;
     
  • POSSUM_01-POSSUM-C-EMBRYO-2KB (mixed)
    Vector: PDNRLIB; Approximately equal quantities of RNA from the following 8 sources: 1. Whole embryos (2 tissue samples). 2. Whole Joeys 8-11 days old (8 tissue samples). 3. Early male reproductive tract, day 9 to day 57 (13 tissue samples). 4. Early female reproductive tract , day 22 to day 66 (13 tissue samples). 5. Mid male reproductive tract, day 73 to day 113 (10 samples). 6. Mid female reproductive tract, day 78 to day 115 (7 tissue samples). 7. Late male reproductive tract, day 119 to day 168 (7 tissue samples). 8. Late female reproductive tract, day 119 to day 136 (3 tissue samples).
     
  • POSSUM_01-POSSUM-C-REPROTRACT-2KB (Oviduct, Cul-de-sac and Uterus)
    Vector: PDNRLIB; Approximately equal quantities of RNA from the Oviduct, Cul-de-sac and Uterus with tissues collected for these three regions of the reproductive tract collected at 8 different physiological states: Late follicular, Mid luteal, Juvenile, Mid pregnancy, Early luteal, Early/Mid follicular, Anaestrous, Lactating. Each state contributed approximately equal amounts (by weight) of tissue to each of the tissue pools from which each of the 3 RNA preparations was purified.

cDNA libraries

For each tissue sample a normalised cDNA library was prepared by Clontech from RNA supplied by AgResearch.

cDNA sequences

The cDNA clones were sequenced by TIGR.

sequence processing

TIGR provided the raw sequence data (trace files) together with the nucleotide sequences and their quality values.

  • Quality filtering
    Low quality sequence ends have been removed using a custom script.
     
  • Vector trimming (seqclean)
    Vector sequences and polyA tails have been removed using TIGR's seqclean.
     
  • Preclustering (BLAST)
    An all-against-all BLAST was performed to find ESTs which might belong to the same group. This step was neccessary in order not to overload the clustering tool (cap3) used in the subsequent step. All the preclustering did was to choose the ESTs which got submitted to cap3 intelligently so ESTs that might belong to the same group were submitted in the same chunk.
     
  • Clustering (cap3)
    ESTs were clustered (grouped) into contigs using cap3.
     
  • Some statistics from contig build CS40:
    111,767 ESTs clustered into 12,013 contigs and 55,749 singletons (that's 67,762 sequences in total).
    Contigs are on average 1,244.25 bp long and 4.66 sequences deep and contain 0.45% ambigious bases (Ns).
    Singletons are on average 751.44 bp long and contain 0.72% ambigious bases (Ns).
     
  • Mapping (BLAST)
    EST clusters were mapped (compared) against the genomic sequence of the closest available relative, the gray short-tailed opossum Monodelphis domestica.
     
  • Annotation (BLAST)
    In July 2006 EST clusters were compared against the non-redundant protein database nr.

Top


Tools | GBrowse

The idea

Brushtail possum (Trichosurus vulpecula) ESTs were compared (mapped) against the genomic sequence of the closest available relative, the gray short-tailed opossum Monodelphis domestica.
The generic genome browser (GBrowse) is a genome browser which you can use to visually inspect the mapping and compare sequence annotation across a number of species. GBrowse opens in a separate window which is divided into the following parts:

  • Instructions
    Of particular interest are the Examples and [Help]. Selecting chr2 from the Examples will show chr2 in the genome browser. Clicking [Help] will bring up general help for GBrowse.
  • Search
    Here you can search for your favourite region or keyword.
    Use Scroll/Zoom to select the region and the level of detail shown.
  • Overview
    This gives you an overview of the complete region (e.g. a complete chromosome).
    Click on the axis to change position (symbolized by the red rectangle).
  • Details
    Here you can see all the information in various groups / datasets for a selected region. You can't display more than 5 Mbp in this detailed view.
  • Tracks
    Here you can select the datasets / tracks you are interested in, just check or uncheck the checkbox in front of a track name.
    Clicking on any of the track names will bring up information about this specific dataset. 
  • Display Settings
    Here you can adjust the display.
  • Add your own tracks
    You are not stuck with our tracks, you can add your own tracks. It's very easy and you can access instructions by clicking on [Help].

The content

  • Opossum Assembly Contigs
    The topmost line (track) of the genome browser shows the opossum genome. The opossum contigs have been sequenced and assembled by the Broad Institute. The current build is called monDom4 (January 2006).

Other available tracks are:

  • Human RefSeq Proteins (Release 18, July 2006)
    This track shows known protein-coding genes from human, taken from the NCBI reference sequences collection (RefSeq). Human RefSeq proteins were compared against the opossum contigs using TBLASTN
  • Ensembl Genes (Genebuild, June 2006)
    This track shows the opossum genome as annotated by Ensembl. Ensembl Genes were compared against the opossum contigs using BLAT. The Ensembl team has the following to say about their genebuild: "The gene set for Opossum was built using a modified version of the standard Ensembl genebuild pipeline. The species-specific sequence resources (opossum cDNA and protein) are very limited, so the vast majority of gene models are based on genewise alignments of proteins from other species. Most of the proteins being aligned were from species genetically distant to opossum. To improve the accuracy of models generated from these proteins, the genewise alignments were made to stretches of genomic sequence rather than to 'miniseqs'. Opossum and human cDNAs were aligned and used to add UTRs to the genewise predictions where possible. The gene models were assessed by generating sets of potential orthologs to genes from other mammalian species. Potentially missing predictions and partial gene predictions were identified by examining the orthologs, and exonerate used to build new gene models for these based on the human ortholog peptide sequence."
  • AgResearch Possum ESTs
    This track shows all of AgResearch"s possum ESTs mapped against the opossum contigs using GMAP
  • Chicken Reference mRNAs (Release 18, July 2006)
    This track shows known chicken mRNAs, taken from the NCBI reference sequences collection (RefSeq). Chicken mRNAs were compared against the opossum contigs using GMAP.  Chicken-human alignments can be used to detect exons (Thomas et al., 2003 Comparative analyses of multi-species sequences from targeted genomic regions. Nature 424: 788-93. )

    See the below figure for a taxonomic overview:

    Amniota

    Theria

    Marsupials (Metatheria)

     Brushtail possum (Trichosurus vulpecula)

    Gray short-tailed opossum (Monodelphis domestica)

    Human (Homo sapiens)

    Chicken (Gallus gallus)

     

  • All Marsupial mRNAs except possum
    This track shows all known marsupial mRNAs, taken from the NCBI reference sequences collection (RefSeq). Marsupial mRNAs were compared against the opossum contigs using GMAP.
  • AgResearch CS40 Possum Contigs
  • Non-Opossum RefSeq Genes
    Alignments provided by UCSC. This track shows known protein-coding genes from organisms other than opossum, taken from the NCBI mRNA reference sequences collection (RefSeq). The mRNAs were aligned against the opossum genome using blat; those with an alignment of less than 15% were discarded. When a single mRNA aligned in multiple places, the alignment having the highest base identity was identified. Only alignments having a base identity level within 0.5% of the best and at least 25% base identity with the genomic sequence were kept.
  • Non-Opossum mRNAs from Genbank
    Alignments provided by UCSC. This track displays translated blat alignments of vertebrate and invertebrate mRNA in GenBank from organisms other than opossum. The mRNAs were aligned against the opossum genome using translated blat. When a single mRNA aligned in multiple places, the alignment having the highest base identity was found. Only those alignments having a base identity level within 1% of the best and at least 25% base identity with the genomic sequence were kept.

Try GBrowse.

Top


Tools | AgResearch BLAST

  1. Select the Program
    The NCBI offers a concise program selection table. Read the remainder of the NCBI document for more info on BLAST.
     
  2. Select the Database
    Depending on your choice of Program all valid database will be displayed, e.g.:

    A) nucleotide databases
    • Possum GenBank ESTs (as at 6/2006)
    • CS40 Possum EST contigs (as at 7/2006)
    • Human RefSeq mRNAs (as at 4/2005)
    • Mouse RefSeq mRNAs (as at)

    B) protein databases

    • All non-redundant protein sequences (as at)
      GenBank CDS translations+PDB+SwissProt+PIR+PRF
    • Non-redundant Swissprot sequences (as at)
      Swissprot without swissnew
    • Mouse RefSeq proteins (as at 4/2005)
    • Human RefSeq proteins (as at)
       
  3. Select E-value.
    The E-value or Expectation-value tells you how many database hits of the same computationally judged quality you can expect to occur simply by chance. In a sense it is a measure for false positives. An E-value of 0 means no false positives predicted. Therefore the smaller the E-value, the better. An E-value of e-04 or smaller (e.g. e-10) is considered significant. e-04 translates to 0.000 1, e-10 translates to 0.000 000 000 01.
     
  4. Select max hits
    Limit the number of database hits to be reported. If you search with a very common motif you can choose not to get hundreds of database hits reported but just the top 5.
     
  5. Paste your FASTA sequence into the large text field.
    Sequences in FASTA format start with a one line header beginning with a '>'. Following the header are any number of lines with just the sequence data (no numbers). See the following example:
    >My test-sequence
    CATGCAGACGATGCTAGCTAGCTGATCGATCGATGCTAGCATGCATGCTAGTAC
     
  6. Press the "Run Blast" button to perform a search against all sequences in the database.
    Your sequences will be submitted to the AgResearch BLAST server and returned when completed.
    Your results are stored on the server for 1 week to enable you to retrieve them at a later date.
    You need to take note of the URL to your results (e.g. http://www.possumbase.org.nz/cgi-bin/blast_results.py?filename=wpVuSjHQYi7zExVB2CXjhCKAo) as we cannot identify them due to the randomly assigned file-names.
     

Top


Tools | Data Downloads

Here you can download the complete set of annotated ESTs (assembled into contigs) as a compressed file cs40annotated.zip.

What does the data look like?

cs40annotated.seq is a FASTA file containing all Trichosurus vulpecula mRNA at June 2006 assembled into contigs and annotated via blast against all non-redundant protein sequences (nr protein). Each description line includes DNA composition, library expression of contig, and top hit against nr protein and evalue.

Let's look at an example:

>GI|108827484|GB|EC281509.1|EC281509_CS40 DNA=(1%N 19%C 19%G 32%A 29%T ) Expr= 1=POSSUM_01-POSSUM-C-BRAIN-2KB(1) NR protein=ref|XP_534617.2| PREDICTED: similar to NEDD8-conjugating enzyme [Canis familiaris(eval= 8.00E-57)
CTGGGGAATGGTATGAGGCTCCCCAGGGTAAAAGTTGTAGTAATGTTCAC
ATTGGAGAGCAAATTGAAGAAGGATGAGCACCTTAAAGGATCCCATCTGT
GTGGCCCAGCCTCAAACTCCACATGGAAAGTATCTGTGAGGGATAAACTG
CTTATTAAAGAGGTTGAAGAGCTCGAAGCCAATTTACCTTGTACACGTAA
AGTGAATTTTCCTGATCCAAACAAGCTTCACTACTTTCAACTAAGAGTCA
CTCCAGATGAGGGTCACTACCAGGGTGGAAAATTTTGGTTTAGATAGAAG
TCCCTGATGCTTACAAAATGCTGCCTCCCAAAGTAAAATGTTTGACTAGA
ATCTGGCACCCTAACATCATAGTGATGGGGGAAATCTTTCTGAGCTTACT
AAGAGAACATTCATTGGACAGCACTGGATGGCCTTCCACAAGAACATTAA
GGGATATTGTATAGGAATTAAACTTTTTTTATTTACCAACCTTTTGAATT
CCTAACCCACAGTTTCTTGCATATAGTATCAGTTCCCTAAACACTTCTGA
ATGTATACTACTCATGAATAACACTTTTAACTTGCATTGGTATAGACACT
AGATTAATTAAAGTGTAGAAGTCCTGGATTAAAANNAACTCCAAGCCAAT
GCAGATNNNCATCTTGCTANATTTAGTCACCAAAATAATTGAAATTGTTT
TCTGTTACATCACCTTTGAAATGNCTC

  • name:
    GI|108827484|GB|EC281509.1|EC281509_CS40
    GI number: 108827484
    Accession number: EC281509, version: 1, contig build: CS40
     
  • DNA composition:
    DNA=(1%N 19%C 19%G 32%A 29%T )
     
  • library expression of contig:
    Expr= 1=POSSUM_01-POSSUM-C-BRAIN-2KB(1)
    The first number is the total number of ESTs comprising the contig (contig depth). Then all libraries in which member ESTs get expressed are listed together with their individual counts.
     
  • top hit against nr protein:
    NR protein=ref|XP_534617.2| PREDICTED: similar to NEDD8-conjugating enzyme [Canis familiaris
     
  • E-value:
    (eval= 8.00E-57)

Naming scheme

cs40.annotated.seq contains 67,762 sequences (note: these sequences are contigs and singletons) with the following naming scheme:

  • Singleton (sequence does not belong to any group as determined through preclustering)
    e.g. >GI|108827484|GB|EC281509.1|EC281509_CS40 (given is GI number: 108827484, accession number: EC281509, version: 1, contig build: CS40)
     
  • Singleton (sequence does not belong to any group as determined through clustering)
    e.g. >050706CS40000705FFFFF (this AgResearch internal numbering scheme means that this sequence was preclustered into the precluster 050706CS40000705; singletons are then counted using the hexadecimal system, e.g. FFFFF)
     
  • Contig (a number of sequences could be clustered to one group)
    e.g. >050706CS4000070500001 (this AgResearch internal numbering scheme means that various sequences were clustered into cluster 00001 of precluster 050706CS40000705)

Top