Last updated:2016.3.7.

What is TSA? – Transcriptome Shotgun Assembly

Since 2008, INSDC (DDBJ/EMBL-Bank/GenBank) has accepted the sequence data of "Transcriptome Shotgun Assembly (TSA)" categorised into TSA division for assembled cDNA sequences.

With new sequencing technologies, INSDC has faced many requests to accept assembled EST sequences. These sequence data have become more useful than used to be, although they may not be correctly assembled or exist in nature. Therefore, DDBJ/EMBL-Bank/GenBank decided to collect assembled EST sequences and classified them into the TSA division.

Prior to TSA data submission, it is required to submit to BioProject Database and BioSample Database. It is also required that the TSA submission with the original sequence data of primary transcripts (primary entries) are classified into the EST division of DDBJ/EMBL-Bank/GenBank, DDBJ Trace Archive or DDBJ Read Archive.
See also steps of transcriptome project, categories of sequence data and their correspondences.

If a primary entry belonged to the submitter who is other than TSA submitter, the TSA entry is classified into TPA category.

You can submit TSA data to DDBJ through Mass Submission System (MSS).

Definition of primary entry for TSA

Primary entries used to build a TSA sequence are RNA sequences that have been experimentally determined by their submitters and are publicly available on INSDC, Trace Archive, or Sequence Read Archive.

It is possible that primary entries are not yet publicized at the TSA submission. However, the primary entries must be publicized by when the corresponding TSA entry is open to the public.

Notes on the TSA submission

The sequence alignment rules between TSA and primary entries

  • Regions of a TSA entry can be assembled from a single EST or read so that coverage is only 1x.
  • When the assembled sequence includes gap region supported by some evidence (pair end sequences, etc), you can describe gap region by sequential n's in the sequence. The gap region must be specified by assembly_gap feature.
  • Limits to ambiguity in the sequence out of location described with assembly_gap features are that;

   [1] the allowable percent of bases that are 'n' should be less than 5% and
   [2] a TSA entry can have a stretch of no more than 5 n' s in a row

Aspects of TSA on DDBJ flat file

LOCUS line provides the division name, "TSA".
"TSA:" is shown at the beginning of DEFINITION line.
"TSA" and "Transcriptome Shotgun Assembly" are indicated in KEYWORDS line.
PRIMARY line provides base spans cited from sequeces of primary entries that contribute to regions of the TSA sequence.

TSA accession number

Since October 2015, each TSA sequence submitted to DDBJ is assigned an accession number that consists of 4 letters + 8 (sometimes 9 or 10, if necessary) digits.

Example: ZZZZ01000001

4 letters -- Prefix to distinguish each project
2 digits -- Version number of the data set
6 digits -- ID of each individual sequence (Sometimes, it might be 7 or 8 digits depended on the number of entries.)

Sample of TSA flat file

In case of citing DDBJ Read Archive

LOCUS       IZZY01000001             800 bp   mRNA     linear   TSA 15-OCT-2015
DEFINITION  TSA: Mus musculus RNA, contig: 1_1.
ACCESSION   IZZY01000001
VERSION     IZZY01000001.1
DBLINK      BioProject:PRJDA43210
            Sequence Read Archive: DRR900001
            BioSample: SAMD98765431
KEYWORDS    TSA; Transcriptome Shotgun Assembly.
SOURCE      Mus musculus (house mouse)
  ORGANISM  Mus musculus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
            Sciurognathi; Muroidea; Muridae; Murinae; Mus; Mus.
REFERENCE   1  (bases 1 to 800)
  AUTHORS   Mishima,H. and Shizuoka,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (30-SEP-2008) to the DDBJ/EMBL/GenBank databases.
            Contact:Hanako Mishima
            National Institute of Genetics, DNA Data Bank of Japan; Yata 1111,
            Mishima, Shizuoka 411-8540, Japan
REFERENCE   2  
  AUTHORS   Mishima,H., Shizuoka,T. and Fuji,I.
  TITLE     Transcriptome shotgun assembly of mouse
  JOURNAL   TSA Biol 12, 61-70 (2015)
COMMENT     ##Assembly-Data-START##
            Assembly Method       :: Velvet v.1.1.05
            Sequencing Technology :: Illumina GAIIx
            ##Assembly-Data-END##
FEATURES             Location/Qualifiers
     source          1..800
                     /db_xref="taxon:10090"
                     /mol_type="transcribed RNA"
                     /note="contig: 1_1"
                     /organism="Mus musculus"
BASE COUNT          199 a          203 c          198 g          200 t
ORIGIN      
        1 attaatataa gctaaatatg tttttcaata tatattgata atagaatatc aacaatttgg
        :
        -- The rest of nucleotide sequence is omitted --
        :
// 

In case of citing EST

LOCUS       IZZZ01000001             800 bp   mRNA     linear   TSA 15-OCT-2008
DEFINITION  TSA: Homo sapiens GAPD mRNA for glyceraldehyde-3-phosphate
            dehydrogenase, complete cds.
ACCESSION   IZZZ01000001
VERSION     IZZZ01000001.1
DBLINK      BioProject:PRJDA43211
            BioSample: SAMD98765433
KEYWORDS    TSA; Transcriptome Shotgun Assembly.
SOURCE      Homo sapiens (human)
  ORGANISM  Homo sapiens
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
            Catarrhini; Hominidae; Homo.
REFERENCE   1  (bases 1 to 800)
  AUTHORS   Mishima,H. and Shizuoka,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (30-SEP-2008) to the DDBJ/EMBL/GenBank databases.
            Contact:Hanako Mishima
            National Institute of Genetics, DNA Data Bank of Japan; Yata 1111,
            Mishima, Shizuoka 411-8540, Japan
REFERENCE   2  
  AUTHORS   Mishima,H., Shizuoka,T. and Fuji,I.
  TITLE     EST assembly of human
  JOURNAL   TSA Biol 12, 61-70 (2008)
COMMENT  
PRIMARY     TSA_SPAN            PRIMARY_IDENTIFIER PRIMARY_SPAN        COMP
            1-599               ZZ000004.1         2-598
            1-669               ZZ000005.1         11-679
            2-596               ZZ000006.1         1-595
            2-575               ZZ000007.1         1-574
            5-676               ZZ000008.1         1-672
            6-725               ZZ000009.1         1-720
            59-369              ZZ000010.1         13-322
            605-800             ZZ000011.1         1-196               c
FEATURES             Location/Qualifiers
     source          1..800
                     /db_xref="taxon:9606"
                     /mol_type="transcribed RNA"
                     /organism="Homo sapiens"
     CDS             73..669
                     /codon_start=1
                     /gene="GAPD"
                     /product="glyceraldehyde-3-phosphate dehydrogenase"
                     /protein_id="LZZ00001.1"
                     /transl_table=1
                     /translation="MWYQSLVIIEKLNLEANIGKLINTKDNINIRCRLSHTEEHSWHS
                     -- The rest of amino acid sequence is omitted -- "
BASE COUNT          199 a          203 c          198 g          200 t
ORIGIN      
        1 attaatataa gctaaatatg tttttcaata tatattgata atagaatatc aacaatttgg
        :
        -- The rest of nucleotide sequence is omitted --
        :
// 
ページの先頭へ戻る