Last updated:2014.9.8.

About Mass sequence for Genome Annotation (MGA) entry

[Caution] DDBJ terminated accepting new submission of MGA data.

In order to accept a large scale of sequence data that provide useful information for annotation of genome assemblies/sequences, the International Nucleotide Sequence Database Collaboration (INSDC; DDBJ/EMBL-Bank/GenBank) have created a new category. The name of this new category is Mass sequence for Genome Annotation (MGA). The definition of MGA data is the following.

The definition of MGA data
MGA is defined as those sequences which are produced in large quantity in view of genome annotation.
The data which can be acceptable to the MGA category of INSDC are
Those which include useful biological features for genome annotation ( e.g. start or end terminus of a transcript).
The large of quantity here means that the number of sequences in one resource is 10,000 or more.

Composition of accession number of MGA

The accession number assigned to each of MGA entries is composed of 12 letters (five alphabetical characters and seven numeric numbers). The detail of accession number is shown as follows;

Example:ZZZZZ0000001

5 alphabetical characters -- project identifier.
    
first two characters -- identifier to each project.
    
third to fifth characters -- identifier to each of resources on each project.
7 digit numeric numbers -- number for each sequence entry in a resource.
    *1 The information about each project id is avilable at the project_index page.
    *2 "resource" here means a unit of identical origin, such as tissue, cells, from which sequence are obtained.

Publication format of MGA data

MGA data are published a resource as a unit. The data consist of "Master record" and "Variable record" for a resource.

Master record
Common parts of information such as submitters, keywords (MGA and others), references, comments and so on. Master record is provided for every resource unit.
Variable record
All nucleotide sequences of the resource unit described in the Master record, and the items specific to each sequence, such as map location, count number of the sequence, and db_xrefs.

Sample of Master record

LOCUS       ZZZZZ0000000                       mRNA    linear   ROD 24-JAN-2005
DEFINITION  Mus musculus 1 month adult cerebellum short transcripts tag.
ACCESSION   ZZZZZ0000000
VERSION     ZZZZZ0000000.1
KEYWORDS    MGA; CAGE (Cap Analysis Gene Expression).
SOURCE      Mus musculus (house mouse)
  ORGANISM  Mus musculus
            Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
            Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
            Sciurognathi; Muroidea; Muridae; Murinae; Mus.
REFERENCE   1
  AUTHORS   Mishima,H. and Shizuoka,T.
  TITLE     Direct Submission
  JOURNAL   Submitted (30-NOV-2009) to the DDBJ/EMBL/GenBank databases. 
            Contact:Hanako Mishima
            National Institute of Genetics, DNA Data Bank of Japan; Yata 1111,
            Mishima, Shizuoka 411-8540, Japan
REFERENCE   2
  AUTHORS   Mishima,H., Shizuoka,T. and Fuji,I
  TITLE     The gene expression analysis of short transcripts tags
  JOURNAL   Unpublished (2010)
COMMENT     The CAGE (cap analysis gene expression) is based on preparation
            and sequencing of concatamers of DNA tags deriving from the
            initial 20/21 nucleotides from 5' end mRNAs.
            Full-length cDNAs were at first selected with the Cap-Trapper
            method. Then, a specific linker (Linker1, some linker contain 5 bp
            sequences that have 15 variations for each rna sample) containing
            the ClassIIs restriction enzyme site MmeI was then ligated to the
            single-strand cDNA and then the second strand of cDNA synthesized.
            (skip the rest in COMMENT field)
FEATURES             Location/Qualifiers
     source          
                     /db_xref="taxon:10090"
                     /dev_stage="1 month adult"
                     /mol_type="mRNA"
                     /organism="Mus musculus"
                     /strain="C57BL/6J"
                     /tissue_type="cerebellum"
MGA         ZZZZZ0000001-ZZZZ0340780
            total number of count : 856609
            Header Format
            >[ACC#]|[submitter's identifier]|[number of sequence
            count]|[map]|[free text]|[db_xref1(,db_xref2,...)]|
// 

Variable record

See also the detail explanation of variable record.

>ZZZZZ0000001|ABC1004AA60F1902|10|9B|lipidosis-related protein Lipidosin| 
GI:2385656|
gactgtcttcggtgaatgca
>ZZZZZ0000002|ABC1003AE78G1607|5||||
gcggaagtcggaccggtcgca
>ZZZZZ0000003|ABC1003AE72P1806|6||||
gggagaccgatccgggatct
>ZZZZZ0000004|ABC1003AE30G1801|91||||
gagtcgggtcggtggggctgt
>ZZZZZ0000005|ABC1003AA45J1501|55||||
ggggaatctgcagcctgggc
>ZZZZZ0000006|ABC1003AE67B0902|152||||
gagccgtccccgacgccgcca
(skip the rest)

ページの先頭へ戻る