No.39    Sep 4, 2008
Otherwise welcome to coming autumn, we should enjoy with energy limited this summer.
  The Report for the 21st International Collaborators Meeting
To operate and implement the collaborative construction of the international nucleotide sequence database, the three data banks; DDBJ, EMBL-Bank/EBI, GenBank/NCBI hold the international collaborators meetings every May. In 2008, the meeting was held at DDBJ in Japan, 20-22 May.
DDBJ, EMBL-Bank, GenBank reported each bank activities in the last year, discussed some practical matters to maintain and develop the nucleotide sequence database as follows;
The Items; Discussed and To Be Studied

- A new division, TSA (Transcriptome Shotgun Assembly)
From June 2008, INSDC introduce a new division for assembled mRNA sequences, TSA. Note that it is required that the TSA submission with the original sequence data of primary transcripts is classified into the EST division of INSDC, Trace Archive, or Short Read Archive. More information about how to submit the TSA entry will be provided via DDBJ website.
- Sequence data from next generation sequencing
In principle, raw reads from next generation sequencing should be registered to Short Read Archive. Following the workshop on MINSEQE (Minimal Information about a High Throughput Sequencing Experiment), data from next generation sequencing not initially intended for INSD submissions might result in discoveries of variation or re-annotation that could be submitted to INSDC as TPA or TSA entries. The number of TPA entries is not expected to grow rapidly.
- Representative submissions of identical sequences for variation studies
INSDC basically accept all sequence data, regardless of source and sequence identity. However, in order to take advantage of normalisation for variation studies, a single submission to represent multiple identical sequences is also acceptable with frequency and total sample number described by /frequency qualifier of source feature.
- Removal of the frag for electronic publication, "(er)", in REFERENCE/JOURNAL lines
The electronic publication token in REFERENCE/JOURNAL lines, "(er)", will be removed. Old records will be retrofitted to conventional article citations where possible.
Changes to the Feature Table Document: Features and Qualifiers

The following items will be applied from October 2008 with the revision of Feature Table Definition, if not otherwise specified.
- Modification of controlled vocabulary for /mol_type qualifier
The /mol_type qualifier is used to indicate in vivo, synthetic or hypothetical molecule type in source feature. The vocabrary list for /mol_type qualifier will be modified as follows;
) Addition: "transcribed RNA"
) Removal: "snoRNA", "snRNA", "scRNA", "pre-RNA" and "tmRNA"
- The value, "chromatophore", will be legal for /organelle qualifier
- Modification of controlled vocabulary for /ncRNA_class qualifier
The ncRNA feature utilizes a /ncRNA_class qualifier with a controlled vocabulary to indicate what type of non-protein-coding feature is being represented. The list for controlled vocabulary of /ncRNA_class qualifier will be modified as follows;
) Addition: "6S/SsrS", "SraD RNA", "DsrA RNA", "SroC" ) Change: "hammerhead ribozyme" --> "ribozyme"
    (See also Controlled vocabulary for ncRNA classes)
- A new qualifier, /satellite, will be legal for repeat_region feature.
Format "<satellite_type>[:<class>][ <identifier>]"    where satellite_type is one of the following;
   "satellite", "microsatellite", "minisatellite"
         /satellite="satellite: S1a"
         /satellite="satellite: gamma III"
         /satellite="microsatellite: DC130"
- Improvement of the format of the /frequency qualifier
In order to represent a sample size, following descriptions will also be legal for the value formats of the /frequency qualifier in addition to decimal fractions;
"[m] in [n]" or "[m] / [n]".
         /frequency="1 in 12"
- /specific_host qualifier will become /host qualifier.
Both /host and /lab_host should be described with a binominal scientific name, if possible.
         /lab_host="Gallus gallus"
         /lab_host="Gallus gallus embryo"
         /lab_host="Escherichia coli strain DH5 alpha"
         /lab_host="Homo sapiens HeLa cells"
- Removal of /virion qualifier
Note: The /proviral qualifier will remain in use.
- Removal of /cons_splice qualifier
- Improvement of validation for both /rearranged and /germline qualifiers
Basically, both /rearranged and /germline qualifiers should be used to indicate if the sequence has undergone somatic rearrangement as part of an adaptive immune response or not. However, since many of them have been wrongly used, we will correct them.
- A new qualifier, /gene_synonym, will be legal for features that can use /gene qualifier.
We also expect further minor changes in the usage of /gene qualifier. Details of changes will be made available shortly.
- Improvement of the format of /inference qualifier
In order to describe inferential supports more effectively, format /inference qualifier will be improved. Details of changes will be made available shortly.
- A new qualifier, /mating_type, will be legal for source feature.
The /sex qualifier will also remain in use. Guidelines of descriptions for both /mating_type and /sex will be made available shortly.
  DDBJ starts accepting Trace Archive data
Trace Archive is defined by NCBI as a permanent repository of DNA sequence chromatograms (traces), base calls, and quality estimates for single-pass reads from various large-scale sequencing projects.
DDBJ has reached the first registration of Trace Archive data in July 2008, supported by National project of integrating life science databases.
1. Trace data of Oryzias latipes WGS sequences determined by National Institute of Genetics (NIG):
TI numbers are as follows:
  • 2095022956-2095389675
  • 2095396176-2096435759
  • 2096858496-2096933759
* Relevant announcement: Release of WGS 134,429 entries and CON 6,928 entries for Medaka strain Hd-rR, and WGS 346,141 entries and CON 38,235 entries for strain HNI.
2. Trace data from human gut metagenome project by University of Tokyo, the Center for Omics and Bioinformatics (UTCOB):
TI numbers are as follows:
  • 2097946941-2099007079
* Relevant announcement: Release of new human gut metagenome WGS data, 353,805 entries.
(1) Assemble trace data to WGS entries.
The sizes of these trace data are as follows:
  • (a) about 50G bytes(from NIG, gzipped tar files including .qual, peak, .seq and .scf)
  • (b) about 40G bytes(from UTCOB, gzipped tar files including .scf)
These trace data both (a) and (b) were assembled to WGS entries:
The (a) trace data was firstly assembled to the part of BAAF WGS entries (about 309M bytes, gzipped tar file including Flat File format). The trace data (a) was further assembled to DG000001-DG000024 chromosome /CON entries. Medaka genome sequencing project web site provides more details.
The trace data (b) was assembled to BAAU-BABG WGS entries (about 272M bytes, gzipped tar file including Flat File format).
(2) Transfer the file from DDBJ to NCBI .
We uploaded test data to NCBI Trace Archive by the conventional ftp protocol. It took intolerably long time to transfer. We have investigated several alternative file transfer protocol and application. Then, we have been able to transfer by parallel transfer of multiple files by the conventional ftp. The transfer was actually completed in several hours though it was expected 2 whole days based on a sequential ftp.
These data are now retrievable at NCBI Trace Archive. DDBJ starts preparing the original web page for the retrieval of trace data.
  DDBJ starts providing Trace Archive data by FTP
DDBJ has accepted Trace Archive data. Trace Archive data which was submitted to DDBJ is now available by FTP from FTP/Web API page.
These data are now available.
- Trace data of Oryzias latipes WGS sequences determined by National Institute of Genetics (NIG);
  TI numbers are as follows:
- Trace data from human gut metangemone project by University of Tokyo, the Center for Omics and Bioinformatics (UTCOB);
  TI numbers are as follows:
 Termination of a part of DDBJ services
DDBJ discontinues the following services provided by WWW and by E-mail, as of the described schedule. We apologize for you incovenience. Thank you for your understanding and cooperation.
Services substitute services of DDBJ
Search & Analysis by E-mail server
  getentry 2008.9.12 getentry (WWW *1 / WABI *2)
  get-version 2008.9.12 getentry (WWW *1 / WABI *2)
  FASTA 2008.9.12 FASTA (WWW *1 / WABI *2)
  BLAST 2008.9.12 BLAST (WWW *1 / WABI *2)
  SSEARCH 2008.9.12 SSEARCH (WWW *1
  ClustalW 2008.9.12 ClustalW (WWW *1 / WABI *2)
  HMMPFAM 2008.9.12 HMMPFAM by WWW*1
Keyword Search
  SRS 2008.12.26 ARSA==>Termination of SRS
Sequence Pattern Match
  SQmatch 2008.11.14  
Protein Compatibility Analysis, etc.
  PDB Retriever 2008.11.14  
  Libra 2008.11.14 GTOP
  Lib score 2008.11.14 
*: Receiving the results by E-mail after sending the queries by WWW is continuously available.
Termination of providing SRS(Sequence Retrieval System) services
DDBJ terminates providing SRS services (by WWW/WABI )as of the December 26, 2008. We appreciate very much using SRS since SRS was firstly launched by WWW in 1999. We apologize for your labor swiching your keyword searh system from SRS to ARSA, but, we hope ARSA is useful for all DDBJ users. Please refer the description of ARSA and SRS shown in below.
(1) ARSA and SRS Search functions
               A R S A S R S
High-speed XML database search engine "Interstage Shunsaku" was applied for the nucleotide sequence database retrieval Keyword search system developed by EBI(European Bioinformatics Institute) was applied and reconstructed for the DDBJ database retrieval
Key Features
  • Rapid response: Even in any complexed search keys, the results are returned within 5 seconds
  • In the DDBJ database retrieval, any combination of feature/qualifiers used in the DDBJ flat file format can be specified as search keys
  • Wide variety of WEB API programs such as SOAP/REST is applicable
  • From Quick Search (simple keyword search) to Advanced Search (specify a complexed search queries), multistage search condition can be selectable.
  • Wide variety of WEB API programs such as SOAP/REST is applicable
functions ARSA SRS
Cross search of multiple DB  Yes   Yes 
Specifying the search field  Yes   Yes 
Use of DDBJ flat file Feature/Qualifier  Yes   No 
Keyword search using the frequently  Yes   Yes 
Multiple keywords connection using and/or/not Yes*  Yes 
Relation to analysis tools  Yes   Yes 
Speed fast slow
*: Use of "not" in Quiack Search is in being enhanced (as of Aug. 2008)
(2) ARSA Improvment History

    No. of
2004.12 Trial operation started at the DDBJ HP
"DDBJ" and "DDBJNEW" only
2005.12 Upgraded
Full-time operation started
2007.02 21DBs other than DDBJ was added  23 
2007.07 Extensive Enhancement and transfer to Official Operation
<Main New Functions>
  • The search results can be downloaded across databases
  • Change of the displayed items in the result view screen is possible
  • API(Application Program Interface) for Java/Perl was enhanced
2007.10 DDBJ HP design renewal
(ARSA is available at the search window of DDBJ HP)
2007.10 Sequence search program linked to TX Search was changed from SRS to ARSA  
2007.11 Further Upgrade subsequent to enhancement conductted in July"
<Main new functions>
  • Addition of KEGG Pathway Database
  • Specification of detailed search conditions in the all databases, as well as DDBJ
  • Cross search by the common search queris
2008.05 Removal of redundant DBs
4DBs related to PFAM were removed
(3) Growth of Number of ARSA and SRS Users In the following graph which shows the growth of ARSA and SRS users from January 2007 to June 200, the number of unique users of ARSA increases continuously, whereas that of SRS is on a downward or a flat tendency. In the pageview count, ARSA exceeded SRS in March, 2008.
Unique User: a number of an individual that has visited a Web site or received specific content,for a specified period of time such as a day or month. More than one time visits by the same user is counted as 1.
Page View: a request to load a single page of an Internet site. This number is widely used as an access index to the site
(4) Conclusion
DDBJ, since starting the ARSA services in December of 2004 as a trial operation, has built up continuous improvements for the more covenient keyword search system. Meanwhile, DDBJ has introduced ARSA at the scientific meeting and/or DDBJing tutorials. Increase of ARSA users reflects these efforts, and as a results, DDBJ decided to terminate providing SRS service.
DDBJ has do all the more efforts for the development of ARSA. If you have any comments for ARSA, please send it from "Your Comment" which is located in the upper blue zone.
 DDBJ Rel. 74 Completed
The nucleotide sequence database collected and maintained by DDBJ is quarterly released online to the public. We completed DDBJ Release 74 on June 24, 2008. DDBJ Release 74 consists of 87,903,140 entries, and the number of bases reached 91,294,770,939.
The periodical release and the new data are available by FTP download from the "FTP/Web API" page.
 Release of sequence data from DDBJ
Release of new Drosophila EST 190,096 entries
DDBJ released Drosophila EST 190,096 entries, which had been submitted by Kyoto Institute of Technology.
The accession numbers are as follows;
  • Drosophila auraria 5'-EST : DK265854-DK284650 (18,797 entries)
  • Drosophila auraria 3'-EST : DK284651-DK303963 (19,313 entries)
  • Drosophila sechellia 5'-EST : DK303964-DK322998 (19,035 entries)
  • Drosophila sechellia 3'-EST : DK322999-DK342220 (19,222 entries)
  • Drosophila simulans adult female 5'-EST : DK342221-DK360662 (18,442 entries)
  • Drosophila simulans adult female 3'-EST : DK360663-DK379473 (18,811 entries)
  • Drosophila simulans larvae 5'-EST : DK379474-DK398612 (19,139 entries)
  • Drosophila simulans larvae 3'-EST : DK398613-DK417792 (19,180 entries)
  • Drosophila simulans adult male 5'-EST : DK417793-DK436903 (19,111 entries)
  • Drosophila simulans adult male 3'-EST : DK436904-DK455949 (19,046 entries)
These entries were released as DDBJ daily updates on Jun. 14.
FTP site for DB download : Drosophila_EST_080614_1.seq.gz
 BLAST and PSI-BLAST upgraded
DDBJ provides some homology search services through WWW. Among them, BLAST ( by WWW ) and PSI-BLAST ( by WWW ) have been upgraded from Ver. 2.2.15 to Ver. 2.2.18.
The Blast program provided through NIG supernig server also has been upgraded from Ver. 2.2.15 to Ver. 2.2.18. ( Jun. 11, 2008 )
 Apology for the defective results of CON entries in getentry
getentry is the entry retrieval system by accession numbers etc., which is provided by DDBJ via WWW server.
In the case of sequence display of CON entry, We found there were two problems to display sequences of CON entries.
Datails are as follows;
  • Condition :
    1. When a cited piece entry is described with "complement" operator in a CON entry, the sequence span cited from the piece entry is processed as the forward strand. It should be processed into the complementary sequence.
      Fixed ( Jul. 3, 2008 )
    2. When the piece entry of retrieved CON entry containes a CON entry, it can not be processed correctly.
    - Affected parts : We have provided two types of methods to get the conjugated sequences of CON entries; display on browser and FTP. However, neither of two methods works properly in the case of retrieving CON entries under the following conditions.
    1. When you specify "total nt seq FASTA" as output format, and "www" or "FTP" as Results.
    2. When you specify "Flat file (DDBJ)" as output format, and "www" as Results.
    Then, click each of following links in the result;
      a) transfer all the DNA sequences separately in FASTA format
      b) display all the DNA sequences separately in FASTA format
      c) transfer the combined DNA sequence in FASTA format (Constructed sequence according to CONTIG line)
      d) display the combined DNA sequence in FASTA format (constructed sequence according to CONTIG line)
  • Affected services : getentry, ARSA, and Web API (GetEntry)
  • Measure : Recovery work takes a little longer. We will provide an announcement on this page as soon as the problems are solved.
We sincerely apologize for your inconveniece.
  Apology for the defect of some search results in getentry
getentry is the entry retrieval system by accession numbers etc., which is provided by DDBJ via WWW server. In the case of DAD ( DDBJ Amino Sequence Database ) retrieval with Protein ID in getentry, there were some search results in which a part of the entry was not displayed properly during the following period.
Datails are as follows;
  • Condition : Displayed unsuitable results under the following conditions. When you specify " Protein ID " as " ID ", " DAD ( DDBJ Amino Sequence Database )" as " Database " in getentry, there were some entries of setting the CDS Feature with qualifier " /translation " below the CDS Feature with qualifier " /psuedo " in the range of CDS Feature. The hyperlink function on the Protein ID in the DDBJ flat file which contained the above " /translation " qualifier doesn't work properly. Furthermore, the DAD retrieval with those Protein ID did not display suitable results.
  • Affected Period : Feb. 27, 2007 to May. 22, 2008
  • Accesion numbers at DDBJ entries corresponding with the affected Protein ID : (-> See the list )
  • Measure : The defect has already been fixed and the service works normally.
We kindly request users to conduct their search again if you have used the corresponding entries. We sincerely apologize for your inconveniece.

