


If you have any questions and suggestions about DDBJmag, please do not hesitate to write
.
We would like to hear from you!!
To operate and implement the collaborative construction of the international nucleotide sequence database, the three data banks; DDBJ, EMBL-Bank/EBI, GenBank/NCBI hold the international collaborators meetings every May. In 2008, the meeting was held at DDBJ in Japan, 20-22 May.
DDBJ, EMBL-Bank, GenBank reported each bank activities in the last year, discussed some practical matters to maintain and develop the nucleotide sequence database as follows;
The Items; Discussed and To Be Studied
- A new division, TSA (Transcriptome Shotgun Assembly)
From June 2008, INSDC introduce a new division for assembled mRNA sequences, TSA. Note that it is required that the TSA submission with the original sequence data of primary transcripts is classified into the EST division of INSDC, Trace Archive, or Short Read Archive. More information about how to submit the TSA entry will be provided via DDBJ website. - Sequence data from next generation sequencing
In principle, raw reads from next generation sequencing should be registered to Short Read Archive. Following the workshop on MINSEQE (Minimal Information about a High Throughput Sequencing Experiment), data from next generation sequencing not initially intended for INSD submissions might result in discoveries of variation or re-annotation that could be submitted to INSDC as TPA or TSA entries. The number of TPA entries is not expected to grow rapidly. - Representative submissions of identical sequences for variation studies
INSDC basically accept all sequence data, regardless of source and sequence identity. However, in order to take advantage of normalisation for variation studies, a single submission to represent multiple identical sequences is also acceptable with frequency and total sample number described by /frequency qualifier of source feature. - Removal of the frag for electronic publication, "(er)", in REFERENCE/JOURNAL lines
The electronic publication token in REFERENCE/JOURNAL lines, "(er)", will be removed. Old records will be retrofitted to conventional article citations where possible.
Changes to the Feature Table Document: Features and Qualifiers
The following items will be applied from October 2008 with the revision of Feature Table Definition, if not otherwise specified. - Modification of controlled vocabulary for /mol_type qualifier
The /mol_type qualifier is used to indicate in vivo, synthetic or hypothetical molecule type in source feature. The vocabrary list for /mol_type qualifier will be modified as follows;
) Addition: "transcribed RNA"
) Removal: "snoRNA", "snRNA", "scRNA", "pre-RNA" and "tmRNA" - The value, "chromatophore", will be legal for /organelle qualifier - Modification of controlled vocabulary for /ncRNA_class qualifier
The ncRNA feature utilizes a /ncRNA_class qualifier with a controlled vocabulary to indicate what type of non-protein-coding feature is being represented. The list for controlled vocabulary of /ncRNA_class qualifier will be modified as follows; ) Addition: "6S/SsrS", "SraD RNA", "DsrA RNA", "SroC" ) Change: "hammerhead ribozyme" --> "ribozyme"
(See also Controlled vocabulary for ncRNA classes) - A new qualifier, /satellite, will be legal for repeat_region feature.
Format "<satellite_type>[:<class>][ <identifier>]" where satellite_type is one of the following;
"satellite", "microsatellite", "minisatellite"
In order to represent a sample size, following descriptions will also be legal for the value formats of the /frequency qualifier in addition to decimal fractions;
"[m] in [n]" or "[m] / [n]".
Both /host and /lab_host should be described with a binominal scientific name, if possible.
Note: The /proviral qualifier will remain in use. - Removal of /cons_splice qualifier
- Improvement of validation for both /rearranged and /germline qualifiers
Basically, both /rearranged and /germline qualifiers should be used to indicate if the sequence has undergone somatic rearrangement as part of an adaptive immune response or not. However, since many of them have been wrongly used, we will correct them. - A new qualifier, /gene_synonym, will be legal for features that can use /gene qualifier.
We also expect further minor changes in the usage of /gene qualifier. Details of changes will be made available shortly. - Improvement of the format of /inference qualifier
In order to describe inferential supports more effectively, format /inference qualifier will be improved. Details of changes will be made available shortly. - A new qualifier, /mating_type, will be legal for source feature.
The /sex qualifier will also remain in use. Guidelines of descriptions for both /mating_type and /sex will be made available shortly.
DDBJ, EMBL-Bank, GenBank reported each bank activities in the last year, discussed some practical matters to maintain and develop the nucleotide sequence database as follows;
The Items; Discussed and To Be Studied- A new division, TSA (Transcriptome Shotgun Assembly)
From June 2008, INSDC introduce a new division for assembled mRNA sequences, TSA. Note that it is required that the TSA submission with the original sequence data of primary transcripts is classified into the EST division of INSDC, Trace Archive, or Short Read Archive. More information about how to submit the TSA entry will be provided via DDBJ website. - Sequence data from next generation sequencing
In principle, raw reads from next generation sequencing should be registered to Short Read Archive. Following the workshop on MINSEQE (Minimal Information about a High Throughput Sequencing Experiment), data from next generation sequencing not initially intended for INSD submissions might result in discoveries of variation or re-annotation that could be submitted to INSDC as TPA or TSA entries. The number of TPA entries is not expected to grow rapidly. - Representative submissions of identical sequences for variation studies
INSDC basically accept all sequence data, regardless of source and sequence identity. However, in order to take advantage of normalisation for variation studies, a single submission to represent multiple identical sequences is also acceptable with frequency and total sample number described by /frequency qualifier of source feature. - Removal of the frag for electronic publication, "(er)", in REFERENCE/JOURNAL lines
The electronic publication token in REFERENCE/JOURNAL lines, "(er)", will be removed. Old records will be retrofitted to conventional article citations where possible.
Changes to the Feature Table Document: Features and QualifiersThe following items will be applied from October 2008 with the revision of Feature Table Definition, if not otherwise specified. - Modification of controlled vocabulary for /mol_type qualifier
The /mol_type qualifier is used to indicate in vivo, synthetic or hypothetical molecule type in source feature. The vocabrary list for /mol_type qualifier will be modified as follows;
) Addition: "transcribed RNA"
) Removal: "snoRNA", "snRNA", "scRNA", "pre-RNA" and "tmRNA" - The value, "chromatophore", will be legal for /organelle qualifier - Modification of controlled vocabulary for /ncRNA_class qualifier
The ncRNA feature utilizes a /ncRNA_class qualifier with a controlled vocabulary to indicate what type of non-protein-coding feature is being represented. The list for controlled vocabulary of /ncRNA_class qualifier will be modified as follows; ) Addition: "6S/SsrS", "SraD RNA", "DsrA RNA", "SroC" ) Change: "hammerhead ribozyme" --> "ribozyme"
(See also Controlled vocabulary for ncRNA classes) - A new qualifier, /satellite, will be legal for repeat_region feature.
Format "<satellite_type>[:<class>][ <identifier>]" where satellite_type is one of the following;
"satellite", "microsatellite", "minisatellite"
Example
/satellite="satellite: S1a"
/satellite="satellite: gamma III"
/satellite="minisatellite"
/satellite="microsatellite: DC130"
- Improvement of the format of the /frequency qualifierIn order to represent a sample size, following descriptions will also be legal for the value formats of the /frequency qualifier in addition to decimal fractions;
"[m] in [n]" or "[m] / [n]".
Example
/frequency="23/108"
/frequency="1 in 12"
- /specific_host qualifier will become /host qualifier.Both /host and /lab_host should be described with a binominal scientific name, if possible.
Example
/lab_host="Gallus gallus"
/lab_host="Gallus gallus embryo"
/lab_host="Escherichia coli strain DH5 alpha"
/lab_host="Homo sapiens HeLa cells"
- Removal of /virion qualifierNote: The /proviral qualifier will remain in use. - Removal of /cons_splice qualifier
- Improvement of validation for both /rearranged and /germline qualifiers
Basically, both /rearranged and /germline qualifiers should be used to indicate if the sequence has undergone somatic rearrangement as part of an adaptive immune response or not. However, since many of them have been wrongly used, we will correct them. - A new qualifier, /gene_synonym, will be legal for features that can use /gene qualifier.
We also expect further minor changes in the usage of /gene qualifier. Details of changes will be made available shortly. - Improvement of the format of /inference qualifier
In order to describe inferential supports more effectively, format /inference qualifier will be improved. Details of changes will be made available shortly. - A new qualifier, /mating_type, will be legal for source feature.
The /sex qualifier will also remain in use. Guidelines of descriptions for both /mating_type and /sex will be made available shortly.
Trace Archive is defined by NCBI as a permanent repository of DNA sequence chromatograms (traces), base calls, and quality estimates for single-pass reads from various large-scale sequencing projects.
DDBJ has reached the first registration of Trace Archive data in July 2008, supported by National project of integrating life science databases.
1. Trace data of Oryzias latipes WGS sequences determined by National Institute of Genetics (NIG):
TI numbers are as follows:
TI numbers are as follows:
The sizes of these trace data are as follows:
The (a) trace data was firstly assembled to the part of BAAF WGS entries (about 309M bytes, gzipped tar file including Flat File format). The trace data (a) was further assembled to DG000001-DG000024 chromosome /CON entries. Medaka genome sequencing project web site provides more details.
The trace data (b) was assembled to BAAU-BABG WGS entries (about 272M bytes, gzipped tar file including Flat File format). (2) Transfer the file from DDBJ to NCBI .
We uploaded test data to NCBI Trace Archive by the conventional ftp protocol. It took intolerably long time to transfer. We have investigated several alternative file transfer protocol and application. Then, we have been able to transfer by parallel transfer of multiple files by the conventional ftp. The transfer was actually completed in several hours though it was expected 2 whole days based on a sequential ftp.
These data are now retrievable at NCBI Trace Archive. DDBJ starts preparing the original web page for the retrieval of trace data.
DDBJ has reached the first registration of Trace Archive data in July 2008, supported by National project of integrating life science databases.1. Trace data of Oryzias latipes WGS sequences determined by National Institute of Genetics (NIG):
TI numbers are as follows:
- 2095022956-2095389675
- 2095396176-2096435759
- 2096858496-2096933759
TI numbers are as follows:
- 2097946941-2099007079
The sizes of these trace data are as follows:
- (a) about 50G bytes(from NIG, gzipped tar files including .qual, peak, .seq and .scf)
- (b) about 40G bytes(from UTCOB, gzipped tar files including .scf)
The (a) trace data was firstly assembled to the part of BAAF WGS entries (about 309M bytes, gzipped tar file including Flat File format). The trace data (a) was further assembled to DG000001-DG000024 chromosome /CON entries. Medaka genome sequencing project web site provides more details.
The trace data (b) was assembled to BAAU-BABG WGS entries (about 272M bytes, gzipped tar file including Flat File format). (2) Transfer the file from DDBJ to NCBI .
We uploaded test data to NCBI Trace Archive by the conventional ftp protocol. It took intolerably long time to transfer. We have investigated several alternative file transfer protocol and application. Then, we have been able to transfer by parallel transfer of multiple files by the conventional ftp. The transfer was actually completed in several hours though it was expected 2 whole days based on a sequential ftp.
These data are now retrievable at NCBI Trace Archive. DDBJ starts preparing the original web page for the retrieval of trace data.
DDBJ has accepted Trace Archive data. Trace Archive data which was submitted to DDBJ is now available by FTP from FTP/Web API page.
These data are now available. - Trace data of Oryzias latipes WGS sequences determined by National Institute of Genetics (NIG);
TI numbers are as follows:
TI numbers are as follows:
These data are now available. - Trace data of Oryzias latipes WGS sequences determined by National Institute of Genetics (NIG);
TI numbers are as follows:
- 2095022956-2095389675
- 2095396176-2096435759
- 2096858496-2096933759
FTP site for DB download : NIG (Oryzias latipes)
TI numbers are as follows:
- 2097946941-2099007079
FTP site for DB download : UTCOB (human gut metagenome)
DDBJ discontinues the following services provided by WWW and by E-mail, as of the described schedule. We apologize for you incovenience. Thank you for your understanding and cooperation.
*: Receiving the results by E-mail after sending the queries by WWW is continuously available.
Termination of providing SRS(Sequence Retrieval System) services
DDBJ terminates providing SRS services (by WWW/WABI )as of the December 26, 2008. We appreciate very much using SRS since SRS was firstly launched by WWW in 1999. We apologize for your labor swiching your keyword searh system from SRS to ARSA, but, we hope ARSA is useful for all DDBJ users. Please refer the description of ARSA and SRS shown in below. (1) ARSA and SRS Search functions
*: Use of "not" in Quiack Search is in being enhanced (as of Aug. 2008)
(2) ARSA Improvment History
(3) Growth of Number of ARSA and SRS Users
In the following graph which shows the growth of ARSA and SRS users from January 2007 to June 200, the number of unique users of ARSA increases continuously, whereas that of SRS is on a downward or a flat tendency.
In the pageview count, ARSA exceeded SRS in March, 2008.
(4) Conclusion
DDBJ, since starting the ARSA services in December of 2004 as a trial operation, has built up continuous improvements for the more covenient keyword search system. Meanwhile, DDBJ has introduced ARSA at the scientific meeting and/or DDBJing tutorials. Increase of ARSA users reflects these efforts, and as a results, DDBJ decided to terminate providing SRS service.
DDBJ has do all the more efforts for the development of ARSA. If you have any comments for ARSA, please send it from "Your Comment" which is located in the upper blue zone.
| Services | substitute services of DDBJ | |
| Search & Analysis by E-mail server | ||
| getentry | 2008.9.12 | getentry (WWW *1 / WABI *2) |
| get-version | 2008.9.12 | getentry (WWW *1 / WABI *2) |
| FASTA | 2008.9.12 | FASTA (WWW *1 / WABI *2) |
| BLAST | 2008.9.12 | BLAST (WWW *1 / WABI *2) |
| SSEARCH | 2008.9.12 | SSEARCH (WWW *1 |
| ClustalW | 2008.9.12 | ClustalW (WWW *1 / WABI *2) |
| HMMPFAM | 2008.9.12 | HMMPFAM by WWW*1 |
| Keyword Search | ||
| SRS | 2008.12.26 | ARSA==>Termination of SRS |
| Sequence Pattern Match | ||
| SQmatch | 2008.11.14 | |
| Protein Compatibility Analysis, etc. | ||
| PDB Retriever | 2008.11.14 | |
| Libra | 2008.11.14 | GTOP |
| Lib score | 2008.11.14 | |
Termination of providing SRS(Sequence Retrieval System) services
DDBJ terminates providing SRS services (by WWW/WABI )as of the December 26, 2008. We appreciate very much using SRS since SRS was firstly launched by WWW in 1999. We apologize for your labor swiching your keyword searh system from SRS to ARSA, but, we hope ARSA is useful for all DDBJ users. Please refer the description of ARSA and SRS shown in below. (1) ARSA and SRS Search functions
| A R S A | S R S | |
| Search Engine |
High-speed XML database search engine "Interstage Shunsaku" was applied for the nucleotide sequence database retrieval | Keyword search system developed by EBI(European Bioinformatics Institute) was applied and reconstructed for the DDBJ database retrieval |
| Key Features |
|
|
| functions | ARSA | SRS | |
| Cross search of multiple DB | Yes | Yes | |
| Specifying the search field | Yes | Yes | |
| Use of DDBJ flat file Feature/Qualifier | Yes | No | |
| Keyword search using the frequently | Yes | Yes | |
| Multiple keywords connection using and/or/not | Yes* | Yes | |
| Relation to analysis tools | Yes | Yes | |
| Speed | fast | slow | |
| No. of DBs |
||
| 2004.12 |
Trial operation started at the DDBJ HP "DDBJ" and "DDBJNEW" only |
2 |
| 2005.12 |
Upgraded Full-time operation started |
|
| 2007.02 | 21DBs other than DDBJ was added | 23 |
| 2007.07 | Extensive Enhancement and transfer to Official Operation <Main New Functions>
|
|
| 2007.10 | DDBJ HP design renewal
(ARSA is available at the search window of DDBJ HP) |
|
| 2007.10 | Sequence search program linked to TX Search was changed from SRS to ARSA | |
| 2007.11 | Further Upgrade subsequent to enhancement conductted in July" <Main new functions>
|
24 |
| 2008.05 | Removal of redundant DBs 4DBs related to PFAM were removed |
20 |
|
|
||
| Unique User: | a number of an individual that has visited a Web site or received specific content,for a specified period of time such as a day or month. More than one time visits by the same user is counted as 1. | |
| Page View: | a request to load a single page of an Internet site. This number is widely used as an access index to the site | |
DDBJ, since starting the ARSA services in December of 2004 as a trial operation, has built up continuous improvements for the more covenient keyword search system. Meanwhile, DDBJ has introduced ARSA at the scientific meeting and/or DDBJing tutorials. Increase of ARSA users reflects these efforts, and as a results, DDBJ decided to terminate providing SRS service.
DDBJ has do all the more efforts for the development of ARSA. If you have any comments for ARSA, please send it from "Your Comment" which is located in the upper blue zone.
The nucleotide sequence database collected and maintained by DDBJ is quarterly released online to the public. We completed DDBJ Release 74 on June 24, 2008. DDBJ Release 74 consists of 87,903,140 entries, and the number of bases reached 91,294,770,939.
The periodical release and the new data are available by FTP download from the "FTP/Web API" page.
The periodical release and the new data are available by FTP download from the "FTP/Web API" page.
Release of new Drosophila EST 190,096 entries DDBJ released Drosophila EST 190,096 entries, which had been submitted by Kyoto Institute of Technology. The accession numbers are as follows;
- Drosophila auraria 5'-EST : DK265854-DK284650 (18,797 entries)
- Drosophila auraria 3'-EST : DK284651-DK303963 (19,313 entries)
- Drosophila sechellia 5'-EST : DK303964-DK322998 (19,035 entries)
- Drosophila sechellia 3'-EST : DK322999-DK342220 (19,222 entries)
- Drosophila simulans adult female 5'-EST : DK342221-DK360662 (18,442 entries)
- Drosophila simulans adult female 3'-EST : DK360663-DK379473 (18,811 entries)
- Drosophila simulans larvae 5'-EST : DK379474-DK398612 (19,139 entries)
- Drosophila simulans larvae 3'-EST : DK398613-DK417792 (19,180 entries)
- Drosophila simulans adult male 5'-EST : DK417793-DK436903 (19,111 entries)
- Drosophila simulans adult male 3'-EST : DK436904-DK455949 (19,046 entries)
FTP site for DB download : Drosophila_EST_080614_1.seq.gz
getentry is the entry retrieval system by accession numbers etc., which is provided by DDBJ via WWW server.
In the case of sequence display of CON entry, We found there were two problems to display sequences of CON entries.
Datails are as follows;
In the case of sequence display of CON entry, We found there were two problems to display sequences of CON entries.
Datails are as follows;
- Condition :
1. When a cited piece entry is described with "complement" operator in a CON entry, the sequence span cited from the piece entry is processed as the forward strand. It should be processed into the complementary sequence.
Fixed ( Jul. 3, 2008 )
2. When the piece entry of retrieved CON entry containes a CON entry, it can not be processed correctly. - Affected parts : We have provided two types of methods to get the conjugated sequences of CON entries; display on browser and FTP. However, neither of two methods works properly in the case of retrieving CON entries under the following conditions.
1. When you specify "total nt seq FASTA" as output format, and "www" or "FTP" as Results.
2. When you specify "Flat file (DDBJ)" as output format, and "www" as Results.
Then, click each of following links in the result;
a) transfer all the DNA sequences separately in FASTA format
b) display all the DNA sequences separately in FASTA format
c) transfer the combined DNA sequence in FASTA format (Constructed sequence according to CONTIG line)
d) display the combined DNA sequence in FASTA format (constructed sequence according to CONTIG line)
- Affected services : getentry, ARSA, and Web API (GetEntry)
- Measure : Recovery work takes a little longer. We will provide an announcement on this page as soon as the problems are solved.
getentry is the entry retrieval system by accession numbers etc., which is provided by DDBJ via WWW server.
In the case of DAD ( DDBJ Amino Sequence Database ) retrieval with Protein ID in getentry,
there were some search results in which a part of the entry was not displayed properly during the following period.
Datails are as follows;
Datails are as follows;
- Condition : Displayed unsuitable results under the following conditions. When you specify " Protein ID " as " ID ", " DAD ( DDBJ Amino Sequence Database )" as " Database " in getentry, there were some entries of setting the CDS Feature with qualifier " /translation " below the CDS Feature with qualifier " /psuedo " in the range of CDS Feature. The hyperlink function on the Protein ID in the DDBJ flat file which contained the above " /translation " qualifier doesn't work properly. Furthermore, the DAD retrieval with those Protein ID did not display suitable results.
- Affected Period : Feb. 27, 2007 to May. 22, 2008
- Accesion numbers at DDBJ entries corresponding with the affected Protein ID : (-> See the list )
- Measure : The defect has already been fixed and the service works normally.
- Published by:
- DNA Data Bank of Japan (DDBJ)
Center for Information Biology and DNA Data Bank of Japan (CIB-DDBJ)
National Institute of Genetics (NIG)
Research Organization of Information and Systems
1111 Yata, Mishima, Shizuoka 411-8540, JAPAN
