HOME >  Report/Statistics > Mail Magazine
DDBJ Mail Magazine
No.38    Jun 27, 2008
Japanese top Latest version top    backnumber backnumber    ddbjPublished by DDBJ

  The 21st ICM in Mishima
meeting room group photo It was said that typhoon was coming, we really worried about transport from abroad. Fortunately, we didin't have any problem at all our meeting and we could finish so exsiting three days. We could say that the meeting was productive and successful in this year, too. Next year, it will be held in the USA. We look forward to exciting discussion with ICM members.(see also 3rd article)
If you have any questions and suggestions about DDBJmag, please do not hesitate to write ddbjmag@ddbj.nig.ac.jp . We would like to hear from you!!
  The number of nucleotides released from INSD has reached 200 G bases
In May 2008, the total number of bases ( DNA and RNA ) collected and distributed by INSD ( International Nucleotide Sequence Database: DDBJ/EMBL/GenBank ) has reached 200 G bases (200,000,000,000 bases; the 'letters' of the genetic code). It took only three years from when we had reached 100G bases on August 2005.

INSD has expanded its specifications to accept data submissions from large scale sequencing projects. For example, we have started accepting the sequence data from EST (Expressed Sequence Tags) projects into EST division since 1993. On 2002, to accept submissions of the draft genome and the meta-genome sequences, we have created the new category for WGS (Whole Genome Shotgun) data.

The left figure shows the relationship of the ratios of three categories. The right figure shows the numbers of bases.

base growth
As you can see in the right figure, base counts of both WGS and EST/GSS (Genome Survey Sequences) data increased approximately two times during three years. The base count of WGS data has increased 100 G bases only past 6 years.
-What is WGS data?
In general the term, WGS "Whole Genome Shotgun", originally means an approach to sequence genomic DNA molecules that are once fragmented into millions of pieces, which are sequenced and reassembled to produce a series of sequences,. The strategy was firstly adopted to sequence the complete genome of Haemophilus influenzae in 1995. In INSD, the large set of contigs or the finished sequences without annotation from the proceeding genome project can be accepted as WGS data.
-Future improvement
Increasingly High-Throughput Sequencing (HTS) technology (also called the next generation sequencers) is gaining popularity. It will accelerate the increase ratio of sequence data submitted to DDBJ. For the vast quantities of future submissions, DDBJ has to improve our database systems.

NCBI has started making SRA (Short Read Archive) to collect the raw outputs of short piece data derived from the next-generation sequencer, such as 454, Solexa, SOLiD and so on. As well as EBI, DDBJ is planning to collaborate with SRA.
Related to the new sequencing technologies, INSD faces many requests for the deposition of assembled EST (or EST-like short piece of) sequences. Therefore we will prepare a new division, TSA (Transcriptome Shotgun Assembly), to accept assembled sequences derived from transcriptome projects.
-The day of reaching to 300 G bases
In February 2008, NIH announced about the 1000 Genomes Project, and they already sequenced human genome data between 200 to 300 G bases in their count. Related to this kind of project, it is not enough for DDBJ to accept huge scale of nucleotide sequences.

Because research communities require not only the text based nucleotide sequences, but also the raw outputs of trace data for the sequences to investigate reliabilities of sequence data, evaluation of polymorphisms among individuals and so on. Trace Archive has accepted huge number of trace data registration, more than tens of terabytes (1013 order) in their total size. Last year, DDBJ started to accept a part of trace data from Japanese researchers for the first trial. It is supported by the Integrated Database Project.

Research analyses of large scale sequences will intend to the peta byte (1015 order) data. Considering with only the size of sequence data in INSD (i.e. excluding SRA and Trace Archive), the number of nucleotides is possible to be 400 G bases or more in one and a half year later.

  The 21st International Collaborators Meeting at Mishima (May20-22, 2008)
DDBJ is one of the three members of INSDC: "International Nucleotide Sequence Databases Collaboration", collaborating with EMBL-Bank at EBI in Europe and GenBank at NCBI in USA. The three databanks collaborated to construct the Internatilnal Nucleotide Sequence Database for more than 20 years.
To operate and implement the collaborative construction of the International Nucleotide Sequence Database, the three databanks hold the International Collaborators Meeting (ICM) every year, since the 1st ICM in Heidelberg, Germany in July, 1988. ICM is held annually rotating the host bank among the three and this time, we finished the 21st ICM (May 20-22,2008), which had intensively discussion for three exciting days at the National Institute of Genetics in Mishima.
Thanks to modern digital technologies, quite a few video conferences are organized for international communication and discussion, namely, attendees join the virtual meeting at their office or even at home. But, at an ICM meeting, members of the three banks meet "face-to-face", and discuss various issues at the same place for three whole days. This is one of the secrets why INSDC has been successful for 2 decades.
The ICM have always spotlighted the topics which reflect the progress of research and development in biology. Consequently, many important topics were decided at the past meetings. They include the abolishment of the length limitation of a submitted sequence, Whole Genome Shotgum(WGS) and Third Party Annotation(TPA) data submissions, establishment of new divisions of sequences such as CON, GSS and ENV, and others. The non-protein coding sequence is one of the recent issues. Following the decision at an ICM the common annotation manual among the three banks, Feature Table Definition Document (FT-Doc) is updated every year.
DDBJ will report the decisions at the 21st ICMs on the DDBJ HP that are useful for the data submitters and users of DDBJ.
 Release of sequence data from DDBJ
Release of Lotus japonicus genome data
DDBJ released Lotus japonicus genome data, which had been submitted by Kazusa DNA Research Institute .

The accession numbers are as follows;
These entries were released as DDBJ daily updates on May. 28 and 29.
FTP site for DB download :
Release of new medaka ( Oryzias latipes ) EST 265,853 entries DDBJ newly released medaka ( Oryzias latipes ) EST 265,853 entries, which had been submitted by National Institute for Basic Biology. .

The accession numbers are as follows;
  • DK000001-DK265853 (265,853 entries)
These entries were released as DDBJ daily updates on May.2.
FTP site for DB download : Oryzias_latipes_EST_080502_1.seq.gz
 Genome Standards Consortium created a new genome description guideline
The Genomic Standards Consortium (GSC) has created a new guideline entitled "MIGS", for the description of genomes (and metagenomes) information, and published it in "Nature Biotechnology" appeared in the issue of May 9. According to the issue of the paper, the GSC announced the press release.
Press release - 9 May 2008
Issued by the Centre for Ecology & Hydrology, UK
Nature Biotechnology 26, 541 - 547 (2008)
about GSC
 Apology for the defect of some search results in getentry
getentry is the entry retrieval system by accession numbers etc., which is provided by DDBJ via WWW and E-mail servers.
In the case of DAD ( DDBJ Amino Sequence Database ) retrieval with Protein ID in getentry, there were some search results in which a part of the entry was not displayed properly during the following period.
Datails are as follows;
- Condition : Displayed unsuitable results under the following conditions.
When you specify " Protein ID " as " ID ", " DAD ( DDBJ Amino Sequence Database )" as " Database " in getentry, there were some entries of setting the CDS Feature with qualifier " /translation " below the CDS Feature with qualifier " /psuedo " in the range of CDS Feature. The hyperlink function on the Protein ID in the DDBJ flat file which contained the above " /translation " qualifier doesn't work properly. Furthermore, the DAD retrieval with those Protein ID did not display suitable results.
- Affected Period : Feb. 27, 2007 to May. 22, 2008
- Accesion numbers at DDBJ entries corresponding with the affected Protein ID : (-> See the list )
- Measure : The defect have already been fixed and the service works normally.

We kindly request users to conduct their search again if you have used the corresponding entries.
We sincerely apologize for your inconveniece.
  Removal of some databases from ARSA
ARSA ( All-round Retrieval of Sequence and Annotation ) is the high-speed retrieval system , which is provided by DDBJ via WWW.
Currently, ARSA provides 24 searchable databases including the following 6 PFAM related databases.
Among them, 4 databases ( PFAMHMMLS, PFAMHMMFS, SWISSPFAM, PFAMSEED ) was removed from ARSA because of redundant subset of PFAMA as of May 28, 2008. PFAMA and PFAMB are continuously available.

Thank you very much for your understanding and cooperation.
 Report of the 7th Japan-Korea-China Bioinformatics training course
The 7th Japan-Korea-China Bioinformatics training course was held at Jeju university, South Korea from Mar 18 to 21, 2008. Ten of young resarchers attended the training from Japan and one of the participant willingly contributed the report.
Robert P Olinski
JSPS Post-doctoral fellow of Molecular Evolution,
Tokyo Institute of Technology
My name is Robert and I come from Poland, a country that is located in the Central Europe, 13 hours by plane from Japan.
I have arrived to Japan in November 2007, soon after obtaining a PhD degree in Neuroscience at the Uppsala University in Sweden. In Japan, I started my project under a guidance of Dr Nishihara Hidenori. I was required to master new in slico approaches of sequence analysis to process large amount of data automatically, thus gaining on time and decreasing a possibility of error. The opportunity to participate in the JKC Bioinformatics course organized in Korea came to me as a right solution for the tasks designated for my Post-doctoral studies. Since I am a foreigner (often called here henna gaijin due to my appreciation of Japanese fermented soybean - natto), I felt much honored to be selected for this course via an open competition. A few weeks after submitting my application, I have received a phone call from Professor Tateno Yoshio with news that I am among a few other scientists who passed selection and will go to Korea. People in the lab were all pleased and seem to support my desire to learn new approaches and discover new part of the world.
On the Jeju Island, everyday starting from 9 am, we gathered in the computer hall to learn how to decode information embedded in the nucleotide and amino acid sequences. The course covered a very extensive range of topics; from molecular phylogenetics through protein modeling, mass spectroscopy, principles of population genetics and novel advances in sequencing strategies. I especially enjoyed lecture given by Dr Daron Standley from DDBJ who introduced us to the resources of the PDBJ database. Moreover, Professor Yang Zhong reviewed major topics in the molecular evolution, focusing on the popular algorithms and software used in phylogeny reconstructions, statistical tests of the robustness of tree with a given topology, formats of multiple sequence alignments and pitfalls such as long branch attraction that render tree not credible. Professor Namshin Kim introduced a new tool for large sequences data analysis called “PYGR” that I do consider implementing in my studies in Japan.
The course generated new scientific network for young researchers in bioinformatics in Japan. The idea of Bioinformatics community came from Joshua S Yang from Korean Bioinformation Center who is in charge of this group in Korea. Using Joshua’s example, Mori Akihiro, PhD student from National Institute of Genetics in Mishima, decided to establish a similar community on the Japanese islands. In the era where genome information is generated and processed so rapidly, I think that it is necessary to establish a network where one can learn from the peers about new technologies, recent publications and exchange opinions freely. I wish that this new initiative would be also accessible to foreign researchers who decided to come to Japan for their projects. I would like to write also few lines about informal part of the meeting that is very relevant and close to me.
The organizers of the course welcomed us warm-heartedly. Very special thanks need to be given to Jeongheui Lim, a lady who coordinated entire meeting, emailed us with the latest updates on our travel arrangements and took care about all issues during our stay at the Jeju Island. During the meeting, we were given all possible kinds of facilities, including personal computers, high-speed internet connections, time to rest and regenerate our mental forces and good, Korean traditional and, of course, hot food. These have made my stay on the Jeju a very rewarding experience. The course was given in a casual atmosphere that facilitated discussion between Master and PhD students, Post-Docs and lecturers.
We were accommodated in a house designated for students of the local university of Jeju. The hotel was situated at the cliff with a stunning view over a bay of Korean sea. I shared my room with a Japanese colleague, Kuroda Daiskue, from the Osaka University who became a good friend of mine.
I think that it was a very good tactic to mix participants with different nationalities and affiliations in the hotel rooms. After long hours of sitting in the front of computers, when the sessions were over, I was learning from Daisuke not only about his scientific expertise in the antibody design in silico but also I had an opportunity to discovered his culture and different way of approaching life. During evenings, we shared many cups of Korean green tea that stimulated long discussions about our dreams and directions the life can takes us in future. As young scientists, we are facing some “not-easy-to-answer” questions about next steps of the professional careers; uncertainty of funding, difficulties in securing tenure track position, constant moving from one university or even country to the other, are among some of them. Fortunately, the course overlapped with my birthday and to a great surprise, I was given a birthday cake by few Japanese scientists. It is still a great unknown to me how these kind people found an Italian cake on this isolated piece of land in the middle of sea?
It is also worthwhile to note that the organizers of the course invited us to prolong our stay on the island in order to appreciate overwhelming serenity and beauty of nature. Therefore, during the last day on Jeju, we took the opportunity to explore beaches, nature reserves, waterfalls, tea plantations and gardens of this island that is often designated as subtropical paradise.
Knowing that the next, 8th JKC Bioinformatics course will be organized in Japan, I would like to strongly encourage young scientists to take a chance and apply for it. The sequence data constitute a rich source of information and many hidden answers, however one needs to master skills necessary to extract, interpret and utilize this information. The JKC Bioinformatics course is a place, where you can learn these skills and be ignited with a curiosity for using databases hooked on to your computer in your daily research.
Once again, I wish to thank all organizers and speakers for their terrific sessions, for sharing their knowledge thoughtfully and for having us on the Jeju Island. Kamsahamnida! (“thank you” in Hangul)

Published by:
DNA Data Bank of Japan (DDBJ)
Center for Information Biology and DNA Data Bank of Japan (CIB-DDBJ)
National Institute of Genetics (NIG)
Research Organization of Information and Systems
1111 Yata, Mishima, Shizuoka 411-8540, JAPAN