DDBJ Data Analysis Challenge
“DDBJ Data Analysis Challenge” is a machine learning competition using “‘International Nucleotide Sequence Database’ data”, which is one of the life science big data provided by DDBJ. Participants need to submit their generated machine learning models to a collaborative website, UnivOfBigData. Even college students and/or researchers outside of life science field, can get an opportunity of studying “Machine Learning and Data Mining”, through this Challenge. And, to join this Challenge more easily, DDBJ provides NIG Supercomputer System as your computer resources.
DDBJ Data Analysis Challenge 2016
| Date | Content | 
|---|---|
| Jun 27, 2016 | Start Accepting Applications: NIG Supercomputer System application NIG Supercomputer System OSS Installation request | 
| Jul 6, 2016 | Start Date | 
| Aug 21, 2016 | Deadline for Applications: NIG Supercomputer System application NIG Supercomputer System OSS Installation request | 
| Aug 31, 2016 | End Date | 
| Sep 30, 2016 | Result | 
Challenge Task
DNA Data Bank of Japan (DDBJ) supports a big data resource called by DDBJ Sequence Read Archive (DDBJ SRA), which contains DNA sequences genearated from high-throughput DNA sequencers. The secondary analytical database, ChIP-Atlas database (Dr.Oki of Kyushu Univ.) provides the annotation data of chromatin feature regions on genome sequences.
At this challenge task, please predict whether genomic regions corresponding to input DNA sequences includes chromatin feature regions. Chromatin feature region is related to on-off function of gene expression, and corresponds to peak regions on a genome sequence of the ChIP-Atlas database.
Challenge’s target species is a plant. The number of conditions for the target plant is often over 100. The number of conditions in the challenge is reduced and composed of eight conditions for saving time of try and error on data modelling.
-———————————–
Input training data: 60,000 DNA sequence
Input test data: 10,000 DNA sequence
Output training data: 8 conditions correct answer sets
-————————————
[Input]
One input sequence is composed of 200 bases, that is a ACGT sequence
fragment with 200 length,
where the sequence is encoded as 01 code [Example: AATGC … =
10001000000100100100 …] so that
the length of a sequence is 800 digits.
Corresponding code: A = 1000, C = 0100, G = 0010, T = 0001, Other
exceptions = 0000
[Output]
Output correct answer sets of 8 conditions is also encoded as 01 code.
True answer is one, which means that the input DNA sequence contains
chromatin feature regions.
Likewise zero is false answer so that it does not include the chromatin
feature region.
[Subject]
On the submit stage, please submit the probability of true prediction
with 10,000 rows (test axis) and 8 columns (condition axis) in BigData
University website.
Award
| 1st Prize of DDBJ Challenge Awards 2016 | Information and Mathematical Science and Bioinformatics Co., Ltd. MOCHIZUKI Masahiro | 
| 2nd Prize of DDBJ Challenge Awards 2016 | RIKEN ACCC Bioinformatics Research Unit MATSUMOTO Hirotaka(representative), OZAKI Haruka() *They participated in this Challenge as a team. | 
| 3rd Prize of DDBJ Challenge Awards 2016 | BITS Co., Ltd. OKAYAMA Toshitsugu | 
| Student Prize of DDBJ Challenge Awards 2016 | Master’s Degree Program 1, Graduate School of Information Science and Technology, The University of Tokyo KATO Takuya | 
Result
| DDBJ Challenge Award | AUC | Model Design | Tool Version | 
|---|---|---|---|
| 1st Prize | 0.94564 | 2 Classifiers(Extremely Randomized Trees, CNN) Ensemble Learning(Stacking) *External Data(Genomic Position, Gene Structure Annotation) | python=3.5 scikit-learn=0.17.1 chainer=1.10.0 | 
| 2nd Prize | 0.89859 | 2 Classifiers(CNN, Product of Genomic Distance Decay Parameter and Nearest Training Data Output Ensemble Learning(Averaged) *External Data(Genomic Position) | julia=0.4.6 python=2.7.10 skflow(tensorflow=0.8.0) | 
| 3rd Prize | 0.85428 | 7 Classifiers(Naive Bayes for Multivariate Bernoulli Models, Logistic Regression, Random Forest, Gradient Boosting, Extremely Randomized Trees, eXtreme Gradient Boosting, CNN Ensemble Learning (Stacking) | python=2.7.11 numpy=1.10.4 scikit-learn=0.17 chainer=1.11.0 xgboost=0.4a30 | 
| Student Prize | 0.84318 | 3 Classifiers(LeNet like CNN, DeepBind like CNN, Variable filter DeepBind like CNN) Ensemble Learning(Soft Voting) | python=2.7 lasagne=0.2.dev1 | 
Citation
- DDBJ Data Analysis Challenge: a machine learning competition to predict Arabidopsis chromatin feature annotations from DNA sequences. Kaminuma E, Baba Y, Mochizuki M, Matsumoto H, Ozaki H, Okayama T, Kato T, Oki S, Fujisawa T, Nakamura Y et al Genes Genet Syst 2020 Mar 26;():. Pubmed: 32213716
DDBJ Challenge Committee
- DDBJ Challenge Committee
 Eli Kaminuma, PhD : Center for Information Biology, National Institute of Genetics, Assistant Professor Hisashi Kashima, PhD : Department of Intelligence Science and Technology, Kyoto University, Professor Toshihisa Takagi, PhD : Center for Information Biology, National Institute of Genetics, Professor
DDBJ Data Analysis Challenge has been approved ethical review by NIG Institutional Review Board (IRB).