Representative submissions of identical sequences for variation studies

Recently, variation studies related to re-sequencing projects are increased, so the sequence data from these projects are also increasd. DDBJ (INSDC) basically accepts all sequence data, regardless of source and sequence identity, however, if the policy is strictly applied, some of data would be very redundant.

In order to take advantage of normalisation for variation studies, a single submission to represent multiple identical sequences is also acceptable with frequency and total sample number described by /frequency qualifier of variation feature.

The way of representative submission for variation studies is NOT to mean that all identical (or similar) sequences derived from same species would be represented by a single sequence data. To evaluate research data properly, DDBJ recommends to normalise research data for variation studies by appropriate set of entries; basically, the number of entries should be equal to multiplication of numbers of sequence polymorphisms and sampled populations.

sequence polymorphism
a unit of sequence variations that can keep unique descriptions of haplotype, allele and/or some other qualifiers.
sampled population
a unit of obserbed samples that can keep unique descriptions of country, lat_lon, host and/or some other qualifiers.

For example, a study of a locus on cat genomes comparing Japan with USA shows that there are three haplotypes of sequence polymorphism indicated by below table, and within each haplotype, sequences are identical. DDBJ can accept these results as a submission of 231 sequence data for all indivisuals, however, the set of sequence data seem to be very redundant for both submitters and users.

polymorphism(haplotype) A B C total
Japan 75 38 0 113
USA 26 32 60 118
totla 101 60 70 231

Since observed identical sequences are three types, it would be possible for the publication of this study to submit only three representative sequence data to DDBJ. However, if so, it would be difficult for users to understand what kind of samples were used for this study. Therefore, it is strongly recommended to submit five representative data (There are 6 patterns; i.e. 3 haplotypes x 2 countries, but haplotype C is not observed in Japan.) to DDBJ in following descriptions for source features, respectively.

    source        1..365
                  /mol_type="genomic DNA"
                  /organism="Felis catus"
    variation     124
                  /frequency="75 in 113"