DDBJ Annotated/Assembled Sequences
Representative submissions of identical sequences for variation studies
Representative submissions of identical sequences for variation studies
Recently, variation studies related to re-sequencing projects are increased, so the sequence data from these projects are also increasd.
DDBJ (INSDC) basically accepts all sequence data, regardless of source and sequence identity, however,
if the policy is strictly applied, some of data would be very redundant.
In order to take advantage of normalisation for variation studies, a single submission to represent multiple identical sequences is also acceptable with frequency and total sample number described by /haplotype qualifier of source feature and/or /frequency qualifier of variation feature.
The way of representative submission for variation studies is
NOT to mean that all identical (or similar) sequences derived from same species
would be represented by a single sequence data.
To evaluate research data properly, DDBJ recommends to normalise research data
for variation studies by appropriate set of entries; basically, the number of entries
should be equal to multiplication of numbers of sequence polymorphisms and sampled populations.
- sequence polymorphism
- a unit of sequence variations that can keep unique descriptions of /haplotype, /allele and/or some other qualifiers.
- sampled population
- a unit of obserbed samples that can keep unique descriptions of /geo_loc_name, /lat_lon, /collection_date, /host and/or some other qualifiers.
For example, a study of a locus on cat genomes comparing Japan with USA shows that there are three haplotypes of sequence polymorphism indicated by below table, and within each haplotype, sequences are identical. DDBJ can accept these results as a submission of 231 sequence data for all indivisuals, however, the set of sequence data seem to be very redundant for both submitters and users.
polymorphism(haplotype) | A | B | C | total |
---|---|---|---|---|
Japan | 75 | 38 | 0 | 113 |
USA | 26 | 32 | 60 | 118 |
totla | 101 | 60 | 70 | 231 |
Since observed identical sequences are three types, it would be possible for the publication of this study
to submit only three representative sequence data to DDBJ.
However, if so, it would be difficult for users to understand what kind of samples were used for this study.
Therefore, it is strongly recommended to submit five representative data
(There are 6 patterns; i.e. 3 haplotypes x 2 countries, but haplotype C is not observed in Japan.)
to DDBJ in following descriptions for source features, respectively.
Furthermore, when observing at the passage of time, you may like to consider about the /collection_date qualifier as well.
source 1..365 /collection_date="2007" /geo_loc_name="Japan" /haplotype="A [75 in 113]" /mol_type="genomic DNA" /organism="Felis catus" variation 124 /frequency="75 in 113" /inference="similar to DNA sequence (same species):INSD:AB012345.1" /replace="t"