Genomic Expression Archive
Accepted Data File Formats
Important notes on file preparation:
- Submit raw or raw matrix file(s) for every sample/hybridization of your experiment.
- Make sure the file names are constructed only from alphanumerals [A-Z,a-z,0-9], underscores [_], hyphens [-] and dots [.], with no whitespaces, brackets, other punctuations or symbols.
- Any spreadsheet/matrix file should be saved in tab-delimited text (*.txt) format and not Excel format (*.xls, *.xlsx)
Microarray file formats
Raw data file formats
Per assay raw file (recommended):
The “native” files generated by the microarray scanner software. Make sure you do not change/edit the native files in any way, and submit one raw file per hybridization assay. One assay can consist of just one channel, as in Affymetrix experiments, or two channels, as in spotted arrays often with red and green channels from two different dyes/fluorophores.
Commercial microarray manufacturers have developed different raw data file formats over the years. If you are unsure about whether your raw files are in an accepted format, please check the list below.
Raw Matrix (not recommended):
A raw file in tab-delimited text (.txt) format, that contains data from more than one hybridization assay (probes in rows and assays in columns). The format requirements are strict (except for Illumina GenomeStudio data files). See matrix guidelines and examples.
Accepted formats by platform
The raw data file platform is recognized by using column headings in the file’s header:
- Common platforms:
- Others:
- Affymetrix
- Our system recognizes .CEL and .EXP files using both the old GDAC formats and the newer GCOS/XDA formats.
- Agilent
- A file containing these headings is recognized as Agilent format file:
Row | Col | PositionX | PositionY |
- Illumina
- Illumina raw data files are usually either in plain text or binary format. Plain text files are generated by the Illumina GenomeStudio software. The binary “IDAT” files (stands for “intensity data file”) are generated by the scanner and can be parsed using R/BioConductor packages such as illuminaio). IDAT is the preferred file format, as it is a binary format, containing all information required to analyse the data. In contrast, plain-text files can be missing information such as which are the control probes; this is sometimes provided in a separate file, but not always, and heterogeneity in raw data file formats makes systematic analysis of data difficult. Another disadvantage of plain-text files is that they are susceptible to human-introduced errors, as it is easy for someone to open the file in a text editor or spreadsheet program and accidentally change its content. If you’re submitting a GenomeStudio text file, below is an example of the expected column headings:
PROBE_ID | Assay_Name_1.QT1 | Assay_Name_1.QT2 | Assay_Name_2.QT1 | Assay_Name_2.QT2 |
PROBE_IDs are always in the format of “ILMN_123456”. QT stands for quantitation type, i.e. the type of measurement recorded in the column, e.g. AVGSignal. You can have as many quantitation types as required. Order the columns by sample names, then by quantitation types.
- GenePix
- GenePix format files (usually with file extension .gpr or .txt) are recognised using the following column headings:
Block | Column | Row | X | Y |
PROBE_ID | X | Y |
GRID | COL | ROW | LEFT | TOP | RIGHT | BOT |
Array Column | Array Row | Spot Column | Spot Row | X | Y |
while the older QuantArray format has these headings:
Array Column | Array Row | Column | Row |
Primary | Secondary |
Newer “lg2” ArrayVision files are identified by the following column headings:
Spot labels |
MC | MR | SC | SR |
COL | ROW | SUBGRIDCOL | SUBGRIDROW |
Arr-colx | Arr-coly | Spot-colx | Spot-coly |
Probe_ID | Gene_ID |
Logical_row | Logical_col | Center_X | Center_Y |
Meta Column | Meta Row | Column | Row | Field | Gene ID |
The ImaGene 3.0 format is also supported:
Meta_col | Meta_row | Sub_col | Sub_row | Name | Selected |
grid_c | grid_r | spot_c | spot_r | indexs |
- Generic (for spotted arrays, non-platform specific)
- If your raw data file contains BlockColumn/BlockRow/Column/Row fields denoting probe location on a spotted array, you can use this generic format with four columns followed by columns of data:
MetaColumn | MetaRow | Column | Row |
Processed data files
Processed files are generated from raw files by procedures such as background correction, normalization, and further statistical analyses (e.g. calculating fold-changes and associated p-values). The final processed data are defined as the data on which the conclusions in the related manuscript are based. We accept either “native” processed files from microarray scanner software (e.g. “.chp” files from Affymetrix scanners, output files from GenomeStudio software for Illumina BeadChip), or two-dimensional spreadsheet files in tab-delimited text (.txt) format. For the latter, the probes/probesets/gene names are in rows, and data from one or more hybridizations are in columns. We accept processed files from the following scenarios:
- one processed file per hybridization (recommended), i.e. you have a series of processed files.
- one spreadsheet (“matrix”) file containing normalised data from all hybridizations (not recommended).
- several spreadsheet (“matrix”) files containing normalised data from different stages of data processing, e.g. one file containing normalized probe intensities and another containing fold-change data summarized at the gene level.
Processed text file format
In the two-dimensional table, you should have probes/genes in rows and samples/data in columns:
Probes/genes in rows: Where possible, as row headers, you should use official probe names/identifiers, matching those in the array design file, so one can map each row of data to the correct probe. Put the probe identifiers in the first column under a heading Reporter Identifier (for probes) or CompositeSequence Identifier (for “composite” collation of probes, most common example being Affymetrix probe sets). If probe identifiers are not available, try to use proper gene symbols or other identifiers (e.g. GenBank cDNA accession, UniProt protein accession).
Samples/Data in columns: Where possible, label each data column with the same sample names as you declare on the SDRF. This would allow mapping of a column of data to correct sample(s).
A processed .txt file containing data from one single hybridization should look like this:
Reporter Identifier | sample 1 normalised intensity | sample 1 background |
---|---|---|
probe_name_1 | 233.5 | 69.1 |
probe_name_2 | 129.4 | 27.6 |
And here is an example where gene names are used as row headings:
Human HGNC gene name | sample 1 normalised intensity | sample 1 background |
---|---|---|
CDKN2A | 233.5 | 69.1 |
BRCA2 | 129.4 | 27.6 |
Processed “matrixes” summarising data from multiple hybridizations should look like the following. Again, as for per-hybridization processed files, if probe identifiers are not available, try to use proper gene symbols or other identifiers (e.g. GenBank cDNA accession, UniProt protein accession).
Matrix of normalised values per sample:
Reporter Identifier |
sample 1 normalised |
sample 2 normalised |
sample 3 normalised |
sample 4 normalised |
---|---|---|---|---|
probe_name_1 | 26.9 | 44.3 | 62.3 | 58.5 |
probe_name_2 | 22.9 | 43.7 | 58.2 | 67.4 |
GenBank accession |
sample 1 normalised |
sample 2 normalised |
sample 3 normalised |
sample 4 normalised |
---|---|---|---|---|
BC000578 | 26.9 | 44.3 | 62.3 | 58.5 |
M31642 | 22.9 | 43.7 | 58.2 | 67.4 |
Matrix of summarised values (one column of data maps to multiple samples):
Reporter Identifier | drug A treated average | drug B treated average | untreated control average |
---|---|---|---|
probe_name_1 | 44.6 | 89.3 | 290.15 |
probe_name_2 | 98.3 | 36.7 | 100.52 |
Additional files
A spike-in list for single-cell analysis or supplementary files for data analysis can be attached to GEA experiment as “additional files” (example, E-MTAB-3624) Please contact GEA team to submit additional files.
Sequencing data
Raw data files
Sequencing raw data files need to be pre-registered to DDBJ Sequence Read Archive (DRA). Please see the accepted data files for DRA.
Processed data files
The final processed data are defined as the data on which the conclusions in the related manuscript are based. We do not expect standard alignment files (e.g., BAM, SAM, BED) as processed data since conclusions are expected to be based on further-processed data. When standard alignments are the only processed data available, please write to us to inquire about whether your data are suitable for submission to GEA. Requirements for processed data files are not fully standardized and will depend on the nature of the experiment.
Expression profiling analysis usually generates quantitative data for features of interest. Features of interest may be genes, transcripts, exons, miRNA, or some other genetic entity. Two levels of data are often generated:
- raw counts of sequencing reads for the features of interest, and/or
- normalized abundance measurements, e.g., output from Cufflinks, Cuffdiff, DESeq, edgeR, etc.
Either or both of these data types may be supplied as processed data. They may be formatted either as a matrix table or individual files for each sample (recommended). Provide complete data with values for all features (e.g., genes) and all samples, not only lists of differentially-expressed genes.
ChIP-Seq data might include peak files with quantitative data, tag density files, etc. Common formats include WIG, bigWig, bedGraph.
Features (e.g., genes, transcripts) in processed data files should be traceable using public accession numbers or chromosome coordinates. The reference assembly used (e.g., hg19, mm9, GCF_000001405.13) should be provided in normalization data transformation protocol and/or high throughput sequence alignment protocol. In addition, a description of the format and content of processed data files should be provided in these protocols.
If you provide WIG, bedGraph, GFF, or GTF files, please refer to the UCSC file format FAQ for format requirements.
Processed matrix files (for advanced users)
For submitters who are familiar with MAGE-TAB specification, we also accept matrix files in strict MAGE-TAB format, which allows each data point in the file (in a given row and a given column) to be mapped to a particular assay in the experiment and to a particular probe/probe set in the array design file in a human readable way and also programmatically. Check out this guide on the strict matrix format for more information.
Additional files
A spike-in list for single-cell analysis or supplementary files for data analysis can be attached to GEA experiment as “additional files” (example, E-MTAB-3624) Please contact GEA team to submit additional files.