Accepted Data File Formats

Important notes on file preparation:

  1. Submit raw or raw matrix file(s) for every sample/hybridization of your experiment.
  2. Make sure the file names are constructed only from alphanumerals [A-Z,a-z,0-9], underscores [_], hyphens [-] and dots [.], with no whitespaces, brackets, other punctuations or symbols.
  3. Any spreadsheet/matrix file should be saved in tab-delimited text (*.txt) format and not Excel format (*.xls, *.xlsx)

Microarray file formats

Raw data file formats

Per assay raw file (recommended):

The "native" files generated by the microarray scanner software. Make sure you do not change/edit the native files in any way, and submit one raw file per hybridization assay. One assay can consist of just one channel, as in Affymetrix experiments, or two channels, as in spotted arrays often with red and green channels from two different dyes/fluorophores.

Commercial microarray manufacturers have developed different raw data file formats over the years. If you are unsure about whether your raw files are in an accepted format, please check the list below.

Raw Matrix (not recommended):

A raw file in tab-delimited text (.txt) format, that contains data from more than one hybridization assay (probes in rows and assays in columns). The format requirements are strict (except for Illumina GenomeStudio data files). See matrix guidelines and examples.

Accepted formats by platform

The raw data file platform is recognized by using column headings in the file's header:

Affymetrix
Our system recognizes .CEL and .EXP files using both the old GDAC formats and the newer GCOS/XDA formats.

Agilent
A file containing these headings is recognized as Agilent format file:
Row Col PositionX PositionY
Illumina
Illumina raw data files are usually either in plain text or binary format. Plain text files are generated by the Illumina GenomeStudio software. The binary "IDAT" files (stands for "intensity data file") are generated by the scanner and can be parsed using R/BioConductor packages such as illuminaio). IDAT is the preferred file format, as it is a binary format, containing all information required to analyse the data. In contrast, plain-text files can be missing information such as which are the control probes; this is sometimes provided in a separate file, but not always, and heterogeneity in raw data file formats makes systematic analysis of data difficult. Another disadvantage of plain-text files is that they are susceptible to human-introduced errors, as it is easy for someone to open the file in a text editor or spreadsheet program and accidentally change its content. If you're submitting a GenomeStudio text file, below is an example of the expected column headings:
PROBE_ID Assay_Name_1.QT1 Assay_Name_1.QT2 Assay_Name_2.QT1 Assay_Name_2.QT2
PROBE_IDs are always in the format of "ILMN_123456". QT stands for quantitation type, i.e. the type of measurement recorded in the column, e.g. AVGSignal. You can have as many quantitation types as required. Order the columns by sample names, then by quantitation types.

GenePix
GenePix format files (usually with file extension .gpr or .txt) are recognised using the following column headings:
Block Column Row X Y
NimbleScan
NimbleScan files (Feature, Probe and Pair) all contain the following headings:
PROBE_ID X Y
ScanAlyze
The following column headings are recognized as being from a ScanAlyze format file:
GRID COL ROW LEFT TOP RIGHT BOT
ScanArray/QuantArray
ScanArray Express files are recognized from the following headings:
Array Column Array Row Spot Column Spot Row X Y
while the older QuantArray format has these headings:
Array Column Array Row Column Row
ArrayVision
The following column headings are recognized as indicating an ArrayVision format file:
Primary Secondary
Newer "lg2" ArrayVision files are identified by the following column headings:
Spot labels
Spotfinder
Spotfinder files are recognized by the following column headings:
MC MR SC SR
BlueFuse
A file containing the following headings is recognized as a BlueFuse file:
COL ROW SUBGRIDCOL SUBGRIDROW
UCSF Spot
UCSF Spot files are recognized by the following column headings:
Arr-colx Arr-coly Spot-colx Spot-coly
Applied Biosystems
Files generated by Applied Biosystems software have the following headings:
Probe_ID Gene_ID
Logical_row Logical_col Center_X Center_Y
ImaGene
ImaGene files are recognized using the following columns:
Meta Column Meta Row Column Row Field Gene ID
The ImaGene 3.0 format is also supported:
Meta_col Meta_row Sub_col Sub_row Name Selected
CSIRO Spot
CSIRO Spot files contain the following columns:
grid_c grid_r spot_c spot_r indexs
Generic (for spotted arrays, non-platform specific)
If your raw data file contains BlockColumn/BlockRow/Column/Row fields denoting probe location on a spotted array, you can use this generic format with four columns followed by columns of data:
MetaColumn MetaRow Column Row

Processed data files

Processed files are generated from raw files by procedures such as background correction, normalization, and further statistical analyses (e.g. calculating fold-changes and associated p-values). The final processed data are defined as the data on which the conclusions in the related manuscript are based. We accept either "native" processed files from microarray scanner software (e.g. ".chp" files from Affymetrix scanners, output files from GenomeStudio software for Illumina BeadChip), or two-dimensional spreadsheet files in tab-delimited text (.txt) format. For the latter, the probes/probesets/gene names are in rows, and data from one or more hybridizations are in columns. We accept processed files from the following scenarios:

  • one processed file per hybridization (recommended), i.e. you have a series of processed files.
  • one spreadsheet ("matrix") file containing normalised data from all hybridizations (not recommended).
  • several spreadsheet ("matrix") files containing normalised data from different stages of data processing, e.g. one file containing normalized probe intensities and another containing fold-change data summarized at the gene level.

Processed text file format

In the two-dimensional table, you should have probes/genes in rows and samples/data in columns:

Probes/genes in rows: Where possible, as row headers, you should use official probe names/identifiers, matching those in the array design file, so one can map each row of data to the correct probe. Put the probe identifiers in the first column under a heading Reporter Identifier (for probes) or CompositeSequence Identifier (for "composite" collation of probes, most common example being Affymetrix probe sets). If probe identifiers are not available, try to use proper gene symbols or other identifiers (e.g. GenBank cDNA accession, UniProt protein accession).

Samples/Data in columns: Where possible, label each data column with the same sample names as you declare on the SDRF. This would allow mapping of a column of data to correct sample(s).

A processed .txt file containing data from one single hybridization should look like this:

Reporter Identifier sample 1 normalised intensity sample 1 background
probe_name_1 233.5 69.1
probe_name_2 129.4 27.6

And here is an example where gene names are used as row headings:

Human HGNC gene name sample 1 normalised intensity sample 1 background
CDKN2A 233.5 69.1
BRCA2 129.4 27.6

Processed "matrixes" summarising data from multiple hybridizations should look like the following. Again, as for per-hybridization processed files, if probe identifiers are not available, try to use proper gene symbols or other identifiers (e.g. GenBank cDNA accession, UniProt protein accession).

Matrix of normalised values per sample:

Reporter Identifier sample 1 normalised sample 2 normalised sample 3 normalised sample 4 normalised
probe_name_1 26.9 44.3 62.3 58.5
probe_name_2 22.9 43.7 58.2 67.4
GenBank accession sample 1 normalised sample 2 normalised sample 3 normalised sample 4 normalised
BC000578 26.9 44.3 62.3 58.5
M31642 22.9 43.7 58.2 67.4

Matrix of summarised values (one column of data maps to multiple samples):

Reporter Identifier drug A treated average drug B treated average untreated control average
probe_name_1 44.6 89.3 290.15
probe_name_2 98.3 36.7 100.52

Additional files

A spike-in list for single-cell analysis or supplementary files for data analysis can be attached to GEA experiment as "additional files" (example, E-MTAB-3624) Please contact GEA team to submit additional files.

Sequencing data

Raw data files

Sequencing raw data files need to be pre-registered to DDBJ Sequence Read Archive (DRA). Please see the accepted data files for DRA.

Processed data files

The final processed data are defined as the data on which the conclusions in the related manuscript are based. We do not expect standard alignment files (e.g., BAM, SAM, BED) as processed data since conclusions are expected to be based on further-processed data. When standard alignments are the only processed data available, please write to us to inquire about whether your data are suitable for submission to GEA. Requirements for processed data files are not fully standardized and will depend on the nature of the experiment.

Expression profiling analysis usually generates quantitative data for features of interest. Features of interest may be genes, transcripts, exons, miRNA, or some other genetic entity. Two levels of data are often generated:

  • raw counts of sequencing reads for the features of interest, and/or
  • normalized abundance measurements, e.g., output from Cufflinks, Cuffdiff, DESeq, edgeR, etc.

Either or both of these data types may be supplied as processed data. They may be formatted either as a matrix table or individual files for each sample (recommended). Provide complete data with values for all features (e.g., genes) and all samples, not only lists of differentially-expressed genes.

ChIP-Seq data might include peak files with quantitative data, tag density files, etc. Common formats include WIG, bigWig, bedGraph.

Features (e.g., genes, transcripts) in processed data files should be traceable using public accession numbers or chromosome coordinates. The reference assembly used (e.g., hg19, mm9, GCF_000001405.13) should be provided in normalization data transformation protocol and/or high throughput sequence alignment protocol. In addition, a description of the format and content of processed data files should be provided in these protocols.

If you provide WIG, bedGraph, GFF, or GTF files, please refer to the UCSC file format FAQ for format requirements.

Processed matrix files (for advanced users)

For submitters who are familiar with MAGE-TAB specification, we also accept matrix files in strict MAGE-TAB format, which allows each data point in the file (in a given row and a given column) to be mapped to a particular assay in the experiment and to a particular probe/probe set in the array design file in a human readable way and also programmatically. Check out this guide on the strict matrix format for more information.

Additional files

A spike-in list for single-cell analysis or supplementary files for data analysis can be attached to GEA experiment as "additional files" (example, E-MTAB-3624) Please contact GEA team to submit additional files.