Accepted Data File Formats

Important notes on file preparation:

Submit raw or raw matrix file(s) for every sample/hybridization of your experiment.
Make sure the file names are constructed only from alphanumerals [A-Z,a-z,0-9], underscores [_], hyphens [-] and dots [.], with no whitespaces, brackets, other punctuations or symbols.
Any spreadsheet/matrix file should be saved in tab-delimited text (.txt) format and not Excel format (.xls, .xlsx)

Microarray file formats

Raw data file formats

Per assay raw file (recommended):

The “native” files generated by the microarray scanner software. Make sure you do not change/edit the native files in any way, and submit one raw file per hybridization assay. One assay can consist of just one channel, as in Affymetrix experiments, or two channels, as in spotted arrays often with red and green channels from two different dyes/fluorophores.

Commercial microarray manufacturers have developed different raw data file formats over the years. If you are unsure about whether your raw files are in an accepted format, please check the list below.

Raw Matrix (not recommended):

A raw file in tab-delimited text (.txt) format, that contains data from more than one hybridization assay (probes in rows and assays in columns). The format requirements are strict (except for Illumina GenomeStudio data files). See matrix guidelines and examples.

Accepted formats by platform

The raw data file platform is recognized by using column headings in the file’s header:

Common platforms: - Affymetrix - Agilent - Illumina - GenePix - NimbleScan
Others: - ScanAlyze - ScanArray - QuantArray - Arrayvision - Spotfinder - BlueFuse - UCSF Spot - Applied Biosystems - CodeLink - Imagene - CSIRO Spot - Generic (for spotted arrays, non-platform specific)

Affymetrix: Our system recognizes .CEL and .EXP files using both the old GDAC formats and the newer GCOS/XDA formats.
Agilent: A file containing these headings is recognized as Agilent format file:

Row

Col

PositionX

PositionY

Illumina: Illumina raw data files are usually either in plain text or binary format. Plain text files are generated by the Illumina GenomeStudio software. The binary “IDAT” files (stands for “intensity data file”) are generated by the scanner and can be parsed using R/BioConductor packages such as illuminaio). IDAT is the preferred file format, as it is a binary format, containing all information required to analyse the data. In contrast, plain-text files can be missing information such as which are the control probes; this is sometimes provided in a separate file, but not always, and heterogeneity in raw data file formats makes systematic analysis of data difficult. Another disadvantage of plain-text files is that they are susceptible to human-introduced errors, as it is easy for someone to open the file in a text editor or spreadsheet program and accidentally change its content. If you’re submitting a GenomeStudio text file, below is an example of the expected column headings:

PROBE_ID

Assay_Name_1.QT1

Assay_Name_1.QT2

Assay_Name_2.QT1

Assay_Name_2.QT2

PROBE_IDs are always in the format of “ILMN_123456”. QT stands for quantitation type, i.e. the type of measurement recorded in the column, e.g. AVGSignal. You can have as many quantitation types as required. Order the columns by sample names, then by quantitation types.

GenePix: GenePix format files (usually with file extension .gpr or .txt) are recognised using the following column headings:

Block

Column

Row

NimbleScan: NimbleScan files (Feature, Probe and Pair) all contain the following headings:

PROBE_ID

ScanAlyze: The following column headings are recognized as being from a ScanAlyze format file:

GRID

COL

ROW

LEFT

TOP

RIGHT

BOT

ScanArray/QuantArray: ScanArray Express files are recognized from the following headings:

Array Column

Array Row

Spot Column

Spot Row

while the older QuantArray format has these headings:

Array Column

Array Row

Column

Row

ArrayVision: The following column headings are recognized as indicating an ArrayVision format file:

Primary

Secondary

Newer “lg2” ArrayVision files are identified by the following column headings:

Spot labels

Spotfinder: Spotfinder files are recognized by the following column headings:

BlueFuse: A file containing the following headings is recognized as a BlueFuse file:

COL

ROW

SUBGRIDCOL

SUBGRIDROW

UCSF Spot: UCSF Spot files are recognized by the following column headings:

Arr-colx

Arr-coly

Spot-colx

Spot-coly

Applied Biosystems: Files generated by Applied Biosystems software have the following headings:

Probe_ID

Gene_ID

CodeLink: CodeLink Expression Analysis files are identified using the following:

Logical_row

Logical_col

Center_X

Center_Y

ImaGene: ImaGene files are recognized using the following columns:

Meta Column

Meta Row

Column

Row

Field

Gene ID

The ImaGene 3.0 format is also supported:

Meta_col

Meta_row

Sub_col

Sub_row

Name

Selected

CSIRO Spot: CSIRO Spot files contain the following columns:

grid_c

grid_r

spot_c

spot_r

indexs

Generic (for spotted arrays, non-platform specific): If your raw data file contains BlockColumn/BlockRow/Column/Row fields denoting probe location on a spotted array, you can use this generic format with four columns followed by columns of data:

MetaColumn

MetaRow

Column

Row

Processed data files

Processed files are generated from raw files by procedures such as background correction, normalization, and further statistical analyses (e.g. calculating fold-changes and associated p-values). The final processed data are defined as the data on which the conclusions in the related manuscript are based. We accept either “native” processed files from microarray scanner software (e.g. “.chp” files from Affymetrix scanners, output files from GenomeStudio software for Illumina BeadChip), or two-dimensional spreadsheet files in tab-delimited text (.txt) format. For the latter, the probes/probesets/gene names are in rows, and data from one or more hybridizations are in columns. We accept processed files from the following scenarios:

one processed file per hybridization (recommended), i.e. you have a series of processed files.
one spreadsheet (“matrix”) file containing normalised data from all hybridizations (not recommended).
several spreadsheet (“matrix”) files containing normalised data from different stages of data processing, e.g. one file containing normalized probe intensities and another containing fold-change data summarized at the gene level.

Processed text file format

In the two-dimensional table, you should have probes/genes in rows and samples/data in columns:

Probes/genes in rows: Where possible, as row headers, you should use official probe names/identifiers, matching those in the array design file, so one can map each row of data to the correct probe. Put the probe identifiers in the first column under a heading Reporter Identifier (for probes) or CompositeSequence Identifier (for “composite” collation of probes, most common example being Affymetrix probe sets). If probe identifiers are not available, try to use proper gene symbols or other identifiers (e.g. GenBank cDNA accession, UniProt protein accession).

Samples/Data in columns: Where possible, label each data column with the same sample names as you declare on the SDRF. This would allow mapping of a column of data to correct sample(s).

A processed .txt file containing data from one single hybridization should look like this:

Reporter Identifier	sample 1 normalised intensity	sample 1 background
probe_name_1	233.5	69.1
probe_name_2	129.4	27.6

And here is an example where gene names are used as row headings:

Human HGNC gene name	sample 1 normalised intensity	sample 1 background
CDKN2A	233.5	69.1
BRCA2	129.4	27.6

Processed “matrixes” summarising data from multiple hybridizations should look like the following. Again, as for per-hybridization processed files, if probe identifiers are not available, try to use proper gene symbols or other identifiers (e.g. GenBank cDNA accession, UniProt protein accession).

Matrix of normalised values per sample:

Reporter Identifier	sample 1 normalised	sample 2 normalised	sample 3 normalised	sample 4 normalised
probe_name_1	26.9	44.3	62.3	58.5
probe_name_2	22.9	43.7	58.2	67.4

GenBank accession	sample 1 normalised	sample 2 normalised	sample 3 normalised	sample 4 normalised
BC000578	26.9	44.3	62.3	58.5
M31642	22.9	43.7	58.2	67.4

Matrix of summarised values (one column of data maps to multiple samples):

Reporter Identifier	drug A treated average	drug B treated average	untreated control average
probe_name_1	44.6	89.3	290.15
probe_name_2	98.3	36.7	100.52

Additional files

A spike-in list for single-cell analysis or supplementary files for data analysis can be attached to GEA experiment as “additional files” (example, E-MTAB-3624) Please contact GEA team to submit additional files.

Raw data files

Sequencing raw data files need to be pre-registered to DDBJ Sequence Read Archive (DRA). Please see the accepted data files for DRA.

Processed data files

The final processed data are defined as the data on which the conclusions in the related manuscript are based. We do not expect standard alignment files (e.g., BAM, SAM, BED) as processed data since conclusions are expected to be based on further-processed data. When standard alignments are the only processed data available, please write to us to inquire about whether your data are suitable for submission to GEA. Requirements for processed data files are not fully standardized and will depend on the nature of the experiment.

Expression profiling analysis usually generates quantitative data for features of interest. Features of interest may be genes, transcripts, exons, miRNA, or some other genetic entity. Two levels of data are often generated:

raw counts of sequencing reads for the features of interest, and/or
normalized abundance measurements, e.g., output from Cufflinks, Cuffdiff, DESeq, edgeR, etc.

Either or both of these data types may be supplied as processed data. They may be formatted either as a matrix table or individual files for each sample (recommended). Provide complete data with values for all features (e.g., genes) and all samples, not only lists of differentially-expressed genes.

ChIP-Seq data might include peak files with quantitative data, tag density files, etc. Common formats include WIG, bigWig, bedGraph.

Features (e.g., genes, transcripts) in processed data files should be traceable using public accession numbers or chromosome coordinates. The reference assembly used (e.g., hg19, mm9, GCF_000001405.13) should be provided in normalization data transformation protocol and/or high throughput sequence alignment protocol. In addition, a description of the format and content of processed data files should be provided in these protocols.

If you provide WIG, bedGraph, GFF, or GTF files, please refer to the UCSC file format FAQ for format requirements.

Processed matrix files (for advanced users)

For submitters who are familiar with MAGE-TAB specification, we also accept matrix files in strict MAGE-TAB format, which allows each data point in the file (in a given row and a given column) to be mapped to a particular assay in the experiment and to a particular probe/probe set in the array design file in a human readable way and also programmatically. Check out this guide on the strict matrix format for more information.