Description of DNA repair datasets
==================================

This file describes the contents of 16 datasets about DNA repair genes and ageing which were used in the research reported in the following paper:
A. Freitas, O. Vasieva, J.P. de Magalhaes. (2011) "A data mining approach for classifying DNA repair genes into ageing-related or non-ageing-related." BMC Genomics, 12:27.

These 16 datasets are included in the compressed file "DNA-repair-datasets.zip".

Each of the 16 datasets represents a classification problem where the goal is to find a classification model that can predict the class (ageing-related or non-ageing-related) of a DNA repair gene based on the values of predictor attributes describing that gene. All the datasets contain basically the same two classes of DNA repair genes, but different datasets contain different types of predictor attributes, as explained below.

Each dataset is provided in a separate file. All files are in the format used by the data mining tool WEKA (indicated by the file type extension ".arff"). This tool is freely available from: http://www.cs.waikato.ac.nz/ml/weka/

In each dataset, each instance (row) corresponds to a gene or protein, which is described by a number of attributes (columns). The last attribute is the class to which the gene or protein belongs.
The lines containing the actual data start after the line with the keyword "@DATA", but before that each file contains a line with the name of the dataset (starting with the keyword "@relation") and many lines starting with the keyword "@attribute". These lines describe the names of the predictor attributes and their possible values, as well as the classes to be predicted.

These datasets were prepared specifically for the classification task of data mining, therefore in principle any classification algorithm available in WEKA can be run on these datasets. For more information about the WEKA's file format ".arff", as well as information about the classification algorithms available in WEKA and about data mining in general, the reader is referred to the following book:
I.H. Witten and E. Frank. Data Mining: practical machine learning tools and techniques. 2nd Ed. Morgan Kaufmann, 2005.

The 16 datasets can be divided into two groups. The first group consists of 15 datasets, each of them using multiple types of predictor attributes, but not using gene expression attributes. The second group consists of a single dataset, using only gene expression attributes as predictor attributes.

All the 16 file names start with "DNA-repair-2-classes", and the following abbreviations were used in the remainder of the file names for datasets in the first group (for more details about the terms below, please see the above-cited paper by Freitas et al.):

"no-PPI-attribute" means the dataset contains no Protein-Protein Interaction attribute.

"BP-GO-attributes-occur-threshold-X" means the dataset contains Biological Process Gene Ontology attributes with Occurrence Threshold set to X, where X can take the value 3, 7 or 11. Each of these GO attributes is a binary attribute, taking the value "1" or "0" to indicate whether or not (respectively) a gene is annotated with the corresponding GO term. Each of these attributes is identified in a file by the corresponding GO term id, e.g. "GO:0000012". 

"Num-Prot-Int-attribute" means the dataset contains an attribute whose value is the Number of Proteins Interacting with the current protein (referred to as the '#partners' attribute in the above-cited paper). This attribute is named "NumInter" in the datasets.

"X-PPI-attributes" means the dataset contains X Protein-Protein Interaction attributes, where X can take the value 10, 20 or 30. Each of these attributes is a binary attribute, taking the value "1" or "0" to indicate whether or not (respectively) a gene's protein product interacts with the corresponding protein. Each of these attributes has a name of the form "X_interaction", where X is the name of a protein, e.g. "ATM_interaction", which is an attribute indicating whether or not a gene's protein product interacts with the ATM protein.

In addition, all the 15 datasets in the first group contain the following 4 attributes, which are not explicitly indicated in the names of the files:
"Protein_name" is the name of the gene or protein.
"Length" is the length of the protein's sequence of amino acids.
"Gene_Wood_type" is the main type of DNA repair pathway associated with the gene, according to a somewhat simplified version of the categorization of DNA repair genes proposed by Wood. 
"KaKi" is an attribute whose value is the Ka/Ki ratio, a measure of a gene's rate of evolutionary change.

Important Notes: 
----------------
(1) The attribute "Protein_name" was included in the datasets only to allow the interpretation of the results - e.g., to allow us to identify all the gene/proteins covered by a classification rule or a decision tree leaf node. However, this attribute should NOT be given as input to a classification algorithm, since it has no generalization power. (It has a unique value for each gene/protein in the dataset, so a classification algorithm cannot use it to discover generic patterns that cover two or more genes/proteins.) 
Hence, when applying a classification algorithm to any of these datasets, the attribute "Protein_name" should be removed from the data before running the classification algorithm. This can be easily done in the WEKA Explorer tool, by checking the box beside the attribute name and clicking on the button "Remove". This will not physically remove the attribute from the file, but it will have the desired effect of removing the attribute from the set of attributes being considered by WEKA for data mining purposes.
(2) The attribute "Length" can be of course used by classification algorithms, but it should be noted that this attribute was not used in the experiments reported in the above-cited paper by Freitas et al.

Concerning the single dataset belonging to the second group, its file is named:
"DNA-repair-2-classes-gene-expression-attributes.arff".
This dataset contains only one type of predictor attribute, namely gene expression attributes.
The file also contains one attribute called "Gene_name".

Important Note:
---------------
Analogously to the attribute "Protein_name" in the other datasets, the attribute "Gene_name" in the gene expression dataset should NOT be given as input to a classification algorithm, since it has no generalization power.