edu.cmu.cs.sb.core
Class DataSetCore

java.lang.Object
  extended by edu.cmu.cs.sb.core.DataSetCore
Direct Known Subclasses:
DREM_DataSet, STEM_DataSet

public class DataSetCore
extends java.lang.Object

The class encapsulates a set of gene expression data


Field Summary
 boolean badd0
          True if a column of inital 0's should be added to the data file, false otherwise
 boolean bfullrepeat
          If repeat data comes from distinct time series (true) that is a longitudal time series, otherwise it is false and each column of the same time point between data sets is interchangeable
 boolean bmaxminval
          True if gene change threshold for filtering is based on the max-min difference False if the gene change threshold for filtering is based on the absolute difference
 boolean bspotincluded
          True if the spot column was included in the data file
 boolean btakelog
          If the log-ratio taken (true); else data is already in log space (false)
 double[][] data
          The expression data, row are genes, columns are time points in experiments
 double dmincorrelation
          Minimum average pairwise correlation a gene must have between full repeats if bfullrepeat is true
 java.lang.String[] dsamplemins
          The time points at which the expression data was sampled
 double dthresholdvalue
          The threshold value for required change
 java.lang.String[] genenames
          The list of gene names for the current data set
 double[][][][] generepeatspottimedata
          Present/missing data for several data sets First dimension is gene Second dimension is repeat Third dimension is spot Fourth dimension is expression value (in log ratio form against time zero)
 int[][][][] generepeatspottimepma
          Present/missing data for several data sets First dimension is gene Second dimension is repeat Third dimension is spot Fourth dimension is present/missing value
 double[][][] genespottimedata
          Present/missing data for one data set First dimension is gene Second dimension is spot Third is expression value (in log ratio form against time zero)
 int[][][] genespottimepma
          Present/missing data for one data set First dimension is gene Second dimension is spot Third is present/missing value
 java.util.HashMap htFiltered
          Contains genes filtered.
 int nmaxmissing
          The maximum number of missing values to prevent a gene from being filtered.
 int numcols
          Number of columns in the data matrix.
 int numrows
          Number of rows in the data matrix.
 java.lang.String[] otherInputFiles
          The names of the other repeat files
 int[][] pmavalues
          0 if data value is missing non-zero if present
 java.lang.String[] probenames
          The list of probe IDs in the current data set
 double[] sortedcorrvals
          The distribution of all the average pairwise correlations of genes across full repeats
 java.lang.String szGeneHeader
          The header string for the gene name column
 java.lang.String szInputFile
          The designated main data file associated with the set, others are repeats
 java.lang.String szProbeHeader
          The header string for the spot ID column
 
Constructor Summary
DataSetCore()
          Empty constructor
DataSetCore(DataSetCore theDataSetCore)
          Constructor copies each field
 
Method Summary
 void addExtraToFilter(GoAnnotations tga)
          Add those genes from tga.extragenes that were filtered to thtFiltered
 DataSetCore averageAndFilterDuplicates()
          Removes duplicate gene rows in the data file and combines there values using the median
protected  void dataSetReader(java.lang.String szInputFile, int nmaxmissing, double dthresholdvalue, double dmincorrelation, boolean btakelog, boolean bspotincluded, boolean brepeatset, boolean badd0)
          Reads in the datafile stored in szInputFile
 DataSetCore filterdistprofiles(DataSetCore theDataSet1, DataSetCore[] RepeatSet)
          Computes the average pairwise correlation between gene repeats stores it in sortedcorrvals.
 DataSetCore filterDuplicates()
          Removes those rows which are the duplicate of another row.
 DataSetCore filtergenesgeneral(boolean[] keepgene, int nkeep, boolean bstore)
          Filters those rows which do not have a true in keepgene nkeep is the number of true rows in keepgene If bstore is true and gene is filtered then we stroe the gene and proble list for it in htFiltered Returns a new DataSetCore object with those rows filtered
 DataSetCore filtergenesthreshold1point()
          Filters those genes with expression below dthresholdvalue
 DataSetCore filtergenesthreshold2()
          If bmaxminval is true, then filter those genes for which the difference between the max and min value is less than dthresholdvalue If bmaxminval if false, then filter those genes for which the absolute expression change is less than dmaxval
 DataSetCore filterMissing()
          Filter those rows which have nmaxmissing or more missing values or are missing the first time point value
 DataSetCore filterMissing1point()
          Filters those rows that have a missing value at the first time point
 DataSetCore logratio2()
          Converts data into log-ratio versus the first time point.
 DataSetCore mergeDataSets(DataSetCore[] otherDataSets)
          Given an array of otherDataSets, merges it with the current data set by storing in data the median of the values.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Field Detail

szInputFile

public java.lang.String szInputFile
The designated main data file associated with the set, others are repeats


bfullrepeat

public boolean bfullrepeat
If repeat data comes from distinct time series (true) that is a longitudal time series, otherwise it is false and each column of the same time point between data sets is interchangeable


otherInputFiles

public java.lang.String[] otherInputFiles
The names of the other repeat files


htFiltered

public java.util.HashMap htFiltered
Contains genes filtered. Maps these gene names to the list of probe IDs associated with it


nmaxmissing

public int nmaxmissing
The maximum number of missing values to prevent a gene from being filtered.


dmincorrelation

public double dmincorrelation
Minimum average pairwise correlation a gene must have between full repeats if bfullrepeat is true


numrows

public int numrows
Number of rows in the data matrix. This corresponds to the number of genes.


numcols

public int numcols
Number of columns in the data matrix. This corresponds to the number of time points


data

public double[][] data
The expression data, row are genes, columns are time points in experiments


pmavalues

public int[][] pmavalues
0 if data value is missing non-zero if present


bspotincluded

public boolean bspotincluded
True if the spot column was included in the data file


genespottimepma

public int[][][] genespottimepma
Present/missing data for one data set First dimension is gene Second dimension is spot Third is present/missing value


generepeatspottimepma

public int[][][][] generepeatspottimepma
Present/missing data for several data sets First dimension is gene Second dimension is repeat Third dimension is spot Fourth dimension is present/missing value


genespottimedata

public double[][][] genespottimedata
Present/missing data for one data set First dimension is gene Second dimension is spot Third is expression value (in log ratio form against time zero)


generepeatspottimedata

public double[][][][] generepeatspottimedata
Present/missing data for several data sets First dimension is gene Second dimension is repeat Third dimension is spot Fourth dimension is expression value (in log ratio form against time zero)


btakelog

public boolean btakelog
If the log-ratio taken (true); else data is already in log space (false)


probenames

public java.lang.String[] probenames
The list of probe IDs in the current data set


genenames

public java.lang.String[] genenames
The list of gene names for the current data set


sortedcorrvals

public double[] sortedcorrvals
The distribution of all the average pairwise correlations of genes across full repeats


dsamplemins

public java.lang.String[] dsamplemins
The time points at which the expression data was sampled


dthresholdvalue

public double dthresholdvalue
The threshold value for required change


bmaxminval

public boolean bmaxminval
True if gene change threshold for filtering is based on the max-min difference False if the gene change threshold for filtering is based on the absolute difference


badd0

public boolean badd0
True if a column of inital 0's should be added to the data file, false otherwise


szProbeHeader

public java.lang.String szProbeHeader
The header string for the spot ID column


szGeneHeader

public java.lang.String szGeneHeader
The header string for the gene name column

Constructor Detail

DataSetCore

public DataSetCore()
Empty constructor


DataSetCore

public DataSetCore(DataSetCore theDataSetCore)
Constructor copies each field

Method Detail

addExtraToFilter

public void addExtraToFilter(GoAnnotations tga)
Add those genes from tga.extragenes that were filtered to thtFiltered


dataSetReader

protected void dataSetReader(java.lang.String szInputFile,
                             int nmaxmissing,
                             double dthresholdvalue,
                             double dmincorrelation,
                             boolean btakelog,
                             boolean bspotincluded,
                             boolean brepeatset,
                             boolean badd0)
                      throws java.io.IOException,
                             java.io.FileNotFoundException,
                             java.lang.IllegalArgumentException
Reads in the datafile stored in szInputFile

Throws:
java.io.IOException
java.io.FileNotFoundException
java.lang.IllegalArgumentException

filterMissing1point

public DataSetCore filterMissing1point()
Filters those rows that have a missing value at the first time point


filterMissing

public DataSetCore filterMissing()
Filter those rows which have nmaxmissing or more missing values or are missing the first time point value


filterDuplicates

public DataSetCore filterDuplicates()
Removes those rows which are the duplicate of another row. Stores in htgenenames for each gene name the index of all rows associated with it


averageAndFilterDuplicates

public DataSetCore averageAndFilterDuplicates()
Removes duplicate gene rows in the data file and combines there values using the median


logratio2

public DataSetCore logratio2()
Converts data into log-ratio versus the first time point. If btakelog is true then this the log base 2 of a value over the time point 0 value. If it is false then this is the difference with the time point 0 value.


mergeDataSets

public DataSetCore mergeDataSets(DataSetCore[] otherDataSets)
Given an array of otherDataSets, merges it with the current data set by storing in data the median of the values. If bfullrepeat is true then stores the repeat and missing data into generepeatspottimedata and generepeatspottimepma


filterdistprofiles

public DataSetCore filterdistprofiles(DataSetCore theDataSet1,
                                      DataSetCore[] RepeatSet)
Computes the average pairwise correlation between gene repeats stores it in sortedcorrvals. Filters those genes which do not have a sortedcorrvals exceeding dmincorrelation


filtergenesgeneral

public DataSetCore filtergenesgeneral(boolean[] keepgene,
                                      int nkeep,
                                      boolean bstore)
Filters those rows which do not have a true in keepgene nkeep is the number of true rows in keepgene If bstore is true and gene is filtered then we stroe the gene and proble list for it in htFiltered Returns a new DataSetCore object with those rows filtered


filtergenesthreshold2

public DataSetCore filtergenesthreshold2()
If bmaxminval is true, then filter those genes for which the difference between the max and min value is less than dthresholdvalue If bmaxminval if false, then filter those genes for which the absolute expression change is less than dmaxval


filtergenesthreshold1point

public DataSetCore filtergenesthreshold1point()
Filters those genes with expression below dthresholdvalue