|
SCIRAD - Data
Technologies - Cyberinfrastructure Laboratory
for Environmental Observing Systems
SKIDLkit Data Mining Toolkit
DOWNLOAD
A Quick Summary of SKIDLkit
Introduction
SKIDLkit is a toolkit written at the San Diego
Supercomputer Center, Cyberinfrastructure Laboratory for Environmental
Observing Systems(CLEOS). It was designed with the intention of
simplifying the selection of key features (variables) in high-dimensional
datasets in a data mining context. This is important because not
all features are usually relevant to the underlying classification
task, so that including irrelevant features usually ends up degrading
the results. Identifying the most important features for classification
can also shed light into the underlying causes of the scientific
phenomena being examined.
SKIDLkit is especially useful for mining datasets
of high dimensionality and with a small number of available samples,
where an extensive parameter search is typically done, and full
n-fold cross-validation is necessary. Using a single command line,
different normalization procedures, feature selection formulas,
induction algorithms (Nave Bayesian Classifier or Support Vector
Machine), and number of features to use can be specified. A full
n-fold cross-validation will be performed on the training data,
and a separate test file can be scored.
Motivation
The development of this toolkit is motivated by
hyperspectral image analysis as well as microarray analysis. A hyperspectral
image is an image from the surface of the earth consisting of more
than a hundred layers (high dimensional features), each layer measuring
the intensity of the image in a certain wavelength. Thus, each pixel
(location) has more than a hundred values, creating a spectral signature,
describing the spectral intensity of the location, from infrared
to ultra-violet. In order to identify new pixels, the spectral signature
is fed into classification methods. The whole spectrum is often
unnecessary, and choosing several key bands, i.e. reducing the dimensionality,
brings positive effects: computational advantage, ease of analysis,
and stability. Taking out unnecessary and redundant features also
leads to simpler and more intuitive analysis, and a smaller set
of features translates to less time to compute. In addition, feature
selection is a way to avoid overfitting, a situation where the identification
works well on the data on which it is built, but not on a new image,
for the reason that the model is too specifically tailored. Microarray
analysis is a similar task, in which each sample typically has thousands
of features (corresponding to gene expression values), but not more
than a dozen are usually relevant for classification tasks.
Feature selection methods implemented
Two ways to select important bands are implemented
in SKIDLkit. First is the filter method. In filter methods, all
the bands are first ordered according to their distance measure
(the ability to discriminate the target material from the others),
and then only the high-ranked bands become an input to the classification
algorithms. SKIDLkit implements three types of distance measures:
t-test, prediction strength and Bhattacharya distance. SKIDLkit
also enables those selected features to be used either on Support
Vector Machines (SVM) or Naive Bayesian classifier (NBC). An alternative
to this method is the wrapper method, which includes the induction
algorithm itself as a part of the feature selection process. The
induction algorithm (such as SVM or NBC) is trained on different
subsets of features, and a good subset is chosen according to the
accuracy of the classifiers. SKIDLkit implements three kinds of
wrapper methods: SVM wrapped in a genetic algorithm (GA + SVM),
NBC wrapped in a GA (GA + NBC), and recursive feature elimination
(RFE) in Support Vector Machine. Both GA + SVM and GA + NBC evolve
a subset of features to be a good input to the induction algorithms
(SVM and NBC), and for the two methods, the n-fold cross validation
(specifically, jack-knife) accuracy measure is implemented. RFE,
on the other hand, starts with the whole feature set, and recursively
eliminates a small set of features in a greedy way. After the features
are selected, a model is created, and new, unseen samples can be
classified using the model. In addition, domain scientists can gain
insight into the domain by studying the selected features.
DOWNLOAD
|