Contact Us Intranet [Password protected] Home
National Laboratory for Advanced Data Research Data Technologies Science & Other Applications Programs (SOAP) SDSC/Cal-(IT)2  Synthesis Center

SCIRAD - Data Technologies - Cyberinfrastructure Laboratory for Environmental Observing Systems

SKIDLkit Data Mining Toolkit

DOWNLOAD

A Quick Summary of SKIDLkit

Introduction

SKIDLkit is a toolkit written at the San Diego Supercomputer Center, Cyberinfrastructure Laboratory for Environmental Observing Systems(CLEOS). It was designed with the intention of simplifying the selection of key features (variables) in high-dimensional datasets in a data mining context. This is important because not all features are usually relevant to the underlying classification task, so that including irrelevant features usually ends up degrading the results. Identifying the most important features for classification can also shed light into the underlying causes of the scientific phenomena being examined.

SKIDLkit is especially useful for mining datasets of high dimensionality and with a small number of available samples, where an extensive parameter search is typically done, and full n-fold cross-validation is necessary. Using a single command line, different normalization procedures, feature selection formulas, induction algorithms (Nave Bayesian Classifier or Support Vector Machine), and number of features to use can be specified. A full n-fold cross-validation will be performed on the training data, and a separate test file can be scored.

Motivation

The development of this toolkit is motivated by hyperspectral image analysis as well as microarray analysis. A hyperspectral image is an image from the surface of the earth consisting of more than a hundred layers (high dimensional features), each layer measuring the intensity of the image in a certain wavelength. Thus, each pixel (location) has more than a hundred values, creating a spectral signature, describing the spectral intensity of the location, from infrared to ultra-violet. In order to identify new pixels, the spectral signature is fed into classification methods. The whole spectrum is often unnecessary, and choosing several key bands, i.e. reducing the dimensionality, brings positive effects: computational advantage, ease of analysis, and stability. Taking out unnecessary and redundant features also leads to simpler and more intuitive analysis, and a smaller set of features translates to less time to compute. In addition, feature selection is a way to avoid overfitting, a situation where the identification works well on the data on which it is built, but not on a new image, for the reason that the model is too specifically tailored. Microarray analysis is a similar task, in which each sample typically has thousands of features (corresponding to gene expression values), but not more than a dozen are usually relevant for classification tasks.

Feature selection methods implemented

Two ways to select important bands are implemented in SKIDLkit. First is the filter method. In filter methods, all the bands are first ordered according to their distance measure (the ability to discriminate the target material from the others), and then only the high-ranked bands become an input to the classification algorithms. SKIDLkit implements three types of distance measures: t-test, prediction strength and Bhattacharya distance. SKIDLkit also enables those selected features to be used either on Support Vector Machines (SVM) or Naive Bayesian classifier (NBC). An alternative to this method is the wrapper method, which includes the induction algorithm itself as a part of the feature selection process. The induction algorithm (such as SVM or NBC) is trained on different subsets of features, and a good subset is chosen according to the accuracy of the classifiers. SKIDLkit implements three kinds of wrapper methods: SVM wrapped in a genetic algorithm (GA + SVM), NBC wrapped in a GA (GA + NBC), and recursive feature elimination (RFE) in Support Vector Machine. Both GA + SVM and GA + NBC evolve a subset of features to be a good input to the induction algorithms (SVM and NBC), and for the two methods, the n-fold cross validation (specifically, jack-knife) accuracy measure is implemented. RFE, on the other hand, starts with the whole feature set, and recursively eliminates a small set of features in a greedy way. After the features are selected, a model is created, and new, unseen samples can be classified using the model. In addition, domain scientists can gain insight into the domain by studying the selected features.

DOWNLOAD



       SDSC — UC San Diego, MC 0505 — 9500 Gilman Drive — La Jolla, CA 92093-0505