CMSB tutorial 8: Machine Learning - WEKA

David Gilbert and José Antonio Reyes

The aim of this lab is to give you practical experience in the use of WEKA for Machine Learning applications from the lecture on machine Learning for micro-array classification.

Resources:

Some useful resources about WEKA are at the website www.cs.waikato.ac.nz/ml/weka

The WEKA datafiles for this tutorial can be found here.

Exercises:

  1. Practice WEKA with the classification example about Play Golf

  2. Classification of breast cancer examples.
    Download the file Breast_Cancer.arff that include a set of 699 cases, 9 attributes and the class attribute related to the type of cancer cell (in this dataset class 4 is equivalent to malignant cells and class 2 is equivalent to benign cells). This dataset is from the Wisconsin Breast Cancer Database (January 8, 1991). You can look for this and others examples of dataset in this link

    Classify the examples in the "Breast_Cancer.arff" dataset (benign and malignant cells) using the four classifiers mentioned in the exercise 1, and compare the results.

    NOTE: This dataset contains numerical data, so you you can not use Id3 classifier (Id3 only support nominal attributes). In this case try decision trees with J48 classifier with the following command

    java weka.classifiers.trees.J48 -t PATH/Breast_Cancer.arff

  3. Classification of Gene expression data.
    Download the file ALLAML.arff (Golub et al 1999) gene expression data that include 72 examples, 7129 genes (attributes) and 2 clases "acute myeloid leukemia (AML)" and "acute lymphoblastic leukemia (ALL)". For more information you can read the gene list in the file ALLAML.gene_names.txt, and in the paper Golub et al 1999

    Classify the examples in this dataset (ALL or AML class) using the four classifiers mentioned in the exercise 1, and compare the results.

    Interpretation: Go to PubMed and search the selected genes, do they have any biological meaning? Can you identify the unknown gene function? (Try using other bioinformatics tools)