In my previous blog post I introduced the Weka GUI, a data mining and machine learning tool written in Java by the University of Waikato in New Zealand, to you. If you read this post I hope you got a good overview on the capabilities of Weka and be now curious to see how you can use it in a Java project. I encourage you to read the previous blog post if you have not yet heard of Weka even though I will repeat some of the general principles in working with Weka. No matter whether you have an existing Java project and decide to do some data analysis and prediction with the existing data you deal with or just want to try out data mining and machine learning in Java, this blog post should help you to jump over the first few obstacles.
Injecting the Weka dependency in Java
The most straightforward and simplest way to use the Weka library in Java is to include the according maven dependency in your pom.xml file. You can grab the latest stable version fromhere.
Loading a data set
To get started with some data analysis tasks you need to get some data processable by Weka into the memory. The recommended way to do this is via theData Source
type. You can read the typical Weka ARFF file format or even traditional CSV files.
DataSource source = new DataSource("/some/where/data.arff");
Instances data = source.getDataSet();
TheInstances
data type now provides you a convenient way to start with all kind of data preprocessing and analysis later on.
Preprocessing pipeline
In my previous blog post I already mentioned that data preprocessing is one of the most essential parts in data mining and machine learning tasks. Using Weka in Java code directly enables you to automate this preprocessing in a way that makes it much faster for a developer or data scientist in contrast to manually applying filters over and over again. I want to show you some handy examples you could implement in your preprocessing pipeline to prepare your data to be ready for further analysis.
Adding a class attribute
If you want to apply supervised machine learning and your data set has an attribute, which you want to predict for unseen observations, your first task should be telling Weka about the special role of this attribute. In Weka terms this is called defining it as class attribute. In Java this can be done simply by following line.
data.setClassIndex(data.numAttributes() - 1);
The previous statement assumes that the attribute is represented by the last column in the file. Even if you have to explicitly pass the index of the attribute here, there is no need to know the index by heart since there are ways to find an attribute by name and get its index as well. The best source for a comprehensive overview on the provided functionality is the Weka Javadocs Reference.
Filtering
I want to show you some useful filters I discovered by myself. You will also probably use them to preprocess your dataset.
Remove attributesA step you often do in preprocessing is to analyse which attributes may be influential for a specific class attribute and which not. You can use the remove filter to get rid of redundant attributes.
Remove remove = new Remove(); remove.setAttributeIndicesArray(attributeIndices);
To apply a filter use
Instances newData = Filter.useFilter(data, remove);
Get rid of missing data It’s a common use case that your data set contains some missing values for attributes. How these are depends on the domain of data but in the end you have to deal with it. Weka provides a convenient method you can call on an
Instances
variable to automatically remove instances with missing data for a provided attribute.deleteWithMissing(Attribute att)
Normalizing dataA nearly inevitable step if you have to deal with data in different metrics. Imagine you have some distances given in kilometers and some other attribute measures a distance in millimeters. To make sure that a classification algorithm provides promising results data has to be normalized to a certain range. Therefore first iterate over the instances and then over the attributes to set each value to a normalized version. The following code snippet shows how these iterations and mutation of the attribute value for a current instance can look like. The calculation of the normalized value is neglected here.
Collections.list(dataSet.enumerateInstances()) .forEach(instance -> Collections.list(instance.enumerateAttributes()) .stream() .filter(Attribute::isNumeric) .forEach(attribute -> { double normalizedAttributeValue = getNormalizedAttributeValue(...); instance.setValue(attribute, normalizedAttributeValue); }));
I hope you got a first impression on how working with Weka in Java looks like. Now we will move on to training and evaluating a classifier.
Classification
After preprocessing the data set we can now apply a machine learning algorithm on the data set. In case of classification we choose a classifier and train it on the instances we have. A popular method if we only have a training set is cross validation which I explained in more detail in my previous post. Training and evaluation of the classifier can be as simple as writing a few lines of code.
J48 tree = new J48();
Evaluation eval = new Evaluation(data);
eval.crossValidateModel(tree, newData, 10, new Random(1));
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
This code trains a decision tree algorithm ondata
and evaluates it with a 10-fold cross validation. The last line prints a statistics summary of the evaluation like in the Weka GUI. The second parameter indicates whether Weka should print some more advanced statistics which you probably don’t need in the beginning.
If you want to evaluate your classifier on a test set this is also possible and can be done the following way.
Instances train = ...
Instances test = ...
Classifier tree = new J48();
tree.buildClassifier(train);
Evaluation eval = new Evaluation(train);
eval.evaluateModel(tree, test);
System.out.println(eval.toSummaryString("\nResults\n======\n", false));
Conclusion
I intended to keep this blog post a bit shorter and just show some of the fundamental concepts when using Weka in a Java project. You are basically able to do everything you are able to do in the GUI as well but have the ability to automate things better as I showed in the preprocessing part. Since data mining and machine learning tasks usually require a lot of experimenting on the way to the optimum result, the Weka GUI can turn out to be a bit tedious. Furthermore you have all the advantages of the Java programming environment and can include machine learning tasks in existing projects as well. However, to get started very quickly the Weka GUI introduced last time is a nice tool and you may consider using it for your first hands-on experience with Weka.