Wednesday, May 21, 2014

Data Mining with Weka

Another skill I'm going to have to re-pick up over the course of my graduate career is data mining with pre-built informatics systems. I can write as many data mining algorithms as I want, but preparing for the scalability required for big datasets is something I would not have too much fun doing.

So I can settle for using a pre-built system. For data analysis for my most current project, I will be using Weka.

Weka can be run from the command line or a GUI. The GUI is simplistic enough, though, to where the command line is not exactly the best way to go about these tasks, especially since a lot of the results of data mining algorithms rely on visualizations. For example, actually viewing a decision tree after creating it is pretty useful, rather than just reading information gains and prunes off of a screen.

Weka provides other built-in functionality which is immensely useful - the experimenter. So, a lot of times in data mining, you are faced with an issue where you don't know which algorithm will be the best after running it... and you don't know this for good reason. If you were an expert about the dataset and knew exactly how the features were interacting, then you could create an amazing hypothesis about which classifiers would run the best, but that is never a guaranteed solution. With Weka's experimenter, there is no need to worry about this. The experimenter allows you to select multiple algorithms to be run on a dataset (or datasets - a many to many relationship can be established). The results are then able to be viewed in a nice table format with sorting capabilities:


Above is the result of running the experimenter with three algorithms on Fisher's famous iris dataset. The ZeroR algorithm that was run is, essentially, a baseline as that is a classifier which does not rely on any of the features for prediction. So, effectively, the class label which appears the most in the training set will be selected for every datapoint which comes through to the testing set.

This may seem silly, but there does come a point where this is a crucial baseline. For example, if you have a dataset classifying whether or not someone has pancreatic cancer and your training set has 98% No's for the label (98% of the people in the training set do not have pancreatic cancer), then the test set will just say No for every single test. This can result in a high amount of false negatives, but... it will be correct a LOT of the time. So any classifier that you throw at that dataset will need to outperform 98% - this is really a situation where more data is needed, but being able to recognize that is vital to creating a good classifier.

The next 2 classifiers are J48 (weka's default decision tree algorithm) and naive bayes. These are both classic algorithms in data science so I will not dive too much into these for now.

Music listened to while blogging: Mac Miller

No comments:

Post a Comment