Clayton Turner: An Introductory Guide: May 2014

Wednesday, May 21, 2014

Data Mining with Weka

Another skill I'm going to have to re-pick up over the course of my graduate career is data mining with pre-built informatics systems. I can write as many data mining algorithms as I want, but preparing for the scalability required for big datasets is something I would not have too much fun doing.

So I can settle for using a pre-built system. For data analysis for my most current project, I will be using Weka.

Weka can be run from the command line or a GUI. The GUI is simplistic enough, though, to where the command line is not exactly the best way to go about these tasks, especially since a lot of the results of data mining algorithms rely on visualizations. For example, actually viewing a decision tree after creating it is pretty useful, rather than just reading information gains and prunes off of a screen.

Weka provides other built-in functionality which is immensely useful - the experimenter. So, a lot of times in data mining, you are faced with an issue where you don't know which algorithm will be the best after running it... and you don't know this for good reason. If you were an expert about the dataset and knew exactly how the features were interacting, then you could create an amazing hypothesis about which classifiers would run the best, but that is never a guaranteed solution. With Weka's experimenter, there is no need to worry about this. The experimenter allows you to select multiple algorithms to be run on a dataset (or datasets - a many to many relationship can be established). The results are then able to be viewed in a nice table format with sorting capabilities:

Above is the result of running the experimenter with three algorithms on Fisher's famous iris dataset. The ZeroR algorithm that was run is, essentially, a baseline as that is a classifier which does not rely on any of the features for prediction. So, effectively, the class label which appears the most in the training set will be selected for every datapoint which comes through to the testing set.

This may seem silly, but there does come a point where this is a crucial baseline. For example, if you have a dataset classifying whether or not someone has pancreatic cancer and your training set has 98% No's for the label (98% of the people in the training set do not have pancreatic cancer), then the test set will just say No for every single test. This can result in a high amount of false negatives, but... it will be correct a LOT of the time. So any classifier that you throw at that dataset will need to outperform 98% - this is really a situation where more data is needed, but being able to recognize that is vital to creating a good classifier.

The next 2 classifiers are J48 (weka's default decision tree algorithm) and naive bayes. These are both classic algorithms in data science so I will not dive too much into these for now.

Music listened to while blogging: Mac Miller

Friday, May 16, 2014

A Refreshing SQL

So becoming a part of a bigger project has induced the need for me to re-learn some old skills.

I have not written my own SQL queries since my DATA 210 class 2½ years ago so a crash-course was needed. I started by visiting the w3schools tutorial which worked great for me since I am re-learning as opposed to learning from scratch. Additionally, I needed to review regular expressions and how to actually do coding with them (I have done regex in the past by hand, but I never actually programmed them myself). I found a good reference site through Oracle.

Some of the main things that I took away were as follows:

SQL Joins

Joins are probably one of the harder techniques for people to grasp when it comes to query languages. Inner joins are the most common form of joins (in SQL the JOIN command defaults to inner join). An inner join on 2 tables will result in a single table with data from both tables. The join requires one field from each of the tables to be equivalent as that would be the feature which is being joined "ON". So if you have a foreign key in a table, then you know that those two tables can be joined upon because of the constraints imposed upon foreign/primary keys. Outer joins differ from inner joins in that there will be a lot of empty data if an outer join is performed. There are different types of outer joins: left, right, and full. A left join will use the "left" table as the primary table being used while right corresponds to "right". So let's say we have a Persons table and an Amazon Orders table. We could easily have people in the Persons table that have not placed an Amazon Order. So the people in the Amazon Orders table will all be encapsulated by the Persons table. If we do a left join, then we will have one big table that has Persons listed with their Amazon Orders attributes added on to the end. For people that did not have Amazon Orders, there would be a NULL or empty value for those attributes.

Altering Tables

Altering tables is a simple, yet vital functionality in database usage and administration. This can range from dropping tables or deleting attributes to designating new keys and features for a table. There is not much explanation needed to understand this.

Populating Tables with the INSERT TO command

Creating new tables is insanely important when dealing with databases. Typically, though, you are not creating a table from scratch. You may want to take features from many tables, perhaps unstructured with a bunch of redundancies, and create a new, normalized table that prevents disk space waste and allows for easy querying. With the INSERT TO command you designate a table from which you want to grab information (as well as selecting the specific attributes you desire, if not all of them) and select a new table for which to send the information.

Music listened to while blogging: Schoolboy Q

I'm going to start linking the artist's name directly to a song of theirs, to add a little interactivity to my music updates.

Wednesday, May 14, 2014

A Personal Hurrah and the Road Ahead

So I'm using this post to explain my hiatus from blogging.

I just recently graduated (May 10th) from The College of Charleston with B.S.'s in Data Science (cognate, an effective minor, in molecular biology) and Computer Science with a minor in mathematics.

I am now pursuing the Master's program at The College of Charleston for 2 years. This is the Computer and Information Sciences program - the concentration I am following is the Computer Science track, which is more of a classical theory-based approach to Computer Science. While doing this, I will be conducting research, hopefully as a GRA, graduate research assistant.

Currently, I am doing research for the summer between these programs with Dr. Anderson and others working with building and modifying machine learning algorithms. A lot of the specifics cannot really be elaborated upon at this point in time, but hopefully I will be able to disperse that information at some point because this work is going to be great and is quite the departure from my normal research routines. We're just waiting on IRB approval so I can get going with the data for the project.

Also, I have recently showed an undergraduate student the basic development environment for working with Learn2Mine. I am still one of the lead developers on the project, but a lot of the maintenance will slowly be transitioned to this undergraduate student as he becomes comfortable with the system.

Clayton Turner: An Introductory Guide