Wednesday, September 24, 2014

Short Update

It's been a while since my last post because I have been busy.

A quick personal update:
1) I am now attending the Master's program at CofC and am taking 2 classes currently (CSIS 602 - Software Engineering & CSIS 604 - Distributed Systems)
2) I am now, officially, a Graduate Research Assistant doing research with CofC (with MUSC collaborators)
3) I also have a web development job for Innovative Resource Management
4) I have a publication coming out in the FIE (Frontiers in Education) 2014 journal [should be out around October]

So, for my research, I am having to learn (partially review) certain topics. I am choosing to use Python for my data munging and analysis for data I am dealing with for my research. Naturally, this is going to require the usage of the numpy and pandas libraries. I have used numpy in the past when building decision trees, naive bayes models, neural networks, etc. so I am mostly familiar with it and I am merely reviewing it. Pandas, however, is a library with which I have never dealt. To review, I am going through tutorials.

Music listened to while blogging: Bibio

Thursday, July 17, 2014

NLP Paper Review

For this post, I will be sharing a Prezi presentation that I recently presented on a paper from the NLP conference I attended in Baltimore.

http://prezi.com/zac-hm_osqwo/?utm_campaign=share&utm_medium=copy

The presentation really speaks for itself.

Music listened to while blogging: GEMS

Wednesday, July 2, 2014

ACL 2014: BioNLP Conference

So I recently came back from the Association for Computational Linguistics', of which I am now a member, annual BioNLP (Biological Natural Language Processing) conference in Baltimore, Maryland. Firstly, I was in a hotel across from the Baltimore Orioles stadium and, not being the biggest fan of baseball, I definitely got my fill of baseball fans, hat-sellers, and hot dog vendors constantly yelling about their hate for the Yankees.

Since I only attended the 2 day workshop this was a little different than my normal conference travel. I attended sessions where people were doing molecular NLP tasks (such as querying Pubmed and other journals) in order to garner data and conduct metadata or real data analysis. These researchers typically utilized SVMs in their algorithmic analysis for their results, which gives me good ideas about where to take my own research. Unfortunately, most of the sessions were molecular NLP-oriented tasks, whereas my focus is more on clinical NLP, which is a different type of problem, by nature. Namely, scientific/structured writing is a lot easier to parse rather than unstructured notes written by different medical professionals.

No one at the conference is using the NLP system that I am using, which was a disappointment, but I was able to broaden my horizons to other systems such as i2b2 and biocreative. In my own research we are utilizing cTAKES/ytex.The conference included a panel of scientists that helped create these newer systems so that was a nice surprise to hear what it's like on the other side of research.

I will elaborate more on these systems when we decide if we want to steer away from the usage of cTAKES for one of these newer systems or if we decide to keep going down the road with which we are familiar.

Music listened to while blogging: Hellyeah

Tuesday, June 24, 2014

ACL 2014 and Journal Acceptance

So for this post, I will give a quick update on what I've been up to since the Summer started.

First, I am going to Baltimore tomorrow to attend a workshop 6/26-6/27 on Biomedical Natural Language Processing (BioNLP). The workshop, part of the Association for Computational Linguistics 2014 annual conference hosted by Johns Hopkins University, includes presentations on the creation of NLP techniques for parsing, the analysis of NLP-parsed data (specifically biomedical), and the utilization of tools/resources such as the Unified Medical Language Systems (UMLS), Systematized Nomenclature of Medicine (SNOMED) resources, among many others.

Recently, a paper submitted to the Frontiers In Education 2014 conference was accepted. I'll talk more about this conference and our paper that was submitted whenever it is time for the conference. I really hope I can attend this conference because it is in Madrid, Spain, which would be a great place to visit. Additionally, this conference would potentially help the Learn2Mine team to garner new ideas to incorporate into our own application.

Music listened to while blogging: Sublime

Wednesday, May 21, 2014

Data Mining with Weka

Another skill I'm going to have to re-pick up over the course of my graduate career is data mining with pre-built informatics systems. I can write as many data mining algorithms as I want, but preparing for the scalability required for big datasets is something I would not have too much fun doing.

So I can settle for using a pre-built system. For data analysis for my most current project, I will be using Weka.

Weka can be run from the command line or a GUI. The GUI is simplistic enough, though, to where the command line is not exactly the best way to go about these tasks, especially since a lot of the results of data mining algorithms rely on visualizations. For example, actually viewing a decision tree after creating it is pretty useful, rather than just reading information gains and prunes off of a screen.

Weka provides other built-in functionality which is immensely useful - the experimenter. So, a lot of times in data mining, you are faced with an issue where you don't know which algorithm will be the best after running it... and you don't know this for good reason. If you were an expert about the dataset and knew exactly how the features were interacting, then you could create an amazing hypothesis about which classifiers would run the best, but that is never a guaranteed solution. With Weka's experimenter, there is no need to worry about this. The experimenter allows you to select multiple algorithms to be run on a dataset (or datasets - a many to many relationship can be established). The results are then able to be viewed in a nice table format with sorting capabilities:


Above is the result of running the experimenter with three algorithms on Fisher's famous iris dataset. The ZeroR algorithm that was run is, essentially, a baseline as that is a classifier which does not rely on any of the features for prediction. So, effectively, the class label which appears the most in the training set will be selected for every datapoint which comes through to the testing set.

This may seem silly, but there does come a point where this is a crucial baseline. For example, if you have a dataset classifying whether or not someone has pancreatic cancer and your training set has 98% No's for the label (98% of the people in the training set do not have pancreatic cancer), then the test set will just say No for every single test. This can result in a high amount of false negatives, but... it will be correct a LOT of the time. So any classifier that you throw at that dataset will need to outperform 98% - this is really a situation where more data is needed, but being able to recognize that is vital to creating a good classifier.

The next 2 classifiers are J48 (weka's default decision tree algorithm) and naive bayes. These are both classic algorithms in data science so I will not dive too much into these for now.

Music listened to while blogging: Mac Miller

Friday, May 16, 2014

A Refreshing SQL

So becoming a part of a bigger project has induced the need for me to re-learn some old skills.

I have not written my own SQL queries since my DATA 210 class 2½ years ago so a crash-course was needed. I started by visiting the w3schools tutorial which worked great for me since I am re-learning as opposed to learning from scratch. Additionally, I needed to review regular expressions and how to actually do coding with them (I have done regex in the past by hand, but I never actually programmed them myself). I found a good reference site through Oracle.

Some of the main things that I took away were as follows:


  • SQL Joins
Joins are probably one of the harder techniques for people to grasp when it comes to query languages. Inner joins are the most common form of joins (in SQL the JOIN command defaults to inner join). An inner join on 2 tables will result in a single table with data from both tables. The join requires one field from each of the tables to be equivalent as that would be the feature which is being joined "ON". So if you have a foreign key in a table, then you know that those two tables can be joined upon because of the constraints imposed upon foreign/primary keys. Outer joins differ from inner joins in that there will be a lot of empty data if an outer join is performed. There are different types of outer joins: left, right, and full. A left join will use the "left" table as the primary table being used while right corresponds to "right". So let's say we have a Persons table and an Amazon Orders table. We could easily have people in the Persons table that have not placed an Amazon Order. So the people in the Amazon Orders table will all be encapsulated by the Persons table. If we do a left join, then we will have one big table that has Persons listed with their Amazon Orders attributes added on to the end. For people that did not have Amazon Orders, there would be a NULL or empty value for those attributes.
  • Altering Tables
Altering tables is a simple, yet vital functionality in database usage and administration. This can range from dropping tables or deleting attributes to designating new keys and features for a table. There is not much explanation needed to understand this.
  • Populating Tables with the INSERT TO command
Creating new tables is insanely important when dealing with databases. Typically, though, you are not creating a table from scratch. You may want to take features from many tables, perhaps unstructured with a bunch of redundancies, and create a new, normalized table that prevents disk space waste and allows for easy querying. With the INSERT TO command you designate a table from which you want to grab information (as well as selecting the specific attributes you desire, if not all of them) and select a new table for which to send the information. 

Music listened to while blogging: Schoolboy Q
I'm going to start linking the artist's name directly to a song of theirs, to add a little interactivity to my music updates.

Wednesday, May 14, 2014

A Personal Hurrah and the Road Ahead

So I'm using this post to explain my hiatus from blogging.

I just recently graduated (May 10th) from The College of Charleston with B.S.'s in Data Science (cognate, an effective minor, in molecular biology) and Computer Science with a minor in mathematics.

I am now pursuing the Master's program at The College of Charleston for 2 years. This is the Computer and Information Sciences program - the concentration I am following is the Computer Science track, which is more of a classical theory-based approach to Computer Science. While doing this, I will be conducting research, hopefully as a GRA, graduate research assistant.

Currently, I am doing research for the summer between these programs with Dr. Anderson and others working with building and modifying machine learning algorithms. A lot of the specifics cannot really be elaborated upon at this point in time, but hopefully I will be able to disperse that information at some point because this work is going to be great and is quite the departure from my normal research routines. We're just waiting on IRB approval so I can get going with the data for the project.

Also, I have recently showed an undergraduate student the basic development environment for working with Learn2Mine. I am still one of the lead developers on the project, but a lot of the maintenance will slowly be transitioned to this undergraduate student as he becomes comfortable with the system.

Wednesday, April 23, 2014

Capstone: Upcoming Presentation

This post will be brief as it will be outlining my upcoming poster presentation for the Data Science and Computing in the Arts symposium at the College of Charleston, as well as my Capstone technical report.

A look at my poster:
I chose to go with a brief overview on the poster and I can elaborate more if people ask. So the poster does not go too much into the technical detail.

Below is a flier advertising the presentation:

Monday, April 14, 2014

RMH Homebase - Chapter 7 of Software Development: An Open Source Approach

This post will be composed of mostly responding to exercises found in Chapter 7 of Software Development: An Open Source Approach.

Chapter 7 is about the development of database modules and the chapter uses RMH Homebase (as I have referenced in numerous posts prior to this) as the example with which to conduct exercises.

The first exercise relates to database normalization criteria.

First, I would like to start by outlining the six database normalization criteria (directly taken from the text):


  1. The rows can be rearranged without changing the meaning of the table (i.e., there's no implicit ordering or functional interdependency among the rows).
  2. The columns can be rearranged without changing the meaning of the table (i.e., there's no implicit ordering or functional interdependency among the columns).
  3. No two rows of a table are identical.  This is often accomplished by defining one column whose values are mutually unique.  This column is known as the table's primary key.
  4. No row has any hidden components, such as an object id or a timestamp.
  5. Every entry in the table has exactly one value of the appropriate type.
  6. No attribute in the table is redundant with (i.e., appears as an explicit substring of) the primary key.
It is given that certain tables in RMH Homebase violate criteria 5 and 6. It is given that dbDates does not satisfy either of these criteria. 

Another table that violates criteria 5 is the dbSchedules table. To recapitulate, criteria 5 states that every entry has exactly 1 value; dbSchedules sometimes misuses the Persons field. Sometimes there only exists 1 person in the field (or null), but there do exist times where multiple people are in the field for one record. Because it is happening for one record, that is a violation of criteria 5.

Another table that violates criteria 6 is in the same exact table. Typically, databases will have a primary key in the form of some unique id. Other times, however, compound keys are used (or created). A compound key is the usage of multiple fields in a database table to uniquely identify a record. For example, if we had a Person table, then we could potentially have 2 people with the same name so maybe we would identify them by their name and address together. The dbSchedules table, though, uses name and phone number as a compound key, which is not a problem, but the primary key is now a new field. So the primary key (or compound key) is now a redundancy of both the name and phone number in the same table, which is in clear violation of criteria 6. 

The next exercise is asking for me to develop and unit test specific functions for the dbShifts.php module.

So the getters can be written very easily since id is a compounding of all the other needed attributes (delimited by -'s).

function get_shift_ABC($key) {
    $attribute = explode('-', $key);
    $ABC = $attribute[fieldNum];
    return $ABC;
        
}

So here I am showing a very generic version of the getters I would use, given the exercise specifications. I change the variable coming to "$key" from "$id" merely because I like key as a better name, but that is just my personal preference. The next line is equivalent to conducting a ".split()" on a string in Python. So now "$attribute" is a list of all the attributes, in generic list order. Now you may have noticed that I named the function ..._ABC(...) and named a variable $ABC. This is because each getter would have a better name for the variable there (for readability). For example, you would replace "ABC" with "month" if you were writing the get_shift_month(...) function. The only thing that changes when getting different fields with these getters is the indexing into the $attribute variable. Below I have listed what field each value in the list corresponds to:

fieldNum Field
0 Month
1 Day
2 Year
3 Start
4 End

So now all 5 getters are, effectively, written.

The last exercise wants me to design and implement the changes to the database modules required by the new feature - Item 4: Calendar Month View - in the "wish list" as prescribed in Appendix B. The quickest way to do this is to copy an entire database through php and then use refactoring (this may not be the cleanest, but it is quick, effective, and gets the job done). The php class I refactored was dbWeeks - I chose dbWeeks because the two classes are organized similarly (this is even hinted at doing in the book - page 194). Refactoring allows me to change the fields to the necessary values. So now each row in this calendar month view represents a month, active or archived. The unit tests were also able to refactored easily and all the tests passed with no apparent issues.

Penultimately, I would like to give an update about Team Rocket and our work with Galaxy this semester. We have finished our poster and got it printed off. Below is a (low quality) picture of our poster that we will be presenting at the College of Charleston School of Science and Math Poster Session on Thursday, April 17, 2014. 



Lastly, I would like to give an update about my plans to "Meet Charleston" - for me that was attending an "Agile User Group" meeting. The next meeting will be taking place April 24, 2014, from 11:00am-1:00pm and will be hosted by Life Cycle Engineering. I am really excited to make it to this meeting.

Monday, April 7, 2014

Software Development: An Open Source Approach <-> RMH Homebase (Developing the Domain Classes)

For this post I will be going through Chapter 6 of Software Development: An Open Source Approach and reflecting upon my experience with some of the exercises at the end of the chapter. All of the exercises focus on dealing with the open source software RMH Homebase (which I have mentioned in posts in the past).

The previous blog post in which I talked about the installation and usage of RMH Homebase can be found here.

The RMH Homebase release 1.5 code base can be downloaded from myopensoftware.org/textbook

Unfortunately, all the work I did on my previous post is for naught for the purposes of doing these exercises since I am on a different computer. So let's run through some commands and dependencies with which I had to deal:

% sudo apt-get install mysql-client mysql-server
This command didn't work as I had hoped because it kept complaining about out-of-date dependencies and missing dependencies. And these errors kept continuing.

% sudo apt-get install mysql-server-5.5 mysql-client-5.5
led to
% sudo apt-get install libdbd-mysql-perl
led to
% libmysqlclient18
which ultimately led to an error with glusterfs-server, which is something, I believe, extremely specific to my lab computer.

Experience-wise, it is good that I can note what is going on. Practicality-wise, this is frustrating because I do not want to tinker with the cluster as there are a lot of projects that are relying on the cluster not being tinkered with (including my own). So while this was valuable, I will continue this on a different computer at a later point. Also, the gluster error perpetuated for every package I wanted to install (apache, php, etc.)

More commands:
% sudo apt-get install apache2
% sudo apt-get install php5
% sudo apt-get install phpmyadmin

For reference, an easier way to get all of the MySQL stuff installed can be done through:
% sudo apt-get install mysql-client-core-5.5 mysql-server-core-5.5

Everything that these commands install and do can be found in my older blog post I mentioned earlier.

The exercises are very basic, in nature. Defining new functions to set and retrieve the value of the variables $employer, $contact_person, and $contact_phone.

So, pretty much. Encapsulate the Person class with getters and setters so someone can say "get_employer()" or "set_employer($newEmployer)". So there's not really any work here rather than making getters and setters for all 3 of the features.

Example:
function set_employer($newEmployer){
     $this->employer = $newEmployer;
}
function get_employer(){
    return $this->employer;
}

The next step is to create a more well-defined constructor. This goes as follows:

function __construct($f, $1, $a, $c, $s, $z, $p1, $p2, $e, $t, $status, $employer, $contact, $contact_phone){
     $this->first_name = $f;
     $this->last_name = $l;
     .
     .
     .
}

It's obvious that you would just typical assignment in this constructor like I started to outline. When doing the password field you would probably want to throw in your favorite flavor of encryption so you are not handling raw passwords because that's a security no-no.

The next question pretty much asks you to rewrite the set_status function because currently the $value that is passed in is never checked to be valid (which it can be "active" or "inactive"). Currently, if something else is passed in (let's say "cheese") then someone's status could be set to "cheese" rather than actually checking if there is a problem. So let's rewrite the function below:

function set_status($value){
     if ($value == "active" or $value == "inactive"){
          $this->status = $value
     }
     else{
          echo "Your input for set_status was invalid. active and inactive are the only valid options."
     }
}

So now the user knows if they picked a wrong value for set_status and status does not get changed if an invalid input is put in.

The last exercise is to refactor the Person class (where we've been working this entire time) to have removed all the mutators that aren't called from anywhere in the code base. This can be done by deleting or commenting out (commenting out would be preferable in my eyes) unused mutators.

Music listened to while blogging: Childish Gambino

Monday, March 31, 2014

Developing the User Interface



By this time, I was hoping to have had a meeting with Agile Charleston, but their group seems to be pretty inactive on LinkedIn and there is no semblance of a meeting/event schedule. Perhaps I will meet with another group and then talk about that experience, even though I am heavily interested in the Agile Charleston group. So, rather, I will focus mostly on Chapter 8 of Software Development: An Open Source Approach.

This chapter focuses on the development of user interfaces. So I'll start by just following the breakdown that the book does because it is a really good breakdown of what makes a user interface a solid user interface:

Completeness - All the steps of every use case in the design must appear on a page or group of related pages in the user interface, but no more.
Language - The language of the interface must be consistent with the language of the domain and user. All labels and user options that appear on individual pages must be consistent with their counterparts in the design document and the application's domain.
Simplicity- No page should contain too much information or too little. Each page should be pleasant to view, yet its functionality should not be buried by excessive detail or elaborate stylistics.
Navigability - Navigation within a page and among different pages should be simple, explicit, and intuitive.
Feedback and recovery - Each page must provide the user with a clear indication of what has just been done, what can be done next, and how to undo what has just been done.
Data integrity - The types and valid values for individual user data entries must be clearly indicated. The software should validity-check all data at the time of entry and the user should be required to correct errors before the data are kept in the database.
Client-server integrity - The activities of several different users who happen to be using the system at the same time must be kept separate and independent.
Security - An individual user should have access to the system's functionality, but only that functionality for which he/she is authorized.
Documentation - Every page or group of pages in the user interface should be linked to a step-by-step on-line instruction that teaches a user how that page can be used to accomplish a task.

So you may have glanced over this and may just think "well, duh", why wouldn't you do all of that whenever developing an application? No one makes an application and is happy if it has security holes or if it is not easily navigable, etc. But these things are not always easy to do. For example, in my own research we have been slowly locking down the security of our program because we had a few security holes which were problematic and those holes were part of the reason that we had not completely open sourced and released it to the public earlier on.

To make these issues easier to tackle, it is vital to adopt a policy for development. One common policy is to adopt a stringent design pattern and follow it throughout the entirety of the project. Arguably the most common design pattern utilized whenever developing an application is the model-view-controller (or MVC, for short) pattern. An image can be seen below which depicts the basic strategy for implementing MVC (image adapted from Stack Overflow):

So if a user visits your site, all they ever see is the "View". The "View" is just a way to represent all the backend information of an application in an easily-comprehensible manner - an abstraction. The "View" is typically created with some markup language (e.g. HTML) and is typically altered with some scripting language (e.g. Javascript). The creation of the view is crucial because users can easily be turned off of an application if the web frontend seems to be shoddily made - and this goes back to several of the aforementioned points about user interface design. So let's say a user interacts with the view and is expecting some change or update; this change goes to the "Controller", which I will talk about in a bit, and the "Controller" then communicates with the "Model". The "Model" is what contains the state of the application, as well as storing information in databases (hence the "MySQL" image on top of the "Model" in the image. The "Model" does not actually alter the data it contains, though - that is the job of the "Controller". When a user interacts with the application, the controller is able to perform some operation (or operations) based upon the user's input. This can result in changes in the model, as the state of the program may have changed. So the changes that were made in the model can effectively be "read" by the "Controller" and output can be sent to the user's view. But what about the bottom half of that image? First, I'd like to make a mention that that is not always present when talking about MVC schemes. This image is merely an example of MVC applied to a web-based application. That is mostly the handling of HTTP POSTs, requests, responses, etc. It acts somewhat as an adapter so no matter what browser you are using (unless it is some old version of IE, which fails to even do HTML5) to handle posts, requests, etc. If you would like to read more about this, then go here.

I'd like to take some time now to talk about security. A user should never be allowed to access information that is not needed for them specifically. This can be done through using a browser's session variable and cookies as a place of storage - or, the approach I took with my own research, initially, we matched them in a datastore by using their email as a unique identifier. Before, we had created a random, unique session ID, but we were able to work around and not use that. Regardless, though, we were doing all we could to disallow users from entering input into forms that could utilize code injection. We implemented OAuth2 and linked it to Google APIs so we know we have a secure and true Google account, use the email associated with that, and pull down information that we have created in the database, rather than having a user input their email or any kind of malicious code into a form that we then use to grab information from the datastore. Hypothetically, if we allowed users to enter their own email, they could write a SQL injection and destroy our entire database, or maybe retrieve all the information out of it. If they retrieved it all, then that could be bad in case we had sensitive information stored - encrypted or not, that is bad.

So you can easily see why a lot of forethought and a lot of time has to be dedicated to user interface design. You have to design in such a way that everything is easy to understand while also balancing security issues and usability issues. It's a game of tug-of-war that will never stop as you constantly maintain and evolve your application.

Music listened to while blogging: Kendrick Lamar and Tech N9ne

Monday, March 24, 2014

Capstone: Gamification of the UI of Learn2Mine

Learn2Mine has been deployed in the AI and Data Mining (CSCI 470 and CSCI 334) classes here at the College of Charleston. In the past, we deployed Learn2Mine in Data Science 101 classes so the students that will be using Learn2Mine should be far more advanced than the users in the past, by just sheer experience in the field.

One thing became very clear, very quickly, during this second piloting of Learn2Mine - students do not like having to use multiple tabs or windows when using Learn2Mine. To me, this seems like a very innocuous problem that really should be overlooked because of the complex nature of the backend of this system. But at the end of the day, your users are your life when it comes to applications such as this. So we decided to take an evolutionary step with our software.

Now Learn2Mine is primarily being used a teaching and learning tool, which is great - so let's simplify the interface that students have to use whenever actually learning. You may have read earlier posts and saw where I've talked about Galaxy being built into Learn2Mine. If you're a student, then you never even have to know about Galaxy. Galaxy is strictly back-end now and the interface students were using with it in the past is going to be completely unbeknownst to them now. Dr Anderson worked furiously to get an API call set up to run Galaxy jobs programmatically (through a python call). 

So, if you have read and become versed with the basics of how Learn2Mine works, you may be wondering how exactly students can get their work graded and receive feedback. Below you will see a screenshot of where students submit their code:

Now there are a few things that will jump out at you if you have been keeping up with the way Learn2Mine works.

First, it can be seen that there is now compatibility for Python 2.7, in addition to R. The initial need for this came about through the AI class since the work they are doing will all be in python. But what really sparked the "need" for this came from the data mining class. A lot of the students are not comfortable working in R and some of them feel that they can work a lot better, quicker, and easier in python, even if it means utilizing multiple third party libraries. So now we have to code solutions in two different ways in order to allow students to code in either language when trying to progress through our system. 

Second, it looks like there is just a button that grades your code. So what happens when you submit? 

So I inserted some text and clicked "Grade R Code" - obviously this should raise an error, but that is part of what I want to show you here. The output here is the exact output you would get if you were to try and run my text through an R interpreter. So you can get the exact trace. Also, though, if you submit code that is syntactically correct, then the feedback box will contain feedback pertaining to why your submission was wrong -> maybe you didn't define the function the way we specified, maybe your signature for your function was incorrect, maybe you were not able to produce the correct matrix, etc. Whatever your problem is, we tell you.

Alright, so we let you submit your code in bulk. Big whoop. That is only going to promote the idea of people trying to code entire solutions at once. Well, to combat this, I have been working on an interface upgrade for Learn2Mine to go side-by-side with Dr Anderson's R/Python upgrades in order to promote the breakdown of coding problems into more atomic pieces.

So the page I have been screenshotting and posting here is for the "Empirical Naive Bayes" lesson. Learn2Mine is set up so the last problem on a page is effectively a "Boss Fight" (we're trying to come up with better nomenclature for this, but that's on the backburner) and you really only get credit for getting the problem correct if you defeat the boss. But what are the problems before it? Well, in the same vein of video games, we give students the opportunity to practice and hone their skills in order to take on the boss. For example, in the Empirical Naive Bayes lesson there are two problems before the boss problem that break down the process into 2 smaller steps. In this naive bayes problem your ultimate goal is to compute the posterior probabilities, but in doing that you will compute prior probabilities and densities. So we have abstracted out the computing of the prior probabilities and densities. So if students write these as separate functions, then they can verify if they have those steps correct by running their code through the non-boss forms. Then they can just add that function to their code for the boss and just call the function when it's needed, rather than having errors in all kinds of crazy places.

Lastly, as a motivation tool for doing this more atomic parts of the problem then working toward the boss, I have created a visual progress bar. If you complete the earlier problems then this bar will start to fill up and give the user a sense of achievement (hopefully). Of course, though, if a user goes to the boss problem and just conquers it, then the bar will fill all the way up because you shouldn't be punished for already being able to do the final part of the problem. Below you can see the bar after having completing the 2 non-boss problems for the empirical naive bayes lesson:



Music listened to while blogging: Kanye West

Tuesday, March 18, 2014

Capstone: Documentation and Database Migration -> Countdown to Open Source

This post will focus primarily on the documentation that we have been adding to Learn2Mine in order to prepare it for the day we decide to open source it (the light at the end of this tunnel can be seen). Additionally, I will talk about some database issues with which I had to deal.

So documentation, it's something that I talked about in one of my posts for software engineering last week. No one likes doing it, but everyone needs it. Jake and I will be handing Learn2Mine off to some new students in the Anderson Lab at the end of this semester and we do not want to waste those students' time with them having to comb through code just to understand it. So we have put together a google document which is slowly growing. This document will, hopefully, morph itself into a README in the future, as that is something that is vital for a project that desires to be a successful open source project.

So right now the README includes details about how to modify and add to the continuously growing components of Learn2Mine (e.g. adding a node/badge to the skill tree). Right now the language used throughout is fairly colloquial and written for the Learn2Mine team, but we will clean it up and make it more formal in the future (like whenever we add installation and development instructions).

Learn2Mine is (as of today) being used in the data mining class here at the College of Charleston. Recently, we reinstalled Galaxy as we had a bunch of issues and this ended up also resetting the database that we had set up for Galaxy, a Postgres database. In our recent development we had not noticed a difference in the performance of Galaxy even though we went back to Galaxy's initial SQLite database. When dozens of students started to work on Galaxy at the same time, however, we ran into concurrency issues. Galaxy just was not allowing students to run jobs or submit lessons because too many people were trying to interact with the database at one time, a problem common in SQLite - this is why SQLite is not meant to be used for large servers. So I went and stopped Galaxy from running, effectively taking the site down momentarily. I delved into the universe_wsgi.ini file and pointed the database location to where the Postgres database existed on our virtual machine (which is the learn2mine-server that hosts Galaxy and RStudio). I then had to run "sh manage_db.sh upgrade" which is a script that looks at the database location and fixes Galaxy's database set up to point to that new database.

So that did migration and created a new Postgres database. Unfortunately, all the users that were in existence on the SQLite database are not in the Postgres database. Dr Anderson and I sat down and tried to use the dump of the SQLite database and get it to effectively merge with the Postgres database. After about 20 minutes of trying different ways of going about this it really seemed like it would be way too much work to migrate a database with only a day's worth of information. All in all, for the users that just started today, it is not much hassle for them to just recreate their accounts. It will create some errors on the backend; like, the users' RStudio accounts will already be created and will hit an error for trying to create an account that is already in existence, but that error will not stop the flow of Galaxy or RStudio and the users will not see the error, so there is no drawback, especially since this only pertains to the users that created accounts on Galaxy today and it only matters for the next time they try logging in.

Music listened to while blogging: Hopsin

Planning to Meet Charleston

So my group met up today and we examined some python documentation and reviewers' thoughts upon the usage of specific file i/o in python with regard to garbage collection. A stack overflow post was very insightful for us. So I previously posted about how I was curious as to whether Python's garbage collector can be spawned when reading in files line-by-line. If you are using the "with" statement in Python to read a file line-by-line, then garbage collection is taken care of automatically (either the buffer will fill and the collector will be spawned then or the collector will just keep up with the line-by-line reading). So our code replacement for the map function works as the Galaxy developers want it to work.

So now we need to add our code to the Galaxy toolshed. I found wiki documentation here which explains the purpose of the toolshed and how to utilize it and add to it. The actual Galaxy toolshed is located here. Effectively, this is a list of repositories that Galaxy users can clone and import into their own Galaxy instance in order to utilize the tool(s) found in the toolshed. A member of Team Rocket, Jacob, uploaded the files to the Galaxy toolshed and the repository can be found here. The Python and XML files are both present in the repository as well as the functional tests I wrote for it, initially. As far as we know, we are done with our second Galaxy addition. So now we need to find a new bug or feature to tackle.

Now you may be wondering about the title of this blog post because it just seems like an experience report so far. Well, since POSSCON will not be happening this year, the software engineering practicum class is being tasked with visiting and attending a meeting of a group listed here. After perusing most of the groups I found that I am most interested in attending a meeting/event for the Agile Charleston group. To join the group to find out about times and more information I have to wait for my request to join the LinkedIn group to be accepted. So more on this later.

Music listened to while blogging: All Time Low

Monday, March 17, 2014

Capstone: Walkthrough of the Tutorial for Learn2Mine

So Jake and I have cranked a lot of work out for Learn2Mine recently. The bulk of this work has been related to finishing the tutorial for Learn2Mine and getting any issues with it settled.

When you navigate to Learn2Mine's home page you can get select "Take the Tutorial!" and jump straight in to the tutorial (as I've explained in previous posts). So the tutorial introduces users to the three components of Learn2Mine, which includes creating a Galaxy/RStudio account.

So let's stop right here. RStudio accounts used to need a tool for creation and now users do not have this extra level of confusion to try to create an RStudio account. Their Galaxy account is their RStudio account now. The only issue we currently have is that users cannot change their passwords on either of the components of Learn2Mine - something that will be solved soon.

So back to the tutorial. Jake did a great job making images that explain the different sections of Learn2Mine and Galaxy and these snapshots can be seen below:


So now the users of the site can get started with the tutorial. The first tutorial section is the "Basic R Tutorial" section. Here, users are asked very simplistic questions programming-wise. For example, the first problem is asking users to create 2 integers (and type declaration is not important in R so users can simply say "x = 1234" or "x <- 1234" if they desire) and perform some mathematical operations on those variables. The users are to do this in RStudio.

Now they have the option to submit this first problem now if they want and we give them a quick how-to on submitting by clicking the "How Do I Submit?" button in the bottom left of the tutorial pages. Let's say the user has moved on to one of the harder tutorial lessons, though. What if the user is having a hard time because of a lack of exposure to R and needs just a little jumpstart in how they should start their coding?

Well, I worked hard on developing javascript code (using jQuery) that will take example code that I created and print it out to users (either all at once or line-by-line). This is especially useful in the last coding section of the tutorial. Users are asked the following (page available here):

Write your own knn function called my.knn that takes 3 arguments: a training file, a testing file, and a k value. The function should use the training file to develop a set of euclidean distances in order to find the nearest neighbors for the records (rows) in your testing data file. Finally, you want to return a 1 column matrix which contains the correct labels for the testing data. The nth row in the test labels matrix corresponds to the nth row in the testing dataset. Though your kNN algorithm should be generalizeable to many different datasets, you may use the training and testing datasets we provide for you below in order to test your function.

Below this we give users a Training Set and Testing set for this problem (which I completely made up - it is meant to be simplistic). We also give users the signature for the function (as our automatic, instant grading is based upon matching the signature first-and-foremost). So users can either click the Hint button in order to get insight as to how to tackle the problem or they can just show line-by-line or full answers to the problem. Now I made a note on the page that the code we provide to them is not the most efficient - and it was never meant to be. Rather, though, the code was written in a way where the users should be able to read the code and understand it without the need for a lot of comments. If the users can do this, then the tutorial was successful because the user then understands R code and can actually get their hands dirty with actual R lessons on the site.

Let's say the users have finished those R lessons though, then they can move on to the tutorial tool lesson. Here, users will be performing k-NN with our built-in k-NN tool on Galaxy. Users just have to provide a training dataset, a testing dataset, and a k-value (the same as the signature that we had them write a function for). The tool is then run and gives users an HTML output and automatically grades the lesson upon submission.

Music listened to while blogging: Childish Gambino & Tech N9ne

Release Early and Often

For this post I will be reflecting upon my team's progress thus far and chapter 9 of Teaching Open Source.

Much to my surprise, chapter 9 of Teaching Open Source is merely a one page chapter talking about how the oft-espoused "Release Early, Release Often" motto in FOSS should be applied to more than just software for FOSS projects. The textbook itself is actually released early and released often, as it is an experimental, open source textbook. The book even has its own mailing list, like the one I've mentioned for Galaxy a multitude of times in the past.

So let's get into my team's work on Galaxy. So last time, I left off with a code segment that I thought could be improved upon in order to become more efficient than the Python map(...) function which I originally used to create the Transpose tool for Galaxy. Jake and Jacob, of Team Rocket, took the lead after I set up the framework (with the functional testing, map function, and xml markup). The code below is what they came up with.


So the logic behind this is pretty simplistic. Effectively, we want anyone that uses Galaxy to be able to transpose tab-delimited data (if the data is not tab-delimited, then Galaxy is able to provide conversion tools). A problem that often occurs is that someone is using lots of genetic data that is just way too big to be all loaded into memory at one time. So we take advantage of this by using for-each loops. Once python has made a pass through a for-each loop, where the for-each here is "for each line in an inputfile", then python no longer needs to access previous lines. We need to confirm this with documentation but the built-in garbage collector for python should scrap the previous lines since the lines are not stored into variables of any sort.

On my end, I have just been keeping up with syncing our team's forked repository of the galaxy-central master branch, that way when we want to push some changes in we will be able to without any hassle.

Music listened to while blogging: Schoolboy Q

Tuesday, March 11, 2014

The Doc Is In!

For this post I will be blogging about Chapter 8 of Teaching Open Source and responding to a couple of the exercises.

So documentation... no one is a fan of making it, but everyone should be doing it and doing it well. You never know when you might have to go back to code written years back and perhaps written by someone else. Would you rather read a description of what is happening with that code and see a shorthand example or would you rather have to inject your own testing code to see what is happening? Not sold? Would you rather take about a minute or two to figure out how code works or take hours to figure out what is going on? The key to making code the best it can be is this documentation.

So it is evident that documentation is crucial when you are working with others on a project or if there is a potentiality that you will be leaving the project and have someone else take the lead on your section of the project, but what the project that you working on is done just by you? Well, as aforementioned, you will need to go back and change code at some point. As I have reflected upon in the past, degradation is going to take hold of your project if you are not constantly maintaining and improving. When you go to maintain and improve this code it is much nicer to just navigate comments to figure out what code is doing so you can modify only what is intended.

So there is an exercise in Chapter 8 of Teaching Open Source which is asking me to write thorough comments in all of my source code, make sense of source code through documentation alone, and write at least one wiki page of developer documentation for each program I am working on. Galaxy has a standard for writing documentation for code that is created for it.

Writing comments for all of the Galaxy source code would be a foolish process as this is a project that has been built up over a long period of time (since 2005). The code that I have created for contributing has also been marked up with comments (available here, here, and here). Now you may navigate to the page and see that there are no python comments on the page. The comments already existed for the Group tool and I was merely improving the code for it. This is because the Galaxy community has effectively perfected the art of documentation (enough to grasp what is going on and where, but not too much as to detract from the code itself). I was able to go directly to the section in the code where I needed to add my code. There also exists example usage documentation within the XML of the tool itself (so when someone is using it they can see what is supposed to happen). So if you couple that markup with the code itself and code comments then it is evident what the code is supposed to do.

To demonstrate making sense of source code through documentation alone I will present a tool in Galaxy which conducts some sort of biological analysis with which I am unfamiliar. So I randomly picked a toolset (phenotype_association) and randomly picked a python file from inside this toolset (senatag.py). The comments for this code is different. Instead of commenting in different parts of the code, there is a large amount of comments at the beginning of the file which explains every piece of the code. Overall, senatag takes a file with identifiers for SNPs (single nucleotide polymorphisms) and a comma separated file which has identifiers for different SNPs. Senatag then outputs a set of tag SNPs for the dataset provided (the comma separated list). The comment markup can be seen below (as well as the breakdown of the step-by-step code):


To contribute to the wiki I was planning on writing a step-by-step tutorial on how to add tools to Galaxy (as I have outlined this in previous posts and it is something I know well enough to write a structured wiki page on). After navigating the "all wiki pages" section of Galaxy, I realized that adding a page would be extremely difficult because the Galaxy community has done just about everything when it comes to wiki documentation. For reference, the add tool page I was referring to can be found here.

The next exercise I am asked to perform is to pick a feature that sounds tantalizing but is not clearly documented. Using Galaxy for this I realized that I do not have any experience actually tinkering around with the graphical user interface (barring simple XML markups). So first I tried navigating through Galaxy's wiki to find information on how to understand or maybe customize the interface (as that documentation would effectively explain how to manipulate all parts of the GUI). Much to my disappointment, I was unable to find any information on this topic, but that does mean I can potentially add this documentation (and it really helps for this exercise). There is a .jar file that looks like it is what is assembling graphics on the client-side through the usage and calling of dozens of python and BASH scripts. So here I will focus on one very simple part of the interaction with the user interface. If you deploy your own version of Galaxy then you are required to have Python 2.6 or 2.7 installed and set into the environment path. Rather than having crazy errors occur because someone does not have the correct version of Python installed, the Galaxy developers have created a simple "check_python.py" which creates a message that gets printed to the user explaining why they cannot run anything. Additionally, I learned something which has always puzzled me about Python. In Python you can use triple quotes to do large block comments. In this file the message that gets printed to the user is written in a block style. So I did some tinkering in my own python shell and have now learned that you can assign strings in this manner, adding a ton of readability to long strings (rather than dealing with a ridiculously long word-wrapped string).

Music listened to while blogging: Childish Gambino

Thursday, March 6, 2014

Capstone: Introduction to a Tutorial and Prep Work for Showcasing

So for the past fortnight (yep, I just successfully used that word) the Learn2Mine team has been hard at work on creating a tutorial for Learn2Mine - really just a simple 30 minute introduction as to what we aim to have users achieve within our system and effectively giving them a taste of the two different types of lessons we currently offer. So let's break down the steps we had to take in order to prepare for a tutorial (next post will focus on the tutorial itself).

For starters, the tutorial really did not stick out to users first entering the site so a lot of work was done on the frontend of Learn2Mine in order to give it a sleeker feel and make it more intuitive for the less-experienced user. There now exists a splash page for the site where users that are not logged in can actually see just a little bit of what Learn2Mine does, rather than quickly asking users to authorize our program with their Google account. An image of this screen can be viewed below or by navigating directly to the site.



So once users are logged in they are directed to the Learn2Mine home page and it is from here that they can jump into the tutorial. Effectively, there is a button much like the Login button that says "Take the Tutorial!" (no need for a screenshot).

Now the tutorial has to teach the users about the different sides to our software. I have went over this in past posts about how the Virtual Portfolio is where you start, RStudio is where coding, testing, and debugging is performed, and Galaxy is software effectively used for submissions and instant feedback. So the tutorial teaches users about the different components of each of these sides of the software and how to use them.

So now that I have brought up RStudio and Galaxy, I would like to showcase another achievement that the Learn2Mine team has overcome. In the past, we used randomly generated keys to give users a way to link their Learn2Mine account to Galaxy and RStudio. This meant that users used that key to submit software in Galaxy and used that key in order to log in to RStudio. So users really just had to keep copy/pasting it in these different fields (except in Galaxy since it retains a job history if you create a Galaxy account). Obviously this is not preferable. So we went digging through the Galaxy-Dev mailing lists and found that if a user is logged in to a Galaxy account then we can get a handle on their email that they used when signing in to Galaxy. This eliminates the need for users having keys that they kept passing around to the different parts of Learn2Mine. This created a new issue for us, though. Previously, whenever a user submitted their key in to Galaxy, we created a user account on our virtual machine - this is how users were able to authorize in to RStudio. So now we had to create a separate tool that users had to run in order to use RStudio, yet another annoyance. Sherlock Holmes style, more investigative work was done and we were able to put our own code in to Galaxy. This code took the Galaxy signup process and also fired off our addUser.sh script. So now the credentials that a user inputs into Galaxy for that account are the same credentials used for signing in to RStudio, which is a lot better than what we had in the past. What is next to be done is allowing users to change their password because there currently is not an implementation for that. So now users are not perplexed when going to the different parts of Learn2Mine - in the past there was a bit of a curve when figuring it all out. So why did I go off on this tangent? First, it is awesome to get this working. Second, and more importantly, users can seamlessly get involved with solving problems on Learn2Mine, whether in the tutorial or not, more quickly and efficiently.

Music listened to while blogging: Blue Scholars

Thursday, February 27, 2014

Galaxy Pull Request #355 Update

So in a previous post I mentioned my team's latest pull request where we were adding a tool that can transpose data.

Because there seemed to be some confusion about its uses, I am going to elaborate upon them here for a moment.

Users are never forced to transpose their data. This feature was requested to be added to Galaxy. A reason that a user may want to transpose their data is for use with Galaxy's column filtering tool for performing statistical analysis or merely just grouping data by data values rather than the features that exist within data. Additionally, the transposition of the data with the tool allows for the usage of tabular data that is not square, though the examples we gave were of square data.

So let's get to the update about the actual pull request itself. John Chilton, the same developer who responded last time, responded to my pull request:



Flipping over to the activity section he left a comment whenever declining the pull request. He said "Would love to see this in the tool shed!" So that recapitulation of his main comment there has Team Rocket now looking at and experimenting with adding to the tool shed.

So why didn't we add our tool directly into the tool shed before? It would make sense to go straight there, right? Well, I made the decision to submit our pull request the same way as last time because the tool we were developing went hand-in-hand with other tools that are located within the core section of Galaxy (even located in the same toolset as other tools in the core). As you can read, the tools being developed by the core team are now even being moved to the tool shed. So this is no issue.

There is an issue, though, and it was something I had worried about whenever first submitting the pull request. The way we are transposing data has the entirety of the file read into memory at one time. For Galaxy, this just cannot happen. This is because Galaxy users are typically dealing with genomic data that can be upwards of 50 GB per file at times. Reading all of that into memory at one time really is not feasible, even with the nicest of server stacks. So we are going to have to brainstorm a methodology for cutting the data up into chunks and slowly write the data. I imagine the code will become less readable, but will be far more efficient when working with big data. I look forward to tinkering and trying to get this to work over the next few weeks. We have a break coming up for classes so I am unsure if I will be able to keep up my regular posting, but I will definitely try if I have the time.

I'd like to close with my initial idea of how to update the tool to reflect the needs for big data usage:

$ outLineNum = 0
$ with open(inputFile) as infile:
$     for line in infile:
$         items = line.split('\t')
$         for item in items:
$             outputFile.write( # Think about most efficient way
$             outLineNum += 1
$         outLineNum = 0

So only one line will be read into memory at a time and previous lines will be garbage collected. Now this may still pose issues as some datasets have thousands, or perhaps more, features which would result into a lot of data still being read into memory. Perhaps I could take a different route and just slowly read in individual datum and then put that into the output file as needed.

One last mention I would like to make. The first pull request made to Galaxy is now ready to be versioned into Galaxy and is located with the Galaxy Central branch in the next update. It was all of this before, but now it's "official" with this Trello Card

Music listened to while blogging: Kendrick Lamar

Wednesday, February 26, 2014

Capstone: Galaxy and RStudio

So this post is going to be focusing on the integration of Galaxy and RStudio into Learn2Mine. Last post focused on the virtual portfolio which is just 1 of 3 core parts of Learn2Mine

Galaxy is an open source project in which I have described in great detail in past non-capstone related posts, so I'll just do a quick summary here. Galaxy is primarily a bioinformatics analysis tool that specializes in working with genomics data. It abstracts the command line from users with a javascript interface and gives users some python/perl/etc files to work with when converting data, running programs like tophat or tuxedo, and even some statistical analyses. Galaxy allows users to create workflows by tying jobs together - think of it as a set of directions. First, I want to upload these genetic datasets and then run them together in this one tool which aligns them with this specific algorithm and then I want to send that result to a visualization tool which creates an HTML output that tells me the score of the alignment and gives me the option to download the alignment file. If this were a workflow then I could provide the workflow with my files and it would do everything else by itself through scheduling within the job manager. All that really matters here for Learn2Mine is that you can use the output of jobs within Galaxy as inputs for other jobs. So I can upload datasets and use them in Learn2Mine tools. I can use the output of code I run or tools as my output when submitting for grading. I can perform scaling or filtering on my data and then use the new scaled/filtered version with a tool. This stream-of-consciousness description of Galaxy is probably the most watered down version I've given, but my past blog posts talk about Galaxy and, if you really want to read more, then those are there.

So how does Learn2Mine take advantage of Galaxy? The tools that I mentioned in my last post that we have built use Galaxy's interface to allow less-experienced programmers conduct algorithms without having to know all the specifics. On the right you will see the result of an XML markup of Learn2Mine's neural network tool. The very first input (at the top) allows users to input a dataset they have previously uploaded to Galaxy as the dataset to use for the algorithm. It is worth noting that even data that has been altered past the upload data portion of Galaxy can also be used here. The rest of the inputs do not rely on past jobs in Galaxy, but, rather, is an abstraction of inputs that you would normally feed into a neural network. For example, the hidden layers input. The hidden layers input takes a comma separated list of values. For each item separated by commas, there is a hidden layer. The number that is listed represents how many nodes exist for that respective layer. Concepts like this perpetuate throughout all of the built-in tools for Learn2Mine.

Alternatively, there is a section of Galaxy tools that we have built referred to as "Learning R" tools. The only jobs that can be run from those tools are "Create RStudio Account", "Get Personalized Dataset", and "Submit R Lessons to Learn2Mine". The "Create RStudio Account" tool is one that was made recently. This tool was completely masked earlier because in order to communicate from Galaxy to Learn2Mine we were forcing users to pass a unique key, that was associated with their account, around Galaxy. When users submitted their key to Galaxy in the past, we created their RStudio account behind the scenes. Until we find a way to automate the creation of an RStudio account with a Learn2Mine signup, we will have to make users use this tool if they want to use our cloud-based R IDE. The "Submit R Lessons to Learn2Mine" tool is a tool that you can run whenever you want to submit an R-based lesson to Learn2Mine for grading/badge-earning (this tool is analogous to the Submit Learn2Mine Tool Lesson tool in the Learn2Mine_Toolset section). This Submit R Lessons tool allows users to submit code/answers in Galaxy output or copy/paste their answer into a text-box - this is done because some users prefer one way and some prefer the other and it was not difficult to allow either. The "Get Personalized Dataset" tool is a tool that we hope to use more in the future. Right now it is only used for the advanced R lessons. It takes a user's information and gives them a personalized dataset for use in lessons - so no 2 users will have the same dataset on which to perform analysis and be graded on. We would like this to become the standard for all lessons.

As I mentioned in the previous paragraph, RStudio is a section of Learn2Mine in which users have to have Galaxy create their account. This is because, currently, our RStudio server is using accounts located on our Learn2Mine server in order to authenticate - so there is no current way to tie in Google authentication into that form of login. RStudio is a cloud-based IDE which allows users to go and run R code through an interpreter, or run entire files - much like R IDE's that require local installation. RStudio allows users to install any 3rd party R packages that they desire. This is especially useful for visualization tasks. Typically, we want users to come to RStudio to write their code and then submit their code/answer on Galaxy. It would be wonderful if we could somehow tie RStudio and Galaxy even further by just pointing Galaxy to a file that a user is working on for a lesson, but that is beyond the scope of this Spring semester.

Music listened to while blogging: Kanye West and Lily Allen

Capstone: The Gamification of Learn2Mine

In my previous post I mentioned Learn2Mine's gamification ideas and how it is using this approach to teach data science. The easiest way to do this is talk about each gamified component individually and talk about how they relate as I go.

Skill Trees
At the heart of Learn2Mine is the virtual portfolio (the Google App Engine side of Learn2Mine). The virtual portfolio contains many things, but I want to focus on the skill tree. This is located in the profile section of the site and is unique in that it shows the individual progress of a user. When you first go to the site you will have a skill tree that is all gray except for the root of the tree. At the root is a basic intro to data science badge which you get for just joining the site - a slight motivation in order to start you off on your badge-earning journey. The badge directly under the root of the tree is currently an "Uploading" badge. The uploading badge is being scrapped as we are working on a tutorial which will replace the uploading badge - so we will have a "Tutorial" badge in its place. More on the tutorial in a later post, though. So the skill tree branches off in 2 main sections as of right now - and this will definitely be expanding in the coming months. The left side of the tree is all about learning R programming (anything from basic skills, to file i/o, to writing classifiers). The right side of the tree is about using the built-in machine learning algorithms that we have inserted into the Galaxy side of Learn2Mine. A depiction of the R subtree as it remains right now can be seen below. Note that badges that are gray are unearned and badges that are colored in have been earned. Green badges are basic learned badges - typically easier lessons. Blue badges are mastery badges, a step above the green badges. Finally, the gold badges are advanced badges, which are a step up from the blue badges.


Achievements/Badges
So I just mentioned badges. You may be wondering where that came from. Well, badges are a way to incentivize users to complete lessons. These badges are, effectively, achievements for completing certain lessons. When you earn a badge it will take its place on the skill tree and, additionally, on your home page of Learn2Mine you will see a fully-flushed list of your badges. But so what? You can have badges and just show them on Learn2Mine - that's kind of boring. Well, we are in the early stages of building a search feature so you can compare yourself with your friends and try to out-compete them, but, more importantly, we have integrated our badges with Mozilla Open Badges. Mozilla Open Badges is an online standard to recognize and verify learning. You are able to show your badges off on LinkedIn, or put them on a resume. Really, you can do anything with them. What gives them credibility, though, is the JSON that backs the badge images. It is there that proof exists of when you earned the badge, where you earned the badge (for now this is always College of Charleston and Learn2Mine), and all other metadata that would be useful to have. So this has much more credibility than all your friends endorsing you for skills in LinkedIn that you may not even possess. Eventually, we want other institutions and teachers to use Learn2Mine to create their own lessons, so there would be new institutions backing certain badges and it would really just help the site flourish and help in resume processes.

Leaderboards
The leaderboards we have built into Learn2Mine are very primitive as of right now. Currently, we have leaderboards for 3 of the built-in Learn2Mine tool lessons - k-Nearest Neighbors, Neural Networks, and Partial Least Squares Regression. The idea is that users will compete against each other to get better scores on certain lessons. For example, the k-Nearest Neighbors lesson judges a user's score by the amount of classification they get correct when using our pre-built test set. This is something that desperately needs re-working but is not of as much importance as the rest of the site. A reason that re-working has to be done is that if someone is smart enough with R programming, then they can currently manipulate output to get 100% of the test set correct, though that would be a case of either cheating or overfitting. Our k-Nearest Neighbor leaderboard can be seen below (and it can be seen that it really hasn't been used):


Game Over? Instant Feedback
When you submit a lesson to Learn2Mine through Galaxy, you are able to submit as many times as you want. Let's take the standard R lesson for this example. That lesson has 12 questions that you have to answer with R output. On a quiz in a class if you missed 4/12 questions then you would end up with a 66, assuming equal weight, but, on Learn2Mine, we just let you know that you missed whatever 4 questions you missed and then allow you to retry it. As many times as you want. We let you know exactly what you missed and, in certain cases, we may even provide hints for you to take in to mind when coding up your answers.

Music listened to while blogging: Childish Gambino and Tech N9ne

Capstone: Motivations for Learn2Mine & Related Works

For this post I will be going over the motivations for the creation of Learn2Mine and some related works. I touched on related works last time, but I'll expand that list and give a clearer vision of what Learn2Mine is aiming to do.

For starters, there is not an effective interactive site to learn data science and to perform data science algorithms all in one place. Many people have used programs, such as Weka or RapidMiner, in the past to conduct algorithms and take results away for their own use, but these results are often confounded and require a large amount of computer science expertise to use and understand. Weka's outputs do not contain much information and is not meaningful unless you are an expert at the algorithm in which you are conducting. RapidMiner has a confusing workflow interface that may confuse new users, leading to a very steep learning curve for the software. Both of these informatics platforms, though, do not bother teaching users about the algorithms or how they work - they merely give a basic introduction as to how to use the software. It is in the name, RapidMiner - rapid mine. It really is just used to mine information. The name of my software, however, is Learn2Mine - you can learn to mine data, but you can just strictly mine if you want. The options are open and that is one of the crucial aspects of Learn2Mine - freedom of usability and pedagogical ability.

A lot of programs, though, are pretty effective at actually teaching concepts to students. I used Rosalind in the past to actually learn Bioinformatics concepts and apply my programming knowledge to actually conducting and performing basic bioinformatics algorithms. It is an effective program, but it has pigeonholed itself to only catering to computer scientists whom have a specialized interest in biology. Learn2Mine aims to take this idea and expand it to any and all domains. This has been pioneered to a very small extent. There currently exists 3 case study lessons where students have to fill in missing code in order to finish problems relating to algal bloom classification, stock market investments, and fraudulent transactions. This will be expanded in the coming months as lessons will be rolled out for bioinformatics, artificial intelligence, and data mining. The bioinformatics lessons are listed because it is a specialization that I have adopted at the College of Charleston by taking multiple bioinformatics classes and by having my data science concentration be in molecular biology. The artificial intelligence and data mining lessons will be included as there are classes at the College of Charleston which will utilize those lessons toward the end of the semester, as a way to evaluate students.

Learn2Mine has other parts about it that stand out from other programs. It is not just about being able to learn and perform. Learn2Mine takes the next step and is a completely cloud-based technology. You need not worry about having to install Learn2Mine on any machine or any kind of dependency. If you want to submit a lesson at the library and then do one at home, then you are free to do that because of our cloud-based nature. Below is an image created that shows everything that goes into Learn2Mine:



So Learn2Mine can teach data science and perform related algorithms, but what is going to keep people motivated to use Learn2Mine? Interdisciplinary fields need new ways to approach their teaching. Learn2Mine has coupled its development with gamification. Gamification, not to be confused with "edutainment" which is a video game with a bonus educational goal (e.g. You beat the boss, here's a fact about programming languages), is the manifestation of a lesson that a student completes with motivations stemming from techniques that are inspired by video games. In Learn2Mine, the techniques used are currently: the implementation of skill trees, leaderboards, and achievements (in the form of badges). My next post will focus on the gamification elements of Learn2Mine and how they have been implemented and what is next to implement.

Music listened to while blogging: Ghostland Observatory and Nine Inch Nails

Monday, February 24, 2014

Capstone: Introduction to Learn2Mine

I'd like to open this post by making a note about the state of my blog over the next several weeks. For my capstone, I am required to keep up with a blog and create posts specifically about work on my research and capstone paper itself. Any post prefaced with "Capstone:" will be in direct reference to that. So for anyone who wants to skip over those readings can just skip on by them and those that are interested can read them if so wish. This is mainly to have a compiled listing of my works through software engineering and my works through my own research in one place rather than managing multiple blogs.

So my project is Learn2Mine. But before I even tell you what that is about, you need to have some prerequisite knowledge, or at least an inkling of an idea about certain topics. So let's get to it.

Data Science is the first of these topics. Data science is an interdisciplinary field which crosses the realms of Statistics, Computer Science, and a domain field (e.g. Biology, Business, Geology, etc.). To the right is a very popular image which really highlights the cross-discipline nature of Data Science.

Data science is not taught to its fullest nowadays, though. People that are data scientists tend to be primarily a mathematician (i.e. statistician), a computer scientist, or someone with substantive expertise in a scientific or business-related field (see Best Source of New Data Science Talent below). If you have traditional training in one of these fields you tend to try and self-teach yourself the important skills of other fields. So maybe a biologist will try to learn the algorithms (e.g. Smith-waterman algorithm) that conduct the alignment of nucleotides or amino acids in gene sequences. Being able to use these algorithms at the most primal of levels without really understanding how to tweak them or really know what is going on means that the computer science expertise that you have is not enough to really mold you into a data scientist - or at least what we would like data scientists to be. A better representation of this can be seen in the image below.


So what are we going to do? How do we make sure that the influx of data scientists that we desperately need in academia and industry can get the proper training? The answer to that question is one that has been under development for quite some time now: Learn2Mine. You may be wondering what Learn2Mine is or how it is going to achieve this incredible goal. Well, you may have heard of sites like Codecademy, Rosalind, and/or O'Reilly. Soon Learn2Mine will be among this list as the preferred source for students and scientists alike to learn and master the skills and techniques one knows as a data scientist.