Clayton Turner: An Introductory Guide: March 2014

Monday, March 31, 2014

Developing the User Interface

By this time, I was hoping to have had a meeting with Agile Charleston, but their group seems to be pretty inactive on LinkedIn and there is no semblance of a meeting/event schedule. Perhaps I will meet with another group and then talk about that experience, even though I am heavily interested in the Agile Charleston group. So, rather, I will focus mostly on Chapter 8 of Software Development: An Open Source Approach.

This chapter focuses on the development of user interfaces. So I'll start by just following the breakdown that the book does because it is a really good breakdown of what makes a user interface a solid user interface:

Completeness - All the steps of every use case in the design must appear on a page or group of related pages in the user interface, but no more.
Language - The language of the interface must be consistent with the language of the domain and user. All labels and user options that appear on individual pages must be consistent with their counterparts in the design document and the application's domain.
Simplicity- No page should contain too much information or too little. Each page should be pleasant to view, yet its functionality should not be buried by excessive detail or elaborate stylistics.
Navigability - Navigation within a page and among different pages should be simple, explicit, and intuitive.
Feedback and recovery - Each page must provide the user with a clear indication of what has just been done, what can be done next, and how to undo what has just been done.
Data integrity - The types and valid values for individual user data entries must be clearly indicated. The software should validity-check all data at the time of entry and the user should be required to correct errors before the data are kept in the database.
Client-server integrity - The activities of several different users who happen to be using the system at the same time must be kept separate and independent.
Security - An individual user should have access to the system's functionality, but only that functionality for which he/she is authorized.
Documentation - Every page or group of pages in the user interface should be linked to a step-by-step on-line instruction that teaches a user how that page can be used to accomplish a task.

So you may have glanced over this and may just think "well, duh", why wouldn't you do all of that whenever developing an application? No one makes an application and is happy if it has security holes or if it is not easily navigable, etc. But these things are not always easy to do. For example, in my own research we have been slowly locking down the security of our program because we had a few security holes which were problematic and those holes were part of the reason that we had not completely open sourced and released it to the public earlier on.

To make these issues easier to tackle, it is vital to adopt a policy for development. One common policy is to adopt a stringent design pattern and follow it throughout the entirety of the project. Arguably the most common design pattern utilized whenever developing an application is the model-view-controller (or MVC, for short) pattern. An image can be seen below which depicts the basic strategy for implementing MVC (image adapted from Stack Overflow):

So if a user visits your site, all they ever see is the "View". The "View" is just a way to represent all the backend information of an application in an easily-comprehensible manner - an abstraction. The "View" is typically created with some markup language (e.g. HTML) and is typically altered with some scripting language (e.g. Javascript). The creation of the view is crucial because users can easily be turned off of an application if the web frontend seems to be shoddily made - and this goes back to several of the aforementioned points about user interface design. So let's say a user interacts with the view and is expecting some change or update; this change goes to the "Controller", which I will talk about in a bit, and the "Controller" then communicates with the "Model". The "Model" is what contains the state of the application, as well as storing information in databases (hence the "MySQL" image on top of the "Model" in the image. The "Model" does not actually alter the data it contains, though - that is the job of the "Controller". When a user interacts with the application, the controller is able to perform some operation (or operations) based upon the user's input. This can result in changes in the model, as the state of the program may have changed. So the changes that were made in the model can effectively be "read" by the "Controller" and output can be sent to the user's view. But what about the bottom half of that image? First, I'd like to make a mention that that is not always present when talking about MVC schemes. This image is merely an example of MVC applied to a web-based application. That is mostly the handling of HTTP POSTs, requests, responses, etc. It acts somewhat as an adapter so no matter what browser you are using (unless it is some old version of IE, which fails to even do HTML5) to handle posts, requests, etc. If you would like to read more about this, then go here.

I'd like to take some time now to talk about security. A user should never be allowed to access information that is not needed for them specifically. This can be done through using a browser's session variable and cookies as a place of storage - or, the approach I took with my own research, initially, we matched them in a datastore by using their email as a unique identifier. Before, we had created a random, unique session ID, but we were able to work around and not use that. Regardless, though, we were doing all we could to disallow users from entering input into forms that could utilize code injection. We implemented OAuth2 and linked it to Google APIs so we know we have a secure and true Google account, use the email associated with that, and pull down information that we have created in the database, rather than having a user input their email or any kind of malicious code into a form that we then use to grab information from the datastore. Hypothetically, if we allowed users to enter their own email, they could write a SQL injection and destroy our entire database, or maybe retrieve all the information out of it. If they retrieved it all, then that could be bad in case we had sensitive information stored - encrypted or not, that is bad.

So you can easily see why a lot of forethought and a lot of time has to be dedicated to user interface design. You have to design in such a way that everything is easy to understand while also balancing security issues and usability issues. It's a game of tug-of-war that will never stop as you constantly maintain and evolve your application.

Music listened to while blogging: Kendrick Lamar and Tech N9ne

Monday, March 24, 2014

Capstone: Gamification of the UI of Learn2Mine

Learn2Mine has been deployed in the AI and Data Mining (CSCI 470 and CSCI 334) classes here at the College of Charleston. In the past, we deployed Learn2Mine in Data Science 101 classes so the students that will be using Learn2Mine should be far more advanced than the users in the past, by just sheer experience in the field.

One thing became very clear, very quickly, during this second piloting of Learn2Mine - students do not like having to use multiple tabs or windows when using Learn2Mine. To me, this seems like a very innocuous problem that really should be overlooked because of the complex nature of the backend of this system. But at the end of the day, your users are your life when it comes to applications such as this. So we decided to take an evolutionary step with our software.

Now Learn2Mine is primarily being used a teaching and learning tool, which is great - so let's simplify the interface that students have to use whenever actually learning. You may have read earlier posts and saw where I've talked about Galaxy being built into Learn2Mine. If you're a student, then you never even have to know about Galaxy. Galaxy is strictly back-end now and the interface students were using with it in the past is going to be completely unbeknownst to them now. Dr Anderson worked furiously to get an API call set up to run Galaxy jobs programmatically (through a python call).

So, if you have read and become versed with the basics of how Learn2Mine works, you may be wondering how exactly students can get their work graded and receive feedback. Below you will see a screenshot of where students submit their code:

Now there are a few things that will jump out at you if you have been keeping up with the way Learn2Mine works.

First, it can be seen that there is now compatibility for Python 2.7, in addition to R. The initial need for this came about through the AI class since the work they are doing will all be in python. But what really sparked the "need" for this came from the data mining class. A lot of the students are not comfortable working in R and some of them feel that they can work a lot better, quicker, and easier in python, even if it means utilizing multiple third party libraries. So now we have to code solutions in two different ways in order to allow students to code in either language when trying to progress through our system.

Second, it looks like there is just a button that grades your code. So what happens when you submit?

So I inserted some text and clicked "Grade R Code" - obviously this should raise an error, but that is part of what I want to show you here. The output here is the exact output you would get if you were to try and run my text through an R interpreter. So you can get the exact trace. Also, though, if you submit code that is syntactically correct, then the feedback box will contain feedback pertaining to why your submission was wrong -> maybe you didn't define the function the way we specified, maybe your signature for your function was incorrect, maybe you were not able to produce the correct matrix, etc. Whatever your problem is, we tell you.

Alright, so we let you submit your code in bulk. Big whoop. That is only going to promote the idea of people trying to code entire solutions at once. Well, to combat this, I have been working on an interface upgrade for Learn2Mine to go side-by-side with Dr Anderson's R/Python upgrades in order to promote the breakdown of coding problems into more atomic pieces.

So the page I have been screenshotting and posting here is for the "Empirical Naive Bayes" lesson. Learn2Mine is set up so the last problem on a page is effectively a "Boss Fight" (we're trying to come up with better nomenclature for this, but that's on the backburner) and you really only get credit for getting the problem correct if you defeat the boss. But what are the problems before it? Well, in the same vein of video games, we give students the opportunity to practice and hone their skills in order to take on the boss. For example, in the Empirical Naive Bayes lesson there are two problems before the boss problem that break down the process into 2 smaller steps. In this naive bayes problem your ultimate goal is to compute the posterior probabilities, but in doing that you will compute prior probabilities and densities. So we have abstracted out the computing of the prior probabilities and densities. So if students write these as separate functions, then they can verify if they have those steps correct by running their code through the non-boss forms. Then they can just add that function to their code for the boss and just call the function when it's needed, rather than having errors in all kinds of crazy places.

Lastly, as a motivation tool for doing this more atomic parts of the problem then working toward the boss, I have created a visual progress bar. If you complete the earlier problems then this bar will start to fill up and give the user a sense of achievement (hopefully). Of course, though, if a user goes to the boss problem and just conquers it, then the bar will fill all the way up because you shouldn't be punished for already being able to do the final part of the problem. Below you can see the bar after having completing the 2 non-boss problems for the empirical naive bayes lesson:

Music listened to while blogging: Kanye West

Tuesday, March 18, 2014

Capstone: Documentation and Database Migration -> Countdown to Open Source

This post will focus primarily on the documentation that we have been adding to Learn2Mine in order to prepare it for the day we decide to open source it (the light at the end of this tunnel can be seen). Additionally, I will talk about some database issues with which I had to deal.

So documentation, it's something that I talked about in one of my posts for software engineering last week. No one likes doing it, but everyone needs it. Jake and I will be handing Learn2Mine off to some new students in the Anderson Lab at the end of this semester and we do not want to waste those students' time with them having to comb through code just to understand it. So we have put together a google document which is slowly growing. This document will, hopefully, morph itself into a README in the future, as that is something that is vital for a project that desires to be a successful open source project.

So right now the README includes details about how to modify and add to the continuously growing components of Learn2Mine (e.g. adding a node/badge to the skill tree). Right now the language used throughout is fairly colloquial and written for the Learn2Mine team, but we will clean it up and make it more formal in the future (like whenever we add installation and development instructions).

Learn2Mine is (as of today) being used in the data mining class here at the College of Charleston. Recently, we reinstalled Galaxy as we had a bunch of issues and this ended up also resetting the database that we had set up for Galaxy, a Postgres database. In our recent development we had not noticed a difference in the performance of Galaxy even though we went back to Galaxy's initial SQLite database. When dozens of students started to work on Galaxy at the same time, however, we ran into concurrency issues. Galaxy just was not allowing students to run jobs or submit lessons because too many people were trying to interact with the database at one time, a problem common in SQLite - this is why SQLite is not meant to be used for large servers. So I went and stopped Galaxy from running, effectively taking the site down momentarily. I delved into the universe_wsgi.ini file and pointed the database location to where the Postgres database existed on our virtual machine (which is the learn2mine-server that hosts Galaxy and RStudio). I then had to run "sh manage_db.sh upgrade" which is a script that looks at the database location and fixes Galaxy's database set up to point to that new database.

So that did migration and created a new Postgres database. Unfortunately, all the users that were in existence on the SQLite database are not in the Postgres database. Dr Anderson and I sat down and tried to use the dump of the SQLite database and get it to effectively merge with the Postgres database. After about 20 minutes of trying different ways of going about this it really seemed like it would be way too much work to migrate a database with only a day's worth of information. All in all, for the users that just started today, it is not much hassle for them to just recreate their accounts. It will create some errors on the backend; like, the users' RStudio accounts will already be created and will hit an error for trying to create an account that is already in existence, but that error will not stop the flow of Galaxy or RStudio and the users will not see the error, so there is no drawback, especially since this only pertains to the users that created accounts on Galaxy today and it only matters for the next time they try logging in.

Music listened to while blogging: Hopsin

Planning to Meet Charleston

So my group met up today and we examined some python documentation and reviewers' thoughts upon the usage of specific file i/o in python with regard to garbage collection. A stack overflow post was very insightful for us. So I previously posted about how I was curious as to whether Python's garbage collector can be spawned when reading in files line-by-line. If you are using the "with" statement in Python to read a file line-by-line, then garbage collection is taken care of automatically (either the buffer will fill and the collector will be spawned then or the collector will just keep up with the line-by-line reading). So our code replacement for the map function works as the Galaxy developers want it to work.

So now we need to add our code to the Galaxy toolshed. I found wiki documentation here which explains the purpose of the toolshed and how to utilize it and add to it. The actual Galaxy toolshed is located here. Effectively, this is a list of repositories that Galaxy users can clone and import into their own Galaxy instance in order to utilize the tool(s) found in the toolshed. A member of Team Rocket, Jacob, uploaded the files to the Galaxy toolshed and the repository can be found here. The Python and XML files are both present in the repository as well as the functional tests I wrote for it, initially. As far as we know, we are done with our second Galaxy addition. So now we need to find a new bug or feature to tackle.

Now you may be wondering about the title of this blog post because it just seems like an experience report so far. Well, since POSSCON will not be happening this year, the software engineering practicum class is being tasked with visiting and attending a meeting of a group listed here. After perusing most of the groups I found that I am most interested in attending a meeting/event for the Agile Charleston group. To join the group to find out about times and more information I have to wait for my request to join the LinkedIn group to be accepted. So more on this later.

Music listened to while blogging: All Time Low

Monday, March 17, 2014

Capstone: Walkthrough of the Tutorial for Learn2Mine

So Jake and I have cranked a lot of work out for Learn2Mine recently. The bulk of this work has been related to finishing the tutorial for Learn2Mine and getting any issues with it settled.

When you navigate to Learn2Mine's home page you can get select "Take the Tutorial!" and jump straight in to the tutorial (as I've explained in previous posts). So the tutorial introduces users to the three components of Learn2Mine, which includes creating a Galaxy/RStudio account.

So let's stop right here. RStudio accounts used to need a tool for creation and now users do not have this extra level of confusion to try to create an RStudio account. Their Galaxy account is their RStudio account now. The only issue we currently have is that users cannot change their passwords on either of the components of Learn2Mine - something that will be solved soon.

So back to the tutorial. Jake did a great job making images that explain the different sections of Learn2Mine and Galaxy and these snapshots can be seen below:

So now the users of the site can get started with the tutorial. The first tutorial section is the "Basic R Tutorial" section. Here, users are asked very simplistic questions programming-wise. For example, the first problem is asking users to create 2 integers (and type declaration is not important in R so users can simply say "x = 1234" or "x <- 1234" if they desire) and perform some mathematical operations on those variables. The users are to do this in RStudio.

Now they have the option to submit this first problem now if they want and we give them a quick how-to on submitting by clicking the "How Do I Submit?" button in the bottom left of the tutorial pages. Let's say the user has moved on to one of the harder tutorial lessons, though. What if the user is having a hard time because of a lack of exposure to R and needs just a little jumpstart in how they should start their coding?

Well, I worked hard on developing javascript code (using jQuery) that will take example code that I created and print it out to users (either all at once or line-by-line). This is especially useful in the last coding section of the tutorial. Users are asked the following (page available here):

Write your own knn function called my.knn that takes 3 arguments: a training file, a testing file, and a k value. The function should use the training file to develop a set of euclidean distances in order to find the nearest neighbors for the records (rows) in your testing data file. Finally, you want to return a 1 column matrix which contains the correct labels for the testing data. The nth row in the test labels matrix corresponds to the nth row in the testing dataset. Though your kNN algorithm should be generalizeable to many different datasets, you may use the training and testing datasets we provide for you below in order to test your function.

Below this we give users a Training Set and Testing set for this problem (which I completely made up - it is meant to be simplistic). We also give users the signature for the function (as our automatic, instant grading is based upon matching the signature first-and-foremost). So users can either click the Hint button in order to get insight as to how to tackle the problem or they can just show line-by-line or full answers to the problem. Now I made a note on the page that the code we provide to them is not the most efficient - and it was never meant to be. Rather, though, the code was written in a way where the users should be able to read the code and understand it without the need for a lot of comments. If the users can do this, then the tutorial was successful because the user then understands R code and can actually get their hands dirty with actual R lessons on the site.

Let's say the users have finished those R lessons though, then they can move on to the tutorial tool lesson. Here, users will be performing k-NN with our built-in k-NN tool on Galaxy. Users just have to provide a training dataset, a testing dataset, and a k-value (the same as the signature that we had them write a function for). The tool is then run and gives users an HTML output and automatically grades the lesson upon submission.

Music listened to while blogging: Childish Gambino & Tech N9ne

Release Early and Often

For this post I will be reflecting upon my team's progress thus far and chapter 9 of Teaching Open Source.

Much to my surprise, chapter 9 of Teaching Open Source is merely a one page chapter talking about how the oft-espoused "Release Early, Release Often" motto in FOSS should be applied to more than just software for FOSS projects. The textbook itself is actually released early and released often, as it is an experimental, open source textbook. The book even has its own mailing list, like the one I've mentioned for Galaxy a multitude of times in the past.

So let's get into my team's work on Galaxy. So last time, I left off with a code segment that I thought could be improved upon in order to become more efficient than the Python map(...) function which I originally used to create the Transpose tool for Galaxy. Jake and Jacob, of Team Rocket, took the lead after I set up the framework (with the functional testing, map function, and xml markup). The code below is what they came up with.

So the logic behind this is pretty simplistic. Effectively, we want anyone that uses Galaxy to be able to transpose tab-delimited data (if the data is not tab-delimited, then Galaxy is able to provide conversion tools). A problem that often occurs is that someone is using lots of genetic data that is just way too big to be all loaded into memory at one time. So we take advantage of this by using for-each loops. Once python has made a pass through a for-each loop, where the for-each here is "for each line in an inputfile", then python no longer needs to access previous lines. We need to confirm this with documentation but the built-in garbage collector for python should scrap the previous lines since the lines are not stored into variables of any sort.

On my end, I have just been keeping up with syncing our team's forked repository of the galaxy-central master branch, that way when we want to push some changes in we will be able to without any hassle.

Music listened to while blogging: Schoolboy Q

Tuesday, March 11, 2014

The Doc Is In!

For this post I will be blogging about Chapter 8 of Teaching Open Source and responding to a couple of the exercises.

So documentation... no one is a fan of making it, but everyone should be doing it and doing it well. You never know when you might have to go back to code written years back and perhaps written by someone else. Would you rather read a description of what is happening with that code and see a shorthand example or would you rather have to inject your own testing code to see what is happening? Not sold? Would you rather take about a minute or two to figure out how code works or take hours to figure out what is going on? The key to making code the best it can be is this documentation.

So it is evident that documentation is crucial when you are working with others on a project or if there is a potentiality that you will be leaving the project and have someone else take the lead on your section of the project, but what the project that you working on is done just by you? Well, as aforementioned, you will need to go back and change code at some point. As I have reflected upon in the past, degradation is going to take hold of your project if you are not constantly maintaining and improving. When you go to maintain and improve this code it is much nicer to just navigate comments to figure out what code is doing so you can modify only what is intended.

So there is an exercise in Chapter 8 of Teaching Open Source which is asking me to write thorough comments in all of my source code, make sense of source code through documentation alone, and write at least one wiki page of developer documentation for each program I am working on. Galaxy has a standard for writing documentation for code that is created for it.

Writing comments for all of the Galaxy source code would be a foolish process as this is a project that has been built up over a long period of time (since 2005). The code that I have created for contributing has also been marked up with comments (available here, here, and here). Now you may navigate to the page and see that there are no python comments on the page. The comments already existed for the Group tool and I was merely improving the code for it. This is because the Galaxy community has effectively perfected the art of documentation (enough to grasp what is going on and where, but not too much as to detract from the code itself). I was able to go directly to the section in the code where I needed to add my code. There also exists example usage documentation within the XML of the tool itself (so when someone is using it they can see what is supposed to happen). So if you couple that markup with the code itself and code comments then it is evident what the code is supposed to do.

To demonstrate making sense of source code through documentation alone I will present a tool in Galaxy which conducts some sort of biological analysis with which I am unfamiliar. So I randomly picked a toolset (phenotype_association) and randomly picked a python file from inside this toolset (senatag.py). The comments for this code is different. Instead of commenting in different parts of the code, there is a large amount of comments at the beginning of the file which explains every piece of the code. Overall, senatag takes a file with identifiers for SNPs (single nucleotide polymorphisms) and a comma separated file which has identifiers for different SNPs. Senatag then outputs a set of tag SNPs for the dataset provided (the comma separated list). The comment markup can be seen below (as well as the breakdown of the step-by-step code):

To contribute to the wiki I was planning on writing a step-by-step tutorial on how to add tools to Galaxy (as I have outlined this in previous posts and it is something I know well enough to write a structured wiki page on). After navigating the "all wiki pages" section of Galaxy, I realized that adding a page would be extremely difficult because the Galaxy community has done just about everything when it comes to wiki documentation. For reference, the add tool page I was referring to can be found here.

The next exercise I am asked to perform is to pick a feature that sounds tantalizing but is not clearly documented. Using Galaxy for this I realized that I do not have any experience actually tinkering around with the graphical user interface (barring simple XML markups). So first I tried navigating through Galaxy's wiki to find information on how to understand or maybe customize the interface (as that documentation would effectively explain how to manipulate all parts of the GUI). Much to my disappointment, I was unable to find any information on this topic, but that does mean I can potentially add this documentation (and it really helps for this exercise). There is a .jar file that looks like it is what is assembling graphics on the client-side through the usage and calling of dozens of python and BASH scripts. So here I will focus on one very simple part of the interaction with the user interface. If you deploy your own version of Galaxy then you are required to have Python 2.6 or 2.7 installed and set into the environment path. Rather than having crazy errors occur because someone does not have the correct version of Python installed, the Galaxy developers have created a simple "check_python.py" which creates a message that gets printed to the user explaining why they cannot run anything. Additionally, I learned something which has always puzzled me about Python. In Python you can use triple quotes to do large block comments. In this file the message that gets printed to the user is written in a block style. So I did some tinkering in my own python shell and have now learned that you can assign strings in this manner, adding a ton of readability to long strings (rather than dealing with a ridiculously long word-wrapped string).

Music listened to while blogging: Childish Gambino

Thursday, March 6, 2014

Capstone: Introduction to a Tutorial and Prep Work for Showcasing

So for the past fortnight (yep, I just successfully used that word) the Learn2Mine team has been hard at work on creating a tutorial for Learn2Mine - really just a simple 30 minute introduction as to what we aim to have users achieve within our system and effectively giving them a taste of the two different types of lessons we currently offer. So let's break down the steps we had to take in order to prepare for a tutorial (next post will focus on the tutorial itself).

For starters, the tutorial really did not stick out to users first entering the site so a lot of work was done on the frontend of Learn2Mine in order to give it a sleeker feel and make it more intuitive for the less-experienced user. There now exists a splash page for the site where users that are not logged in can actually see just a little bit of what Learn2Mine does, rather than quickly asking users to authorize our program with their Google account. An image of this screen can be viewed below or by navigating directly to the site.

So once users are logged in they are directed to the Learn2Mine home page and it is from here that they can jump into the tutorial. Effectively, there is a button much like the Login button that says "Take the Tutorial!" (no need for a screenshot).

Now the tutorial has to teach the users about the different sides to our software. I have went over this in past posts about how the Virtual Portfolio is where you start, RStudio is where coding, testing, and debugging is performed, and Galaxy is software effectively used for submissions and instant feedback. So the tutorial teaches users about the different components of each of these sides of the software and how to use them.

So now that I have brought up RStudio and Galaxy, I would like to showcase another achievement that the Learn2Mine team has overcome. In the past, we used randomly generated keys to give users a way to link their Learn2Mine account to Galaxy and RStudio. This meant that users used that key to submit software in Galaxy and used that key in order to log in to RStudio. So users really just had to keep copy/pasting it in these different fields (except in Galaxy since it retains a job history if you create a Galaxy account). Obviously this is not preferable. So we went digging through the Galaxy-Dev mailing lists and found that if a user is logged in to a Galaxy account then we can get a handle on their email that they used when signing in to Galaxy. This eliminates the need for users having keys that they kept passing around to the different parts of Learn2Mine. This created a new issue for us, though. Previously, whenever a user submitted their key in to Galaxy, we created a user account on our virtual machine - this is how users were able to authorize in to RStudio. So now we had to create a separate tool that users had to run in order to use RStudio, yet another annoyance. Sherlock Holmes style, more investigative work was done and we were able to put our own code in to Galaxy. This code took the Galaxy signup process and also fired off our addUser.sh script. So now the credentials that a user inputs into Galaxy for that account are the same credentials used for signing in to RStudio, which is a lot better than what we had in the past. What is next to be done is allowing users to change their password because there currently is not an implementation for that. So now users are not perplexed when going to the different parts of Learn2Mine - in the past there was a bit of a curve when figuring it all out. So why did I go off on this tangent? First, it is awesome to get this working. Second, and more importantly, users can seamlessly get involved with solving problems on Learn2Mine, whether in the tutorial or not, more quickly and efficiently.

Music listened to while blogging: Blue Scholars

Clayton Turner: An Introductory Guide