Monday, February 10, 2014

Squashed?

The first key point in patching a bug is understanding how it works and being able to reproduce it.

The tool that is being used in Galaxy that has the specific bug is referred to as "Group" - it is a data manipulation tool found within the "Join, Subtract and Group" tool section within Galaxy's toolbox. The description for "Group" is as follows:

"Group data by a column and perform aggregate operation on other columns."

To show how the XML maps the tool to the GUI, I have provided a screenshot below:


The select data list allows you to select data that you have uploaded to Galaxy. In concordance with the tip at the bottom, Galaxy can convert any delimited file into a tab delimited file. For future reference, we could improve this tool to automatically figure out the delimiter for this Group tool (feature upgrade). Group by column allows you to pick a column number for which to group by (starting at 1, not 0).

Rather than provide my own example of how this works, the Galaxy developers have provided an excellent markup of how this works in XML (seen below):


So, effectively, you can group an input dataset in order to perform aggregate functions (finding mean, median, mode, sum, etc.).

So after amassing this information, I decided it was time to try to fork the development branch of the repository. So, first, I had to clone the Galaxy development repository after forking. This was puzzling because I could not just clone my fork through Git like I normally do. I read some documentation and tried a few things, but I ended up just having to use Mercurial to pull down my forked Galaxy in the form of a clone.

The tool's name is listed as "Group" within Galaxy and is in the "Join, Subtract, Group" toolset, but there was no file named group.py anywhere. I had to keep searching different toolset folders within Galaxy and eventually found the tool I was looking for in the "stats" folder within "tools" and it was named "grouping" instead of "group". I was able to match and know that they were the same by reading the example in XML and comparing it to the gui version of the tool.

There was a suggestion on the Trello card that said that this bug fix could be done by adding a checkbox "ignore these lines" solution like that has been done to the trim tool. So I went through the aforementioned steps to find the trim tool (trimmer by name and located in filters). The XML markup was the same so I grabbed the "param" tagged with the checkbox and added this ignore section to the command line arguments in the XML for grouping. I went to the python file and adjusted the the command line parsing numbers accordingly. The next trick was to see how a checkbox would be transcribed and translated for sending to python so I had to do some classic debugging with print statements since I know how they print in Galaxy.

For clarification reasons, I'll explain why this is a bug fix as oppose to a feature addition. A lot of times, people have metadata in their files or perhaps just comments in order to keep note of something. This is especially common in the sciences (specifically biology here). So when someone wants to do grouping on their data they will get incorrect results as the comments/metadata will be interpreted as actual data. What we are doing is allowing comments and metadata to be present in a file and to group data while ignoring the comments and metadata. So the original file does not have to be altered and the user gets what they expect when grouping their data.

So doing the actual coding was the typical "code, debug, bang head in frustration, epiphany, code, debug, etc." for a little while. Eventually, I got it working, flushed out all the use cases and made sure it did not crash. The part that really started to mess with me was getting a pull request to work. Initially, it was a bit of a hassle to even get my own work committed. Abandoning git in favor of mercurial put some hoops in front of me to jump through but it was a good learning experience. I really enjoy using Mercurial from the command line and it seems pretty simplistic and easy to understand after some initial hurdles. For example, I had to go into my mercurial repository hidden directory (./.hg from my repository home) and get into the hgrc file within this hidden directory. Here I had to specify the user that was actually making the changes so I had to add:
[ui]
username = forename surname <email>

where I filled in the name areas and email area. The reason I had to do this was because I could not actually commit to my own repository without this information. Upon searching, I was able to deduce that it had something to do with a buggy update in Mercurial where one thing was changed in the initial install of it but some complementary portion of the code was not updated, leading to this fault. So I was finally able to commit to my own repository.

A snapshot of the code we are adding can be found below:
So, here, sys.argv[5] essentially refers back to some XML that was edited, which is depicted below:
Effectively, we are ignoring lines that contain any of the characters listed above by referring to their ascii values.

Then, I decided it was time to make my pull request. I went to this site to learn how to create a pull request on bitbucket. It seemed relatively straightforward except bitbucket was missing a button that allowed me to actually submit a pull request. I went into the Galaxy irc and asked about the issue. A user, dannon, gave me information about the issue. For one, my changes were not even pushed up to my own repository, but there was an outgoing change. I was able to check this with the terminal command "hg outgoing". Upon further conversation with dannon, I realized that I checked out the wrong repository (galaxy-dist instead of galaxy-central). Galaxy-central is where development occurs and galaxy-dist is the latest working version. I would want to send my changes to galaxy-central so the core developers of Galaxy could review the change and accept them into the next working version of Galaxy, which is currently being worked on.

So I cloned the galaxy-central branch and then performed "hg pull -u ../galaxy-dist" to try and pull the changes I made into the galaxy-central branch that I am working on. Talking with dannon, he said I should try and push my changes up, but there was a mishap with the branch that was originally forked. So I needed to run "hg update default" then "hg merge". I then had to merge our current repository with a stable version with "hg ci -m "Merge from stable". I ran into the hgrc issue again, so it seems I'll have to do that for each repository, but at least I know what the issue is with that. Next, a "hg push" sent my work to my forked branch and I was able to submit a pull request through Bitbucket. I communicated with dannon one last time and he said "Great, thanks! We'll take a look".

A link to the pull request can be found here: https://bitbucket.org/galaxy/galaxy-central/pull-request/322/added-ignore-lines-starting-with-specific/diff

And that's the story of my first contribution of a bug fix to an open source project.

Additionally, I will be blogging about two articles from http://opensource.com/ over my next two blog posts. So one now and one on my next one
Top court decisions to come from US public policy in 2014
One of the issues being tackled is "can an abstract idea be patented?" Before even reading the rest of the article my mind was screaming "NO!" because the implications of that would be largely negative. If you thought Apple trying to patent rounded corners was bad, then imagine if a company were able to patent a vaguely-described design pattern for a website or application? Or what if a company could patent a specific niche of video game? Now, we would not see the video game clones that Zynga and other relatively-disliked companies push out all the time, but that would be restricting innovation and freedom to create whatever you, as a developer, want. There are a lot of other implications, but the article has more to it than just that frightening quote. Another court case mentioned is whether the inducing of infringement should be considered infringement in and of itself. This made me think of torrenting programs, like uTorrent or bittorrent. The programs themselves are not being utilized for infringement purposes, they are just peer-to-peer filesharing programs, but some of the people that use the programs have a malicious or, at least, illegal intention when using the program. Should the programmers be held liable because they are allowing this to go on (even though it is pretty much impossible to stop it without constant patrolling) or should the uploaders/seeders/leechers be the ones considered to be at fault? Or both? The problem is really trying to figure out where to draw a line to separate black and white when the entire area is just an amorphous, gray blob. Lastly, there is a court decision coming up that could decide that a patent claim can be too ambiguous to be claimed. Now this is a case that I am pretty excited to have happen because it is downright obvious that when patent claims clash with software that there is going to be a large amount of ambiguity and this would be the first step to defining a fair patent system within our constantly-growing technological society.

Music listened to while blogging: Lily Allen & Metallica

No comments:

Post a Comment