So in a previous post I mentioned my team's latest pull request where we were adding a tool that can transpose data.
Because there seemed to be some confusion about its uses, I am going to elaborate upon them here for a moment.
Users are never forced to transpose their data. This feature was requested to be added to Galaxy. A reason that a user may want to transpose their data is for use with Galaxy's column filtering tool for performing statistical analysis or merely just grouping data by data values rather than the features that exist within data. Additionally, the transposition of the data with the tool allows for the usage of tabular data that is not square, though the examples we gave were of square data.
So let's get to the update about the actual pull request itself. John Chilton, the same developer who responded last time, responded to my pull request:
Flipping over to the activity section he left a comment whenever declining the pull request. He said "Would love to see this in the tool shed!" So that recapitulation of his main comment there has Team Rocket now looking at and experimenting with adding to the tool shed.
So why didn't we add our tool directly into the tool shed before? It would make sense to go straight there, right? Well, I made the decision to submit our pull request the same way as last time because the tool we were developing went hand-in-hand with other tools that are located within the core section of Galaxy (even located in the same toolset as other tools in the core). As you can read, the tools being developed by the core team are now even being moved to the tool shed. So this is no issue.
There is an issue, though, and it was something I had worried about whenever first submitting the pull request. The way we are transposing data has the entirety of the file read into memory at one time. For Galaxy, this just cannot happen. This is because Galaxy users are typically dealing with genomic data that can be upwards of 50 GB per file at times. Reading all of that into memory at one time really is not feasible, even with the nicest of server stacks. So we are going to have to brainstorm a methodology for cutting the data up into chunks and slowly write the data. I imagine the code will become less readable, but will be far more efficient when working with big data. I look forward to tinkering and trying to get this to work over the next few weeks. We have a break coming up for classes so I am unsure if I will be able to keep up my regular posting, but I will definitely try if I have the time.
I'd like to close with my initial idea of how to update the tool to reflect the needs for big data usage:
$ outLineNum = 0
$ with open(inputFile) as infile:
$ for line in infile:
$ items = line.split('\t')
$ for item in items:
$ outputFile.write( # Think about most efficient way
$ outLineNum += 1
$ outLineNum = 0
So only one line will be read into memory at a time and previous lines will be garbage collected. Now this may still pose issues as some datasets have thousands, or perhaps more, features which would result into a lot of data still being read into memory. Perhaps I could take a different route and just slowly read in individual datum and then put that into the output file as needed.
One last mention I would like to make. The first pull request made to Galaxy is now ready to be versioned into Galaxy and is located with the Galaxy Central branch in the next update. It was all of this before, but now it's "official" with this Trello Card
Music listened to while blogging: Kendrick Lamar
No comments:
Post a Comment