Wednesday, February 26, 2014

Capstone: Galaxy and RStudio

So this post is going to be focusing on the integration of Galaxy and RStudio into Learn2Mine. Last post focused on the virtual portfolio which is just 1 of 3 core parts of Learn2Mine

Galaxy is an open source project in which I have described in great detail in past non-capstone related posts, so I'll just do a quick summary here. Galaxy is primarily a bioinformatics analysis tool that specializes in working with genomics data. It abstracts the command line from users with a javascript interface and gives users some python/perl/etc files to work with when converting data, running programs like tophat or tuxedo, and even some statistical analyses. Galaxy allows users to create workflows by tying jobs together - think of it as a set of directions. First, I want to upload these genetic datasets and then run them together in this one tool which aligns them with this specific algorithm and then I want to send that result to a visualization tool which creates an HTML output that tells me the score of the alignment and gives me the option to download the alignment file. If this were a workflow then I could provide the workflow with my files and it would do everything else by itself through scheduling within the job manager. All that really matters here for Learn2Mine is that you can use the output of jobs within Galaxy as inputs for other jobs. So I can upload datasets and use them in Learn2Mine tools. I can use the output of code I run or tools as my output when submitting for grading. I can perform scaling or filtering on my data and then use the new scaled/filtered version with a tool. This stream-of-consciousness description of Galaxy is probably the most watered down version I've given, but my past blog posts talk about Galaxy and, if you really want to read more, then those are there.

So how does Learn2Mine take advantage of Galaxy? The tools that I mentioned in my last post that we have built use Galaxy's interface to allow less-experienced programmers conduct algorithms without having to know all the specifics. On the right you will see the result of an XML markup of Learn2Mine's neural network tool. The very first input (at the top) allows users to input a dataset they have previously uploaded to Galaxy as the dataset to use for the algorithm. It is worth noting that even data that has been altered past the upload data portion of Galaxy can also be used here. The rest of the inputs do not rely on past jobs in Galaxy, but, rather, is an abstraction of inputs that you would normally feed into a neural network. For example, the hidden layers input. The hidden layers input takes a comma separated list of values. For each item separated by commas, there is a hidden layer. The number that is listed represents how many nodes exist for that respective layer. Concepts like this perpetuate throughout all of the built-in tools for Learn2Mine.

Alternatively, there is a section of Galaxy tools that we have built referred to as "Learning R" tools. The only jobs that can be run from those tools are "Create RStudio Account", "Get Personalized Dataset", and "Submit R Lessons to Learn2Mine". The "Create RStudio Account" tool is one that was made recently. This tool was completely masked earlier because in order to communicate from Galaxy to Learn2Mine we were forcing users to pass a unique key, that was associated with their account, around Galaxy. When users submitted their key to Galaxy in the past, we created their RStudio account behind the scenes. Until we find a way to automate the creation of an RStudio account with a Learn2Mine signup, we will have to make users use this tool if they want to use our cloud-based R IDE. The "Submit R Lessons to Learn2Mine" tool is a tool that you can run whenever you want to submit an R-based lesson to Learn2Mine for grading/badge-earning (this tool is analogous to the Submit Learn2Mine Tool Lesson tool in the Learn2Mine_Toolset section). This Submit R Lessons tool allows users to submit code/answers in Galaxy output or copy/paste their answer into a text-box - this is done because some users prefer one way and some prefer the other and it was not difficult to allow either. The "Get Personalized Dataset" tool is a tool that we hope to use more in the future. Right now it is only used for the advanced R lessons. It takes a user's information and gives them a personalized dataset for use in lessons - so no 2 users will have the same dataset on which to perform analysis and be graded on. We would like this to become the standard for all lessons.

As I mentioned in the previous paragraph, RStudio is a section of Learn2Mine in which users have to have Galaxy create their account. This is because, currently, our RStudio server is using accounts located on our Learn2Mine server in order to authenticate - so there is no current way to tie in Google authentication into that form of login. RStudio is a cloud-based IDE which allows users to go and run R code through an interpreter, or run entire files - much like R IDE's that require local installation. RStudio allows users to install any 3rd party R packages that they desire. This is especially useful for visualization tasks. Typically, we want users to come to RStudio to write their code and then submit their code/answer on Galaxy. It would be wonderful if we could somehow tie RStudio and Galaxy even further by just pointing Galaxy to a file that a user is working on for a lesson, but that is beyond the scope of this Spring semester.

Music listened to while blogging: Kanye West and Lily Allen

No comments:

Post a Comment