I recently have been using a Windows Server 2008 R2 VM, but I also wanted to use Theano in scripts I was using. Rather than creating a virtual machine within my virtual machine (which I could not even do due to exceptions in the virtual environment), I wanted to make use of Theano and figure out what the quickest way to set Theano up would be.
First, install a 32-bit (even if you have a 64-bit system) install of Anaconda for Python 2.7, not any version of Python 3. I recommend using version 2.1 of Anaconda, which can be found via zip files on their site.
Once installed, run "conda install mingw libpython" from the command line. This retrieves resources required for Theano.
Before installing Theano, I suggest installing LAPACK resources. A Windows tutorial on how to do this is here and describes the process better than I would be able to: LAPACK
Theano can be installed in 2 ways. "pip install Theano" installs the current stable branch of Theano, but that branch sometimes updates and breaks Windows installs of Theano when using it. If you try "pip install Theano" and you cannot import Theano without errors/crashing, then I suggest using this alternate method: "pip install git://github.com/Theano/Theano.git" - this installs the bleeding edge dev version of Theano which may not be perfect, which is why this is the alternate method. However, before installing this bleeding edge version you either want to uninstall the Theano you have or upgrade it. "pip uninstall theano" then "pip install git://github.com/Theano/Theano.git" or "pip install git://github.com/Theano/Theano.git --upgrade", respectively.
That should be all you need to get going with Theano.
Why Anaconda? I love Anaconda because I'm a command line fanatic due to constant SSH-ing (most of my editing is done through VIM - shoutout to my automated vimrc creation tool) so I wanted to avoid IDEs. Anaconda packages nice machine learning utilities with it, such as numpy, scipy, and scikit-learn. Additionally, the "conda install" feature is pretty clean and useful.
Sources:
Theano site: http://deeplearning.net/software/theano/install.html
Why 2.1 Anaconda: http://stackoverflow.com/questions/31050976/python-exe-crashes-when-importing-theano
The issue in action: https://groups.google.com/forum/#!topic/theano-users/p77HXTvjNxc
LAPACK resources: http://icl.cs.utk.edu/lapack-for-windows/lapack/index.html#libraries
Theano Github: https://github.com/Theano/Theano
Clayton Turner: An Introductory Guide
A blog dedicated to my experiences and development as a Data Science and Computer Science researcher.
Thursday, December 10, 2015
Tuesday, November 10, 2015
ICDM 2015
This coming up weekend I and Cassios Marques will be attending ICDM 2015 (IEEE International Conference on Data Mining) in Atlantic City, NJ. I am attending this conference rather than presenting at it, which is actually out of the ordinary for me. Information available at: http://icdm2015.stonybrook.edu/
We arrive the night of November 14th (conference is from the 15th to the 17th) and will be needing to head to sleep right away as our technical arrive time will be on the 15th.
Tentative Schedule:
Saturday, November 14th Schedule
9:00am - 12:30pm ~ Morning Workshop: The 2015 IEEE ICDM Workshop on Data Mining in Biomedical Informatics and Healthcare (DMBIH)
2:00pm - 6:00pm ~ Afternoon Workshop: The 2015 IEEE ICDM Workshop on Sentiment Elicitation from Natural Text for Information Retrieval and Extraction (SENTIRE)
Sunday, November 15th Schedule
9:00am - 10:15am ~ Keynote 1 (Robert F. Engle): Dynamic Conditional Beta and Global Financial Instability
10:30am - 12:30pm ~ Session 1A: Deep Learning and Representation Learning
2:00pm - 3:15pm ~ Keynote 2 (Michael I. Jordan): On Computational Thinking, Inferential Thinking and Big Data
3:30pm - 4:40pm ~ Session 2A: Big Data 1
4:50pm - 6:00pm ~ Session 3A: Big Data 2 (Not sure if this is a repeat or extension; Clustering 2 is another option)
Monday, November 16th Schedule
9:00am - 10:15am ~ Session 4C: Dimension Reduction and Feature Selection
10:30am - 12:30pm ~ One of:
Tuesday, November 17th Schedule
9:00am - 10:15am ~ Keynote 3 (Lada Adamic): Information in Social Network
10:30am - 12:30pm ~ One of:
We arrive the night of November 14th (conference is from the 15th to the 17th) and will be needing to head to sleep right away as our technical arrive time will be on the 15th.
Tentative Schedule:
Saturday, November 14th Schedule
9:00am - 12:30pm ~ Morning Workshop: The 2015 IEEE ICDM Workshop on Data Mining in Biomedical Informatics and Healthcare (DMBIH)
2:00pm - 6:00pm ~ Afternoon Workshop: The 2015 IEEE ICDM Workshop on Sentiment Elicitation from Natural Text for Information Retrieval and Extraction (SENTIRE)
Sunday, November 15th Schedule
9:00am - 10:15am ~ Keynote 1 (Robert F. Engle): Dynamic Conditional Beta and Global Financial Instability
10:30am - 12:30pm ~ Session 1A: Deep Learning and Representation Learning
2:00pm - 3:15pm ~ Keynote 2 (Michael I. Jordan): On Computational Thinking, Inferential Thinking and Big Data
3:30pm - 4:40pm ~ Session 2A: Big Data 1
4:50pm - 6:00pm ~ Session 3A: Big Data 2 (Not sure if this is a repeat or extension; Clustering 2 is another option)
Monday, November 16th Schedule
9:00am - 10:15am ~ Session 4C: Dimension Reduction and Feature Selection
10:30am - 12:30pm ~ One of:
- Session 5A: Ensemble Methods
- Session 5B: Applications 2
- Session 5C: Network Mining 1
This day is short as there is an excursion from 2:00pm to 7:00pm
Tuesday, November 17th Schedule
9:00am - 10:15am ~ Keynote 3 (Lada Adamic): Information in Social Network
10:30am - 12:30pm ~ One of:
- Session 6B: Graph Mining
- Session 6C: Mining Sequential Data
2:00pm - 3:15pm ~ Session 7B: Mining Text and Unstructured Data
There is one last session from 3:30pm to 5:00pm, but we will have to leave in order to catch a taxi, to catch a train, to catch our flight so we need to leave room for any holdups in that process.
Wednesday, June 10, 2015
Numpy ValueError: Output array is read-only
I recently received this cryptic error when working with a numpy implementation of a neural network and was having trouble finding a ready-made solution for this problem.
This error can be seen in a similar fashion if you run commands such as:
import numpy as np
a = np.arange(6)
a.setflags(write=False)
a[2] = 42
ValueError:assignment destination is read-only
This is intended behavior.
I am currently working on a neural network which utilizes gradient descent and updates through the use of a client-server relationship. When processing a new mini-batch, the neural network started to throw the titular ValueError whenever updating the weights on the 2nd minibatch.
I have seen many posts on Stack Overflow and other sites referring to this error being a result of the array being non-contiguous in memory itself. This seems to be a problem in underlying Python code as it is built on C and it could also couple with the underlying implementation of Numpy.
If you receive an error such as this, then the easiest way to circumvent your non-contiguous memory is to simply just make it contiguous again:
my_discontiguous_array = np.array(my_discontiguous_array).copy()
This is performing an effective deep copy and re-loading the data into contiguous memory for your use. Hopefully the garbage collector cleans up the old, un-referenceable version of the array.
This error can be seen in a similar fashion if you run commands such as:
import numpy as np
a = np.arange(6)
a.setflags(write=False)
a[2] = 42
ValueError:assignment destination is read-only
This is intended behavior.
I am currently working on a neural network which utilizes gradient descent and updates through the use of a client-server relationship. When processing a new mini-batch, the neural network started to throw the titular ValueError whenever updating the weights on the 2nd minibatch.
I have seen many posts on Stack Overflow and other sites referring to this error being a result of the array being non-contiguous in memory itself. This seems to be a problem in underlying Python code as it is built on C and it could also couple with the underlying implementation of Numpy.
If you receive an error such as this, then the easiest way to circumvent your non-contiguous memory is to simply just make it contiguous again:
my_discontiguous_array = np.array(my_discontiguous_array).copy()
This is performing an effective deep copy and re-loading the data into contiguous memory for your use. Hopefully the garbage collector cleans up the old, un-referenceable version of the array.
Monday, March 23, 2015
AMIA 2015 Joint Summits on Translational Science
I previously mentioned that the research I have been working on has been accepted to a conference. In all of my busy-ness I have not been able to effectively update this blog (being a GRA is a bit of a time-sink). I'm attaching the poster being presented at the AMIA 2015 Joint Summits on Translational Science to this post as a reference to the work we have been conducting for the past ~year. This is being presented this afternoon. This poster is setup in a landscape format so it is being shrunk for purposes of being on this blog. If you're interested and would like to see a better version of it, please contact me.
Thursday, February 5, 2015
Deep Learning and Natural Language Processing
First off, I would like to say that an abstract I co-authored, titled Improving Lupus Phenotyping Using Natural Language Processing, has been accepted to the 2015 Summit on Translational Bioinformatics. This conference is in San Francisco during late March. I will not be attending as I will be busy with classes (two of the PI's with which I am working will be attending), but I am still heavily working on information to be presented for the poster symposium then.
My most recent advances in this research have brought me to attempting to unravel the intricacies behind Deep Learning. Our goal is to classify patients', based on digitized doctors' notes for those patients, status of Lupus (effectively present or not). The ramifications of such research would result in quicker, easier, and more accurate recruitment for clinical trials, as well as the outperformance of solely using icd-9 billing codes as a classifier.
Some sources consulted for research:
My most recent advances in this research have brought me to attempting to unravel the intricacies behind Deep Learning. Our goal is to classify patients', based on digitized doctors' notes for those patients, status of Lupus (effectively present or not). The ramifications of such research would result in quicker, easier, and more accurate recruitment for clinical trials, as well as the outperformance of solely using icd-9 billing codes as a classifier.
Some sources consulted for research:
- http://colah.github.io/posts/2014-07-NLP-RNNs-Representations/
- http://deeplearning.stanford.edu/tutorial/
- https://www.youtube.com/watch?v=n1ViNeWhC24
- http://arxiv.org/pdf/1206.5533.pdf
- http://www.socher.org/uploads/Main/PaulusSocherManning_NIPS2014.pdf
- http://nlp.stanford.edu/~socherr/thesis.pdf
- http://nlp.stanford.edu/~socherr/SocherChenManningNg_NIPS2013.pdf
- http://www.aclweb.org/anthology/P/P12/P12-1092.pdf
- http://www.aaai.org/Papers/JAIR/Vol37/JAIR-3705.pdf
Wednesday, September 24, 2014
Short Update
It's been a while since my last post because I have been busy.
A quick personal update:
1) I am now attending the Master's program at CofC and am taking 2 classes currently (CSIS 602 - Software Engineering & CSIS 604 - Distributed Systems)
2) I am now, officially, a Graduate Research Assistant doing research with CofC (with MUSC collaborators)
3) I also have a web development job for Innovative Resource Management
4) I have a publication coming out in the FIE (Frontiers in Education) 2014 journal [should be out around October]
So, for my research, I am having to learn (partially review) certain topics. I am choosing to use Python for my data munging and analysis for data I am dealing with for my research. Naturally, this is going to require the usage of the numpy and pandas libraries. I have used numpy in the past when building decision trees, naive bayes models, neural networks, etc. so I am mostly familiar with it and I am merely reviewing it. Pandas, however, is a library with which I have never dealt. To review, I am going through tutorials.
Music listened to while blogging: Bibio
A quick personal update:
1) I am now attending the Master's program at CofC and am taking 2 classes currently (CSIS 602 - Software Engineering & CSIS 604 - Distributed Systems)
2) I am now, officially, a Graduate Research Assistant doing research with CofC (with MUSC collaborators)
3) I also have a web development job for Innovative Resource Management
4) I have a publication coming out in the FIE (Frontiers in Education) 2014 journal [should be out around October]
So, for my research, I am having to learn (partially review) certain topics. I am choosing to use Python for my data munging and analysis for data I am dealing with for my research. Naturally, this is going to require the usage of the numpy and pandas libraries. I have used numpy in the past when building decision trees, naive bayes models, neural networks, etc. so I am mostly familiar with it and I am merely reviewing it. Pandas, however, is a library with which I have never dealt. To review, I am going through tutorials.
Music listened to while blogging: Bibio
Thursday, July 17, 2014
NLP Paper Review
For this post, I will be sharing a Prezi presentation that I recently presented on a paper from the NLP conference I attended in Baltimore.
http://prezi.com/zac-hm_osqwo/?utm_campaign=share&utm_medium=copy
The presentation really speaks for itself.
Music listened to while blogging: GEMS
http://prezi.com/zac-hm_osqwo/?utm_campaign=share&utm_medium=copy
The presentation really speaks for itself.
Music listened to while blogging: GEMS
Wednesday, July 2, 2014
ACL 2014: BioNLP Conference
So I recently came back from the Association for Computational Linguistics', of which I am now a member, annual BioNLP (Biological Natural Language Processing) conference in Baltimore, Maryland. Firstly, I was in a hotel across from the Baltimore Orioles stadium and, not being the biggest fan of baseball, I definitely got my fill of baseball fans, hat-sellers, and hot dog vendors constantly yelling about their hate for the Yankees.
Since I only attended the 2 day workshop this was a little different than my normal conference travel. I attended sessions where people were doing molecular NLP tasks (such as querying Pubmed and other journals) in order to garner data and conduct metadata or real data analysis. These researchers typically utilized SVMs in their algorithmic analysis for their results, which gives me good ideas about where to take my own research. Unfortunately, most of the sessions were molecular NLP-oriented tasks, whereas my focus is more on clinical NLP, which is a different type of problem, by nature. Namely, scientific/structured writing is a lot easier to parse rather than unstructured notes written by different medical professionals.
No one at the conference is using the NLP system that I am using, which was a disappointment, but I was able to broaden my horizons to other systems such as i2b2 and biocreative. In my own research we are utilizing cTAKES/ytex.The conference included a panel of scientists that helped create these newer systems so that was a nice surprise to hear what it's like on the other side of research.
I will elaborate more on these systems when we decide if we want to steer away from the usage of cTAKES for one of these newer systems or if we decide to keep going down the road with which we are familiar.
Music listened to while blogging: Hellyeah
Since I only attended the 2 day workshop this was a little different than my normal conference travel. I attended sessions where people were doing molecular NLP tasks (such as querying Pubmed and other journals) in order to garner data and conduct metadata or real data analysis. These researchers typically utilized SVMs in their algorithmic analysis for their results, which gives me good ideas about where to take my own research. Unfortunately, most of the sessions were molecular NLP-oriented tasks, whereas my focus is more on clinical NLP, which is a different type of problem, by nature. Namely, scientific/structured writing is a lot easier to parse rather than unstructured notes written by different medical professionals.
No one at the conference is using the NLP system that I am using, which was a disappointment, but I was able to broaden my horizons to other systems such as i2b2 and biocreative. In my own research we are utilizing cTAKES/ytex.The conference included a panel of scientists that helped create these newer systems so that was a nice surprise to hear what it's like on the other side of research.
I will elaborate more on these systems when we decide if we want to steer away from the usage of cTAKES for one of these newer systems or if we decide to keep going down the road with which we are familiar.
Music listened to while blogging: Hellyeah
Tuesday, June 24, 2014
ACL 2014 and Journal Acceptance
So for this post, I will give a quick update on what I've been up to since the Summer started.
First, I am going to Baltimore tomorrow to attend a workshop 6/26-6/27 on Biomedical Natural Language Processing (BioNLP). The workshop, part of the Association for Computational Linguistics 2014 annual conference hosted by Johns Hopkins University, includes presentations on the creation of NLP techniques for parsing, the analysis of NLP-parsed data (specifically biomedical), and the utilization of tools/resources such as the Unified Medical Language Systems (UMLS), Systematized Nomenclature of Medicine (SNOMED) resources, among many others.
Recently, a paper submitted to the Frontiers In Education 2014 conference was accepted. I'll talk more about this conference and our paper that was submitted whenever it is time for the conference. I really hope I can attend this conference because it is in Madrid, Spain, which would be a great place to visit. Additionally, this conference would potentially help the Learn2Mine team to garner new ideas to incorporate into our own application.
Music listened to while blogging: Sublime
First, I am going to Baltimore tomorrow to attend a workshop 6/26-6/27 on Biomedical Natural Language Processing (BioNLP). The workshop, part of the Association for Computational Linguistics 2014 annual conference hosted by Johns Hopkins University, includes presentations on the creation of NLP techniques for parsing, the analysis of NLP-parsed data (specifically biomedical), and the utilization of tools/resources such as the Unified Medical Language Systems (UMLS), Systematized Nomenclature of Medicine (SNOMED) resources, among many others.
Recently, a paper submitted to the Frontiers In Education 2014 conference was accepted. I'll talk more about this conference and our paper that was submitted whenever it is time for the conference. I really hope I can attend this conference because it is in Madrid, Spain, which would be a great place to visit. Additionally, this conference would potentially help the Learn2Mine team to garner new ideas to incorporate into our own application.
Music listened to while blogging: Sublime
Wednesday, May 21, 2014
Data Mining with Weka
Another skill I'm going to have to re-pick up over the course of my graduate career is data mining with pre-built informatics systems. I can write as many data mining algorithms as I want, but preparing for the scalability required for big datasets is something I would not have too much fun doing.
So I can settle for using a pre-built system. For data analysis for my most current project, I will be using Weka.
Weka can be run from the command line or a GUI. The GUI is simplistic enough, though, to where the command line is not exactly the best way to go about these tasks, especially since a lot of the results of data mining algorithms rely on visualizations. For example, actually viewing a decision tree after creating it is pretty useful, rather than just reading information gains and prunes off of a screen.
Weka provides other built-in functionality which is immensely useful - the experimenter. So, a lot of times in data mining, you are faced with an issue where you don't know which algorithm will be the best after running it... and you don't know this for good reason. If you were an expert about the dataset and knew exactly how the features were interacting, then you could create an amazing hypothesis about which classifiers would run the best, but that is never a guaranteed solution. With Weka's experimenter, there is no need to worry about this. The experimenter allows you to select multiple algorithms to be run on a dataset (or datasets - a many to many relationship can be established). The results are then able to be viewed in a nice table format with sorting capabilities:
Above is the result of running the experimenter with three algorithms on Fisher's famous iris dataset. The ZeroR algorithm that was run is, essentially, a baseline as that is a classifier which does not rely on any of the features for prediction. So, effectively, the class label which appears the most in the training set will be selected for every datapoint which comes through to the testing set.
This may seem silly, but there does come a point where this is a crucial baseline. For example, if you have a dataset classifying whether or not someone has pancreatic cancer and your training set has 98% No's for the label (98% of the people in the training set do not have pancreatic cancer), then the test set will just say No for every single test. This can result in a high amount of false negatives, but... it will be correct a LOT of the time. So any classifier that you throw at that dataset will need to outperform 98% - this is really a situation where more data is needed, but being able to recognize that is vital to creating a good classifier.
The next 2 classifiers are J48 (weka's default decision tree algorithm) and naive bayes. These are both classic algorithms in data science so I will not dive too much into these for now.
Music listened to while blogging: Mac Miller
So I can settle for using a pre-built system. For data analysis for my most current project, I will be using Weka.
Weka can be run from the command line or a GUI. The GUI is simplistic enough, though, to where the command line is not exactly the best way to go about these tasks, especially since a lot of the results of data mining algorithms rely on visualizations. For example, actually viewing a decision tree after creating it is pretty useful, rather than just reading information gains and prunes off of a screen.
Weka provides other built-in functionality which is immensely useful - the experimenter. So, a lot of times in data mining, you are faced with an issue where you don't know which algorithm will be the best after running it... and you don't know this for good reason. If you were an expert about the dataset and knew exactly how the features were interacting, then you could create an amazing hypothesis about which classifiers would run the best, but that is never a guaranteed solution. With Weka's experimenter, there is no need to worry about this. The experimenter allows you to select multiple algorithms to be run on a dataset (or datasets - a many to many relationship can be established). The results are then able to be viewed in a nice table format with sorting capabilities:
Above is the result of running the experimenter with three algorithms on Fisher's famous iris dataset. The ZeroR algorithm that was run is, essentially, a baseline as that is a classifier which does not rely on any of the features for prediction. So, effectively, the class label which appears the most in the training set will be selected for every datapoint which comes through to the testing set.
This may seem silly, but there does come a point where this is a crucial baseline. For example, if you have a dataset classifying whether or not someone has pancreatic cancer and your training set has 98% No's for the label (98% of the people in the training set do not have pancreatic cancer), then the test set will just say No for every single test. This can result in a high amount of false negatives, but... it will be correct a LOT of the time. So any classifier that you throw at that dataset will need to outperform 98% - this is really a situation where more data is needed, but being able to recognize that is vital to creating a good classifier.
The next 2 classifiers are J48 (weka's default decision tree algorithm) and naive bayes. These are both classic algorithms in data science so I will not dive too much into these for now.
Music listened to while blogging: Mac Miller
Subscribe to:
Posts (Atom)