Serverless Distributed Decision Forests with AWS Lambda

Within the team in GE Digital, we have monthly “edu-hackdays” where the entire tech team spends the entire day trying to learn and implement new promising approaches to some portion of our machine-learning based workflow. In the past, we worked on algorithm hacks and on methods for distributed featurization. Some of what we start those days eventually go into production, but most does not. The main goal (apart from the team building that comes with the fun and pain of all-day hacks) is to create collective knowledge and experience around important components of our stack. Recently we had an edu-hackday on strategies for distributed learning. This post captures (and hopefully provides some motivation for) the work I did at that hackday in April.

Continue reading → Serverless Distributed Decision Forests with AWS Lambda

GE Acquires to Accelerate Machine Learning Efforts

On Tuesday, GE Digital announced that it had entered into a definitive agreement to acquire machine learning technology company  The acquisition will enable GE Digital to further accelerate development of advanced machine learning and data science offerings on the Predix platform. The technology deepens GE’s machine learning stack and the team will spearhead innovative solutions in GE’s vertical markets to develop intelligent systems offerings.


Berkeley-based was founded in 2013 to bring high-performance machine learning to the business world.  The company recognized the challenges companies faced as a result of the limited number of qualified data scientists and the struggle to take data-driven processes from “whiteboard to production.”  These struggles were the result of an over-reliance on the development of algorithmic toolkits as opposed to the unique systems engineering challenges associated with the development of production grade intelligent applications.  Most recently, the company applied its technology to deliver intelligent customer support applications to innovative enterprises such as Pinterest, Thumbtack, and ThredUp, enabling these organizations to radically improve efficiency while simultaneously delivering a better customer experience.

Industrial machine learning is critical to GE’s development of scalable Digital Twin solutions and data-intensive industrial computing challenges. Running on Predix, Digital Twins are virtual twins of an industrial asset – a jet engine, a wind turbine or an entire power plant. Twins continuously collect data from physical and virtual sensors and rely on advanced machine learning techniques to analyze the data to gain insights about performance and operation. According to Harel Kodesh, CTO of GE Digital, “’s deep machine learning expertise – combined with GE Digital’s existing data science talent and massive portfolio of industrial assets – will advance GE’s Digital Twin capabilities and solidify its role as a leader in industrial machine learning.”

“One of the foundational principles of Wise is that there is tremendous untapped value in the repetitive, mundane workflows that exist everywhere in business, and that tightly coupled access to the underlying data-sources is crucial to automating these workflows in a robust and scalable way,” said Jeff Erhardt, CEO.  “To participate in bringing this capability to the world as part of such a venerable and pervasive enterprise is an incredibly exciting opportunity.”

“For years, machine learning talent has been drawn to a limited number of consumer facing companies where people are the primary source of data,” said Dr. Joshua Bloom, CTO and Professor of Astronomy at UC Berkeley.  “Instead, the new machine learning horizon is deriving insights from vast quantities of industrial data. We are excited to have the opportunity to play a leading role in pushing these technological boundaries in ways that create value for GE, its customers, and the broader society.”

Towards Cost-Optimized Artificial Intelligence

If accuracy improves with more computation, why not throw in more time, people, hardware, and the concomitant energy costs? Seems reasonable but this approach misses the fundamental point of doing machine learning (and more broadly, AI): as a means to an end.  And so we need to have a little talk about cost-optimization, encompassing a much wider set of cost-assignable components than usually discussed in academia, industry, and the press. Viewing AI as a global optimization over cost (ie., dollars) puts the work throughout all parts of the value chain in perspective (including the driving origins of new specialized chips—like IBM TrueNorth Google’s Tensor Processing Unit). Done right it will lead to, by definition, better outcomes.

Continue reading → Towards Cost-Optimized Artificial Intelligence

Make Docker images Smaller with This Trick

The architectural and organizational/process advantages of containerization (eg., via Docker) are commonly known. However, in constructing images, especially those that serve as the base for other images, adding functionality via package installation is a double edged sword. On one hand we want our images to be most useful for the purposes they are built but—as images are downloaded, moved around our networks and live in our production environments—we pay a real speed and cost price for bloated image sizes. The obvious onus on image creators is to make them as practically small as possible without sacrificing efficacy and extensibility. This blog shows how we shrunk our images with a pretty simple trick…

Continue reading → Make Docker images Smaller with This Trick

Asking RNNs+LTSMs: What Would Mozart Write?

Preamble: A natural progression beyond artificial intelligence is artificial creativity. I’ve been interested in AC for awhile and started learning of the various criteria that the scholarly community has devised to test AC in art, music, writing, etc. (I think crosswords might present an interesting Turing-like test for AC). In music, a machine-generated score which is deemed interesting, challenging, and unique (and indistinguishable from the real work of a great master), would be a major accomplishment. Machine-generated music has a long history (cf. Computer Models of Musical Creativity by D. Cope; Cambridge, MA: MIT Press, 2006).

[soundcloud url=”″ params=”color=ff5500&auto_play=false&hide_related=false&show_comments=true&show_user=true&show_reposts=false” width=”100%” height=”166″ iframe=”true” /]

Deep Learning at the character level: With the resurgence of interest in Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM), I thought it would be interesting to see how far we could go in autogenerating music. RNNs have actually been around in music generation for awhile (even with LSTM; see this site and this 2014 paper from Liu & Ramakrishnan and references therein), but we’re now getting into an era where we can train on a big corpus and thus train a big, complex model. Andrej Karpathy’s recent blog showed how training a character-level model on Shakespeare and Paul Graham essays could yield interesting, albeit fairly garbled, text that seems to mimic the flow and usage of English in those contexts. Perhaps more interesting was his ability to get nearly perfectly compilable LaTeX, HTML, and C. A strong conclusion is that character-level RNNs + LSTMs get pretty good at learning structure even if the sense of the expression seems like nonsense. This an important conclusion related to mine (if you keep reading).

Piano Roll

Fig 1: Image of a Piano Roll (via Can a machine generate a score of intererst?

Data Prep for training

While we’re using advanced Natural Language Processing (NLP) in my company,, using RNNs is still very early days for us (hence the experimentation here). Also, I should say that I am nowhere near an expert in deep learning and so, for me, a critical contribution from Andrej Karpathy is the availability of code that I could actually get to run and understand. Likewise, I am nowhere near an expert on music theory nor practice. I dabbled in piano about 30 years ago and can read music, but that’s about it. So I’m starting this project on a few shaky legs to be sure.

Which music? To get train a model using Karpathy’s codebase on GitHub, I had to create a suitable corpus. Sticking a single musical genre and composer makes sense. Using the fewest number of instruments seemed sensible too. So I chose piano sonatas for two hands from Mozart and Beethoven.

Which format?  As I learned, there are over a dozen different digital formats used in music, some of them more versatile than others, some of them focused on enabling complex visual score representation. It became pretty clear that one of the preferred modern markup of sophisticated music manipulation codebases (like music21 from MIT) is the XML-based MusicXML (ref). The other is humdrum **kern. Both are readily convertible to each other, although humdrum appears to be more compact (and less oriented towards the visual representation of scores).

Let’s see the differences between the two formats. There’s a huge corpus of classical music at (“7,866,496 notes in 108,703 files”).

Let’s get Mozart’s Piano Sonata No. 7 in C major (sonata07-1.krn)

!curl -o sonata07-1.krn ""
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 22098    0 22098    0     0   3288      0 --:--:--  0:00:06 --:--:--  5133

We can convert this into musicXML using music21:

import music21
m = music21.converter.parse("ana-music/corpus/mozart/sonata07-1.krn")"musicxml")

The first note in musicXML is 11 lines with a length of 256 relevant characters

   <part id="P1">
           <note default-x="70.79" default-y="-50.00">
        <beam number="1">begin</beam>

where as the first note in humdrum format is a single line with just 14 characters:


Based on this compactness (without the apparent sacrifice of expressiveness) I chose to use the humdrum **kern format for this experiment. Rather than digging around and scraping the site, I instead dug around Github and found a project that had already compiled a tidy little corpus. The project is called ana-music (“Automatic analysis of classical music for generative composition”).It compiled 32 Sonatas by Beethoven in 102 movements and 17 Sonatas by Mozart in 51 movements (and others).

The **kern format starts with a bunch of metadata:

!!!COM: Mozart, Wolfgang Amadeus
!!!CDT: 1756/01/27/-1791/12/05/
!!!CNT: German
!!!OTL: Piano Sonata No. 7 in C major
!!!SCT1: K<sup>1</sup> 309
!!!SCT2: K<sup>6</sup> 284b
!!!OMV: Mvmt. 1
!!!OMD: Allegro con spirito
!!!ODT: 1777///
**kern    **kern    **dynam
*staff2    *staff1    *staff1/2
*>[A,A,B]    *>[A,A,B]    *>[A,A,B]
*>norep[A,B]    *>norep[A,B]    *>norep[A,B]
*>A    *>A    *>A
*clefF4    *clefG2    *clefG2
*k[]    *k[]    *k[]
*C:    *C:    *C:
*met(c)    *met(c)    *met(c)
*M4/4    *M4/4    *M4/4
*MM160    *MM160    *MM160
=1-    =1-    =1-

After the structured metadata about the composer and the song (lines starting !), three staffs/voices are defined, the repeat schedule (ie. dc al coda), the key, the tempo, etc. The first staff starts with the line:

=1-    =1-    =1-

In my first experiment, I stripped away only the ! lines and kept everything in the preamble. Since there is little training data of preamble, I found that I got mostly incorrect preambles. So then in order to have our model build solely on notes in the measure I choose to strip away the metadata, the preamble, and the numbers of the measures.

import glob
composers = ["mozart","beethoven"]
for composer in composers:
    comp_txt = open(composer + ".txt","w")
    ll = glob.glob(dir + "/ana-music/corpus/{composer}/*.krn".format(composer=composer))
    for song in ll:
        lines = open(song,"r").readlines()
        out = []
        found_first = False
        for l in lines:
            if l.startswith("="):
                ## new measure, replace the measure with the @ sign, not part of humdrum
                found_first = True
            if not found_first:
                ## keep going until we find the end of the header and metadata
            if l.startswith("!"):
                ## ignore comments

From this, I got two corpi: mozart.txt and beethoven.txt and was ready to train.


To learn, I stood up a small Ubuntu machine on (In case you don’t know already, is a user-friendly software layer on top of AWS where you can provision machines and dynamically change the effective size of the machine [CPUs and RAM]. There’s also a GPU capability. This allows you to write code and test without burning serious cash until you’re ready to crank.)

On my terminal, I installed Torchnngraph, and optim. And then started training:

th train.lua -data_dir data/beethoven -rnn_size 128   
             -num_layers 3 -dropout 0.3 
             -eval_val_every 100 
             -checkpoint_dir cv/beethoven -gpuid -1
th train.lua -data_dir data/mozart -rnn_size 128 -num_layers 3 
             -dropout 0.05 -eval_val_every 100 \
             -checkpoint_dir cv/mozart -gpuid -1

There were only 71 characters in preamble-less training sets. After seeing that it was working, I cranked up my system to faster machine (I couldn’t get it to work after switching to GPU mode, which was my original intension…my guess it has something to do with the compiling of Torch without a GPU—I burned too many hours try to fix this so instead keep the training in non-GPU mode). And so, after about 18 hours it finished:

5159/5160 (epoch 29.994), train_loss = 0.47868490, grad/param norm = 4.2709
evaluating loss over split index 2
saving checkpoint to cv/mozart/lm_lstm_epoch30.00_0.5468.t7

Note: I played only a little bit with dropout rates (for regularization) so there's obviously a lot more to try here. 

Sampling with the model

Models having been built, now it was time to sample. Here’s an example

th sample.lua cv/beethoven/lm_lstm_epoch12.53_0.6175.t7 
               -temperature 0.8 -seed 1 -primetext "@" \
               -sample 1 -length 15000 -gpuid -1 > b5_0.8.txt

The -primetext @ basically says “start the measure”.  The -length 15000 requests 15k characters (a somewhat lengthy score). The -temperature 0.8 leads to a conservative exploration. The first two measures from the Mozart sample yeilds:

8r      16cc#\LL        .
.       16ee\   .
8r      16cc#\  .
.       16dd\JJ .
4r      16gg#\LL        .
.       16aa\   .
.       16dd\   .
.       16ff#\JJ)       .
.       16dd\LL .
.       16gg'\  .
.       16ee\   .
.       16gg#\JJ        .
*clefF4 *       *
8.GG#\L 16bb\LL .
.       16ee\   .
8F#\    16gg\   .
.       16ee#\JJ        .
8G\     16ff#\LL        .
.       16bb\   .
8G#\    16ff#\  .
.       16gg#\  .
8G#\J   16ff#\  .
.       16ccc#\JJ       .
This isn’t exactly **kern format…it’s missing a preamble and the measures are not numbered. So I added back some sensible preamble and closed the file properly:
!!! m1a.krn - josh bloom - AC mozart
**kern  **kern  **dynam 
*staff2 *staff1 *staff1/2
*>[A,A,B,B]     *>[A,A,B,B]     *>[A,A,B,B]
*>norep[A,B]    *>norep[A,B]    *>norep[A,B]
*>A     *>A     *>A
*clefF4 *clefG2 *clefG2
*k[]    *k[]    *k[]
*C:     *C:     *C:
*M4/4   *M4/4   *M4/4
*met(c) *met(c) *met(c)
*MM80  *MM80  *MM80
==   ==  ==
*-   *-  *-

and fixed up the measure numbering:

f = open("m1a.krn","r").readlines()
r = []
bar = 1
for l in f:
    if l.startswith("@"):
        if bar == 1:
        bar += 1
Now the moment to see what we got.
from music21 import *
m1 = converter.parse("m1a-bar.krn")

first of all—wow—this is read and accepted by music21 as valid music. I did else nothing to the notes themselves (I actually cannot write **kern so I can’t cheat; in other scores I had to edit the parser in music21 to replace the seldom “unknown dynamic tags” with a rest)."musicxml")

reveals the score:


Here’s the PDF, kern, and midi file. Click on the midi file to listen to it (you might need first download then to use Quicktime Player or the like).

I created a few different instantiations from Beethoven and Mozart (happy to send to anyone interested).

b5_0.8.txt = beethoven with temp 1 (sample = cv/beethoven/lm_lstm_epoch12.53_0.6175.t7)

b4.txt = beethoven with temp 1 (sample = cv/beethoven/lm_lstm_epoch12.53_0 .6175.t7)

b3.txt = beethoven with temp 1 (sample = cv/beethoven/lm_lstm_epoch24.40_0 .5743.t7)

b2.txt = beethoven with temp 1 (sample = cv/beethoven/lm_lstm_epoch30.00_0 .5574.t7)

b1.txt = beethoven with temp 0.95 (sample = cv/beethoven/lm_lstm_epoch30.00_0 .5574.t7)


This music does not sound all that good. But you listen to the music, to the very naive ear, it sounds like the phrasing of Mozart. There are rests, accelerations, and changing of intensity. But the chord progressions are wierd and the melody is far from memorable. Still, this is a whole lot better than a 1000 monkeys throwing darts at rolls of player piano tape.

My conslusion at this early stage is that RNNs+LSTM did a fairly decent job at learning the expression of a musical style via **kern but did a fairly poor job at learning anything about consonance. There’s a bunch of possible directions to explore in the short term:

a) increase the dataset size
b) combine composers and genres
c) play with the hyperparameters of the model (how many layers? how much dropout? etc.)

I’m excited to start engaging a music theory friend in this and hopefully will get to some non-trivial results (in all my spare time). This is the start obviously. Chappie want to learn more from the internet…

Edit 1: looks like a few others have started training RNNs on for music as well (link | link)

-Josh Bloom (Berkeley, June 2015)

Five Takeaways on the State of Natural Language Processing

Thoughts following the 2015 “Text By The Bay” Conference

The first “Text By the Bay” conference, a new natural language processing (NLP) event from the “Scala bythebay” organizers, just wrapped up tonight. In bringing together practitioners and theorists from academia and industry I’d call it a success, save one significant and glaring problem. 

1. word2vec and doc2vec appear to be pervasive

Mikolov et al.’s work on embedding words as real-numbered vectors using a skip-gram, negative-sampling model (word2vec code) was mentioned in nearly every talk I attended. Either companies are using various word2vec implementations directly or they are building diffs off of the basic framework. Trained on large corpora, the vector representations encode concepts in a large dimensional space (usually 200-300 dim). Beyond the “king – man = queen – woman” analogy party trick, such embeddings are finding real-world applications throughout NLP.  For example, Mike Tamir (“Classifying Text without (many) Labels”; slide shown below), discussed how he is using the average representation over entire docs as features for text classification, out-performing other bag-of-words (BoW) techniques by a large measure with heavily imbalanced classes.  Marek Kolodziej (“Unsupervised NLP Tutorial using Apache Spark”) gave a wonderful talk about the long history of concept embeddings along with technical details of most of the salient papers.  Chris Moody (“A Word is Worth a Thousand Vectors”)  showed how word2vec was being used in conjunction with topic modeling for improved recommendation over standard cohort analysis. He also ended his talk about how word2vec can be extended beyond NLP to machine translation and graph analysis.

Tamir slide
Fig 1. word2vec description slide from Mike Tamir.

2. Production-grade NLP is Spreading in Industry

For years, the most obvious users of natural language processing were those involved in search, tagging, sentiment on social graphs, and recommendations. And there are clear applications to voice recognition. What was most exciting to see at #tbtb, however, were the companies making use of NLP in production for core product enhancements that stretched beyond traditional uses.
   Sudeep Das gave a great talk about the places within OpenTable where NLP is improving their customer’s experience in subtle but measurable ways. Creating topics around word embeddings of customer reviews they can get much richer insights about a restaurant than what appears in the metadata of that restaurant’s website. And in showing reviews they can then bias towards (longer) reviews that hit on all the important topics for that restaurant. Das showed an auto-discovered review for a restaurant (one of my favorites, La Mar!) that spoke to specific famous dishes, the view of the Bay, and the proximity to the Ferry Building. Also impressive was that when the data science team discovered an explosion of cauliflower-infused dishes in New York City (yes, that’s a thing apparently), the marketing team was then able to capitalize on the trend by sending out a timely email campaign.
   One of my favorite talks was by Chris Moody from Stichfix. The company sends 5 fashion items at a time to women. They send back what they don’t want. He showed how, using word2vec, they are using user comments on their products coupled with buying behavior to enrich the suggestions of new items. These are then used by personal fashion consultants as an augmentative tool in their decisions of what to send next. They train the word2vec embedding on the wikipedia corpus and argument that with training using existing reviews and comments.
(Note: both Sudeep and Chris are former astronomers, but that has little bearing on my glowing reviews of their talks!)
Fig. 2. Sudeep Das, “Learning from the Diner’s Experience”. Via Twitter.

3. Open tools are being used but probably not compensated in the way they should

For training models, a number of large open datasets and code are used by virtually everyone (including by us at Freebase (“wikipedia for structured data”), wikidata, and common crawl were mentioned throughout the conference, in talks from folks at Crunchbase and Klout for example. The most commonly used implementation of word2vec is in the open-source gensim project (with some growing interest in the Spark implementation as well).  Most of these projects are just scraping by without a stable source of funding, which seems ridiculous. It seems that few of these open data and software communities are compensated by the (large) corporations that use these tools. This is perfectly legal of course given the licensing but one wonders if there isn’t a better funding model for all of us to consider in the future (like at bountysource).

4. “RNNs for X”

It’s an exciting time for deep learning and NLP, evidence throughout the conference but highlighted in the talk by Richard Socher, a co-founder and CTO of MetaMind. Based on work he did with recursive neural networks at Stanford, Socher discussed how tree-based learning is performing exceedingly well on sentiment (recent work using Long Short-Term Memory Networks [LSTM] here).  Jeremy Howard, founder of Enlitic (using deep learning for medical diagnosis intelligence), discussed using recurrent neural nets to upend long-standing industries. In his panel discussion with Pete Skomoroch he touted the power of RNNs, likening this moment in history as the early days of the web when Internet + (anything) changed (anything) forever. Will RNN + (anything) disrupt (anything) again? We’ll see!
Fig. 3. Richard Socher, “Deep Learning for Natural Language Processing”

5. A Big Problem: Massive Gender Imbalance

Out of 57 speakers and panelists who spoke at the conference, there was exactly twofour (*) women, Vita Markman and Katrin Tomanek. In two full days. I really don’t know how this is possible in a day and age where there are so many outstanding female machine learning experts and NLP practitioners (A few of us had a mini-tweetstorm about this where a number of top female speakers in the Bay Area were named.) I don’t want to speculate as to why this happened at this particular conference but it’s clearly not a positive thing for anyone involved, including the sponsors and, frankly, the participants. Charles Covey-Brandt (at Disqus) has a great rule which is that he will refuse to serve on a panel or give a talk in a conference that does not achieve fair representation. If all of us did the same thing, conferences would be better off and we’d be done with this awful foolishness.
(*) Edit after original post: Alexy Khrabrov noted in the comments that two other women spoke at the conference, Diana Hu of Verizon and Katelyn Lyster. Neither are listed in the published schedule at So a total of 4 out of 59 spoke. Alexy also notes efforts that the organizers took to solicit broader participation.

That’s it for my wrap up summary. Feel free to comment if I missed anything important (or even not-so-important). Caveat: I attended both days but, given the multitrack talk schedule, I was unable to see all the talks.