<h1 id="trainingdata">Training Data</h1>
<p>In the <a href="http://labs.wot.io/machine-learning-with-the-wot-io-data-service-exchange/">last part</a>, we created an environment in which we could deploy a <a href="http://tensorflow.org">Tensorflow</a> application within the <a href="http://wot.io/">wot.io Data Service Exchange</a> (DSE). Building on that work, this post will cover distributing training data for our Tensorflow models using the wot.io Data Bus.</p>
<p>The training data set we're going to use the the <a href="http://yann.lecun.com/exdb/mnist/">MNIST Database</a> maintained by Yann LeCun. This will allow us to build on Tensorflow's tutorials using the MNIST data:</p>
<ul>
<li><a href="https://www.tensorflow.org/versions/master/tutorials/mnist/beginners/index.html">MNIST for Beginners</a></li>
<li><a href="https://www.tensorflow.org/versions/master/tutorials/mnist/pros/index.html">Deep MNIST for Experts</a></li>
<li><a href="https://www.tensorflow.org/versions/master/tutorials/mnist/tf/index.html">TensorFlow Mechanics 101</a></li>
</ul>
<p>And as this is not going to be a Tensorflow tutorial, I highly recommend you read all three at some point. Let's look at how we're going to use this data.</p>
<h3 id="architectureofasolution">Architecture of a Solution</h3>
<p><img src="http://idfiles.leveelabs.com/55bd0288af0b0930ba599bd0c4b7ca38/resources/img_new/labs_wot_io/mlarch.png" alt="" /></p>
<p>The system that we're going to build consists of a number of components:</p>
<ul>
<li>a <em>Training Data Generator</em></li>
<li>two <em>Production Data Sources</em></li>
<li>four <em>Machine Learning Models</em></li>
<li>three <em>Consumer Applications</em></li>
</ul>
<p>Between the components, we will use the wot.io Data Bus to distribute data from each of the training data set and the production data sources to the different models, and then selectively route the model output to the consumers in real time. Due to the nature of the wot.io DSE, we can either build these applications inside of the DSE security context, or host the applications externally going through one of the authenticated protocol adapters. For purposes of this article, we will treat this design decision as an exercise up to the reader.</p>
<p>For my sample code, I'm going to use the AMQP protocol adapter for all of the components with the <a href="https://github.com/wotio/wot-python">wot-python SDK</a>. This will make it easy to integrate with the Tensorflow framework, and will make it possible to reuse code explained elsewhere.</p>
<h3 id="trainingdatagenerator">Training Data Generator</h3>
<p>The first component we need to build is a Train Data Generator. This application will read a set of data files and then send individual messages to the wot.io Data Bus for each piece of training data. The wot.io Data Bus will then distribute it to each of our machine learning models.</p>
<p>As our ML models will be built in <a href="https://docker.io">Docker</a> containers in the wot.io DSE, we can treat each instance of a model as a disposable resource. We will be able to dynamically spin them up and down with wild abandon, and just throw away our failed experiments. The wot.io DSE will manage our resources for us, and clean up after our mess. The Training Data Generator will allow us to share the same training data with as many models as we want to deploy, and we don't have to worry about making sure each model gets the same or similar data.</p>
<p>We can do our development of the application inside of a container instance of the <em>wotio/tensorflow</em> container we made in the last tutorial.</p>
<p><code>docker run -i -t wotio/tensorflow</code></p>
<p>This will drop us in a bash prompt, which we can then use to develop our training data generator. Next we'll setup an isolated Python environment using virtualenv so that while we're developing our solution we don't pollute the system python. It will also make it easier to capture all of the dependencies we added when creating a new Dockerfile.</p>
<p><code>virtualenv training</code></p>
<p>We can select this environment by sourcing the training/bin/activate file:</p>
<p><code>. training/bin/activate</code></p>
<p>We'll build the rest of our application within the <em>training</em> directory, which will keep our code contained as well. You can checkout the code from GitHub using:</p>
<p><code>git clone https://github.com/wotio/wot-tensorflow-example.git</code></p>
<p>The MNIST data in contained in a couple of gzipped archives:</p>
<ul>
<li>train-images.idx3-ubyte.gz</li>
<li>train-labels.idx1-ubyte.gz</li>
</ul>
<p>You can think of these files a pair of parallel arrays, one containing image data, and then an identifier for each image in the other. The images contain pictures of the numbers 0 through 9, and the labels take on those same values. Each training file has a header of some sort:</p>
<p><strong>Image data file</strong> <img src="http://idfiles.leveelabs.com/55bd0288af0b0930ba599bd0c4b7ca38/resources/img_new/labs_wot_io/Canvas-2-1.png" alt="" /></p>
<p><strong>Label data file</strong> <img src="http://idfiles.leveelabs.com/55bd0288af0b0930ba599bd0c4b7ca38/resources/img_new/labs_wot_io/Canvas-3.png" alt="" /></p>
<p>The goal will be to load both files, and then generate a sequence of messages from the images selected at random, and sent with the label as a meta-data attribute of the image data. The models will interpret the messages with meta-data as training data, and will invoke their training routine on the message. If the message doesn't have a meta-data label, it will instead be run through the model it will forward the result to the consumer with the most likely label attached in the meta-data field. In this way, we can simulate a system in which production data is augmented by machine learning, and then passed on to another layer of applications for further processing.</p>
<p>To read the image file header we'll use a function like:</p>
<script src="https://gist.github.com/cthulhuology/9642b406adf4cacb9921.js"></script>
<p>And to read the label file header we'll use:</p>
<script src="https://gist.github.com/cthulhuology/663080f2d44070190e3e.js"></script>
<p>Both of these functions take a stream, and return a tuple with the values contained in the header (minus the magic). We can then use the associated streams to read the data into numpy arrays:</p>
<script src="https://gist.github.com/cthulhuology/2a719541fd479410ec2d.js"></script>
<p>By passing in the respective streams (as returned from prior functions), we can read the data into two parallel arrays. We'll randomize our output data by taking the number of elements in both arrays and shuffling the indexes like a pack of card:</p>
<script src="https://gist.github.com/cthulhuology/8985414d4970da4716a8.js"></script>
<p>With this iterator, we are guaranteed not to repeat any image, and will exhaust the entire training set. We'll then use it to drive our generator in a helper function:</p>
<script src="https://gist.github.com/cthulhuology/6b32b7b222462a3b08e3.js"></script>
<p>Now we come to the tricky bit. The implementation of <a href="https://github.com/wotio/wot-python">wot-python SDK</a> is built on top of Pika, which has a main program loop. Under the hood, we have a large number of asynchronous calls that are driven by the underlying messaging. Rather than modeling this in a continuation passing style (CPS), the wot-python SDK adopts a simple indirect threading model for it's state machine:</p>
<p><img src="http://idfiles.leveelabs.com/55bd0288af0b0930ba599bd0c4b7ca38/resources/img_new/labs_wot_io/callflow.png" alt="" /></p>
<p>Using this interpreter we'll store our program as a sequence of function calls modeled as tuples stored in an array. <code>Start</code> will inject our initial state of our finite state machine into a hidden variable by calling <code>eval</code>. <code>Eval</code> prepends the passed array to the beginning of the hidden <em>fsm</em> deque which we can exploit to mimic subroutine calls. The <code>eval</code> function passes control to the <code>_next</code> function which removes the head form the the <em>fsm</em> deque, and calls apply on the contents of the tuple if any.</p>
<p>The user supplied function is then invoked, and one of 3 scenarios can happen:</p>
<ul>
<li>the function calls <code>eval</code> to run a subroutine</li>
<li>the function calls <code>_next</code> to move on to the next instruction</li>
<li>the function registers an asynchronous callback which will in turn call <code>eval</code> or <code>_next</code></li>
</ul>
<p>Should the hidden <em>fsm</em> deque empty, then processing will terminate, as no further states exist in our finite state model.</p>
<p>This technique for programming via a series of events is particularly powerful when we have lots of nested callbacks. For example, take the definition of the function <code>step</code> in the training program:</p>
<script src="https://gist.github.com/cthulhuology/d299ebdcc5b35e498051.js"></script>
<p>It grabs the next index from our randomized list of indexes, and if there is one it schedules a write to a wot.io Data Bus resource followed by a call to recuse. Should we run out of indexes, it schedules an exit from the program with status 0.</p>
<p>The <code>write_resource</code> method is itself defined as a series of high level events:</p>
<script src="https://gist.github.com/cthulhuology/cb9548589c8d12efc038.js"></script>
<p>wherein it first ensures the existence of the desired resource, and then sends the data to that resource. The definition of the others are too high level events evaluated by the state machine, with the lowest levels being asynchronous calls whose callbacks invoke the <code>_next</code> to resume evaluation of our hidden <em>fsm</em>.</p>
<p>As such, our top level application is just an array of events passed to the <code>start</code> method:</p>
<script src="https://gist.github.com/cthulhuology/653f3a743323906e69b8.js"></script>
<p>By linearizing the states in this fashion, we don't need to pass lots of different callbacks, and our intended flow is described in data as program. It doesn't hurt that the resulting python looks a lot like LISP, a favorite of ML researches of ages past, either.</p>
<h3 id="asimpleconsumer">A Simple Consumer</h3>
<p>To test the code, we need a simple consumer that will simply echo out what we got from the wot.io Data Bus:</p>
<script src="https://gist.github.com/cthulhuology/ac8bd827a08f22018e31.js"></script>
<p>You can see the same pattern as with the generator above, wherein we pass a finite state machine model to the <code>start</code> method. In this case, the <code>stream_resource</code> method takes a resource name and a function as an argument, which it will invoke on each message it receives from the given resource. The callback simply echoes the message and it's label to stdout.</p>
<p>With this consumer and generator we can shovel image and label data over the wot.io Data Bus, and see it come out the other end. In the next part of this series, we will modify the consumer application to process the training data and build four different machine learning models with Tensorflow.</p>
Machine learning with the wot.io Data Service Exchange: Training Data (part 2)
Dec 2015/ Posted By: wotio team