Machine learning with the Data Service Exchange: Training Data (part 2)

December 8, 2015 / Posted By: wotio team

Training Data

In the last part, we created an environment in which we could deploy a Tensorflow application within the Data Service Exchange (DSE). Building on that work, this post will cover distributing training data for our Tensorflow models using the Data Bus.

The training data set we're going to use the the MNIST Database maintained by Yann LeCun. This will allow us to build on Tensorflow's tutorials using the MNIST data:

And as this is not going to be a Tensorflow tutorial, I highly recommend you read all three at some point. Let's look at how we're going to use this data.

Architecture of a Solution

The system that we're going to build consists of a number of components:

  • a Training Data Generator
  • two Production Data Sources
  • four Machine Learning Models
  • three Consumer Applications

Between the components, we will use the Data Bus to distribute data from each of the training data set and the production data sources to the different models, and then selectively route the model output to the consumers in real time. Due to the nature of the DSE, we can either build these applications inside of the DSE security context, or host the applications externally going through one of the authenticated protocol adapters. For purposes of this article, we will treat this design decision as an exercise up to the reader.

For my sample code, I'm going to use the AMQP protocol adapter for all of the components with the wot-python SDK. This will make it easy to integrate with the Tensorflow framework, and will make it possible to reuse code explained elsewhere.

Training Data Generator

The first component we need to build is a Train Data Generator. This application will read a set of data files and then send individual messages to the Data Bus for each piece of training data. The Data Bus will then distribute it to each of our machine learning models.

As our ML models will be built in Docker containers in the DSE, we can treat each instance of a model as a disposable resource. We will be able to dynamically spin them up and down with wild abandon, and just throw away our failed experiments. The DSE will manage our resources for us, and clean up after our mess. The Training Data Generator will allow us to share the same training data with as many models as we want to deploy, and we don't have to worry about making sure each model gets the same or similar data.

We can do our development of the application inside of a container instance of the wotio/tensorflow container we made in the last tutorial.

docker run -i -t wotio/tensorflow

This will drop us in a bash prompt, which we can then use to develop our training data generator. Next we'll setup an isolated Python environment using virtualenv so that while we're developing our solution we don't pollute the system python. It will also make it easier to capture all of the dependencies we added when creating a new Dockerfile.

virtualenv training

We can select this environment by sourcing the training/bin/activate file:

. training/bin/activate

We'll build the rest of our application within the training directory, which will keep our code contained as well. You can checkout the code from GitHub using:

git clone

The MNIST data in contained in a couple of gzipped archives:

  • train-images.idx3-ubyte.gz
  • train-labels.idx1-ubyte.gz

You can think of these files a pair of parallel arrays, one containing image data, and then an identifier for each image in the other. The images contain pictures of the numbers 0 through 9, and the labels take on those same values. Each training file has a header of some sort:

Image data file

Label data file

The goal will be to load both files, and then generate a sequence of messages from the images selected at random, and sent with the label as a meta-data attribute of the image data. The models will interpret the messages with meta-data as training data, and will invoke their training routine on the message. If the message doesn't have a meta-data label, it will instead be run through the model it will forward the result to the consumer with the most likely label attached in the meta-data field. In this way, we can simulate a system in which production data is augmented by machine learning, and then passed on to another layer of applications for further processing.

To read the image file header we'll use a function like:

And to read the label file header we'll use:

Both of these functions take a stream, and return a tuple with the values contained in the header (minus the magic). We can then use the associated streams to read the data into numpy arrays:

By passing in the respective streams (as returned from prior functions), we can read the data into two parallel arrays. We'll randomize our output data by taking the number of elements in both arrays and shuffling the indexes like a pack of card:

With this iterator, we are guaranteed not to repeat any image, and will exhaust the entire training set. We'll then use it to drive our generator in a helper function:

Now we come to the tricky bit. The implementation of wot-python SDK is built on top of Pika, which has a main program loop. Under the hood, we have a large number of asynchronous calls that are driven by the underlying messaging. Rather than modeling this in a continuation passing style (CPS), the wot-python SDK adopts a simple indirect threading model for it's state machine:

Using this interpreter we'll store our program as a sequence of function calls modeled as tuples stored in an array. Start will inject our initial state of our finite state machine into a hidden variable by calling eval. Eval prepends the passed array to the beginning of the hidden fsm deque which we can exploit to mimic subroutine calls. The eval function passes control to the _next function which removes the head form the the fsm deque, and calls apply on the contents of the tuple if any.

The user supplied function is then invoked, and one of 3 scenarios can happen:

  • the function calls eval to run a subroutine
  • the function calls _next to move on to the next instruction
  • the function registers an asynchronous callback which will in turn call eval or _next

Should the hidden fsm deque empty, then processing will terminate, as no further states exist in our finite state model.

This technique for programming via a series of events is particularly powerful when we have lots of nested callbacks. For example, take the definition of the function step in the training program:

It grabs the next index from our randomized list of indexes, and if there is one it schedules a write to a Data Bus resource followed by a call to recuse. Should we run out of indexes, it schedules an exit from the program with status 0.

The write_resource method is itself defined as a series of high level events:

wherein it first ensures the existence of the desired resource, and then sends the data to that resource. The definition of the others are too high level events evaluated by the state machine, with the lowest levels being asynchronous calls whose callbacks invoke the _next to resume evaluation of our hidden fsm.

As such, our top level application is just an array of events passed to the start method:

By linearizing the states in this fashion, we don't need to pass lots of different callbacks, and our intended flow is described in data as program. It doesn't hurt that the resulting python looks a lot like LISP, a favorite of ML researches of ages past, either.

A Simple Consumer

To test the code, we need a simple consumer that will simply echo out what we got from the Data Bus:

You can see the same pattern as with the generator above, wherein we pass a finite state machine model to the start method. In this case, the stream_resource method takes a resource name and a function as an argument, which it will invoke on each message it receives from the given resource. The callback simply echoes the message and it's label to stdout.

With this consumer and generator we can shovel image and label data over the Data Bus, and see it come out the other end. In the next part of this series, we will modify the consumer application to process the training data and build four different machine learning models with Tensorflow.