Welcome to Part 3 of a blog series that introduces TensorFlow Datasets and Estimators. Part 1 focused on pre-made Estimators, while Part 2 discussed feature columns. Here in Part 3, you'll learn how to create your own custom Estimators. In particular, we're going to demonstrate how to create a custom Estimator that mimics DNNClassifier's behavior when solving the Iris problem.
DNNClassifier
If you are feeling impatient, feel free to compare and contrast the following full programs:
As Figure 1 shows, pre-made Estimators are subclasses of the tf.estimator.Estimator base class, while custom Estimators are an instantiation of tf.estimator.Estimator:
tf.estimator.Estimator
tf.estimator.Estimator:
Pre-made Estimators are fully-baked. Sometimes though, you need more control over an Estimator's behavior. That's where custom Estimators come in.
You can create a custom Estimator to do just about anything. If you want hidden layers connected in some unusual fashion, write a custom Estimator. If you want to calculate a unique metric for your model, write a custom Estimator. Basically, if you want an Estimator optimized for your specific problem, write a custom Estimator.
A model function (model_fn) implements your model. The only difference between working with pre-made Estimators and custom Estimators is:
model_fn
Your model function could implement a wide range of algorithms, defining all sorts of hidden layers and metrics. Like input functions, all model functions must accept a standard group of input parameters and return a standard group of output values. Just as input functions can leverage the Dataset API, model functions can leverage the Layers API and the Metrics API.
Before demonstrating how to implement Iris as a custom Estimator, we wanted to remind you how we implemented Iris as a pre-made Estimator in Part 1 of this series. In that Part, we created a fully connected, deep neural network for the Iris dataset simply by instantiating a pre-made Estimator as follows:
# Instantiate a deep neural network classifier. classifier = tf.estimator.DNNClassifier( feature_columns=feature_columns, # The input features to our model. hidden_units=[10, 10], # Two layers, each with 10 neurons. n_classes=3, # The number of output classes (three Iris species). model_dir=PATH) # Pathname of directory where checkpoints, etc. are stored.
The preceding code creates a deep neural network with the following characteristics:
PATH
Figure 2 illustrates the input layer, hidden layers, and output layer of the Iris model. For reasons pertaining to clarity, we've only drawn 4 of the nodes in each hidden layer.
Let's see how to solve the same Iris problem with a custom Estimator.
One of the biggest advantages of the Estimator framework is that you can experiment with different algorithms without changing your data pipeline. We will therefore reuse much of the input function from Part 1:
def my_input_fn(file_path, repeat_count=1, shuffle_count=1): def decode_csv(line): parsed_line = tf.decode_csv(line, [[0.], [0.], [0.], [0.], [0]]) label = parsed_line[-1] # Last element is the label del parsed_line[-1] # Delete last element features = parsed_line # Everything but last elements are the features d = dict(zip(feature_names, features)), label return d dataset = (tf.data.TextLineDataset(file_path) # Read text file .skip(1) # Skip header row .map(decode_csv, num_parallel_calls=4) # Decode each line .cache() # Warning: Caches entire dataset, can cause out of memory .shuffle(shuffle_count) # Randomize elems (1 == no operation) .repeat(repeat_count) # Repeats dataset this # times .batch(32) .prefetch(1) # Make sure you always have 1 batch ready to serve ) iterator = dataset.make_one_shot_iterator() batch_features, batch_labels = iterator.get_next() return batch_features, batch_labels
Notice that the input function returns the following two values:
batch_features
batch_labels
Refer to Part 1 for full details on input functions.
As detailed in Part 2 of our series, you must define your model's feature columns to specify the representation of each feature. Whether working with pre-made Estimators or custom Estimators, you define feature columns in the same fashion. For example, the following code creates feature columns representing the four features (all numerical) in the Iris dataset:
feature_columns = [ tf.feature_column.numeric_column(feature_names[0]), tf.feature_column.numeric_column(feature_names[1]), tf.feature_column.numeric_column(feature_names[2]), tf.feature_column.numeric_column(feature_names[3]) ]
We are now ready to write the model_fn for our custom Estimator. Let's start with the function declaration:
def my_model_fn( features, # This is batch_features from input_fn labels, # This is batch_labels from input_fn mode): # Instance of tf.estimator.ModeKeys, see below
The first two arguments are the features and labels returned from the input function; that is, features and labels are the handles to the data your model will use. The mode argument indicates whether the caller is requesting training, predicting, or evaluating.
features
labels
mode
To implement a typical model function, you must do the following:
If your custom Estimator generates a deep neural network, you must define the following three layers:
Use the Layers API (tf.layers) to define hidden and output layers.
tf.layers
If your custom Estimator generates a linear model, then you only have to generate a single layer, which we'll describe in the next section.
Call tf.feature_column.input_layer to define the input layer for a deep neural network. For example:
tf.feature_column.input_layer
# Create the layer of input input_layer = tf.feature_column.input_layer(features, feature_columns)
The preceding line creates our input layer, reading our features through the input function and filtering them through the feature_columns defined earlier. See Part 2 for details on various ways to represent data through feature columns.
feature_columns
To create the input layer for a linear model, call tf.feature_column.linear_model instead of tf.feature_column.input_layer. Since a linear model has no hidden layers, the returned value from tf.feature_column.linear_model serves as both the input layer and output layer. In other words, the returned value from this function is the prediction.
tf.feature_column.linear_model
If you are creating a deep neural network, you must define one or more hidden layers. The Layers API provides a rich set of functions to define all types of hidden layers, including convolutional, pooling, and dropout layers. For Iris, we're simply going to call tf.layers.Dense twice to create two dense hidden layers, each with 10 neurons. By "dense," we mean that each neuron in the first hidden layer is connected to each neuron in the second hidden layer. Here's the relevant code:
tf.layers.Dense
# Definition of hidden layer: h1 # (Dense returns a Callable so we can provide input_layer as argument to it) h1 = tf.layers.Dense(10, activation=tf.nn.relu)(input_layer) # Definition of hidden layer: h2 # (Dense returns a Callable so we can provide h1 as argument to it) h2 = tf.layers.Dense(10, activation=tf.nn.relu)(h1)
The inputs parameter to tf.layers.Dense identifies the preceding layer. The layer preceding h1 is the input layer.
inputs
h1
Figure 3. The input layer feeds into hidden layer 1.
The preceding layer to h2 is h1. So, the string of layers now looks like this:
h2
Figure 4. Hidden layer 1 feeds into hidden layer 2.
The first argument to tf.layers.Dense defines the number of its output neurons—10 in this case.
The activation parameter defines the activation function—Relu in this case.
activation
Note that tf.layers.Dense provides many additional capabilities, including the ability to set a multitude of regularization parameters. For the sake of simplicity, though, we're going to simply accept the default values of the other parameters. Also, when looking at tf.layers you may encounter lower-case versions (e.g. tf.layers.dense). As a general rule, you should use the class versions which start with a capital letter (tf.layers.Dense).
tf.layers.dense
We'll define the output layer by calling tf.layers.Dense yet again:
# Output 'logits' layer is three numbers = probability distribution # (Dense returns a Callable so we can provide h2 as argument to it) logits = tf.layers.Dense(3)(h2)
Notice that the output layer receives its input from h2. Therefore, the full set of layers is now connected as follows:
Figure 5. Hidden layer 2 feeds into the output layer.
When defining an output layer, the units parameter specifies the number of possible output values. So, by setting units to 3, the tf.layers.Dense function establishes a three-element logits vector. Each cell of the logits vector contains the probability of the Iris being Setosa, Versicolor, or Virginica, respectively.
units
3
Since the output layer is a final layer, the call to tf.layers.Dense omits the optional activation parameter.
The final step in creating a model function is to write branching code that implements prediction, evaluation, and training.
The model function gets invoked whenever someone calls the Estimator's train, evaluate, or predict methods. Recall that the signature for the model function looks like this:
train
evaluate
predict
Focus on that third argument, mode. As the following table shows, when someone calls train, evaluate, or predict, the Estimator framework invokes your model function with the mode parameter set as follows:
mode parameter
train()
ModeKeys.TRAIN
evaluate()
ModeKeys.EVAL
predict()
ModeKeys.PREDICT
For example, suppose you instantiate a custom Estimator to generate an object named classifier. Then, you might make the following call (never mind the parameters to my_input_fn at this time):
classifier
my_input_fn
classifier.train( input_fn=lambda: my_input_fn(FILE_TRAIN, repeat_count=500, shuffle_count=256))
The Estimator framework then calls your model function with mode set to ModeKeys.TRAIN.
model
Your model function must provide code to handle all three of the mode values. For each mode value, your code must return an instance of tf.estimator.EstimatorSpec, which contains the information the caller requires. Let's examine each mode.
tf.estimator.EstimatorSpec
When model_fn is called with mode == ModeKeys.PREDICT, the model function must return a tf.estimator.EstimatorSpec containing the following information:
mode == ModeKeys.PREDICT
tf.estimator.ModeKeys.PREDICT
The model must have been trained prior to making a prediction. The trained model is stored on disk in the directory established when you instantiated the Estimator.
For our case, the code to generate the prediction looks as follows:
# class_ids will be the model prediction for the class (Iris flower type) # The output node with the highest value is our prediction predictions = { 'class_ids': tf.argmax(input=logits, axis=1) } # Return our prediction if mode == tf.estimator.ModeKeys.PREDICT: return tf.estimator.EstimatorSpec(mode, predictions=predictions)
The block is surprisingly brief--the lines of code are simply the bucket at the end of a long hose that catches the falling predictions. After all, the Estimator has already done all the heavy lifting to make a prediction:
The output layer is a logits vector that contains the value of each of the three Iris species being the input flower. The tf.argmax method selects the Iris species in that logits vector with the highest value.
logits
tf.argmax
Notice that the highest value is assigned to a dictionary key named class_ids. We return that dictionary through the predictions parameter of tf.estimator.EstimatorSpec. The caller can then retrieve the prediction by examining the dictionary passed back to the Estimator's predict method.
class_ids
When model_fn is called with mode == ModeKeys.EVAL, the model function must evaluate the model, returning loss and possibly one or more metrics.
mode == ModeKeys.EVAL
We can calculate loss by calling tf.losses.sparse_softmax_cross_entropy. Here's the complete code:
tf.losses.sparse_softmax_cross_entropy
# To calculate the loss, we need to convert our labels # Our input labels have shape: [batch_size, 1] labels = tf.squeeze(labels, 1) # Convert to shape [batch_size] loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits)
Now let's turn our attention to metrics. Although returning metrics is optional, most custom Estimators return at least one metric. TensorFlow provides a Metrics API (tf.metrics) to calculate different kinds of metrics. For brevity's sake, we'll only return accuracy. The tf.metrics.accuracy compares our predictions against the "true labels", that is, against the labels provided by the input function. The tf.metrics.accuracy function requires the labels and predictions to have the same shape (which we did earlier). Here's the call to tf.metrics.accuracy:
tf.metrics
tf.metrics.accuracy
# Calculate the accuracy between the true labels, and our predictions accuracy = tf.metrics.accuracy(labels, predictions['class_ids'])
When the model is called with mode == ModeKeys.EVAL, the model function returns a tf.estimator.EstimatorSpec containing the following information:
tf.estimator.ModeKeys.EVAL
So, we'll create a dictionary containing our sole metric (my_accuracy). If we had calculated other metrics, we would have added them as additional key/value pairs to that same dictionary. Then, we'll pass that dictionary in the eval_metric_ops argument of tf.estimator.EstimatorSpec. Here's the block:
my_accuracy
eval_metric_ops
# Return our loss (which is used to evaluate our model) # Set the TensorBoard scalar my_accurace to the accuracy # Obs: This function only sets value during mode == ModeKeys.EVAL # To set values during training, see tf.summary.scalar if mode == tf.estimator.ModeKeys.EVAL: return tf.estimator.EstimatorSpec( mode, loss=loss, eval_metric_ops={'my_accuracy': accuracy})
When model_fn is called with mode == ModeKeys.TRAIN, the model function must train the model.
mode == ModeKeys.TRAIN
We must first instantiate an optimizer object. We picked Adagrad (tf.train.AdagradOptimizer) in the following code block only because we're mimicking the DNNClassifier, which also uses Adagrad. The tf.train package provides many other optimizers—feel free to experiment with them.
tf.train.AdagradOptimizer
tf.train
Next, we train the model by establishing an objective on the optimizer, which is simply to minimize its loss. To establish that objective, we call the minimize method.
loss
minimize
In the code below, the optional global_step argument specifies the variable that TensorFlow uses to count the number of batches that have been processed. Setting global_step to tf.train.get_global_step will work beautifully. Also, we are calling tf.summary.scalar to report my_accuracy to TensorBoard during training. For both of these notes, please see the section on TensorBoard below for further explanation.
global_step
tf.train.get_global_step
tf.summary.scalar
optimizer = tf.train.AdagradOptimizer(0.05) train_op = optimizer.minimize( loss, global_step=tf.train.get_global_step()) # Set the TensorBoard scalar my_accuracy to the accuracy tf.summary.scalar('my_accuracy', accuracy[1])
When the model is called with mode == ModeKeys.TRAIN, the model function must return a tf.estimator.EstimatorSpec containing the following information:
tf.estimator.ModeKeys.TRAIN
Here's the code:
# Return training operations: loss and train_op return tf.estimator.EstimatorSpec( mode, loss=loss, train_op=train_op)
Our model function is now complete!
After creating your new custom Estimator, you'll want to take it for a ride. Start by
instantiating the custom Estimator through the Estimator base class as follows:
Estimator
classifier = tf.estimator.Estimator( model_fn=my_model_fn, model_dir=PATH) # Path to where checkpoints etc are stored
The rest of the code to train, evaluate, and predict using our estimator is the same as for the pre-made DNNClassifier described in Part 1. For example, the following line triggers training the model:
As in Part 1, we can view some training results in TensorBoard. To see this reporting, start TensorBoard from your command-line as follows:
# Replace PATH with the actual path passed as model_dir tensorboard --logdir=PATH
Then browse to the following URL:
localhost:6006
All the pre-made Estimators automatically log a lot of information to TensorBoard. With custom Estimators, however, TensorBoard only provides one default log (a graph of loss) plus the information we explicitly tell TensorBoard to log. Therefore, TensorBoard generates the following from our custom Estimator:
Figure 6. TensorBoard displays three graphs.
In brief, here's what the three graphs tell you:
tf.train.get_global_step()
eval_metric_ops={'my_accuracy': accuracy})
EVAL
EstimatorSpec
tf.summary.scalar('my_accuracy', accuracy[1])
TRAIN
Note the following in the my_accuracy and loss graphs:
During TRAIN, orange values are recorded continuously as batches are processed, which is why it becomes a graph spanning x-axis range. By contrast, EVAL produces only a single value from processing all the evaluation steps.
As suggested in Figure 7, you may see and also selectively disable/enable the reporting for training and evaluation the left side. (Figure 7 shows that we kept reporting on for both:)
Figure 7. Enable or disable reporting.
In order to see the orange graph, you must specify a global step. This, in combination with getting global_steps/sec reported, makes it a best practice to always register a global step by passing tf.train.get_global_step() as an argument to the optimizer.minimize call.
global_steps/sec
optimizer.minimize
Although pre-made Estimators can be an effective way to quickly create new models, you will often need the additional flexibility that custom Estimators provide. Fortunately, pre-made and custom Estimators follow the same programming model. The only practical difference is that you must write a model function for custom Estimators. Everything else is the same!
For more details, be sure to check out:
input_layer
Until next time - Happy TensorFlow coding!
On November 14th, we announced the developer preview of TensorFlow Lite, TensorFlow's lightweight solution for mobile and embedded devices.
Today, in collaboration with Apple, we are happy to announce support for Core ML! With this announcement, iOS developers can leverage the strengths of Core ML for deploying TensorFlow models. In addition, TensorFlow Lite will continue to support cross-platform deployment, including iOS, through the TensorFlow Lite format (.tflite) as described in the original announcement.
Support for Core ML is provided through a tool that takes a TensorFlow model and converts it to the Core ML Model Format (.mlmodel).
For more information, check out the TensorFlow Lite documentation pages, and the Core ML converter. The pypi pip installable package is available here: https://pypi.python.org/pypi/tfcoreml/0.1.0.
Stay tuned for more updates.
Happy TensorFlow Lite coding!
Welcome to Part 2 of a blog series that introduces TensorFlow Datasets and Estimators. We're devoting this article to feature columns—a data structure describing the features that an Estimator requires for training and inference. As you'll see, feature columns are very rich, enabling you to represent a diverse range of data.
In Part 1, we used the pre-made Estimator DNNClassifier to train a model to predict different types of Iris flowers from four input features. That example created only numerical feature columns (of type tf.feature_column.numeric_column). Although those feature columns were sufficient to model the lengths of petals and sepals, real world data sets contain all kinds of non-numerical features. For example:
tf.feature_column.numeric_column)
How can we represent non-numerical feature types? That's exactly what this blogpost is all about.
Let's start by asking what kind of data can we actually feed into a deep neural network? The answer is, of course, numbers (for example, tf.float32). After all, every neuron in a neural network performs multiplication and addition operations on weights and input data. Real-life input data, however, often contains non-numerical (categorical) data. For example, consider a product_class feature that can contain the following three non-numerical values:
tf.float32
product_class
kitchenware
electronics
sports
ML models generally represent categorical values as simple vectors in which a 1 represents the presence of a value and a 0 represents the absence of a value. For example, when product_class is set to sports, an ML model would usually represent product_class as [0, 0, 1], meaning:
So, although raw data can be numerical or categorical, an ML model represents all features as either a number or a vector of numbers.
As Figure 2 suggests, you specify the input to a model through the feature_columns argument of an Estimator (DNNClassifier for Iris). Feature Columns bridge input data (as returned by input_fn) with your model.
input_fn
To represent features as a feature column, call functions of the tf.feature_column package. This blogpost explains nine of the functions in this package. As Figure 3 shows, all nine functions return either a Categorical-Column or a Dense-Column object, except bucketized_column which inherits from both classes:
tf.feature_column
bucketized_column
Let's look at these functions in more detail.
The Iris classifier called tf.numeric_column() for all input features: SepalLength, SepalWidth, PetalLength, PetalWidth. Although tf.numeric_column() provides optional arguments, calling the function without any arguments is a perfectly easy way to specify a numerical value with the default data type (tf.float32) as input to your model. For example:
tf.numeric_column()
# Defaults to a tf.float32 scalar. numeric_feature_column = tf.feature_column.numeric_column(key="SepalLength")
Use the dtype argument to specify a non-default numerical data type. For example:
dtype
# Represent a tf.float64 scalar. numeric_feature_column = tf.feature_column.numeric_column(key="SepalLength", dtype=tf.float64)
By default, a numeric column creates a single value (scalar). Use the shape argument to specify another shape. For example:
shape
# Represent a 10-element vector in which each cell contains a tf.float32. vector_feature_column = tf.feature_column.numeric_column(key="Bowling", shape=10) # Represent a 10x5 matrix in which each cell contains a tf.float32. matrix_feature_column = tf.feature_column.numeric_column(key="MyMatrix", shape=[10,5])
Often, you don't want to feed a number directly into the model, but instead split its value into different categories based on numerical ranges. To do so, create a bucketized column. For example, consider raw data that represents the year a house was built. Instead of representing that year as a scalar numeric column, we could split year into the following four buckets:
The model will represent the buckets as follows:
Why would you want to split a number—a perfectly valid input to our model—into a categorical value like this? Well, notice that the categorization splits a single input number into a four-element vector. Therefore, the model now can learn four individual weights rather than just one. Four weights creates a richer model than one. More importantly, bucketizing enables the model to clearly distinguish between different year categories since only one of the elements is set (1) and the other three elements are cleared (0). When we just use a single number (a year) as input, the model can't distinguish categories. So, bucketing provides the model with additional important information that it can use to learn.
The following code demonstrates how to create a bucketized feature:
# A numeric column for the raw input. numeric_feature_column = tf.feature_column.numeric_column("Year") # Bucketize the numeric column on the years 1960, 1980, and 2000 bucketized_feature_column = tf.feature_column.bucketized_column( source_column = numeric_feature_column, boundaries = [1960, 1980, 2000])
Note the following:
tf.feature_column.bucketized_column()
boundaries
Categorical identity columns are a special case of bucketized columns. In traditional bucketized columns, each bucket represents a range of values (for example, from 1960 to 1979). In a categorical identity column, each bucket represents a single, unique integer. For example, let's say you want to represent the integer range [0, 4). (That is, you want to represent the integers 0, 1, 2, or 3.) In this case, the categorical identity mapping looks like this:
So, why would you want to represent values as categorical identity columns? As with bucketized columns, a model can learn a separate weight for each class in a categorical identity column. For example, instead of using a string to represent the product_class, let's represent each class with a unique integer value. That is:
0="kitchenware"
1="electronics"
2="sport"
Call tf.feature_column.categorical_column_with_identity() to implement a categorical identity column. For example:
tf.feature_column.categorical_column_with_identity()
# Create a categorical output for input "feature_name_from_input_fn", # which must be of integer type. Value is expected to be >= 0 and < num_buckets identity_feature_column = tf.feature_column.categorical_column_with_identity( key='feature_name_from_input_fn', num_buckets=4) # Values [0, 4) # The 'feature_name_from_input_fn' above needs to match an integer key that is # returned from input_fn (see below). So for this case, 'Integer_1' or # 'Integer_2' would be valid strings instead of 'feature_name_from_input_fn'. # For more information, please check out Part 1 of this blog series. def input_fn(): ...<code>... return ({ 'Integer_1':[values], ..<etc>.., 'Integer_2':[values] }, [Label_values])
We cannot input strings directly to a model. Instead, we must first map strings to numeric or categorical values. Categorical vocabulary columns provide a good way to represent strings as a one-hot vector. For example:
As you can see, categorical vocabulary columns are kind of an enum version of categorical identity columns. TensorFlow provides two different functions to create categorical vocabulary columns:
tf.feature_column.categorical_column_with_vocabulary_list()
tf.feature_column.categorical_column_with_vocabulary_file()
The tf.feature_column.categorical_column_with_vocabulary_list() function maps each string to an integer based on an explicit vocabulary list. For example:
# Given input "feature_name_from_input_fn" which is a string, # create a categorical feature to our model by mapping the input to one of # the elements in the vocabulary list. vocabulary_feature_column = tf.feature_column.categorical_column_with_vocabulary_list( key="feature_name_from_input_fn", vocabulary_list=["kitchenware", "electronics", "sports"])
The preceding function has a significant drawback; namely, there's way too much typing when the vocabulary list is long. For these cases, call tf.feature_column.categorical_column_with_vocabulary_file() instead, which lets you place the vocabulary words in a separate file. For example:
# Given input "feature_name_from_input_fn" which is a string, # create a categorical feature to our model by mapping the input to one of # the elements in the vocabulary file vocabulary_feature_column = tf.feature_column.categorical_column_with_vocabulary_file( key="feature_name_from_input_fn", vocabulary_file="product_class.txt", vocabulary_size=3) # product_class.txt should have one line for vocabulary element, in our case: kitchenware electronics sports
So far, we've worked with a naively small number of categories. For example, our product_class example has only 3 categories. Often though, the number of categories can be so big that it's not possible to have individual categories for each vocabulary word or integer because that would consume too much memory. For these cases, we can instead turn the question around and ask, "How many categories am I willing to have for my input?" In fact, the tf.feature_column.categorical_column_with_hash_buckets() function enables you to specify the number of categories. For example, the following code shows how this function calculates a hash value of the input, then puts it into one of the hash_bucket_size categories using the modulo operator:
tf.feature_column.categorical_column_with_hash_buckets()
hash_bucket_size
# Create categorical output for input "feature_name_from_input_fn". # Category becomes: hash_value("feature_name_from_input_fn") % hash_bucket_size hashed_feature_column = tf.feature_column.categorical_column_with_hash_bucket( key = "feature_name_from_input_fn", hash_buckets_size = 100) # The number of categories
At this point, you might rightfully think: "This is crazy!" After all, we are forcing the different input values to a smaller set of categories. This means that two, probably completely unrelated inputs, will be mapped to the same category, and consequently mean the same thing to the neural network. Figure 7 illustrates this dilemma, showing that kitchenware and sports both get assigned to category (hash bucket) 12:
As with many counterintuitive phenomena in machine learning, it turns out that hashing often works well in practice. That's because hash categories provide the model with some separation. The model can use additional features to further separate kitchenware from sports.
The last categorical column we'll cover allows us to combine multiple input features into a single one. Combining features, better known as feature crosses, enables the model to learn separate weights specifically for whatever that feature combination means.
More concretely, suppose we want our model to calculate real estate prices in Atlanta, GA. Real-estate prices within this city vary greatly depending on location. Representing latitude and longitude as separate features isn't very useful in identifying real-estate location dependencies; however, crossing latitude and longitude into a single feature can pinpoint locations. Suppose we represent Atlanta as a grid of 100x100 rectangular sections, identifying each of the 10,000 sections by a cross of its latitude and longitude. This cross enables the model to pick up on pricing conditions related to each individual section, which is a much stronger signal than latitude and longitude alone.
Figure 8 shows our plan, with the latitude & longitude values for the corners of the city:
For the solution, we used a combination of some feature columns we've looked at before, as well as the tf.feature_columns.crossed_column() function.
tf.feature_columns.crossed_column()
# In our input_fn, we convert input longitude and latitude to integer values # in the range [0, 100) def input_fn(): # Using Datasets, read the input values for longitude and latitude latitude = ... # A tf.float32 value longitude = ... # A tf.float32 value # In our example we just return our lat_int, long_int features. # The dictionary of a complete program would probably have more keys. return { "latitude": latitude, "longitude": longitude, ...}, labels # As can be see from the map, we want to split the latitude range # [33.641336, 33.887157] into 100 buckets. To do this we use np.linspace # to get a list of 99 numbers between min and max of this range. # Using this list we can bucketize latitude into 100 buckets. latitude_buckets = list(np.linspace(33.641336, 33.887157, 99)) latitude_fc = tf.feature_column.bucketized_column( tf.feature_column.numeric_column('latitude'), latitude_buckets) # Do the same bucketization for longitude as done for latitude. longitude_buckets = list(np.linspace(-84.558798, -84.287259, 99)) longitude_fc = tf.feature_column.bucketized_column( tf.feature_column.numeric_column('longitude'), longitude_buckets) # Create a feature cross of fc_longitude x fc_latitude. fc_san_francisco_boxed = tf.feature_column.crossed_column( keys=[latitude_fc, longitude_fc], hash_bucket_size=1000) # No precise rule, maybe 1000 buckets will be good?
You may create a feature cross from either of the following:
dict
categorical_column_with_hash_bucket
When feature columns latitude_fc and longitude_fc are crossed, TensorFlow will create 10,000 combinations of (latitude_fc, longitude_fc) organized as follows:
latitude_fc
longitude_fc
(0,0),(0,1)... (0,99) (1,0),(1,1)... (1,99) …, …, ... (99,0),(99,1)...(99, 99)
The function tf.feature_column.crossed_column performs a hash calculation on these combinations and then slots the result into a category by performing a modulo operation with hash_bucket_size. As discussed before, performing the hash and modulo function will probably result in category collisions; that is, multiple (latitude, longitude) feature crosses will end up in the same hash bucket. In practice though, performing feature crosses still provides significant value to the learning capability of your models.
tf.feature_column.crossed_column
Somewhat counterintuitively, when creating feature crosses, you typically still should include the original (uncrossed) features in your model. For example, provide not only the (latitude, longitude) feature cross but also latitude and longitude as separate features. The separate latitude and longitude features help the model separate the contents of hash buckets containing different feature crosses.
latitude, longitude)
latitude
longitude
See this link for a full code example for this. Also, the reference section at the end of this post for lots more examples of feature crossing.
Indicator columns and embedding columns never work on features directly, but instead take categorical columns as input.
When using an indicator column, we're telling TensorFlow to do exactly what we've seen in our categorical product_class example. That is, an indicator column treats each category as an element in a one-hot vector, where the matching category has value 1 and the rest have 0s:
Here's how you create an indicator column:
categorical_column = ... # Create any type of categorical column, see Figure 3 # Represent the categorical column as an indicator column. # This means creating a one-hot vector with one element for each category. indicator_column = tf.feature_column.indicator_column(categorical_column)
Now, suppose instead of having just three possible classes, we have a million. Or maybe a billion. For a number of reasons (too technical to cover here), as the number of categories grow large, it becomes infeasible to train a neural network using indicator columns.
We can use an embedding column to overcome this limitation. Instead of representing the data as a one-hot vector of many dimensions, an embedding column represents that data as a lower-dimensional, ordinary vector in which each cell can contain any number, not just 0 or 1. By permitting a richer palette of numbers for every cell, an embedding column contains far fewer cells than an indicator column.
Let's look at an example comparing indicator and embedding columns. Suppose our input examples consists of different words from a limited palette of only 81 words. Further suppose that the data set provides the following input words in 4 separate examples:
In that case, Figure 10 illustrates the processing path for embedding columns or Indicator columns.
When an example is processed, one of the categorical_column_with... functions maps the example string to a numerical categorical value. For example, a function maps "spoon" to [32]. (The 32 comes from our imagination—the actual values depend on the mapping function.) You may then represent these numerical categorical values in either of the following two ways:
categorical_column_with...
"spoon"
[32]
32
How do the values in the embeddings vectors magically get assigned? Actually, the assignments happen during training. That is, the model learns the best way to map your input numeric categorical values to the embeddings vector value in order to solve your problem. Embedding columns increase your model's capabilities, since an embeddings vector learns new relationships between categories from the training data.
Why is the embedding vector size 3 in our example? Well, the following "formula" provides a general rule of thumb about the number of embedding dimensions:
embedding_dimensions = number_of_categories**0.25
That is, the embedding vector dimension should be the 4th root of the number of categories. Since our vocabulary size in this example is 81, the recommended number of dimensions is 3:
3 = 81**0.25
Note that this is just a general guideline; you can set the number of embedding dimensions as you please.
Call tf.feature_column.embedding_column to create an embedding_column. The dimension of the embedding vector depends on the problem at hand as described above, but common values go as low as 3 all the way to 300 or even beyond:
tf.feature_column.embedding_column
categorical_column = ... # Create any categorical column shown in Figure 3. # Represent the categorical column as an embedding column. # This means creating a one-hot vector with one element for each category. embedding_column = tf.feature_column.embedding_column( categorical_column=categorical_column, dimension=dimension_of_embedding_vector)
Embeddings is a big topic within machine learning. This information was just to get you started using them as feature columns. Please see the end of this post for more information.
Still there? I hope so, because we only have a tiny bit left before you've graduated from the basics of feature columns.
As we saw in Figure 1, feature columns map your input data (described by the feature dictionary returned from input_fn) to values fed to your model. You specify feature columns as a list to a feature_columns argument of an estimator. Note that the feature_columns argument(s) vary depending on the Estimator:
LinearClassifier
LinearRegressor
indicator_column
embedding_column
DNNLinearCombinedClassifier
DNNLinearCombinedRegressor
linear_feature_columns
dnn_feature_columns
DNNRegressor
The reason for the above rules are beyond the scope of this introductory post, but we will make sure to cover it in a future blogpost.
Use feature columns to map your input data to the representations you feed your model. We only used numeric_column in Part 1 of this series , but working with the other functions described in this post, you can easily create other feature columns.
numeric_column
For more details on feature columns, be sure to check out:
If you want to learn more about embeddings:
TensorFlow release 1.4 is now public - and this is a big one! So we're happy to announce a number of new and exciting features we hope everyone will enjoy.
In 1.4, Keras has graduated from tf.contrib.keras to core package tf.keras. Keras is a hugely popular machine learning framework, consisting of high-level APIs to minimize the time between your ideas and working implementations. Keras integrates smoothly with other core TensorFlow functionality, including the Estimator API. In fact, you may construct an Estimator directly from any Keras model by calling the tf.keras.estimator.model_to_estimator function. With Keras now in TensorFlow core, you can rely on it for your production workflows.
tf.contrib.keras
tf.keras
tf.keras.estimator.model_to_estimator
To get started with Keras, please read:
To get started with Estimators, please read:
We're pleased to announce that the Dataset API has graduated to core package tf.data (from tf.contrib.data). The 1.4 version of the Dataset API also adds support for Python generators. We strongly recommend using the Dataset API to create input pipelines for TensorFlow models because:
tf.data
tf.contrib.data
feed_dict
We're going to focus future development on the Dataset API rather than the older APIs.
To get started with Datasets, please read:
Release 1.4 also introduces the utility function tf.estimator.train_and_evaluate, which simplifies training, evaluation, and exporting Estimator models. This function enables distributed execution for training and evaluation, while still supporting local execution.
tf.estimator.train_and_evaluate
Beyond the features called out in this announcement, 1.4 also introduces a number of additional enhancements, which are described in the Release Notes.
TensorFlow release 1.4 is now available using standard pip installation.
pip
# Note: the following command will overwrite any existing TensorFlow # installation. $ pip install --ignore-installed --upgrade tensorflow # Use pip for Python 2.7 # Use pip3 instead of pip for Python 3.x
We've updated the documentation on tensorflow.org to 1.4.
TensorFlow depends on contributors for enhancements. A big thank you to everyone helping out developing TensorFlow! Don't hesitate to join the community and become a contributor by developing the source code on GitHub or helping out answering questions on Stack Overflow.
We hope you enjoy all the features in this release.
Happy TensorFlow Coding!
Today, we introduce eager execution for TensorFlow.
Eager execution is an imperative, define-by-run interface where operations are executed immediately as they are called from Python. This makes it easier to get started with TensorFlow, and can make research and development more intuitive.
The benefits of eager execution include:
Eager execution is available now as an experimental feature, so we're looking for feedback from the community to guide our direction.
To understand this all better, let's look at some code. This gets pretty technical; familiarity with TensorFlow will help.
When you enable eager execution, operations execute immediately and return their values to Python without requiring a Session.run(). For example, to multiply two matrices together, we write this:
Session.run()
import tensorflow as tf import tensorflow.contrib.eager as tfe tfe.enable_eager_execution() x = [[2.]] m = tf.matmul(x, x)
It's straightforward to inspect intermediate results with print or the Python debugger.
print
print(m) # The 1x1 matrix [[4.]]
Dynamic models can be built with Python flow control. Here's an example of the Collatz conjecture using TensorFlow's arithmetic operations:
a = tf.constant(12) counter = 0 while not tf.equal(a, 1): if tf.equal(a % 2, 0): a = a / 2 else: a = 3 * a + 1 print(a)
Here, the use of the tf.constant(12) Tensor object will promote all math operations to tensor operations, and as such all return values with be tensors.
tf.constant(12)
Tensor
Most TensorFlow users are interested in automatic differentiation. Because different operations can occur during each call, we record all forward operations to a tape, which is then played backwards when computing gradients. After we've computed the gradients, we discard the tape.
If you're familiar with the autograd package, the API is very similar. For example:
autograd
def square(x): return tf.multiply(x, x) grad = tfe.gradients_function(square) print(square(3.)) # [9.] print(grad(3.)) # [6.]
The gradients_function call takes a Python function square() as an argument and returns a Python callable that computes the partial derivatives of square() with respect to its inputs. So, to get the derivative of square() at 3.0, invoke grad(3.0), which is 6.
gradients_function
square()
grad(3.0)
The same gradients_function call can be used to get the second derivative of square:
gradgrad = tfe.gradients_function(lambda x: grad(x)[0]) print(gradgrad(3.)) # [2.]
As we noted, control flow can cause different operations to run, such as in this example.
def abs(x): return x if x > 0. else -x grad = tfe.gradients_function(abs) print(grad(2.0)) # [1.] print(grad(-2.0)) # [-1.]
Users may want to define custom gradients for an operation, or for a function. This may be useful for multiple reasons, including providing a more efficient or more numerically stable gradient for a sequence of operations.
Here is an example that illustrates the use of custom gradients. Let's start by looking at the function log(1 + ex), which commonly occurs in the computation of cross entropy and log likelihoods.
def log1pexp(x): return tf.log(1 + tf.exp(x)) grad_log1pexp = tfe.gradients_function(log1pexp) # The gradient computation works fine at x = 0. print(grad_log1pexp(0.)) # [0.5] # However it returns a `nan` at x = 100 due to numerical instability. print(grad_log1pexp(100.)) # [nan]
We can use a custom gradient for the above function that analytically simplifies the gradient expression. Notice how the gradient function implementation below reuses an expression (tf.exp(x)) that was computed during the forward pass, making the gradient computation more efficient by avoiding redundant computation.
tf.exp(x)
@tfe.custom_gradient def log1pexp(x): e = tf.exp(x) def grad(dy): return dy * (1 - 1 / (1 + e)) return tf.log(1 + e), grad grad_log1pexp = tfe.gradients_function(log1pexp) # Gradient at x = 0 works as before. print(grad_log1pexp(0.)) # [0.5] # And now gradient computation at x=100 works as well. print(grad_log1pexp(100.)) # [1.0]
Models can be organized in classes. Here's a model class that creates a (simple) two layer network that can classify the standard MNIST handwritten digits.
class MNISTModel(tfe.Network): def __init__(self): super(MNISTModel, self).__init__() self.layer1 = self.track_layer(tf.layers.Dense(units=10)) self.layer2 = self.track_layer(tf.layers.Dense(units=10)) def call(self, input): """Actually runs the model.""" result = self.layer1(input) result = self.layer2(result) return result
We recommend using the classes (not the functions) in tf.layers since they create and contain model parameters (variables). Variable lifetimes are tied to the lifetime of the layer objects, so be sure to keep track of them.
Why are we using tfe.Network? A Network is a container for layers and is a tf.layer.Layer itself, allowing Network objects to be embedded in other Network objects. It also contains utilities to assist with inspection, saving, and restoring.
tfe.Network
tf.layer.Layer
Network
Even without training the model, we can imperatively call it and inspect the output:
# Let's make up a blank input image model = MNISTModel() batch = tf.zeros([1, 1, 784]) print(batch.shape) # (1, 1, 784) result = model(batch) print(result) # tf.Tensor([[[ 0. 0., ...., 0.]]], shape=(1, 1, 10), dtype=float32)
Note that we do not need any placeholders or sessions. The first time we pass in the input, the sizes of the layers' parameters are set.
To train any model, we define a loss function to optimize, calculate gradients, and use an optimizer to update the variables. First, here's a loss function:
def loss_function(model, x, y): y_ = model(x) return tf.nn.softmax_cross_entropy_with_logits(labels=y, logits=y_)
And then, our training loop:
optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.001) for (x, y) in tfe.Iterator(dataset): grads = tfe.implicit_gradients(loss_function)(model, x, y) optimizer.apply_gradients(grads)
implicit_gradients() calculates the derivatives of loss_function with respect to all the TensorFlow variables used during its computation.
implicit_gradients()
loss_function
We can move computation to a GPU the same way we've always done with TensorFlow:
with tf.device("/gpu:0"): for (x, y) in tfe.Iterator(dataset): optimizer.minimize(lambda: loss_function(model, x, y))
(Note: We're shortcutting storing our loss and directly calling the optimizer.minimize, but you could also use the apply_gradients() method above; they are equivalent.)
apply_gradients()
Eager execution makes development and debugging far more interactive, but TensorFlow graphs have a lot of advantages with respect to distributed training, performance optimizations, and production deployment.
The same code that executes operations when eager execution is enabled will construct a graph describing the computation when it is not. To convert your models to graphs, simply run the same code in a new Python session where eager execution hasn't been enabled, as seen, for example, in the MNIST example. The value of model variables can be saved and restored from checkpoints, allowing us to move between eager (imperative) and graph (declarative) programming easily. With this, models developed with eager execution enabled can be easily exported for production deployment.
In the near future, we will provide utilities to selectively convert portions of your model to graphs. In this way, you can fuse parts of your computation (such as internals of a custom RNN cell) for high-performance, but also keep the flexibility and readability of eager execution.
Using eager execution should be intuitive to current TensorFlow users. There are only a handful of eager-specific APIs; most of the existing APIs and operations work with eager enabled. Some notes to keep in mind:
tf.layer.Conv2D()
tfe.enable_eager_execution()
This is still a preview release, so you may hit some rough edges. To get started today:
There's a lot more to talk about with eager execution and we're excited… or, rather, we're eager for you to try it today! Feedback is absolutely welcome.
Coca-Cola's core loyalty program launched in 2006 as MyCokeRewards.com. The "MCR.com" platform included the creation of unique product codes for every Coca-Cola, Sprite, Fanta, and Powerade product sold in 20oz bottles and cardboard "fridge-packs" purchasable at grocery stores and other retail outlets. Users could enter these product codes at MyCokeRewards.com to participate in promotional campaigns.
Fast-forward to 2016: Coke's loyalty programs are still hugely popular with millions of product codes having been entered for promotions and sweepstakes. However, mobile browsing went from non-existent in 2006 to over 50% share by the end of 2016. The launch of Coke.com as a mobile-first web experience (replacing MCR.com) was a response to these changes in browsing behavior. Thumb-entering 14-character codes into a mobile device could be a difficult enough user experience to impact the success of our programs. We want to provide our mobile audience the best possible experience, and recent advances in artificial intelligence opened new opportunities.
For years Coke attempted to use off-the-shelf optical character recognition (OCR) libraries and services to read product codes with little success. Our printing process typically uses low-resolution dot-matrix fonts with the cap or fridge-pack media running under the printhead at very high speeds. All of this translates into a low-fidelity string of characters that defeats off-the-shelf OCR offerings (and can sometimes be hard to read with the human eye as well). OCR is critical to simplifying the code-entry process for mobile users: they should be able to take a picture of a code and automatically have the purchase registered for a promotional entry. We needed a purpose-built OCR system to recognize our product codes.
Our research led us to a promising solution: Convolutional Neural Networks. CNNs are one of a family of "deep learning" neural networks that are at the heart of modern artificial intelligence products. Google has used CNNs to extract street address numbers from StreetView images. CNNs also perform remarkably well at recognizing handwritten digits. These number-recognition use-cases were a perfect proxy for the type of problem we were trying to solve: extracting strings from images that contain small character sets with lots of variance in the appearance of the characters.
In the past, developing deep neural networks like CNNs has been a challenge because of the complexity of available training and inference libraries. TensorFlow, a machine learning framework that was open sourced by Google in November 2015, is designed to simplify the development of deep neural networks.
TensorFlow provides high-level interfaces to different kinds of neuron layers and popular loss functions, which makes it easier to implement different CNN model architectures. The ability to rapidly iterate over different model architectures dramatically reduced the time required to build Coke's custom OCR solution because different models could be developed, trained, and tested in a matter of days. TensorFlow models are also portable: the framework supports model execution natively on mobile devices ("AI on the edge") or in servers hosted remotely in the cloud. This enables a "create once, run anywhere" approach for model execution across many different platforms, including web-based and mobile.
Any neural network is only as good as the data used to train it. We knew that we needed a large set of labeled product-code images to train a CNN that would achieve our performance goals. Our training set would be built in three phases:
The pre-launch training phase began by programmatically generating millions of simulated product-code images. These simulated images included variations in tilt, lighting, shadows, and blurriness. The prediction accuracy (i.e. how often all 14 characters were correctly predicted within the top-10 predictions) was at 50% against real-world images when the model was trained using only simulated images. This provided a baseline for transfer-learning: a model initially trained with simulated images was the foundation for a more accurate model that would be trained against real-world images.
The challenge now turned to enriching the simulated images with enough real-world images to hit our performance goals. We created a purpose-built training app for iOS and Android devices that "trainers" could use to take pictures of codes and label them; these labeled images were then transferred to cloud storage for training. We did a production run of several thousand product codes on bottle caps and fridge-packs and distributed these to multiple suppliers who used the app to create the initial real-world training set.
Even with an augmented and enriched training set, there is no substitute for images created by end-users in a variety of environmental conditions. We knew that scans would sometimes result in an inaccurate code prediction, so we needed to provide a user-experience that would allow users to quickly correct these predictions. Two components are essential to delivering this experience: a product-code validation service that has been in use since the launch of our original loyalty platform in 2006 (to verify that a predicted code is an actual code) and a prediction algorithm that performs a regression to determine a per-character confidence at each one of the 14 character positions. If a predicted code is invalid, the top prediction as well as the confidence levels for each character are returned to the user interface. Low-confidence characters are visually highlighted to guide the user to update characters that need attention.
This user interface innovation enables an active learning process: a feedback loop allows the model to gradually improve by returning corrected predictions to the training pipeline. In this way, our users organically improve the accuracy of the character recognition model over time.
To meet user expectations around performance, we established a few ambitious requirements for the product-code OCR pipeline:
We initially explored an architecture that used a single CNN for all product-code media. This approach created a model that was too large to be distributed to mobile apps and the execution time was longer than desired. Our applied-AI partners at Quantiphi, Inc. began iterating on different model architectures, eventually landing on one that used multiple CNNs.
This new architecture reduced the model size dramatically without sacrificing accuracy, but it was still on the high end of what we needed in order to support over-the-air updates to mobile apps. We next used TensorFlow's prebuilt quantization module to reduce the model size by reducing the fidelity of the weights between connected neurons. Quantization reduced the model size by a factor of 4, but a dramatic reduction in model size occurred when Quantiphi had a breakthrough using a new approach called SqueezeNet.
The SqueezeNet model was published by a team of researchers from UC Berkeley and Stanford in November of 2016. It uses a small but highly complex design to achieve accuracy levels on par with much larger models against popular benchmarks such as Imagenet. After re-architecting our character recognition models to use a SqueezeNet CNN, Quantiphi was able to reduce the model size of certain media types by a factor of 100. Since the SqueezeNet model was inherently smaller, a richer feature detection architecture could be constructed, achieving much higher accuracy at much smaller sizes compared to our first batch of models trained without SqueezeNet. We now have a highly accurate model that can be easily updated on remote devices; the recognition success rate of our final model before active learning was close to 96%, which translates into a 99.7% character recognition accuracy (just 3 misses for every 1000 character predictions).
Advances in artificial intelligence and the maturity of TensorFlow enabled us to finally achieve a long-sought proof-of-purchase capability. Since launching in late February 2017, our product code recognition platform has fueled more than a dozen promotions and resulted in over 180,000 scanned codes; it is now a core component for all of Coca-Cola North America's web-based promotions.
Moving to an AI-enabled product-code recognition platform has been valuable for two key reasons:
Our product-code recognition platform is the first execution of new AI-enabled capabilities at scale within Coca-Cola. We're now exploring AI applications across multiple lines of business, from new product development to ecommerce retail optimization.