Posted by Shanqing Cai, Software
Engineer, Tools and Infrastructure.
We are excited to share TensorFlow
Debugger (tfdbg), a tool that
makes debugging
of machine learning models (ML) in TensorFlow easier.
TensorFlow, Google's open-source ML library, is based on dataflow graphs. A
typical TensorFlow ML program consists of two separate stages:
Setting up the ML model as a dataflow graph by using the library's Python API,
Training or performing inference on the graph by using the
Session.run()
method.
If errors and bugs occur during the second stage (i.e., the TensorFlow runtime),
they are difficult to debug.
To understand why that is the case, note that to standard Python debuggers, the
Session.run() call is effectively a single statement and does not
exposes the running graph's internal structure (nodes and their connections) and
state (output arrays or tensors of the nodes). Lower-level debuggers
such as gdb cannot organize stack
frames and variable values in a way relevant to TensorFlow graph operations. A
specialized runtime debugger has been among the most frequently raised feature
requests from TensorFlow users.
tfdbg addresses this runtime debugging
need. Let's
see tfdbg in action with a short snippet
of code
that sets up and runs a simple TensorFlow graph to fit a simple linear equation
through gradient
descent.
import numpy as np
import tensorflow as tf
import tensorflow.python.debug as tf_debug
xs = np.linspace(-0.5, 0.49, 100)
x = tf.placeholder(tf.float32, shape=[None], name="x")
y = tf.placeholder(tf.float32, shape=[None], name="y")
k = tf.Variable([0.0], name="k")
y_hat = tf.multiply(k, x, name="y_hat")
sse = tf.reduce_sum((y - y_hat) * (y - y_hat), name="sse")
train_op = tf.train.GradientDescentOptimizer(learning_rate=0.02).minimize(sse)
sess = tf.Session()
sess.run(tf.global_variables_initializer())
sess = tf_debug.LocalCLIDebugWrapperSession(sess)
for _ in range(10):
sess.run(train_op, feed_dict={x: xs, y: 42 * xs})
As the highlighted line in this example shows, the session object is wrapped as
a class for debugging (LocalCLIDebugWrapperSession), so the calling
the run() method will launch the command-line interface (CLI) of
tfdbg. Using mouse clicks or commands,
you can proceed through the successive run calls, inspect the graph's nodes and
their attributes, visualize the complete history of the execution of all
relevant nodes in the graph through the list of intermediate tensors. By using
the invoke_stepper command, you can let the
Session.run() call execute in the "stepper mode", in which you can
step to nodes of your choice, observe and modify their outputs, followed by
further stepping actions, in a way analogous to debugging procedural languages
(e.g., in gdb or pdb).
A class of frequently encountered issue in developing TensorFlow ML models is
the appearance of bad numerical values (infinities and NaNs) due to overflow, division by
zero, log of zero, etc. In large TensorFlow graphs, finding the source of such
nodes can be tedious and time-consuming. With the help of
tfdbg CLI and its conditional breakpoint
support,
you can quickly identify the culprit node. The video below demonstrates how to
debug infinity/NaN issues in a neural network with
tfdbg:
A screencast of the TensorFlow Debugger in action, from this tutorial.
Compared with alternative debugging options such as Print
Ops, tfdbg requires fewer
lines of code
change, provides more comprehensive coverage of the graphs, and offers a more
interactive debugging experience. It will speed up your model development and
debugging workflows. It offers additional features such as offline
debugging of dumped tensors from server environments and integration with tf.contrib.learn.
To get started, please visit this documentation.
This research paper
lays out the design of tfdbg in greater
detail.
The minimum required TensorFlow version for
tfdbg
is 0.12.1. To report bugs, please open issues on TensorFlow's GitHub
Issues Page. For general usage help, please post questions on StackOverflow
using the tag tensorflow.
Acknowledgements
This project would not be possible without the help and feedback from members of
the Google TensorFlow Core/API Team and the Applied Machine Intelligence Team.