Debug TensorFlow Models with tfdbg

FEB. 17, 2017

Posted by Shanqing Cai, Software Engineer, Tools and Infrastructure.

We are excited to share TensorFlow Debugger (tfdbg), a tool that makes debugging of machine learning models (ML) in TensorFlow easier.
TensorFlow, Google's open-source ML library, is based on dataflow graphs. A typical TensorFlow ML program consists of two separate stages:

Setting up the ML model as a dataflow graph by using the library's Python API,
Training or performing inference on the graph by using the Session.run() method.

If errors and bugs occur during the second stage (i.e., the TensorFlow runtime), they are difficult to debug.

To understand why that is the case, note that to standard Python debuggers, the Session.run() call is effectively a single statement and does not exposes the running graph's internal structure (nodes and their connections) and state (output arrays or tensors of the nodes). Lower-level debuggers such as gdb cannot organize stack frames and variable values in a way relevant to TensorFlow graph operations. A specialized runtime debugger has been among the most frequently raised feature requests from TensorFlow users.

tfdbg addresses this runtime debugging need. Let's see tfdbg in action with a short snippet of code that sets up and runs a simple TensorFlow graph to fit a simple linear equation through gradient descent.

import numpy as np
      import tensorflow as tf
      import tensorflow.python.debug as tf_debug
      xs = np.linspace(-0.5, 0.49, 100)
      x = tf.placeholder(tf.float32, shape=[None], name="x")
      y = tf.placeholder(tf.float32, shape=[None], name="y")
      k = tf.Variable([0.0], name="k")
      y_hat = tf.multiply(k, x, name="y_hat")
      sse = tf.reduce_sum((y - y_hat) * (y - y_hat), name="sse")
      train_op = tf.train.GradientDescentOptimizer(learning_rate=0.02).minimize(sse)

      sess = tf.Session()
      sess.run(tf.global_variables_initializer())

      sess = tf_debug.LocalCLIDebugWrapperSession(sess)
      for _ in range(10):
      sess.run(train_op, feed_dict={x: xs, y: 42 * xs})

As the highlighted line in this example shows, the session object is wrapped as a class for debugging (LocalCLIDebugWrapperSession), so the calling the run() method will launch the command-line interface (CLI) of tfdbg. Using mouse clicks or commands, you can proceed through the successive run calls, inspect the graph's nodes and their attributes, visualize the complete history of the execution of all relevant nodes in the graph through the list of intermediate tensors. By using the invoke_stepper command, you can let the Session.run() call execute in the "stepper mode", in which you can step to nodes of your choice, observe and modify their outputs, followed by further stepping actions, in a way analogous to debugging procedural languages (e.g., in gdb or pdb).

A class of frequently encountered issue in developing TensorFlow ML models is the appearance of bad numerical values (infinities and NaNs) due to overflow, division by zero, log of zero, etc. In large TensorFlow graphs, finding the source of such nodes can be tedious and time-consuming. With the help of tfdbg CLI and its conditional breakpoint support, you can quickly identify the culprit node. The video below demonstrates how to debug infinity/NaN issues in a neural network with tfdbg:

A screencast of the TensorFlow Debugger in action, from this tutorial.

Compared with alternative debugging options such as Print Ops, tfdbg requires fewer lines of code change, provides more comprehensive coverage of the graphs, and offers a more interactive debugging experience. It will speed up your model development and debugging workflows. It offers additional features such as offline debugging of dumped tensors from server environments and integration with tf.contrib.learn. To get started, please visit this documentation. This research paper lays out the design of tfdbg in greater detail.

The minimum required TensorFlow version for tfdbg is 0.12.1. To report bugs, please open issues on TensorFlow's GitHub Issues Page. For general usage help, please post questions on StackOverflow using the tag tensorflow.

Acknowledgements