Video-Touch: Multi-User Remote Robot Control in Google Meet call by DNN-based Gesture Recognition

AUG 19, 2021

A guest post by the Engineering team at Video-Touch

Please note that the information, uses, and applications expressed in the below post are solely those of our guest author, Video-Touch.

A guest post by Video-Touch

You may have watched some science fiction movies where people could control robots with the movements of their bodies. Modern computer vision and robotics approaches allow us to make such an experience real, but no less exciting and incredible.

Inspired by the idea to make remote control and teleoperation practical and accessible during such a hard period of coronavirus, we came up with a VideoTouch project.

Video-Touch is the first robot-human interaction system that allows multi-user control via video calls application (e.g. Google Meet, Zoom, Skype) from anywhere in the world.

The Video-Touch system in action.

Figure 1: The Video-Touch system in action: single user controls a robot during a Video-Touch call. Note the sensors’ feedback when grasping a tube [source video].

We were wondering if it is even possible to control a robot remotely using only your own hands - without any additional devices like gloves or a joystick - not suffering from a significant delay. We decided to use computer vision to recognize movements in real-time and instantly pass them to the robot. Thanks to MediaPipe now it is possible.

Our system looks as follows:

  1. Video conference application gets a webcam video on the user device and sends it to the robot computer (“server”);
  2. User webcam video stream is being captured on the robot's computer display via OBS virtual camera tool;
  3. The recognition module reads user movements and gestures with the help of MediaPipe and sends it to the next module via ZeroMQ;
  4. The robotic arm and its gripper are being controlled from Python, given the motion capture data.

Figure 2: Overall scheme of the Video-Touch system: from users webcam to the robot control module [source video].

As it clearly follows from the scheme, all the user needs to operate a robot is a stable internet connection and a video conferencing app. All the computation, such as screen capture, hand tracking, gesture recognition, and robot control, is being carried on a separate device (just another laptop) connected to the robot via Wi-Fi. Next, we describe each part of the pipeline in detail.

Video stream and screen capture

One can use any software that sends a video from one computer to another. In our experiments, we used the video conference desktop application. A user calls from its device to a computer with a display connected to the robot. Thus it can see the video stream from the user's webcam.

Now we need some mechanism to pass the user's video from the video conference to the Recognition module. We use Open Broadcaster Software (OBS) and its virtual camera tool to capture the open video conference window. We get a virtual camera that now has frames from the users' webcam and its own unique device index that can be further used in the Recognition module.

Recognition module

The role of the Recognition module is to capture a users' movements and pass them to the Robot control module. Here is where the MediaPipe comes in. We searched for the most efficient and precise computer vision software for hand motion capture. We found many exciting solutions, but MediaPipe turned out to be the only suitable tool for such a challenging task - real-time on-device fine-grained hand movement recognition.

We made two key modifications to the MediaPipe Hand Tracking module: added gesture recognition calculator and integrated ZeroMQ message passing mechanism.

At the moment of our previous publication we had two versions of the gesture recognition implementation. The first version is depicted in Figure 3 below and does all the computation inside the Hand Gesture Recognition calculator. The calculator has scaled landmarks as its input, i.e. these landmarks are normalized on the size of the hand bounding box, not on the whole image size. Next it recognizes one of 4 gestures (see also Figure 4): “move”, “angle”, “grab” and “no gesture” (“finger distance” gesture from the paper was an experimental one and was not included in the final demo) and outputs the gesture class name. Despite this version being quite robust and useful, it is based only on simple heuristic rules like: “if this landmark[i].x < landmark[j].x then it is a `move` gesture”, and is failing for some real-life cases like hand rotation.

Modified MediaPipe Hand Landmark CPU subgraph.

Figure 3: Modified MediaPipe Hand Landmark CPU subgraph. Note the HandGestureRecognition calculator

To alleviate the problem of bad generalization, we implemented the second version. We trained the Gradient Boosting classifier from scikit-learn on a manually collected and labeled dataset of 1000 keypoints: 200 per “move”, “angle” and “grab” classes, and 400 for “no gesture” class. By the way, today such a dataset could be easily obtained using the recently released Jesture AI SDK repo (note: another project of some of our team members).

We used scaled landmarks, angles between joints, and pairwise landmark distances as an input to the model to predict the gesture class. Next, we tried to pass only the scaled landmarks without any angles and distances, and it resulted in similar multi-class accuracy of 91% on a local validation set of 200 keypoints. One more point about this version of gesture classifier is that we were not able to run the scikit-learn model from C++ directly, so we implemented it in Python as a part of the Robot control module.

Figure 4: Gesture classes recognized by our model (“no gesture” class is not shown).

Figure 4: Gesture classes recognized by our model (“no gesture” class is not shown).

Right after the publication, we came up with a fully-connected neural network trained in Keras on just the same dataset as the Gradient Boosting model, and it gave an even better result of 93%. We converted this model to the TensorFlow Lite format, and now we are able to run the gesture recognition ML model right inside the Hand Gesture Recognition calculator.

Figure 5: Fully-connected network for gesture recognition converted to TFLite model

Figure 5: Fully-connected network for gesture recognition converted to TFLite model format.

When we get the current hand location and current gesture class, we need to pass it to the Robot control module. We do this with the help of the high-performance asynchronous messaging library ZeroMQ. To implement this in C++, we used the libzmq library and the cppzmq headers. We utilized the request-reply scenario: REP (server) in C++ code of the Recognition module and REQ (client) in Python code of the Robot control module.

So using the hand tracking module with our modifications, we are now able to pass the motion capture information to the robot in real-time.

Robot control module

A robot control module is a Python script that takes hand landmarks and gesture class as its input and outputs a robot movement command (on each frame). The script runs on a computer connected to the robot via Wi-Fi. In our experiments we used MSI laptop with the Nvidia GTX 1050 Ti GPU. We tried to run the whole system on Intel Core i7 CPU and it was also real-time with a negligible delay, thanks to the highly optimized MediaPipe compute graph implementation.

We use the 6DoF UR10 robot by Universal Robotics in our current pipeline. Since the gripper we are using is a two-finger one, we do not need a complete mapping of each landmark to the robots’ finger keypoint, but only the location of the hands’ center. Using this center coordinates and python-urx package, we are now able to change the robots’ velocity in a desired direction and orientation: on each frame, we calculate the difference between the current hand center coordinate and the one from the previous frame, which gives us a velocity change vector or angle. Finally, all this mechanism looks very similar to how one would control a robot with a joystick.

Hand-robot control logic follows the idea of a joystick with pre-defined movement

Figure 6: Hand-robot control logic follows the idea of a joystick with pre-defined movement directions [source video].

Tactile perception with high-density tactile sensors

Dexterous manipulation requires a high spatial resolution and high-fidelity tactile perception of objects and environments. The newest sensor arrays are well suited for robotic manipulation as they can be easily attached to any robotic end effector and adapted at any contact surface.

High fidelity tactile sensor array

Figure 7: High fidelity tactile sensor array: a) Array placement on the gripper. b) Sensor data when the gripper takes a pipette. c) Sensor data when the gripper takes a tube [source publication].

Video-Touch is embedded with a kind of high-density tactile sensor array. They are installed in the two-fingers robotic gripper. One sensor array is attached to each fingertip. A single electrode array can sense a frame area of 5.8 [cm2] with a resolution of 100 points per frame. The sensing frequency equals 120 [Hz]. The range of force detection per point is of 1-9 [N]. Thus, the robot detects the pressure applied to solid or flexible objects grasped by the robotic fingers with a resolution of 200 points (100 points per finger).

The data collected from the sensor arrays are processed and displayed to the user as dynamic finger-contact maps. The pressure sensor arrays allow the user to perceive the grasped object's physical properties such as compliance, hardness, roughness, shape, and orientation.

Multi-user robotic arm control feature.

Figure 8: Multi-user robotic arm control feature. The users are able to perform a COVID-19 test during a regular video call [source video].


Thus by using MediaPipe and a robot we built an effective, multi-user robot teleoperation system. Potential future uses of teleoperation systems include medical testing and experiments in difficult-to-access environments like outer space. Multi-user functionality of the system addresses an actual problem of effective remote collaboration, allowing to work on projects which need manual remote control in a group of several people.

Another nice feature of our pipeline is that one could control the robot using any device with a camera, e.g. a mobile phone. One also could operate another hardware form factor, such as edge devices, mobile robots, or drones instead of a robotic arm. Of course, the current solution has some limitations: latency, the utilization of z-coordinate (depth), and the convenience of the gesture types could be improved. We can’t wait for the updates from the MediaPipe team to try them out, and looking forward to trying new types of the gripper (with fingers), two-hand control, or even a whole-body control (hello, “Real Steel”!).

We hope the post was useful for you and your work. Keep coding and stay healthy. Thank you very much for your attention!

This blog post is curated by Igor Kibalchich, ML Research Product Manager at Google AI