Interactive Transcripts and Automatic Captions for Developer Videos

JUNE 21, 2010

Did you notice the new Interactive Transcript feature that lets you scan quickly through the full text of any owner-captioned video that you’re watching on YouTube? For videos from I/O, that means you can quickly scan through a 60 minute talk to find just the part of the talk that you need to see. Or use your browser search with the Interactive Transcript to find a mention of an API call, and then click on a word in the transcript to jump straight to that part of the video.

Because developers don’t all speak English (and because some developers speak really fast when presenting) we caption every video that we post to http://www.youtube.com/googledevelopers. Most of the year, that’s a pretty easy thing to keep up with. But last year, when we posted all the videos for Google I/O 2009, it took us months to get everything done.

This year, we captioned everything within 24 hours or less of the videos going live. I’m excited about that, because it wouldn’t have been possible without the new auto-caption and auto-timing features in YouTube. We also did something a little nerdy -- we used four different methods of captioning.

If you use YouTube to share talks from your own developer events, you might find this summary useful.

The two fastest options for producing and cleaning up our captions used auto-timing. We uploaded a transcript and had YouTube’s speech recognition calculate the timecodes for us.

The two auto-timing methods were:

CART live real-time transcript + auto-timing
Because we had professional real-time transcriptionists at I/O, we could instantly caption anything that had a live session transcript. That’s how we got the keynotes captioned on the day of the event. We also used this method for the android talks.
Professional transcription + auto-timing
This was less expensive than CART, and faster than full captions with timecodes, but slower than real-time transcription because we had to get video files to the transcribers.

Although these methods were fastest, auto-timing turned out not to be perfect for all videos. When mic quality varied, or we had too many speaker changes in a short period of time (e.g panel discussions or fireside chats), the timing sometimes slipped out of sync. You can still use the Interactive Transcript to see what was said, but it’s not ideal.

The two slower methods that we used were:

Pure 'traditional' captioning
This is what we did last year for Google I/O 2009 videos. It’s slower, and more expensive, because you have to transcribe and set all the timecodes correctly. But the end result is 100% accurately timed. We did this to fix a video that the auto-timing had a lot of difficulty with.
Speech recognition (auto-captions) with human cleanup and editing
This gave us perfect timecodes, just like traditional captions, and took less time than traditional captioning. It took slightly longer than auto-timing alone because we had to download the machine-generated auto-captions from YouTube to do the edits.

Automatic captions are fantastic if you don't have time or budget to put any work into your captioning. But for I/O, we wanted our captions to be perfect on technical terms, so fully automatic captions weren't the best fit.

Not all of these methods are equal in terms of quality, but it’s interesting to compare. To see which method was used on a video, look for the track name in the caption menu. To compare owner-uploaded captions with pure machine-generated auto-captions, you can always choose ‘Transcribe Audio’ from the caption menu for our videos.

If you’d like to help improve caption quality, please watch a video and fill out our caption survey to tell us what you think of these captions! We know some of them are going to be a little off -- if you report issues, we’ll fix them.

By Naomi Bilodeau, Google Developer Programs