By the end of the summer, we expect the Spanish in Texas corpus will include 100 videos with a total running time of more than 50 hours. Fortunately, there are a range of services and tools to expedite the process of transcribing and captioning all those hours of video.
YouTube began offering automated captioning for videos a few years ago. Using Google’s voice recognition technology, a transcript is automatically generated for any video in one of the supported languages. As of today those languages include English, Japanese, Korean and Spanish, German, Italian, French, Portuguese, Russian and Dutch. The result of the automated transcription is still very much inferior to human transcription and is not usable for our purposes. However, YouTube also allows the option of uploading your own transcript as the basis for generating the synchronized captions. When a transcript is provided, the syncing process is very effective at creating accurate closed captions synchronized to a video. In addition, YouTube offers a Captioning API, which allows programmers to access the caption syncing service from within other applications.
Automatic Sync Technologies is a commercial provider of human transcription services as well as a technology for automatically syncing transcripts with media to produce closed captions in a variety of formats. Automatic Sync recently expanded their service to include Spanish as well as mixed Spanish/English content. An advantage of using their service is that they have the ability to create custom output formats (requires a one-time fee). For instance, we worked with them to create a custom output file that included the start and end time for each word in the transcript and was formatted as a tab-delimited text file.
There are also online platforms for manually transcribing and captioning videos in a user-friendly web interface. DotSub leverages a crowd-sourcing model for creating subtitles and then translating the subtitles into many different languages. Another option in this category is Universal Subtitles, which is the platform used to subtitle and translate the popular TED Video series. These can be a good option if resources aren’t available to hire transcribers and/or translators.
While developing the SPinTX corpus we have used all of the solutions mentioned above, but we have now settled on a standard process that works best for us. First, we pay a transcription service to transcribe the video files in mixed Spanish / English and provide us with a plain text file, at a cost of approximately $70 per hour of video. Then, we use the YouTube API to sync the transcripts with the videos and retrieve a caption file. This process works for us because our transcripts often need a lot of revisions, and we can sync as many times as we need at no cost. The caption file is then integrated into our annotation process, so when users get search results they can jump directly to the place it occurs in the video. In a later post, we will go into more detail about how we are implementing the free YouTube API and how you can adapt this process for your own video content!