Text is the New Timecode
Although I’ve shamelessly stolen the title from Joe B (@zbutcher on Twitter) I think it does represent a shift in the way we work with our source media.
Now, before I start let me be clear. I am NOT saying timecode is unimportant. I’m NOT saying that timecode is passé and suddenly irrelevant. Timecode remains incredibly important for any tape based access.
What I am saying is that text search – or phonetic search derived from text – is becoming a highly viable, and in many ways superior, way to search and find content. Timecode’s primary role was in being able to identify any given frame from a tape by tape and frame number. There’s nothing wrong with that approach, but as humans we don’t think in “reel and Timecode”, which is why text is a superior option.
In this technology summary, I’m going to consider:
- What tools are at our finger tips right now and their relative merits,
- Why speech transcription is ultimately more valuable than phonetic search (long term),
- Developments in speech transcription, and
- Transcription technologies.
Software and tools available now
There are two broad approaches to text search in the marketplace today: those that transcribe the speech into text and those that use phonetic search. In the first category we have Adobe Premiere/Audition/Soundbooth using Autonomy’s technology to transcribe speech to text. In the second we have Avid and Boris using Nexidia’s phonetic search technology in Media Composer (for Phrasefind and Script Sync) and Soundbite (formerly Get!) respectively. Our own prEdit edits video using a text transcript.
Phonetic search makes no attempt to understand meaning. Essentially, Nexidia’s technology “understands” what an audio waveform would look like when the text is input. Nexidia scans all the audio files ahead of search and indexes them for search. When you input a word, it estimates the waveform then finds matches for that waveform by comparing the index with the target waveform. This technique is great when you don’t have a transcript, or have a transcript without any time stamps.
This technology is also used to align a text file (script) with the audio in the audio and video files in Avid’s ScriptSync with the optional ScriptSync package. This also works great if you have a transcript of interviews to align the audio in the video with the transcript. However, if you have a transcript available, working with prEdit may be faster, manipulating the audio and video by editing and modifying text.
Going the other way – from speech to text – provides an additional advantage in that we have meaning associated with the text, and we have a transcript that carries through production into distribution. If you want searchable text for distribution then Adobe’s professional tools are the only automated tools: phonetic search isn’t viable because it would require distributing the Nexidia engine into distribution (and with their licensing model, I don’t see that happening any time soon).
However, the speech to text engine used by Adobe (licensed from Autonomy) is still a work in progress: results can be quite good but the average result is less spectacular, sometimes completely useless. That is why Adobe have added, and strongly pushed, the ability to provide the speech transcript engine a guide script. Surprisingly, a guide script alone – even one where it is an exact transcript – does not transcribe perfectly. The best results require a trip via Adobe Story. This has the advantage of keeping all punctuation and paragraph breaks (and automatic subclipping into paragraph subclips in prEdit). For a comparison on accuracy you might be interested in Colin Brougham’s comparisons.
The big disadvantage to this approach is that it requires a transcript, the very thing most people want to automate because of the cost.
Why speech transcription is more valuable than phonetic search
Speech transcription carries meaning. Phonetic search does not carry meaning. This is an important distinction, because it means that speech transcription is valuable metadata as well as a production tool, while phonetic search is a useful production tool, but has no value as metadata away from the Nexidia engine.
Speech transcription can be carried with the media throughout its life, even into edited versions. Indeed, this is Adobe’s intent. They tend to focus on the speech transcript metadata as part of a distribution strategy more than its use in postproduction. The speech transcript metadata is carried inside media files as XMP metadata. (There are other alternatives for speech transcript metadata in distribution files, but I’ll discuss that further down.)
Speech transcription can be searched by anyone, at any stage, without needing a proprietary engine.
Speech transcription can be used to derive keywords and other expressions of meaning, which is valuable not only for automating some types of production (some types folks, only some) but extremely valuable as metadata for later finding content.
Transcribed speech is the input to prEdit. Briefly prEdit allows you to easily add metadata (log notes), break interviews into thought segments and eliminate less useful material, before searching and building a story by dragging and dropping text blocks. Editing can continue through the story building process and narration added (converted to voice instantly). At any time you can preview a clip or clips, the full story, or any selected part of the story before exporting to Final Cut Pro 7 or Premiere Pro CS 5.5 or later.
So, while phonetic search is a great post-production tool, transcription into real text has a wider range of uses and uses outside post-production. The only trouble is, speech transcription is expensive!
Developments in speech transcription
It’s a great time to live if you’re interested in speech transcription. The most significant developments have been hybrid computer-human approaches used by 3PlayMedia, SpeakerText and SpeechPad to reduce the cost and time of transcription. These companies particularly are focused on the need for transcription for video in distribution, but are excellent choices for transcriptions for prEdit or other postproduction needs.
Until we get a fully automatic speech transcription with adequate accuracy, these will help. Even with human correction some uncommon words or names will not be transcribe accurately.
I’ve talked about Nexidia and the phonetic search technology and why my long term preference is for speech transcription, so it’s no surprise that I spend some time following what’s happening with the technology. What is interesting to me is that the two companies who are generally recognized as having the most accurate under the widest range of conditions are not (yet) available for postproduction work.
Google has been amassing huge numbers of examples of speech for recognition – an important first step to accurate speech recognition – via the (now defunct) Goog411 initiative. Google have been using this technology for automated captioning for YouTube videos and for voicemail transcriptions within Google Voice. As a Google voice customer I’d say that the results are definitely much better than the attempts by Vonage (laughable) and comparable or better than Premiere Pro’s use of Autonomy.
More information on Google’s voice recognition plans can be found in this Techcruch article with Mike Cohen, head of Google’s Voice recognition efforts. Also interesting to note is that Google is slowly opening up an API for their speech recognition efforts, starting with Chrome version 11. How open that is for third party developers to use, remains unknown, but it’s an interesting direction from the search giant. If the API became open, and this is one of the two most accurate speech transcription technologies, why wouldn’t savvy developers like us, start to use it and integrate it into our software?
Equally prominent in the speech transcription/recognition community is Nuance, probably best known for powering Dragon Dictate, Dragon Naturally Speaking (and the variations) and the speech recognition component of Apple’s Siri technology. (Siri adds a lot of powerful tools on top of this basic recognition layer, but if the speech isn’t recognized accurately nothing good can come from it downstream.) There is no public API for any of Nuance’s technology (nor Siri for that matter). Nuance tends to do direct deals with companies who want to license its technology – a fairly standard practice in the technology world.
My fondest hope is that Apple’s license from Nuance will be extended to OS X and a speech recognition framework be included in OS X for developers. It’s a fond hope, not anything real!
Google and Nuance have the most accurate technologies that do not require speech training. I’m a Dictate user but that product uses training to obtain high (very high) accuracy in transcription. To be able to be accurate without needing training is what we need for interview transcriptions for post.
Tip: If you do have Dictate or any of the PC variants of Nuance’s products, one technique is to listen to the interview and speak it with your own (trained) voice. I’m not yet there, but it is possible to speak-as-you-hear (the basis of the common ear-bud presenter trick) for fast, accurate transcription.
Beyond the giants, there are many other technology companies, or open source projects, in the speech recognition field that are worth mentioning. One thing that should be noted is that most of these technologies are cloud based (as is Apple’s Siri). They work as long as they have a connection to their primary servers.
I’ll let these folk speak for themselves. Whichever technologies prevail, we’re definitely seeing a surge in the accuracy and flexibility of speech recognition, that is going to factor in post production in the coming years, beyond where we are at right now.
Speech Recognition for your iPhone application. The claim is that the technology is:
- currently unavailable from Apple APIs
- easy to add to your application
- convenient for all types of apps: games, fun and promo applications or utilities
- suitable for the iPhone and the iPod Touch
- cheaper than you might expect.
Sphinx-4 is a state-of-the-art speech recognition system written entirely in the JavaTMprogramming language. It was created via a joint collaboration between the Sphinx group at Carnegie Mellon University, Sun Microsystems Laboratories, Mitsubishi Electric Research Labs (MERL), and Hewlett Packard (HP), with contributions from the University of California at Santa Cruz (UCSC) and the Massachusetts Institute of Technology (MIT).
Sphinx-4 started out as a port of Sphinx-3 to the Java programming language, but evolved into a recognizer designed to be much more flexible than Sphinx-3, thus becoming an excellent platform for speech research.
A combination of several technologies and open source tools make this possible. In the browser, Flash is used to access the microphone and stream the audio to an RTMP server. Red5 is used because its a versatile media server that has the benefit of being open source and free.
“Julius” is a high-performance, two-pass large vocabulary continuous speech recognition (LVCSR) decoder software for speech-related researchers and developers. Based on word N-gram and context-dependent HMM, it can perform almost real-time decoding on most current PCs in 60k word dictation task. Major search techniques are fully incorporated such as tree lexicon, N-gram factoring, cross-word context dependency handling, enveloped beam search, Gaussian pruning, Gaussian selection, etc. Besides search efficiency, it is also modularized carefully to be independent from model structures, and various HMM types are supported such as shared-state triphones and tied-mixture models, with any number of mixtures, states, or phones. Standard formats are adopted to cope with other free modeling toolkit such as HTK, CMU-Cam SLM toolkit, etc.
I have no idea what that means either!
Scripto is a light-weight, open source, tool that will allow users to contribute transcriptions to online documentary projects. The tool will include a versioning history and full set of editorial controls, so that project staff and manage public contributions. The design and development of the tool is being supported by grant funding from the National Endowment for the Humanities, Office Digital Humanities, and the National Historical Publication and Records Commission.
From an audio or video media, transcription can automatically generate a file ofsubtitles, the keyword list in XML format and the entire plain text. (That’s the translation from the French – speech recognition is not limited to English!)
And interesting, although not currently speech recognition:
Currently, SoundHound’s specialty is delivering information about music. Users can sing or hum a tune into its SoundHound app and the app returns the song name, as well as other information. Last week, the company released its Hound app, which can identify when a user says the name of an artist or album.
The slightly frivolous-seeming “name that tune” aspects of SoundHound’s applications belie the seriousness of the technology and business underneath it all. SoundHound has raised $16 million in venture capital and currently has 55 full-time employees. Investors have been attracted to the company by the future potential of SoundHound’s core technology, Mohajer told me. “We own all of our technology, while a lot of other apps in this space license their core technology,” he said. “We built everything in-house and we own all of our intellectual property.