The present and future of post production business and technology

Transcription Services: State of Play

A few years ago, we considered supporting transcripts in Lumberjack System. At the time our goal was to quickly prepare for an edit, and transcriptions took days and cost serious money.

Two years ago we supported the alignment of time-stamped transcripts to Final Cut Pro X Clips and a year ago, introduced “magic” keywords, derived by a cognitive service. Since Lumberjack doesn’t (yet, I might emphasize) support a speech to text service internally, what are the options and what do they tell us about the state of play for transcription in April 2017?

Traditional Transcription

Until recent times the only way a transcription was created, was with a trained human listening to the audio and typing the transcript. They tended to have multi-day turn around and be expensive. There are hundreds of smaller shops, but Take1 and Rev are prominent examples.

Take1 service the Hollywood Media and Entertainment industry, as well as productions in their home of the UK, with the advantage of being in a different time zone. Take1 customize their output to the needs of the client, and will prepare Lumberjack ready transcripts upon request. Time stamps are part of their standard service. Because of the high level of human attention, Take1 are at the higher end of the spectrum at $2 a minute (basic service).

Rev (like most services now) use a computer transcription as the base and a human editor to correct. Time stamps are not standard, and they do not do Lumberjack ready, but their pricing makes them attractive. Although turnaround is official 24 hours, experience lately has been closer to 3-4 hours turnaround for $1 a minute. If you need time stamped transcriptions it will require an extra pass by you or someone in your team.

Hybrid Services

These services make no pretense of their roots as a machine driven Cognitive Service. Some make it known what underlying service is being used, others prefer not to say, but there are many as I have recently written. While suitable for programmers, hooking in directly isn’t for beginners, which is why several companies have wrapped these speech-to-text APIs in a user-friendly interface.

I suppose I should mention that the developers of all these apps, and some of the people involved with the services previously mentioned, are friends of mine.

First that I was aware of is SpeedScriber, now in beta for nearly a year (but one relatively easy to join via the website), where the Cognitive Service is hidden within an excellent editing and management interface. The developer appears focused on creating the most accurate transcriptions with minimal effort, and get them aligned with clips in FCP X. SpeedScriber requires minimal correction – usually with speaker identification – and is set at 50c per minute.

Discussed over various beverages during the FCP X Creative Summit 2016, and now publicly announced are two more upcoming releases from Core Melt and Digital Anarchy.

Scribomatic focuses more on making media searchable (functionally similar to PhraseFind and its ilk) with basic storyboarding tools, and is well integrated with FCP X.

Transcriptive is another new tool from Digital Anarchy. It’s the only one that offers a choice of either IBM Watson as the transcription engine, or Cambridge UK based Speechmatics. Speechmatics offers a lot more languages that IBM’s Watson does, so it’s a smart option with a robust European market. Again the focus is on search – as well as subtitles – in an Adobe Premiere Pro CC Panel, integrated in the app.

Shortly after NAB finished, Wired published an article on speech-to-text, highlighting the Trint service. Trint would also seem to be using Speechmatics, based on language selection (and proximity to Cambridge). At 25c per minute, it’s the least expensive option with time stamps. While the editing interface isn’t as comprehensive as SpeedScriber it’s fairly easy to use and functional. Not bad for a browser based app. Trint is also the least expensive way to get Lumberjack ready transcripts.

You could go cheaper than that as the wholesale rates for the Cognitive Services but you’ll be writing a lot of code. The inevitable conclusion, though, is that transcription – and the other Cognitive Services – will become commodities.

But we have to rethink our workflows to take advantage of commodity transcription. Simply commodifying transcripts isn’t enough.




9 responses to “Transcription Services: State of Play”

  1. John Bertram

    Hi Philip —

    Many thanks for this update.

    True after-the-fact Speech-to-Text technology (as opposed to Live-Dictation-to-Text) is something I could desperately use, as I continue to acquire hundreds of hours of interview material for a pair of interrelated, micro-budget doc projects. But I’m wondering how far along are the services & software you describe above? Do they all presume files with close-up mic’d and cleanly captured, studio-quality audio? Much of what I have is In-the-Field, Run ’n’ Gun — sometimes at its Runniest and Gunniest. Okay, maybe not THAT bad, but rarely pristine (some bits even including live event sound heard over a PA system).

    And what I need are text file transcriptions which include (in an easily accessible way) the Creation Date&Time plus elapsed RunningTime metadata for each clip, to which I can then add whatever speaker IDs I choose and format to taste — but reasonably accurate, and NOT costing the thousands of dollars currently failing to burn a whole in my pocket.

    So — any recommendations as to how many more years I should keep waiting, or do you think the time has come for me to at least try out one or more of these services — given that their most bargain basement rate of around 25c/minute would probably represent my budget’s ceiling?

    John B,

    1. Philip

      I don’t think I know of any service that gives you that now. Creation Date and Time come from the file. Elapsed time from the keyword range assigned to the paragraph. Speaker ID is part of speech to text (although not that accurate right now).

      But, the quality of a transcription – human or otherwise – will depend on the quality of the audio.

  2. mark Raudonis

    The Achilles heel for all of these services is ACCURACY! Successful auto transcription is very dependent on the condition of the audio. The “cleaner” the audio , the higher the accuracy rate. While I could tolerate a relatively high error rate as a reader, if you want to take advantage of these transcripts and use them for word searches, then the acurracy has to be very high.

    Currently, no AI generated transcript that I’m aware of is higher than 90-95%. Therefore, when using these files for l searches, the results will always leave you wondering if indeed you found EVERY instance of a particular word.

    I can’t wait for AI produced transcripts to become a commodity.

    1. Philip

      The standard for human transcription is about 97%. Both Microsoft (Cordana) and Google claim to have met or exceeded the accuracy of a human. I would rather have everything transcribed to 95% or better.

      But you’re once again not skating to where the puck will be. The transcript is only the first step to automatic keywords, concept extraction, emotion highlighting and story building. This is where we’re heading.

    2. Interesting comment, but the other editors I’ve spoken to have told me that 95% accuracy is still useful for text based search. And of course you can have a human go over the result, it will still save time over transcribing it by hand even if it’s human checked.

  3. Karsten Schlüter

    > … results will always leave you wondering if indeed you found EVERY instance of a particular word.… could be, the words ’10’, ‘life’ nor ‘ago’ never are spoken out loud.

    Voice Recog is no ‘dictation’ replacement – in my understanding for usage in an NLE. It helps to find certain ranges…and then, an accuracy of 90% is a lot.

    My 5¢ 🙂

    1. Karsten Schlüter

      oops, the board doesn’t like some signs, now my full reply:

      quote: … results will always leave you wondering if indeed you found EVERY instance of a particular word.

      When you edit video, do you really search for SINGLE words? Or, more for ranges of words, aka sentences?

      Ok, when you search for a persons or products name … Then, I would like to see in those transcribers, that they frankly tell you when they fail, a simple ‘unintelligible’ Marker…

      The real challenge is ‘meaning’, smart VoiceRecog:
      “Where is the scene, he’s talking about his life ten years ago?” => could be, the words ’10’, ‘life’ nor ‘ago’ never are spoken out loud.

      Voice Recog is no ‘dictation’ replacement – in my understanding for usage in an NLE. It helps to find certain ranges…and then, an accuracy of 90% is a lot.

      My 5¢ 🙂

      1. Philip

        I look for concepts that build a story, not words, which is why the transcript is only the beginning. If you’re focusing on the transcript as the end result, you’re missing the point. You don’t get that from current transcription so why do you expect an automated transcription to be different.

        You’re working in the past if you think 90% is the current state of the art. As I put in another response, Cordana and Google are already more accurate than a human according to the companies.

  4. Just a minor point, our Scribeomatic beta solution also talks to multiple services. At the moment we are concentrating on having the software automatically work out which service would be best to send it to depending on the content but we can also add the ability to manually select if people want that.