The present and future of post production business and technology | Philip Hodgetts

CAT | Machine Learning

Web APIs (Application Programming Interface) allow us to send data to a remote service and get a result back. Machine learning tools and Cognitive Services like speech-to-text and image recognition are mostly online APIs. Trained machines can be integrated into apps, but in general these services operate through an API.

The big advantage is that they keep getting better, without the local developer getting involved.

Nearly two years ago I wrote of my experience with SpeedScriber*, which was the first of the machine learning based transcription apps on the market. At the time I was impressed that I could get the results of a 16 minute interview back in less than 16 minutes, including prep and upload time. Usually the overall time was around the run time of the file.

Upload time is the downside of of web based APIs and is significantly holding back image recognition on video. That is why high quality proxy files are created for audio to be transcribed, which reduces upload time.

My most recent example sourced from a 36 minute WAV, took around one minute to convert to archival quality m4a which reduced the file size from 419 MB to 71MB. The five times faster upload – now 2’15” – compared with more than 12 minutes to upload the original, more than compensates for the small prep time for the m4a.

The result was emailed back to me 2’30.” That’s 36 minutes of speech transcribed with about 98% accuracy, in 2.5 minutes. That’s more than 14x real time. The entire time from instigating the upload to finished transcript back was 5’45” for 36 minutes of interview.

These APIs keep getting faster and can run on much “heavier iron” than my local iMac which is no doubt part of the reason they are so fast, but that’s just another reason they’re good for developers. Plus, every time the speech-to-text algorithm gets improved, every app that calls on the API gets the improvement for free.

*I have’t used SpeedScriber recently but I would expect that it has similarly benefited from improvements on the service side of the API they work with.

Jul/18

19

Speech-to-Text: Recent Example

For a book project I recorded a 46 minute interview and had it transcribed by Speechmatics.com (as part of our testing for Lumberjack Builder). The interview was about 8600 words raw.

The good news is that it was over 99.98% accurate. I corrected 15 words out of a final 8100. The interview had good audio. I’m sure an audio perfectionist would have made it better, as would recording in a perfect environment, but this was pretty typical of most interview setups. It was recorded to a Zoom H1N as a WAV file. No compression.

Naturally, my off-mic questions and commentary was not transcribed accurately but it was never expected or intended to be. Although, to be fair, it was clear enough that a human transcriber would probably have got closer.

The less good news: my one female speaker was identified as about 15 different people! If I wanted a perfect transcript I probably would have cleaned up the punctuations as it wasn’t completely clean. But reality is that people do not speak in nice, neat sentences.

But neither the speaker identification nor the punctuation matter for the uses I’m going to make. I recognize that accurate punctuation would be needed for Closed (or open) Captioning for an output, but for production purposes perfect reproduction of the words is enough.

Multiple speakers will be handled in Builder’s Keyword Manager and reduced to one there. SpeedScriber has a feature to eliminate the speaker ID totally, which I would have used if a perfect output was my goal. For this project I simply eliminated any speaker ID.

The punctuation would also not be an issue in Builder, where we break on periods, but you can combine and break paragraphs with simple keystrokes. It’s not a problem for the book project as it will mostly be rewritten from spoken form to a more formal written style.

Most importantly for our needs, near perfect text is the perfect input for keyword, concept and emotion extraction.

On the night of the Supermeet 2011 Final Cut Pro X preview I was told that this was the “foundation for the next 10 years.” Well, as of last week, seven of the ten have elapsed. I do not, for one minute, think that Apple intended to convey a ten year limit to Final Cut Pro X’s ongoing development, but maybe it’s smart to plan obsolescence. To limit the time an app continues to be developed before its suitability for the task is re-evaluated.

(more…)

As someone who’s watched the development of machine learning, and who is in the business of providing tools for post production workflows that “take the boring out of post” you’d think I’d be full of ideas of how post can be enhanced by machine learning.

I’m not.

(more…)

Jun/18

11

IBM Watson is a Sports Guru?

Two recent announcement place IBM’s Artificial Intelligence play, Watson, right in the sports spotlight.

Watson is being used for tagging World Cup coverage, and the relationship with Wimbledon from picking highlights and enhancing user experience to, this year, designing the poster!

(more…)

Endgaget recently had an article on transferring facial movements from a person in one video, to a different person in a different video. Unlike previous approaches, this latest development requires only a few minutes of the target person’s video, and correctly handles shadows.

Combined with other research that allows us to literally “put words in people’s mouths” by typing them and having them created in a person’s voice that never said the words. Completely synthesized and indistinguishable from the person saying it.

Transferred facial movements plus created words in that person’s voice and it will be a forensic operation to determine if the results are “genuine” or created.

In some way I guess this is another example of Artificial Intelligence (by which we mean Machine Learning) taking work away from skilled technicians, but human recall has been replaced with facial identification at the recent Royal Wedding in the UK, where Amazon’s facial recognition technology was used to identify guests arriving sat the wedding.

Users of Sky News’ livestream were able to use a “Who’s Who Live” function:

As guests arrived at St. George’s Chapel at Windsor Castle, the function identified royals and other notable guests through on-screen captions, interesting information about each celebrity and how they are connected to Prince Harry and Meghan Markle.

The function was made possible by Amazon Rekognition, a cloud-based technology that uses AI to recognize and analyze faces, as well as objects, scenes and activities in images and video. And Sky News isn’t the first to use it: C-SPAN utilizes Rekognition to tag people speaking on camera.

Rekognition is also being used by law enforcement.

Facial recognition and identification would obviously be useful for logging in reality and documentary production.

I was privileged to be invited to a panel at 2018 HPA TR-X: Everything You Thought You Knew About Artificial Intelligence in M&E and What You Didn’t Know You Didn’t on a panel titled AI Vision Panel: What Will Happen and When?

It was an interesting topic, although our panel got quite pessimistic about the future of society if we’re not very careful with how AI/Machine Learning comes into our lives, but that’s a blog post for another time.

What has really set me thinking was a comment by John Motz, CTO Gray Meta that his 12 year old and her friends, spend more time creating media for each other than simply consuming it.

(more…)

A simple indicator of the growing influence and impact of Artificial Intelligence and Machine Learning is the Hollywood Professional Association (party of SMPTE) inclusion of it in their annual retreat.

For the 2016 Retreat I proposed a presentation on AI & ML that wasn’t deemed relevant at that time.

For the 2017 Retreat I made pretty much the same proposal, which lead to a panel discussion at the Retreat that I was pleased to contribute to.

For the 2018 Retreat the half day tech focus the day before the main conference is:

A half day of targeted panels, speakers and interaction, this year focusing on one of the most important topics our industry faces, Artificial Intelligence and Machine Learning.

Two years ago it wasn’t relevant.

A year ago it was one panel in three days of conference.

This year it’s the tech focus day ahead of the conference!

I’ll be back on a panel this year – Novel AI Implementations: Real World AI Case Studies – and hope to see you there. The HPA Retreat is the closest thing to TED talks for our industry and discuss the topics that will be crucial to film and television production in 2-3 years. Get a heads up before everyone else.

Well, from a couple of days of reading and email newsletters, but there is quite a focus.

MESA Alliance quotes Deluxe Entertainment Services Group chief product offer Andy Shenkler as saying:

“AI is obviously playing a fairly broad role, especially with the areas that we at Deluxe are working on,” he told the Media & Entertainment Services Alliance (MESA) in a phone interview. That includes “everything from the post-production creation process, localization” around advanced language detection and auto translation – “and then even down into the distribution side of things,” he said, noting the latter was “probably the least well-known and discussed” part of the equation.

That article goes on to talk about who’s technologies they use and how they use it to assign metadata to incoming assets. Speaking of Content Metadata (in this case about finished content, not for use in production) Roz Ho, senior vice president and general manager, consumer and metadata at TIVO, writes in a guest blog at Multichannel News:

Not only does machine learning help companies keep up with the tsunami of content, it can better enrich metadata and enable distributors to get the right entertainment in front of the right viewers at the right time.

Machine learning takes metadata beyond cast, title and descriptions, and enables content to be enhanced with many new data descriptors such as keywords, dynamic popularity ratings, and moods, to name a few.

Liz finishes with a short dissertation on how these machines, and people enhanced by them, will be the direction we take in the future.

And out of CES some headlines:

CES 2018: Consumer Technology Association Expects Major Growth for AI in 2018

CES 2018: AI Touted Heavily by LG, Samsung, Byton on Eve of CES

It seems like every day there is news yet another application of Machine Learning (AI) into the Media and Entertainment space, either in production – where it is helping decide what goes in to production as well as helping in production – through to helping people find more appropriate content.

Older posts >>

September 2018
M T W T F S S
« Aug    
 12
3456789
10111213141516
17181920212223
24252627282930