With the release of Final Cut Pro 10.6 Apple introduced a new XML format. The original FCP XML was known as fcpxml and was a single text file, as is all XML. The new format – fcpxmld – is a package, not a single file known as a Bundle. A ‘Bundle’ in macOS appears as a single file in the Finder, but is a special type of folder that is read as a single file. Most apps are packages. The FCP Library is a Bundle.
As a Folder, a Bundle has other files within it, known as Sidecar Files. With fcpxmld the files inside the package include a fcpxml file exactly the same as the existing fcpxml files. This is useful if you run into an app that doesn’t (yet) read the new format.
While Apple haven’t explained their reasons for changing the format, the new Bundle format allows the XML to remain straightforward, while carrying additional data, like Tracking or Stabilization data along with the XML metadata. In the original fcpxml, additional data, like tracking or stabilization has to be converted from the binary data into a form that can be contained in a text file. This makes the resulting text XML files very large and unwieldy. Developers have to read in all the data, hold it in memory and read it back out. With all the additional data in the XML it becomes very memory intensive, very quickly.
By separating out the text-based XML from the various data files, developers can read in the XML without the memory challenges of the combined files. Developers who want to work with tracking or stabilization data could read that directly, without needing to extract and convert it out of the XML.
It’s important to note that XML is always, and only, about metadata. With the new Bundle format that does not change. This Bundle is not designed to carry media as is an option as it is with AAF.
The bottom line is that nothing will change unless you’re a developer. FCP remains compatible with the original single file format. In fact, you can read the Bundle’s contents by right-clicking on the Bundle and selecting “Show Package Contents” to reveal the content. Inside you’ll find a regular fcpxml file that can be imported to apps that don’t support fcpxmld. This also allows apps, like those form Intelligent Assistance and Lumberjack System, to ignore all that extra metadata, and return a classic fcpxml file.
Beyond specific NLE implementations, ML is making it’s way into almost every part of the productions process: storyboarding, production breakdowns, voice casting, digital sets, smart cameras, synthetic presenters, digital humans, voice cloning, music composition, voice overs, colorizing, image upscaling, frame rate upscaling, rotoscoping, background fill, intelligent reframing, Aging, de-aging and digital makeup, “created” images and action, Logging and organization, automatic editing (of sorts), temporization of production, analytics and personalization, storytelling, and directing.
That’s a lot, and it’s only the examples that I’ve kept records of! I’m sure there are many more I’ve missed here.
Based on recent headlines making all sorts of claims about Artificial Intelligence (AI) it’s reasonable to wonder if you job is going to be taken over by AI. Can any sort of machine do creative work? What sort of workplace will it be if AI takes over? In this article I take an in depth look at all the ways that AI and Machine Learning (ML) are affecting every aspect of production from storytelling to visual effects.
A shorter overview of the material is over at the Frame.io blog, as Here to Help: Machine Learning, AI, and the Future of Video Creation, which is an excellent resource. For the full story, keep reading.
Over recent years we’ve seen headlines that claimAI “edited a movie trailer,” cut narrative content in different styles, “makes a movie,” and many more like it, that has to have us all wondering if we’re going to be replaced. It doesn’t help when Wired magazine claims Robots Will Take Jobs From Men, the Young, and Minorities.
Machine Learning (ML) – where AI’s rubber hits the road – is creating a new crop of super-smart assistants we can use to amplify our personal creativity. It’s already providing (in at least some NLEs) Improved retiming, facial recognition, color matching, color balance, colorizing, visual and audio noise reduction, and upscaling.
I’ve been writing about the affect of AI on the production industries since my first post in July 2010 with Letting the Machines Decide. That was the first time I wondered what affect the burgeoning field of AI might have on creative production. It remains a strong interest, because the more of the routine we can take out of the production cycle – by whatever means we can without compromising the results – the more time we can spend on the creative parts of he job. In that article I revealed my bias:
So, with that background and a belief that a lot of editing is not overtly creative (not you of course dear reader, your work is supremely creative, but those other folk, not so much!). It can be somewhat repetitive with a lot of similarities.
In 2017 I wrote a series of articles on Artificial Intelligence and Production: an overview, then in part two the 2017 “now,” and in part three my expectations of the future. I wrote a lot more on AI on that blog. Search for “Artificial Intelligence” for the full list. In that article I concluded:
Change is inevitable. Our response to it is where we have control. We can ignore or fight-off the incursion of AI and ML into our world, or we can embrace the increase in productivity and the how we can focus on the truly creative – imaginative & original – parts of what we do.
In the relatively short four years since I wrote those articles, the field has exploded. My personal interest in the field overlaps with my professional goal to free as much time for creative work as possible. We’ve created a lot of software tools (Intelligent Assistance Software, Lumberjack System) and use some Machine Learning tools in some of them.
We’ve even been “accused” of creating an AI editing tool in First Cuts for FCP back in 2006. First Cuts did a very good job of an automated first string-out with story arc, appropriate b-roll and even lower thirds.It seemed like magic!
It was a technical triumph and a business failure! The perception was that entering the metadata took more time than was saved with the automated tool, totally overlooking the creative inspiration that comes by being able to explore five different string-outs in as long as it takes to play them.
The need to enter metadata was its Achilles’ heel, so I assumed that we would have useful metadata automatically derived from the media, to feed into the First Cuts algorithm. I saw as a way of creating the perfect metadata tool. As we’ll read in the section on Amplifying Metadata, ten years later and we’re no closer!
We are reaching the tipping point where AI, or more accurately Machine Learning, is already in some of the tools we use every day and in new tools available now, or coming up fast enough to affect yours and my future careers.
The biggest winners from technology developments over the last 20-30 year have been individual creatives and small teams. For example, the DV format empowered thousands of creative people who had previously been held back by the cost of conventional (at the time) production crews.
As an example. I connected with Kay Stammers shortly after the release of FCP 1 via 2-pop.com. Kay, and her husband Tristan, ran a small independent production company, that had been limited by the cost of quality production at that time. Adopting FCP and DV cameras they amplified their ability to produce their stories many fold.
AI and ML is the same. ML based Smart Assistants will amplify the ability of individual creative teams to amp up their output. It’s like a super-sized version of my goals. If you embrace it, you’ll become an Amplified Creative multiplying your output and/or upping your quality.
The headlines I referred to at the start of the article are designed to sell the story, so it’s not surprising that most are simply not true. At best exaggerations. No machine edited a trailer, or created a movie.
ML has been used to write scripts, to laughable results. While tools like Magisto create seemingly magic results, they are essential smart templates with some good Machine Learning models* that determine which parts of a shot to choose. More later.
*The result of training a Machine Learning tool, is to create a Model. The Model then continues to do the job it was trained to do. Models learn and evolve over time.
Even though the headlines are misleading, what is developing is exciting. Machine Models have pulled selects, extracted metadata and performed dozens of smart assistant tasks like color grading, automatically ducking under dialog, converting speech to text, and dozens of other tasks as you’ll read in this article, right through to synthetic humans.
It’s true that ML driven assistants will change the employment landscape, but that is the way it has always been. The Wired article I linked to at the start, does feel that more jobs will be created as some are eliminated. Technology has always been a fundamental part of visual storytelling from the Lumière Brothers onward and the staffing needs have always trended down. Filmmaking Magazine tells us:
Machines have aided and enabled filmmaking since the advent of the camera, followed shortly thereafter by innovations in sound. Later, computers would disrupt the entire process, from preproduction through post and onward to distribution and discovery. Today, a new wave of intelligent machines waits in the wings, ready to dramatically transform how stories are told and ultimately sold.
“I’ve thought a lot about the invention of photography — before that, creating photorealistic images required talent and training. Then, the camera came along. It didn’t make painting irrelevant. The camera set painting free. One of the things we’re doing is setting writing free.”
The natural response to “all of this” is to confirm that “no computer can do my job.” You’d be completely right…until the day you are not. While it’s the rare job that will be completely replaced by ML based tools, there are many time-consuming tasks within the creative process that can be improved or streamlined by ML, just as there have been technological improvements in the past that have improved or streamlined by technology. Who off-lines tape-to-tape any more!
The many non-creative parts of the process get in the way of creativity and these smart tools are going to free us to be Amplified Creatives. Freed of the boring, Amplified Creatives have more time to be creative. For example, production tasks like Color Grading will be simplified with smart grading software like Colorlab.ai, or Adobe’s Sensei color matching grading in Premiere Pro.
Machine Learning has certainly been very good for the audio and visual processing side of the business, providing some very exciting tools and workflows.
It has also become very good at extracting all kinds of metadata from footage: text from signage, content descriptions, brand recognition, famous people, etc, and, not least, speech to text and semantic extraction have improved to the point of usefulness. That’s very good for media libraries where you don’t know what you’ll need to search on until the time comes. In Media Asset Management more metadata is a good thing. It is less useful extracting organizational metadata needed for documentary and reality productions.
While ML has become good at extracting keywords, it’s still terrible at providing useful keywords. Keywords relevant to the needs of the production, not generic metadata. I’ll explore in more detail in the section on Amplifying Metadata.
The danger of ignoring technology and workflow shifts, is that others will take the Augmented Creative approach, and embrace every smart assistant they can to enhance their creativity and productivity. Those that don’t will appear unproductive.
Before I discuss why the Amplified Creative approach is the right one, what are AI and ML?
What is Artificial Intelligence and Machine Learning
We’re not there yet. We may never create a true AI AlphaGo, Sophia, self driving cars, and to a lesser extend personal assistants like Alexa, Siri, et al. are as close as we get to Artificial General Intelligence, and they are not close!
What we do have are incredible advances in Machine Learning, where a ‘machine’ (technically a bunch of Neural Networks, but we don’t really care) is typically trained on a very large number of graded examples, and then “learns” what it is supposed to learn. (There are other ML approaches, and I’ll mention them later.)
For example: Stanford researchers have trained one of Google’s deep neural networks to recognize skin lesions in photographs. By the end of training, the neural network was competitive with dermatologists when it came to diagnosing cancers using images. While still not perfect, it’s an impressive result. It’s not the job of a dermatologist, but it’s a valuable tool for maximizing the productivity of the dermatologist. The Machine will do the pre-scan and anything slightly abnormal is referred to the dermatologist for final diagnosis. An Amplified dermatologist, if I may.
The ML approach compares to traditional apps by learning rather than being programmed. Although people think that First Cuts for FCP was some sort of ML or AI, it was most certainly not. It quite cleverly modeled and embedded version of my approach to editing interviews into stories via classic programming style.
While we’ll see ML doing many, many exciting things (here’s my four year old list) I don’t see it even reaching the capability of First Cuts in my lifetime. Better extraction of metadata to drive that programmed algorithm for sure, because I – along with a lot of other people – believe the future of Machine learning in Post Production is symbiotic: the Amplified Creative concept.
“So, as AI enables these things to become more spontaneous, we’ll have a larger army of people developing creative work and there will be more demand.”
I’m not entirely convinced that demand will grow that dramatically, but then I’m not of a generation making videos for their circle of friends.
For the current part of my career, my business goal is to “take the boring out of post,” which is why I’m so bullish on ML in production. Our goal has always been to automate the boring or repetitive parts of post, so that creative people can focus on being creative. The next generation of smart assistants will process most of the repetitive work so more time can be spent on storytelling and the look and feel of the finished product.
Tools are not competition. In an article at PolSci.com summarizing a World Economic Panel on AI, titled The Future won’t be “Man or Machine”, it will be SymbioticIBM CEO Ginni Rometty — member of a five-person panel in Davos offering their views of AI — said that technology should augment human intelligence.
“For most of our businesses and companies, it will not be ‘man or machine,’” Rometty said. “…It’s a very symbiotic relationship. Our purpose is to augment and really be in service of what humans do.”
“With a greater computational information processing capacity and an analytical approach, AI can extend humans’ cognition when addressing complexity, whereas humans can still offer a more holistic, intuitive approach in dealing with uncertainty and equivocality in organizational decision making.”
This is a best of both worlds approach. The best of what humans will continue to bring to the table, with ever smarter assistants filling in. Machine Learning is good at a lot of things, but since it is effectively “rote learning” it’s hard to imagine a spark of complete originality arising. Humans, particularly when freed of repetitive drudgery, are very good at original thinking.
“Well, very roughly and with many variations and nuances, it’s the idea that although automation – increasingly that automation facilitated by AI and machine learning – can do many tasks previously carried out by people much better than people can, automation plus people providing cognitive skills at the right point will outdo pure automation by itself. That’s your centaur – part horsepower part personpower.”
He quotes Chess as an example. Computers now beat grandmasters, but a grandmaster + computer beats computers. (Until it reaches the point where every game ends in a draw, which then takes the human benefit out of the equation). According to Wooton:
“In this model, humans are seen to add value to the process of automation. They perform a role of bringing expertise, experience, imagination and insight to prioritise and pick from the heavy processing of large amounts of data and algorithmic pattern-finding. In this model, Kahneman’s System 2 thinking brings value to big data crunching. In the knowledge economy especially, humans only bring value when placed at the right point in the automation chain.”
As creatives, we have to constantly adapt our tools. Digital Non-linear editing was so much more efficient than tape-to-tape offline, even when both still needed a high end edit bay to conform in!
From an era of 40 lb “portable” standard definition cameras and recorders, to a pocketable device that takes at least eight times higher resolution with better color fidelity, in frame rates unheard of at the time. That same device runs NLE software – LumaFusion – to edit, title, grade, add effects, encode and publish to streaming platforms with little set up or effort.
Other apps on that same device can switch multiple cameras and live stream them to the Internet. Even 20 years ago that was a truck and a couple of hours setup, and an expensive uplink. Now your ‘truck’ also doubles as your phone!
The tools evolve, and smart creatives take advantage of the way the symbiotic model frees them to spend more time on the “personpower” that Wooton described. These digital tools will enable individual creatives to produce high quality finished stories.
When everyone can attain a certain standard of visual quality, it will force those who are truly gifted to set a new standard of creativity and originality. And I think the world will be even more beautiful and entertaining as a result.
Now in 2021 there are a wide range of ML based tools to use across the whole spread of production: from indicating which scripts are likely to succeed, to micro targeting audiences, to digital humans, to AI color grading, and so, so much more, that are available now.
I’ll discuss some of the specific smart assistants that are available now, or soon will be across the entire production process, that will turn you into an Amplified Creative.
Not all production is “Hollywood” and the vast majority of production that’s outside “Film and TV” will benefit from amplified production. Solo producers and small groups will benefit most from the smart camera mounts (and software), autonomous drone camera platforms, and automatic switching tools.
Let’s take a look at how ML is now, or soon soon will, affecting each area of production in turn.
Introduction, AI and ML, Amplified Creativity
Introduction and a little history
What is Artificial Intelligence and Machine Learning
I frequently get asked if I think there will ever be AI “Automated Editing.” It’s somewhat of a moot point, as there is already a whole lot of automated editing going on! It’s been going on for a long time.
Wibbitz, a tech startup in New York, has developed a program to create videos automatically by adding footage and stills to an audio track that it generates. Link
With directr you choose the template and the app then tells you what shots to take. It then builds them into a video based on the template Link
SundaySky create up to 1.4 million unique videos a month: SundaySky, Tel Aviv/New York-based start-up with 50 employees, with no video camera or production staff, will produce 1.4 million video clips this month for a range of big retail and real estate corporate customers including Overstock.com and the History Channel. The company pulls customer data into a customized template which creates videos with movement, music, narration and graphics and video. Link
Of those five year old examples, directr is no longer with us, but Wibbitz is a current competitor to Magisto. Technically none were AI or ML based back then.
If your source material is highly structured then template driven editing is fairly straightforward. Our company, Intelligent Assistance Software created a custom app for a client that built FCP Classic Sequences based on a template XML and their repository of highly structured files. I imagine SundaySky were doing something similar.
“Apple’s “Memories” movie feature (in their iOS “Photos” App) provides a great glimpse where AI is taking editing. In very little time, you can create and select a photo album, add a music style, determine the video length (Short, Medium or Long) and – PRESTO – your video is created in just a few seconds”
Automated personalized videos. Of course, Apple now have the Clips app, which I confess I’d never used until now. At a minimum, clips is using ML for Speech-to-titles, and for scanning for Augmented Reality effects, like the light up dance floor I’ve always wanted in my office!
Clips is targeted directly at consumers, while Wibbitz and Magisto are focused on a corporate market, but open to individuals. My associate Cirina Catania (a producer-editor herself) was very impressed with the results from Magistro:
I did nothing except shoot/pick the clips, graphic style and music- they cut!
Magistro – using ML Models – chose the parts of the uploaded clips with the most action, and matched them into templates. Cirina and I were both impressed with the results. As recently as 2016, choosing clips based on action (and emotion) was a co-operative research project at IBM Watson and 20th Century Fox to pull selects for the Morgan trailer. Now it’s part of a mainstream app creating hundreds of videos a day.
Magistro, and its competitors, are certainly templates at their core, but intelligently selecting the shots that go into those templates is where ML is adding significant value.
The trend toward basing creative endeavors on templates has been a trend for many years, culminating in Hollywood’s use of its history as templates for its current production.
Whether this is a good or bad thing depends on whether you value your personal creativity, or you’re pushing a budget to get a project finished.
Honestly, if you’ve watched any significant amount of the House Hunters franchise on HGTV the template would be obvious. All that’s needed is a little real-time logging and the right ML Models, and at least the first assemble is done ready for final shot selection and trimming.
Sports is another area where automated editing is making strong inroads into finding and building custom highlight or recap packages. Asset Management specialist Evolphin added the ability to “automatically edit videos using AI-generated metadata”. Back in 2019 Evolphin VP of Video Product Management Evan Michals told StudioDaily:
“The killer app, of course, is AI video editing — the ability to automatically assemble exactly the clips that are needed for a given recap or highlight reel. Instead of having an editorial staffer pore over footage and manually select relevant clips, Evolphin says, the system can be used to search for the face of a given player at specific moments in a game, such as goals or penalties, or for types of action, like cars drifting on a race track.
With the correct filters applied, Zoom will retrieve the corresponding moments and generate a sequence that can be saved as a video file or exported as XML for use with any video editing software that recognizes XML files.”
Video intelligence platforms using machine learning help marketers gain an increasingly granular understanding of where their audience is watching and why they’re watching. Such insights enhance a marketer’s ability to interpret and act on their audience’s wants, needs, and goals.
That ability to speak directly to people’s priorities will only increase as hyper-personalization becomes more routine — in future years, marketers may be able create AI-driven video content targeted to an individual viewer.
Other future potentialities in AI for video include gesture control, use- and context-specific tagging to improve product discovery, and neuromarketing and biometric sensing to monitor viewer response. Context-aware marketing, which is also gaining traction, uses natural language processing to better place video ads against relevant video content.
At this point I can hear voices (and I can identify some of them) and they’re saying “but that’s not real editing!” In the sense that those voices mean it, I will agree. While all these types of automated editing have their place, they are not creative editing in any form.
Clips, Magistro, Memories, etc have their uses. If I’d used Magistro, maybe the video from our last Zion National Park trip – about eight years ago – would have been edited before we went back again. This editor never got to the personal project, so the automated tool would be a better result, because it would be a result!
Birthday party videos would be edited promptly, instead of when someone can “get around to it,” or in time for their 21st birthday compilation video!
Super smart templatorization could conceivably assemble the basic cut of a highly formulaic show like House Hunters, ready for a pass by the “craft” editor.
Outside, the highly formulaic that lends itself to templatorization, there’s no suggestion, or research that suggests creative editors and creative editing will become obsolete in the foreseeable future, but a studio camera switcher might feel a little threatened, particularly after seeing the demise of studio camera operators!
Back in 2014, before the modern ML era, Disney Researchers “…have developed a groundbreaking program that delivers automated edits from multi-camera footage based on cinematic criteria.”
An interesting research project, but not in the least “real” editing, and also not particularly practical.
It’s going to take a couple of quantum leaps forward for any sort of AI that can do the creative editing of even a junior level editor. Thirteen years later nothing comes close to our own First Cuts, which was most definitely not AI
The obstacle to an autonomous creative editor are phenomenal. The sheer complexity of the task; the lack of a training set, and the lack of any other suitable training method pushes the likelihood way off into the future. In the present we can use all the organization advantages of visual and speech metadata to amplify our personal creativity as an editor.
If you know me, you’ll know I’m a metadata nut. I’ve never metadata I didn’t like! For me this is both an extremely exciting time and yet somewhat disappointing.
We now have the tools for accurate transcription, extract text from moving images, identify objects in the video, extract keywords, concepts and even emotion, and that is all good.
In fact, for all types of searching, whether it’s in a project, or any form of asset management system, this metadata is awesome. Tools like FCP Video Tag make it possible to find that perfect b-roll shot in your FCP Library, or like AxelAI in your asset library.
Metadata is information about the media files. It can be technical metadata describing the file format, codec, size, etc, or it can be information about the content – what we also call Logging!
For discovery and organization we need the logging metadata to be concise and associated only with the range in the media file that it’s relevant to. One way to achieve that is to isolate subclip ranges and organize them in Bins associated with topics, or in Markers with duration.For me the perfect embodiment of Logging Metadata is Final Cut Pro’s Keyword Ranges, that self organize into collections.
I’ll be focusing on Visual Search and Natural Language Processing, but there are many commercial and open source tools for extracting or embedding technical metadata including Synopsis.video, which will also allow semantic searching of movies using terms like “an interior, closeup shot outside, in a vineyard with shallow depth of field”, and CinemaNet (part of the Synopsis set of tools) will understand and match because it has been taught to understand those visual concepts.
At the last IBC we were able to attend, we saw an IBM Watson pod that was extracting metadata about the content of a fast moving car race, on the fly. The car ID, any advertising on the car, any text on signs in the background, description of the background, etc. There was an avalanche of information being extracted in real time from this race footage. More on the avalanche in a minute.
The challenge with visual search is that most of these ML services are on the web, and uploading video is challenging, especially when it’s only for extracting metadata. Audio for transcription is practical as the bandwidth is a fraction of video. In fact some services require the media to already be in ‘the cloud.’ For example, Google Video search requires the video to be stored in Google Videos. From there it can be indexed, even programmatically from 3rd party apps, but first there’s the upload.
Visual Search has to be integrated with the NLE and not require an upload, to be really useful. Face detection and identification has been in Blackmagic Resolve since version 16, but not generalized video search, yet. I’ve seen technology previews of Premiere Pro that included integration with IBM Watson visual search that have never been in a released version. Technology previews aren’t a sign of future product, but they do point to the direction of a company’s thinking.
In the near future, that may be time you get back. AI is already learning how to “view” images, recognize notable elements and layouts, and automatically tag and describe them. Taking this a step further, Concept Canvas, one of our Adobe Research projects, will allow users to search images based on spatial relationships between concepts
Until visual search is integrated into your favorite NLE, there are some interim solutions. Most major Media Asset Management systems include some form of visual indexing and searching. AxelAI, for example, performs all the analysis on the local network.
FCP Video Tag uses a number of different analysis engines to create Keyword Ranges for FCP in a stand alone app.
Visual search is great when the primary thing we’re interested in is visual. When it comes to interviews, the only visual metadata that would be available for this image:
would be: Philip Hodgetts (although more likely “Middle Aged White Male”, which is a whole lot less useful); Medium Wide shot, Living Room. We would have no useful information about the content of that 30 minute interview. That is where Natural Language Processing takes over. Because it starts with transcribing speech, it could be thought of as the speech equivalent of visual search!
Transcription, or speech-to-text is now a mature technology. I’ve watched the accuracy of our transcription provider for Lumberjack Builder NLE improve significantly in flexibility and accuracy since my original testing in 2018. In 2018 punctuation was perfunctory at best, a bad guess normally, and speaker identification was way off, but word accuracy was as high as 99.97% in one of my examples.
It’s now three years later and a recent test, ahead of re-introducing speaker identification to Builder NLE, accurately identified two very similar sounding female voices throughout the interview, with punctuation that would make an English teacher proud.
It’s just three years since we had unacceptable punctuation and speaker identification. Since then grammar and speaker identification are as accurate as the word transcription, which has also improved in that time, and become easier to use.
Accurate transcription arrived in a tidal wave of ongoing improvements, so when you are inclined to dismiss some of the technologies I’ve been introducing you to, remember it’s not where the technology is right now, it’s where it’s going to be in two to three years that will affect your career.
If a research project could lead to changes in the way your job will be performed, it’s good to know far enough ahead, that you can decide on the best way that it is going to be used to Amplify your Creativity, and employability.
Accurate transcription in most major languages is now a given. It is what we can do with that metadata that becomes more interesting.
In order to improve speaker identification, our provider has done some intricate work with spectral tone maps. In the absence of any visuals, it’s the only option.
A Google Research project in 2018 attempted to “better identify who is speaking with a noisy environment,” specifically a noisy environment of other human voices at the same frequencies and approximate levels! This is a much more challenging task than identifying when a voice changes in a two or three person interview in a (relatively) quiet environment.
They solved the problem the way humans do! In that situation humans are able to focus their attention on one or two speakers in a crowd using what is called the “cocktail party effect.” You can easily work out who’s speaking and what they’re saying: a microphone, not so much. Google’s researchers got their machine to look at the video and see who’s mouth was moving! Easy for us to do, again much more challenging for the machine.
The effort into accurately understanding speaker change and speakers identified is to have interview “chunks” that make sense for keyword and emotion extraction.
We’ve been waiting for a good “keyword” extraction tool for a long time. In an earlier version the Lumberjack Lumberyard app, we used to extract keywords from transcripts. We pulled the feature because the keywords were rarely useful. To a machine a keyword is literally, a key word, and that’s a popularity contest! The words that appear most frequently become keywords, but may not represent the concept accurately.
They are rarely the Keyword that I would use to describe the “content.” In my Builder NLE demo I work with media from EditStock where one of the major themes is Climate Change. Those words rarely appear and almost none of the discussion about Climate Change is tagged by automatic keywording.
MonkeyLearn specializes in extracting from a transcript and has no input into the transcription process itself. You have to create a free account to access it, but their Keyword Extraction tool gave these results from a paragraph from that Climate Change project.
For reference, my Keyword was Climate Change, which was certainly in the results, along with another nine ‘keywords’ that weren’t useful. A 10% accurate tool, where you can’t predict which 10% is right, is unlikely to be a useful production tool.
IBM Watson does a little better with Keywords by providing a relevancy ranking. If you are a Lumberjack Builder NLE user, you might have noticed a relevancy column in the Keyword Manager. We included that with the expectation we’d be offering automatic (useful) Keywords by now where the relevancy would help rank (and remove) Keywords. It remains unused.
Watson also has the ability to extract Concepts, which is much closer to what we need for organizing and analyzing interviews. While Climate is discovered as a concept, Climate Change (the Keyword I used) does not appear. As with Keywords, there are nine or ten unhelpful Concepts in with the useful one(s).
As far as automated extraction of useful keywords for organization (as opposed to searching where the Visual Search is very useful) it’s simply not here. Yet! Until the technology moves forward, the most efficient way of logging/keywording is to do it in real time during the shoot with Lumberjack System’s iOS Logger, or with any of the other tools in the Lumberjack Suite. The second fastest way to log or keyword is using the transcripts in Lumberjack Builder NLE.
I thought that Tone Analysis – how happy or angry the speaker is – would be a useful tool, particularly for Reality TV where high emotion is an essential ingredient! Unfortunately Tone Analysis has the same need for human curation, that makes it a net negative in production!
Working with Transcripts
No mainstream NLE is a good way of working with transcripts. There are many ways to get transcripts into various NLEs but none of them are a good environment to work with transcripts because nowhere can you actually write the story.
Before the digital era, and well into it, a story writer would take all the text documents of the transcripts and copy/paste into a story document, that some unfortunate editor had to conform. There’s nothing akin to the word processor text in NLEs. I include Avid’s ScriptSync in this blanket condemnation! ScriptSync is a useful tool for finding and comparing takes in editing narrative (scripted) content, but pales next to Builder NLE for editing with transcripts.
There are only two apps that take working with transcripts seriously: Descript and Lumberjack Builder NLE. Descript is positioned directly at podcasters and single take presenters, with a suite of tools that suit that market very well.
Builder NLE is the result of our frustration with the best we could do bringing transcripts into Final Cut Pro Classic back in 2010, and in Final Cut Pro X in 2014. While there is some utility in finding words in transcripts in the NLE, we hated the experience and knew we could do better since we are in the software business.
That is why we created Builder NLE as a hybrid word processor and video editor to take advantage of the abundance of accurate and inexpensive transcripts.
Maybe we will have automated, useful Keyword extraction from transcripts for organizing one day. I expected we would have by now, but I remain disappointed. Transcripts in the right environment are the second easiest way to add Keywords, and the foundation of amplified story telling, but for the moment those Keywords are going to be manually entered.
Visual search will become ubiquitous in NLEs over time and is an incredibly valuable tool for finding specific visuals.
The magic of our industry has always been to create something that isn’t real, out of some real, and some created, elements. Well, imagine if those elements were just a bunch of pixels, then you have a grasp of where this is going: creating video out of a description alone.
Another necessity for Mike Pearl’s ‘Black Box’ it would be a great tool even for educational and corporate production. Instead of Alton Brown needing elaborate props and sets to illustrate the inner workings of culinary concoctions, simply describe the example you want, and your Black Box will create it for you. In full 8K HDR!
Okay, we are nowhere near that, yet, but the research is surprisingly advanced. Keeping in mind how much advancement we’ve seen over two years in other examples, like Jukebox between 2014 and 2016: from 80’s video game to senior living commercial backing track!
These may be research projects now, but I expect them to be very exciting when I update this article in 2023!
Let’s start back in 2016 when AI researchers at MIT used a GAN similar what we talked about in the Digital Human section. They pitted a Neural Network generating video based on a still frame against another Neural Network trained to pick fake video, with a feedback loop between them.
In the video linked above, you can see (tiny) examples of the input frame, and the next 1-2 seconds of video, as the Neural Network expects it to be, that is good enough to fool the fake video detector Neural Network! They don’t quite pass the human eye, yet, but the thing to remember is that these technologies iterate extremely fast.
Google have been busy, with a project from 2019 that can “create videos from start and end frames only” of very simple action, like a person walking across a simple background. According to the article:
The most impressive video clip happened to be one in which the AI system generated the next one second of crashing waves in the ocean.
…novel networks that are able to produce “diverse” and “surprisingly realistic” frames from open source video data sets at scale. They describe their method in a newly published paper on the preprint server Arxiv.org (“Scaling Autoregressive Video Models“), and on a webpage containing selected samples of the model’s outputs.
“[We] find that our [AI] models are capable of producing diverse and surprisingly realistic continuations on a subset of videos from Kinetics, a large scale action recognition data set of … videos exhibiting phenomena such as camera movement, complex object interactions, and diverse human movement,” wrote the coauthors. “To our knowledge, this is the first promising application of video-generation models to videos of this complexity.”
Nobody is predicting fully finished scenes for that Black Box in the next few years, but when it does, the tools will be able to create scenes and background only limited by our ability to describe them.
I can imagine an interface like Metahumans’ where there are slides for hills, foliage, water, etc and they interact in real time, until that is supplanted by voice control: describe the scene, get the scene! Or, simply pluck it out of my brain.
Using AI to generate 3D holograms in real-time
Is a Holodeck program a production? When it comes to generating new pixels representing fully synthetic images, perhaps the distinction is unimportant. In May 2021 MIT published a paper:
A new method called tensor holography could enable the creation of holograms for virtual reality, 3D printing, medical imaging, and more — and it can run on a smartphone.
Technology that was expected to be more than “10 years out” looks like becoming a reality on smartphones in the future. Not quite the Holodeck, but pretty cool.
I am uncertain whether digital actors and sets belong in production or post production! If there’s no location or actors to shoot, is it production at all? That philosophical discussion is for a future generation to decide, but digital actors in digital sets are here now.
We already have tools like Massive for simulating large crowds and background talent. Blue/Green screen keying is being partially supplanted by the LED Stage developed for The Mandalorian. Although not directly AI/ML related we’re at the tipping point between the various keying techniques pioneered 48 years ago with Mary Poppins’Sodium Screen technique, and LED screens powered by tools like the Unreal Engine.
The future may in fact be Unreal! The Unreal Engine is the foundation of the Madalorian sets and environments, but it’s also the engine at the heart of Metahuman – very realistic, fully synthetic humans.
Actors have breathed life into ‘synthetic’ characters for years in animation. I expect we’ll see another generation of actors who power synthetic, but visually complete, “human” actors, thanks to the Unreal Engine. Metahumans inherit all the character animation tools in Unreal Engine. Check out the videos on the product page.
Unreal’s Metahumans aren’t the only synthetic humans on the horizon. While aimed at different markets, Neon are also creating realistic humans.
At a minimum, an increase in the use of Metahumans and similar developments, will allow actors to amplify their career beyond physical or age limitations. Actors could work from home by having a simple performance capture rig in their home office!
Synthetic actors will require human driven performances: for now. While I’m not aware of any research teaching a machine how to create an emotionally real performance, at the spokesmodel level, we are already there. Synthesia generates an “AI presenter” video from the text you type in!
There are digital presenters already in use, and covered later in this article. A related, also covered further into the article, is the field of foreign language dubbing that modifies the original actor’s mouth to match the new language, also in the original actor’s voice!
Thanks to ML we have a BBC newsreader deliver the news in English, and Spanish, Mandarin and Hindi, even though Matthew Amroliwala only speaks English! Japanese network NHK have gone one step further by dispensing with the newsreader entirely, to create “an artificially intelligent news anchor.”
Another obvious application is in translation of movies into additional languages for foreign markets. Instead of human dubbing, the translation is automatic, the voice is cloned in the new language, and the lips morphed for the new lip sync! It’s not quite living up to that hype with a prototype that only works with a still image of the face, with good lip sync. The time will come when an evolved version makes any movie available in any language, believably! Even if you’re sending home movies back to family in the “old country.”
That leads us to deepfakes, as the examples so far have relied on some of the same techniques. Deepfakes can raise dead actors to star again, or convey birthday greetings from a long dead father. Of all the technology we’ve talked about, deepfakes are the most controversial.
Our first example is from the BBC, where London-based start-up Synthesia created a tool that allows newsreader Matthew Amroliwala to deliver the news in Spanish, Mandarin and Hindi, as well as his native English. Amroliwala only speaks English.
The technique is known as deepfake, where software replaces an original face with a computer-generated face of another. (There’s a whole section on deep fakes coming up.)
Amroliwala was asked to read a script in BBC Click’s film studio and the same phrases were also read by speakers of other languages.
The software, created by London based start-up Synthesia, then mapped and manipulated Amroliwala’s lips to mouth the different languages.
Researchers for the Japanese TV Network NHK have gone further and created a fully artificial presenter for news and live anchor roles! NHK’s application only creates a monolingual presenter, but it’s inevitable the two streams of research will merge. For now, here’s the summary of a research paper AI News Anchor with Deep Learning-based Speech Synthesis,
Kiyoshi Kurihara et. al. present a detailed discussion of how they use a deep neural network and speech synthesis to create an artificially intelligent news anchor. Their system is being used to read news automatically and for AI anchors of live programs on NHK. It is in use now and they hope it will serve as a model for other applications.
Translating and Dubbing Lip Sync
When I presented at an event in Barcelona in November 2016 I relied on professional translators to “dub” my presentation into Spanish for the majority of the audience. (They had reverse translators for us English speakers during the Spanish presentations.)
Translation and lip-sync issues plague movies and television shows as they move into new markets with a new language. The classic solution has been to overdub the voices with local actors, attempting their best at lip synching with the original, and failing terribly most of the time. Everyone is thankful my translators didn’t try to get any lip synch!
Researchers from the International Institute of Information Technology from the southern city of Hyderabad, India have developed a new AI model that translates and lip-syncs a video from one language to another with great accuracy.
While it can match lip movements of translated text in an original video, it can also correct lip movements in a dubbed movie.
Another research project has Robert DeNiro delivering his famous line from Taxi Driver in flawless German with matching lip movements and facial expressions! Single lines for now, but in the sequence of writing this article, that was so “last week” (literally). Then I discovered a link to a company claiming to do seamless translation and lip sync across languages: FlawlessAI’s TrueSync.
The short sample on that page supports their assertion:
TrueSync is the world’s first system that uses Artificial Intelligence to create perfectly lip-synced visualisations in multiple languages. At the heart of the system is a performance preservation engine which captures all the nuance and emotions of the original material. .
I finished my first draft of this article in May of 2021, where I wrote:
Before you get to that (fully lip synched translation), you’ll see an example of a fully computer generated voice (from text) and performance of a digital actor to match. In this example, there is no performance capture required.
As an indication of just how fast these fields are developing, just four months later we have Synthesia. Synthesia generates a fully artificial “AI presenter” video from the text you type in! With forty “presenters” – avatars at Synthesia – and over 50 supported languages, it’s a bigger casting pool than I had available my entire career in Newcastle, Australia!
… synthetic media in which a person in an existing image or video is replaced with someone else’s likeness. … deepfakes leverage powerful techniques from machine learning and artificial intelligence to manipulate or generate visual and audio content with a high potential to deceive.
Early in 2021, a deepfake (or many) of Tom Cruise started circulating on YouTube that are very believable. Vlogger Tom Scott also demonstrated just how easy it was to create a deepfake of himself for under $100!
They needed to print the 3D model, to shoot on video, because these type of deepfakes usually require large numbers of samples of the target face. Samples of Tom Cruise are available in almost countless movies and TV appearances. Those samples did not exist for Joaquin Oliver, hence the 3D printed workaround.
Creating a deepfake used to require deep pockets, as significant work is required, but it apparently doesn’t have to be. Educational blogger Tom Scott set out to create a deepfake Tom Scott. It certainly lacks the fidelity of the other examples on YouTube, but it should be kept in mind his budget was $100. And now DeepFakeKive will deepfake your Zoom or TikTok!
Disney Research Studios is spearheading the project, and they call it “the first method capable of rendering photo-realistic and temporally coherent results at megapixel resolution.”
Furthermore, this algorithm also will make “any performance transferable between any two identities” in any given database.
It’s a fascinating read, and you know if Disney Research is working on it, there are labs in every other major studio conducting technological seances to get dead actors back to work. Not to mention that the body double actors probably don’t command the same premium salary as the long dead star.
What is most intriguing about this project is that the “algorithm” makes performances transferable between any actors – living or dead! A modern actor’s performance would give life to the resurrected actor, much as Metahumans do.
In the short period between draft and final, the field of Deepfakes has moved on to the point that DeepFakeLive will do real time deepfakes for your Zoom, Twitch or TikTok. From expensive lab project to a real-time toy for your Zoom meetings in around five years, should be an indicator of the rapid pace of development with these tools.
However these tools end up being deployed, I feel confident that the foreign language experience is going to improve in the future. Perhaps I could finally watch Drag Race Holland without the distraction of subtitles.
Digital Humans and Beyond
I chose “humans” over actors deliberately, because while Metahumans look real, and are fully animatable in the Unreal Engine, but they are still powered by a human performance, much the way actors have breathed life into animated characters for generations.
While fully synthetic performances if fine for Synthesis’s Spokesmodels mentioned above, they fall short of the emotional impact of talented actors, even when the performance is tied to a non-human ‘actor’ as we see in the Digital Andy Serkis performance mentioned below.
Early attempts at animated ‘human’ characters, like in Polar Express, got lost in the ‘uncanny valley’ effect, where the more human the character is supposed to be, the more unreal it becomes. Except the ability to synthesize human faces has made massive strides in recent years thanks to Generative Adversarial Networks (GAN), which is one of the ways that ML can train itself.
Simply put, a GAN pits two machines against each other, so instead of requiring a huge training set, the adversarial machines train each other. For example, one machine will attempt to generate a human face, while the adversary is trying to detect human faces. Success (or not) is fed back to the original machine and the cycle repeats. Millions of times a second! You’ll find a much more detailed explanation at KDnuggets both about GANs and specifically generating realistic human faces.
Metahumans are the next advancement of that work. With ‘design your human’ tools that work by dragging sliders for how each feature should appear, they are extremely flexible, and very realistic. The Unreal Engine gives each Metahuman full animation capabilities driven by an actor’s performance. Although keep in mind that the performance and character are completely isolated.
On the Metahumans page at Unreal, one example is a digital Andy Serkis performed by the actor Andy Serkis. The digital version is very convincing, but click through to the details and you’ll find the same performance applied to 3Lateral’s Osiris Black!
This, like animation acting, should bring a lot more freedom to actors. 2021 Michael Horton, for example, could still star in the remake of Murder She Wrote as his younger self! The way Hollywood is, though, it will probably be used to impose new performers on the identity and image of long dead stars.
While creating new performances for these digital humans is going to be a much bigger challenge than the digital spokesmodels, I expect that performances will be sampled in the way visuals are sampled for deepfakes, then used to create new performances. It won’t be Academy Award worthy, but for Mike Pearl’s ‘Black Box’, it will be an essential component.
If your needs are more toward the spokesmodel instead of actor, then Synthesia let’s you choose your spokesmodel and then type in your text for a moderately natural performance, quite good enough for corporate work. Synthesia’s spokesmodel are likely more natural than those on local, late night TV ads!
While ML is making inroads into every aspect of production, it certainly seems like Visual Processing and effects show the most impressive results.
ML is already in upscaling, noise reduction, frame rate conversion, colorization (of monochrome footage), intelligent reframing, rotoscoping and object fill, color grading, aging and de-aging people, digital makeup, facial emotion manipulation, and I’m sure there’s more that I’ve missed.
Although they could be grouped under Visual Processing and Effects, I’ve chosen to group digital humans and deep fakes in another section. Similarly, there’s a separate section devoted to fully synthetic image creation.
If you think we’ve been able to manipulate ‘reality’ to create something new with traditional compositing and effects tools, just wait until you see what’s coming.
Noise Reduction, Upscaling, Frame Rate Conversion and Colorization
As these tools tend to be used together for processing old footage into a condition usable in a modern production, I’ve bundled them under the one heading.
Colorizing monochrome footage isn’t anything new. I recall the horror that arose when they first started colorizing Black and White movies. In fact hand coloring of movies had been done way before then, but it was very manual and extremely labor intensive.
A new AI colorizing process takes the characteristics of the original film stock into account, and then introduces sub-surface scatter to make the image look like it was taken with a modern camera. That’s a little more sophisticated than a color wash!
Or at a simpler level DeepAI will colorize film frame-by-frame from monochrome images, without the sophistication of imitating sub-surface light scatter. I had not realized how much my aunt (in this picture on the right) resembles my niece of a generation later, until I saw it in color. DeepAI reminds me more of the very old process of hand tinting black and white images with water color paints.
Upscaling/frame rate conversion
Frame rate conversion isn’t anything new. We’ve been using “optical flow” technologies for years, to create in-between frames that were never shot. The technology just keeps getting better, as demonstrated in this example of upscaling in this presentation of DAIN: Depth Aware video frame INterpolation. In the demonstration they take 15 fps stop motion animation up to a very smooth 60 fps. It’s another Open Source project and you can download it for a Patreon donation.
Intelligent Reframing comes in to play when we need to convert between video formats. Framings that work well for widescreen 16:9 aren’t always going to work in a square or vertical format. We’ve seen ML driven examples in both Premiere Pro (Sensei) and in Final Cut Pro, but if you’d rather roll your own, Google published an Open Source project: AutoFlip: An Open Source Framework for Intelligent Video Reframing.
AutoFlip provides a fully automatic solution to smart video reframing, making use of state-of-the-art ML-enabled object detection and tracking technologies to intelligently understand video content.
I don’t plan on writing my own reframing software, but the existence of an Open Source project strongly indicates that the use of ML for reframing is a known and mature technology.
Rotoscoping and Image Fill
While there are many tedious jobs in the field of production, surely rotoscoping for object removal, or cloning for image fill, must be at the top of the list. I remember the excitement surrounding Puffin Designs’ Commotion when it was released with the ability to automatically clone from the same place on other frames. It was a huge step forward that made a very, very tedious job just plain tedious!
It’s an area where research has resulted n practical solutions. Adobe’s entry into the “AI Powered” world of rotoscoping, Rotobrush 2, was a leap forward in reducing the tediousness of the job.
Keep in mind that ML models being used now will continually improve. Right now we have the ability to click on the object and it will be isolated, or removed and back filled with real imagery.
Within a week of my first draft including a reference to RunwayML’s ability to rotoscope and infill, they announced they were adding Smart Mask that tuses text to select objects.
It seems like magic but their (experimental) demo shows selecting a red car in the image, by selecting ‘red’ and ‘car’ in the text box! Want to select the sweater, click on the word sweater and it’s done. There doesn’t seem to be a way to link to the video but it was posted in their Twitter feed.
I’m not sure if Adobe have competing AI teams, but the same blog reveals SceneStitch, where you can select an area of an image to remove, and “AI” will search for a new image to replace the area, and then blend everything seamlessly. And similarly with SkyReplace ML removes the old sky, searches a host of possible replacement skies, and blends them into the existing image.
Adobe’s philosophy behind these tools sounds a lot like Amplified Creativity!
With applications like these, it’s not hard to see how AI could become not only the ultimate time-saver for creatives, but a powerful, virtual creative agent.
Aging, De-aging and Digital Makeup
Not long ago, aging or de-aging an actor wasn’t possible. Then it was available in the rarified realm of high end film technology like Flame, which uses ML to create pixel correction and aging/de-aging tools. An article at FXguide outlines the many steps needed create these sorts of manipulations.
If you read through that in-depth article, consider that consumer apps like Snapchat and FaceApp are doing all the processing in real time, on a hand held device. There’s an amazing amount of power in identifying and tracking the facial features, then applying ML Models to age and de-age faces.
While it’s for still images in Photoshop, Adobe’s new NVIDEA and AI-powered Neural Filters include emotion editing! The new Smart Portrait feature:
…takes headshots into the Snapchat zone, giving you a bunch of sliders for things like happiness, anger, surprise, age, hair thickness, direction of gaze, angle of head and the direction of light on the face.
Many features appear in single image processing apps first, because of the challenges of keeping results coherent across many frames, but eventually the temporal coherency problems are solved and the tool moves from still to video. We can probably start manipulating emotions in video before 2024!
Being able to easily modify the appearance of faces in real time has serious consequences, and not just on dating apps! Magda Skrzypecka, wrote an article for ediblorial.com on the effects of face enhancing technology on social media and movies.
Arguably, up until now, most of the examples of ML in production have been to automate relatively mundane tasks. Where it has been most successful is in the symbiotic meshing of machine and man I’ve tagged Amplified Creativity.
Now we move on to an application of ML that’s moving right into the creative realm. Colourlab Ai is AI powered Color Grading. It would seems that color grading requires that human eye, and it does. Like every other tool I’ve discussed the developers see Colourlab Ai as:
Colourlab Ai enables content creators to achieve stunning results in less time and focus on the creative. Less tweaking, more creating.
Although the overall job is highly creative, the camera matching and color matching parts are not. Once again we see the machine enable the human to be an amplified creative.
Audio Post, along with Visual Effects, have benefited greatly from innovative ML-based tools. There are tools on the horizon for voice cloning, isolating voices from the sound of a crowd, and automated mixing. More challenging are automated music composition and voice overs, which are rapidly getting ready for Prime Time. They’re not there yet, but as I mentioned in Amplifying Production, automated voice overs are ready for education/training and even corporate production.
Fully automated “radio” (audio only) news is on the horizon. These ML tools will take a basic data feed from a sports-ball game, for example, and format it into an article, that is then “read” by a ML voice over. I doubt it will be long before that research and BBC Research’s synthetic newsreader merge! Listeners, or viewers, would never know there wasn’t a human involved.
With all the attendant ethical complications, voice cloning will forever end the need for “frankenbites.” Voice cloning once needed large samples of a voice before it could synthesize new words, but now requires less than a minute of the sample voice to accurately create new words in that voice. It doesn’t provide any visuals, but that’s only a temporary setback
Much more complex than the technology are the ethical issues. The ability to reliably create words in someone’s voice, words they never said, is very open to abuse. Even in the context of Frankenbites, how far is “fixing” the line, and where does outright fake take over?
It’s not only audio voices that are being fakes, deepfakes create a compelling, but fake, visual of a person. Much more on deepfakes later.
Also interesting to note, is that many of these technologies are now mature enough that there are open source versions, so programmers can add it to their apps.
No-one is suggesting that any machine is going to be finalizing the mix on any production, but there is a field of research that is studying Intelligent Mixing Systems that seeks to understand how a mixer authors content to decide what parts can be automated to some extent to “improve the efficiency of the people involved in content creation in terms of leveraging some of their production tools using AI.” Translate that to the language of this article and they’re talking about Amplifying the mixer.
Sunil Bharitkar, principal research scientist for AI research at Samsung Electronics, points to a recent article titled “Context-Aware Intelligent Mixing Systems” in the Journal of the Audio Engineering Society (AES) penned by European researchers that addresses so-called Intelligent Mixing Systems (IMS). The article suggests that human creative skill and AI tools could potentially co-exist nicely as long as context is factored into the collaboration, meaning human decision-making needs to be essentially the controlling factor in how, when, and to what degree IMS technology is used in the creative process.
The same group penned a second article on one approach to intelligent music production tools aided by ML—in this case, the potential for a deep neural network to automate much of the process of creating drum mixes.
It’s only a research project, but – once again – the goal is to amplify the creative, not replace them.
One part of ML’s inroads into post production that is threatening employment, is the explosion of text-to-speech generators that are being pitched as human replacements. The technology grew form the more familiar text-to-speech we hear every day on our phones.
Long term iPhone users will appreciate the progressive improvements over the years of the Siri voice(s). The most recent versions are good, but easily spotted as text-to-speech. We use the macOS System voices for temp voiceover in the Lumberjack Builder NLE, but we have no pretensions they’re anything but an alternative to recording your own voice.
The Siri voices are built from samples of the human voice extracted from hours of recordings. You know what else is good at examining lots of examples and creating a model? If you said “Machine Learning” you’d be right.
These days my my Facebook feed is full of ads for Talkia, Speechelo. Speechor and others. The results, of some, are way better than expected. Of those I listened to I found Speechelo more natural and I could definitely use that quality for short training videos. I’m sure everyone would agree that it would be an improvement on recording myself!
Natural Reader was highly regarded in one article I read, but I didn’t find all that natural. To me, Speechelo is the least obviously fake.
These synthetic voiceovers are only going to get better as research continues. The current state-of-the-art would be sufficient for product demos and training videos. Combine with a synthetic human and you have Synthesia. Perfectly useful In the corporate world.
Voice Cloning generates a synthetic human voice that has very little difference from the human it was sampled from. In practical terms, that means someone could take a few seconds of my recorded voice, and then put whatever words they like ‘in my mouth’. It’s very useful for correcting simple errors, as in the Descript editor, but has some serious ethical implications when you’re cloning someone else’s voice.
It’s such a useful tool that Descript spent a good chunk of their $15 million A series funding on buying Lyrebird. Lyrebird was incorporated as Overdub in Descript’s online editor. In the Descript editor you can re-voice yourself as simply as typing in the replacement words.
If you combine Voice Cloning, mouth morphing (already used in several examples yet to come) and some existing video of me, I could seem to be saying the most outrageous things, even if no word of it ever left my mouth under my control! Until recently, ethical issues were diminished by the need to have hours of recordings of the source voice in order to successfully clone.
Modern voice cloning requires only a few seconds sample, which is why it can work in Descript.
No mainstream NLE has included an Overdub-like feature, but if you feel like creating your own, you’ll need a few seconds of the voice to sample and this GitHub project.
Real-Time-Voice-Cloning SV2TTS is a three-stage deep learning framework that allows to create a numerical representation of a voice from a few seconds of audio, and to use it to condition a text-to-speech model trained to generalize to new voices.
Research continually finds ways to improve cloned voices. In January 2021 Cornell published a research paper i on their research into more expressive voice cloning.
Any attempt at automatic translation of movies into foreign languages – a subject covered in more detail later in the article – has to involve voice cloning.
In my 2017 AI and Production: Overview article I wrote of Jukedeck’s music composition learning machine. Most ML examines large numbers of examples and “works out for itself” how to do what you want it to do. Their methodology is hidden behind a paywall, but Jukedeck progressed dramatically between this example from 2014, and this one from just two years later in 2016. From bad 80’s game music to music good enough for background tracks or maybe a health or retirement commercial!
Jukedeck 2.0 became a commercial product offering custom music tracks, but is sadly no longer with us. Research goals moved from simply wanting the machine to compose music, to wanting Aiva to become one of the greatest composers in history! Aiva stands for for Artificial Intelligence Virtual Artist.
Amper, a new startup that recently raised $4 million in funding and also offers music created by AI. Amper’s focus is to empower users to “instantly create and customize original music for your content.”
Per their website, Amper claims that “Your music is uniquely crafted with no risk of it being used by someone else.” Even better, the site also states that “Amper provides you with a global, perpetual-use, and royalty-free license, with no conditional or unexpected financial expenses.”
While the best attempt at an AI musical were “as pleasant as a milky drink”, background music composition by machine equals typical music library fare. And almost as if I had written their website, Aiva refers to itself as “A Creative Assistant for Creative People.”
I think it’s important to reiterate that “production” is an umbrella term, covers everything from big budget feature films, to corporate, education, YouTube and TikTok. Tools designed for high end feature film production, usually with large crews, are probably not going to be at all relevant to a Tom Scott producing informational videos on YouTube! Solo, and small production crews will benefit most from their efforts being amplified by ML, while the immediate affect on the next Marvel blockbuster will be negligible.
In between there are lots of exciting developments. Google had a research project in 2017 that automatically pick the best angle from a multicam shoot, or direct a single camera to stay on the subject. Carnegie Melon and Disney Research have a project that edits from among the many cameras at an event – “social cameras.”
More practically, there’s a slew of new smart camera mounts for consumers that track an individual and keep them on camera. One of the new features of Apple’s iPad Pro announced April 2021 is called Center Stage and the camera will automatically track one or two people in the images, and frame accordingly.
There are smart drones that are being trained to better follow performers, or even having autonomous drones do all the filming!
Then there is the question of what exactly is “production?” If we can create synthetic actors in synthetic sets, is that still production?
Even though production has shown the lowest infiltration by ML, there are still a lot of exciting tools coming.
Automated Action tracking and camera switching
The first time I heard of ML being used to direct and switch a multi camera live event was at an Entertainment Technology Center special event on AI in Film and TV Production back in 2017. There Google presented a research project that automatically edited live video by choosing the best angle, or direct the view of a single camera based on audience or player attention. Unfortunately I cannot find a direct link to this research.
The same ML Model can customize trailers for specific audiences, or even create personalized versions. They used Google Tensor Flow, Cloud Vision and Video Intelligence. No one single ML Model achieved the total goal, but several working together are able to handle some seemingly human tasks.
In the world of YouTube and TikTok, or other smaller scale or independent production, this is a huge amplification, particularly those projects that are out of the lab and available.
Smart Camera Mounts
While you may not have been watching, smart camera mounts like Pivo or Pixo/Pixem have hit the market. Face and body tracking in the apps associated with the mounts uses the phone’s camera and processing power to auto track a presenter as they walk around. They also add special effect modes so talent can replicate themselves, etc.
My experience with a Pivo has been positive. This and similar mounts are very useful tools for solo presenters like myself, making it easier for one person to create content. It’s an amplification of the individual creative, which is the core of Amplified Creativity. The “machine” empowers human creativity that might otherwise lie dormant, particularly during a lock down when we couldn’t get together to shoot together.
Without Pivo I’d have to find someone to shoot for me, communicate to them the shots that I want, then review to ensure we got what I want. A smart camera mount has freed me to be more creative, more often.
Although not part of any shipping Adobe app, a blog post The State of AI in Video from January 2019 reveals the company’s thinking on the subject.
AI is also useful on the production side of video, with automated camera technology improving filming options and video quality. AI-enabled camera equipment reacts to gestures, recognizes and tracks subjects, and swivels to follow action or oral commands.
OBSBOT, is an “AI-Powered. Auto-Tracking Phone mount for use with any app.” It’s going to become a crowded market very quickly. As I said above, Pivo uses the camera and processing power in the phone, which is why it can do the many special effects modes. OBSBOT has its own camera and AI processing so it is fully self contained. The ability to work with any app – particularly Filmic Pro – made it attractive enough that I backed it while doing the research for this article. It’s working very well, until it loses sight of my face!
Even without the smart mounts, we can have automated camera tracking of one or more presenters. Apple’s iPad Pro, announced April 2021, has a feature integrated into the camera called Center Stage. From the press release:
“The new ultra wide camera on the iPad Pro pans automatically to follow you as you move around—and widens out if a second person enters.”
While training to fly a drone for the ill-fated 2012 Solar Odyssey project, I saw a whiteboard in the drone company’s headquarters that had their software roadmap. It was obvious that everything I was training to do was going to be redundant within a few short years. Autonomous and flock controlled drones are now a reality and they’re getting involved in production:
New developments in AI technology are making it much easier to maintain framing (camera angle and distance) while filming movie sequences from the skies. Those in the know have already seen drones which can follow a person as they jog, climb or go biking but if you turn 180 degrees away from the camera, it will film your back.
This impressive drone-specific camera technology allows a director to determine a shot’s framing (for instance: a profile shot, straight on, over the shoulder, etc.). Incredibly, these framing directions are preserved by the drone when the actors move around.
A sci-fi film called In the Robot of Skies was made entirely with autonomous drones filming. Facial tracking and object avoidance using ML makes autonomous camera drones much more accessible than the days when I had to be the brains of the drone! Robot of Skies was pitched as an example of how solo filmmakers can amplify their production:
This example may provide a glimpse into the future where solo filmmakers on tight budgets can shoot entire projects without needing human camera operator.
You can see the Robot Skies teaser on Vimeo. This is another example collaboration between man and machine to create an amplified creative.
In this work, we propose a deep reinforcement learning agent which supervises motion planning of a filming drone by making desirable shot mode selections based on aesthetical values of video shots. Unlike most of the current state-of-the-art approaches that require explicit guidance by a human expert, our drone learns how to make favorable viewpoint selections by experience. We propose a learning scheme that exploits aesthetical features of retrospective shots in order to extract a desirable policy for better prospective shots. We train our agent in realistic AirSim simulations using both a hand-crafted reward function as well as reward from direct human input. We then deploy the same agent on a real DJI M210 drone in order to test the generalization capability of our approach to real world conditions. To evaluate the success of our approach in the end, we conduct a comprehensive user study in which participants rate the shot quality of our methods. Videos of the system in action can be seen here.
Before a single frame of a feature film or television series is shot, it is likely ML has been involved. Writers focus on whether or not an AI can write a script, but that’s probably the wrong question. Attempts at storytelling are the subject of the last section, but we can safely say that script writing will – at a very minimum – need input from humans for a very long time to come.
The right question would be whether there are ML based smart assistants that are in use, or proposed, for pre-production. None of these write scripts, but like all creative amplifications, these tools are freeing creatives to spend more time on what they alone can do. Whether it’s helping decide which projects to green light, to automated storyboarding tools in development, to breakdowns and budgets, or voice casting, ML is already creating Amplified Creatives.
“Netflix uses machine learning to determine expected hours of viewing for each piece of content, estimate the cost per hour viewed, and compare it with that of similar content deals . Additionally, the firm uses predictive models to understand customers, such that there is a large enough set of content that meets their preferences without necessitating the renewal of any specific title . This cost-effectiveness is particularly important as increased competition bids up licensing and renewal agreements.
“They use similar techniques to predictive marketing in planning scripts, to see what’s being talked about, what’s popular. Trying to identify for scripts what might resonate with the audience,” he explains. “But we haven’t quite got to the stage yet where we can replace scriptwriters and even really good actors.”
Here again is the Amplified Creative approach using the best of human and machine together.
Movies and TV Projects have always tailored themselves to the audience. It is, after all, the hallmark of The Hallmark Channel. It’s also the point of focus group advance screenings. Movies have entirely new endings added after that feedback, notably Sweet Charity. When projects meet their audiences’ needs, they make money. Creatives who make money often get the opportunity to do it again!
Tailoring scripts to customize them to audiences is ubiquitous. That we do it better with the help of machines is no surprise.
While essential to the creative process, storyboards are challenging because they entail a lot of work that won’t ever be seen outside the production. What would be helpful, is for a smart tool to analyze the script and automatically generate (useful) storyboards.
“Disney Research and Rutgers take the idea one step further with an end-to-end model that can create a rough storyboard and video depicting text from movie screenplays. Specifically, their text-to-animation model produces animations without the need for annotated data or a pre-training step, given input text describing certain activities.”
Not yet a production tool, but a whole lot closer than in 2008. The article is worth reading because it explains some of the challenges of extracting the right information from the script. This type of useful smart tool takes multiple layers of analysis, understanding and visualizing, all incredibly complex in themselves.
It’s also worth reading just to note how quickly these fields are advancing, In an aside at the start of the article they quote a precedent to the current automatic tool from 2018, just a year earlier:
“Last year, researchers detailed a system that tapped a pair of neural networks — layers of mathematical functions modeled after biological neurons — to generate videos 32 frames long and 64×64 pixels in size from descriptions like “playing golf on the grass.”
This is a computer generating imagery and animations from text descriptions (now) parsed from Disney’s scripts. Still need the human to write the script, but I wouldn’t invest a lot in a storyboarding career!
Breakdowns and Budgeting
Another task that seems well suited to automation is the process of breaking down a script into cast, locations, crew, etc. It is an important adjunct to the creative production process that is essential for the financial well-being of the project.
Startup RivetAI develops ML-infused moviemaking tools designed to streamline preproduction. One of three cofounders Debajyoti Ray explains:
“It helps people to create content much faster,” Ray told VentureBeat in a video conference, “by using AI to augment creativity. Everything starts with data.”
Once again, the goal is to augment creativity, not to replace it. Anyone who has broken down a script into storyboards, shot list generation, optimizing schedules, and creating budgets knows that these are tedious tasks that are essential, but not at the core of creativity. Necessary to create the framework where creativity happens, but in the way of creativity.
RivetAI automates that process from the finished script, changing a multi-week task to one done in minutes, with the option of human over-ride because it’s still not perfect.
From the breakdowns RivetAI calculates a budget:
“With no more than a script, it can estimate the required number of shoot days and prep days (down to the length of scenes), predict a project’s total budget, and spit out a line-item, department-by-department bill of materials.”
The article makes it clear that this is intended as a ballpark figure, not one to raise funds on, but automating so much tedious work out of the process, frees resources to go into the production itself. The co-founders are also:
“… adamant that RivetAI’s suite of tools only aid, rather than interfere with, the creative process.”
You can find a discussion of the data science and philosophy of RivetAI by Debajyoti Ray at medium.
This example isn’t a generalized voice casting agent replacement, but rather a tool for matching talent recording localized versions of movies and television to the vocal qualities of the original voice.
“Exploring Automated Voice Casting for Content Localization using Deep Learning” by Aansh Malik is a technical paper that explores the use of deep learning to automate what is now a largely manual workflow for casting voiceover talent across languages and cultures. … Malik considers ways to leverage developments in deep learning for text-independent speaker verification (TI-SV) to enable computer-aided voice casting. https://ieeexplore.ieee.org/document/9395671
In what I expect will become a common story, by the time automatic voice casting for localization becomes a readily available tool, it will have become irrelevant. Given advances in automatic translation, voice synthesis and facial manipulation, the more likely future is a suite of automated tools that translate the script, synthesize the new language in the voice of the original actor, and manipulate the mouth and face of the actor from the original shoot, so they mouth the new translated performance. You’ll visit some of the research projects attempting to make that a reality, in a later part of this article.