Can a computer really recognize an individual face, or a car?

In this attempt to summarize the state of a technology and its application to production and postproduction my focus is on image recognition, including facial detection and recognition. We’re exposed to facial recognition/detection technology in some current apps: Premiere Pro CS5 onward; iPhoto, Final Cut Pro X, Picassa, Facebook, with mixed success.

Firstly, the distinction between facial detection and facial recognition is the difference between recognizing “oh, that’s a face” compared with “oh that’s Philip Hodgetts’ face”. That’s a huge distinction. Most digital still cameras sold today recognize faces in the image and attempt to lock focus on them. Heck, Smile Click recognizes that a subject is smiling before taking a picture, and that’s a 99c app! Facial detection is a developer framework in iOS 5, making it easy to add facial detection to any app – for tracking, changing or modifying! Sanyo have facial detectionÂ (they call if Face Chaser) in a video camera. In professional cameras, Fujifilm have released a lens with its TRACE technology that tracks faces in shot to maintain focus and exposure – up to 12 faces in a shot.

In postproduction software both Premiere Pro CS 5 and later and Final Cut Pro X attempt facial detection. Neither attempts to identify the individual, just that there is one or more people in the image. Subjectively, it’s still a work in progress. When it works, it’s great but even facial detection isn’t perfect with moving video. Final Cut Pro X further takes this useful metadata (how many people are in a picture) and attempts to infer the shot based on the size of the detected faces: big faces = closeup; many small faces = wide. (I’m sure that’s a gross simplification of someone’s very hard work, but you get the point.)

Facial recognition is valuable because it allows us to quickly group shots with the same person (or character) in them without any additional work. Taking the boring out of post, as I like to say. We’d have to identify the face once and probably deal with some false identifications, but if it gets more accurate than iPhoto currently is, it could be more useful.

I’d also suggest that it’s likely to be more useful in a video project, and more accurate, than in a general iPhoto library because of the more limited set of examples for each person. iPhoto (pre-Polar Rose technology) does a good job where it’s presented with a set of images of the same person in roughly the same time. The more examples in photos across a lifetime, and the accuracy is reduced. So, with the more limited time scale of the typical video project, I’d expect better accuracy.

Facial recognition is still very much a work in progress. In September 2010 Apple purchased Swedish facial recognition company Polar Rose, presumably to boost the facial recognition technology in iPhoto and across their entire product range (I hope!). Full facial recognition probably won’t make it to Final Cut Pro X this year, but you do have to wonder what they have planned for a “full revision” release (one we might pay for) after all the catch-up features are added. (I have absolutely no ideas, this is pure hypothesis/wishful thinking on my part.)

However, if we broaden out a little and consider the research that’s being done – the existing integration of facial recognition in social media and photo sharing sites, and how object detection/computer understanding of what it’s seeing is developing – then it’s obvious that computer recognized metadata will start to be a viable alternative.

Facial recognition

I mentioned Polar Rose, now integrated into Apple where I’d expect it will be used to improve the accuracy of facial recognition in iPhoto and Aperture, but also likely to be added to the facial detection framework now in iOS 5. As iOS and OS X are merging, I would expect the same frameworks to become available to OS X developers in due course. From GigOm -Â Appleâ€™s iOS facial recognition could lead to Kinect-likeÂ interaction:

The unearthed APIs are described as â€œhighly sophisticated,â€ and can determine where a userâ€™s mouth, and left and right eyes are located, as well as process images taken by the iPhone for face detection. Aside from providing Apple an easy way to introduce Faces (which recognizes specific people in iPhoto) to both its own Photos app and any third-party apps that access that library, it should also open the door for much more advanced facial recognition applications.

Beyond Apple, facial recognition is becoming ubiquitous, even to the point where Slate’sÂ Farhad ManjooÂ wrote in his “Smile, You’re on everyone’s camera“:

Soon, face recognition will be ubiquitous. While the police may promise to tread lightly, the technology is likely to become so good, so quickly that officers will find themselves reaching for their cameras in all kinds of situations. The police will still likely use traditional ID technologies like fingerprintingâ€”or even iris scanningâ€”as these are generally more accurate than face-scanning, but face-scanning has an obvious advantage over fingerprints: It works from far away. Bunch of guys loitering on the corner? Scantily clad woman hanging around that run-down motel? Two dudes who look like they’re smoking a funny-looking cigarette? Why not snap them all just to make sure they’re on the up-and-up?

This is absolutely a technology who’s time is coming. Further in the Slate article:

In 2006,Â Google acquiredÂ the biometric recognition company Neven Vision, and Hartmut Neven, one of the world’s experts in computer vision, is a respected engineer at the company.Â A Microsoft research teamÂ in Israel has built a fantastic app that uses face-recognition systems to search the Web for pictures of people who are in your photo album. And last yearÂ Facebook rolled outÂ a tool that automatically suggests names of people to tag in your pictures.

The Google acquisition has now rolled out as facial recognition in Picassa.

Late 2011 the New York Times had a feature article Face Recognition Makes the Leap From Sci-Fi in their business section. In the article they list these examples:

SceneTap, a new app for smart phones, uses cameras with facial detection software to scout bar scenes. Without identifying specific bar patrons, it posts information like the average age of a crowd and the ratio of men to women, helping bar-hoppers decide where to go. More than 50 bars in Chicago participate.
Immersive Labs, a company in Manhattan, has developed software for digital billboards using cameras to gauge the age range, sex and attention level of a passer-by.
Those endeavors pale next to the photo-tagging suggestion tool introduced byÂ FacebookÂ this year. When a person uploads photos to the site, the â€œTag Suggestionsâ€ feature uses facial recognition to identify that userâ€™s friends in those photos and automatically suggests name tags for them. And apparently you don’t need a person’s permission to tag them on Facebook.
â€œMillions of people are using it to add hundreds of millions of tags,â€ says Simon Axten, a Facebook spokesman. Other well-known programs like Picasa,Â the photo editing software from Google, and third-party apps likeÂ PhotoTagger, fromÂ face.com, work similarly.

Powered by technology from Face.com (one of the key players in white label face detection) Facebook processed 400 million photographs in 30 days!

So my dream of having all the people in my source video recognized and grouped. Perhaps for manual naming (once) but more likely we’ll be able to use existing resources to match the face somewhere and associate the name with it. InÂ aÂ study reported in theÂ Cloud-Powered Facial Recognition Is TerrifyingÂ article at The Atlantic:

Unlike Groucho Marx, unfortunately, the cloud never forgets.Â That’s the logic behind a new application developed by Carnegie Mellon University’s Heinz College that’s designed to take a photograph of a total stranger and, using the facial recognition software PittPatt, track down their real identity in a matter of minutes.Â Facial recognition isn’t that new — the rudimentary technology has been around since the late 1960s — but this system is faster, more efficient, and more thorough than any other system ever used. Why? Because it’s powered by the cloud.

….

With Carnegie Mellon’s cloud-centric new mobile app, the process of matching a casual snapshot with a person’s online identity takes less than a minute. Tools like PittPatt and other cloud-based facial recognition services rely on finding publicly available pictures of you online, whether it’s a profile image for social networks like Facebook and Google Plus or from something more official from a company website or a college athletic portrait. In their most recent round ofÂ facial recognition studies,Â researchers at Carnegie Mellon were able to not only match unidentified profile photos from a dating website (where the vast majority of users operateÂ pseudonymously) with positively identified Facebook photos, but also match pedestrians on a North American college campus with their online identities.

At most you need one photograph of the individual, and good matching algorithms.Â CNET News has a background articleÂ that gives another take on the experiment identifying random people in public spaces.

It’s getting quite common:

Facial Recognition Technologies

Before moving on to other ways computers are “seeing” images, here’s a short summary of the primary technology providers.

Google purchased PittPatt and are integrating the technology into Picassa
Apple purchase Polar Rose and are integrating the technology widely through their OS and applications.
Facebook relied on Face.com technology. In fact Face.com are one of the major providers of facial recognition technology with 7 Billion PhotosÂ in the last year in its Facebook apps, Face.com is now available for developers (such as Facebook, and well, even Assisted Editing) to use:

Developers who are interested in building their own facial recognition apps can now get full access to the open Face.com API, free of charge. That basically means developers can tap into Face.comâ€™s face detection and recognition technology and create brand new ways for friends to engage through photos at zero cost. Hard to beat that offer.
It is also well funded: “Yandex, operator of Russiaâ€™s largest search engine, has invested in Tel-Aviv based facial recognition technology startupÂ Face.com, marking its first investment in an Israeli company. In total, Face.com has raised$4.3 millionÂ in Series B funding in a round led by previous investorÂ Rhodium.”
DigitalSmiths.com offer facial recognition as part of its suit of metadata-generation tools, but I’ll expand in the next section.
TheÂ University of IllinoisÂ is developing a face recognition system that is remarkably accurate in realistic situations.

I want to draw special attention toÂ AffectivaÂ who use the tag line “Respectful emotional measurement and communication.” Yes, they read emotion from video.

The software â€œmakes it possible to measure audience response with a scene-by-scene granularity that the current survey-and-questionnaire approach cannot,â€ Mr. Hamilton said. A director, he added, could find out, for example, that although audience members liked a movie over all, they did not like two or three scenes. Or he could learn that a particular character did not inspire the intended emotional response.

I fully expect facial recognition to come into the postproduction world rapidly and provide useful metadata. Automatically identifying and labeling people in video content will be very empowering. I do foresee an issue with *automatic* identification with actors and roles in narrative, but the software would surely have the ability to manually tag every instance of “this identified face” with a character name instead of the actor’s name sourced from IMDB!

If software can detect emotional responses to movies, it can detect emotional performances and – for documentary/reality/news – detect the emotion in the face to help drive editing.

Oh, if you really want to avoid facial detection? Social Beat has a few tips.

Metadata for asset management

Both DigitalSmiths and Asterpix have tools specifically intended to create visual metadata automatically for asset management and exploitation.

Asterpix

The actual process of machine-tagging involves pulling in imaging data from the video clip and matching it up to whatever text was included by the video’s creator. Nat Kausik, CEO of Asterpix tells me the process is a little similar to Google’s search algorithm in creating relevancy based on what bits of parts of the video get the most screen time. For example, in a video of a someone walking through a grocery store there would be a wealth of information about other products and people, but if you’re focused on that one person for the majority of the clip the engine will pick on it and react accordingly.

DigitalSmiths

Digitalsmiths is the technology leader in the rapidly growing segment of video search and recommendation. Digitalsmiths Seamless Discoveryâ„¢ has revolutionized the accuracy and ease with which end-users find relevant, personalized entertainment across multiple channels and devices.

From powering tablet-enabled set-top-boxes, to live integration of sporting events, to first-run premium web-content, Digitalsmiths has deep experience and proven results for increasing engagement and viewership though our unique, holistic solutions. Our solutions serve customers that span across all media channels and devices and reach a combined audience of millions of consumers.

Generating other visual metadata

While we’re only just seeing facial detection and recognition rolling into useful applications, but already computers are being taught to read the text in images, and process the images themselves to recognize people, place and things.

Text recognition isn’t exactly new, but:

Google Goggles Can Now Read Print Ads. Oh, And Play FreakingÂ Sudoku!

OCRKit 1.2 â€“ The simplest Text Recognition for the Mac

Beyond Text

Computers That See You and Keep Watch Over You

A New York Times article outlining how computer vision is being used:

Perched above the prison yard, five cameras tracked the play-acting prisoners, and artificial-intelligence software analyzed the images to recognize faces, gestures and patterns of group behavior. When two groups of inmates moved toward each other, the experimental computer system sent an alert â€” a text message â€” to a corrections officer that warned of a potential incident and gave the location.

â€œMachines will definitely be able to observe us and understand us better,â€ said Hartmut Neven, a computer scientist andÂ vision expert at Google. â€œWhere that leads is uncertain.â€

The applications are quite amazing. From observing prisoners, to reminding hospital staff to wash their hands if it’s detected they haven’t, to a host of other uses, smart computer software is observing us and making accurate observations.

Google Experiments With Next Generation ImageÂ Search

Notably, the new image search technology doesnâ€™t just index text associated with an image in determining whatâ€™s in it. Google is now talking about using computers to analyze the stuff in photos, and using that to associate it in a ranked way with keyword queries. In effect, theyâ€™re talking about something similar to PageRank for images (but without the linking behavior).

Teaching Google To See Images

Nuno Vasconcelos, a professor of electrical engineering at the UCSD Jacobs School of Engineering, discusses the approach, called Supervised Multiclass Labeling (SML), in a recentnews releaseÂ from the school (hat tip toÂ ThreadwatchÂ for the pointer). Though SML sounds like a mouthful of jargon, what it really amounts to is systematically training a computer to recognize statistically similar objects, and teaching it to differentiate them from other objects that have similar characteristics.

Google Researchers Teach Computers Out How To Recognize Images Of FamousÂ Landmarks

In the experiment, the researchers fed â€œan unnamed, untagged picture of a landmarkâ€ found on the Internet and the system would spit back the name and location of the landmark, such as the Acropolis in Greece. Each untagged photo was be compared to 40 million GPS-tagged images on Picasa and Panoramio (both owned by Google), as well as related photos found through Google Image Search. Using clustering and new image indexing techniques, the Google researchers were able to identify untagged photos of the same landmarks from different angles and under various lighting conditions.

The researchers report that their system can identify 50,000 landmarks with 80 percent accuracy. Iâ€™m not sure thatâ€™s quite good enough to even roll that out in a beta product, but if Google can get it to 90 percent or 95 percent that would start to be consumer-friendly.

Researchers from MIT’s CSAIL teach computers to recognize objects

The system uses a modified version of a so-called motion estimation algorithm, a type of algorithm common in video processing. Since consecutive frames of video usually change very little, data compression schemes often store the unchanging aspects of a scene once, updating only the positions of moving objects. The motion estimation algorithm determines which objects have moved from one frame to the next. In a video, that’s usually fairly easy to do: most objects don’t move very far in one-30th of a second. Nor does the algorithm need to know what the object is; it just has to recognize, say, corners and edges, and how their appearance typically changes under different perspectives.

The MIT researchers’ new system essentially treats unrelated images as if they were consecutive frames in a video sequence. When the modified motion estimation algorithm tries to determine which objects have “moved” between one image and the next, it usually picks out objects of the same type: it will guess, for instance, that the 2006 Infiniti in image two is the same object as the 1965 Chevy in image one.

SeeIT.com

Â SeeIT.com is in beta while the company is scaling the index from millions to hundreds of millions of images. You can try it byÂ clicking here, then entering the user nameÂ pictureÂ and passwordÂ picture93AEÂ (exclusive access for Search Engine Land readers). See thisinformation for new usersÂ for more information, including some of the limitations of the current beta release.

Riya

Riya started out focusing primarily on facial recognition, but now has aÂ beta visual searchÂ that lets you find similar faces and objects on many images across the web and then refine your results, using color, shape and texture.

Riya also powers the visually oriented product search serviceÂ Like.comÂ that lets you find clothing and a few home furnishing items based on visual similarity. Like also has a â€œcelebrityâ€ search that lets you see what the stars are currently wearing and find similar accoutrements for your own adornment.

DARPA building search engine for video surveillance footage

According to aÂ prospectusÂ written in March but released only this month, the Video and Image Retrieval and Analysis Tool (VIRAT) will enable intel analysts to “rapidly find video content of interest from archives and provide alerts to the analyst of events of interest during live operations,” taking both conventional video and footage from infrared scanners as input. The VIRAT project is an effort to cope with a growing data glut that has taxed intelligence resources because of the need to have trained human personnel perform time- and labor-intensive review of recorded video.

Check out the “simple’ diagram accompanying the article. This stuff isn’t simple.

VideoSurf: New, Genuinely Radical Video Search

VideoSurfÂ is a computer vision search engine that processes all of the kinds of information most video search services do, but then goes a step further, applying a proprietary process using â€œmultigrid fast computationâ€ and some heavy-duty computer processing power to analyze videos, identify people, and extract all kinds of additional information directly from the video itself. Until I saw the demo, I thought this type of technology was still years away.

Apple wins patent on 3D object-recognition technology

Â The USPTO has awarded Apple a patent onÂ 3D object-recognition technologyÂ that goes well beyond the currentÂ face recognitionÂ already included in apps such as iPhoto and the iOS 5 camera application, allowing a device to “build” a 3D face or object by analyzing the curves, contours and shadows of a 2D image. Such technology would give Kinect-like detection and recognition capabilities to cameras such as those found in iOS and Mac devices.

Another benefit of the Polar Rose acquisition, where the technology behind this patent was developed.

Developer APIs

Kooaba

The Swiss company aims to unlock its library of over 10 million images, ranging from album covers to books and movie posters, and provide access to all that precious data via the cloud.

Kooaba hopes that the launch of the API will trigger third-party developers to develop more mobile applications â€“iPhoneÂ andÂ AndroidÂ versions exist already â€“ or tools that tap into social networking services like Facebook and Twitter, etcetera.

OpenCV

OpenCVÂ (OpenÂ SourceÂ ComputerÂ Vision) is a library of programming functions for real time computer vision.Â Â It has C++, C, Python and soon Java interfaces running on Windows, Linux, Android and Mac. The library has >2500 optimized algorithms.

Out at the Leading Edge of technology

New Image-Recognition Software Could Let Computers ‘See’ Like Humans Do

Using such small amounts of data per image makes it possible to search for similar pictures through millions of images in a database, using an ordinary PC, in less than a second, Torralba says. And unlike other methods that require first breaking down an image into sections containing different objects, this method uses the entire image, making it simple to apply to large datasets without human intervention.

Developing artificial intelligence systems that can interpret images

Â Torralba is also attempting to develop systems that can scan a short video clip and predict what is likely to happen next, based on what people or objects are in the scene. To do this, the systems will need to understand what actions each object or person in the scene is capable of making, and what their limitations are. This will allow the systems to make predictions about what each of these entities is likely to do in the near future.

We already have facial detection in our software, and identifying the person is definitely coming. There are technologies to recognize a smile in a cheap iPhone app, or recognize human emotion currently exploited for focus group work. There are computers patrolling prison yards and making sure doctors and nurses wash their hands between patients. There is no doubt in my mind that pre-edit processing will give editors name, context and emotion metadata. And smart companies, like us, will exploit that as input for automating certain editing tasks.