Synthesizing human emotions From: The Baltimore Sun - 11/29/2004 By: Michael Stroh Speech: Melding acoustics, psychology and linguistics, researchers teach computers to laugh and sigh, express joy and anger. Shiva Sundaram spends his days listening to his computer laugh at him. Someday, you may know how it feels. The University of Southern California engineer is one of a growing number of researchers trying to crack the next barrier in computer speech synthesis -- emotion. In labs around the world, computers are starting to laugh and sigh, express joy and anger, and even hesitate with natural ums and ahs. Called expressive speech synthesis, "it's the hot area" in the field today, says Ellen Eide of IBM's T.J. Watson Research Center in Yorktown Heights, N.Y., which plans to introduce a version of its commercial speech synthesizer that incorporates the new technology. It is also one of the hardest problems to solve, says Sundaram, who has spent months tweaking his laugh synthesizer. And the sound? Mirthful, but still machine-made. "Laughter," he says, "is a very, very complex process." The quest for expressive speech synthesis -- melding acoustics, psychology, linguistics and computer science -- is driven primarily by a grim fact of electronic life: The computers that millions of us talk to every day as we look up phone numbers, check portfolio balances or book airline flights might be convenient but, boy, can they be annoying. Commercial voice synthesizers speak in the same perpetually upbeat tone whether they're announcing the time of day or telling you that your retirement account has just tanked. David Nahamoo, overseer of voice synthesis research at IBM, says businesses are concerned that as the technology spreads, customers will be turned off. "We all go crazy when we get some chipper voice telling us bad news," he says. And so, in the coming months, IBM plans to roll out a new commercial speech synthesizer that feels your pain. The Expressive Text-to-Speech Engine took two years to develop and is designed to strike the appropriate tone when delivering good and bad news. The goal, says Nahamoo, is "to really show there is some sort of feeling there." To make it sound more natural, the system is also capable of clearing its throat, coughing and pausing for a breath. Scientist Juergen Schroeter, who oversees speech synthesis research at AT&T Labs, says his organization wants not only to generate emotional speech but to detect it, too. "Everybody wants to be able to recognize anger and frustration automatically," says Julia Hirschberg, a former AT&T researcher now at Columbia University in New York. For example, an automated system that senses stress or anger in a caller's voice could automatically transfer a customer to a human for help, she says. The technology also could power a smart voice mail system that prioritizes messages based on how urgent they sound. Hirschberg is developing tutoring software that can recognize frustration and stress in a student's voice and react by adopting a more soothing tone or by restating a problem. "Sometimes, just by addressing the emotion, it makes people feel better," says Hirschberg, who is collaborating with researchers at the University of Pittsburgh. So, how do you make a machine sound emotional? Nick Campbell, a speech synthesis researcher at the Advanced Telecommunications Research Institute in Kyoto, Japan, says it first helps to understand how the speech synthesis technology most people encounter today is created. The technique, known as "concatenative synthesis," works like this: Engineers hire human actors to read into a microphone for several hours. Then they dice the recording into short segments. Measuring in the milliseconds, each segment is often barely the length of a single vowel. When it's time to talk, the computer picks through this audio database for the right vocal elements and stitches them together, digitally smoothing any rough transitions. Commercialized in the 1990s, concatenative synthesis has greatly improved the quality of computer speech, says Campbell. And some companies, such as IBM, are going back to the studio and creating new databases of emotional speech from which to work. But not Campbell. "We wanted real happiness, real fear, real anger, not an actor in the studio," he says. So, under a government-funded project, he has spent the past four years recording Japanese volunteers as they go about their daily lives. "It's like people donating their organs to science," he says. His audio archive, with about 5,000 hours of recorded speech, holds samples of subjects experiencing everything from earthquakes to childbirth, from arguments to friendly phone chat. The next step will be using those sounds in a software-based concatenative speech engine. If he succeeds, the first customers are likely to be Japanese auto and toy makers, who want to make their cars, robots and other gadgets more expressive. As Campbell puts it, "Instead of saying, 'You've exceeded the speed limit,' they want the car to go, "Oy! Watch it!" Some researchers, though, don't want to depend on real speech. Instead, they want to create expressive speech from scratch using mathematical models. That's the approach Sundaram uses for his laugh synthesizer, which made its debut this month at the annual meeting of the Acoustical Society of America in San Diego. Sundaram started by recording the giggles and guffaws of colleagues. When he ran them through his computer to see the sound waves represented graphically, he noticed that the sound waves trailed off as the person's lungs ran out of air. It reminded him of how a weight behaves as it bounces to a stop on the end of a spring. Sundaram adopted the mathematical equations that explain that action for his laugh synthesizer. But Sundaram and others know that synthesizing emotional speech is only part of the challenge. Yet another is determining when and how to use it. "You would not like to be embarrassing," says Jurgen Trouvain, a linguist at Saarland University in Germany who is working on laughter synthesis. Researchers are turning to psychology for clues. Robert R. Provine, a psychologist at the University of Maryland, Baltimore County who pioneered modern laughter research, says the truth is sometimes counterintuitive. In one experiment, Provine and his students listened in on discussions to find out when people laughed. The big surprise? "Only 10 to 15 percent of laughter followed something that's remotely jokey," says Provine, who summarized his findings in his book Laughter: A Scientific Investigation. The one-liners that elicited the most laughter were phrases such as "I see your point" or "I think I'm done" or "I'll see you guys later." Provine argues that laughter is an unconscious reaction that has more to do with smoothing relationships than with stand-up comedy. Provine recorded 51 samples of natural laughter and studied them with a sound spectrograph. He found that a typical laugh is composed of expelled breaths chopped into short, vowel-like "laugh notes": ha, ho and he. Each laugh note lasted about one-fifteenth of a second, and the notes were spaced one-fifth of a second apart. In 2001, psychologists Jo-Anne Bachorowski of Vanderbilt University and Michael Owren of Cornell found more surprises when they recorded 1,024 laughter episodes from college students watching the films Monty Python and the Holy Grail and When Harry Met Sally. Men tended to grunt and snort, while women generated more songlike laughter. When some subjects cracked up, they hit pitches in excess of 1,000 hertz, roughly high C for a soprano. And those were just the men. Even if scientists can make machines laugh, the larger question is how will humans react to machines capable of mirth and other emotions? "Laughter is such a powerful signal that you need to be cautious about its use," says Provine. "It's fun to laugh with your friends, but I don't think I'd like to have a machine laughing at me." To hear clips of synthesized laughter and speech, visit: http://www.baltimoresun.com/computer The first computer speech synthesizer was created in the late 1960s by Japanese researchers. AT&T wasn't far behind. To hear how the technology sounded in its infancy, visit: http://sal.shs.arizona.edu/~asaspeechcom/PartD.html Today's most natural sounding speech synthesizers are created using a technique called "concatenative synthesis," which starts with a prerecorded human voice that is chopped up into short segments and reassembled to form speech. To hear an example of what today's speech synthesizers can do, all you need to do is dial 411. Or visit this AT&T demo for its commercial speech synthesizer: http://www.naturalvoices.com/demos/ Many researchers are now working on the next wave of voice technology, called expressive speech synthesis. Their goal: to make machines that can sound emotional. In the coming months, IBM will roll a new expressive speech technology. To hear an early demo, visit http://www.research.ibm.com/tts/ For general information on speech synthesis research, visit: http://www.aaai.org/AITopics/html/speech.html http://www.baltimoresun.com/news/health/bal-te.voice29nov29,1,550833.story=?coll=3Dbal-news-nation Contributed by Alan Cantor