KINDA' LIKE SIRI, YA' KNOW?
That’s what I tell people who ask me about the voiceover project I recorded in Scotland Summer 2015. In our business, it’s pretty unusual to be flown to a studio to narrate a project, because with today’s technology we can connect to our clients around the world from our own studios, with no degradation of sound quality.
But this client prefers to bring talent to their studio — thereby controlling the recording environment, the scripts, and the actual digital files (for security reasons). The company, CereProc (short for Cerebral Processing), creates synthesized speech, commonly known as Text-to-Speech (TTS), for individuals and corporations around the globe.
What is text-to-speech? It’s software that’s used to create a spoken sound version of text in a digital document, such as a help file, or web page.
Not to confuse you, but the reverse of that is Speech-to-Text, which is what Siri does at first when she ‘hears’ your commands. Once your audio content is transcribed into written words, Siri can locate the answer and use TTS to 'tell' you.
WHAT WAS THE PROCESS LIKE?
Long and arduous! We recorded for 30 hours (3 hours per day, 5 days a week, for 2 weeks), where I sat and read phrases from a computer screen — over 6,000. And contrary to most voice acting, this work required a “flat” read — the acting and emotion had to be kept out of the voice.
I had to sit on my hands so my voice stayed calm. Usually the more movement in the body, the more authentic the voice — and whereas I’d normally remove my jewelry, to prevent jingling and jangling, in this case it wasn’t necessary!
Sentence after sentence had to be read with the same tone, pacing, intonation, and emphasis. I was beeped in to the recording, and if my read was perfect (correct pronunciation, correct volume, no emotion, no mouth noise, no exterior noise), I got beeped out, indicating that specific phrase was done. Then came a new beep to cue me to read the next line, which appeared on the computer screen and was controlled by the sound engineer.
When there were tricky foreign names I hungered for a pencil to mark up the script with pronunciation notes; but that wasn’t possible, so I had to keep the sounds in my head as I reread the sentences with the challenging words. Not easy when there were multiple words that were unfamiliar.
1,025,109 AND COUNTING.
That’s how many words are in the English language!
In every other kind of voiceover job, the script provides valuable clues— story, arc, conflict, resolution, characters, intention, hopes, aspiration, information and more. But, in TTS recordings, the script is taken from news sources: The New York Times, The Washington Post, The BBC, etc. A variety of topics are included, i.e. general news, politics, business, international news, sports, weather, arts and entertainment.
The sentences aren't always used as they appeared in print; sometimes they're combined in a way that’s nonsensical. They’re created to get all the sound combinations in the language that’s being recorded: syllables, phonemes, diphones, morphemes and lexemes. These linguistic components are then parsed into units and create a database of sounds.
The software system transposes written text into phonetic text, selects the prosody (rhythm and intonation of a sentence), and picks the best units in the database to reconstruct the words. When synthesizing a word or phrase, the program selects units of speech from any of the recordings that are used to build the voice, so that’s one reason why the recordings have to be consistent. Ironically, if there’s too much emotion in the voice during the recordings, when the units get concatenated (linked together) the voice sounds extremely robotic!
BUT IT IS ABOUT YOU!
You may think that TTS doesn’t really touch your life, but that’s not true. It’s used in education, healthcare, call centers, alert systems, chat bots, web reading, robotics, marketing and voice branding.
CereProc recreated voices for some high profile individuals, among them, film critic Roger Ebert who lost his voice to cancer, and football player Steve Gleason.
More and more research and technology is happening in this industry. For people with speech challenges, Northeastern University professor, Rupal Patel, founded VocalID, “a company that creates customized synthetic voices by combining the speaker’s residual voice with an anatomically similar voice elected from a donor database. The result is a reverse engineered voice that approximates how a person may sound if not limited by a speech impairment.”
I pride myself on helping my clients #BeHeard, but I must commend Rupal Patel for taking voice to another level. If you’d like to watch her TedMed talk CLICK HERE. If you’d like to donate your voice, to her project, CLICK HERE.
Think about it, the voice is everything — it’s a person, a brand, a feeling. Our personalities come through our voices. What does yours say about you?
This piece was originally written for my #VOnow blog series, which you can read here.