To paraphrase the title of Raymond Carver’s short story collection, What do we talk about when we talk about Voice AI? Are we talking about talking? About sounds? About the spaces between sounds—the silence that, without it, sounds might be entirely without meaning and structure anyway? Why isn’t it called Sound AI? Or why aren’t there further divisions: Voice AI, Sound AI, Conversation AI, Sound Effects AI, and on and on. Don’t scoff, the Voice AI world could split off into just such categories—and more—in the near future. After all, conversation designers didn’t even exist, as a job, as a job title, ten years ago, five years ago. Which is why Voice AI is the Wild West of the AI world: it’s wide open. It’s the frontier, maybe not the final frontier, but definitely a frontier. One that’s constantly branching out, learning more about itself—its (self-imposed, preconceived) limitations, its possibilities. Its heroes and villains.
Herein, then, the first of many conversations among people in the ever-expanding Voice AI multiverse. This one conjoining Naïma van Esch, a UX consultant who’s done plenty of work in AI and Voice AI, and Benjamin McCulloch, a freelance conversation designer and audio specialist (and also a human intelligence trainer—to his son) affiliated with Conch Design. McCulloch lives in Czechia, van Esch is based in the Netherlands.
Before we get to the questions, though, there’s the matter of sound itself: What is it? What is it in the context of AI? Again, if voices are just sounds, AI should be a walk in the park.
McCulloch: Sound isn’t understood very well. If you want to be an audio engineer you need to train for years, get thousands of hours of work experience and then you’re ready to get paid for your skills. From the outside it looks like we just repeat processes like a robot, but we listen all the time! Even in my daily life I’m always listening—every sound tells a story. I’ve worked in many fascinating fields: music production, TV, videogames, localization, podcasting, film restoration—and I’ve even been a voice artist for brands like McDonald’s and Müller. In every case I had to learn how things should sound in each context. For example, voiceover is performed differently for dramatic material, or the news, or tutorials, or commercials.There isn’t a single piece of software or hardware in the world that could automatically do it for me—I needed to train my ears because they’re my main instrument. I think that is where the disconnect lies: anybody working in Voice AI is making an audio product, but many haven’t trained their ears—sound engineers have done that training and are the people who can analyze the audio and suggest improvements. It’s still a relatively young industry—I’m confident that over time we’ll see people realize that any team of coders is going to get much better results with sound engineers helping them refine results. And it has to be said: linguists and other voice specialists are also vital.
TtC: What then are the challenges of audio engineers in working with voice AI coders?
McCulloch: This doesn’t just relate to coders specifically, it relates to all people working with synthetic voices—most people don’t think about what they hear, and that includes many people developing synthetic voices. Our ears are extremely good at spotting a “fake”—because one of the purposes of the ears is to detect threats. For example, we hear everything that is possible to be heard from every direction around us. Our ears filter everything we hear to pick out the specific sound that is most relevant in every moment. We can hear around corners, our ears are evolved to amplify the sound of babies crying, and you can hear what your friend is saying into your ear in a noisy music club while your body is literally shaking from the vibrations of the sound system. However, I get the impression that many groups working in voice have never really analyzed what they hear in everyday life. The reason this is problematic is because any product that sounds fake won’t be very impressive on the market—the consumers will know something sounds wrong even when they can’t say why, because they’re used to hearing human voices 24/7.
Van Esch: The main challenge I faced was realizing that conversation design is not a “One-size-fits-all” phenomenon. I spent a lot of time on research and user-testing sentences, which should be relatively “general,” but after all these tests, I realized: People are different and their own unique experiences form their way of speech throughout life. An example is that if you’re trying to figure out the command for “Call [person],” a millennial might say “Call [name],” while an elderly person would say, “Can you check whether [name] is home? I want to speak to him.” Clearly, this elderly person grew up when there were only “house phones,” rotaries, and therefore you need to call the “land line.”
The challenge was to do a lot of user testing to figure out how different people use the same verbal command. Another challenge was to get stakeholders on board to do a lot of user testing. After whichever test (like testing a new feature), we always, always, always discovered new sentences people use for the same command. So the advice is to keep track of the wording. And I always bring a developer with me to experience the user (and their struggles), to have intrinsic motivation and a deeper understanding of the issues.
McCulloch: From my experience coders are often approaching synthetic voice from a mathematical point of view, but this ignores the cultural and linguistic aspects. I’ve seen coders try to randomize the pitch of synthetic voice to make it sound more organic rather than robotic. It’s nonsense because the pitch we use while speaking is as learned as the words (we actually learned pitch before words, when we were babies mimicking the voices we heard around us). Recently, there have been attempts to create non-binary genderless voices, but I’ve seen them focus on the fundamental pitch of the voice because generally speaking women have higher-pitched voices than men. Unfortunately, this simplified understanding of gendered voices falls apart very quickly because young men have higher-pitched voices and older women (or those who have had illnesses that affected their speaking voice) have lower-pitched voices. They get so close that it’s hard to tell them apart.
David Beckham and James Earl Jones are both men, but the fundamental pitch of their voices is definitely not the same. It’s too simplistic to say “women are high, men are low,” as if they’re two clear and distinct groups. There are ways of talking that are considered feminine or male. For example, women generally use a wider range of pitches while talking than men. It’s more about culture than math or physiology—researchers discovered that Margaret Thatcher adopted a lower-pitched voice when she became Prime Minister. I could give many examples but the truth is it’s much more complex than just the fundamental pitch. Gender in voice can’t be understood with such a simple mathematical formula.
Van Esch: Working with coders is crucial. Besides letting them understand my work as a researcher and designer, I need to have a deeper understanding of their work, too. It was developers who taught me to read Python and to use their programs—to take basic tasks off their hands and to avoid “designing unicorn features”—features that seem fantastic but are impossible to build well in a short amount of time.
Having a background in front-end also helps me in my collaboration with engineers and coders. I know how they think and what types of solutions are available. It’s crucial as a designer to have trust between you and your engineers and coders. This way, you can communicate transparently and clearly.
McCulloch: I’m not a coder, so cooperation is key. It’s definitely a challenge to translate what you have in mind. I think the most important thing to watch out for is when the coder might make assumptions about voice in order to speed up development—often those assumptions can lead to issues, because they’ve focused on their own understanding of language, or they’ve omitted all the natural sounds voices make (breaths, sighs, laughs, cries, etc.). They’ve interpreted “voice” to mean “words.”
Van Esch: As a conversation designer, you cannot be left alone. You can write by yourself, but at the end of the day, you always need to test it with users, and then the developer needs to implement the changes in the code. So it’s a combined effort and partnership between designers and developers.
Even if you work freelance, it will benefit you greatly to have a good and trusting relationship with the development team. The development team needs to trust you and your skills for them to be able to implement your changes without resistance. And if they trust you, they will also ask you critical questions to understand why you designed something the way you did—for them to back you up on your design fully.
We’re still figuring out these new technologies, and therefore are still figuring out conversation design. The combination of these technologies will also change the way you design for certain conversations. You cannot design a chatbot conversation with text the same way you would design for a robot using voice.
TtC: Which group is more important—coders, engineers, designers?
McCulloch: Voice AI needs coders to build the underlying structure and systems, but it needs sound engineers to make sure it sounds right for the design—I’m not saying “sounds natural,” because there is an argument against making synthetic voices that sound too human, as that would mislead consumers. Then there is also a need for linguists, voice specialists, and so on. Personally, I find the idea that coders alone can replicate the voice hard to believe. It’s only as we’re delving into Voice AI that I’m reminded how incredible our voices are. We can go from a whisper to a shout in the space of one word, a sigh can be more expressive than 1,000 words. We adapt our voices depending on who we’re talking to. We do all of this naturally and in learned contexts. This isn’t just about obtaining data and training machine learning models. It needs people who know what’s going on when we talk and who can guide the coders.
TtC: What is the distinction, then, between a conversation designer and an audio/sound engineer—and coders, who seem to be the forgotten soldiers of Voice AI? Is there overlap among them? Or are they very distinct?
Van Esch: In my opinion, an audio/sound engineer is someone who works with sounds to communicate actions. Like the “whooshing sound” whenever you send an email via an Apple device. That’s something audio/sound engineers created, tested, and refined. Then sent it over to developers to implement. Conversation designers are mostly designers who write sentences to “keep the conversation going.” However, a designer can have both skills. And sound design can be embedded in good conversation design. Conversation design can exist without sound design. Sound design can also exist without conversation design. And the two can be stronger together—depending on the functionality and user interface used.
McCulloch: One very general observation is that coders tend to become a bit of a clan. They stick together. Often I’ll see jobs advertised with machine-learning startups where they ask for a sound engineer but they frame it within the job of a coder. Like, “You must know Python and GitHub, and also know how to edit sound for natural results.” Those are two different jobs! I know there are audio programmers out there, but the chances that they can do both sound engineering tasks and write code to a high standard are very slim. Both tasks require years of study and they’re different areas. Usually they end up with a coder who knows sound from a theoretical point of view, but that’s got very little to do with the skills and techniques of editing and designing sound.
TtC: How vital are conversation designers to the future of Voice AI?
Van Esch: Conversation design exists due to the adoption of robotics, chatbots, improved user experiences, voice-user interface design. Before the adoption of these technologies, conversation design didn’t exist. UX designers can be good conversation designers because it’s embedded in us to do user research and a lot of user testing. The dream vision of a lot of companies has user interfaces without a device. So, even if there are still ‘buttons’ you can press, I assume most of these interfaces are also voice-controlled. So, conversation designers are essential to the future of Voice AI.
McCulloch: I’ve often had to say “That can’t be automated,” because a programmer intended to apply the same batch process to multiple sounds, and all those sounds had unique qualities. You have to analyze each sound by listening to it to know what needs to be fixed. For example, a recording of a voice might have background noises, such as a door banging. If you automatically try to remove all the banging doors, the system is likely to think that all the loud percussive vocal sounds (‘p, b, t’, etc.) are also doors banging and remove them too, which results in butchered vocal recordings!
TtC: What might be some of the things that conversation designers are doing, or working on for the near future, that are new and exciting or unexpected?
McCulloch: It’s no surprise that the BBC is getting great results. Their synthetic voice is one of the best I’ve heard. They’ve worked in voice production for so long—for TV and radio—and you can hear that their synthetic voice is useable. Another great company I’ve heard is Sonantic. The emotional quality of their voices is very impressive.
Van Esch: As a conversation designer, you have opportunities to work with these other disruptive technologies like robotics, Virtual Reality, Augmented Reality, etc. I believe that the opportunities of combining the aforementioned technologies and conversation design is endless. We could have so many new inventions that can change our lives. Who would have thought I’d like to tell my stereo to answer my questions? Or my robot vacuum cleaner to say to me, Welcome home?! The world can be our oyster.