We had the pleasure of speaking with Oren Jacob, Co-Founder and CEO of PullString. In Part 1 of this series, we discussed what makes a great computer conversation.
RB: Welcome to the Rapportboost podcast series, where we interview thought leaders applying AI to conversational platforms. For today’s podcast, I am happy to introduce Oren Jacob, cofounder and CEO of PullString. Oren, can you please start by providing a brief overview of your company and product offerings?
Oren Jacob: PullString offers a product called Converse, along with our conversation cloud platform, to enterprises that want to build great voice experiences for Amazon Alexa, Google Assistant, and IoT devices.
We primarily work with verticals like entertainment, media, gaming, pharma and financial services. Those large companies that want to build fantastic computer based conversation between brands and the end consumers who have an Alexa in their house, for example, use Converse to build Alexa skills that way and do it very successfully.
We’re an enterprise software company that provides a voice technology platform. I want to make the comment that PullString Converse, along with our conversation cloud, is a complete end to end solution, from very far at the beginning of design and prototyping of voice experiences all the way through the development of this submission to the skill store operation and market on the far side of it, and the ongoing care and feeding and improvement of skills — is all within reach of what we offer to market today. So we’re very unique in the space. And for large companies that really want a full solution for bringing their brands into the voice technology space, at the highest fidelity possible, we’re the folks to call and happily help along their voice journey.
RB: Any particular size and scale of organizations you like to work with? Or is it anybody and everybody?
Oren Jacob: I think that most of the folks we work with have at least a small team that is responsible for their voice-skill work. Sometimes we work — because our customers want to — with a third party voice agency that does the actual production work for a skill. And oftentimes the companies want to do that work themselves. So we’re happy to go either way depending on what’s best for the particular task at hand for a specific customer.
As far as size that companies go, certainly, Fortune 1000 and large and medium sized companies. There are a few small businesses that do work with us, but that’s unusual because if they typically have someone who will code with the Alexa APIs in-house, as opposed to having a team, often in the marketing or innovation or sales product groups who want to bring voice into part of their many channels and extend their brand strategies reaching out to customers.
RB: So given, certainly, your market and the things that you have seen and the things that you know, such as best practices in designing high-fidelity voice experiences. What makes a great computer conversation to start?
Oren Jacob: I’m going to answer that question with the first part of the previous one about how the company got started and drive that into what I think is a strong advocacy for how the best experiences of voice get built.
The company was started, if I go back to 2011 — just a quick 30-second story here. My daughter Toby at the time was seven, and there was an iPhone 2 or 3, and she was on a Skype call with my grandma, down in Irvine, down by you and the beach. The Skype call ended and Toby hung up the Skype call and looked at me and said, “Daddy, can I use this to talk to that?” “This” was the iPhone and “that” was her stuffed bunny, Tutu.
Oren Jacob: “I don’t know, Toby. Give me a couple million in venture capital in a few years. We’ll give it a go.”
Oren Jacob: But that evening, I explained that question to Martin, now the CTO here at PullString — and we contemplated that question for a couple of weeks. And we came to two conclusions from that process. One was, we were not sure we actually could do that. By that being, build a high fidelity, credible synthetic two-way conversation. Can we use a computer to talk back to someone who wants to talk to a computer? And do that well? And by well, it means interesting, not artificial, in a way that conversation feels connected, follows along when the person wants to stay on topic and can adjust when they don’t want to. And then we thought, if we could even do that, then what would happen? I could quickly imagine 472 different applications for that technology in a second. But really in practice, what would actually get built into use? And how would that change the landscape of how it relates to technology, broadly speaking?
We talk to each other all the time. That’s what you and I are doing now, here, as we record this podcast, and as I go home with my family and I’m here at work with my coworkers. But what would it be like to be able to use language? The language — our language, not Python and C and Java, but English and Spanish and French, and Mandarin and Hebrew and the rest. What would it be like to use our languages of humans to interact with computers in a more meaningful way? And that seemed like a very fruitful place to go. That was sort of the original story of the company. And that wraps now into some of the best practices about computer conversation.
I think that maybe I’ll warm up this question by offering a little bit of the observation of the not best practices.
Some of the folks in this space are trying to map their mobile app or their website in the flows that are designed for those experiences directly into voice. And that has not worked well. Voice is much more different than mobile, and voice is much more different than web, then mobile and web are from each other. Yes, it’s different because websites, typically are mouse or are touch-panel based. Mobile is typically a touch screen on a handheld device. Both are visual GUI interfaces, with text and imagery together and buttons you can press and engage for OKs and Cancels and Check Out from your shopping cart.
Zero of those things exist in voice. Voice is an audio experience. So it’s linear by nature. It’s interactive by nature in a turn-based way, that’s much more like a video game design than a mobile app or websites because — I say, then you say, then I say, then you say.
Because it’s audio, a lot of things do not work well, like a long lists of options. Past option 7, I’m taking a nap already, let alone option 3. But on a website you can see a whole tableau of lots of shoes on Zappos, and can see 100 things at a time. That’s impossible in voice.
On the other hand, what voice can do well, because it is the native form of communication that we as humans developed early on — at age two, three, four, and five we start to talk. The kinds of connections, the compactness of the efficiency of communication available is very, very high, as well as the chance to do things like — taking one example, top of my head, following the instructions on a recipe, and ordered list of things that take time to do, when you want to ask questions about each step along the way, are really well suited to voice. And actually really badly suited — like YouTube, constantly pausing recipes in YouTube and I want to mix my eggs and I got to watch the video instead, I’d much rather talk to that with a voice assistant.
The second thing I want to mention about best practices have to do with considering the nature of voice. And that it is a turn-based — I say, you say, I say, you say — kind of experience. When your Alexa skill or your voice app is asking a question, about half the time the user will give you back what you’re hoping for. Do you want to check out or confirm your check out? Yes, I do. Do you want to add that into your cart? Yes, I do. What color do you want? Purple. About half the time the user will say something you don’t expect, either by answering the question in an off-axis way, maybe they need more clarification, or perhaps they’re just hungry and want to get a snack, and tell you that too. And when designing and building for voice, the process that yields the best experience really does evenly distribute the effort of building the skill half into the expected things that you think the user is going to say back, and half into the things you don’t expect them to stay back. And how do you handle those? How do you redirect the conversation? Or do you not? How do you ask for further clarification? Or do you not? How do you allow the conversation to meander or change topic? Or do you keep it right on rails?
A lot of folks only design for the positive case, because if you go back to a visual GUI design, and you only have OK and Cancel, or Close window, the user can do only three things. But with an open mic, they can say anything back to you in voice, and spending as much time thinking about, when they say what you expect as you to do thinking about when they say what you don’t expect — is really important to consider. Because both will happen to you all the time.