Guess which one I am…

Globally, enormous sums of money are being thrown at generative AI, including methods of creation and of detection – but it’s fair to say that the majority is being steered towards, if not explicitly ring-fenced for STEM (science, technology, engineering, maths). Natural language processing and machine learning. Audio-engineering. Software and hardware development. Data science. If you’re in one of these fields, life probably looks very busy right now. Possibly even a little daunting.

However, HackaCon’s core point is that generating spontaneous, human-like conversations using AI fundamentally requires SHAPE (social sciences, humanities, and the arts for people and the economy). You’re going to need linguists. Creative writers. Psychologists. Sociologists. Philosophers. Historians.

Thankfully we have the inimitable Katherine Parkinson on hand to help illustrate this.

I’m a big fan…

Firstly, let’s take a look at about thirty seconds of dialogue from season two’s final episode of the BBC TV series, Sherlock. (It seemed an appropriate choice somehow.) Katherine is playing Kitty Riley (KR), a wily journalist looking for a big break but masquerading as an obsessed fan. She has just appeared behind Sherlock Holmes (SH) in the public toilets:

KR: You’re him

SH: Wrong toilet

KR: I’m a big fan

SH: [Quietly, to himself.] Evidently

KR: I read your cases; follow them all. [Undoes top buttons and pulls open her jacket to reveal her cleavage.] Sign my shirt would you

SH: There are two types of fans

KR: Oh?

SH: Catch me before I kill again, Type A

KR: Uhuh. What’s Type B-

SH: [Instantly.] -Your bedroom’s just a taxi ride away

KR: [Smiling.] Hnnnh. Guess which one I am

SH: [Looks her up and down.] Neither

What we get here is a really classic example of fictional dialogue, whether that’s from books or movies or TV shows. Everyone takes turns. People rarely speak over the top of each other. The closest we get to overlap above is an instance of latching where Sherlock leaves no gap after Kitty asks, “What’s type B?” There are no hesitations, no interruptions, no repetitions, no grammatical mishaps. It’s all a seamless execution designed to fulfil the triple televisual requirements of being intelligible, gripping, and instrumental in developing the story in some way.

That’s not to say that there are no speech-like features whatsoever. We get grammatical shorthand (follow them all) and even some speechlike sounds (hnnnh) that are difficult to convey in text, but that’s about as far as most fictional scripted dialogues go. We could transcribe the whole series and find that with occasional exceptions for dramatic effect, most, if not all of the character interactions look like this.

In turn, you might reasonably say yes but… isn’t this how we actually talk? And the answer is: absolutely not.

The milk rounds

Let’s take the very same Katherine Parkinson (KP), but this time in conversation with Helen Mountfield (HM) about why she became an actress:

HM: [Laughter.] No, no, no. I’ve seen the IT Crowd, don’t do it

KP: [Laughter.] Yeah

Both: [Overlapping talk, undecipherable.]

KP: …destroys… some businesses. Um, I don’t know what the um, kind of, fashionable job is at the moment but when I was an undergraduate in the… nineties… the… milk… rounds – I don’t know if they still call them that

HM: [Quietly.] Mmm

KP: That’s kind of what, the sort of- the coolest job that you could aspire to

HM: [Quietly.] Mmm

KP: certainly amongst me and my kind of

HM: [Quietly.] Mmm

KP: undergraduate friends was being a management consultant so that’s what I decided [laughter voice, undecipherable] without actually knowing what it was

When first reading spoken words that have been transcribed much more accurately like this, people tend to react with dismay. In this format, speech looks almost impossible to follow, and yet if you listen to this conversation, I suspect you’ll find it relatively straightforward to understand. She sounds hesitant, certainly, and seems to be thinking her way through her answer live, but in spite of that, what she means remains highly intelligible.

That enormous difference between the more accurate transcript above that seems so inaccessible and the live interaction that our brains process with so little effort occurs because we’re remarkably tolerant of, or even oblivious to all those disfluencies. We unconsciously filter out all the “noise” and focus on processing the “signal”.

Advantage, Katherine

An extra consideration with this example is that Katherine was almost certainly provided with at least a flavour of the questions before she took part, giving her chance to have some thoughts ready in advance. She has also almost certainly been asked these kinds of questions many times before, and has therefore practiced versions of these answers with many different audiences, both in private and in public. Moreover, though I’ve tried to find something “conversational”, in reality this is an interview, and however informal, it inevitably comes with a predictable structure. Katherine expects to be asked relevant, interesting questions, and to provide relevant, interesting answers. This takes away a lot of the mystery, which can significantly reduce the necessary processing power. Increased spontaneity and decreased thinking time – or in other words, greater cognitive load – can hugely amplify disfluencies and other speech-like features as we try to both process and also answer on the spot.

Despite Katherine’s conversational advantages in this particular instance, she still displays a very typical array of hesitations, false starts, grammatically orphaned phrases, overlaps that make it impossible to follow what was actually said, and far more besides. We also get a series of supportive back-channel markers from Helen (mmm) to show engagement and offer encouragement, and these slot with extraordinary precision into millisecond-long gaps not just between Katherine’s phrases, where those spaces can be fractionally longer, but even between words within phrases where gaps are vanishingly small.

All this to say…

This is one of the key things HackaCon is interested in. We believe that rising to this challenge will require uniting both STEM and SHAPE.

Our little snippet is an extremely normal example of natural, spontaneous, human speech. Conversation is somehow both messy and exceptionally precise at the same time. It looks disjointed, but it’s highly collaborative. Those tiny supportive noises (mmm, yeah) often fall below the radar of conscious acknowledgement yet we miss them extremely quickly if we don’t hear them when we think we should. We say all kinds of apparently empty things (you know, like, erm) but those supposedly meaningless additions communicate entire additional layers of humour, uncertainty, thoughtfulness, politeness, emotion, attitude, and more.

Like so many natural human behaviours, conversation is a masterpiece of many seemingly chaotic moves that unite into an astonishingly coherent whole, and that’s why STEM alone is unlikely to solve this. You might computationally engineer the most perfect clone voices of Agent Luke and Chris Nemesis, for instance, but can you then make those voices undertake a conversational choreography that sounds as authentically spontaneous and natural as two humans talking about a 1990s milk round? Or will they end up sounding more like a dramatic cameo in a hit BBC series?

We’d love to find out.

Leave a Reply

Your email address will not be published. Required fields are marked *