JavaScript Speech Recognition

Who is this guy?

works at Adobe

Apache Cordova core contributor

nutty about speech recognition

The future won't be like Star Trek.

Scott Adams, creator of Dilbert

Why do I care about speech rec?

Cape Breton Island

= Cape Bretoner

Here's a conversation between two Cape Bretoners

P1: jeet?

P2: naw, jew?

P1: naw, t'rly t'eet bye.

And here's the translation

P1: jeet?

P1: Did you eat?

P2: naw, jew?

P2: No, did you?

P1: naw, t'rly t'eet bye.

P1: No, it's too early to eat buddy.

Regular Alphabet

26 letters

Cape Breton Alphabet

12 letters!

Alright, enough about me

What is speech recognition?

Speech recognition is the process of translating the spoken word into text.

The process of speech rec includes...

Record and digitize the audio data

Perform end pointing (trimming)

Split data into phonemes

What is a phoneme?

It is a perceptually distinct units of sound in a specified language that distinguish one word from another.

The English language has 44 distinct sounds

Source: English language phoneme chart

By comparison, the Rotokas speakers in Papua New Guinea have 11 phonemes.

But the !Xóõ speakers who mostly live in Botswana have 112 phonemes.

Apply the phonemes to the recognition model. This is a massive lexicon which takes into account all of the different ways words can be pronounced.

Analyze the results against the grammar

Return a confidence weighted result

[
  {
    "confidence": 0.97335243225098,
    "transcript": "hello"
  },
  {
    "confidence": 0.19940405040800,
    "transcript": "hell low"
  },
  {
    "confidence": 0.19910827091000,
    "transcript": "how low"
  }
]

Basically...

We want it to be like this

but more often than not...

Why is that?

When two people talk comprehension rates are better than 97%

A really good english language speech recognition system is correct 92% of the time

Where does that extra 5% in error rate come from?

Vocabulary size and confusability
Speaker dependence vs independence
Isolated or continuous speech
Initiated vs spontaneous speech
Adverse conditions

Mobile Speech Recognition

OS	Application	SDK
Android	Google Now	Java API
iOS	Siri	Many 3rd party Obj-C SDK's, SFSpeechRecognizer (iOS 10+)

So how do we add speech rec to our app?

You may look at the W3C Speech API Specification

but only Chrome and Firefox have implemented that spec

The spec looks like this:

interface SpeechRecognition : EventTarget {
    // recognition parameters
    attribute SpeechGrammarList grammars;
    attribute DOMString lang;
    attribute boolean continuous;
    attribute boolean interimResults;
    attribute unsigned long maxAlternatives;
    attribute DOMString serviceURI;

    // methods to drive the speech interaction
    void start();
    void stop();
    void abort();
};

With additional event methods to control behaviour:

attribute EventHandler onstart;
attribute EventHandler onaudiostart;
attribute EventHandler onsoundstart;
attribute EventHandler onspeechstart;
attribute EventHandler onspeechend;
attribute EventHandler onsoundend;
attribute EventHandler onaudioend;
attribute EventHandler onend;

attribute EventHandler onresult;
attribute EventHandler onnomatch;
attribute EventHandler onerror;

Let's recognize some speech


var recognition = new SpeechRecognition();
recognition.onresult = function(event) {
  if (event.results.length > 0) {
    var test1 = document.getElementById("test1");
    test1.innerHTML = event.results[0][0].transcript;
  }
};
recognition.start();

Replace me...

So that's pretty cool...

But I want to do something more exciting with the result

Let's ask the web a question

Works pretty good...

...but ugly!

Let's style our button with some CSS

<a class="speechinput">
    <img src="images/mic.png">
</a>

#speechinput input {
	cursor:pointer;
	margin:auto;
	margin:15px;
	color:transparent;
	background-color:transparent;
	border:5px;
	width:15px;
	-webkit-transform: scale(3.0, 3.0);
}

And we'll add some color using

Speech

Bubbles

Pure-CSS-Speech-Bubbles by Nicholas Gallagher

Then pull it all together!

But wait, why am I using my eyes like a sucker?

We'll output the answer using SpeechSynthesis

Pretty much all browsers have implemented this spec

The SpeechSynthesis spec looks like this:

interface SpeechSynthesis {
      readonly attribute boolean pending;
      readonly attribute boolean speaking;
      readonly attribute boolean paused;

      void speak(SpeechSynthesisUtterance utterance);
      void cancel();
      void pause();
      void resume();
      SpeechSynthesisVoiceList getVoices();
    };

The SpeechSynthesisUtterance spec looks like this:

interface SpeechSynthesisUtterance : EventTarget {
      attribute DOMString text;
      attribute DOMString lang;
      attribute DOMString voiceURI;
      attribute float volume;
      attribute float rate;
      attribute float pitch;
    };

With additional event methods to control behaviour:


      attribute EventHandler onstart;
      attribute EventHandler onend;
      attribute EventHandler onerror;
      attribute EventHandler onpause;
      attribute EventHandler onresume;
      attribute EventHandler onmark;
      attribute EventHandler onboundary;

But wait, one more thing...

What if I want continuous speech rec?

Use Annyang!

An incredible library by Tal Ater

2kb and no dependencies

Setup looks like this:

<script src="//cdnjs.cloudflare.com/ajax/libs/annyang/2.5.0/annyang.min.js"/>
<script>
if (annyang) {
  // Let's define a command.
  var commands = {
    'hello': function() { alert('Hello world!'); }
  };

  // Add our commands to annyang
  annyang.addCommands(commands);

  // Start listening.
  annyang.start();
}
</script>

The real genius is in the command grammar

var commands = {
    'show tps report': function() { alert('TPS Report!'); },
    'turn the background color *color': function(color) {
    	document.body.style = 'background-color: ' + color;
    },
    'say hello (to the attendees) computer': sayHello
  };

And if I want to develop hybrid apps using Apache Cordova/PhoneGap/Ionic?

Plugin repo's

SpeechRecognitionPlugin - https://github.com/macdonst/SpeechRecognitionPlugin
SpeechSynthesisPlugin - https://github.com/macdonst/SpeechSynthesisPlugin

Availability

OS	Recognition	Synthesis
Android	✓	✓
iOS*	✓	Native to iOS 7.0+

* Thanks to Julio César (@jcesarmobile) for his work on iOS

Getting started

              
phonegap create speech com.example.speech speech
cd speech
phonegap platform add android
phonegap plugin add https://github.com/macdonst/SpeechRecognitionPlugin
phonegap plugin add https://github.com/macdonst/SpeechSynthesisPlugin
phonegap run android

For more information on hybrid applications. Seek me out during the conference, I can talk your ear off.

Types of Speech Recognition Applications

Voice Web Search
Speech Command Interface
Continuous Recognition of Open Dialog
Domain Specific Grammars Filling Multiple Input Fields
Speech UI present when no visible UI need be present
Voice Activity Detection
Speech Translation
Multimodal Interaction
Speech Driving Directions

JavaScript Speech Recognition

Who is this guy?

Scott Adams, creator of Dilbert

Why do I care about speech rec?

Cape Breton Island

= Cape Bretoner

Here's a conversation between two Cape Bretoners

And here's the translation

Regular Alphabet

26 letters

Cape Breton Alphabet

12 letters!

Alright, enough about me

What is speech recognition?

Speech recognition is the process of translating the spoken word into text.

The process of speech rec includes...

Record and digitize the audio data

Perform end pointing (trimming)

Split data into phonemes

What is a phoneme?

It is a perceptually distinct units of sound in a specified language that distinguish one word from another.

The English language has 44 distinct sounds

By comparison, the Rotokas speakers in Papua New Guinea have 11 phonemes.

But the !Xóõ speakers who mostly live in Botswana have 112 phonemes.

Apply the phonemes to the recognition model. This is a massive lexicon which takes into account all of the different ways words can be pronounced.

Analyze the results against the grammar

Return a confidence weighted result

Basically...

We want it to be like this

but more often than not...

Why is that?

When two people talk comprehension rates are better than 97%

A really good english language speech recognition system is correct 92% of the time

Where does that extra 5% in error rate come from?

Mobile Speech Recognition

So how do we add speech rec to our app?

You may look at the W3C Speech API Specification

but only Chrome and Firefox have implemented that spec

The spec looks like this:

With additional event methods to control behaviour:

Let's recognize some speech

So that's pretty cool...

But I want to do something more exciting with the result

Let's ask the web a question

Works pretty good...

...but ugly!

Let's style our button with some CSS

And we'll add some color using

Then pull it all together!

But wait, why am I using my eyes like a sucker?

We'll output the answer using SpeechSynthesis

Pretty much all browsers have implemented this spec

The SpeechSynthesis spec looks like this:

The SpeechSynthesisUtterance spec looks like this:

With additional event methods to control behaviour:

But wait, one more thing...

What if I want continuous speech rec?

Use Annyang!

An incredible library by Tal Ater

2kb and no dependencies

Setup looks like this:

The real genius is in the command grammar

And if I want to develop hybrid apps using Apache Cordova/PhoneGap/Ionic?

Plugin repo's

Availability

Getting started

For more information on hybrid applications. Seek me out during the conference, I can talk your ear off.

Types of Speech Recognition Applications

THE END