Vocal Search
People are getting used to using their devices as their personal assistants. Personally I often rely on my smartphone or my Outlook calendar reminding me of things: birthdays, appointments, meetings and deadlines.
However, typing all the information, appointments, favorite restaurants and airlines is definitely a tedious task. Probably for this reason back in 2011 Apple decided to integrate Siri into iOS. This way users can use their iPhone as their secretary, when they are not concerned of being consider idiots by others staring at them while they’re having a pleasant conversation with their phone.
Recently W3C released a new specification: Web Speech API. This opens the doors for entirely new ways users can interact with web applications. It will be no longer just keyboard and mouse, you’ll be able to talk to the browser, issue commands and dictate text. Developers now have the opportunity to build voice-driven personal assistant applications that work in the browser. The API is currently implemented only in the most recent release of Google Chrome (version 25), and the implementation is incomplete.
Only a few days after my Twitter stream notified me about the Web Speech API becoming available in Chrome, Nokia Berlin was hosting the first HERE Hackathon. We had several developers coming over to play with our location APIs. Web Speech API + maps was a perfect little project for a hackathon.
So there I was, hacking Vocal Search together. Vocal Search is a simple little application that takes the user’s voice input, understands it and search for relevant places around matching the user’s request. A request can be something like: Find me a restaurant and hopefully even something more complex like “I am hungry, find a french restaurant around here”.

The first version of Vocal Search I built at the hackathon was very basic. A full screen map centered in the current location, a search box in case the user wants to type the search terms and an icon to enable voice capturing. Users could issue commands like “look for a restaurant” or “find a supermarket” and the application would search for relevant places around the user’s position matching the given search terms using the HERE Places API.

The way the Web Speech API works with the current, partial implementation is the following: the browser requests permission of accessing the microphone. When the user accepts audio capturing starts. As audio is being captured, the recorded audio is sent to a server at Google, speech to text is performed and the result is sent back to the browser where the request originated and becomes available as a parameter of a JavaScript callback.

A simplified version of the speech recognition code for Vocal Search is something like this:
(function(){
var recognizing = false,
transcript = '',
speechRecognitionHandler = new webkitSpeechRecognition();
//https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#dfn-continuous
speechRecognitionHandler.continuos = true;
//https://dvcs.w3.org/hg/speech-api/raw-file/tip/speechapi.html#dfn-interimresults
speechRecognitionHandler.interimResults = true;
speechRecognitionHandler.onstart = function(){
recognizing = true;
};
speechRecognitionHandler.onerror = function(e){
recognizing = false;
console.log('Something wrong happened', e.error);
};
speechRecognitionHandler.onresult = function(e){
//We've got something here
var interim_transcript = '';
for (var i = event.resultIndex; i < event.results.length; ++i) {
if (event.results[i].isFinal) {
transcript += event.results[i][0].transcript;
}
}
};
speechRecognitionHandler.onend = function(e){
//Ok, done
recognizing = false;
//This should be everything the user said:
console.log(transcript);
//Do something interesting with transcript
};
})();
As you can see, at the current status the implementation of Web Speech API is nothing particularly smart: it is only a speech to text service accessible from JavaScript. In the future, however, it will be much more that that. The specification defines (roughly for now) the possibility of defining grammars. This means Web Speech will become a complete NLU (Natural Language Understanding) API directly available on the browser. The diagram above will the become something like this:

Unfortunately, for now you’ll have to parse the content of that transcript variable yourself.
During the hackathon, the smartest thing I could think of was a regular-expression-based command parser. If transcript contains search for, look for or find then the user definitely wanted to search for something. Right after the command, transcript probably contains the search terms. Pass the search term to the Places REST API and boom, pins on the map appear.
This worked fine, although it was somehow a big limitation. Sure, you could define thousands of regular expressions, but that is totally not the way to go. However I considered that enough for a day of hacking, and promised to myself I will revisit the code later on, hoping one day the Web Speech API will be completely implemented by all browser vendors.
Just a few days ago I was googling for I-don’t-remember-what, and I stumbled upon something incredibly awesome: a REST API that given a piece of text or a phrase returns a JSON object containing the following:
- Category: provides a high level identification about the subject of the sentence.
- Action: a more specific type of category, and provides some information about the intent of the text
- Entities: important pieces of information from the sentence in a normalized form.
Isn’t that great? It is exactly what I needed to make Vocal Search better and smarter. Have a look at Maluuba’s nAPI, it is a really nice API to play with. They have libraries for Ruby, Python and Java, but in the end it is just a HTTP GET request with the API Key in the query string, so it can be easily done via AJAX in the browser. Unfortunately, they currently don’t have CORS enabled and there is no JSONP support, so you’ll have to proxy the call (for non business critical apps you can use corsproxy).
So here is what my onend callback does now (more or less, the actual code is a little more complex than this):
speechRecognitionHandler.onend = function(e){
//Ok, done
recognizing = false;
//This should be everything the user said:
console.log(transcript);
//Do something interesting with transcript
$.ajax({
url: 'http://www.corsproxy.com/napi.maluuba.com/v0/interpret',
data: {
apikey: 'myapikey',
phrase: transcript
}
}).done(function(data){
if(data.action === 'BUSINESS_SEARCH') {
//The user's wants to search for something, cool!
//data.entities should be now an array of search terms
firePlacesSearch(data.entities && data.entities.searchTerm.join(' '));
}
}).fail(function(){
alert('Oh no! Something must have gone wrong.')
});
};
Here we go, super easy, works great. Now I can say things like “I feel like having a beer, find me a pub”. I get the transcript back from Google, I send it to Maluubas’s nAPI, I get back a response that tells me it is as business search, and that the search terms are beer and pub. I pass that to the places API and I get pubs highlighted on the map with colorful pins. It is a really nice toy.
Now, demo time. Vocal Search is available at vocalsearch.marcon.me and all the code is on Github. To start recording click the microphone icon or press s when the focus is on the map. Same thing to stop recording. Note that since Vocal Search runs on http:// and not on https:// the browser will ask for permission to record every single time, and the preference cannot be saved.
Should the demo not work, corsproxy is probably down as it is now for me. Please try again later. I am working on setting up my own CORS proxy service.
Update: the supernice guys at Maluuba actually read my feedback email and enabled CORS for their nAPI. This means proxying the AJAX calls to http://napi.maluuba.com/v0/interpret is no longer necessary, everything just works. Good job guys!













