Phonetics vs. LVCSR: Under the Hood of Speech Analytics
Speech analytics allows enterprises to leverage the voice of the customer as a business asset. It can be used to diagnose and address customer service issues, contact center efficiency opportunities, agent training, and more. Two approaches to speech recognition are commonly used in conversational analytics solutions: phonetics and LVCSR, each with their own pros and cons. So which is better for mining intelligence from contact centers?
Phonetic Speech Analytics
The basic recognition unit of a phonetics system is a phoneme (sequence of sounds). Phonetic speech analytics search preprocesses the audio into the phonemes and encodes the result in a lattice of possibilities. Any search terms used later by call center agents or managers are also translated into phonemes and the search looks for the same sequence in the existing lattice.
Pros and Cons of Phonetics Systems
The greatest advantage of a phonetic recognition system is that words that are not in a predefined vocabulary can still be found, provided the phonemes are recognizable. For instance, when searching for the name of the drug “cialis”, the term may still be found in the text if that sequence of phonemes exists, “S IY AH LIH S”. The disadvantage is that since there are many possible sequences in the lattice, the term may be found in many places where it was never actually said, for instance if the actual words were “see a list.” The phonemes are very similar but it’s a false positive match. Phonetic approaches generally have higher recall rate (a measure of how many of the items that were actually in the documents being searched were found), but this is countered by a low precision because those false positives have to be manually filtered out of the result set.
Phonetics systems are also typically faster at processing (turning the phone calls into data), mostly because the size of the “vocabulary” being used is very small as phonetics relies only on the sounds of the language and there are only few tens of unique phonemes in most languages. However, the search process itself is much slower since phonemes cannot be as efficiently indexed the way whole words can. Phonetics also requires a larger footprint for storage, since a word has an average of 4 phonemes, which may be an issue for large scale projects.
While phonetic approaches do take into account what the possible sequences of sounds are and their frequency (for example, groups of consonants, such as “stldr” never occur in English), they do not take into account any higher level knowledge of the language, meaning “most likely” used phrases cannot be filtered correctly, requiring more manual work later.
LVCSR Speech Analytics
LVCSR (Large Vocabulary Continuous Speech Recognition) begins by recognizing phonemes much like a phonetic system, but then applies a dictionary or language model of potentially 50,000 ‐ 100,000 words and phrases to produce a full transcript. In LVCSR every word is recognized and nothing is thrown away or skipped. While the initial process of recognizing the full transcript and not just the individual phonemes requires more processing power than phonetic only recognition, the resulting transcript makes it much easier and faster for contact centers to search and use the gathered information.
Pros and Cons of LVCSR Speech Analytics
LVCSR uses statistical methods to confirm the likelihood of different word sequences (like “the side effect of cialis include” or “cialis pills”), therefore the accuracy is much higher than just the single word lookup of a phonetic approach, so it is more likely that if the word is found, it was really spoken. The disadvantage, however, is that the words in the search terms need to be in the dictionary in order to be found by the transcription search engine. In rare cases when words are not in the 50,000 – 100,000 vocabulary, the word or phrase of interest often can be found by combining words (e.g. “see Alice” for “cialis”) or by using “sounds-like” approaches to word matching (eg. Horizon and Verizon).
The initial processing of the audio also takes a bit longer than with a phonetic approach because of the large vocabulary that has to be analyzed, even though search time is actually much faster and more accurate. Imagine being a call center manager trying to uncover the root causes for a sudden spike in call traffic. If there is an issue with one of your self-service channels (or perhaps a product issue) it’s imperative that you find that out as soon as possible. Faster search speeds mean you can comb through more customer calls in a shorter amount of time and get to the heart of the issue quickly, with less human resources cost.
With an LVCSR approach, the larger contexts that the sounds occur in are taken into account as well. This compensates for the fact that some sounds are very ambiguous and tend to merge with neighboring sounds (e.g. “dish soap”) and the same sequence of sounds can be different word sequences: “let us pray” vs. “lettuce spray”. The LVCSR approach algorithmically determines which alternatives are more likely, letting the computer do most of the work.
LVCSR systems typically have a much higher precision since they are more likely to contain the words that were actually said, but lower recall due to unusual words or recognition errors. To compensate for this, a LVCSR approach provides a transcript of the words around the key term, allowing users to visually skim the “snippet” and determine if it is relevant or not. In addition, the fact that there is an actual transcript of the conversation allows for the automated analysis of the frequency of various words and phrases. This can reveal trends and metrics that the system hasn’t been specifically told to look for.
So Which Speech Recognition System is the Best?
The final decision comes down to looking at what is “best fit” is for your business problem in terms of cost, value, and manual effort. In general, phonetics is an appropriate fit for search‐seldom applications. For larger enterprise applications like speech analytics and business intelligence mining, the LVCSR approach to speech analytics is better suited. In the grand scheme of things, using phonetics is a bit like using cassette tapes in an era of MP3s. It still works and has niche benefits but for the most part the technology is outdated and there are more powerful options available to contact centers.