Automatic Speech Recognition - SkillBakery Studios


Post Top Ad

Post Top Ad

Wednesday, September 30, 2020

Automatic Speech Recognition

Speech is the primary mode of communication among human beings, whereas on the other hand when it comes to computers the input modes are Keyboard or a mouse.

What if the computers could listen to human speech (command) and carry perform the task.

Automatic speech recognition (ASR), computer speech recognition or speech to text (STT), is the technology and methods that enable the computers to understand the speech or spoken language and convert into text.

Speech recognition applications include Voice user interfaces such as voice dialing (e.g. "call home"), call routing (e.g. "I would like to make a collect call"), domotic appliance control, search keywords (e.g. find a podcast, where particular words were spoken,), simple data entry (e.g., entering a credit card number), preparation of structured documents (e.g. a radiology report), determining speaker characteristics, speech-to-text processing (e.g.,  wordprocessors or emails), and aircraft (usually termed direct voice input)


Some of the uses of automatic speech recognition is given as under-


In-car systems-Typically a manual control input, for example by means of a finger control on the steering-wheel enables the speech recognition system and this is signaled to the driver by an audio prompt. Following the audio prompt, the system has a "listening window" during which it may accept a speech input for recognition.

Simple voice commands may be used to initiate phone calls, select radio stations, or play music from a compatible smartphone, MP3 player, or music-loaded flash drive. Voice recognition capabilities vary between car make and mode


Health care-In the  health care sector, speech recognition can be implemented in front-end or back-end of the medical documentation process. Front-end speech recognition is where the provider dictates into a speech-recognition engine, the recognized words are displayed as they are spoken, and the dictator is responsible for editing and signing off on the document.


Defense-substantial efforts have been devoted in the last decade to the test and evaluation of speech recognition in fighter aircraft. Of particular note have been the US program in speech recognition for the Advanced Fighter Technology), the program in France for  aircraft, and other programs in the UK dealing with a variety of aircraft platforms. In these programs, speech recognizers have been operated successfully in fighter aircraft, with applications including setting radio frequencies, commanding an autopilot system, setting steer-point coordinates and weapons release parameters, and controlling flight display.


Communications-ASR is now commonplace in the field of mobile technology and is becoming more widespread in the field of computer gaming and simulation. In telephony systems, ASR is now being predominantly used in contact centers by integrating it with IVR systems. Despite the high level of integration with word processing in general personal computing, in the field of document production, ASR has not seen the expected increases in use.

The improvement of mobile processor speeds have made speech recognition practical in smartphones. Speech is used mostly as a part of a user interface, for creating predefined or custom speech commands.


Education-Speech recognition can be useful for learning, It can teach proper pronunciation, in addition to helping a person develop fluency with their speaking skills.

Students who are blind or have very low vision can benefit from using the technology to convey words and then hear the computer recite them, as well as use a computer by commanding with their voice, instead of having to look at the screen and keyboard.[9

 The major service provider in ASR is Google and AWS (Amazon Web Services), we have tried to make comparison between the two so as to help you to choose the best which suits you.


Below is the important points on which the two of them is compared


1.   Speed. The speed of a transcribe platform is a crucial factor. Given enough time, everyone could transcribe multimedia content, but the point of the existence of platforms like these is to make that time as short as possible. But in some cases, speed may not be the ultimate, deciding factor. Some companies will be better off with a slower but more accurate solution.


2.     Accuracy is very important to a transcription platform. The value of the transcription platform is measured by its accuracy. If the platform gives you a transcription that needs additional edits in punctuation and speakers, then that platform my friend hasn’t done much of the job for you. But again, in some cases, companies that have large amounts of transcripts, they’ll be better off with a slightly less accurate, but much cheaper solution.


3.   Price. No matter if anyone is a small company or a well-established vendor moving the market, everyone cares about costs.


Amazon Web Services

The best thing about Amazon Transcribe is the accuracy of transcriptions. AWS has been the world’s most comprehensive and broadly adopted cloud platform for the last 12 years. This experience can be seen in the accuracy Amazon Transcribe shows in their results.

Namely, unlike other transcribe services, Amazon transcribe platform produces texts that are ready to use, without a need for further editing. To achieve this, AWS Transcribe pays special attention to:

Punctuation- Amazon Transcribe platform is capable of adding appropriate punctuation to the text as it goes and formats the text automatically. This way producing an intelligible output can be used without further editing.

Confidence score- AWS Transcribe makes sure to provide a confidence score which shows how confident the platform is with the transcription.
This means you can always check the confidence score to see whether a particular line of the transcript needs alterations.

Possible alternatives- The platform also gives you an opportunity to make some alterations in cases where you are not completely satisfied with the results.

Timestamp Generation- Powered by deep learning technologies, AWS Transcribe automatically generates time-stamped text transcripts.
This feature provides timestamps for every word which makes locating the audio in the original recording very easy by searching for the text.

Custom Vocabulary- AWS Transcribe allows you to create your own custom vocabulary. By creating and managing a custom vocabulary you expand and customize the speech recognition of AWS Transcribe.
Basically, custom vocabulary gives AWS Transcribe more information about how to process speech in the multimedia file.
This feature is very important in achieving high accuracy in transcriptions of specific use such as Engineering, Medical, Law Enforcement, Legal, etc.

Multiple Speakers-AWS Transcribe platform can identify different speakers in a multimedia file. The platform can recognize when the speaker changes and attribute the transcribed text accordingly. Recognition of multiple speakers is handy when transcribing multimedia content that involves multiple speakers (such as telephone calls, meetings, etc.).
Amazon Transcribe API is billed monthly at a rate of $0.00056 per second

Google Speech-to-Text

Google Speech-to-Text is available for multimedia content from different lengths and duration and returns them immediately. Thanks to Google’s Machine Learning technology, the platform can also process real-time streaming or pre-recorded audio content including FLAC, AMR, PCMU, and Linear-16.

The platform recognizes 120 languages which makes it much more advanced than Amazon Transcribe platform.

However, despite this, Google still falls short on accuracy and price, compared to Amazon Transcribe platform.

Google Speech-to-Text accuracy improves over time as Google improves the internal speech recognition technology used by Google products. It includes:

1.     Automatic identification of the spoken language. Google employs this feature to automatically identify the language spoken in the multimedia content (out of 4 selected languages) without any additional alterations.

2.     Automatic recognition of proper nouns and context-specific formatting. Google Speech-to-Text works well with real-life speech. It can accurately transcribe proper nouns and appropriately format language (such as dates, phone numbers).

3.     Phrase hints. Almost identical to Amazon’s Custom Vocabulary, Google Speech-to-Text allows customization of context by providing a set of words and phrases that are likely to be met in the transcription.

4.     Noise robustness. This feature of Google Speech-to-Text allows for noisy multimedia to be handled without additional noise cancellation.

5.     Inappropriate content filtering. Google Speech-to-Text is capable of filtering inappropriate content in text results for some

6.     Automatic punctuation. Like Amazon Transcribe, this platform also uses punctuation in transcriptions.

7.     Speaker recognition. This feature is similar to Amazon’s recognition of multiple speakers. It makes automatic predictions about which of the speakers in a conversation spoke which part of the text.

Google Speech-to-Text costs $0.006 per 15 seconds, while the video model costs twice as much, at $0.012 per 15 seconds

No comments:

Post a Comment

Post Top Ad