Speech is the primary mode of
communication among human beings, whereas on the other hand when it comes to
computers the input modes are Keyboard or a mouse.
What if the computers could
listen to human speech (command) and carry perform the task.
Automatic
speech recognition (ASR), computer speech recognition or speech to text (STT), is the technology and methods that enable the
computers to understand the speech or spoken language and convert into text.
Speech recognition
applications include Voice user interfaces such as voice dialing (e.g. "call home"),
call routing (e.g. "I would like to make a collect call"), domotic
appliance control, search keywords (e.g. find a
podcast, where particular words were spoken,), simple data entry (e.g., entering
a credit card number), preparation of structured documents (e.g. a radiology
report), determining speaker characteristics, speech-to-text processing (e.g., wordprocessors or emails), and aircraft (usually
termed direct voice input)
Some of the uses of automatic speech recognition is
given as under-
In-car systems-Typically
a manual control input, for example by means of a finger control on the
steering-wheel enables the speech recognition system and this is signaled to
the driver by an audio prompt. Following the audio prompt, the system has a
"listening window" during which it may accept a speech input for
recognition.
Simple voice commands
may be used to initiate phone calls, select radio stations, or play music from a
compatible smartphone, MP3 player, or music-loaded flash drive. Voice
recognition capabilities vary between car make and mode
Health care-In the health care sector, speech
recognition can be implemented in front-end or back-end of the medical
documentation process. Front-end speech recognition is where the provider
dictates into a speech-recognition engine, the recognized words are displayed
as they are spoken, and the dictator is responsible for editing and signing off
on the document.
Defense-substantial efforts have been devoted in the last decade to the test and
evaluation of speech recognition in fighter aircraft. Of particular note have been the US program in
speech recognition for the Advanced Fighter
Technology), the program in France for aircraft, and
other programs in the UK dealing with a variety of aircraft platforms. In these
programs, speech recognizers have been operated successfully in fighter
aircraft, with applications including setting radio frequencies, commanding an
autopilot system, setting steer-point coordinates and weapons release
parameters, and controlling flight display.
Communications-ASR is now commonplace in the field of mobile technology and is becoming more widespread in the
field of computer gaming and
simulation. In telephony systems, ASR is now being predominantly used in
contact centers by integrating it with IVR systems. Despite the high level of
integration with word processing in general personal computing, in the field of
document production, ASR has not seen the expected increases in use.
The improvement of
mobile processor speeds have made speech recognition practical in smartphones. Speech is used mostly as
a part of a user interface, for creating predefined or custom speech commands.
Education-Speech recognition can be useful for learning,
It can teach proper pronunciation, in addition to helping a person develop
fluency with their speaking skills.
Students who are
blind or have very low vision can benefit from using the technology to convey
words and then hear the computer recite them, as well as use a computer by
commanding with their voice, instead of having to look at the screen and
keyboard.[9
Below is the
important points on which the two of them is compared
1. Speed. The speed of a transcribe platform is a crucial
factor. Given enough time, everyone could transcribe multimedia content, but
the point of the existence of platforms like these is to make that time as
short as possible. But in some cases, speed may not be the ultimate, deciding
factor. Some companies will be better off with a slower but more accurate
solution.
2.
Accuracy is very important to a transcription
platform. The value of the transcription platform is measured by its accuracy.
If the platform gives you a transcription that needs additional edits in
punctuation and speakers, then that platform my friend hasn’t done much of the
job for you. But again, in some cases, companies that have large amounts of
transcripts, they’ll be better off with a slightly less accurate, but much
cheaper solution.
3. Price. No matter if anyone is a small company or a
well-established vendor moving the market, everyone cares about costs.
Amazon
Web Services
The best thing about Amazon
Transcribe is the accuracy of transcriptions. AWS has been the world’s most
comprehensive and broadly adopted cloud platform for the last 12 years. This
experience can be seen in the accuracy Amazon Transcribe shows in their
results.
Namely, unlike other transcribe services, Amazon
transcribe platform produces texts that are ready to use, without a need for
further editing. To achieve this, AWS Transcribe pays special attention to:
Punctuation- Amazon Transcribe platform is capable of adding
appropriate punctuation to the text as it goes and formats the text
automatically. This way producing an intelligible output can be used
without further editing.
Confidence score- AWS Transcribe makes sure to provide a confidence
score which shows how confident the platform is with the transcription.
This means you can always check the confidence score to see whether a particular line of the transcript needs alterations.
Possible alternatives- The platform also gives you an opportunity to
make some alterations in cases where you are not completely satisfied with the
results.
Timestamp
Generation- Powered by deep learning technologies, AWS
Transcribe automatically generates time-stamped text transcripts.
This feature provides timestamps for every word which makes locating the audio
in the original recording very easy by searching for the text.
Custom Vocabulary- AWS Transcribe allows you to create your own
custom vocabulary. By creating and managing a custom vocabulary you expand and
customize the speech recognition of AWS Transcribe.
Basically, custom vocabulary gives AWS Transcribe more information about how to
process speech in the multimedia file.
This feature is very important in achieving high accuracy in transcriptions of
specific use such as Engineering, Medical, Law Enforcement, Legal, etc.
Multiple Speakers-AWS Transcribe platform can identify different
speakers in a multimedia file. The platform can recognize when the speaker
changes and attribute the transcribed text accordingly. Recognition of multiple
speakers is handy when transcribing multimedia content that involves multiple
speakers (such as telephone calls, meetings, etc.).
Amazon Transcribe API is billed monthly at a rate of $0.00056 per second
Google Speech-to-Text
Google Speech-to-Text is available for multimedia content
from different lengths and duration and returns them immediately. Thanks to
Google’s Machine Learning technology, the platform can also process real-time
streaming or pre-recorded audio content including FLAC, AMR, PCMU, and
Linear-16.
The
platform recognizes 120 languages which makes it much more advanced than Amazon
Transcribe platform.
However,
despite this, Google still falls short on accuracy and price, compared to
Amazon Transcribe platform.
Google Speech-to-Text accuracy improves over time
as Google improves the internal speech recognition technology used by Google
products. It includes:
1.
Automatic
identification of the spoken language.
Google employs this feature to automatically identify the language spoken in
the multimedia content (out of 4 selected languages) without any additional
alterations.
2.
Automatic
recognition of proper nouns and context-specific formatting. Google Speech-to-Text works well with real-life
speech. It can accurately transcribe proper nouns and appropriately format
language (such as dates, phone numbers).
3.
Phrase
hints. Almost identical to Amazon’s
Custom Vocabulary, Google Speech-to-Text allows customization of context by
providing a set of words and phrases that are likely to be met in the
transcription.
4.
Noise
robustness. This feature of
Google Speech-to-Text allows for noisy multimedia to be handled without
additional noise cancellation.
5.
Inappropriate
content filtering. Google
Speech-to-Text is capable of filtering inappropriate content in text results
for some
6.
Automatic
punctuation. Like Amazon Transcribe, this
platform also uses punctuation in transcriptions.
7.
Speaker
recognition. This feature is similar to
Amazon’s recognition of multiple speakers. It makes automatic predictions about
which of the speakers in a conversation spoke which part of the text.
Google
Speech-to-Text costs $0.006 per 15 seconds, while the video model costs twice
as much, at $0.012 per 15 seconds
https://www.armedia.com/blog/transcription-services-aws-google-ibm-nuance/
No comments:
Post a Comment