Almost Human (But Not Quite): Evaluating Text-to-Speech for eLearning Narration
In our previous article we discussed the use of audio narration for our online courses. One potential voice source is text-to-speech, or TTS. We want to report what we learned about its viability from our view as internal eLearning developers.
The use of live narrators for eLearning can be a lengthy and resource-intensive process, both for initial production and for subsequent revisions. TTS, if viable, could make our production more efficient. Of course, the trade-off is voice quality. Can TTS really sound human enough to be practical? After spot-checking the TTS market for over a year we recently took a more in-depth look.
A variety of considerations and opinions
On one hand, we learned that some learners, our own employees included, can accommodate TTS as long as they don’t have to strain to understand it. One source found that after several minutes, learners viewed it as listening to someone with an accent. On the other hand, there are the elements of cost, suitability of voices, and ease of use.
Posts on an ASTD eLearning discussion group were unanimously against using TTS. Other articles were generally in favor of it under certain circumstances. We think some of the disparity stems from the wide variance of quality not only between TTS engine manufacturers, but even between different voices that use the same TTS engine.
TTS engines we reviewed
We evaluated TTS engines and voices from the following TTS engine manufacturers:
In addition to these companies, all of whom specialize in TTS Services, we evaluated the voices that come with Adobe Captivate.
Typical options include male and female personalities along with accents such as American, British, and Australian. All voices were judged using the same passage from a script in one of our eLearning courses. Voice quality ranged from highly robotic to amazingly human-like. Besides one voice’s diction sounding quite different from another, we also found that a voice could vary within itself depending on the passage.
We found that the same package was priced quite differently whether it was being licensed for individual use, internal distribution on an intranet, or commercially.
TTS manufacturers seem to use one of two general business models. One is a hosted model. Text is entered on the host website and read by the selected voice. The user adjusts pronunciation and inflection until the sound is satisfactory. (See note under Ease of Use.) The user then downloads the finished product as an audio file. Most manufacturers who use this model base their fee on number of finished minutes of audio. In our sample, fees for this kind of service ran between $7.50 and $11.00 per finished minute.
The other model is based on licensed downloads of the engine and voices. Fees for this kind of service varied from $2,500 per year for the engine and three voices to a one-time fee of $1,100 for the engine and two voices. Either way, additional voices are available for an additional fee.
Ease of use
In our small eLearning shop, no one specializes in a particular skill or tool. Thus it is essential that if TTS is going to work, it must be very intuitive to tweak a voice’s inflection and punctuation. We found that several TTS products do not have a graphical user interface. Rather, some of them use a SDK (Software Developer Kit) and are intended for use by developers only. Note: We were able to adjust some pronunciation by changing spelling and punctuation in a trial-and-error fashion.
Finally a critical factor in anyone’s use of TTS is technical support. Based on the responsiveness to our inquiries, technical support could range widely. Thus, we urge anyone considering TTS to check this carefully.
We believe the quality, price, and ease of use are reaching a point where text-to-speech is becoming a viable alternative to recording human voices for certain narration. After evaluating a variety of sources and voices, we feel the ones that ship with Adobe Captivate are acceptable for short passages. Some others are getting close to human-sounding. In our sample, which is not comprehensive, we found the following products to be viable based on quality of voices, price, and ease of use:
- Virtual Speaker and Acapela Box by the Acapela Group
- Studio Two by Ivona
However, because we found there can be noticeable variation between voices using the same engine, and even within the same voice from one passage to the next, we urge anyone considering TTS to evaluate the product thoroughly, across a wide sample of phrases.
“Text-to-Speech vs Human Narration for eLearning.” eLearning Technology, Tony Karrer, September 14, 2010, downloaded from http://elearningtech.blogspot.com/2010/09/text-to-speech-vs-human-narration-for.html
To Discuss how these Solutions will add value for you, your organization and/or your clients, Affinity/Resale Opportunities, and/or Collaborative Efforts, Please Contact: