Synthesized speech is now as natural as human speech

In the second half of the 2010s, computer-based text-to-speech systems began to be used on smart speakers and web-based news programs. Many people may have noticed that such synthesized speech is different from the kind of speech that is typical of machine-synthesized speech.

As a matter of fact, speech synthesis technology has already left behind the phase of focusing on making synthesized speech sound as natural as human speech.

Let’s start with an example of traditional speech synthesis technology. Suppose you are on a bus and hear an on-board announcement in a synthetic voice that says, “Tsugi wa XX desu [the next stop will be XX].”

If the announcement is based on traditional speech synthesis technology, you can immediately notice that it is synthesized speech. This is because the synthetic sounds of words tsugi, wa, XX, and desu are prerecorded separately and they are reproduced in a sequence when the bus stop “XX” approaches. This is how traditional synthesized speech works.

In this case, the intonation of the postposition wa may sound a bit unnatural. The intonation of wa subtly varies depending on the words that come before and after the postposition. Simple sequential reproduction does not take intonation into consideration. In this case, wa is pronounced as if it stands alone, making it sound unnatural.

With the latest technology, however, the Japanese sounds are preregistered in the smallest unit of speech called “phoneme” and these phonemes are linked in the right sequence to pronounce the word in question. Take the word eki (train station), for example. To pronounce eki, the phonemes of e, k, and i need to be linked correctly. To make such linking as natural as possible, it is important to collect as many examples as possible.

When a sentence to be read aloud is entered into a speech synthesis system, the system analyzes the sentence into parts of speech such as nouns, postpositions, and verbs in the corresponding sequence of phonemes.

The system then refers to a dictionary that stores data on rough accents of words. In the process, the system differentiates words with the same sequence of phonemes but a different accent, for example, between hashi (bridge) and hashi (chopsticks).

Then the speech synthesis system generates an intonation that links the accents of words naturally and picks out the necessary phonemes from the database and arranges them so as to produce the natural sound of the sentence.

Roughly speaking, the system goes through these complex processes in an instant. The result is the synthesis of natural speech that sounds as if it is spoken by a human voice.

Diverse application of speech synthesis technology

Innovations in speech synthesis technology do not stop there. Some technologies can now make synthesized speech sound as if it is spoken by a particular person.

One such technology divides speech into pitches and timbres, edits these elements on a computer, and generates new speech waveforms from these edited elements.

A more recent technology processes original speech waveforms without dividing speech into these different elements.

At any rate, it is no longer difficult to generate speech waveforms that are similar to those of a particular person.

Simply put, the latest speech synthesis technology can generate the voice of person B based on the voice of person A simply by gathering one hour’s worth of speech data of person A and a few minutes’ worth of speech data of person B.

Work is underway to apply these technologies for the good of society. One example is the medical application of the technology to help laryngeal cancer patients and others whose vocal cords have been surgically removed.

Such patients have to go through tough training to communicate by voice. This may involve speaking by attaching a buzzing-sound-producing device against their throat or vocalizing with another organ in place of vocal cords.

Their lives can be made easier with a new system whereby the patients’ voice is recorded before their vocal cords are surgically removed and their voice is synthesized based on this prerecorded data.

After the removal surgery, the patients use a tablet computer or a smartphone to read aloud the entered text in their synthesized voice, which comes out of the speaker.

This new technology allows the patients to continue communication in their own voice without going through rigorous training.

The technology is already commercialized but still expensive. Various efforts are underway to make this technology more affordable.

Moreover, technological research is being conducted to convert the sound generated by a buzzing-sound-producing device into the synthesized voice of the patient.

This technology, if it can be put to practical use, will allow patients to produce their synthesized voice by moving their mouth, which will give them the feel of speaking in a normal way.

Progress has also been made in collaboration with psychology.

Psychological research is being made on how the voice of a person influences the public image of one’s personality traits. Technology that can freely synthesize the pitches and timbres of the human voice lends itself to experiments for this research.

This research is expected to shed light on what types of voice and tone give people a good impression. This in turn will allow people to train themselves to produce such a voice and tone.

A system can be developed that analyzes speech recorded on a PC or smartphone and gives marks and advice on how to improve the voice and tone.

Such a system should be useful for job-hunters who want to give a good impression to their job interviewers. It should also be effective for social skills training for those who are not good at talking to people.

On the downside, however, speech synthesis technology can be abused for fraud. In fact, such a case happened overseas.

In this case, a company employee received a call from a man who pretended to be his boss. The employee had no doubt that the caller was his boss and sent money to the specified account as requested. The voice of “his boss” turned out to be synthesized.

To prevent such crimes, efforts are now being made to develop a technology that can distinguish between real human speech and synthesized speech.

It is important for us researchers to study the security aspect as well as the latest technology in putting it to practical use to avoid possible abuse.

The availability of a speech-deforming tool will help create new speech expressions

The past few years have seen astonishingly rapid progress in speech synthesis technology. This technology will be put to wider practical use in many different fields.

I study a “voice that surpasses humanity.”

For example, painters have developed different modalities of expression beyond a realistic rendering of what is actually seen. These modalities may include techniques that symbolically express the artists’ ideas as well as pictures that deform human figures and exaggerate their robustness and movements that are beyond human capacity.

Speech synthesis technology, which has succeeded in accurately reproducing human voice, has the potential for development, just like painting.

One possibility is to produce a charming voice that a human would not be capable of.

More broadly speaking, it is possible to synthesize voices that boost human abilities–a voice that makes people vigorous, cheerful or feel drowsy or a voice that wakes people up nicely and allows them to get off to a good start in the morning, for example.

In fact, some anime works highlight their characters with synthesized speech. Some songs that process the voices of their singers with speech synthesis technology hit the charts.

Thus, speech synthesis technology has vast potential for playing an important role in entertainment contents.

Another possibility is to compress synthesized text-to-speech to convey five times the amount of information that can be conveyed at a normal reading-out speed for a fixed period of time.

Humans cannot speak like compressed speech. Yet it may be possible to deliver compressed speech intelligibly using speech synthesis technology. That would be a “voice that surpasses humanity” for sure.

Speech synthesis technology has these potentials.

These potentials can be explored by making speech deformation technology readily available.

As I said earlier, it is no longer difficult to process the pitches and timbres of voices.

Now, software and machines that can deform images are readily available and many people enjoy using them. Likewise, if a full-fledged tool that allows users to deform speech easily becomes readily available, new expressions will emerge from it.

In other words, the potential of speech synthesis technology can be expanded by a broad range of people who create new expressions with a free mind, not by researchers who develop such tools.

This is the very reason why I want young people to take good care of their ears.

The ear is in fact subject to wear – it degenerates with age. If the organ is exposed to loud sound for a long time in a concert hall or via headphones, it may degenerate even at a young age and even lead to deafness in the worst case.

Once the ear is damaged, it may be difficult to heal the damage completely in light of the current levels of medical technology.

I want you readers to enjoy music at an appropriate volume. If you expose your ears to loud music, don’t continue to do so. Give your ears a rest for a while. In a nutshell, don’t strain your ears.

Going forward, speech synthesis technology will expand the potential for speech expression. Take a good care of your ears!

* The information contained herein is current as of November 2019.
* The contents of articles on are based on the personal ideas and opinions of the author and do not indicate the official opinion of Meiji University.
* I work to achieve SDGs related to the educational and research themes that I am currently engaged in.

Information noted in the articles and videos, such as positions and affiliations, are current at the time of production.