Obsidian/Speech synthesis.md

3.1 KiB

Goal

The goal is to write a text to speech program, where all sounds are purely created through mathematical equations, in order to create my own true 100% machine speech synthesis.

Inspirations

These inspirations include some newer products and some very old software solutions which are practically impossible to obtain and as such have been almost completely forgotten.

MUSIC-N

It was used to create this famous piece of music. If my guess is correct, this should be made using Formant synthesis. Since this is a product of Bell Labs and because of it's age, it probably lives in an archive somewhere or maybe has even been lost.

gnuspeech

It makes use of Articulatory synthesis and is an open source project. Sadly it only runs on extremely old versions of Mac OS X and on the BSD-based NeXTSTEP, which is closed source and as such practically impossible to get. These factors have led to gnuspeech being unmaintained and mostly forgotten.

UTAU

I really like Utane Uta (Defoko) and Adachi Rei. Both of them are completely synthesised and make use of no Voicebanks.

Vocaloid

Not the kind of speech synthesis I am going for, but I still really like them.

SAM

I currently cannot easily figure out which type of synthesis it is even after taking a short look at the source code of this newer implementation. My current guess is, that it is using some form of Concatenation synthesis, since it is combining phonemes to create words.

Planning

Spoken Language

The language is one of the most important aspects of this program and as such needs to be carefully chosen. The possible languages from which I wanted to choose from where German, English or Japanese. In order to help me which of those three languages I should choose, I came up with these requirements.

  • Small alphabet in order to minimise the amount of work required
  • Individual letters stitched into words together should require no further processing in order to be understandable
  • The words need to be understandable even without proper or any emphasising at all Since I speak German and English fluently and they have a relatively small alphabet, I chose to do both, but I am going to start with English first.

Programming Language

There are a many options, but to help narrow it down, I came up with a few requirements.

  • should be functional (Haskell for example) or procedural (C for example)
  • compiles to binary (should require no interpreter, so no python or java for example)

What type of speech synthesis

When I looked on Wikipedia I saw that there are multiple types of speech synthesis. Since I am restricting myself to pure machine synthesis, I can only choose between two types.

  • Formant synthesis
  • Articulatory synthesis I really wanna do both, but I am going to start with Articulatory synthesis. I am only doing it first, because it seems way cooler to me.