small progress on speech synthesis research

This commit is contained in:
AustrianToast 2024-10-25 00:58:30 +02:00
parent f7c6d54376
commit 2b20e5ece9
Signed by: AustrianToast
GPG Key ID: 1B4D0AAF6E558816
2 changed files with 32 additions and 11 deletions

View File

@ -13,12 +13,12 @@
"state": { "state": {
"type": "markdown", "type": "markdown",
"state": { "state": {
"file": "Reasons why I use Obsidian.md", "file": "Speech synthesis.md",
"mode": "source", "mode": "source",
"source": false "source": false
}, },
"icon": "lucide-file", "icon": "lucide-file",
"title": "Reasons why I use Obsidian" "title": "Speech synthesis"
} }
} }
] ]
@ -163,12 +163,12 @@
}, },
"active": "8b669fe838f51401", "active": "8b669fe838f51401",
"lastOpenFiles": [ "lastOpenFiles": [
"Satan.md",
"Framework.md", "Framework.md",
"Reasons why I use Obsidian.md",
"Beehive.md", "Beehive.md",
"Speech synthesis.md", "Speech synthesis.md",
"Updating.md", "Updating.md",
"Satan.md",
"Reasons why I use Obsidian.md",
"Knowledge Base/labwc.md", "Knowledge Base/labwc.md",
"Knowledge Base", "Knowledge Base",
"Hentai.md", "Hentai.md",

View File

@ -1,16 +1,37 @@
## Goal ## Goal
The goal is to write a text to speech program, where all sounds are purely created through mathematical equations, in order to create my own true 100% machine speech synthesis. The goal is to write a text to speech program, where all sounds are purely created through mathematical equations, in order to create my own true 100% machine speech synthesis.
## Inspirations
These inspirations include some newer products and some very old software solutions which are practically impossible to obtain and as such have been almost completely forgotten.
### [MUSIC-N](https://en.wikipedia.org/wiki/MUSIC-N)
It was used to create [this famous piece of music](https://www.youtube.com/watch?v=41U78QP8nBk). If my guess is correct, this should be made using Formant synthesis. Since this is a product of Bell Labs and because of it's age, it probably lives in an archive somewhere or maybe has even been lost.
### [gnuspeech](https://savannah.gnu.org/projects/gnuspeech)
It makes use of Articulatory synthesis and is an open source project. Sadly it only runs on extremely old versions of Mac OS X and on the BSD-based NeXTSTEP, which is closed source and as such practically impossible to get. These factors have led to gnuspeech being unmaintained and mostly forgotten.
### [UTAU](http://utau2008.web.fc2.com/)
I really like Utane Uta (Defoko) and Adachi Rei. Both of them are completely synthesised and make use of no Voicebanks.
### [Vocaloid](https://www.vocaloid.com/en/)
Not the kind of speech synthesis I am going for, but I still really like them.
### SAM
I currently cannot easily figure out which type of synthesis it is even after taking a short look at the source code of [this newer implementation](https://github.com/s-macke/SAM). My current guess is, that it is using some form of Concatenation synthesis, since it is combining phonemes to create words.
## Planning ## Planning
### Language ### Spoken Language
The language is one of the most important aspects of this program and as such needs to be carefully chosen. The possible languages from which I wanted to choose from where German, English or Japanese. The language is one of the most important aspects of this program and as such needs to be carefully chosen. The possible languages from which I wanted to choose from where German, English or Japanese.
In order to help me which of those three languages I should choose, I came up with these requirements. In order to help me which of those three languages I should choose, I came up with these requirements.
- Small alphabet in order to minimise the amount of code required - Small alphabet in order to minimise the amount of work required
- Individual letters stitched into words together require no further processing in order to be understandable - Individual letters stitched into words together should require no further processing in order to be understandable
- The words need to be understandable even without proper or any emphasising at all - The words need to be understandable even without proper or any emphasising at all
I speak German and English fluently and they have a relatively small alphabet. In the end, I chose Japanese, because it may have a way bigger alphabet, but at least it is consistent in the way different combinations of letters create words. Since I speak German and English fluently and they have a relatively small alphabet, I chose to do both, but I am going to start with English first.
### Programming Language ### Programming Language
There are a many options, but to help narrow it down, I came up with these requirements. There are a many options, but to help narrow it down, I came up with a few requirements.
- functional programming (pure functions) - should be functional (Haskell for example) or procedural (C for example)
- compiles to ELF - compiles to binary (should require no interpreter, so no python or java for example)
### What type of speech synthesis
When I looked on [Wikipedia](https://en.wikipedia.org/wiki/Speech_synthesis#Synthesizer_technologies) I saw that there are multiple types of speech synthesis.
Since I am restricting myself to pure machine synthesis, I can only choose between two types.
- Formant synthesis
- Articulatory synthesis
I really wanna do both, but I am going to start with Articulatory synthesis. I am only doing it first, because it seems way cooler to me.