Over the years, the important sounds that make up the languages of the world have been studied, and an International Phonetic Alphabet (Courtesy of the International Phonetic Association) has been developed.
Consonants can be characterized by point of articulation and manner of articulation.
They can be voiced (or not), nasal (or not).
Vowels are characterized as high or low, front or back.
State of the Art gives this chart describing how different voice recognition systems differ from each other.
NOTE: This discussion leans heavily on an excellent article in The Linguistics Encyclopedia
[ Up to Extract a spectrogram from the raw signal]
Sound is analyzed in terms of sine waves.
A tuning fork produces a single sine wave
[ Up to Extract a spectrogram from the raw signal]
If three notes are sounded at the same time (possibly at different intensities), the resulting waveform is the sum of each of them.
[ Up to Extract a spectrogram from the raw signal]
This could also be represented as a line spectrum
[ Up to Extract a spectrogram from the raw signal]
Most sounds are more complex than musical chords
By Fourier's Theorem, any complex waveform can be expressed as a sum of sine waves
When some sounds are broken down into their components, their intensities form a pattern. (Others are noisy).
[ Up to Extract a spectrogram from the raw signal]
The pattern that evolves is called an 'envelope'
[ Up to Extract a spectrogram from the raw signal]
The envelope has humps.
Local maxima are called formants
Formants go a long way toward characterizing vowels
[ Up to Extract a spectrogram from the raw signal]
The first three formants of a speech signal can be used to characterize vowel sounds.
Formants map onto high/low, front/back distinction quite well
[ Up to Extract a spectrogram from the raw signal]
This is a spectrogram: 'Bab/dad/gag'.
Note the analogy to color/prisms.
It maps changing waveform to time
Instead of indicating the intensity of each frequency with a line, it is marked by the darkness of the mark at each point on the page.
[ Up to Extract a spectrogram from the raw signal]
Here's a schematic of the same spectrogram.
Bab, dad, gag
Note the the shapes of the formants as they change over time.
This characterizes 'stops'
[ Up to Extract a spectrogram from the raw signal]
Thaw, saw, shaw, chaw
Fricatives don't have formants.
[ Up to Map the spectrogram to a stream of sub-phonetic symbols ('codes')]
About vectors:
Vectors represent a 'space'
Typical example is [x,y,z] coordinates from cartesian space.
Each position in the sequence has a specific cannonical meaning
Each position represents a 'dimension'
Vectors can extend for many dimensions
We can measure the 'distance' between vectors
Each vector in the spectrogram is a very quick sample of the waveform
Each dimension of the spectrogram represents the amplitude of a particular frequency at the point the sample was taken
We can gather several subsequent samples together into segments which represent the dynamic context of the signal.
[ Up to Map the spectrogram to a stream of sub-phonetic symbols ('codes')]
Possible features:
Overall amplitude
Amplitude in various frequency ranges.
1st,2nd,3rd formants
Whether formants are increasing/decreasing/steady
Distances between formants
Noisiness
We can compare this vector to some (empirically determined) templates, and match it to the closest one.
The result is a stream of symbols.
[ Up to Map the spectrogram to a stream of sub-phonetic symbols ('codes')]
[ Up to Example: a toy 'dad' recognizer]
Codebook alphabet:{0,-,/,\,?}
Our 2-d vector is [x,y], where
Each code has a template
Sample outputs for 'dad':
[ Up to Example: a toy 'dad' recognizer]
Finite-state (count 'em!)
Characters either return us to the same state or move us to the next one.
We're always in just one state at any one time.(It's deterministic)
Problem: we're not dealing with all the characters at each juncture.
A better solution is to be probabilistic, i.e. after each transition we assign a probability to each state.
The symbols we get from the codebook are sub-phonemic representations.
We need to model the phonemes/sylables/words that make up the language.
This is complicated by the fact that the same sound can vary from instance to instance, especially in length.
[ Up to Feed the codes into a statistically trained model]
They've had good results modeling this with a Hidden Markov Model (HMM)
Finite-state
We know in advance how many states will represent the sound-unit we want to model.
What we don't know is how well a given sequence of codes will map onto those states.
We address this by progressively hedging our bets as to which state the model is in at any given time.
Each time we read another character, we recalculate the probabilities as to which state we're in
Assessing the probability of 'final states' gives us our current hypothesis for any given input string.
We come up with our probabilities by consulting a corpus of data, and 'training' the model.
[ Up to Hidden Markov Models (HMM)]
As a demonstration, let's say that we're in the middle of the process, which has somehow assigned a probability to each state: [start] = 0.1, D1 = 0.8, A = 0.1, D2= 0.0, [end]= 0.0. Note that the total of all the probabilities is 1.0. Let's say the next code we receive is a '-'. This is how we'd calculate the probability that the next state is 'A'.
Each state needs to have a transition table which can accept any possible code and assign a probability to what the next state might be. Since we're only interested in calculating state 'A', there are only two states which have important transitions to A; D1, and A itself.
| Code-> | / | - | \ | 0 | ? |
| Start | 0.0 | 0.0 | 0.2 | 0.0 | 0.2 |
| D1 | 0.9 | 0.1 | 0.2 | 0.9 | 0.2 |
| A | 0.1 | 0.9 | 0.2 | 0.1 | 0.2 |
| D2 | 0.0 | 0.0 | 0.2 | 0.0 | 0.2 |
| Finish | 0.0 | 0.0 | 0.2 | 0.0 | 0.2 |
...We're not really expecting a '\' or a '?' while in state D1, so let's say for purposes of our 'toy' example that the probabilities are evenly distributed among all states in those cases (we have to be able to deal with all possibilities). Otherwise the weights strongly reflect the deterministic model above. It'll probably stay in the same state on '0' or '/', and transition to 'A' on '-'.
Notice that all the columns add up to 1.0.
For state A, the transition table also closely resembles the deterministic version:
| Code-> | / | - | \ | 0 | ? |
| Start | 0.2 | 0.0 | 0.0 | 0.2 | 0.2 |
| D1 | 0.2 | 0.0 | 0.0 | 0.2 | 0.2 |
| A | 0.2 | 0.9 | 0.1 | 0.2 | 0.2 |
| D2 | 0.2 | 0.1 | 0.9 | 0.2 | 0.2 |
| Finish | 0.2 | 0.0 | 0.0 | 0.2 | 0.2 |
So if the current probability of state D1 is 0.8, and for state A it's 0.1, if our incoming code is '-', the probability that we're in state 'A' after the next transition is calculated as (the probability that we're now in state D1) x (the probability in the 'A' row of the D1 table) + (the probability that we're now in state A) x (the probability in the 'A' row of the 'A' table), or 0.8 x 0.9 + 0.1 x 0.9 = 0.72 + 0.09 = 0.81. So the model determines that we're 81% sure we're now in state 'A' after receiving the '-'.
Over the whole model, each state has a transition table, each state recalculates its probability with each new code in the same we we just calculated it for state 'A'.
Our model yields probabilistic estimates of which unit of speech has just been recognized. (Maybe he said [b], maybe [p])
We can correct our model by applying a similar modeling process to the next linguistic level up.(Which is more likely:'beer' or 'peer'?)
In this assignment you will this this spectrogram.
It is a spectrogram of 'funny money' Note that since it rhymes, the 'unny' sound occurs twice while there's a contrast between the f- and m- sounds.
Print out the file on a laser printer. You may want to make several copies.
Use a highlighter pen to mark off alternate segments where the spectrogram changes abruptly.
Indicate the probable boundaries between the phonemes on the bottom of the transcript. Do this by drawing brackets with a pencil along the bottom of the spectrogram.
Look here for an example on transcript of 'Jack Sprat'.
For each of the phonemes in your sample (F, M, U, N, Y), describe the segments which map onto it.