Outline - Stochastic pronunciation modelling from hand-labelled phonetic corpora

📑 Outline of Riley et al. (1999)

Stochastic pronunciation modelling from hand-labelled phonetic corpora

@article{riley1999pronunciation,
title = {Stochastic pronunciation modelling from hand-labelled phonetic corpora},
journal = {Speech Communication},
volume = {29},
number = {2},
pages = {209-224},
year = {1999},
issn = {0167-6393},
doi = {https://doi.org/10.1016/S0167-6393(99)00037-0},
url = {https://www.sciencedirect.com/science/article/pii/S0167639399000370},
author = {Michael Riley and William Byrne and Michael Finke and Sanjeev Khudanpur and Andrej Ljolje and John McDonough and Harriet Nock and Murat Saraclar and Charles Wooters and George Zavaliagkos},
}

📑 Outline of Riley et al. (1999)

1. Introduction

Motivation for modeling pronunciation variation
Challenges of canonical dictionaries in ASR
Advantages of learning from phonetically transcribed speech corpora

2. Pronunciation Modeling Approach

Use of decision trees to learn mappings between:
- Lexical (canonical) phoneme sequences
- Observed surface phone sequences
Model representation:
- Input: phoneme with context (e.g., word position, neighboring phonemes)
- Output: probabilistic distribution over phones
Description of how variants are extracted from hand-labeled data

3. Corpora Used

Description of three phonetic corpora:
1. TIMIT: Read speech, clean recordings
2. ICSI Switchboard corpus: Conversational telephone speech
3. CSLU Spontaneous Speech corpus
Details of:
- Lexicon preparation
- Forced alignment and phonetic transcription alignment methods

4. Data Preparation

Canonical pronunciation lexicon derivation
Aligning dictionary forms with hand-labeled surface realizations
Mapping phones to phonemes (many-to-one and one-to-many mappings)

5. Decision Tree Training

Learning pronunciation rules using CART-style decision trees
Feature set for conditioning:
- Left/right phonemic context
- Word position
- Stress markers
- Part-of-speech tags (optional)
Smoothing and pruning strategies

6. Experimental Results

Evaluation metrics:
- Word error rate (WER) improvement in ASR
- Lexicon coverage increase
- Comparison of baseline vs. expanded pronunciation lexicons
Results on:
- TIMIT (limited gain, due to low variation)
- ICSI (more natural variation, larger gains)
- CSLU (high variability, challenging data)

7. Analysis and Discussion

Interpretation of decision tree outputs
Nature of learned pronunciation rules:
- Context-dependent deletions, insertions, substitutions
Examples of learned patterns:
- Flapping, vowel reduction, glottalization
Observations on overfitting and rule generalization

8. Comparison with Other Approaches

Rule-based vs. data-driven methods
Relation to maximum entropy models
Contrast with G2P models trained only on dictionaries

9. Conclusion

Summary of findings:
- Data-driven pronunciation modeling from hand-labeled corpora is effective
- Learns interpretable and generalizable variation
Limitations:
- Dependency on high-quality phonetic corpora
Future work:
- Integration with G2P systems
- Use of more diverse corpora
- Better handling of spontaneous speech variation