Text normalization for building LMs for ASR has many parts:
- data cleaning [remove xml or html]
- removing punctuations [might consider keeping dashes]
- lower casing or upper casing [depends on lexicon used]
- converting text into spoken form [LMs for ASR, the hardest part of normalization]
Examples:
- google.com -> "google dot com"
- $25 -> "twenty five dollars" [notice that the dollar sign moved to the end]
Comments
Post a Comment