Text Normalization for Building LMs for ASR

Text normalization for building LMs for ASR has many parts:

- data cleaning [remove xml or html]

- removing punctuations [might consider keeping dashes]

- lower casing or upper casing [depends on lexicon used]

- converting text into spoken form [LMs for ASR, the hardest part of normalization]

Examples:

- google.com -> "google dot com"

- $25 -> "twenty five dollars" [notice that the dollar sign moved to the end]

Nadira Povey