How Data is Held in TM Translation Units and Tokens

The core information in a TM is the set of source segments of text and their translations. Each pair of source and translated text segments is called a translation unit (TU). Except for the case of perfect match, a TM does not have any relationship between TUs. Each TU has associated data (such as the author); this data is held in a structure called a field. Some text is stored verbatim while other text, such as dates, are usually stored in tokenized form. When text is stored as tokens, the TM can match source texts that are essentially the same but have different values for the tokens - for example sentences that are the same except that they mention different dates.

Text stored as tokens

When text that is recognized as a token, the TM stores it in tokenized form. Using tokens makes it easy for the TM to match the following two segments, (assuming the TM recognizes numbers as tokens):

I bought 5 apples. (in the TM)

I bought 10 apples. (in the source text)

Use of trigram indexes for concordance and non-character based languages

For languages that may not have word breaks, or for any language if concordance searching is selected, the TM indexes every three consecutive letters (trigram or tri-grams) and uses these index contents to find matches.

For example if the source text is:

The cat sat on the mat

The TM will have create indexes for:

The

e c

cat

and so on.