About Segmentation Settings: How a TM Segments Text

Segmentation settings define how a TM or a project divides source text into segments.

Segmentation rules are defined in the Language Resources section of TM settings. You can also define the segmentation rules that SDL Studio GroupShare uses when there is no applicable TM: these rules are defined in a Language Resources template, whose location is specified under project settings.

Segmentation Rules

Segmentation rules are defined by the regular expressions that specify a segment.

Often a segment is identical to a sentence, in which case the regular expression specifies the text patterns that constitute a sentence.

In any one project, for the same language pair, you can use multiple main TMs with different segmentation rules.

Rules specifying exceptions

List of abbreviations. This contains a list of abbreviations that finish with a period (.), for example, etc. The period at the end of etc. does not necessarily mark the end of a sentence, though it might do so, by chance.
List of ordinal followers. Like abbreviations, ordinal followers provide cases where a period does not necessarily mark the end of a segment: when followed by some nouns, a set of digits followed by a period (for example 23.) signifies the ordinal (23rd), not the end of a sentence. For example 23. April, can mean 23rd April. The list of ordinal followers is the list of such nouns.

Example: A simple segmentation rule

\.+[\p{Pe}\p{Pf}\p{Po}"]*

This regular expression specifies a segment in a rather simplistic manner. It matches all characters up to a punctuation mark that closes the segment.

Close, final and other punctuation, are defined Unicode categories for the following codes:

\p{Pe} specifies close punctuation.

\p{Pf} specifies final quote punctuation.

\p{Po} specifies other punctuation.

For more information, see for example, http://msdn.microsoft.com/en-us/library/system.globalization.unicodecategory.aspx.

Search by keyword for related topics

translation memories TMs

language resources

segments segmentation

Topic: Published: 27-Jun-2012