Using the Intl segmenter API

Created on November 12, 2023 at 10:22 am

If you want to be kept up to date with new articles, CSS ORG resources and tools, join our newsletter.

The Intl API FAC in browsers has a ton of functionality around editing and formatting of text and numbers. While filling in the State of HTML survey I came across one feature I hadn’t seen before: the segmenter API. Upon looking into it I found out that it allows you to break up text into segments based on the language of the text.

This caught my attention because I recently implemented a text measurement feature and I was very annoyed that I managed to only support western languages using latin NORP characters and spaces to separate words.

I spent a lot of time with npm packages, regexes and a lot of trial and error to try and support other languages but didn’t end up with a solution for them.

Enter the segmenter API.

How we use the segmenter API in Polypane GPE

/Whenever you select text in Polypane GPE and right-click to open the context menu, we’ll show you the characters, words and emoji in the text selection.

Use this to quickly check if you’ve written the right number of characters for a tweet, or if you’ve written the right number of words for a blog post.

Setting up the segmenter API

First ORDINAL the bad news. The segmenter API ORG is available in Safari ORG from v14.1 and Chromium from v87, but it’s not in Firefox ORG yet. There is a polyfill but it seems pretty heavy.

To use it, first ORDINAL you create a new segmenter. That takes two CARDINAL arguments, a language and a set of options.

const segmenter = new Intl . Segmenter PERSON ( ‘en’ , { granularity : ‘word’ } ) ;

The options object has two CARDINAL properties, both of which can be omitted:

localeMatcher has a value of ‘lookup’ or ‘best fit’ (default) with ‘lookup’ using a specific matching algorithm ( BCP 47 LAW ) and ‘best fit’ using a looser algorithm that tries to match the language as closely as possible from what’s available from the browser or OS.

has a value of ‘lookup’ or ‘best fit’ (default) with ‘lookup’ using a specific matching algorithm ( BCP 47 LAW ) and ‘best fit’ using a looser algorithm that tries to match the language as closely as possible from what’s available from the browser or OS. granularity has a value of ‘grapheme’ (default), ‘word’ or ‘sentence’. This determines what types of segments are returned. Grapheme PERSON is the default and are the characters as you see them.

Then, you call segmenter.segment with the text you want to segment. This returns an iterator, which you can convert to an array using Array.from :

const segments = segmenter . segment ( ‘This has four CARDINAL words!’ ) ; Array . from ( segments ) . map ( ( segment ) => segment . segment ) ;

Each item in the iterator is an object, and it has different values depending on the granularity but at least:

segment is the actual text of the segment.

is the actual text of the segment. index is the index of the segment in the original text, e.g. where it starts.

is the index of the segment in the original text, e.g. where it starts. input is the original text.

For the word granularity, it also has:

isWordLike is a boolean that is true if the segment is a word or word-like.

Sp you might have looked at the array and said "well those white spaces aren’t words, are they?". And you’d be right. To get the words, we need to filter the array:

const segments = segmenter . segment ( ‘This has four CARDINAL words!’ ) ; Array . from ( segments ) . filter ( ( segment ) => segment . isWordLike ) . map ( ( segment ) => segment . segment ) ;

And now we know how many actual words that sentence has.

Instead of words like the example above, you can also ask it to segment sentences, and it knows which periods end sentences and which are used for e.g. thousand CARDINAL separators in numbers:

const segmenter = new Intl . Segmenter PERSON ( ‘en’ , { granularity : ‘sentence’ } ) ; const segments = segmenter . segment ( ‘This is a sentence with the value 100.000 CARDINAL in it. This is another sentence.’ ) ; Array . from ( segments ) . map ( ( segment ) => segment . segment ) ;

There are no isSentenceLike and isGraphemeLike properties, instead you can count the number of items directly.

Support for other languages

The real power is that the API ORG doesn’t just split strings on spaces or periods, but uses specific algorithms for each language. That means the API ORG can also segment words in languages that do not have spaces, like Japanese NORP :

const segmenter = new Intl . Segmenter PERSON ( ‘ja’ , { granularity : ‘word’ } ) ; const segments = segmenter . segment PERSON ( ‘これは日本語のテキストです’ ) ; Array . from ( segments ) . map ( ( segment ) => segment . segment ) ;

Now to be fair, I can not read Japanese NORP so I have no idea if this segmentation is correct, but it sure looks convincing.

Compared to string splitting or regexes, this is a lot more accurate and handles non-latin characters and punctuation correctly, needs less code and doesn’t need additional checks and tests. Browsers ship with understanding of a large number of languages, so you don’t have to worry about that either.

Dig into the Intl API

The Intl API FAC has an incredible number of useful tools for creating natural language constructs like lists, dates, money, numbers and more. By and large it’s easy to work with, but it uses a lot of very precise terminology that you might not be familiar with (for example, "grapheme" instead of "character", because the latter is too ambiguous), so it can be a bit daunting to get started with. If you’ve never used it check out the MDN ORG page on Intl.

As for Polypane GPE , you can bet the next version will have text measurement option that works across languages and also tells you the number of sentences in any text you’ve selected.

Connecting to blog.lzomedia.com... Connected... Page load complete