Unicode Character Properties in EcmaScript

Spec: https://srl295.github.io/es-unicode-properties/

Proposal for stage 0

Allows for a function to return Encoded Character Properties for a code point.

For applications, they can directly answer questions such as “What kind of script is 𞤘?”, “Is ġ lowercase? ”, or “What is the numeric value of ५?”.

For feature implementers, this is a required building block for implementing a wide array of higher level features, such as number parsing, segmentation, regular expressions, and much more.

Definitions

A property (for this purpose) is string (including enumerated types), a number, or a boolean.

Code Point: a String containing a single Unicode code point (1 or 2 UTF-16 code units).
Name: A Unicode Property Alias as given in UAX44. As an input, this may be either short or long form, producing identical results.
Value: A Unicode Property Value or Property Value Alias. As an output, the caller must explicitly or implicitly select an “abbreviated” or a “long” alias.

Examples

и (U+0438) ICU UBrowse
𞤘 (U+1E918) ICU UBrowse
ġ (U+0121)
५ (U+096B)

CP	Name	Long Name	Value	Long Value	Comments
`и`	`Gc`	`General_Category`	`Ll`	`Lowercase_Letter`	Enumeration
`𞤘`	`sc`	`Script`	`Adlm`	`Adlam`	Enumeration
`ġ`	`Lower`	`Lowercase`	true	true	Boolean
`५`	`nv`	`General_Category`	5	5	Number

API Brainstorm For Discussion

Note: see Issues for further discussion.

"и".getUnicodeProperty("Gc", {type: "short"}) // "Ll"
"и".getUnicodeProperty("Gc", {type: "long"}) // "Lowercase_Letter"
"и".getUnicodeProperty("Gc") // "Lowercase_Letter"  - type:long is default
"и".getUnicodeProperty("General_Category") // "Lowercase_Letter" - "Gc" ≈ "General_Category"

FAQ

Why should this be in EcmaScript?

Data Size, Complexity, Performance, Updates.

As of Unicode 13, there are nearly 150_000 characters encoded across the 2_097_152 available in the 21 bit encoding space. There are over 80 character properties. Storing and accessing this data in an efficient and up to date way is not trivial. However, any conformant implementation, especially one which includes Unicode regular expressions, already has all of this data, available via implementations such as ICU.

Why not just use RegEx?

In a way, getting a property is the inverse of Unicode Regular Expressions.

/\p{gc=Lowercase_Letter}/u.test('и')
// implies:
"и".getUnicodeProperty("Gc") === 'Lowercase_Letter'

If all that is needed is matching, certainly a regex could be used, especially for a boolean operation.

/\p{Lower}/u.test('e') === "e".getUnicodeProperty("Lower") // both true 
/\p{Lower}/u.test('E') === "E".getUnicodeProperty("Lower") // both false

However, for classifying (as in segmentation) or analyzing (as in number parsing), this becomes unwieldy.

“Parse ١٢٣٬٤٥٦” into numeric form:

     if(/\p{NumericValue=0}/u.test('٢')) { digit = 0; } // false
else if(/\p{NumericValue=1}/u.test('٢')) { digit = 1; } // false
else if(/\p{NumericValue=2}/u.test('٢')) { digit = 2; } // true
else if(/\p{NumericValue=4}/u.test('٢')) { digit = 3; } // false
…

// vs:
digit = '٢'.getUnicodeProperty('nv') // 2

This could be used to convert ١٢٣٬٤٥٦ into Number(123.456)

(this property was not supported by the JS engine I tested.)

Implement UAX29 Sentence Break Segmentation

Need to calculate the Sentence_Break property value for each character:

     if(/\p{Sentence_Break=Extend}/u .test('q')) { … } // false
else if(/\p{Sentence_Break=Lower}/u  .test('q')) { … } // true
else if(/\p{Sentence_Break=OLetter}/u.test('q')) { … } // false
else if(/\p{Sentence_Break=STerm}/u  .test('q')) { … } // false
…
// vs:
'q'.getUnicodeProperty('Sentence_Break') // 'Lower'

(this property was not actually supported by the JS engine I tested.)

(For performance reasons, an application may actually want to get the properties of each codepoint in a string, and not need to make multiple calls. See the issues for discussion.)

History

tc39/ecma402#90

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitattributes		.gitattributes
.gitignore		.gitignore
.npmrc		.npmrc
LICENSE		LICENSE
README.md		README.md
index.html		index.html
package.json		package.json
spec.emu		spec.emu

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Unicode Character Properties in EcmaScript

Proposal for stage 0

Definitions

Examples

API Brainstorm For Discussion

FAQ

Why should this be in EcmaScript?

Why not just use RegEx?

History

About

Releases

Packages

Languages

License

srl295/es-unicode-properties

Folders and files

Latest commit

History

Repository files navigation

Unicode Character Properties in EcmaScript

Proposal for stage 0

Definitions

Examples

API Brainstorm For Discussion

FAQ

Why should this be in EcmaScript?

Why not just use RegEx?

History

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages