Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-16941 Improve coverage for non-latn default/native number symbols/patterns #3690

Conversation

pedberg-icu
Copy link
Contributor

@pedberg-icu pedberg-icu commented May 6, 2024

CLDR-16941

  • This PR completes the ticket.
  1. Improve coverage for number system symbols and patterns belonging to non-latn default, native, and finance numbering systems in all locales. In some cases we select coverage by script instead of or in addition to language, to disambiguate cases where languages are written in multiple scripts.
  2. Add a unit test to verify that various number symbols and patterns are at the desired coverage level for each of the two or three number systems (default, native, and possibly finance) specified in a locale, for all locales. This currently takes some time (25 sec on my machine), maybe could be sped up.
  3. Whet this does not do is group collections of number symbols or patterns into sets. That would require coverage variables with alternate values for elements in paths, which is currently not supported. We can discuss for a future enhancement, this would have performance impact.

ALLOW_MANY_COMMITS=true

@AEApple
Copy link
Contributor

AEApple commented May 6, 2024

Are we also tracking whether the equivalent number formats to those currently in basic where the locale uses native digits by default in a separate ticket for resolution in esub or a future release instead (since they are available at a higher coverage level if we do need to change them in v46)?

@pedberg-icu
Copy link
Contributor Author

Are we also tracking whether the equivalent number formats to those currently in basic where the locale uses native digits by default in a separate ticket for resolution in esub or a future release instead (since they are available at a higher coverage level if we do need to change them in v46)?

@AEApple Not sure I understand the question. We had coverage in certain locales for the number symbols and patterns for default and native number systems, at either moderate or modern coverage depending on the item. What this ticket does is ensure that we do that across all locales for all of the number systems that are default, native, or finance systems in those locales.

@@ -1278,7 +1278,7 @@ CLDR data files are interpreted according to the LDML specification (http://unic
<!ATTLIST coverageLevel inLanguage CDATA #IMPLIED >
<!--@MATCH:any-->
<!ATTLIST coverageLevel inScript CDATA #IMPLIED >
<!--@MATCH:validity/script-->
<!--@MATCH:any-->
Copy link
Contributor Author

@pedberg-icu pedberg-icu May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so we can use coverage variables for inScript, as we already do for inLanguage and inTerritory.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally @MATCH:validity/script-with-variables or something

Larger question is whether coverage level belongs in "supplemental data" when, like casing, it's primarily inward tool-focused data. Perhaps we should split some of these out -- and give them their own DTD.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great. A lot of work, though! Longer term, I think we need to set aside some time to review the structure of the coverage levels, and some up with a more concise format to lower the maintenance burden. We already talked about expanding elements; we also might have more predefined variables (that we can construct at boot time based on what's in a particular locale). And other ideas.

<coverageVariable key="%guruNativeLanguages" value="(pa)"/>
<coverageVariable key="%hanidecNativeLanguages" value="(yue|zh)"/>
<!-- Number system coverage by language and/or script; default number system items at various levels, native & finance system items all at modern -->
<coverageVariable key="%adlmDefaultScripts" value="(Adlm)"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think any of this belongs in data at all. It should be read from the locale files themselves and cached. But.. it's an improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -265,29 +283,34 @@ For terms of use, see http://www.unicode.org/copyright.html
<coverageLevel value="moderate" inLanguage="%traditionalCollationLanguages" match="localeDisplayNames/types/type[@key='collation'][@type='traditional']"/>
<coverageLevel value="moderate" inLanguage="%CJK_Languages" match="localeDisplayNames/types/type[@key='collation'][@type='unihan']"/>
<coverageLevel value="moderate" match="localeDisplayNames/types/type[@key='numbers'][@type='latn']"/>
<coverageLevel value="moderate" inLanguage="zh" match="localeDisplayNames/types/type[@key='numbers'][@type='(han(s|t)(fin)?|hanidec)']"/>
<coverageLevel value="moderate" inLanguage="%hanidecNativeLanguages" match="localeDisplayNames/types/type[@key='numbers'][@type='(han(s|t)(fin)?|hanidec)']"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<coverageLevel value="moderate" inLanguage="%hanidecNativeLanguages" match="localeDisplayNames/types/type[@key='numbers'][@type='(han(s|t)(fin)?|hanidec)']"/>
<coverageLevel value="moderate" inLanguage="%hanidecNativeLanguages" match="localeDisplayNames/types/type[@key='numbers'][@type='(hans|hant)(fin)?|hanidec)']"/>

more readable.. or, spell all of them out.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in the follow-on PR.

<coverageLevel value="moderate" inScript="%mteiDefaultScripts" match="numbers/symbols[@numberSystem='mtei']/group"/>
<coverageLevel value="moderate" inScript="%mteiDefaultScripts" match="numbers/symbols[@numberSystem='mtei']/minusSign"/>
<coverageLevel value="moderate" inScript="%mteiDefaultScripts" match="numbers/symbols[@numberSystem='mtei']/percentSign"/>
<coverageLevel value="moderate" inScript="%mteiDefaultScripts" match="numbers/symbols[@numberSystem='mtei']/plusSign"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'll say again, I appreciate this being done.

@srl295
Copy link
Member

srl295 commented May 7, 2024

looks great. A lot of work, though! Longer term, I think we need to set aside some time to review the structure of the coverage levels, and some up with a more concise format to lower the maintenance burden. We already talked about expanding elements; we also might have more predefined variables (that we can construct at boot time based on what's in a particular locale). And other ideas.

drop the data from coverage*.xml entirely and derive it from the locale files

@pedberg-icu pedberg-icu merged commit d2123f4 into unicode-org:main May 7, 2024
10 checks passed
@pedberg-icu pedberg-icu deleted the CLDR-16941-better-coverage-for-number-systems branch May 7, 2024 18:07
@pedberg-icu
Copy link
Contributor Author

looks great. A lot of work, though! Longer term, I think we need to set aside some time to review the structure of the coverage levels, and some up with a more concise format to lower the maintenance burden. We already talked about expanding elements; we also might have more predefined variables (that we can construct at boot time based on what's in a particular locale). And other ideas.

drop the data from coverage*.xml entirely and derive it from the locale files

Covered by https://unicode-org.atlassian.net/browse/CLDR-17618

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants