Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-16720 json: add transforms #4036

Merged
merged 9 commits into from
Sep 21, 2024
Merged

Conversation

srl295
Copy link
Member

@srl295 srl295 commented Sep 10, 2024

  • new package cldr-transforms
  • add manifest file transforms.json at the top level
  • each transform has a metadata file (transforms/ID.json) and a raw text file (transforms/ID.txt).
  • metadata has all of the keys from the transform rule
  • the _rulesFile key formally indicates the textfile's name (in case we need to massage the id for some reason in the future).

Sample data available at this branch: https://github.com/unicode-org/cldr-json/tree/cldr-16720/transforms/cldr-json/cldr-transforms

CLDR-16720

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

@srl295 srl295 self-assigned this Sep 10, 2024
@srl295 srl295 changed the title CLDR-17620 json: add transforms CLDR-16720 json: add transforms Sep 10, 2024
- new package cldr-transforms
- add manifest file transforms.json at the top level
- each transform has a metadata file (transforms/ID.json) and a raw text file (transforms/ID.txt).
- metadata has all of the keys from the transform rule
- the _rulesFile key formally indicates the textfile's name (in case we need to massage the id for some reason in the future).
@srl295 srl295 force-pushed the cldr-16720/json-xlit branch from 7bdacce to 6d40472 Compare September 10, 2024 21:37
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@srl295 srl295 force-pushed the cldr-16720/json-xlit branch from f19ce43 to 4750f88 Compare September 10, 2024 21:43
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@srl295
Copy link
Member Author

srl295 commented Sep 10, 2024

Deploy will fail because I used my personal fork, so no preview URL.

Please review the Markdown change carefully.

@sffc
Copy link
Member

sffc commented Sep 11, 2024

The mix of JSON and TXT files in the same directory might make it hard to parse. The file extension is the only way to tell the difference, which usually isn't ideal. Can you split them into separate directories?

Also, since the JSON files are basically metadata on the TXT files, I have a mild expectation that there would be a single index.json with all the transform metadata in one place.

@robertbastian
Copy link
Member

Also, since the JSON files are basically metadata on the TXT files, I have a mild expectation that there would be a single index.json with all the transform metadata in one place.

This is per-locale data. Nowhere in CLDR-JSON multiple locales are merged in an index file, they always use the file system structure.

@srl295
Copy link
Member Author

srl295 commented Sep 11, 2024

The mix of JSON and TXT files in the same directory might make it hard to parse. The file extension is the only way to tell the difference, which usually isn't ideal. Can you split them into separate directories?

It's designed so clients don't need to crawl the directory or parse any filenames.

  1. Look at the transforms.json file. It has a list of ids, with no extension
{
  "transforms": {
    "available": [
      "InterIndic-Bengali",
      "Oriya-Arabic",
      "my-t-my-d0-zawgyi",
      "tlh-am",
  1. For each id, there is transforms/id.json with the metadata.
{
  "transforms": {
    "BGN": {
      "_visibility": "external",
      "_alias": "Amharic-Latin/BGN am-Latn-t-am-m0-bgn",
      "_source": "am",
      "_target": "am_Latn",
      "_direction": "forward",
      "_rulesFile": "Amharic-Latin-BGN.txt"
    }
  }
}
  1. One of the metadata items is _rulesFile which has the path to the .txt file.
# Originally prepared by Michael Everson <[email protected]>
########################################################################
# MINIMAL FILTER: Amharic-Latin
:: [ሀ-᎙] ;
:: NFD (NFC) ;
$ejective = ’;
$glottal  = ’;
…

Also, since the JSON files are basically metadata on the TXT files, I have a mild expectation that there would be a single index.json with all the transform metadata in one place.

There is, it's transforms.json.

@srl295
Copy link
Member Author

srl295 commented Sep 11, 2024

I see a couple of bugs:

  • _source and _target need to be bcp47, not old IDs.
  • There's a bug in _alias that has some corruption.
  • the 2nd level key in the metadata .json has a problem (because that ID might have slashes in it)

@sffc
Copy link
Member

sffc commented Sep 11, 2024

Also, since the JSON files are basically metadata on the TXT files, I have a mild expectation that there would be a single index.json with all the transform metadata in one place.

This is per-locale data. Nowhere in CLDR-JSON multiple locales are merged in an index file, they always use the file system structure.

https://github.com/unicode-org/cldr-json/blob/main/cldr-json/cldr-core/supplemental/plurals.json

Look at the transforms.json file. It has a list of ids, with no extension

OK, that looks cool, thanks! I didn't see it the first time.

I still think I mildly favor not putting the JSON and TXT in the same directory, but I'll leave that to @robertbastian to weigh in on.

@srl295
Copy link
Member Author

srl295 commented Sep 11, 2024

I don't see why being in the same directory would be a problem.

As Mark suggested, we need docs on the formats. That will be a separate effort.

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had approved this, but Steven mentioned 3 outstanding bugs:

  • _source and _target need to be bcp47, not old IDs.
  • There's a bug in _alias that has some corruption.
  • the 2nd level key in the metadata .json has a problem (because that ID might have slashes in it)

@macchiati
Copy link
Member

Shane, are there any other blockers?

- properly use BCP47 for source/target
- fix corruption in alias and slashes in output
- back out bcp47 - broke some source/target ids
@srl295 srl295 requested a review from macchiati September 17, 2024 23:00
@srl295
Copy link
Member Author

srl295 commented Sep 17, 2024

Please review sample data in https://github.com/unicode-org/cldr-json/tree/cldr-16720/transforms/cldr-json/cldr-transforms

I now think source and target should not be bcp47 as they aren't always locale IDs. The alias field contains a bcp47 alias.

@macchiati
Copy link
Member

macchiati commented Sep 18, 2024 via email

@sffc
Copy link
Member

sffc commented Sep 18, 2024

What is the name of the second level key? In Amharic-Latin-BGN.json it is BGN

{
  "transforms": {
    "BGN": {
      "_value": "Amharic-Latin-BGN.txt",
      "_visibility": "external",
      "_alias": "Amharic-Latin/BGN am-Latn-t-am-m0-bgn",
      "_source": "am",
      "_target": "am_Latn",
      "_direction": "forward"
    }
  }
}

But in most other files it is just "transform".

I agree that it would be nice to name the files consistently with either the alias name or the BCP-47 name but not a mix as is currently in the branch.

@robertbastian
Copy link
Member

I'm not a fan of the nesting in the JSON files.

{
  "transforms": {
    "BGN": {
      "_value": "Arabic-Latin-BGN.txt",
      "_visibility": "external",
      "_alias": "Arabic-Latin/BGN ar-Latn-t-ar-m0-bgn",
      "_source": "ar",
      "_target": "ar_Latn",
      "_direction": "forward"
    }
  }
}

could be represented as

{
  "_value": "Arabic-Latin-BGN.txt",
  "_visibility": "external",
  "_alias": "Arabic-Latin/BGN ar-Latn-t-ar-m0-bgn",
  "_source": "ar",
  "_target": "ar_Latn",
  "_variant": "BGN",
  "_direction": "forward"
}

As far as I can tell no JSON file will have multiple values in the transforms map.

@macchiati
Copy link
Member

macchiati commented Sep 18, 2024 via email

@srl295
Copy link
Member Author

srl295 commented Sep 19, 2024

What is the name of the second level key? In Amharic-Latin-BGN.json it is BGN

That's a bug. will fix

{
  "transforms": {
    "BGN": {
      "_value": "Amharic-Latin-BGN.txt",
      "_visibility": "external",
      "_alias": "Amharic-Latin/BGN am-Latn-t-am-m0-bgn",
      "_source": "am",
      "_target": "am_Latn",
      "_direction": "forward"
    }
  }
}

But in most other files it is just "transform".

I agree that it would be nice to name the files consistently with either the alias name or the BCP-47 name but not a mix as is currently in the branch.

The files are by the id name, which is neither the bcp47 nor the alias name. As explained.

@srl295
Copy link
Member Author

srl295 commented Sep 19, 2024

Agh. another bug. _value is supposed to be _ruleFile.

@srl295
Copy link
Member Author

srl295 commented Sep 19, 2024

I'll change it to:

{
  "transform":  { 
       "_source": "am",
       ...
  }
}

all of the json files have duck typed content similarly.

@srl295 srl295 marked this pull request as draft September 19, 2024 05:10
- hoist json content up 2 levels
- fix 'BGN' in path
@srl295 srl295 marked this pull request as ready for review September 19, 2024 20:53
@srl295
Copy link
Member Author

srl295 commented Sep 19, 2024

@srl295 srl295 dismissed macchiati’s stale review September 19, 2024 20:54

addressed issues

@sffc
Copy link
Member

sffc commented Sep 19, 2024

I still don't understand why half of these are identified by BCP-47 and half are identified by their alias.

Armenian-Latin-BGN.json
{
  "_visibility": "external",
  "_alias": "Armenian-Latin/BGN hy-Latn-t-hy-m0-bgn",
  "_source": "hy",
  "_target": "hy_Latn",
  "_direction": "forward",
  "_rulesFile": "Armenian-Latin-BGN.txt"
}

am-Ethi-t-am-brai.json
{
  "_backwardAlias": "Braille-Ethiopic/Amharic am-Ethi-t-am-brai",
  "_visibility": "external",
  "_alias": "Ethiopic-Braille/Amharic am-Brai-t-am-ethi",
  "_source": "am_Ethi",
  "_target": "am_Brai",
  "_direction": "both",
  "_rulesFile": "am-Ethi-t-am-brai.txt"
}

@srl295
Copy link
Member Author

srl295 commented Sep 19, 2024

I still don't understand why half of these are identified by BCP-47 and half are identified by their alias.

Armenian-Latin-BGN.json
{
  "_visibility": "external",
  "_alias": "Armenian-Latin/BGN hy-Latn-t-hy-m0-bgn",
  "_source": "hy",
  "_target": "hy_Latn",
  "_direction": "forward",
  "_rulesFile": "Armenian-Latin-BGN.txt"
}

am-Ethi-t-am-brai.json
{
  "_backwardAlias": "Braille-Ethiopic/Amharic am-Ethi-t-am-brai",
  "_visibility": "external",
  "_alias": "Ethiopic-Braille/Amharic am-Brai-t-am-ethi",
  "_source": "am_Ethi",
  "_target": "am_Brai",
  "_direction": "both",
  "_rulesFile": "am-Ethi-t-am-brai.txt"
}

that's how they are identified in the source data.

@macchiati
Copy link
Member

In CLDR, the BCP47 version of the ID is in the _alias list (resp _backwardAlias)
Those alias lists also contain the non-BCP47 alias.

In JSON, we could filter these to break:

  "_backwardAlias": "Braille-Ethiopic/Amharic am-Ethi-t-am-brai",

  "_alias": "Ethiopic-Braille/Amharic am-Brai-t-am-ethi",

into

  "_backwardAlias": "Braille-Ethiopic/Amharic",
  "_backwardAliasBcp47": "am-Ethi-t-am-brai",

  "_alias": "Ethiopic-Braille/Amharic",
  "_aliasBcp47": "am-Brai-t-am-ethi",

Even better would be for us to do this in the XML source, but that's not something we could do in v46

@srl295
Copy link
Member Author

srl295 commented Sep 19, 2024

@macchiati filter how? How do I know which alias is which?

Maybe we should go with this format, and we can add bcp47?

@macchiati
Copy link
Member

macchiati commented Sep 19, 2024 via email

@srl295
Copy link
Member Author

srl295 commented Sep 19, 2024

I think it is sufficient to parse with a strict BCP47 parser. If it succeeds without error, it is BCP47, otherwise legacy.

seems a bit imprecise/mixed but OK

@macchiati
Copy link
Member

It will work, but as I wrote earlier we should add better structure

@srl295
Copy link
Member Author

srl295 commented Sep 20, 2024

Can you recommend a parser and options ?

- split bcp47 and non-bcp47 aliases.
@srl295
Copy link
Member Author

srl295 commented Sep 21, 2024

OK, try the latest, same review link.

attributes are only absent if non-empty.

{
  "_backwardAlias": "Latin-Ethiopic/Tekie_Alibekit",
  "_visibility": "external",
  "_backwardAliasBcp47": "byn-Ethi-t-byn-latn-m0-tekieali",
  "_alias": "Ethiopic-Latin/Tekie_Alibekit",
  "_aliasBcp47": "byn-Latn-t-byn-ethi-m0-tekieali",
  "_source": "byn_Ethi",
  "_direction": "both",
  "_target": "byn_Latn",
  "_rulesFile": "byn-Ethi-t-byn-latn-m0-tekie-alibekit.txt"
}

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Respot-checked data, looks good to me.

@srl295 srl295 merged commit 8a22f67 into unicode-org:main Sep 21, 2024
12 of 13 checks passed
@srl295 srl295 deleted the cldr-16720/json-xlit branch September 21, 2024 21:05
conradarcturus pushed a commit that referenced this pull request Sep 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants