-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No UTF32 Characters in the Regex for Strings #362
Comments
@sebbader-sap we transpiled the patterns into UTF-16 since most JSON schema engines we tested operated on UTF-16 and could not handle UTF-32. It is a trade-off between correctness and practicality -- if we put UTF-32 in JSON schema (e.g., the pattern you mentioned: You can test it online. The first answer on Google for "JSON schema validator" for me is https://www.jsonschemavalidator.net/. This validator does support UTF-32: {
"$schema": "https://json-schema.org/draft/2019-09/schema",
"title": "AssetAdministrationShellEnvironment",
"type": "string",
"pattern": "^[\\x09\\x0A\\x0D\\x20-\\uD7FF\\uE000-\\uFFFD\\U00010000-\\U0010FFFF]*$"
} The following value passes:
This is not a big change in aas-core-codegen, so whatever the decision, it shouldn't be hard to fix. |
Please consider also the SHACL -- I think the same issue appears there as well. |
The original pattern from the constraint has the same problems with OpenAPI-based validators, as they usually translate the YAML into JSON Schema --> then using the same JSON Schema Validation libraries with the same UTF-32 problems. I am uncertain how to proceed now.
Which then means that my actually used OpenAPI file is not string-equals to the IDTA published one anymore... |
@BirgitBoss I think we need a formal decision for all parts. Either way, the Part 2 Domain must go the same way as the Part 1 Domain & the schemas. |
Because there was some clarification needed in the taskforce: ^: Asserts the start of the string. [\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]: Defines a character class that allows various Unicode characters. \x09: ASCII horizontal tab. $: Asserts the end of the string. |
Maybe we should change the Constraint AASd-130 in the following way:
|
I think that the most important bit is missing: this constraint is required, so that the text can be represented in XML. For example, you can not represent |
Here is a short example as illustration. The smiley "😀" is represented as character code 128512. This code is encoded in UTF-32 as "\U0001F600". In UTF-16, this is encoded as "\ud83d\ude00". Hence, if you want to match this smiley with an regex engine that uses UTF-16, you have to write the pattern "\ud83d\ude00" even though it is a single character. If your regex engine operates on UTF-32, you can simply write "\U0001F600". |
For example, this schema works on an on-line JSON schema tester:
This matches "😀" on https://www.jsonschemavalidator.net/. The smiley does not test with:
|
Update from the latest state of Part 1 V3.1.0: Description for AASd-130 is already extended: aas-specs/documentation/IDTA-01001/modules/ROOT/pages/Spec/IDTA-01001_Metamodel_Constraints.adoc Line 73 in c9d6c3b
|
Proposal from a meeting of us (@mristin, @g1zzm0, and myself):
Therefore, the following activities are needed:
|
Side-effect: Example:
|
Decision Proposal TF Metamodel AAS 2024-03-27 Change formulation of Contraint AASd-130 from
to Constraint AASd-130 ensures that encoding and interoperability between different serializations is possible. See https://www.w3.org/TR/xml/#charsets for more information on XML Schema 1.0 string handling.
@g1zzm0 : please check |
[...] This representation makes problems in swagger representation How about this RegEx (see \ instead of \ before first ud7ff and before ue000 and ufffd at the beginning): @mristin may you please have a look whether the regex we are using is really ok? Thank you! |
@BirgitBoss wrote:
There is no single standard syntax for regular expressions. It all depends on the engine that you plan to use and support. Best you fix the engine & test against it, and then also document somewhere why you picked that engine and not another one. Whatever engine you pick, the particular syntax will be incompatible with some other engine. |
But independent of the engine, having for some unicode characters one backslash (e.g. "\ud7ff") but for the others two (e.g. "\\ud800") in the same pattern seems pretty strange. |
Ah, I haven't even noticed that -- yes, it should be consistent. |
Thank you @sebbader-sap and @mristin for reviewing, so we change to to have a consistent way and it would also be supported by swagger. |
We document in the JSON and RDF schema `README` files, that we deviate from the pattern in the specification of AASd-130, due to the fact that most schema engines test UTF-16, instead of the used UTF-32. For the full discussion, refer to #362
Workstream AAS Specs |
We document in the JSON and RDF schema `README` files, that we deviate from the pattern in the specification of AASd-130, due to the fact that most schema engines test UTF-16, instead of the used UTF-32. For the full discussion, refer to #362
Describe the bug
The regex pattern in the JSON Schema has only UTF-16 characters, while constraint AASd-130 demands the following:
^[\x09\x0A\x0D\x20-\uD7FF\uE000-\uFFFD\u00010000-\u0010FFFF]*$
Where
JSON Schema, e.g., https://github.com/admin-shell-io/aas-specs/blob/2ab08f92bdd1d44edc1cfee52552fe5429d2178e/schemas/json/aas.json#L44C22-L44C36
Additional context
Needs to be adopted in the SwaggerHub Domains for Part 1 and Part 2.
The text was updated successfully, but these errors were encountered: