diff --git a/eeps/eep-0068.md b/eeps/eep-0068.md new file mode 100644 index 0000000..3453a6a --- /dev/null +++ b/eeps/eep-0068.md @@ -0,0 +1,387 @@ + Author: Michał Muskała + Status: Draft + Type: Standards Track + Created: 12-02-2024 + Erlang-Version: + Post-History: +**** +EEP 68: JSON library +---- + +Abstract +======== + +This EEP proposes introducing a module `json` to the Erlang standard +library with support for encoding and decoding [JSON][1] documents +from and to Erlang data structures. The main reason is to cover +a gap in the Erlang standard library with regards to such a vastly +popular and widespread data format. + +Rationale +========= + +JSON is commonly used in many different use-cases: + +* by web services as a lightweight and human-readable data interchange format; +* as a configuration language in static files; +* as data interchange format by developer tooling; +* and more. + +There are many existing JSON libraries for Erlang and other BEAM languages, +however adding such a support to standard library would offer unique benefits. +Most notably being able to use it in situations where leveraging third-party +libraries is complex or cumbersome -- such as stand-alone escripts or +fundamental tooling like a build system, or inside OTP itself. + +There have been previous attempts to bring JSON support into OTP, most notably +[EEP 18][EEP], which ultimately weren't adopted previously for various reasons. +However, I believe the time is right to revisit this subject with a fresh +take on an interface such support could take. + +JSON is a well defined format specified in parallel in [RFC 8259][RFC] and +[ECMA 404][ECMA], however how this representation should be translated +into Erlang is not fully clear since the data structures don't present +a direct, 1:1 mapping. To help with this, this EEP proposes an interface +that presents both a convenient and "canonical" simple API, as well +as an extensible and highly-customisable API with common underlying +implementation. + +This EEP proposes a JSON library which: + +* should be easy to adopt in large codebases using one of the popular, + existing, open-source JSON libraries; +* will allow the existing open-source libraries with custom features + (like support for Elixir protocols) to become thin wrappers around + this library; +* will improve, or at least not regress, performance compared to + leading open-source JSON libraries. + +The proposed JSON library will provide: + +* JSON encoding, allowing for single-pass encoding of custom data types –- + in particular, for Elixir, integrating with a protocol through a thin layer + (implemented outside of OTP); +* JSON decoding with some streaming support allowing to decode messages that + don't fully fit into memory; +* JSON decoding with support for decoding values split across separate + messages without fully concatenating them upfront; +* focus on high-performance encoding and decoding; +* full conformance to [RFC 8259][RFC] and [ECMA 404][ECMA] standards, + the decoder should pass the entire [JSONTestSuite][JSONTestSuite]; +* simple API for common use-cases with canonical data type mapping. + +Design choices +============== + +Data mapping +------------ + +We propose, in the "canonical" API to map JSON data structues to +Erlang and back in the following way: + +| **Decoding from JSON** | **Erlang** | **Encoding into JSON** | +|------------------------|----------------------|------------------------| +| Number | integer() \| float() | Number | +| Boolean | true \| false | Boolean | +| Null | null | Null | +| String | binary() | String | +| | atom() | String | +| Array | list() | Array | +| Object | #{binary() => _} | Object | +| | #{atom() => _} | Object | +| | #{integer() => _} | Object | + +Erlang has generally a richer value system than JSON, therefore +there's generally more types that can be encoded into JSON, +even if they can never be produced directly by the decoder. + +However, with the flexible API, as demonstrated below, the user will +be able to customize the decoding & encoding routines to produce and +consume any Erlang term as necessary in the particular application. + +**Note**: A decode-encode rountrip might not produce the same data, +even with custom decoders -- since JSON has such a limited data-type +options, compared to Erlang, some information will be commonly be lost, +for example, coercing all keys in maps to binaries. + +Streaming vs value-based parser +------------------------------- + +When it comes to data-structure parsers it's common to encounter two +types: ones that given the data produce a complete parsed value, +and others the same data produce a stream of events that can later +be processed to extract values. + +The first kind, which we'll call here value-based, is generally simpler, +usually more efficient, and more convenient to use. The second one offers +unique advantages in specific use-cases: for example, where data +can't fully fit into memory. + +For the proposed `json` library this EEP suggests a hybrid approach. + +First, a simple, value-based API: + + -type value() :: + integer() | + float() | + boolean() | + null | + binary() | + list(value()) | + #{binary() => value()}. + + -spec decode(binary()) -> value(). + +Error handling is achieved through exceptions. The following errors +are possible: + + -type error() :: + unexpected_end | + {unexpected_sequence, binary()} | + {invalid_byte, byte()} + +The exceptions might be enhanced through the [Error Info][ERRINFO] mechanism +with additional meta-data like byte offset where the error occurred. + +For the advanced and customizable API, this EEP proposes a callback-based +API that the decoder will use to produce values from the data it parses. + + -type from_binary_fun() :: fun((binary()) -> dynamic()). + -type array_start_fun() :: fun((Acc :: dynamic()) -> ArrayAcc :: dynamic()). + -type array_push_fun() :: fun((Value :: dynamic(), Acc :: dynamic()) -> NewAcc :: dynamic()). + -type array_finish_fun() :: fun((ArrayAcc :: dynamic(), OldAcc :: dynamic()) -> {dynamic(), Acc :: dynamic()}). + -type object_start_fun() :: fun((Acc :: dynamic()) -> ObjectAcc :: dynamic()). + -type object_push_fun() :: fun((Key :: dynamic(), Value :: dynamic(), Acc :: dynamic()) -> NewAcc :: dynamic()). + -type object_finish_fun() :: fun((ObjectAcc :: dynamic(), OldAcc :: dynamic()) -> {dynamic(), Acc :: dynamic()}). + + -type decoders() :: #{ + array_start => array_start_fun(), + array_push => array_push_fun(), + array_finish => array_finish_fun(), + object_start => object_start_fun(), + object_push => object_push_fun(), + object_finish => object_finish_fun(), + float => from_binary_fun(), + integer => from_binary_fun(), + string => from_binary_fun(), + null => term() + }. + + -spec decode(binary(), Acc :: dynamic(), decoders()) -> + {Value :: dynamic(), FinalAcc :: dynamic(), Rest :: binary()}. + +This allows the user to fully customize the decoded format, including +features seen in open-source JSON libraries: + +* decoding string keys as atoms; +* decoding objects as lists of pairs; +* decoding floats as custom structures with decimal precision; +* decoding `null` as another atom, in particular `undefined` or `nil`; +* using `binary:copy/1` on strings that will be retained in memory; +* decoding multiple JSON messages from a single binary blob; +* and more. + +Furthermore, this allows the user to only retain parts of the data structure +to achieve results similar to using a streaming SAX-like parser for data +that doesn't fully fit into memory. + +The `array_finish` and `object_finish` callbacks are responsible for +restoring the accumulator to continue processing the parent object. +To simplify the case where accumulators are not connected, these +callbacks receive value of the accumulator that was passed to the +corresponding `_start` call. + +All the callbacks are optional and have a default value corresponding to the +"simple" API behaviour, using lists as accumulators, in particular: + +* for `array_start`: `fun(_) -> [] end` +* for `array_push`: `fun(Elem, Acc) -> [Elem | Acc] end` +* for `array_finish`: `fun(Acc, OldAcc) -> {lists:reverse(Acc), OldAcc} end` +* for `object_start`: `fun(_) -> [] end` +* for `object_push`: `fun(Key, Value, Acc) -> [{Key, Value} | Acc] end` +* for `object_finish`: `fun(Acc, OldAcc) -> {maps:from_list(Acc), OldAcc} end` +* for `float`: `fun erlang:binary_to_float/1` +* for `integer`: `fun erlang:binary_to_integer/1` +* for `string`: `fun (Value) -> Value end` +* for `null`: the atom `null` + +Incomplete data parsing +----------------------- + +We propose a future enhancement to the full `decode/3` API, where +it can return an `{incomplete, continuation()}` value that can be used to +decode values split across multiple binary blobs (for example as received +from a TCP socket). + + -spec decode_continue(binary(), continuation()) -> + {Value :: dynamic(), FinalAcc :: dynamic(), Rest :: binary()} | + {incomplete, continuation()}. + +Encoding API +------------ + +For encoding this EEP again proposes two separate sets of APIs. +A simple API using "canonical" data types: + + -type encode_value() :: + integer() | + float() | + boolean() | + null | + binary() | + atom() | + list(encode_value()) | + #{binary() | atom() | integer() => encode_value()}. + + -spec encode(encode_value()) -> iodata(). + +And an advanced, callback-based API allowing for single-pass encoding +of custom data structures. This API is accompanied by a set of functions +facilitating the implementation of custom encoding callbacks. + + -type encoder() :: fun((dynamic(), encoder()) -> iodata()). + + -spec encode(dynamic(), encoder()) -> iodata(). + + -spec encode_value(dynamic(), encoder()) -> iodata(). + -spec encode_atom(atom(), encoder()) -> iodata(). + -spec encode_integer(integer()) -> iodata(). + -spec encode_float(float()) -> iodata(). + -spec encode_list(list(), encoder()) -> iodata(). + -spec encode_map(map(), encoder()) -> iodata(). + -spec encode_map_checked(map(), encoder()) -> iodata(). + -spec encode_key_value_list([{dynamic(), dynamic()}], encoder()) -> iodata(). + -spec encode_key_value_list_checked([{dynamic(), dynamic()}], encoder()) -> iodata(). + -spec encode_binary(binary()) -> iodata(). + -spec encode_binary_escape_all(binary()) -> iodata(). + +The `encoder()` callback is invoked on every value during traversal. +The simple API specified above is equivalent to using the +`fun json:encode_value/2` function as the encoder. + +The `*_checked/2` variants of functions offer verifying the encoder +doesn't produce repeated keys. +The default `encode_binary/1` function will emit unescaped unicode values +as allowed by the specifications; however for compatibility reasons +we provide the optional `encode_binary_escape_all/1` function +that will always produce purely ASCII messages encoding all higher +unicode values with the `\u` escape sequences. + +Formatting and pretty-printing +------------------------------ + +This EEP further proposes an additional API for formatting (and pretty-printing) +JSON messages. This API consists of transforming a textual JSON message into +a formatted JSON message. +This is the most flexible solution that orthogonally supports +formatting results of custom encoding functions like described above, +without adding the burden of complex formatting options in the middle of the +encoders. +Formatting isn't usually done in critical hot-paths of high-performance +services, therefore the overhead of a two-pass formatting is deemed acceptable. + + -type format_option() :: #{ + indent => iodata(), + line_separator => iodata(), + after_colon => iodata() + }. + -spec format(iodata()) -> iodata(). + -spec format(iodata(), format_option()) -> iodata(). + +Reference Implementation +======================== + +[PR-8111][PR] Implements the `encode/1`, `encode/2`, `decode/1`, and `decode/3` +functions as proposed in this EEP. +The formatting API and the support for incomplete message decoding is left +as a follow-up task. + +Appendix +======== + +Example of a decoding trace +--------------------------- + +Given the following data: + + {"a": [[], {}, true, false, null, {"foo": "baz"}], "b": [1, 2.0, "three"]} + +the decoding APIs will be called with the following arguments: + + object_start(Acc0) => Acc1 + string(<<"a">>) => Str1 + array_start(Acc1) => Acc2 + empty_array() => Arr1 + array_push(Acc2, Arr1) => Acc3 + empty_object() => Obj1 + array_push(Obj1, Acc3) => Acc4 + array_push(true, Acc4) => Acc5 + array_push(false, Acc5) => Acc6 + null() => Null + array_push(Null, Acc6) => Acc7 + object_start(Acc7) => Acc8 + string(<<"foo">>) => Str2 + string(<<"baz">>) => Str3 + object_push(Str2, Str3, Acc8) => Acc9 + object_finish(Acc9) => Obj2 + array_push(Obj2, Acc7) => Acc10 + array_finish(Acc10, Acc1) => {Arr1, Acc11} + object_push(Arr1, Acc11) => Acc12 + string(<<"b">>) => Str4 + array_start(Acc12) => Acc13 + integer(<<"1">>) => Int1 + array_push(Int1, Acc13) => Acc14 + float(<<"2.0">>) => Float1 + array_push(Float1, Acc14) => Acc15 + string(<<"three">>) => Str5 + array_push(Str5, Acc15) => Acc16 + array_finish(Acc16, Acc12) => {Arr2, Acc17} + object_push(Str4, Arr2, Acc17) => Acc18 + object_finish(Acc18, Acc0) => {Obj3, Acc19} + % final decode/3 return + {Obj3, Acc19, <<"">>} + +Example of a custom encoder +--------------------------- + +An example of a custom encoder that would support using a heuristic +to differentiate pairs of object-like key-value lists from plain +lists of values could look as follows: + + custom_encode(Value) -> json:encode(Value, fun encoder/2). + + encoder([{_, _} | _] = Value, Encode) -> json:encode_key_value_list(Value, Encode); + encoder(Other, Encode) -> json:encode_value(Other, Encode). + +Another encoder that supports using Elixir `nil` as Null and protocols for +further customisation could look as follows: + + encoder(nil, _Encode) -> <<"null">>; + encoder(null, _Encode) -> <<"\"null\"">>; + encoder(#{__struct__ => _} = Struct, Encode) -> 'Elixir.JSONProtocol':encode(Struct, Encode); + encoder(Other, Encode) -> json:encode_value(Other, Encode). + +[1]: https://www.json.org/json-en.html + "Introducing JSON" + +[RFC]: https://datatracker.ietf.org/doc/html/rfc8259 + "The JavaScript Object Notation (JSON) Data Interchange Format" + +[ECMA]: https://ecma-international.org/publications-and-standards/standards/ecma-404/ + "The JSON data interchange syntax" + +[EEP]: https://github.com/erlang/eep/blob/master/eeps/eep-0018.md + "EEP 18: JSON bifs" + +[ERRINFO]: https://github.com/erlang/eep/blob/master/eeps/eep-0054.md + "EEP 54: Provide more information about errors" + +[JSONTestSuite]: https://github.com/nst/JSONTestSuite + +[PR]: https://github.com/erlang/otp/pull/8111 + +Copyright +========= + +This document is placed in the public domain or under the CC0-1.0-Universal +license, whichever is more permissive.