Unified api draft #1

abhi-agg · 2020-10-21T09:23:36Z

This PR contains

Small update to README file
First draft of unified API (as agreed in API review meeting)

- Added a short introduction of this repository - More updates to come later

doc/Unified_API.md

abhi-agg

Hi Kenneth and Uli. First of all, thanks a lot for reviewing it :) I am adding a new patch to the existing PR accommodating all the changes that we agreed here. Please give it a look. Thanks 👍

motin

Reviewing this in light of what we agreed upon at the plenary Oct 15th.

Not being native in C++, I could only confirm that the following parts are included in this draft:

First API draft: A Fast translation endpoint

INPUT: batch of plain (i.e. markup already removed) text blobs in UTF-8
OUTPUT: batch of objects
Result objects:
- {
- plain (i.e. markup not reinserted) translated text blob in UTF-8
- byteranges for alignments
- byteranges for sentence splits for original text blob
- byteranges for sentence splits for translated text blob
- }
Something similar to StringPiece for annotation of ranges
Quality estimation optional off by default

Not to be included in the first API draft:

Domain adaptation is not supported in first draft
Tags/Emojis are not handled in first draft (should be removed from API requests)
We add more expensive “backfill” flags, which can give us QEs and N-Best for individual sentences if required
First draft will not specifically cater to the currently yet-to-be-defined requirements related to Outbound Translation

@abhi-agg @kpu Could you confirm that std::string_view corresponds to something like StringPiece and that I as a WASM-consumer of the API can use this to split up the original and translated text blobs into the same sentences that were sent to marian for translation?

doc/Unified_API.md

abhi-agg · 2020-10-27T08:27:57Z

  * [ ]  byteranges for alignments
  * [ ]  byteranges for sentence splits for original text blob
  * [ ]  byteranges for sentence splits for translated text blob

These are already covered in this PR.

* [ ]  Something similar to StringPiece for annotation of ranges
@abhi-agg @kpu Could you confirm that std::string_view corresponds to something like StringPiece

std::string_view is a c++17 feature that resembles StringPiece for annotation of ranges.

and that I as a WASM-consumer of the API can use this to split up the original and translated text blobs into the same sentences that were sent to marian for translation?

I believe this boils down to whether the LLVM compiler used by emscripten can compile C++ code having a C++17 feature to WASM or not. As per my knowledge, it can. See this: emscripten-core/emscripten#5513

- Added examples for Sentence mapping and Alignment

kpu · 2020-10-27T09:54:09Z

doc/Unified_API.md

+	The API splits each text entry into sentences internally, which are then translated independent of each other. The translated sentences are then joined together and returned in TranslationResult.
+	Please refer to the TranslationRequest class to find out what additional information can be requested. The alignment information can only be requested if the model supports it (check isAlignmentSupported() API).
+	*/
+	virtual std::vector<std::future<TranslationResult>> translate(std::vector<std::string> texts, TranslationRequest request) = 0;


Still has the same bug: should be one std::future<std::vector<TranslationResult>>.

Also std::vector<std::string> &

Comment that the std::string will be stolen and moved to the result.

Still has the same bug: should be one std::future<std::vector<TranslationResult>>.

Having separate std::future for each text might give some performance benefit. Surely, in current implementation it might not give us sufficient gain but a separate future for each text entry makes sense as the api should be agnostic of the implementation details of marian.

Also std::vector<std::string> &

Didn't we agree in previous review that the api should have the ownership of the textsin this API call? Passing a reference leaves a possibility of the texts to be modified by consumers of the API.

Comment that the std::string will be stolen and moved to the result.

This comment would not make sense if I am passing a copy of texts and not reference

The efficient semantic is that texts moves to translate and can do what it wants with it (and will move it to TranslationRequest.

I think the most C++ way of doing this is

std::future<std::vector<TranslationRequest> > translate(std::vector<std::string> && texts, TranslationRequest request);

and the caller does translate(std::move(texts), request)

You were the one saying we don't want to copy the input too much...

If you want to design a correct API for incremental returns that would do sentences at a time, you can do that later. It doesn't look like std::vector<std::future<TranslationResult>>. It looks like this span translates as this span in undefined order.

There is no efficient toolkit that would take a batch of requests and return them in the original order. It will require more work for you and for us to implement a less efficient API that nobody will support.

I don't want the caller to retain write access to those std::string. They are not shared. They are transferred. I realize I can't prevent writing to them. Cough on a std::string and the base pointer changes and all the std::string_view are invalid.

Which is why:

std::future<std::vector<TranslationRequest> > translate(std::vector<std::string> && texts, TranslationRequest request) { std::vector<std::string> actualTexts = std::move(texts); // ... }

And we just define that the passed vector is clear at the end of the translate call.

I still very much like the idea of using shared pointers. How about something along these lines:

std::future<std::vector<TranslationRequest> > translate(std::vector<std::unique_ptr<std::string>>&& texts, ...) { std::vector<std::shared_ptr<const std::string>> actualTexts(texts.size()); // notice that actualTexts are shared pointers to **const** strings! for (size_t i = 0; i < texts.size(); ++i) { actualTexts[i] = std::move(texts[i]); } ... }

That way you get what you want (immutability) and I get what I want (shared pointers to make sure the strings are there as long as I need them).

Dude they're moving into the response. The string_view classes depend on them. They will exist. I hate unnecessary malloc calls.

How many more malloc calls does my proposal involve?

For now, I will change the API

std::vector<std::future<TranslationRequest>> translate(std::vector<std::string> texts, TranslationRequest request);

to

std::future<std::vector<TranslationRequest>> translate(std::vector<std::string> && texts, TranslationRequest request);

It is a first version of the API which. In any case, changing the texts to support a move paradigm && texts or making it a shared/unique pointer is never a big problem.

kpu · 2020-10-27T09:59:18Z

doc/Unified_API.md

+	AlignmentType alignmentType = AlignmentType::NONE;
+
+	// Optional request. A true/false value will include/exclude the original text in the TranslationResult. By default (false), the original text is not included.
+	bool includeOriginalText = false;


Again, since we are returning string_view returning the underlying std::string is essentially required. There should be no option here.

Will remove it.

doc/Unified_API.md

kpu · 2020-10-27T10:30:32Z

doc/Unified_API.md

+        //                                      ["you", "need"],  // originalTextViews[1] = All sections of original text that align with translatedTextViews[1] i.e. "brauchst"
+        //                                      ["you"]           // originalTextViews[2] = All sections of original text that align with translatedTextViews[2] i.e. "du"
+        //                                   ]
+	std::vector<std::vector<std::string_view>> originalTextViews;


All this matching indices is overcomplicated compared to just having a std::vector of a struct, yes?

Also, efficiency nit that this will require tons of memory allocation with all these std::vector running around.

Better:

struct AlignmentArc { std::string_view sourceSpan; float weight; }; class AlignmentAnnotation { public: const AlignmentArc *begin() const { return begin_; } const AlignmentArc *end() const { return end_; } AlignmentArc *begin() { return begin_; } AlignmentArc *end() { return end_; } std::string_view targetSpan() const { return targetSpan_; } private: AlignmentArc *begin_, *end_; std::string_view targetSpan_; }; class Alignment { public: // Note MSVC gets annoying about converting iterators to pointers when compiled with debug, may need to change this. const AlignmentAnnotation *begin() { return &*annotations_.begin(); } const AlignmentAnnotation *end() { return &*annotations_.end(); } AlignmentAnnotation &Add(std::string_view target, std::size_t arcCount); private: pool<AlignmentArc> arc_pool_; // I've got one of these lying around, we all do. std::vector<AlignmentAnnotation> annotations_; };

kpu · 2020-10-27T10:51:42Z

doc/Unified_API.md

+	const std::string sourceLanguageVocabPath;
+
+	// Path to the target vocabulary file to be used by the model	
+	const std::string targetLanguageVocabPath;


If we're going to abstract this so much, then we might as well derive the vocab paths from the model path.

ugermann · 2020-10-26T23:50:00Z

doc/Unified_API.md

+	Alignment getAlignment() const;
+
+	/* Return whether the original text is requested to be included in the TranslationResult. */
+	boolean isOriginalTextInResult() const;


The use of std::string_view is somewhat problematic because it's a C++-17 feature but the core marian code base currently assumes C++-11. Also, I'm in favor of text spans (offset and length) instead of string_view because the latter is specific to C++. It's better to keep the API specs portable.

ugermann · 2020-10-26T23:51:42Z

doc/Unified_API.md

+
+	SOFT = 0,
+	NONE,
+}


What type of alignments do we want? Really soft alignments? No hard alignments?

We get soft alignments natively. As to whether the Mozilla people want those hardened in C++, I'll leave that up to them.

In March I was told that it makes more sense to return hard alignments. I implemented both in the REST API, so I don't really care. I return what I'm being asked for. Soft alignments just add a lot of JSON clutter. One thing to keep in mind is that responses through Native Messaging are capped at 1MB per response message (browser to co-app: 4GB).

doc/Unified_API.md

ugermann · 2020-10-27T00:03:58Z

doc/Unified_API.md

+
+      public:
+	// ToDo: Public Methods
+}


I do not really see a need for this class. A quality estimate associates target text spans with a number, so a std::vector<std::pair<TextSpan,float>> could be a direct data member of the TranslationResult class. We could have two such vectors, one for word-level estimates, one for sentence-level estimates.

Agreed having a class is overkill.

I do not really see a need for this class. A quality estimate associates target text spans with a number, so a std::vector<std::pair<TextSpan,float>> could be a direct data member of the TranslationResult class. We could have two such vectors, one for word-level estimates, one for sentence-level estimates.

Is this really extendable? Imagine if we agree on supporting document level QE scores in future. Would we add one more vector there for document level QE scores? I believe this design is not extendable and this is the reason for a separate class.

You're either adding to an enum or adding a member variable. I don't see how adding a member variable is any less extensible; we're not aiming for ABI compatibility here. And @ugermann 's design has the advantage that multiple things can be returned in the same query---including document and sentence-level QE.

You're either adding to an enum or adding a member variable. I don't see how adding a member variable is any less extensible;

I would be adding an enum as well as adding a member variable for the solution proposed by @ugermann

@ugermann 's design has the advantage that multiple things can be returned in the same query---including document and sentence-level QE.

Didn't we agree on that there is no use case of returning QE scores for multiple granularities in the same request? I fail to understand how this use case came into picture again when you suggested me not to implement unnecessary stuff. And by the way, extending this API to this use case only require changing QualityScore to vector<QualityScore> in TranslationResult.

kpu · 2020-10-27T11:21:00Z

Somehow I can't reply in thread to the above @ugermann comment. I don't see a problem with the wrapper being C++17. If we want it could be downgraded to StringPiece.

motin · 2020-10-27T11:35:10Z

For context, @ugermann's comment that @kpu is replying to:

The use of std::string_view is somewhat problematic because it's a C++-17 feature but the core marian code base currently assumes C++-11. Also, I'm in favor of text spans (offset and length) instead of string_view because the latter is specific to C++. It's better to keep the API specs portable.

Personally, I am agnostic of the implementation as long as the result makes sense once it lands in JS land :)

ugermann · 2020-10-27T11:49:20Z

@kpu: The problem with string_view is that the Marian Vocab class currently expects a const std::string& as the input for 'encoding' (i.e., string to sequence of token ids). In order to accommodate std::string_view, we'd have to change the C++ settings for everything. SentencePiece ships with sentencepiece::utils::min_string_view that we could use, but that requires Marian to be set up with -DUSE_SENTENCEPIECE=on. Enforcing or expecting that always may break regression tests.

I tried setting the C++ standard to C++-17 in Marian's CMakeLists.txt, but the default cmake on Ubuntu 16.04 can't handle that.
In my opinion, sentencepiece::utils::min_string_view is good enough for our needs for the wrapper, and I'm currently adding a variant to Marian's Vocab::encode that expects a const char* and a length instead of const std::string&. That makes coding slightly (in one single place, actually!) more inconvenient but saves me all the adaptation work that would be required otherwise. As I said, for the wrapper we could use sentencepiece::utils::min_string_view because that what the SentencePiece API expects anyway.

And once more: if we go native messaging, we'll need to talk JSON, how do you convert a string view into JSON if not by converting it into a byte range (offset and length). In order to get things off the ground quickly, we should focus on JSON in, JSON out and discuss the API in those terms, i.e. structures of generic data types that are replicable in JSON.

As an aside: with the REST server we already have an API of sorts (however poorly documented). I don't see the benefit of replicating the work that's already been done (including implementation). There currently is no C++ class for the translation response because the code directly builds the JSON response from the raw search histories that it gets out of the search.

Right now the discussion seems to focus on: how can we redesign and reimplement Marian from scratch in WASM? I'm not sure that's the most productive way forward.

ugermann · 2020-10-27T12:15:44Z

For context, @ugermann's comment that @kpu is replying to:

The use of std::string_view is somewhat problematic because it's a C++-17 feature but the core marian code base currently assumes C++-11. Also, I'm in favor of text spans (offset and length) instead of string_view because the latter is specific to C++. It's better to keep the API specs portable.

Personally, I am agnostic of the implementation as long as the result makes sense once it lands in JS land :)

On that note: how do text ranges work when JSON is consumed by JavaScript? As far as I know, by default, JSON uses utf-8. Javascript uses utf-16. So if text ranges are defined in terms of bytes, we have a conversion problem, no?

kpu · 2020-10-27T12:32:56Z

@ugermann Conversion between string_view, StringPiece, and similar types should be free if done correctly (ok, it might require subtraction since StringPiece is length-based). So if you're asking for "text spans" that are char * and length then make it a class and call it, say, StringPiece. I don't really care which of these options we use and honestly it will probably end up a typedef or something.

Does Marian even do encoding without SentencePiece? We're going to need to track byte ranges though the normalizer anyway so there will be refactoring somewhere.

ugermann · 2020-10-27T21:29:06Z

Does Marian even do encoding without SentencePiece? We're going to need to track byte ranges though the normalizer anyway so there will be refactoring somewhere.

@kpu The last time I checked the default compilation option for the official Marian master is still -DUSE_SENTENCEPIECE=off. Marian has three types of vocbularies: "default" (simple whitespace-based tokenization for people who want to do their own tokenization externally), "factored" ("new|uc york|uc" instead of "New York"), and SentencePiece. I once proposed to always compile with SentencePiece but was told that wasn't desirable. I did not bother to inquire about the reasoning behind that stance.

motin

The draft seems to cover everything we agreed on at the Oct 15 meeting:

First API draft: A Fast translation endpoint

INPUT: batch of plain (i.e. markup already removed) text blobs in UTF-8
OUTPUT: batch of objects
Result objects:
- {
- plain (i.e. markup not reinserted) translated text blob in UTF-8
- byteranges for alignments
- byteranges for sentence splits for original text blob
- byteranges for sentence splits for translated text blob
- }
Something similar to StringPiece for annotation of ranges
Quality estimation optional off by default

Not to be included in the first API draft:

Domain adaptation is not supported in first draft
Tags/Emojis are not handled in first draft (should be removed from API requests)
We add more expensive “backfill” flags, which can give us QEs and N-Best for individual sentences if required
First draft will not specifically cater to the currently yet-to-be-defined requirements related to Outbound Translation

@kpu @ugermann Are you ok with approving this draft as is (in all its imperfection)? We can return back to tweaking the exact interfaces later when we have an implementation and some memory and throughput benchmarks to go with it. We may of course also adjust the interfaces as we implement, depending on what seems right in the moment. Eg ok to move on to implementation roughly based on this API draft? :)

kpu

We're mostly there! Let's move to C++ as discussed and get some implementation that will inform it.

* CI Changes to add tiny regression tests * Adding an inspect cache step * Removing ccache, pursue in another * Incorporating Nick's changes through submodule merge * Submodule now points to master * Restoring ccache enabled workflow file * Restoring ccache enabled CMakeLists * cache -> ccache typo fix * Moving CCACHE setup to GitHub runner file * Find also uses CCACHE dir * Updating CMakeLists not to override env * Cache compiler binary's contents * Changing a few names to trigger new build; Testing cache looks fun * USE_CCACHE=on, -L for inspection * Adding a ccache_cmd, but will only use in next commit * Using ccache_cmd * Removing " * Adding compiler hash script * Bunch of absolute paths * GITHUB_WORKSPACE typo * Nah, I'll keep -L and trigger another build * Trying something with compiler hash on cache key backup as well * builtin, bash it seems * Empty commit #1 * Move ccache stats to after compile * Reshuffling ccache vars * No comments * Updates to Github output set syntax * Empty Commit 1 * Empty Commit 2 * Empty commit 3 * /bin/bash -> bash; ccache_cmd for consistency * Adding ccache -s before and after build * Adding comments to compiler-hash script * Let's build cached and non-cached variants together for comparison * Fixing quotes, /bin/bash -> bash * Minor var/env adjustment * Adding ccache -z before the job * Reverting CMakeLists.txt without CCACHE * Switching to CMAKE_LANG_COMPILER_LAUNCHER instead of CMakeLists.txt rule * 5G -> 1G cache size * 1G -> 2G; Hyperparameter tuning

abhi-agg added 2 commits October 21, 2020 10:39

Changed README file

1a1252f

- Added a short introduction of this repository - More updates to come later

First draft of the unified API

715abe5

abhi-agg requested review from kpu and motin October 21, 2020 09:24

kpu reviewed Oct 21, 2020

View reviewed changes

doc/Unified_API.md Outdated Show resolved Hide resolved

kpu requested changes Oct 21, 2020

View reviewed changes

abhi-agg commented Oct 26, 2020

View reviewed changes

Unified API improved after review

e954e45

abhi-agg requested review from kpu and ugermann October 26, 2020 14:38

motin reviewed Oct 27, 2020

View reviewed changes

doc/Unified_API.md Show resolved Hide resolved

Added examples in unified API doc for clarity

1f131bc

- Added examples for Sentence mapping and Alignment

abhi-agg requested a review from motin October 27, 2020 09:19

kpu requested changes Oct 27, 2020

View reviewed changes

ugermann reviewed Oct 27, 2020

View reviewed changes

motin approved these changes Oct 28, 2020

View reviewed changes

kpu approved these changes Oct 28, 2020

View reviewed changes

abhi-agg merged commit ef2323c into main Oct 29, 2020

jerinphilip pushed a commit that referenced this pull request Feb 20, 2021

Submodule pain #1

e4816ec

jerinphilip pushed a commit that referenced this pull request May 9, 2021

Empty commit #1

47289e0

Unified api draft #1

Unified api draft #1

Conversation

abhi-agg commented Oct 21, 2020

abhi-agg left a comment • edited Loading

Choose a reason for hiding this comment

motin left a comment • edited Loading

Choose a reason for hiding this comment

abhi-agg commented Oct 27, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

abhi-agg Oct 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kpu Oct 27, 2020 • edited Loading

Choose a reason for hiding this comment

kpu Oct 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kpu Oct 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kpu Oct 27, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kpu Oct 27, 2020 • edited Loading

Choose a reason for hiding this comment

abhi-agg Oct 27, 2020 • edited Loading

Choose a reason for hiding this comment

kpu commented Oct 27, 2020

motin commented Oct 27, 2020

ugermann commented Oct 27, 2020

ugermann commented Oct 27, 2020 • edited Loading

kpu commented Oct 27, 2020 • edited Loading

ugermann commented Oct 27, 2020

motin left a comment • edited Loading

Choose a reason for hiding this comment

kpu left a comment

Choose a reason for hiding this comment

abhi-agg left a comment •

edited

Loading

motin left a comment •

edited

Loading

abhi-agg Oct 27, 2020 •

edited

Loading

kpu Oct 27, 2020 •

edited

Loading

kpu Oct 27, 2020 •

edited

Loading

kpu Oct 27, 2020 •

edited

Loading

kpu Oct 27, 2020 •

edited

Loading

kpu Oct 27, 2020 •

edited

Loading

abhi-agg Oct 27, 2020 •

edited

Loading

ugermann commented Oct 27, 2020 •

edited

Loading

kpu commented Oct 27, 2020 •

edited

Loading

motin left a comment •

edited

Loading