Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified api draft #1

Merged
merged 4 commits into from
Oct 29, 2020
Merged

Unified api draft #1

merged 4 commits into from
Oct 29, 2020

Conversation

abhi-agg
Copy link
Contributor

This PR contains

  • Small update to README file
  • First draft of unified API (as agreed in API review meeting)

 - Added a short introduction of this repository
 - More updates to come later
@abhi-agg abhi-agg requested review from kpu and motin October 21, 2020 09:24
doc/Unified_API.md Outdated Show resolved Hide resolved
doc/Unified_API.md Outdated Show resolved Hide resolved
doc/Unified_API.md Outdated Show resolved Hide resolved
doc/Unified_API.md Outdated Show resolved Hide resolved
doc/Unified_API.md Outdated Show resolved Hide resolved
doc/Unified_API.md Outdated Show resolved Hide resolved
doc/Unified_API.md Outdated Show resolved Hide resolved
doc/Unified_API.md Show resolved Hide resolved
doc/Unified_API.md Outdated Show resolved Hide resolved
doc/Unified_API.md Outdated Show resolved Hide resolved
doc/Unified_API.md Outdated Show resolved Hide resolved
Copy link
Contributor Author

@abhi-agg abhi-agg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Kenneth and Uli. First of all, thanks a lot for reviewing it :) I am adding a new patch to the existing PR accommodating all the changes that we agreed here. Please give it a look. Thanks 👍

@abhi-agg abhi-agg requested review from kpu and ugermann October 26, 2020 14:38
Copy link
Contributor

@motin motin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewing this in light of what we agreed upon at the plenary Oct 15th.

Not being native in C++, I could only confirm that the following parts are included in this draft:

First API draft: A Fast translation endpoint

  • INPUT: batch of plain (i.e. markup already removed) text blobs in UTF-8
  • OUTPUT: batch of objects
  • Result objects:
    • {
    • plain (i.e. markup not reinserted) translated text blob in UTF-8
    • byteranges for alignments
    • byteranges for sentence splits for original text blob
    • byteranges for sentence splits for translated text blob
    • }
  • Something similar to StringPiece for annotation of ranges
  • Quality estimation optional off by default

Not to be included in the first API draft:

  • Domain adaptation is not supported in first draft
  • Tags/Emojis are not handled in first draft (should be removed from API requests)
  • We add more expensive “backfill” flags, which can give us QEs and N-Best for individual sentences if required
  • First draft will not specifically cater to the currently yet-to-be-defined requirements related to Outbound Translation

@abhi-agg @kpu Could you confirm that std::string_view corresponds to something like StringPiece and that I as a WASM-consumer of the API can use this to split up the original and translated text blobs into the same sentences that were sent to marian for translation?

doc/Unified_API.md Show resolved Hide resolved
@abhi-agg
Copy link
Contributor Author

  * [ ]  byteranges for alignments
  * [ ]  byteranges for sentence splits for original text blob
  * [ ]  byteranges for sentence splits for translated text blob

These are already covered in this PR.

* [ ]  Something similar to StringPiece for annotation of ranges

@abhi-agg @kpu Could you confirm that std::string_view corresponds to something like StringPiece

std::string_view is a c++17 feature that resembles StringPiece for annotation of ranges.

and that I as a WASM-consumer of the API can use this to split up the original and translated text blobs into the same sentences that were sent to marian for translation?

I believe this boils down to whether the LLVM compiler used by emscripten can compile C++ code having a C++17 feature to WASM or not. As per my knowledge, it can. See this: emscripten-core/emscripten#5513

 - Added examples for Sentence mapping and Alignment
@abhi-agg abhi-agg requested a review from motin October 27, 2020 09:19
The API splits each text entry into sentences internally, which are then translated independent of each other. The translated sentences are then joined together and returned in TranslationResult.
Please refer to the TranslationRequest class to find out what additional information can be requested. The alignment information can only be requested if the model supports it (check isAlignmentSupported() API).
*/
virtual std::vector<std::future<TranslationResult>> translate(std::vector<std::string> texts, TranslationRequest request) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still has the same bug: should be one std::future<std::vector<TranslationResult>>.

Also std::vector<std::string> &

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment that the std::string will be stolen and moved to the result.

Copy link
Contributor Author

@abhi-agg abhi-agg Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still has the same bug: should be one std::future<std::vector<TranslationResult>>.

Having separate std::future for each text might give some performance benefit. Surely, in current implementation it might not give us sufficient gain but a separate future for each text entry makes sense as the api should be agnostic of the implementation details of marian.

Also std::vector<std::string> &

Didn't we agree in previous review that the api should have the ownership of the textsin this API call? Passing a reference leaves a possibility of the texts to be modified by consumers of the API.

Comment that the std::string will be stolen and moved to the result.

This comment would not make sense if I am passing a copy of texts and not reference

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The efficient semantic is that texts moves to translate and can do what it wants with it (and will move it to TranslationRequest.

I think the most C++ way of doing this is

std::future<std::vector<TranslationRequest> > translate(std::vector<std::string> && texts, TranslationRequest request);

and the caller does translate(std::move(texts), request)

You were the one saying we don't want to copy the input too much...

Copy link
Member

@kpu kpu Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to design a correct API for incremental returns that would do sentences at a time, you can do that later. It doesn't look like std::vector<std::future<TranslationResult>>. It looks like this span translates as this span in undefined order.

There is no efficient toolkit that would take a batch of requests and return them in the original order. It will require more work for you and for us to implement a less efficient API that nobody will support.

Copy link
Member

@kpu kpu Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want the caller to retain write access to those std::string. They are not shared. They are transferred. I realize I can't prevent writing to them. Cough on a std::string and the base pointer changes and all the std::string_view are invalid.

Which is why:

std::future<std::vector<TranslationRequest> > translate(std::vector<std::string> && texts, TranslationRequest request) {
  std::vector<std::string> actualTexts = std::move(texts);
  // ...
}

And we just define that the passed vector is clear at the end of the translate call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still very much like the idea of using shared pointers. How about something along these lines:

std::future<std::vector<TranslationRequest> > translate(std::vector<std::unique_ptr<std::string>>&& texts, ...) {
  std::vector<std::shared_ptr<const std::string>> actualTexts(texts.size()); 
  // notice that actualTexts are shared pointers to **const** strings! 
  for (size_t i = 0; i < texts.size(); ++i) {
     actualTexts[i] = std::move(texts[i]);
  }
  ...
}

That way you get what you want (immutability) and I get what I want (shared pointers to make sure the strings are there as long as I need them).

Copy link
Member

@kpu kpu Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dude they're moving into the response. The string_view classes depend on them. They will exist. I hate unnecessary malloc calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many more malloc calls does my proposal involve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I will change the API

std::vector<std::future<TranslationRequest>> translate(std::vector<std::string> texts, TranslationRequest request);

to

std::future<std::vector<TranslationRequest>> translate(std::vector<std::string> && texts, TranslationRequest request);

It is a first version of the API which. In any case, changing the texts to support a move paradigm && texts or making it a shared/unique pointer is never a big problem.

AlignmentType alignmentType = AlignmentType::NONE;

// Optional request. A true/false value will include/exclude the original text in the TranslationResult. By default (false), the original text is not included.
bool includeOriginalText = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, since we are returning string_view returning the underlying std::string is essentially required. There should be no option here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it.

doc/Unified_API.md Show resolved Hide resolved
// ["you", "need"], // originalTextViews[1] = All sections of original text that align with translatedTextViews[1] i.e. "brauchst"
// ["you"] // originalTextViews[2] = All sections of original text that align with translatedTextViews[2] i.e. "du"
// ]
std::vector<std::vector<std::string_view>> originalTextViews;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this matching indices is overcomplicated compared to just having a std::vector of a struct, yes?

Copy link
Member

@kpu kpu Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, efficiency nit that this will require tons of memory allocation with all these std::vector running around.

Better:

struct AlignmentArc {
  std::string_view sourceSpan;
  float weight;
};
class AlignmentAnnotation {
  public:
    const AlignmentArc *begin() const { return begin_; }
    const AlignmentArc *end() const { return end_; }

    AlignmentArc *begin() { return begin_; }
    AlignmentArc *end() { return end_; }
    std::string_view targetSpan() const { return targetSpan_; }
  private:
    AlignmentArc *begin_, *end_;
    std::string_view targetSpan_;
};
class Alignment {
  public:
    // Note MSVC gets annoying about converting iterators to pointers when compiled with debug, may need to change this.
    const AlignmentAnnotation *begin() { return &*annotations_.begin(); }
    const AlignmentAnnotation *end() { return &*annotations_.end(); }
    AlignmentAnnotation &Add(std::string_view target, std::size_t arcCount);
  private:
    pool<AlignmentArc> arc_pool_; // I've got one of these lying around, we all do.  
    std::vector<AlignmentAnnotation> annotations_;
};

const std::string sourceLanguageVocabPath;

// Path to the target vocabulary file to be used by the model
const std::string targetLanguageVocabPath;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to abstract this so much, then we might as well derive the vocab paths from the model path.

Alignment getAlignment() const;

/* Return whether the original text is requested to be included in the TranslationResult. */
boolean isOriginalTextInResult() const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of std::string_view is somewhat problematic because it's a C++-17 feature but the core marian code base currently assumes C++-11. Also, I'm in favor of text spans (offset and length) instead of string_view because the latter is specific to C++. It's better to keep the API specs portable.


SOFT = 0,
NONE,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type of alignments do we want? Really soft alignments? No hard alignments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get soft alignments natively. As to whether the Mozilla people want those hardened in C++, I'll leave that up to them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In March I was told that it makes more sense to return hard alignments. I implemented both in the REST API, so I don't really care. I return what I'm being asked for. Soft alignments just add a lot of JSON clutter. One thing to keep in mind is that responses through Native Messaging are capped at 1MB per response message (browser to co-app: 4GB).

doc/Unified_API.md Outdated Show resolved Hide resolved

public:
// ToDo: Public Methods
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not really see a need for this class. A quality estimate associates target text spans with a number, so a std::vector<std::pair<TextSpan,float>> could be a direct data member of the TranslationResult class. We could have two such vectors, one for word-level estimates, one for sentence-level estimates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed having a class is overkill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not really see a need for this class. A quality estimate associates target text spans with a number, so a std::vector<std::pair<TextSpan,float>> could be a direct data member of the TranslationResult class. We could have two such vectors, one for word-level estimates, one for sentence-level estimates.

Is this really extendable? Imagine if we agree on supporting document level QE scores in future. Would we add one more vector there for document level QE scores? I believe this design is not extendable and this is the reason for a separate class.

Copy link
Member

@kpu kpu Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're either adding to an enum or adding a member variable. I don't see how adding a member variable is any less extensible; we're not aiming for ABI compatibility here. And @ugermann 's design has the advantage that multiple things can be returned in the same query---including document and sentence-level QE.

Copy link
Contributor Author

@abhi-agg abhi-agg Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're either adding to an enum or adding a member variable. I don't see how adding a member variable is any less extensible;

I would be adding an enum as well as adding a member variable for the solution proposed by @ugermann

@ugermann 's design has the advantage that multiple things can be returned in the same query---including document and sentence-level QE.

Didn't we agree on that there is no use case of returning QE scores for multiple granularities in the same request? I fail to understand how this use case came into picture again when you suggested me not to implement unnecessary stuff. And by the way, extending this API to this use case only require changing QualityScore to vector<QualityScore> in TranslationResult.

@kpu
Copy link
Member

kpu commented Oct 27, 2020

Somehow I can't reply in thread to the above @ugermann comment. I don't see a problem with the wrapper being C++17. If we want it could be downgraded to StringPiece.

@motin
Copy link
Contributor

motin commented Oct 27, 2020

For context, @ugermann's comment that @kpu is replying to:

The use of std::string_view is somewhat problematic because it's a C++-17 feature but the core marian code base currently assumes C++-11. Also, I'm in favor of text spans (offset and length) instead of string_view because the latter is specific to C++. It's better to keep the API specs portable.

Personally, I am agnostic of the implementation as long as the result makes sense once it lands in JS land :)

@ugermann
Copy link
Member

@kpu: The problem with string_view is that the Marian Vocab class currently expects a const std::string& as the input for 'encoding' (i.e., string to sequence of token ids). In order to accommodate std::string_view, we'd have to change the C++ settings for everything. SentencePiece ships with sentencepiece::utils::min_string_view that we could use, but that requires Marian to be set up with -DUSE_SENTENCEPIECE=on. Enforcing or expecting that always may break regression tests.

I tried setting the C++ standard to C++-17 in Marian's CMakeLists.txt, but the default cmake on Ubuntu 16.04 can't handle that.
In my opinion, sentencepiece::utils::min_string_view is good enough for our needs for the wrapper, and I'm currently adding a variant to Marian's Vocab::encode that expects a const char* and a length instead of const std::string&. That makes coding slightly (in one single place, actually!) more inconvenient but saves me all the adaptation work that would be required otherwise. As I said, for the wrapper we could use sentencepiece::utils::min_string_view because that what the SentencePiece API expects anyway.

And once more: if we go native messaging, we'll need to talk JSON, how do you convert a string view into JSON if not by converting it into a byte range (offset and length). In order to get things off the ground quickly, we should focus on JSON in, JSON out and discuss the API in those terms, i.e. structures of generic data types that are replicable in JSON.

As an aside: with the REST server we already have an API of sorts (however poorly documented). I don't see the benefit of replicating the work that's already been done (including implementation). There currently is no C++ class for the translation response because the code directly builds the JSON response from the raw search histories that it gets out of the search.

Right now the discussion seems to focus on: how can we redesign and reimplement Marian from scratch in WASM? I'm not sure that's the most productive way forward.

@ugermann
Copy link
Member

ugermann commented Oct 27, 2020

For context, @ugermann's comment that @kpu is replying to:

The use of std::string_view is somewhat problematic because it's a C++-17 feature but the core marian code base currently assumes C++-11. Also, I'm in favor of text spans (offset and length) instead of string_view because the latter is specific to C++. It's better to keep the API specs portable.

Personally, I am agnostic of the implementation as long as the result makes sense once it lands in JS land :)

On that note: how do text ranges work when JSON is consumed by JavaScript? As far as I know, by default, JSON uses utf-8. Javascript uses utf-16. So if text ranges are defined in terms of bytes, we have a conversion problem, no?

@kpu
Copy link
Member

kpu commented Oct 27, 2020

@ugermann Conversion between string_view, StringPiece, and similar types should be free if done correctly (ok, it might require subtraction since StringPiece is length-based). So if you're asking for "text spans" that are char * and length then make it a class and call it, say, StringPiece. I don't really care which of these options we use and honestly it will probably end up a typedef or something.

Does Marian even do encoding without SentencePiece? We're going to need to track byte ranges though the normalizer anyway so there will be refactoring somewhere.

@ugermann
Copy link
Member

Does Marian even do encoding without SentencePiece? We're going to need to track byte ranges though the normalizer anyway so there will be refactoring somewhere.

@kpu The last time I checked the default compilation option for the official Marian master is still -DUSE_SENTENCEPIECE=off. Marian has three types of vocbularies: "default" (simple whitespace-based tokenization for people who want to do their own tokenization externally), "factored" ("new|uc york|uc" instead of "New York"), and SentencePiece. I once proposed to always compile with SentencePiece but was told that wasn't desirable. I did not bother to inquire about the reasoning behind that stance.

Copy link
Contributor

@motin motin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The draft seems to cover everything we agreed on at the Oct 15 meeting:

First API draft: A Fast translation endpoint

  • INPUT: batch of plain (i.e. markup already removed) text blobs in UTF-8
  • OUTPUT: batch of objects
  • Result objects:
    • {
    • plain (i.e. markup not reinserted) translated text blob in UTF-8
    • byteranges for alignments
    • byteranges for sentence splits for original text blob
    • byteranges for sentence splits for translated text blob
    • }
  • Something similar to StringPiece for annotation of ranges
  • Quality estimation optional off by default

Not to be included in the first API draft:

  • Domain adaptation is not supported in first draft
  • Tags/Emojis are not handled in first draft (should be removed from API requests)
  • We add more expensive “backfill” flags, which can give us QEs and N-Best for individual sentences if required
  • First draft will not specifically cater to the currently yet-to-be-defined requirements related to Outbound Translation

@kpu @ugermann Are you ok with approving this draft as is (in all its imperfection)? We can return back to tweaking the exact interfaces later when we have an implementation and some memory and throughput benchmarks to go with it. We may of course also adjust the interfaces as we implement, depending on what seems right in the moment. Eg ok to move on to implementation roughly based on this API draft? :)

Copy link
Member

@kpu kpu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're mostly there! Let's move to C++ as discussed and get some implementation that will inform it.

@abhi-agg abhi-agg merged commit ef2323c into main Oct 29, 2020
jerinphilip pushed a commit that referenced this pull request Feb 20, 2021
jerinphilip pushed a commit that referenced this pull request May 9, 2021
XapaJIaMnu pushed a commit that referenced this pull request May 17, 2021
* CI Changes to add tiny regression tests

* Adding an inspect cache step

* Removing ccache, pursue in another

* Incorporating Nick's changes through submodule merge

* Submodule now points to master

* Restoring ccache enabled workflow file

* Restoring ccache enabled CMakeLists

* cache -> ccache typo fix

* Moving CCACHE setup to GitHub runner file

* Find also uses CCACHE dir

* Updating CMakeLists not to override env

* Cache compiler binary's contents

* Changing a few names to trigger new build; Testing cache looks fun

* USE_CCACHE=on, -L for inspection

* Adding a ccache_cmd, but will only use in next commit

* Using ccache_cmd

* Removing "

* Adding compiler hash script

* Bunch of absolute paths

* GITHUB_WORKSPACE typo

* Nah, I'll keep -L and trigger another build

* Trying something with compiler hash on cache key backup as well

* builtin, bash it seems

* Empty commit #1

* Move ccache stats to after compile

* Reshuffling ccache vars

* No comments

* Updates to Github output set syntax

* Empty Commit 1

* Empty Commit 2

* Empty commit 3

* /bin/bash -> bash; ccache_cmd for consistency

* Adding ccache -s before and after build

* Adding comments to compiler-hash script

* Let's build cached and non-cached variants together for comparison

* Fixing quotes, /bin/bash -> bash

* Minor var/env adjustment

* Adding ccache -z before the job

* Reverting CMakeLists.txt without CCACHE

* Switching to CMAKE_LANG_COMPILER_LAUNCHER instead of CMakeLists.txt rule

* 5G -> 1G cache size

* 1G -> 2G; Hyperparameter tuning
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants