-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified api draft #1
Conversation
- Added a short introduction of this repository - More updates to come later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Kenneth and Uli. First of all, thanks a lot for reviewing it :) I am adding a new patch to the existing PR accommodating all the changes that we agreed here. Please give it a look. Thanks 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing this in light of what we agreed upon at the plenary Oct 15th.
Not being native in C++, I could only confirm that the following parts are included in this draft:
First API draft: A Fast translation endpoint
- INPUT: batch of plain (i.e. markup already removed) text blobs in UTF-8
- OUTPUT: batch of objects
- Result objects:
- {
- plain (i.e. markup not reinserted) translated text blob in UTF-8
- byteranges for alignments
- byteranges for sentence splits for original text blob
- byteranges for sentence splits for translated text blob
- }
- Something similar to StringPiece for annotation of ranges
- Quality estimation optional off by default
Not to be included in the first API draft:
- Domain adaptation is not supported in first draft
- Tags/Emojis are not handled in first draft (should be removed from API requests)
- We add more expensive “backfill” flags, which can give us QEs and N-Best for individual sentences if required
- First draft will not specifically cater to the currently yet-to-be-defined requirements related to Outbound Translation
@abhi-agg @kpu Could you confirm that std::string_view corresponds to something like StringPiece and that I as a WASM-consumer of the API can use this to split up the original and translated text blobs into the same sentences that were sent to marian for translation?
These are already covered in this PR.
std::string_view is a c++17 feature that resembles StringPiece for annotation of ranges.
I believe this boils down to whether the LLVM compiler used by emscripten can compile C++ code having a C++17 feature to WASM or not. As per my knowledge, it can. See this: emscripten-core/emscripten#5513 |
- Added examples for Sentence mapping and Alignment
The API splits each text entry into sentences internally, which are then translated independent of each other. The translated sentences are then joined together and returned in TranslationResult. | ||
Please refer to the TranslationRequest class to find out what additional information can be requested. The alignment information can only be requested if the model supports it (check isAlignmentSupported() API). | ||
*/ | ||
virtual std::vector<std::future<TranslationResult>> translate(std::vector<std::string> texts, TranslationRequest request) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still has the same bug: should be one std::future<std::vector<TranslationResult>>
.
Also std::vector<std::string> &
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment that the std::string
will be stolen and moved to the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still has the same bug: should be one
std::future<std::vector<TranslationResult>>
.
Having separate std::future for each text might give some performance benefit. Surely, in current implementation it might not give us sufficient gain but a separate future for each text entry makes sense as the api should be agnostic of the implementation details of marian.
Also
std::vector<std::string> &
Didn't we agree in previous review that the api should have the ownership of the texts
in this API call? Passing a reference leaves a possibility of the texts
to be modified by consumers of the API.
Comment that the
std::string
will be stolen and moved to the result.
This comment would not make sense if I am passing a copy of texts
and not reference
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The efficient semantic is that texts
moves to translate
and can do what it wants with it (and will move it to TranslationRequest
.
I think the most C++ way of doing this is
std::future<std::vector<TranslationRequest> > translate(std::vector<std::string> && texts, TranslationRequest request);
and the caller does translate(std::move(texts), request)
You were the one saying we don't want to copy the input too much...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want to design a correct API for incremental returns that would do sentences at a time, you can do that later. It doesn't look like std::vector<std::future<TranslationResult>>
. It looks like this span translates as this span in undefined order.
There is no efficient toolkit that would take a batch of requests and return them in the original order. It will require more work for you and for us to implement a less efficient API that nobody will support.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't want the caller to retain write access to those std::string
. They are not shared. They are transferred. I realize I can't prevent writing to them. Cough on a std::string
and the base pointer changes and all the std::string_view
are invalid.
Which is why:
std::future<std::vector<TranslationRequest> > translate(std::vector<std::string> && texts, TranslationRequest request) {
std::vector<std::string> actualTexts = std::move(texts);
// ...
}
And we just define that the passed vector is clear at the end of the translate
call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still very much like the idea of using shared pointers. How about something along these lines:
std::future<std::vector<TranslationRequest> > translate(std::vector<std::unique_ptr<std::string>>&& texts, ...) {
std::vector<std::shared_ptr<const std::string>> actualTexts(texts.size());
// notice that actualTexts are shared pointers to **const** strings!
for (size_t i = 0; i < texts.size(); ++i) {
actualTexts[i] = std::move(texts[i]);
}
...
}
That way you get what you want (immutability) and I get what I want (shared pointers to make sure the strings are there as long as I need them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dude they're moving into the response. The string_view
classes depend on them. They will exist. I hate unnecessary malloc calls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How many more malloc calls does my proposal involve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I will change the API
std::vector<std::future<TranslationRequest>> translate(std::vector<std::string> texts, TranslationRequest request);
to
std::future<std::vector<TranslationRequest>> translate(std::vector<std::string> && texts, TranslationRequest request);
It is a first version of the API which. In any case, changing the texts
to support a move paradigm && texts
or making it a shared/unique pointer is never a big problem.
AlignmentType alignmentType = AlignmentType::NONE; | ||
|
||
// Optional request. A true/false value will include/exclude the original text in the TranslationResult. By default (false), the original text is not included. | ||
bool includeOriginalText = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, since we are returning string_view
returning the underlying std::string
is essentially required. There should be no option here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will remove it.
// ["you", "need"], // originalTextViews[1] = All sections of original text that align with translatedTextViews[1] i.e. "brauchst" | ||
// ["you"] // originalTextViews[2] = All sections of original text that align with translatedTextViews[2] i.e. "du" | ||
// ] | ||
std::vector<std::vector<std::string_view>> originalTextViews; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All this matching indices is overcomplicated compared to just having a std::vector of a struct, yes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, efficiency nit that this will require tons of memory allocation with all these std::vector running around.
Better:
struct AlignmentArc {
std::string_view sourceSpan;
float weight;
};
class AlignmentAnnotation {
public:
const AlignmentArc *begin() const { return begin_; }
const AlignmentArc *end() const { return end_; }
AlignmentArc *begin() { return begin_; }
AlignmentArc *end() { return end_; }
std::string_view targetSpan() const { return targetSpan_; }
private:
AlignmentArc *begin_, *end_;
std::string_view targetSpan_;
};
class Alignment {
public:
// Note MSVC gets annoying about converting iterators to pointers when compiled with debug, may need to change this.
const AlignmentAnnotation *begin() { return &*annotations_.begin(); }
const AlignmentAnnotation *end() { return &*annotations_.end(); }
AlignmentAnnotation &Add(std::string_view target, std::size_t arcCount);
private:
pool<AlignmentArc> arc_pool_; // I've got one of these lying around, we all do.
std::vector<AlignmentAnnotation> annotations_;
};
const std::string sourceLanguageVocabPath; | ||
|
||
// Path to the target vocabulary file to be used by the model | ||
const std::string targetLanguageVocabPath; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're going to abstract this so much, then we might as well derive the vocab paths from the model path.
doc/Unified_API.md
Outdated
Alignment getAlignment() const; | ||
|
||
/* Return whether the original text is requested to be included in the TranslationResult. */ | ||
boolean isOriginalTextInResult() const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of std::string_view is somewhat problematic because it's a C++-17 feature but the core marian code base currently assumes C++-11. Also, I'm in favor of text spans (offset and length) instead of string_view because the latter is specific to C++. It's better to keep the API specs portable.
|
||
SOFT = 0, | ||
NONE, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type of alignments do we want? Really soft alignments? No hard alignments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We get soft alignments natively. As to whether the Mozilla people want those hardened in C++, I'll leave that up to them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In March I was told that it makes more sense to return hard alignments. I implemented both in the REST API, so I don't really care. I return what I'm being asked for. Soft alignments just add a lot of JSON clutter. One thing to keep in mind is that responses through Native Messaging are capped at 1MB per response message (browser to co-app: 4GB).
|
||
public: | ||
// ToDo: Public Methods | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not really see a need for this class. A quality estimate associates target text spans with a number, so a std::vector<std::pair<TextSpan,float>>
could be a direct data member of the TranslationResult class. We could have two such vectors, one for word-level estimates, one for sentence-level estimates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed having a class is overkill.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not really see a need for this class. A quality estimate associates target text spans with a number, so a
std::vector<std::pair<TextSpan,float>>
could be a direct data member of the TranslationResult class. We could have two such vectors, one for word-level estimates, one for sentence-level estimates.
Is this really extendable? Imagine if we agree on supporting document level QE scores in future. Would we add one more vector there for document level QE scores? I believe this design is not extendable and this is the reason for a separate class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're either adding to an enum or adding a member variable. I don't see how adding a member variable is any less extensible; we're not aiming for ABI compatibility here. And @ugermann 's design has the advantage that multiple things can be returned in the same query---including document and sentence-level QE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're either adding to an enum or adding a member variable. I don't see how adding a member variable is any less extensible;
I would be adding an enum as well as adding a member variable for the solution proposed by @ugermann
@ugermann 's design has the advantage that multiple things can be returned in the same query---including document and sentence-level QE.
Didn't we agree on that there is no use case of returning QE scores for multiple granularities in the same request? I fail to understand how this use case came into picture again when you suggested me not to implement unnecessary stuff. And by the way, extending this API to this use case only require changing QualityScore
to vector<QualityScore>
in TranslationResult.
Somehow I can't reply in thread to the above @ugermann comment. I don't see a problem with the wrapper being C++17. If we want it could be downgraded to |
For context, @ugermann's comment that @kpu is replying to:
Personally, I am agnostic of the implementation as long as the result makes sense once it lands in JS land :) |
@kpu: The problem with string_view is that the Marian Vocab class currently expects a I tried setting the C++ standard to C++-17 in Marian's CMakeLists.txt, but the default cmake on Ubuntu 16.04 can't handle that. And once more: if we go native messaging, we'll need to talk JSON, how do you convert a string view into JSON if not by converting it into a byte range (offset and length). In order to get things off the ground quickly, we should focus on JSON in, JSON out and discuss the API in those terms, i.e. structures of generic data types that are replicable in JSON. As an aside: with the REST server we already have an API of sorts (however poorly documented). I don't see the benefit of replicating the work that's already been done (including implementation). There currently is no C++ class for the translation response because the code directly builds the JSON response from the raw search histories that it gets out of the search. Right now the discussion seems to focus on: how can we redesign and reimplement Marian from scratch in WASM? I'm not sure that's the most productive way forward. |
On that note: how do text ranges work when JSON is consumed by JavaScript? As far as I know, by default, JSON uses utf-8. Javascript uses utf-16. So if text ranges are defined in terms of bytes, we have a conversion problem, no? |
@ugermann Conversion between Does Marian even do encoding without SentencePiece? We're going to need to track byte ranges though the normalizer anyway so there will be refactoring somewhere. |
@kpu The last time I checked the default compilation option for the official Marian master is still -DUSE_SENTENCEPIECE=off. Marian has three types of vocbularies: "default" (simple whitespace-based tokenization for people who want to do their own tokenization externally), "factored" ("new|uc york|uc" instead of "New York"), and SentencePiece. I once proposed to always compile with SentencePiece but was told that wasn't desirable. I did not bother to inquire about the reasoning behind that stance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The draft seems to cover everything we agreed on at the Oct 15 meeting:
First API draft: A Fast translation endpoint
- INPUT: batch of plain (i.e. markup already removed) text blobs in UTF-8
- OUTPUT: batch of objects
- Result objects:
- {
- plain (i.e. markup not reinserted) translated text blob in UTF-8
- byteranges for alignments
- byteranges for sentence splits for original text blob
- byteranges for sentence splits for translated text blob
- }
- Something similar to StringPiece for annotation of ranges
- Quality estimation optional off by default
Not to be included in the first API draft:
- Domain adaptation is not supported in first draft
- Tags/Emojis are not handled in first draft (should be removed from API requests)
- We add more expensive “backfill” flags, which can give us QEs and N-Best for individual sentences if required
- First draft will not specifically cater to the currently yet-to-be-defined requirements related to Outbound Translation
@kpu @ugermann Are you ok with approving this draft as is (in all its imperfection)? We can return back to tweaking the exact interfaces later when we have an implementation and some memory and throughput benchmarks to go with it. We may of course also adjust the interfaces as we implement, depending on what seems right in the moment. Eg ok to move on to implementation roughly based on this API draft? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're mostly there! Let's move to C++ as discussed and get some implementation that will inform it.
* CI Changes to add tiny regression tests * Adding an inspect cache step * Removing ccache, pursue in another * Incorporating Nick's changes through submodule merge * Submodule now points to master * Restoring ccache enabled workflow file * Restoring ccache enabled CMakeLists * cache -> ccache typo fix * Moving CCACHE setup to GitHub runner file * Find also uses CCACHE dir * Updating CMakeLists not to override env * Cache compiler binary's contents * Changing a few names to trigger new build; Testing cache looks fun * USE_CCACHE=on, -L for inspection * Adding a ccache_cmd, but will only use in next commit * Using ccache_cmd * Removing " * Adding compiler hash script * Bunch of absolute paths * GITHUB_WORKSPACE typo * Nah, I'll keep -L and trigger another build * Trying something with compiler hash on cache key backup as well * builtin, bash it seems * Empty commit #1 * Move ccache stats to after compile * Reshuffling ccache vars * No comments * Updates to Github output set syntax * Empty Commit 1 * Empty Commit 2 * Empty commit 3 * /bin/bash -> bash; ccache_cmd for consistency * Adding ccache -s before and after build * Adding comments to compiler-hash script * Let's build cached and non-cached variants together for comparison * Fixing quotes, /bin/bash -> bash * Minor var/env adjustment * Adding ccache -z before the job * Reverting CMakeLists.txt without CCACHE * Switching to CMAKE_LANG_COMPILER_LAUNCHER instead of CMakeLists.txt rule * 5G -> 1G cache size * 1G -> 2G; Hyperparameter tuning
This PR contains