Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unified api draft #1

Merged
merged 4 commits into from
Oct 29, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,3 @@
# bergamot-translator
# Bergamot Translator

Bergamot translator provides a unified API for ([Marian NMT](https://marian-nmt.github.io/) framework based) neural machine translation functionality in accordance with the [Bergamot](https://browser.mt/) project that focuses on improving client-side machine translation in a web browser.
187 changes: 187 additions & 0 deletions doc/Unified_API.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
# Unified (C++) API of Bergamot Translator

/* A Translation model interface for translating a plain utf-8 encoded text (without any markups and emojis). The model supports translation from 1 source language to 1 target language. There can be different implementations of this interface. */

class **AbstractTranslationModel** {

public:

AbstractTranslationModel();

virtual ~AbstractTranslationModel() {};

/* This method performs translation on a list of (utf-8) texts and returns a list of results in the same order. Each text entry can either be a word, a phrase, a sentence or a list of sentences and should contain plain text (without any markups or emojis). Additional information related to the translated text can be requested via TranslationRequest which is applied equally to each text entry. The translated text corresponding to each text entry and the additional information (as specified in the TranslationRequest) is encapsulated and returned in TranslationResult.
abhi-agg marked this conversation as resolved.
Show resolved Hide resolved
The API splits each text entry into sentences internally, which are then translated independent of each other. The translated sentences are then joined together and returned in TranslationResult.
Please refer to the TranslationRequest class to find out what additional information can be requested. The alignment information can only be requested if the model supports it (check isAlignmentSupported() API).
*/
virtual std::vector<std::future<TranslationResult>> translate(std::vector<std::string> texts, TranslationRequest request) = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still has the same bug: should be one std::future<std::vector<TranslationResult>>.

Also std::vector<std::string> &

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment that the std::string will be stolen and moved to the result.

Copy link
Contributor Author

@abhi-agg abhi-agg Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still has the same bug: should be one std::future<std::vector<TranslationResult>>.

Having separate std::future for each text might give some performance benefit. Surely, in current implementation it might not give us sufficient gain but a separate future for each text entry makes sense as the api should be agnostic of the implementation details of marian.

Also std::vector<std::string> &

Didn't we agree in previous review that the api should have the ownership of the textsin this API call? Passing a reference leaves a possibility of the texts to be modified by consumers of the API.

Comment that the std::string will be stolen and moved to the result.

This comment would not make sense if I am passing a copy of texts and not reference

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The efficient semantic is that texts moves to translate and can do what it wants with it (and will move it to TranslationRequest.

I think the most C++ way of doing this is

std::future<std::vector<TranslationRequest> > translate(std::vector<std::string> && texts, TranslationRequest request);

and the caller does translate(std::move(texts), request)

You were the one saying we don't want to copy the input too much...

Copy link
Member

@kpu kpu Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you want to design a correct API for incremental returns that would do sentences at a time, you can do that later. It doesn't look like std::vector<std::future<TranslationResult>>. It looks like this span translates as this span in undefined order.

There is no efficient toolkit that would take a batch of requests and return them in the original order. It will require more work for you and for us to implement a less efficient API that nobody will support.

Copy link
Member

@kpu kpu Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't want the caller to retain write access to those std::string. They are not shared. They are transferred. I realize I can't prevent writing to them. Cough on a std::string and the base pointer changes and all the std::string_view are invalid.

Which is why:

std::future<std::vector<TranslationRequest> > translate(std::vector<std::string> && texts, TranslationRequest request) {
  std::vector<std::string> actualTexts = std::move(texts);
  // ...
}

And we just define that the passed vector is clear at the end of the translate call.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still very much like the idea of using shared pointers. How about something along these lines:

std::future<std::vector<TranslationRequest> > translate(std::vector<std::unique_ptr<std::string>>&& texts, ...) {
  std::vector<std::shared_ptr<const std::string>> actualTexts(texts.size()); 
  // notice that actualTexts are shared pointers to **const** strings! 
  for (size_t i = 0; i < texts.size(); ++i) {
     actualTexts[i] = std::move(texts[i]);
  }
  ...
}

That way you get what you want (immutability) and I get what I want (shared pointers to make sure the strings are there as long as I need them).

Copy link
Member

@kpu kpu Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dude they're moving into the response. The string_view classes depend on them. They will exist. I hate unnecessary malloc calls.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How many more malloc calls does my proposal involve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I will change the API

std::vector<std::future<TranslationRequest>> translate(std::vector<std::string> texts, TranslationRequest request);

to

std::future<std::vector<TranslationRequest>> translate(std::vector<std::string> && texts, TranslationRequest request);

It is a first version of the API which. In any case, changing the texts to support a move paradigm && texts or making it a shared/unique pointer is never a big problem.


/* Check if the model can provide alignment information b/w original and translated text. */
virtual bool isAlignmentSupported() const = 0;
}

/* This class specifies the additional information related to the translated text (e.g. quality of the translation etc.) that can be requested to be included in the TranslationResult. These optional requests are set/unset independent of each other i.e. setting any one of them doesn’t have the side effect of setting any of the others. */

class **TranslationRequest** {

private:

// Optional request. The granularity for which Quality scores of the translated text will be included in TranslationResult. By default (QualityScoreGranularity::NONE), scores are not included.
QualityScoreGranularity qualityScore = QualityScoreGranularity::NONE;

// Optional request. The type of the alignment b/w original and translated text that will be included in TranslationResult. By default (AlignmentType::NONE), alignment is not included.
AlignmentType alignmentType = AlignmentType::NONE;

// Optional request. A true/false value will include/exclude the original text in the TranslationResult. By default (false), the original text is not included.
bool includeOriginalText = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, since we are returning string_view returning the underlying std::string is essentially required. There should be no option here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will remove it.


// Optional request. A true/false value will include/exclude the information regarding how individual sentences of original text map to corresponding translated sentences in joined translated text in the TranslationResult. By default (false), this information is not included.
bool includeSentenceMapping = false;

public:

explicit TranslationRequest();

~TranslationRequest();

/* Set the granularity for which the Quality scores of translated text should be included in the TranslationResult. By default (QualityScoreGranularity::NONE), scores are not included. */
void setQualityScoreGranularity(QualityScoreGranularity granularity);

/* Set the type of Alignment b/w original and translated text to be included in the TranslationResult. By default (AlignmentType::NONE), alignment is not included. */
void setAlignmentType(AlignmentType alignmentType);

/* Set to true/false to include/exclude the original text in the TranslationResult. By default (false), the original text is not included. */
void includeOriginalText(bool originalText);

/* Set to true/false to include/exclude the information regarding how individual sentences of original text map to corresponding translated sentences in joined translated text in the TranslationResult. By default (false), this information is not included. */
void includeSentenceMapping(bool sentenceMapping);

/* Return the granularity for which the Quality scores of the translated text will be included in TranslationResult. QualityScoreGranularity::NONE means the scores will not be included. */
QualityScoreGranularity getQualityScoreGranularity() const;

/* Return the type of Alignment b/w original and translated text that should be included in the TranslationResult. AlignmentType::NONE means the alignment will not be included. */
AlignmentType getAlignmentType() const;

/* Return whether the original text should be included in the TranslationResult. False means the original text will not be included. */
bool includeOriginalText() const;

/* Return whether the information regarding how individual sentences of original text map to corresponding translated sentences in joined translated text should be included in the TranslationResult. False means this information will not be included. */
bool includeSentenceMapping() const;
}

/* This class represents the result of translation on a TranslationRequest. */

class **TranslationResult** {

private:

// Original text (utf-8) that was supposed to be translated; An optional result (it will be an empty string if not requested in TranslationRequest).
std::string originalText;

// Translation (in utf-8 format) of the originalText
std::string translatedText;

// Quality score of the translated text at the granularity specified in TranslationRequest; An optional result (it will have no information if not requested in TranslationRequest)
QualityScore qualityScore;

// Alignment information b/w original and translated text for AlignmentType specified in TranslationRequest; An optional result (it will have no information if not requested in TranslationRequest)
Alignment alignment;

// Information regarding how individual sentences of originalText map to corresponding translated sentences in joined translated text (translatedText); An optional result (it will be empty if not requested in TranslationRequest);
std::vector<std::pair<std::string_view, std::string_view>> sentenceMappings;
abhi-agg marked this conversation as resolved.
Show resolved Hide resolved

public:
// ToDo: Public Methods
}

/* This class encapsulates the configuration that is required by a translation model to perform translation. This configuration includes a path to the model file, source language vocabulary file, target language vocabulary file along with other options. */

class **TranslationModelConfiguration** {
abhi-agg marked this conversation as resolved.
Show resolved Hide resolved

private:

// Path to the translation model file
const std::string modelPath;

// Path to the source vocabulary file to be used by the model
const std::string sourceLanguageVocabPath;

// Path to the target vocabulary file to be used by the model
const std::string targetLanguageVocabPath;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're going to abstract this so much, then we might as well derive the vocab paths from the model path.


// ToDo: Add all possible user configurable options (e.g. min batch size, max batch size) that are relevant for translation

public:

// Provide the path to the model file along with the source and target vocabulary files
TranslationModelConfiguration(const std::string& modelFilePath,
const std::string& sourceVocabPath,
const std::string& targetVocabPath);

// Return the path of the model file
const std::string& getModelFilePath() const;

// Return the path of the source language vocabulary file
const std::string& getSourceVocabularyPath() const;

// Return the path of the target language vocabulary file
const std::string& getSourceVocabularyPath() const;
}

// All possible granularities for which Quality Scores can be returned for translated (utf-8) text

enum class QualityScoreGranularity {
abhi-agg marked this conversation as resolved.
Show resolved Hide resolved

WORD,
SENTENCE,
NONE,
}

// All possible supported alignment types between a text and its translation

enum class AlignmentType {

SOFT,
NONE,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What type of alignments do we want? Really soft alignments? No hard alignments?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We get soft alignments natively. As to whether the Mozilla people want those hardened in C++, I'll leave that up to them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In March I was told that it makes more sense to return hard alignments. I implemented both in the REST API, so I don't really care. I return what I'm being asked for. Soft alignments just add a lot of JSON clutter. One thing to keep in mind is that responses through Native Messaging are capped at 1MB per response message (browser to co-app: 4GB).


// This class represents the Quality Scores for various spans of the translated text at a specific granularity

class QualityScore {

private:

// Sections of a text for the Quality Scores
std::vector<std::string_view> textViews;

// Quality Scores corresponding to each section of the text in textViews in the same order
std::vector<float> textScores;

// Granularity of the text for the Quality scores above
QualityScoreGranularity textGranularity;

public:
// ToDo: Public Methods
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not really see a need for this class. A quality estimate associates target text spans with a number, so a std::vector<std::pair<TextSpan,float>> could be a direct data member of the TranslationResult class. We could have two such vectors, one for word-level estimates, one for sentence-level estimates.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed having a class is overkill.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not really see a need for this class. A quality estimate associates target text spans with a number, so a std::vector<std::pair<TextSpan,float>> could be a direct data member of the TranslationResult class. We could have two such vectors, one for word-level estimates, one for sentence-level estimates.

Is this really extendable? Imagine if we agree on supporting document level QE scores in future. Would we add one more vector there for document level QE scores? I believe this design is not extendable and this is the reason for a separate class.

Copy link
Member

@kpu kpu Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're either adding to an enum or adding a member variable. I don't see how adding a member variable is any less extensible; we're not aiming for ABI compatibility here. And @ugermann 's design has the advantage that multiple things can be returned in the same query---including document and sentence-level QE.

Copy link
Contributor Author

@abhi-agg abhi-agg Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're either adding to an enum or adding a member variable. I don't see how adding a member variable is any less extensible;

I would be adding an enum as well as adding a member variable for the solution proposed by @ugermann

@ugermann 's design has the advantage that multiple things can be returned in the same query---including document and sentence-level QE.

Didn't we agree on that there is no use case of returning QE scores for multiple granularities in the same request? I fail to understand how this use case came into picture again when you suggested me not to implement unnecessary stuff. And by the way, extending this API to this use case only require changing QualityScore to vector<QualityScore> in TranslationResult.


// This class encapsulates a translated text, all the sections of the original text that align to this translated text and the corresponding alignments for each of these sections of original text.

class Alignment {

private:

// A list of sections of a translated text
std::vector<std::string_view> translatedTextViews;

// Each ith entry of this container corresponds to a list of all the sections of the original text that align to the ith entry of translatedTextView
std::vector<std::vector<std::string_view>> originalTextViews;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this matching indices is overcomplicated compared to just having a std::vector of a struct, yes?

Copy link
Member

@kpu kpu Oct 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, efficiency nit that this will require tons of memory allocation with all these std::vector running around.

Better:

struct AlignmentArc {
  std::string_view sourceSpan;
  float weight;
};
class AlignmentAnnotation {
  public:
    const AlignmentArc *begin() const { return begin_; }
    const AlignmentArc *end() const { return end_; }

    AlignmentArc *begin() { return begin_; }
    AlignmentArc *end() { return end_; }
    std::string_view targetSpan() const { return targetSpan_; }
  private:
    AlignmentArc *begin_, *end_;
    std::string_view targetSpan_;
};
class Alignment {
  public:
    // Note MSVC gets annoying about converting iterators to pointers when compiled with debug, may need to change this.
    const AlignmentAnnotation *begin() { return &*annotations_.begin(); }
    const AlignmentAnnotation *end() { return &*annotations_.end(); }
    AlignmentAnnotation &Add(std::string_view target, std::size_t arcCount);
  private:
    pool<AlignmentArc> arc_pool_; // I've got one of these lying around, we all do.  
    std::vector<AlignmentAnnotation> annotations_;
};


// Each ith entry of this container corresponds to the alignments of all the sections of the original text (ith entry of originalTextViews) that align to the ith entry of translatedTextViews
std::vector<std::vector<float>> alignments;

// Type of the alignment b/w original and translated text above
AlignmentType alignmentType;

public:
// ToDo: Public Methods
}