Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binary vector format for flat and hnsw vectors #14078

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Dec 17, 2024

This provides a binary vector format for vectors. The key ideas are:

  • Centroid centered vectors
  • Asymmetric quantization
  • Individually optimized scalar quantization

This allows Lucene to have a single scalar quantization format that allows for high quality vector retrieval, even down to a single bit.

For all similarity types, on disk it looks like.

quantized vector lower quantile upper quantile additional correction sum quantized components
(vector_dimension/8) bytes float float float short

During segment merge & HNSW building, another temporary file is written containing the query quantized vectors over the configured centroids. One downside is that this temporary file will actually be larger than the regular vector index. This is because we use asymmetric quantization to keep good information around. But once the merge is complete, this file is deleted. I think eventually, this can be removed.

Here are the results for Recall@10|50

Dataset old PR this one Improvement
Cohere 768 0.933 0.938 0.5%
Cohere 1024 0.932 0.945 1.3%
E5-Small-v2 0.972 0.975 0.3%
GIST-1M 0.740 0.989 24.9%

Even with the optimization step, indexing time with HNSW is only marginally increased.

Dataset OLD PR This One Difference
Cohere 768 368.62s 372.95s +1%
Cohere 1024 307.09s 314.08s +2%
E5-Small-v2 227.37s 229.83s < +1%

The consistent improvement in recall and flexibility for various bits makes this format and quantization technique much preferred.

Eventually, we should consider moving scalar quantization to utilize this new optimized quantizer. Though, the on disk format and scoring will change, so, I didn't do that in this PR.

supersedes: #13651

Co-Authors: @tveasey @john-wagster @mayya-sharipova @ChrisHegarty

@gaoj0017
Copy link

Hi @benwtrent , I am the first author of the RaBitQ paper and its extended version. As your team have known, our RaBitQ method brings breakthrough performance on binary quantization and scalar quantization.

We notice that in this pull request, you mention a method which individually optimizes the lower bound and upper bound of scalar quantization. This idea is highly similar to our idea of individually looking for the optimal rescaling factor of scalar quantization as described in our extended RaBitQ paper, which we shared with your team in Oct 2024. An intuitive explanation can be found in our recent blog. The mathematical equivalence between these two ideas is listed in Remark 2.

In addition, the contribution of our RaBitQ has not been properly acknowledged at several other places. For example, in a previous post from Elastic - Better Binary Quantization (BBQ) in Lucene and Elasticsearch, the major features of BBQ are introduced, yet it is not made clear that all these features originate from our RaBitQ paper. In a press release, Elastic claims that "Elasticsearch’s new BBQ algorithm redefines vector quantization", however, BBQ is not a grandly new method, but a variant of RaBitQ with some minor adaption.

We note that when a breakthrough is made, it is always easy to derive its variants or to restate the method in different languages. One should not claim a variant to be a new method with a new name and ignore the contribution of the original method. We hope that you would understand our concern and acknowledge the contributions of our RaBitQ and its extension properly in your pull requests and/or blogs.

  • Remark 1. The BBQ feature fails on the GIST dataset because it removes the randomization operation of the RaBitQ method. With the randomization operation, RaBitQ is theoretically guaranteed to perform stably on all datasets.
  • Remark 2. Let $B$ be the number of bits for scalar quantization. The scalar quantization can be represented in two equivalent ways.
    1. Scalar quantization can be determined by the lower bound $v_l$ and the upper bound $v_r$. The algorithm first computes $\Delta =(v_r-v_l) / (2^{B}-1)$ and then maps each real value $x$ to the nearest integer of $(x-v_l) / \Delta$.
    2. Based on the process above, scalar quantization can be equivalently determined by a rescaling factor $\Delta$ and a shifting factor $v_l$.

* </tr>
* <tr>
* <td>{@link org.apache.lucene.codecs.lucene99.Lucene99HnswVectorsFormat Vector values}</td>
* <td>.vec, .vem, .veq, vex</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also add veb and vemb files to the list?

import org.apache.lucene.index.SegmentWriteState;

/**
* Copied from Lucene, replace with Lucene's implementation sometime after Lucene 10 Codec for
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also should we remove this line?

return "Lucene102BinaryQuantizedVectorsFormat(name="
+ NAME
+ ", flatVectorScorer="
+ scorer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should we also add + ", rawVectorFormat=" + rawVectorFormat?

@benwtrent
Copy link
Member Author

@gaoj0017 Thank you for your feedback!

Truly, y'all inspired us on improving scalar quantization. RaBitQ showed that it is possible to achieve 32x reduction while achieving high recall without product quantization. And, to my knowledge, we have attributed inspiration where the particulars of the algorithm were used.

As for this change, it is not mathematically the same or derived from y'all's new or old paper. Indeed, your new paper is interesting and provides the same flexibility to various bit sizes and shows that it's possible. However, we haven’t tested it, nor implemented it.

Here are some details about this implementation.

https://www.elastic.co/search-labs/blog/scalar-quantization-optimization

@gaoj0017
Copy link

@benwtrent Thanks for your reply.

First, in the blog - Better Binary Quantization at Elastic and Lucene - the BBQ method is a variant of our RaBitQ with no major differences. The claimed major features of BBQ all originate from our RaBitQ paper (as we have explained in our last reply). There is only one attribution to our method, where it is mentioned (in one sentence) that BBQ is based on some inspirations from RaBitQ. We think this attribution is not sufficient - it should be made clear that the mentioned features of BBQ all originate from RaBitQ.

Second, for the new method described in this pull request, there is no attribution to our extended RaBitQ method at all - we note that we shared with your team the extended RaBitQ paper more than 2 months ago. To our understanding, the method is highly similar to our extended RaBitQ at its core (which also supports quantizing a vector to 1-bit, 2-bit, ... per dimension). They share the major idea of optimizing the scalar quantization method by trying different parameters. In your new blog, it is mentioned that “Although the RaBitQ approach is conceptually rather different to scalar quantization, we were inspired to re-evaluate whether similar performance could be unlocked for scalar quantization.” This is not true since our extended RaBitQ corresponds to an optimized scalar quantization method.

Given that our extended RaBitQ method is a prior art of the method introduced in the blog and our method was known to your team more than 2 months ago, you should not have ignored it. Discussions on the differences between two methods if any should be clearly explained and experiments of comparing the two methods should be provided as well.

@msokolov
Copy link
Contributor

@gaoj0017 it sounds to me as if your concern is about lack of attribution in the blog post you mentioned, and doesn't really relate to this pull request (code change) - is that accurate?

@mikemccand
Copy link
Member

+1 for proper attribution.

We should give credit where credit is due. The evolution of this PR clearly began with the RaBitQ paper, as seen in the opening comment on the original PR as well as the original issue.

Specifically for the open source changes proposed here (this pull request suggesting changes to Lucene's ASL2 licensed source code):

Linking to the papers that inspired important changes in Lucene is not only for proper attribution but also so users have a deep resource they can fall back on to understand the algorithm, understand how tunable parameters are expected to behave, etc. It's an important part of the documentation too! Also, future developers can re-read the paper and study Lucene's implementation and maybe find bugs / improvement ideas.

For the Elastic specific artifacts (blog posts, press releases, tweets, etc.): I would agree that Elastic should also attribute properly, probably with an edit/update/sorry-about-the-oversight sort of addition? But I do not (no longer) work at Elastic, so this is merely my (external) opinion! Perhaps a future blog post, either Elastic or someone else, could correct the mistake (missed attribution).

Finally, thank you to @gaoj0017 and team for creating RaBitQ and publishing these papers -- this is an impactful vector quantization algorithm that can help the many Lucene/OpenSearch/Solr/Elasticsearch users building semantic / LLM engines these days.

@benwtrent
Copy link
Member Author

To head this off, this implementation is not an evolution of RabitQ in any way. It's intellectually dishonest to say it's an evolution of RaBitQ. I know that's pedantic, but it's a fact.

This is the next step of the global vector quantization optimization done already in Lucene. Instead of global, it's local and utilizes anisotropic quantization. I am still curious as to what in particular is considered built on RaBitQ here. Just because things reach the same ends (various bit level quantization) doesn't mean they are the same.

We can say "so this idea is unique from RaBitQ in these ways" to keep attribution, but it seems weird to call out another algorithm to simply say this one is different.

I agree, Elastic stuff should be discussed and fixed in a different forum.

@gaoj0017
Copy link

gaoj0017 commented Jan 6, 2025

Hi @msokolov , the discussion here is not only about the blog posts but also related to the pull request here. In this pull request (and its related blogs), it claims a new method without properly acknowledging the contributions/inspirations from our extended RaBitQ method as we have explained in our last reply. Besides, we believe this discussion is relevant to the Lucene community because Lucene is a collaborative project, used and contributed to by many teams beyond Elastic.

Thanks @mikemccand for your kind words - we truly appreciate them! It is also an honor to us that RaBitQ and the extended RaBitQ are seen as impactful in improving industry productivity.

Our responses to @benwtrent are as follows.
Point 1 in our last reply: This has been ignored in Ben’s reply. We would like to emphasize once again that the so-called “BBQ” method from Elastic is largely based on our RaBitQ method, with only minor modifications - this can be reflected in the previous PRs. Yet Elastic has repeatedly referred to BBQ as their new algorithm without acknowledging RaBitQ. For example, in a press release, Elastic states that "Elasticsearch’s new BBQ algorithm redefines vector quantization," omitting any reference to RaBitQ. This is unacceptable and particularly unfair to other teams who have openly acknowledged their use of RaBitQ. We request that our RaBitQ method should be properly credited in all existing and future blogs, press releases, pull requests, and other communications regarding the “BBQ” method.

Point 2 in our last reply: In the related blog that describes the method in this pull request, it states “Although the RaBitQ approach is conceptually rather different to scalar quantization, we were inspired to re-evaluate whether similar performance could be unlocked for scalar quantization” - this is not true since this target has already been achieved by our extended RaBitQ method. Your team should not have made this mis-claim by ignoring our extended RaBitQ method since we circulated our extended RaBitQ paper to your team more than three months ago. In addition, our extended RaBitQ method proposed the idea of searching for the optimal parameters of scalar quantization for each vector (for details, please refer to our blog). The method in this pull request has adopted a highly similar idea. For this reason, we request that in any existing and potentially future channels of introducing the method in this PR, proper acknowledgement of our extended RaBitQ method should be made.

@ChrisHegarty
Copy link
Contributor

In my capacity as the Lucene PMC Chair (and with explicit acknowledgment of my current employment with Elastic, as of the date of this writing), I want to emphasize that proper attribution and acknowledgment should be provided for all contributions, as applicable, in accordance with best practices.

While the inclusion of links to external blogs and prior works serves to provide helpful context regarding the broader landscape, it would be of greater value to explicitly delineate which specific elements within this pull request are directly related to the RaBitQ method or its extension.

@tveasey
Copy link

tveasey commented Jan 6, 2025

Just sticking purely to the issues raised regarding this PR and the blog Ben linked explaining the methodology...

Although the RaBitQ approach is conceptually rather different to scalar quantization, we were inspired to re-evaluate whether similar performance could be unlocked for scalar quantization” - this is not true since this target has already been achieved by our extended RaBitQ method. Your team should not have made this mis-claim by ignoring our extended RaBitQ method since we circulated our extended RaBitQ paper to your team more than three months ago.

This comment relates to the fact that RaBitQ, as you yourself describe it in both your papers, is motivated by seeking a form of product quantization (PQ) for which one can compute the dot product directly rather than via look up. Your papers make minimal reference to scalar quantisation (SQ) other than to say the method is a drop in replacement. If you strongly take issue to the statement based on this clarification we can further clarify it in the blog. I still feel this is separate to this PR and it seems better to discuss that in a separate forum.

I would also reiterate that conceptually, our approach is much closer to our prior work on int4 SQ we blogged about last April, which is what inspired it more directly.

In addition, our extended RaBitQ method proposed the idea of searching for the optimal parameters of scalar quantization for each vector (for details, please refer to our blog).

I would argue that finding the nearest point on the sphere is exactly equivalent to the standard process in SQ of finding the nearest grid point to a vector. Perhaps more accurate would be to say you've ported SQ to work with spherical geometry, although as before the more natural motivation, and the one you yourselves adopt, is in terms of PQ. This isn't related to optimising hyperparameters of SQ IMO.

You could argue perhaps that arranging for both codebook centres and corpus vectors to be uniformly distributed on the sphere constitutes this sort of optimization, although it would not be standard usage. At best you could say it indirectly arranges for raw vectors to wind up close in some average sense to the quantized vectors. However, I'd take issue with this statement because a single sample of a random rotation does not ensure that the corpus vectors are uniformly distributed on the sphere: using a single random rotation of, for example, a set of points which are concentrated somewhere on the sphere doesn't change this. You would have to use different samples for different vectors, but this eliminates the performance advantages.

Incidentally, this I think is the reason it performs significantly worse on GIST and indeed part of the reason why we found small improvements across the board for binary. (Tangentially, it feels like a whitening pre-conditioner might actually be of more benefit to performance of RaBitQ. I also can't help but feel some combination of hyperparameter optimization and normalization will yield even further improvements, but I haven't been able to get this to workout yet.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants