Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

alternative characterization for weighted sample quantile #687

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

salbert83
Copy link

I am not familiar with the weighted quantile calculation provided by StatsBase.jl. It seems inconsistent with what one would want, for example, in a minimum weighted absolute deviation calculation. Based on the implementation, it seems this is intentional, but I am not sure why? Please review the document and sample code. Thank you.
QuantileCalc.pdf
quantile_chk.txt

@nalimilan
Copy link
Member

See #435 and linked PR. @matthieugomez may be able to answer.

@matthieugomez
Copy link
Contributor

@salbert83 Does your quantile function gives the same thing as Base Julia for weights equal to one?

@salbert83
Copy link
Author

salbert83 commented Apr 27, 2021

The short answer is no, because the solution to the associated optimization problem might not be unique. I extended the writeup.
QuantileCalc.pdf
I began looking at this in relation to a rejected PR I'd made to Distributions.jl regarding a MLE fit for a weighted Laplace distribution. The reviewer suggested using the StatsBase weighted quantile calculation, which doesn't return the result one needs in this context. So, I figured to see whether others agreed with this suggested change to the weighted quantile calculation, or should I reopen the other PR? Thanks for your time.

@matthieugomez
Copy link
Contributor

matthieugomez commented Apr 27, 2021

There are several ways to extend unweighted quantile for non-frequency weights. The current implementation satisfies these three properties:

  1. equals unweighted quantile for a weight vector of ones
  2. gives the same result after normalizing weights to sum to one
  3. gives the same result after removing elements with weights equal to zero

If you can show me that your definition satisfies these three properties, and that it can also be seen as the solution of a nice minimization problem, I'd be down to use it instead.

@salbert83
Copy link
Author

  1. Please review the PDF, where I examined the 30% unweighted quantile vs my implementation on the integers 1..100.
    2 & 3) I believe my implementation satisfies these requirements, unless my tests haven't uncovered some bug.

The notes elaborate on the associated optimization problem, which may not have a unique solution.

Thanks again for your time.

@matthieugomez
Copy link
Contributor

In the pdf, you don't show that 1, 2, 3 are true, you just look at one example.

@salbert83
Copy link
Author

The implementation largely follows the current implementation. A view is taken restricted to observations with nonzero weights, so (3) should be satisfied in all cases. The code should also be invariant to rescaling all the weights by the same positive constant, so (2) should be satisfied. (1) is not satisfied, as demonstrated by the example. In my humble opinion, for this case I find 30 easier to explain than 30.7 for the 30% quantile. Thanks again.

@matthieugomez
Copy link
Contributor

matthieugomez commented Apr 27, 2021

Right, 2&3 are satisfied. And I agree with you — I don't particularly like the definition of unweighted quantile used in Julia. But I still think it's important to match it in case weights equal to one.

@salbert83
Copy link
Author

The unweighted quantile is the one from Statistics.jl, right? I see Julia defaults to definition 7 of Hyndman and Fan (1996). It seems definition 4 is more consistent with the results of the optimization (achieved setting alpha = 0, beta =1 in optional params), whereas the Julia defaults for these optional parameters are alpha = beta = 1. I'm guessing since my main interest is in MLE calculations, would you suggest I reopen that PR with the specific point that the Julia weighted quantile calculation is not correct for this case? Thank you.

@matthieugomez
Copy link
Contributor

matthieugomez commented Apr 27, 2021

I guess you could open a PR to add a type kwarg to quantile in Statistics, adding the type you want + open a PR in StatsBase to add a type kwarg to quantile with weights too.

@salbert83
Copy link
Author

Statistics.quantile already supports the optional arguments,
julia> z = [1:100...];
julia> quantile(z, 0.3, alpha = 1, beta = 1) # These are the defaults if left unspecified
30.7
julia> quantile(z, 0.3, alpha = 0, beta = 1) # This seems consistent with the characterization through optimization.
30.0

Your suggestion then is to add a kwarg to the StatsBase calculation?

@matthieugomez
Copy link
Contributor

matthieugomez commented Apr 27, 2021

Oh, I see! So yeah, it'd be terrific to add the kwargs alpha and beta to the weighted quantile function in StatsBase (at least if you find an extension that makes sense, in particular, satisfying the requirements 1 2 3 given above).

@salbert83
Copy link
Author

I implemented the MLE consistent calculation using an optional kwarg. It doesn't cover the full range regarding alpha and beta settings available in Statistics.quantile, but does provide an option for those requiring this for MLE. I refactored the implementation so both methods share as much code as possible. Please review. Thanks.

@nalimilan
Copy link
Member

Even if you don't support arbitrary alpha and beta values, can you use the same keyword argument names as in the unweighted function (throwing errors for values that are not supported)?

@salbert83
Copy link
Author

salbert83 commented Apr 28, 2021

While I think alpha = 0 and beta = 1 are consistent with the optimization approach, I think it would be misleading. Statistics.quantile supports arbitrary alpha and beta in [0,1] as parameters controlling interpolation. The point of the kwarg here is to indicate its consistency with MLE.

@salbert83 salbert83 closed this Apr 28, 2021
@salbert83 salbert83 reopened this Apr 28, 2021
@salbert83
Copy link
Author

Sorry, closed accidentally.

@nalimilan
Copy link
Member

Why would it be misleading? That would make the different quantile definitions more consistent. And if in the future somebody requests supporting another definition, we don't want to add multiple keyword arguments.

@salbert83
Copy link
Author

salbert83 commented Apr 28, 2021

I did some further investigation to see whether this could work. Unfortunately, my previous speculation was incorrect, as demonstrated by the following examples. Consider the following 2 cases, where the function "f" is the cost function from my earlier notes.

Example 1
q = 0.5
z = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
w = [1.0, 1.0, 1.0, 1.0, 1.9, 0.1, 1.0, 1.0, 1.0, 1.0]
StatsBase.quantile(z, Weights(w), [q]) |> only = 4.7894736842105265
f(StatsBase.quantile(z, Weights(w), [q]) |> only) = 12.260526315789473
StatsBase.quantile(z, Weights(w), [q], mle = true) |> only = 5.0
f(StatsBase.quantile(z, Weights(w), [q], mle = true) |> only) = 12.05
Statistics.quantile(z, q, alpha = 0, beta = 1) = 5.0
Statistics.quantile(z, q, alpha = 1, beta = 0) = 6.0

Example 2:
q = 0.5
z = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
w = [1.0, 1.0, 1.0, 1.0, 0.1, 1.9, 1.0, 1.0, 1.0, 1.0]
StatsBase.quantile(z, Weights(w), [q]) |> only = 5.7368421052631575
f(StatsBase.quantile(z, Weights(w), [q]) |> only) = 12.286842105263158
StatsBase.quantile(z, Weights(w), [q], mle = true) |> only = 6.0
f(StatsBase.quantile(z, Weights(w), [q], mle = true) |> only) = 12.05
Statistics.quantile(z, q, alpha = 0, beta = 1) = 5.0
Statistics.quantile(z, q, alpha = 1, beta = 0) = 6.0

In Example 1, alpha = 0 and beta = 1 is consistent with the optimization approach. In Example 2, alpha = 1 and beta = 0 is consistent. In both these cases, the minima is unique and there is no ambiguity in the value that is selected by optimization. I believe selecting according to this optimization criteria doesn't quite fit the current framework of selecting interpolation parameters.

@salbert83
Copy link
Author

Here are some examples with equal weights showing selection by optimization is different from choosing alpha and beta parameters for interpolation:

z = [0.1576177442664426, 0.22588785878291495, 0.3472306214052947, 0.484404897858993, 0.5295720233336718, 0.5706659909590641, 0.6204132900748311, 0.6264793742353929, 0.9104324596368851, 0.9773671697505484]
w = [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0]
StatsBase.quantile(z, Weights(w), [q]) |> only = 0.4968710244900043
f(StatsBase.quantile(z, Weights(w), [q]) |> only) = 0.9475584471334264
StatsBase.quantile(z, Weights(w), [q], mle = true) |> only = 0.484404897858993
f(StatsBase.quantile(z, Weights(w), [q], mle = true) |> only) = 0.9430706415462624
Statistics.quantile(z, q, alpha = 0, beta = 1) = 0.43502215833566155
f(Statistics.quantile(z, q, alpha = 0, beta = 1)) = 0.9746755948411945
Statistics.quantile(z, q, alpha = 1, beta = 0) = 0.5133118581627875
f(Statistics.quantile(z, q, alpha = 1, beta = 0)) = 0.9534771472556283
Statistics.quantile(z, q, alpha = 1, beta = 1) = 0.4968710244900043
f(Statistics.quantile(z, q, alpha = 1, beta = 1)) = 0.9475584471334264

@nalimilan
Copy link
Member

Thanks for checking. So you mean that this is yet another variant that isn't supported by the unweighted method in Statistics? Is it implemented in other statistical packages? If we add support for the weighted variant, we should probably support the unweighted one too (in Statistics).

@matthieugomez What do you think?

@salbert83
Copy link
Author

I'm not sure whether this is implemented in other statistical packages. I'm not a statistician by trade, so I can't claim expertise here. My motivation for this pull request started here: JuliaStats/Distributions.jl#1310 (comment) The weighted absolute deviation calculation arises if, for example, one uses Laplace distributions as part of a GMM or HMM. I'd made a submission regarding this to Distributions.jl, but was informed what was available (the weighted median) was good enough. This relied on the current weighted quantile method, which doesn't actually match the MLE for the distribution. I disagreed with the reviewer's assessment, but it did seem reasonable to see whether there was interest in a general weighted quantile calculation that could then be leveraged for my use case of interest. Thanks again.

@nalimilan
Copy link
Member

OK. Please do have a look at what other languages do, as it's generally a good idea to try using the same vocabulary for argument names if possible.

@devmotion Do you have an opinion about this?

@salbert83
Copy link
Author

I saw the following Python package, but I'm not sure how widely it is used, https://github.com/nudomarinero/wquantiles. My personal opinion is to design around use cases. My current use case is a correct implementation for parameter estimation for a weighted Laplace distribution. By correct, I mean returning a value which maximizes the likelihood function, noting that this need not uniquely determine the value. To resolve non-uniqueness, I imagine users would want to have a prior, in which case they would probably want to handle the details themselves.

@devmotion
Copy link
Member

I am not completely sure, are you mainly interested in my opinion about keyword argument names?

I just skimmed through the discussion quickly but I think it mixes up different ways to define (weighted) quantiles, based on the (smoothed) cumulative distribution function and based on quantile regression.

As mentioned before by others, I think any implementation of weighted quantiles in StatsBase has to be consistent with the unweighted implementation in Statistics. I would go even further than property 1 above and say that it would be desirable that it

1+: equals unweighted quantile with replicated samples for integer weights

It would be nice to add supports for parameter alpha and beta to the implementation in StatsBase - in my opinion, ideally the implementation in StatsBase would just be a generalization of the one in Statistics and satisfy all consistency properties mentioned above.

To me it seems, it would be better to implement estimates based on (weighted) quantile regression in a separate function.

@salbert83
Copy link
Author

I'm not sure how to proceed here. My end goal is to have a weighted Laplace fit (similar to weighted fits for other distributions, that returns a correct answer. If the internal implementation needs to leverage the StatsBase weighted median function, then this would need to change. But then that causes inconsistency with the unweighted quantile calculations in Statistics. It seems to me the way to proceed is to introduce a separate weighted quantile regression (per the previous suggestion), and then submit a pull request for Laplace that leverages this function. If there's general agreement, I'll reimplement. Thanks.

@devmotion
Copy link
Member

devmotion commented May 8, 2021

It seems to me the way to proceed is to introduce a separate weighted quantile regression (per the previous suggestion), and then submit a pull request for Laplace that leverages this function.

Yes, this would be my suggestion. What do you think @nalimilan?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants