-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
simplified likelihoods and pyhf #1359
Comments
Do you mind pointing at an example specification of the CMS simplified likelihood? I thought it was in the Combine card format (#344) but perhaps it was a custom format? There are pieces of |
Since multivariate Gaussians are not part of the |
Ok, so I personally do not have a Combine card -- in SModelS all we use is the covariance matrix, and number of observed and number of expected events per signal region. Nick Wardle (@nucleosynthesis) is the person to ask. Of course he cannot give the real life "combine" data card, but he might help guide the development here? |
In CMS, we never use combine for the simplified likelihood (but we do use it to generate the inputs). In terms of pyHF, I think this just means putting the parameterisation for the bin content and adding the gaussian constraint term (I don't think one needs to use any fancy class, its just a matrix mult of course). Would be super nice if the code here : https://gitlab.cern.ch/SimplifiedLikelihood/SLtools can be re-used. in CMS we already have a way to go from Even further, it would be really nice if pyHF could implement the simplification in the form as it is in the code above. Its more accurate than correlations alone and doesn't require much additional input. |
To keep this thread going, trying to obtain a CMS-style simplified likelihood from a pyhf likelihood is still a desirable goal, IMHO. I see how I can get the latter, i.e. the correlations between the individual nuisances: result, result_obj = pyhf.infer.mle.fit( asimov, pdf, return_result_obj=True, return_correlations=True, do_grad=True ) But I still fail to see how I can I get the correlations between the signal regions. JFYI, the document of the simplified likelihoods v1 is here: https://cds.cern.ch/record/2242860?ln=en cheers Wolfgang |
Hi Wolfgang,
I don't know specifically how one could do this on pyHF (syntax wise) but
another route (which is rather straightforward) would be to use the known
correlations between nuisance parameters (from the fit) and how they
affect each signal region to generate the correlation (or rather
covariance) between signal region via MC toys - it should be easy to
simply randomize the parameters (according to the correlations from
`correlations = result_obj.corr`) and evaluate the total expected yield in
each SR across the toys to do so (does that make sense?)
Note, it would also be great to include the 3rd moment diagonal vector a la
SL v2 as an option too!
Best,
Nick
…On Wed, Jun 23, 2021 at 2:43 PM Wolfgang Waltenberger < ***@***.***> wrote:
To keep this thread going, trying to obtain a CMS-style simplified
likelihood from a pyhf likelihood is still a desirable goal, IMHO.
In SLs, we "collapse" all nuisances to a single nuisance described by a
multivariate Gaussian. So we would need the correlations of the "collapsed"
nuisance between the signal regions, not between the sources of the
nuisances.
I see how I can get the latter, i.e. the correlations between the
individual nuisances:
result, result_obj = pyhf.infer.mle.fit( asimov, pdf,
return_result_obj=True, return_correlations=True, do_grad=True )
correlations = result_obj.corr
But I still fail to see how I can I get the correlations between the
signal regions.
Any ideas @kratsg <https://github.com/kratsg> @lukasheinrich
<https://github.com/lukasheinrich> @matthewfeickert
<https://github.com/matthewfeickert> ?
Thinking wildly, I imagine that one can somehow extract how a given
nuisance parameter affects a given signal region
and from this info plus the above "nuisance" correlation matrix I can fit
the "signal region" covariance matrix between the signal regions? Something
along these lines?
JFYI, the document of the simplified likelihoods v1 is here:
https://cds.cern.ch/record/2242860?ln=en
cheers
Wolfgang
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1359 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMEVWZWSKR4RMWU4N5UDELTUHQG7ANCNFSM4YZUVLQA>
.
|
Hey Nick, yes that's a good idea! I wouldn't mind working on this idea myself, but I currently do not know how I can get information about Wolfgang |
|
Thanks Alexander. I am thinking, maybe I am too focussed on using the
correlations matrix.
Maybe I should more have MC sampling in mind, as Nick said. So how can I
create fake observations
from a pyhf model? I could use these to estimate the signal-region
covariance matrix from the sample
covariance ....
Sorry for my ignorance about all this.
Wolfgang
…On Wed, 23 Jun 2021 at 16:02, Alexander Held ***@***.***> wrote:
pyhf.pdf.Model.expected_data
<https://pyhf.readthedocs.io/en/v0.6.2/_generated/pyhf.pdf.Model.html#pyhf.pdf.Model.expected_data>
returns the model prediction (yields and optionally also auxiliary data)
for a point in parameter space. Is this what you are looking for? The
output can be split into regions via a small utility function like this
<https://github.com/alexander-held/cabinetry/blob/d888ec054e080f474a12de6de375857631c55987/src/cabinetry/model_utils.py#L154-L173>
.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1359 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLSE422OIKUTBQ5OETEM4DTUHSO7ANCNFSM4YZUVLQA>
.
|
Hi Wolfgang, here is an example with a simple model. I get best-fit parameter results + covariance from a MLE, and then sample parameters from a multivariate Gaussian. The model is subsequently evaluated for all samples of parameter points. This can be used for example to evaluate post-fit yield uncertainties via bootstrapping. I think it also allows the yield covariance calculation as described by Nick, though I am currently not completely sure that this is indeed the intended quantity: import numpy as np
import pyhf
pyhf.set_backend("numpy", pyhf.optimize.minuit_optimizer(verbose=1))
# set up a simple model
model = pyhf.simplemodels.uncorrelated_background(
signal=[12.0, 11.0], bkg=[50.0, 52.0], bkg_uncertainty=[3.0, 7.0]
)
data = [62, 63] + model.config.auxdata
# MLE fit to get best-fit results and covariance matrix
result, result_obj = pyhf.infer.mle.fit(
data, model, return_uncertainties=True, return_result_obj=True
)
# sample parameters from multivariate Gaussian and evaluate model
sampled_parameters = np.random.multivariate_normal(
result_obj.minuit.values, result_obj.minuit.covariance, size=10000
)
model_predictions = [
model.expected_data(p, include_auxdata=False) for p in sampled_parameters
]
yields = np.mean(model_predictions, axis=0)
yield_unc = np.std(model_predictions, axis=0)
print(f"model prediction: {yields} +/- {yield_unc}")
print(f"covariance:\n{np.cov(model_predictions, rowvar=False)}")
print(f"correlation:\n{np.corrcoef(model_predictions, rowvar=False)}") output example: model prediction: [62.07857537 63.12867452] +/- [6.56092723 6.24693401]
covariance:
[[43.05007114 20.9128427 ]
[20.9128427 39.02808728]]
correlation:
[[1. 0.51019653]
[0.51019653 1. ]] |
thanks @alexander-held - maybe we can convert this into a backend independent language, what APIs would be needed? |
Yes, |
One minor thing to be careful of (and i'm not sure we need to care about
it or not yet), is that when using the simplified likelihood (a la the CMS
note), there is a distinction made between the control and signal regions
in that the *fit* from which the nuisance parameters (and their
correlations) are determined only uses data from the control regions. The
reason is that the data in the signal region are used in the inference
step and this way one doesn't double count the data in the signal region
(in CMS we call this "masking" parts of the likelihood such that the
expected bin yields can be evaluated without explicitly being included in
the fit).
Now if Wolfgang needs the covariance between *all regions* then you can
ignore this caveat.
…On Thu, Jun 24, 2021 at 3:37 PM Alexander Held ***@***.***> wrote:
Yes, return_covariance should be all that is needed (return_correlations
already exists).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1359 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMEVW6ISM2INXLGELRKV2LTUM7JXANCNFSM4YZUVLQA>
.
|
You are right, Nick, I need the covariance of SRs only. But, this shouldn't make a huge difference, should it? I patch the model with a signal at high stau masses, so the signal should not play much of a role: jsonpatch SUSY-2018-04_likelihoods/Region-combined/BkgOnly.json SUSY-2018-04_likelihoods/Region-combined/patch.DS_440_80_Staus.json > test.json (I tried also with pyhf patchset apply, but that failed for me ) I retrieve the channel names: From these names I expect to be interested in rows number 2 and 3 (0-indexed). I run Alexander's code (thanks a lot!!), and look at the yields: Comparing the yields with the expectation values it however seems like row number 1 and 3 In[]: np.diag ( np.cov(model_predictions, rowvar=False) ) (I assume the Poissonian errors of the counting variables do not enter here. But even if they do, I am still So there is still something I do not yet understand, I guess. Cheers Wolfgang |
These would be the files. Wolfgang |
Hi @WolfgangWaltenberger use |
The yield uncertainties should indeed be recovered from the diagonal of this yield covariance matrix via [ 309.42967087, 12.37700728, 246.4054857 , 7.0491261 , 1108.5018055 ] and the variances [2.51467088e+07, 1.24868271e+02, 2.18699298e+07, 1.49262233e+01, 1.45148174e+03] are much too large to be compatible with those yields. Is it possible that the fit did not converge, and the parameter covariances reported from the fit are nonsensical? The Poisson uncertainty component enters here as well, these numbers are the full post-fit uncertainty including statistical and systematic effects. |
Just to come back to an earlier point (not related to the technical part),
Wolfgang, when I said you musn;t include the signal region bins in the
fit, I meant that if you ...
1) Include the signal region in the fit, and then obtain the covariance
between signal regions (or whatever regions)
2) Use the covariance from 1 as input to the SL
you will double count the data in the signal regions and not get the right
thing. Thus in 1), you need to ignore the signal regions when fitting to
avoid the double count,
Hope its clear, and sorry to interrupt the technical discussion,
Nick
…On Fri, Jun 25, 2021 at 1:26 PM Alexander Held ***@***.***> wrote:
The yield uncertainties should indeed be recovered from the diagonal of
this yield covariance matrix via np.sqrt(np.diag(np.cov(model_predictions,
rowvar=False))). It works fine in that toy setup I posted above. Looking
at the numbers in @WolfgangWaltenberger
<https://github.com/WolfgangWaltenberger>'s example, the yields are
[ 309.42967087, 12.37700728, 246.4054857 , 7.0491261 , 1108.5018055 ]
and the variances
[2.51467088e+07, 1.24868271e+02, 2.18699298e+07, 1.49262233e+01, 1.45148174e+03]
are much too large to be compatible with those yields. Is it possible that
the fit did not converge, and the parameter covariances reported from the
fit are nonsensical?
The Poisson uncertainty component enters here as well, these numbers are
the full post-fit uncertainty including statistical and systematic effects.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1359 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAMEVWYONUYSTPOBYR3LEJTTURYV7ANCNFSM4YZUVLQA>
.
|
@alexander-held result_obj.success is True, so it seems that minuit assumes it converged. Will keep playing around with the code snippet. Wolfgang |
Hi @WolfgangWaltenberger, even though the fit is reported to be successful, at least in my version of it using the files you linked above I see that MINUIT reports that a parameter hit its boundary (the signal strength in this example, capped at 0). This will make the results unreliable (unrelated to this discussion: maybe such a case should not be flagged as success?). In this specific example there are some very strong anti-correlations between |
Alright. I allowed for negative signal strengths in the fit (as I use it only to extract the cov matrix), sampled a bit longer, assumed #!/usr/bin/env python3 import cabinetry pyhf.set_backend("numpy", pyhf.optimize.minuit_optimizer(verbose=1)) jsonfile = "example.json" download() ws = cabinetry.workspace.load( jsonfile ) muSigIndex = model.config.parameters.index ( "mu_SIG" ) result, result_obj = pyhf.infer.mle.fit( sampled_parameters = np.random.multivariate_normal( for i,name in enumerate ( model.config.parameters ): yields = np.mean(model_predictions, axis=0) np.set_printoptions ( precision = 3 ) indices = np.array ( [ 2,3 ] ) indices = np.array ( [ 1,3 ] ) indices of signal regionsprint ( f"covariance of SRs (1,3):\n{scov}" ) import IPython |
Now the first open issue I would wish to tackle, is the question of the ordering of the channels. Why does that change? |
The order in Line 41 in 03b914b
|
Hm, then I don't know what's wrong. I do take the names from
models.config.channels.
Wolfgang
…On Mon, 28 Jun 2021 at 13:04, Alexander Held ***@***.***> wrote:
assumed that the order of the yields and cov matrix is not the one given
in models.config.channels
The order in models.config.channels is "correct" in the sense that it is
consistently used within pyhf. This is the order in which things like
expected_data return you the yields. The order in the workspace itself (
ws["channels"]) does not matter, pyhf sorts channels alphabetically when
creating a Workspace instance:
https://github.com/scikit-hep/pyhf/blob/03b914b1aa9006d8aa8ca5ecbbd2e6b6e6bdd1ce/src/pyhf/mixins.py#L41
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1359 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABLSE445H6RGNVGFGHKTQJDTVBJMXANCNFSM4YZUVLQA>
.
|
Discussion - simplified likelihoods
To get the discussion started, let me open up this ticket. So, as we have already discussed a bit in other channels, for
SModelS we actually would like to adopt pyhf also for our (CMS) simplified likelihoods (SL) we have in the database.
To remind ourselves, the CMS simplified likehoods (v1.0) were of the form "Poisson * Gaussian",
where the multivariate Gaussian was used to describe all nuisances, collapsed to a single "enveloping" nuisance, with a single big covariance matrix, and the Poissons (one for each signal region) were used to model our counting variables, as always. So ideally, in the long run, we would kick out our own SL implementation and also use pyhf for that task. We SModelS could of course work on that from our side, as all necessary functionality is already in place -- except for, possibly, performance tweaks within pyhf. But since you pyhf seem to anyways have similar ambitions, you guys might just go ahead and implement all that is needed, or we team up, or whatevs. Let the discussion begin.
The text was updated successfully, but these errors were encountered: