Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

simplified likelihoods and pyhf #1359

Open
WolfgangWaltenberger opened this issue Mar 8, 2021 · 25 comments
Open

simplified likelihoods and pyhf #1359

WolfgangWaltenberger opened this issue Mar 8, 2021 · 25 comments
Labels
question Further information is requested research experimental stuff

Comments

@WolfgangWaltenberger
Copy link

WolfgangWaltenberger commented Mar 8, 2021

Discussion - simplified likelihoods

To get the discussion started, let me open up this ticket. So, as we have already discussed a bit in other channels, for
SModelS we actually would like to adopt pyhf also for our (CMS) simplified likelihoods (SL) we have in the database.
To remind ourselves, the CMS simplified likehoods (v1.0) were of the form "Poisson * Gaussian",
where the multivariate Gaussian was used to describe all nuisances, collapsed to a single "enveloping" nuisance, with a single big covariance matrix, and the Poissons (one for each signal region) were used to model our counting variables, as always. So ideally, in the long run, we would kick out our own SL implementation and also use pyhf for that task. We SModelS could of course work on that from our side, as all necessary functionality is already in place -- except for, possibly, performance tweaks within pyhf. But since you pyhf seem to anyways have similar ambitions, you guys might just go ahead and implement all that is needed, or we team up, or whatevs. Let the discussion begin.

@kratsg
Copy link
Contributor

kratsg commented Mar 8, 2021

Do you mind pointing at an example specification of the CMS simplified likelihood? I thought it was in the Combine card format (#344) but perhaps it was a custom format? There are pieces of pyhf which are somewhat HiFa JSON independent, so this is not the craziest idea.. but it would be nice to see what things look like and we can probably guide towards what the pyhf API is intended to support?

@kratsg kratsg added question Further information is requested research experimental stuff labels Mar 8, 2021
@alexander-held
Copy link
Member

Since multivariate Gaussians are not part of the HistFactory model, I assume this would either require something like #1188 (if the pdf is still supposed to be built by pyhf), or the pdf could be built externally (e.g. via a tool that translates HistFactory workspace -> simplified likelihood) and fit via pyhf.infer.

@WolfgangWaltenberger
Copy link
Author

Do you mind pointing at an example specification of the CMS simplified likelihood? I thought it was in the Combine card format (#344) but perhaps it was a custom format? There are pieces of pyhf which are somewhat HiFa JSON independent, so this is not the craziest idea.. but it would be nice to see what things look like and we can probably guide towards what the pyhf API is intended to support?

Ok, so I personally do not have a Combine card -- in SModelS all we use is the covariance matrix, and number of observed and number of expected events per signal region. Nick Wardle (@nucleosynthesis) is the person to ask. Of course he cannot give the real life "combine" data card, but he might help guide the development here?

@nucleosynthesis
Copy link

In CMS, we never use combine for the simplified likelihood (but we do use it to generate the inputs). In terms of pyHF, I think this just means putting the parameterisation for the bin content and adding the gaussian constraint term (I don't think one needs to use any fancy class, its just a matrix mult of course).

Would be super nice if the code here : https://gitlab.cern.ch/SimplifiedLikelihood/SLtools can be re-used. in CMS we already have a way to go from combine datacard -> model.py/HepData format -> SLtools. so if the intermediate format needs to be modified, it should be easy enough.

Even further, it would be really nice if pyHF could implement the simplification in the form as it is in the code above. Its more accurate than correlations alone and doesn't require much additional input.

@WolfgangWaltenberger
Copy link
Author

To keep this thread going, trying to obtain a CMS-style simplified likelihood from a pyhf likelihood is still a desirable goal, IMHO.
In SLs, we "collapse" all nuisances to a single nuisance described by a multivariate Gaussian. So we would need the correlations of the "collapsed" nuisance between the signal regions, not between the sources of the nuisances.

I see how I can get the latter, i.e. the correlations between the individual nuisances:

result, result_obj = pyhf.infer.mle.fit( asimov, pdf, return_result_obj=True, return_correlations=True, do_grad=True )
correlations = result_obj.corr

But I still fail to see how I can I get the correlations between the signal regions.
Any ideas @kratsg @lukasheinrich @matthewfeickert ?
Thinking wildly, I imagine that one can somehow extract how a given nuisance parameter affects a given signal region
and from this info plus the above "nuisance" correlation matrix I can fit the "signal region" covariance matrix between the signal regions? Something along these lines?

JFYI, the document of the simplified likelihoods v1 is here: https://cds.cern.ch/record/2242860?ln=en

cheers

Wolfgang

@nucleosynthesis
Copy link

nucleosynthesis commented Jun 23, 2021 via email

@WolfgangWaltenberger
Copy link
Author

Hey Nick,

yes that's a good idea! I wouldn't mind working on this idea myself, but I currently do not know how I can get information about
how the nuisances affect the signal regions. For this I hope for some help from the pyhf devs. Then I yes could just throw toys.
Agreed on the 3rd moment, I just wanted to keep things simple for starters ;)

Wolfgang

@alexander-held
Copy link
Member

pyhf.pdf.Model.expected_data returns the model prediction (yields and optionally also auxiliary data) for a point in parameter space. Is this what you are looking for? The output can be split into regions via a small utility function like this.

@WolfgangWaltenberger
Copy link
Author

WolfgangWaltenberger commented Jun 23, 2021 via email

@alexander-held
Copy link
Member

alexander-held commented Jun 24, 2021

Hi Wolfgang, here is an example with a simple model. I get best-fit parameter results + covariance from a MLE, and then sample parameters from a multivariate Gaussian. The model is subsequently evaluated for all samples of parameter points. This can be used for example to evaluate post-fit yield uncertainties via bootstrapping. I think it also allows the yield covariance calculation as described by Nick, though I am currently not completely sure that this is indeed the intended quantity:

import numpy as np
import pyhf

pyhf.set_backend("numpy", pyhf.optimize.minuit_optimizer(verbose=1))

# set up a simple model
model = pyhf.simplemodels.uncorrelated_background(
    signal=[12.0, 11.0], bkg=[50.0, 52.0], bkg_uncertainty=[3.0, 7.0]
)
data = [62, 63] + model.config.auxdata

# MLE fit to get best-fit results and covariance matrix
result, result_obj = pyhf.infer.mle.fit(
    data, model, return_uncertainties=True, return_result_obj=True
)

# sample parameters from multivariate Gaussian and evaluate model
sampled_parameters = np.random.multivariate_normal(
    result_obj.minuit.values, result_obj.minuit.covariance, size=10000
)
model_predictions = [
    model.expected_data(p, include_auxdata=False) for p in sampled_parameters
]

yields = np.mean(model_predictions, axis=0)
yield_unc = np.std(model_predictions, axis=0)
print(f"model prediction: {yields} +/- {yield_unc}")

print(f"covariance:\n{np.cov(model_predictions, rowvar=False)}")
print(f"correlation:\n{np.corrcoef(model_predictions, rowvar=False)}")

output example:

model prediction: [62.07857537 63.12867452] +/- [6.56092723 6.24693401]
covariance:
[[43.05007114 20.9128427 ]
 [20.9128427  39.02808728]]
correlation:
[[1.         0.51019653]
 [0.51019653 1.        ]]

@lukasheinrich
Copy link
Contributor

thanks @alexander-held - maybe we can convert this into a backend independent language, what APIs would be needed?
Additional kwargs like return_covoariance?

@alexander-held
Copy link
Member

Yes, return_covariance should be all that is needed (return_correlations already exists).

@nucleosynthesis
Copy link

nucleosynthesis commented Jun 24, 2021 via email

@WolfgangWaltenberger
Copy link
Author

You are right, Nick, I need the covariance of SRs only. But, this shouldn't make a huge difference, should it?
Anyways some progress.
I take SUSY-2018-04. It has two signal regions,
SRlow with an expected yield of 6+/-1.7 (observed 10) and SRhigh with an expected yield of 10.2+/-3.3 (observed 7).
See https://atlas.web.cern.ch/Atlas/GROUPS/PHYSICS/PAPERS/SUSY-2018-04/tab_06.png

I patch the model with a signal at high stau masses, so the signal should not play much of a role:

jsonpatch SUSY-2018-04_likelihoods/Region-combined/BkgOnly.json SUSY-2018-04_likelihoods/Region-combined/patch.DS_440_80_Staus.json > test.json

(I tried also with pyhf patchset apply, but that failed for me )

I retrieve the channel names:
In[]: channelnames = [ x["name"] for x in ws["channels"] ]
Out: ['QCR1cut_cuts', 'QCR2cut_cuts', 'SR1cut_cuts', 'SR2cut_cuts', 'WCRcut_cuts']

From these names I expect to be interested in rows number 2 and 3 (0-indexed).

I run Alexander's code (thanks a lot!!), and look at the yields:
Out[]: array([ 309.42967087, 12.37700728, 246.4054857 , 7.0491261 ,
1108.5018055 ])

Comparing the yields with the expectation values it however seems like row number 1 and 3
are the signal yields -- they would look roughly right.
In the covariance matrix somewhere along the diagonal I would expect numbers close to 1.72 ~ 3 and 3.32 ~ 10.
But I see much higher numbers.

In[]: np.diag ( np.cov(model_predictions, rowvar=False) )
Out[]: array([2.51467088e+07, 1.24868271e+02, 2.18699298e+07, 1.49262233e+01,
1.45148174e+03])

(I assume the Poissonian errors of the counting variables do not enter here. But even if they do, I am still
quite off).

So there is still something I do not yet understand, I guess.
Anyways, thanks for all the help so far.

Cheers

Wolfgang

@WolfgangWaltenberger
Copy link
Author

These would be the files.

Wolfgang

SL.zip

@kratsg
Copy link
Contributor

kratsg commented Jun 24, 2021

Hi @WolfgangWaltenberger use ws.channels or model.config.channels to get the right channel name ordering.

@alexander-held
Copy link
Member

The yield uncertainties should indeed be recovered from the diagonal of this yield covariance matrix via np.sqrt(np.diag(np.cov(model_predictions, rowvar=False))). It works fine in that toy setup I posted above. Looking at the numbers in @WolfgangWaltenberger's example, the yields are

[ 309.42967087, 12.37700728, 246.4054857 , 7.0491261 , 1108.5018055 ]

and the variances

[2.51467088e+07, 1.24868271e+02, 2.18699298e+07, 1.49262233e+01, 1.45148174e+03]

are much too large to be compatible with those yields. Is it possible that the fit did not converge, and the parameter covariances reported from the fit are nonsensical?

The Poisson uncertainty component enters here as well, these numbers are the full post-fit uncertainty including statistical and systematic effects.

@nucleosynthesis
Copy link

nucleosynthesis commented Jun 25, 2021 via email

@WolfgangWaltenberger
Copy link
Author

@alexander-held result_obj.success is True, so it seems that minuit assumes it converged.
@nucleosynthesis thanks, and yes you made yourself clear. What I mean is, assuming that the yields in the signal regions
are much smaller than the yields in the control regions, fitting the signal regions as well as the CRs should not have a big
impact on the fit result, no?

Will keep playing around with the code snippet.

Wolfgang

@alexander-held
Copy link
Member

alexander-held commented Jun 25, 2021

Hi @WolfgangWaltenberger, even though the fit is reported to be successful, at least in my version of it using the files you linked above I see that MINUIT reports that a parameter hit its boundary (the signal strength in this example, capped at 0). This will make the results unreliable (unrelated to this discussion: maybe such a case should not be flagged as success?).

In this specific example there are some very strong anti-correlations between mu_QCD / mu_W and the gammas in two regions. This may make the estimation via sampling a bit unstable. I have not run any likelihood scans for parameters, if the strongly correlated parameters are very non-Gaussian then I think this approach will quickly run into limitations. For the WCRcut_cuts region, the yield 1109 +/- 39 via sampling is fairly close to 1099.63 ± 33.27 obtained via cabinetry.model_utils.calculate_stdev. The discrepancies in the other regions are much larger.

@WolfgangWaltenberger
Copy link
Author

WolfgangWaltenberger commented Jun 28, 2021

Alright. I allowed for negative signal strengths in the fit (as I use it only to extract the cov matrix), sampled a bit longer, assumed
that the order of the yields and cov matrix is not the one given in models.config.channels, and ended up with a covariance matrix that actually looks realistic. The procedure is clearly not yet "correct" (Nick's caveat still applies, plus some more issues regarding the correlations), but numerically not anymore completely bonkers either.

comparison

#!/usr/bin/env python3

import cabinetry
import pyhf
import numpy as np

pyhf.set_backend("numpy", pyhf.optimize.minuit_optimizer(verbose=1))

jsonfile = "example.json"
def download():
import os, subprocess
if os.path.exists ( jsonfile ):
return
from pyhf.contrib.utils import download
download("https://www.hepdata.net/record/resource/1406212?view=true", "SUSY-2018-04_likelihoods/" )
cmd = "jsonpatch SUSY-2018-04_likelihoods/Region-combined/BkgOnly.json SUSY-2018-04_likelihoods/Region-combined/patch.DS_440_80_Staus.json"
# cmd = "jsonpatch SUSY-2018-04_likelihoods/Region-combined/BkgOnly.json SUSY-2018-04_likelihoods/Region-combined/test.json"
cmd += f" > {jsonfile}"
subprocess.getoutput ( cmd )

download()

ws = cabinetry.workspace.load( jsonfile )
model, data = cabinetry.model_utils.model_and_data(ws)
channels = model.config.channels

muSigIndex = model.config.parameters.index ( "mu_SIG" )
suggestedBounds = model.config.suggested_bounds()
suggestedBounds[muSigIndex]=(-10.,10.)

result, result_obj = pyhf.infer.mle.fit(
data, model, return_uncertainties=True, return_result_obj=True,
par_bounds = suggestedBounds )

sampled_parameters = np.random.multivariate_normal(
result_obj.minuit.values, result_obj.minuit.covariance, size=50000 )
model_predictions = [
model.expected_data(p, include_auxdata=False) for p in sampled_parameters
]

for i,name in enumerate ( model.config.parameters ):
fit = result_obj.minuit.values[i]
bound = model.config.suggested_bounds()[i]
if abs ( fit - bound[0] ) < 1e-5:
print ( f"Fitted value {fit} of {name} hit bound {bound}" )
if abs ( fit - bound[1] ) < 1e-5:
print ( f"Fitted value {fit} of {name} hit bound {bound}" )

yields = np.mean(model_predictions, axis=0)
yield_unc = np.std(model_predictions, axis=0)
print(f"model predictions:\n" )
for i,channel in enumerate ( channels ):
print ( f" -- {channel}: {yields[i]:.2f}+/-{yield_unc[i]:.2f}" )

np.set_printoptions ( precision = 3 )
ncov = np.cov(model_predictions, rowvar=False)
print(f"covariance:\n{ncov}")
print(f"correlation:\n{np.corrcoef(model_predictions, rowvar=False)}")

indices = np.array ( [ 2,3 ] )
print ( f"covariance of SRs (2,3):\n{ncov[indices[:,None],indices]}" )

indices = np.array ( [ 1,3 ] )
scov = ncov[indices[:,None],indices]

indices of signal regions

print ( f"covariance of SRs (1,3):\n{scov}" )
for i in indices.tolist():
ncov[i][i]=ncov[i][i]-yields[i]
scov = ncov[indices[:,None],indices]
print ( f"covariance of SRs w/o Poissonian (1,3):\n{scov}" )

import IPython
IPython.embed( using = False )

@WolfgangWaltenberger
Copy link
Author

Now the first open issue I would wish to tackle, is the question of the ordering of the channels. Why does that change?
Next, I would like to fit without fitting also the signal regions.

@alexander-held
Copy link
Member

assumed that the order of the yields and cov matrix is not the one given in models.config.channels

The order in models.config.channels is "correct" in the sense that it is consistently used within pyhf. This is the order in which things like expected_data return you the yields. The order in the workspace itself (ws["channels"]) does not matter, pyhf sorts channels alphabetically when creating a Workspace instance:

self.channels = sorted(list(set(self.channels)))

@WolfgangWaltenberger
Copy link
Author

WolfgangWaltenberger commented Jun 28, 2021 via email

@WolfgangWaltenberger
Copy link
Author

comparison

It's a bit puzzling, I know I do not yet do it right, but the results actually already look quite good. A rare problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested research experimental stuff
Projects
None yet
Development

No branches or pull requests

5 participants