-
Notifications
You must be signed in to change notification settings - Fork 393
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Material for unfolding tutorial #862
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Aliya! this is really nice.
I wonder if its worth at the beginning suggesting that newer users might also want to check out the original long exercise to get a better idea of basic combine usage?
I haven't worked through all of the commands yet (on the train), so I may have some more comments later as well.
I've added a bunch for now, but they are mostly small things; I think this will be really helpful to have.
I wonder if at the end we might also want a small section mentioning regularization and that details on regularization can be found in some of the other documentation pages, if necessary, though in general the recommendation is to avoid it where possible?
Thanks a lot!
|
||
When constructing the reco-level for any differential analysis the main goal is to match the gen-level bins as closely as possible. In the simplest case it can be done with the cut-based approach, i.e. applying the selection on the corresponding reco-level variables: $p_{T}(Z)$ and $n_{\text{add. jets}}$. Due to the good lepton $p_{T}$ resolution we can follow the original STXS scheme quite closely with the reco-level selection, with one exception, it is not possible to access the very-low transverse momenta bin $p_{T}(Z)<75$ GeV. | ||
|
||
In `counting/regions` dicrectory you can find the datacards with for five reco-level categories, each targetting a corresponding gen-level bin. Below you can find an example of the datacard for reco-level bin with $p_{T}(Z)$>400 GeV, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"with for five" --> "with five"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the repo, I see the directory counting/
but it doesn't have a regions
subdirectory, only a combined_ratesOnly.txt
datacard.
-------------------------------------------------------------------------------- | ||
|
||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be good to have them create a workspace from this datacard and run a simple fit (perhaps e.g. a fitdiagnostics to get the signal strength) on a single region, with the very simple datacard.
Then maybe add another heading before moving on to the full datacard and the migration matrix?
|
||
One of the most important stages in the analysis design, is to make sure that the reco-level categories are pure with the corresponding gen-level processes. | ||
|
||
To explicitly check it, one can plot the contributions of gen-level bins in all of the reco-level bins. We propose to use the script provided in the tutorial git-lab page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could add a couple of words about what this script does. e.g.
"This script uses combine harvester to loop over detector level bins, and get the rate at which each of the signal processes (generator-level bins) contributes to that detector-level bin; which is then used to plot the migration matrix."
```shell | ||
text2workspace.py -m 125 counting/* -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO verbose --PO 'map=.*/.*ZH_lep_PTV_75_150_hbb:r_zh_75_150[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_0J_hbb:r_zh_150_250noj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_GE1J_hbb:r_zh_150_250wj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_250_400_hbb:r_zh_250_400[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_GT400_hbb:r_zh_gt400[1,-5,5]' -o ws_counting.root | ||
``` | ||
where we use `--PO 'map=bin/process:poi[init, min, max]'`. So in the example above a signal POI is assigned to each gen-level bin independent on reco-level bin. This allows to take into account the non-trivial acceptance effects. One can also perform bin-by-bin unfolding using the mapping to the bin names rather that processes, e.g. `'map= vhbb_Zmm_gt400_13TeV/.*:r_reco_zh_gt400[1,-5,5]'`, but this method is not recommended and can be used for tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is maybe better not to mention bin-by-bin at all here? Unless it is used a useful test? I haven't used it that way, but maybe there is a use case I'm not aware of.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a couple of words about multiSignalModel
would also be a good idea?
``` | ||
where we use `--PO 'map=bin/process:poi[init, min, max]'`. So in the example above a signal POI is assigned to each gen-level bin independent on reco-level bin. This allows to take into account the non-trivial acceptance effects. One can also perform bin-by-bin unfolding using the mapping to the bin names rather that processes, e.g. `'map= vhbb_Zmm_gt400_13TeV/.*:r_reco_zh_gt400[1,-5,5]'`, but this method is not recommended and can be used for tests. | ||
|
||
To extract the measurement let's run the initial fit first: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe a few more words about MultiDimFit
, e.g.
"We will use the MultiDimFit algorithm to simultaneously extract best-fit values and uncertainties on multiple parameters."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And maybe a short word about the initial values "we use --setParameters <param>=<value>,...
to ... "
|
||
The datacards for this part of the exercise located `full_model_datacards/`, where you can find a separate datacard for each region within `full_model_datacards/regions` directory and also a combined datacard `full_model_datacards/comb_full_model.txt`. | ||
|
||
As you will find the datacards also contain several background processes. To control them properly we will add the regions enriched in the respective backgrounds. Then we can define a common set rate parameters for signal and control regions to scale the rates or other parameters affecting their shape. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"define a common set rate parameters" --> "define a common set of rate parameters"
Now we can create the workspace using the same `multiSignalmodel` | ||
|
||
```shell | ||
text2workspace.py -m 125 full_model_datacards/comb_full_model.txt -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO verbose --PO 'map=.*/.*ZH_lep_PTV_75_150_hbb:r_zh_75_150[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_0J_hbb:r_zh_150_250noj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_GE1J_hbb:r_zh_150_250wj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_250_400_hbb:r_zh_250_400[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_GT400_hbb:r_zh_gt400[1,-5,5]' --for-fits --no-wrappers --X-pack-asympows --optimize-si.png-constraints=cms --use-histsum -o ws_full.root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think -simpdf-
has morphed into -si.png-
in the --optimize-simpdf-constraints
flag. It might also be good to explain somewhere a little what some of these flags do? Perhaps we could add a collapsable info box, or similar?
|
||
> Following the instructions given earlier, create the workspace and run the initial fit with `-t -1` and set the name `-n .BestFit`. | ||
|
||
Since this time the datacards include shape uncertainties as well as additional categories to improve the background description the fit might take much longer, but we can submit condor jobs and have results ready to look at in a few minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is nice, showing how to scale up some aspects. But maybe we should also advertise that it will only work on systems with condor (and perhaps provide an alternative for people running in places without?)
## Unfolded measurements | ||
|
||
Now that we studied the NP impacts for each POI, we can finally extract the measurements. | ||
Note, that in this exercise we are skipping couple of checks that have to be done before the unblinding. Namely the goodness of fit test and the post-fit plots of folded observables. Both of these checks were detailed in the previous exercises, you can find the description under the following links. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"skipping couple of checks" --> Maybe something like "we are skipping further checks and validation that you should do on your analysis for the purposes of the tutorial."
```shell | ||
python scripts/make_XSplot.py summary_zh_stxs.json | ||
``` | ||
![](figures/stxs_zh.png) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason I am not seeing observed values for two of the bins? Maybe it would be good to comment on this briefly.
Thanks @kcormi, I will address your comments soon. But what I've realised actually, is that it's not completely ok to have some of these plots on combine public docs pages, even if the analysis is already public. I'm using only partial run2 datacards to be able to fit them within 1-2h so these results are not public or going to be public. I will move these plot to the cms-analysis gitlab |
Hi @anigamova what if you generate some pseudodata from the model, and then use that as the input data histogram? Then I think we could make everything public right? There's no real data being used in that case, but it should work more or less as if it were. |
Added the labels for the plots, should be clear now that they are only intended for this particular tutorial |
|
||
There's a set of combine (.txt) datacards which will help you get through the various parts of the exercise. The exercises should help you become familiar with the structure of fitting datacards. | ||
|
||
Note that the general recomendation on unfolding in `Combine` are available [here](https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/regularisation/), which also includes recommendations on regularisation techniques and when to use it, which is completely is not discussed in this tutorial at all. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"which is completely is not discussed" --> "is not discussed"
``` | ||
where we use `--PO 'map=bin/process:poi[init, min, max]'`. So in the example above a signal POI is assigned to each gen-level bin independent on reco-level bin. This allows to take into account the non-trivial acceptance effects. One can also perform bin-by-bin unfolding using the mapping to the bin names rather that processes, e.g. `'map= vhbb_Zmm_gt400_13TeV/.*:r_reco_zh_gt400[1,-5,5]'`, but this method is not recommended and can be used for tests. | ||
In the example given above a signal POI is assigned to each gen-level bin independent on reco-level bin. This allows to take into account the non-trivial acceptance effects. One can also perform bin-by-bin unfolding using the mapping to the bin names rather that processes, e.g. `'map= vhbb_Zmm_gt400_13TeV/.*:r_reco_zh_gt400[1,-5,5]'`, but this method is not recommended and can be used only for tests as another way to ensure that the migration matrix is close to diagonal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you mean "to take into account migrations" rather than "acceptance effects" here, and that will maybe be a bit more clear to the reader.
|
||
## Shape analysis with control regions | ||
|
||
One of the advantages of the maximum likelihood unfolding is the flexibility to choose the analysis observable and include more information on the event kinematics, consequently improving the analysis sensitivity. This analysis benefits from the shape information of the DNN output trained to differentiate the VH(bb) signal from the SM backgrounds. | ||
|
||
The datacards for this part of the exercise located `full_model_datacards/`, where you can find a separate datacard for each region within `full_model_datacards/regions` directory and also a combined datacard `full_model_datacards/comb_full_model.txt`. | ||
|
||
As you will find the datacards also contain several background processes. To control them properly we will add the regions enriched in the respective backgrounds. Then we can define a common set rate parameters for signal and control regions to scale the rates or other parameters affecting their shape. | ||
As you will find the datacards also contain several background processes. To control them properly we will add the regions enriched in the respective backgrounds. Then we can define a common set of rate parameters for signal and control regions to scale the rates or other parameters affecting their shape. | ||
|
||
For the shape datacards one has to specify the mapping of histograms and channels/processes as given described below: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could also link to the part of the documentation that describes this, for further information.
@@ -142,26 +143,33 @@ Then the `shape` nuisance parameters can be defined in the systematics block in | |||
|
|||
In the realistic CMS analysis there are hundreds of nuisance parameters corresponding to various source of systematics. | |||
|
|||
When we unfold to the gen-level observable we should remove the nuisances affecting the rate of the gen-level bins, i.e. the `lnN` NPs: `THU_ZH_mig*, THU_ZH_inc` and keep only the acceptance `shape` uncertainties: `THU_ZH_acc` and `THU_ggZH_acc`. This can be achieved by freezing the respective nuisance parameters with the option `--freezeParameters par_name1,par_name2`. Alternatively you can create a group following the syntax given below at the end of the combined datacard, and freeze the parameters with the `--freezeNuisanceGroups group_name` option. | |||
When we unfold to the gen-level observable we should remove the nuisances affecting the rate of the gen-level bins, i.e. the `lnN` NPs: `THU_ZH_mig*, THU_ZH_inc` and keep only the acceptance `shape` uncertainties: `THU_ZH_acc` and `THU_ggZH_acc`, which do not scale the inclusive cross sections by construction. | |||
This can be achieved by freezing the respective nuisance parameters with the option `--freezeParameters par_name1,par_name2`. Alternatively you can create a group following the syntax given below at the end of the combined datacard, and freeze the parameters with the `--freezeNuisanceGroups group_name` option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe before this it would be good to point out explicitly that for this analysis you have already done the work of separating out the normalization and shape components by normalizing the theoretical variations to the nominal generator-level cross sections for each bin?
|
||
![](figures/scan_plot_r_zh_75_150.png) | ||
|
||
Repeat the same command for other POIs to fill the `summary_zh_stxs.json`, which can then be used to create the cross section plot as shown below. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would also be good to add a few pedagogical lines here to the effect of:
"In order to plot our differential cross sections measurements at generator-level, we need to multiply the fitted parameter of interest values by their original cross sections which are stored in .... "
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some more comments about the presentation, mostly suggesting to make the writing and exposition a bit more pedagogical and try to make sure that the people following the tutorial understand how each of the pieces fit together.
# Combine unfolding exercise | ||
|
||
## Getting started | ||
By now you should have a working setup of Combine v9 from the pre-tutorial exercise. If so then move onto the cloning of the parametric fitting exercise gitlab repo below. If not then you need to set up a CMSSW area and checkout the combine package: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if for this part it actually makes more sense to link directly to the installation instructions and tell them to use one of those setups. That way we avoid repeating information (and lower the risk of it going out of date or changing)?
|
||
![](figures/simplifiedXS_VH_1_2.png) | ||
|
||
In this tutorial we will focus on the ZH production, with Z boson decaying to charged leptons, and Higgs boson reconstructed with the resolved $b\bar{b}$ pair. We will also use only a part of the Run 2 categories, so the analysis sensitivity is not going to be achieved. Note that ggZH and ZH production modes are combined in the fit, since it is not possible to resolve them at this stage of the analysis. The STXS categories are defined independently of the Higgs decay channel, to streamline the combinations of the cross section measurement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the analysis sensitivity is not going to be achieved" --> maybe "we will not achieve the same sensitivity as the full analysis."
|
||
## Simple datacards, one-bin measurement | ||
|
||
When constructing the reco-level for any differential analysis the main goal is to match the gen-level bins as closely as possible. In the simplest case it can be done with the cut-based approach, i.e. applying the selection on the corresponding reco-level variables: $p_{T}(Z)$ and $n_{\text{add. jets}}$. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a word is missing in the first sentence here?
Maybe e.g. "When determining the reco-level binning for any differential analysis" (it may also be good to use "detector-level" rather than "reco-level", but not sure consistency with either choice is probably the most important thing)
I wonder if in general spelling out "generator level" and "detector level" is a bit more understandable and good for the tutorial.
|
||
One of the most important stages in the analysis design, is to make sure that the reco-level categories are pure with the corresponding gen-level processes. | ||
|
||
To explicitly check it, one can plot the contributions of gen-level bins in all of the reco-level bins. We propose to use the script provided in the tutorial git-lab page. This script uses `CombineHarvester` to loop over detector level bins, and get the rate at which each of the signal processes (generator-level bins) contributes to that detector-level bin; which is then used to plot the migration matrix. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe "we propose to use the script" --> "You can use the script"
```shell | ||
text2workspace.py -m 125 counting/combined_ratesOnly.txt -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO verbose --PO 'map=.*/.*ZH_lep_PTV_75_150_hbb:r_zh_75_150[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_0J_hbb:r_zh_150_250noj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_GE1J_hbb:r_zh_150_250wj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_250_400_hbb:r_zh_250_400[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_GT400_hbb:r_zh_gt400[1,-5,5]' -o ws_counting.root | ||
``` | ||
In the example given above a signal POI is assigned to each gen-level bin independent on reco-level bin. This allows to take into account the non-trivial to take into account migration. One can also perform bin-by-bin unfolding using the mapping to the bin names rather that processes, e.g. `'map= vhbb_Zmm_gt400_13TeV/.*:r_reco_zh_gt400[1,-5,5]'`, but this method is not recommended and can be used only for tests as another way to ensure that the migration matrix is close to diagonal. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Independent on" --> "Independent of"
"This allows to take into account the non-trivial to take into account migration" --> "This allows us to take into account migrations from a given generator level bin into various detector level bins."
![](figures/simplifiedXS_VH_1_2.png) | ||
|
||
In this tutorial we will focus on the ZH production, with Z boson decaying to charged leptons, and Higgs boson reconstructed with the resolved $b\bar{b}$ pair. We will also use only a part of the Run 2 categories, so the analysis sensitivity is not going to be achieved. Note that ggZH and ZH production modes are combined in the fit, since it is not possible to resolve them at this stage of the analysis. The STXS categories are defined independently of the Higgs decay channel, to streamline the combinations of the cross section measurement. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we could also add a few words on the structure of the tutorial here.
e.g.
"In the first part of the tutorial, we will setup a relatively simple unfolding, where there is a single detector-level bin for every generator-level bin we are trying to measure. We will then perform a blind analysis using this setup to see the expected sensitivity.
In the second part of the tutorial we will perform the same measurement with a more advanced setup, making use of differential distributions per generator-level bin we are trying to measure, as well as control regions. By providing this additional information to the fit, we are able to achieve a better and more robust unfolding result. After checking the expected sensitivity, we will take a look at the impacts and pulls of the nuisance parameters. Then we will unblind and look at the results of the measurement, produce generator-level plots and provide the correlation matrix for our measured observables."
|
||
The hands-on exercise is split into seven parts: | ||
|
||
1) Counting experiment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wonder if it might be more clear to change this name from something like "counting experiment" to "simplified analysis strategy" or "simplified unfolding"?
And similarly, maybe changing "Shape analysis with control regions" to something like "Extending the analysis with other variables and control regions"
``` | ||
|
||
![](figures/impacts_zh_75_150.png) | ||
* Do you observe differences in impacts plots for different POIs, do these differences make sense to you? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because of the layout, I think people might miss this question if it is after the plot, might be easier to see if it comes just before the figure?
## Unfolded measurements | ||
|
||
Now that we studied the NP impacts for each POI, we can finally extract the measurements. | ||
Note, that in this exercise we are skipping further checks and validation that you should do on your analysis for the purposes of the tutorial. Namely the goodness of fit test and the post-fit plots of folded observables. Both of these checks were detailed in the previous exercises, you can find the description under the following links. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, it was my suggestion earlier but maybe its better to remove "for the puproses of the tutorial" here. To make sure no one thinks we are saying that they should do the checks for the purposes of the tutorial. Alternatively we could write it as "Note, that in this exercise, for the purposes of the tutorial, we are skipping ... "
(and I think we are missing the links referred to by "the following links")
|
||
## POIs correlations | ||
|
||
In addition to the cross-section measurements it is very important to publish correlation matrix of measured cross sections. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to publish the correlation matrix" (missing the) "of the measured cross sections" (again missing the)
Thanks for the updates @kcormi, all good from my side. |
Hands-on session in the afternoon, should fit in 90-120 minutes