Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Material for unfolding tutorial #862

Merged
merged 30 commits into from
Feb 7, 2024

Conversation

anigamova
Copy link
Collaborator

Hands-on session in the afternoon, should fit in 90-120 minutes

Copy link
Collaborator

@kcormi kcormi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Aliya! this is really nice.

I wonder if its worth at the beginning suggesting that newer users might also want to check out the original long exercise to get a better idea of basic combine usage?

I haven't worked through all of the commands yet (on the train), so I may have some more comments later as well.

I've added a bunch for now, but they are mostly small things; I think this will be really helpful to have.

I wonder if at the end we might also want a small section mentioning regularization and that details on regularization can be found in some of the other documentation pages, if necessary, though in general the recommendation is to avoid it where possible?

Thanks a lot!


When constructing the reco-level for any differential analysis the main goal is to match the gen-level bins as closely as possible. In the simplest case it can be done with the cut-based approach, i.e. applying the selection on the corresponding reco-level variables: $p_{T}(Z)$ and $n_{\text{add. jets}}$. Due to the good lepton $p_{T}$ resolution we can follow the original STXS scheme quite closely with the reco-level selection, with one exception, it is not possible to access the very-low transverse momenta bin $p_{T}(Z)<75$ GeV.

In `counting/regions` dicrectory you can find the datacards with for five reco-level categories, each targetting a corresponding gen-level bin. Below you can find an example of the datacard for reco-level bin with $p_{T}(Z)$>400 GeV,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"with for five" --> "with five"

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the repo, I see the directory counting/ but it doesn't have a regions subdirectory, only a combined_ratesOnly.txt datacard.

--------------------------------------------------------------------------------

```

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be good to have them create a workspace from this datacard and run a simple fit (perhaps e.g. a fitdiagnostics to get the signal strength) on a single region, with the very simple datacard.

Then maybe add another heading before moving on to the full datacard and the migration matrix?


One of the most important stages in the analysis design, is to make sure that the reco-level categories are pure with the corresponding gen-level processes.

To explicitly check it, one can plot the contributions of gen-level bins in all of the reco-level bins. We propose to use the script provided in the tutorial git-lab page.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could add a couple of words about what this script does. e.g.

"This script uses combine harvester to loop over detector level bins, and get the rate at which each of the signal processes (generator-level bins) contributes to that detector-level bin; which is then used to plot the migration matrix."

```shell
text2workspace.py -m 125 counting/* -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO verbose --PO 'map=.*/.*ZH_lep_PTV_75_150_hbb:r_zh_75_150[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_0J_hbb:r_zh_150_250noj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_GE1J_hbb:r_zh_150_250wj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_250_400_hbb:r_zh_250_400[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_GT400_hbb:r_zh_gt400[1,-5,5]' -o ws_counting.root
```
where we use `--PO 'map=bin/process:poi[init, min, max]'`. So in the example above a signal POI is assigned to each gen-level bin independent on reco-level bin. This allows to take into account the non-trivial acceptance effects. One can also perform bin-by-bin unfolding using the mapping to the bin names rather that processes, e.g. `'map= vhbb_Zmm_gt400_13TeV/.*:r_reco_zh_gt400[1,-5,5]'`, but this method is not recommended and can be used for tests.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if it is maybe better not to mention bin-by-bin at all here? Unless it is used a useful test? I haven't used it that way, but maybe there is a use case I'm not aware of.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a couple of words about multiSignalModel would also be a good idea?

```
where we use `--PO 'map=bin/process:poi[init, min, max]'`. So in the example above a signal POI is assigned to each gen-level bin independent on reco-level bin. This allows to take into account the non-trivial acceptance effects. One can also perform bin-by-bin unfolding using the mapping to the bin names rather that processes, e.g. `'map= vhbb_Zmm_gt400_13TeV/.*:r_reco_zh_gt400[1,-5,5]'`, but this method is not recommended and can be used for tests.

To extract the measurement let's run the initial fit first:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe a few more words about MultiDimFit, e.g.

"We will use the MultiDimFit algorithm to simultaneously extract best-fit values and uncertainties on multiple parameters."

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And maybe a short word about the initial values "we use --setParameters <param>=<value>,... to ... "


The datacards for this part of the exercise located `full_model_datacards/`, where you can find a separate datacard for each region within `full_model_datacards/regions` directory and also a combined datacard `full_model_datacards/comb_full_model.txt`.

As you will find the datacards also contain several background processes. To control them properly we will add the regions enriched in the respective backgrounds. Then we can define a common set rate parameters for signal and control regions to scale the rates or other parameters affecting their shape.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"define a common set rate parameters" --> "define a common set of rate parameters"

Now we can create the workspace using the same `multiSignalmodel`

```shell
text2workspace.py -m 125 full_model_datacards/comb_full_model.txt -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO verbose --PO 'map=.*/.*ZH_lep_PTV_75_150_hbb:r_zh_75_150[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_0J_hbb:r_zh_150_250noj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_GE1J_hbb:r_zh_150_250wj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_250_400_hbb:r_zh_250_400[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_GT400_hbb:r_zh_gt400[1,-5,5]' --for-fits --no-wrappers --X-pack-asympows --optimize-si.png-constraints=cms --use-histsum -o ws_full.root
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think -simpdf- has morphed into -si.png- in the --optimize-simpdf-constraints flag. It might also be good to explain somewhere a little what some of these flags do? Perhaps we could add a collapsable info box, or similar?


> Following the instructions given earlier, create the workspace and run the initial fit with `-t -1` and set the name `-n .BestFit`.

Since this time the datacards include shape uncertainties as well as additional categories to improve the background description the fit might take much longer, but we can submit condor jobs and have results ready to look at in a few minutes.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is nice, showing how to scale up some aspects. But maybe we should also advertise that it will only work on systems with condor (and perhaps provide an alternative for people running in places without?)

## Unfolded measurements

Now that we studied the NP impacts for each POI, we can finally extract the measurements.
Note, that in this exercise we are skipping couple of checks that have to be done before the unblinding. Namely the goodness of fit test and the post-fit plots of folded observables. Both of these checks were detailed in the previous exercises, you can find the description under the following links.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"skipping couple of checks" --> Maybe something like "we are skipping further checks and validation that you should do on your analysis for the purposes of the tutorial."

```shell
python scripts/make_XSplot.py summary_zh_stxs.json
```
![](figures/stxs_zh.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason I am not seeing observed values for two of the bins? Maybe it would be good to comment on this briefly.

@anigamova
Copy link
Collaborator Author

Thanks @kcormi, I will address your comments soon. But what I've realised actually, is that it's not completely ok to have some of these plots on combine public docs pages, even if the analysis is already public. I'm using only partial run2 datacards to be able to fit them within 1-2h so these results are not public or going to be public. I will move these plot to the cms-analysis gitlab

@kcormi
Copy link
Collaborator

kcormi commented Sep 26, 2023

plot

Hi @anigamova what if you generate some pseudodata from the model, and then use that as the input data histogram? Then I think we could make everything public right? There's no real data being used in that case, but it should work more or less as if it were.

@anigamova
Copy link
Collaborator Author

Added the labels for the plots, should be clear now that they are only intended for this particular tutorial


There's a set of combine (.txt) datacards which will help you get through the various parts of the exercise. The exercises should help you become familiar with the structure of fitting datacards.

Note that the general recomendation on unfolding in `Combine` are available [here](https://cms-analysis.github.io/HiggsAnalysis-CombinedLimit/part3/regularisation/), which also includes recommendations on regularisation techniques and when to use it, which is completely is not discussed in this tutorial at all.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"which is completely is not discussed" --> "is not discussed"

```
where we use `--PO 'map=bin/process:poi[init, min, max]'`. So in the example above a signal POI is assigned to each gen-level bin independent on reco-level bin. This allows to take into account the non-trivial acceptance effects. One can also perform bin-by-bin unfolding using the mapping to the bin names rather that processes, e.g. `'map= vhbb_Zmm_gt400_13TeV/.*:r_reco_zh_gt400[1,-5,5]'`, but this method is not recommended and can be used for tests.
In the example given above a signal POI is assigned to each gen-level bin independent on reco-level bin. This allows to take into account the non-trivial acceptance effects. One can also perform bin-by-bin unfolding using the mapping to the bin names rather that processes, e.g. `'map= vhbb_Zmm_gt400_13TeV/.*:r_reco_zh_gt400[1,-5,5]'`, but this method is not recommended and can be used only for tests as another way to ensure that the migration matrix is close to diagonal.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you mean "to take into account migrations" rather than "acceptance effects" here, and that will maybe be a bit more clear to the reader.


## Shape analysis with control regions

One of the advantages of the maximum likelihood unfolding is the flexibility to choose the analysis observable and include more information on the event kinematics, consequently improving the analysis sensitivity. This analysis benefits from the shape information of the DNN output trained to differentiate the VH(bb) signal from the SM backgrounds.

The datacards for this part of the exercise located `full_model_datacards/`, where you can find a separate datacard for each region within `full_model_datacards/regions` directory and also a combined datacard `full_model_datacards/comb_full_model.txt`.

As you will find the datacards also contain several background processes. To control them properly we will add the regions enriched in the respective backgrounds. Then we can define a common set rate parameters for signal and control regions to scale the rates or other parameters affecting their shape.
As you will find the datacards also contain several background processes. To control them properly we will add the regions enriched in the respective backgrounds. Then we can define a common set of rate parameters for signal and control regions to scale the rates or other parameters affecting their shape.

For the shape datacards one has to specify the mapping of histograms and channels/processes as given described below:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could also link to the part of the documentation that describes this, for further information.

@@ -142,26 +143,33 @@ Then the `shape` nuisance parameters can be defined in the systematics block in

In the realistic CMS analysis there are hundreds of nuisance parameters corresponding to various source of systematics.

When we unfold to the gen-level observable we should remove the nuisances affecting the rate of the gen-level bins, i.e. the `lnN` NPs: `THU_ZH_mig*, THU_ZH_inc` and keep only the acceptance `shape` uncertainties: `THU_ZH_acc` and `THU_ggZH_acc`. This can be achieved by freezing the respective nuisance parameters with the option `--freezeParameters par_name1,par_name2`. Alternatively you can create a group following the syntax given below at the end of the combined datacard, and freeze the parameters with the `--freezeNuisanceGroups group_name` option.
When we unfold to the gen-level observable we should remove the nuisances affecting the rate of the gen-level bins, i.e. the `lnN` NPs: `THU_ZH_mig*, THU_ZH_inc` and keep only the acceptance `shape` uncertainties: `THU_ZH_acc` and `THU_ggZH_acc`, which do not scale the inclusive cross sections by construction.
This can be achieved by freezing the respective nuisance parameters with the option `--freezeParameters par_name1,par_name2`. Alternatively you can create a group following the syntax given below at the end of the combined datacard, and freeze the parameters with the `--freezeNuisanceGroups group_name` option.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe before this it would be good to point out explicitly that for this analysis you have already done the work of separating out the normalization and shape components by normalizing the theoretical variations to the nominal generator-level cross sections for each bin?


![](figures/scan_plot_r_zh_75_150.png)

Repeat the same command for other POIs to fill the `summary_zh_stxs.json`, which can then be used to create the cross section plot as shown below.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would also be good to add a few pedagogical lines here to the effect of:

"In order to plot our differential cross sections measurements at generator-level, we need to multiply the fitted parameter of interest values by their original cross sections which are stored in .... "

Copy link
Collaborator

@kcormi kcormi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some more comments about the presentation, mostly suggesting to make the writing and exposition a bit more pedagogical and try to make sure that the people following the tutorial understand how each of the pieces fit together.

# Combine unfolding exercise

## Getting started
By now you should have a working setup of Combine v9 from the pre-tutorial exercise. If so then move onto the cloning of the parametric fitting exercise gitlab repo below. If not then you need to set up a CMSSW area and checkout the combine package:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if for this part it actually makes more sense to link directly to the installation instructions and tell them to use one of those setups. That way we avoid repeating information (and lower the risk of it going out of date or changing)?


![](figures/simplifiedXS_VH_1_2.png)

In this tutorial we will focus on the ZH production, with Z boson decaying to charged leptons, and Higgs boson reconstructed with the resolved $b\bar{b}$ pair. We will also use only a part of the Run 2 categories, so the analysis sensitivity is not going to be achieved. Note that ggZH and ZH production modes are combined in the fit, since it is not possible to resolve them at this stage of the analysis. The STXS categories are defined independently of the Higgs decay channel, to streamline the combinations of the cross section measurement.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the analysis sensitivity is not going to be achieved" --> maybe "we will not achieve the same sensitivity as the full analysis."


## Simple datacards, one-bin measurement

When constructing the reco-level for any differential analysis the main goal is to match the gen-level bins as closely as possible. In the simplest case it can be done with the cut-based approach, i.e. applying the selection on the corresponding reco-level variables: $p_{T}(Z)$ and $n_{\text{add. jets}}$.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a word is missing in the first sentence here?

Maybe e.g. "When determining the reco-level binning for any differential analysis" (it may also be good to use "detector-level" rather than "reco-level", but not sure consistency with either choice is probably the most important thing)

I wonder if in general spelling out "generator level" and "detector level" is a bit more understandable and good for the tutorial.


One of the most important stages in the analysis design, is to make sure that the reco-level categories are pure with the corresponding gen-level processes.

To explicitly check it, one can plot the contributions of gen-level bins in all of the reco-level bins. We propose to use the script provided in the tutorial git-lab page. This script uses `CombineHarvester` to loop over detector level bins, and get the rate at which each of the signal processes (generator-level bins) contributes to that detector-level bin; which is then used to plot the migration matrix.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe "we propose to use the script" --> "You can use the script"

```shell
text2workspace.py -m 125 counting/combined_ratesOnly.txt -P HiggsAnalysis.CombinedLimit.PhysicsModel:multiSignalModel --PO verbose --PO 'map=.*/.*ZH_lep_PTV_75_150_hbb:r_zh_75_150[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_0J_hbb:r_zh_150_250noj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_150_250_GE1J_hbb:r_zh_150_250wj[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_250_400_hbb:r_zh_250_400[1,-5,5]' --PO 'map=.*/.*ZH_lep_PTV_GT400_hbb:r_zh_gt400[1,-5,5]' -o ws_counting.root
```
In the example given above a signal POI is assigned to each gen-level bin independent on reco-level bin. This allows to take into account the non-trivial to take into account migration. One can also perform bin-by-bin unfolding using the mapping to the bin names rather that processes, e.g. `'map= vhbb_Zmm_gt400_13TeV/.*:r_reco_zh_gt400[1,-5,5]'`, but this method is not recommended and can be used only for tests as another way to ensure that the migration matrix is close to diagonal.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Independent on" --> "Independent of"

"This allows to take into account the non-trivial to take into account migration" --> "This allows us to take into account migrations from a given generator level bin into various detector level bins."

![](figures/simplifiedXS_VH_1_2.png)

In this tutorial we will focus on the ZH production, with Z boson decaying to charged leptons, and Higgs boson reconstructed with the resolved $b\bar{b}$ pair. We will also use only a part of the Run 2 categories, so the analysis sensitivity is not going to be achieved. Note that ggZH and ZH production modes are combined in the fit, since it is not possible to resolve them at this stage of the analysis. The STXS categories are defined independently of the Higgs decay channel, to streamline the combinations of the cross section measurement.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we could also add a few words on the structure of the tutorial here.

e.g.

"In the first part of the tutorial, we will setup a relatively simple unfolding, where there is a single detector-level bin for every generator-level bin we are trying to measure. We will then perform a blind analysis using this setup to see the expected sensitivity.

In the second part of the tutorial we will perform the same measurement with a more advanced setup, making use of differential distributions per generator-level bin we are trying to measure, as well as control regions. By providing this additional information to the fit, we are able to achieve a better and more robust unfolding result. After checking the expected sensitivity, we will take a look at the impacts and pulls of the nuisance parameters. Then we will unblind and look at the results of the measurement, produce generator-level plots and provide the correlation matrix for our measured observables."


The hands-on exercise is split into seven parts:

1) Counting experiment
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wonder if it might be more clear to change this name from something like "counting experiment" to "simplified analysis strategy" or "simplified unfolding"?

And similarly, maybe changing "Shape analysis with control regions" to something like "Extending the analysis with other variables and control regions"

```

![](figures/impacts_zh_75_150.png)
* Do you observe differences in impacts plots for different POIs, do these differences make sense to you?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because of the layout, I think people might miss this question if it is after the plot, might be easier to see if it comes just before the figure?

## Unfolded measurements

Now that we studied the NP impacts for each POI, we can finally extract the measurements.
Note, that in this exercise we are skipping further checks and validation that you should do on your analysis for the purposes of the tutorial. Namely the goodness of fit test and the post-fit plots of folded observables. Both of these checks were detailed in the previous exercises, you can find the description under the following links.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, it was my suggestion earlier but maybe its better to remove "for the puproses of the tutorial" here. To make sure no one thinks we are saying that they should do the checks for the purposes of the tutorial. Alternatively we could write it as "Note, that in this exercise, for the purposes of the tutorial, we are skipping ... "

(and I think we are missing the links referred to by "the following links")


## POIs correlations

In addition to the cross-section measurements it is very important to publish correlation matrix of measured cross sections.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"to publish the correlation matrix" (missing the) "of the measured cross sections" (again missing the)

@anigamova
Copy link
Collaborator Author

Thanks for the updates @kcormi, all good from my side.

@kcormi kcormi merged commit f5d8df9 into cms-analysis:main Feb 7, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants