-
-
Notifications
You must be signed in to change notification settings - Fork 194
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Estimation of uncertainty associated to predictions from ranger #136
Comments
As discussed via email, we are currently investigating two approaches for variance estimation: The first one is based on the infinitesimal Jackknife (Wager et al. 2014), the other on U-statistics (Mentch & Hooker 2016). So expect something to happen here, but it will take some time. |
Since #188 standard errors of predictions can be computed for regression forests using the Jackknife. Example: idx <- sample(nrow(mtcars), 10)
test <- mtcars[idx, ]
train <- mtcars[-idx, ]
rf <- ranger(mpg ~., train, num.trees = 50, keep.inbag = TRUE)
pred <- predict(rf, test, type = "se")
pred$predictions
pred$se |
Great work Marvin, I have just checked your new code and I get:
So there is a very good match between the median of all se's and what comes as OOB prediction error (MSE). I did notice however that if you do not set the number of trees low (50-85) then computing time blows up if the |
One remaining issue is how to derive 95% prob prediction interval using ranger. I know that we could use the simple formula for this:
But I wonder if you could also maybe implement it using the approach of Wager et al (2014) so it can be requested via the |
I have just tested the |
Yes, it's not multithreaded in the current implementation. It's also in plain R, so I'm sure a speedup is possible. In Wager et al. the CI's are computed the same way, see the footnote on p.2 in their paper. |
I've did some research and I came up to five methods that can be used to derive prediction errors (i.e. confidence limits) at new prediction locations (the main focus here is on regression problems i.e. numeric target variable):
From the five above, the quantile regression forest approach seems to be most attractive, but it is very computational. I would not mind if the confidence intervals are not super precise (but they should definitively be unbiased). At the moment using quantregRanger to predict confidence intervals for millions of pixels is not realistic. The infinitesimal Jackknife approach is at the order of magnitude faster, but seem to result in too wide confidence intervals. I am eager to find out which approach to derivation of prediction errors / intervals gives tolerable precision, is unbiased, while being 'fast enough'. |
@thengl do you want prediction intervals (i.e., intervals that cover new observations Yi with high probability), or confidence intervals (i.e., intervals that cover mu(x) = E[Y|X=x] with high probability)? In the former case, quantile regression forests are the way to go. The suggestion of of Greg Snow, properly calibrated, could perhaps function as sort of a back-of-the-envelope approximation to quantile regression forests; but it won't give you any formal guarantees. For confidence intervals, the infinitesimal jackknife and the method of Mentch and Hooker are very closely related, both corresponding to the CIs available via ranger. The |
@swager Can prediction intervals be derived from the IJ estimates somehow? We'd need an additional variance component, right? |
I think that in most problems (and certainly in any regime where we can do asymptotics), the irreducible noise Ei = Yi - mu(Xi) dwarfs the estimation error mu(Xi) - hat(mu)(Xi). The implication is that, to get accurate prediction intervals, it may be much more important to model Ei accurately than to quantify the latter term well (which is what confidence intervals are after). If there's interest in quick and dirty prediction intervals, maybe something like this could work?
|
Thank you for your work on this, especially for implementing it in a well-documented R package. I am also interested in the variance of ranger predictions on spatial data. In addition to the variance at train/test points, I am also interested in aggregating the variance across subsets of the dataset - is there a way to get the covariance of ranger predictions? My example is predicting species abundance in space and time (similar to @thengl), then producing a CI for abundance estimates in each year (aggregating across space, but variance is spatially correlated as in Tomislav's map above). Something like:
Apologies if this is documented elsewhere. The only R implementation of covariance for RF predictions I've found is at http://shftan.github.io/surfin/demo.html. Thanks, |
Brian, here are some worked out examples of how to derive s.d. of prediction error and standard error using ranger and spatial data: https://github.com/thengl/GeoMLA/blob/master/RF_vs_kriging/R/RF_uncertainty_maps.R Having said all this, derivation of the prediction error with ranger is complex and we are still discussing differences and issues. |
Thanks for pointing me to your examples - very helpful. I've been able to produce a time-series of uncertainty maps, and my question here is about the variance of the sum of spatially correlated predictions. I.e, in your spatiotemporal rainfall ex, if you now want to plot a time series of total rainfall (sum of mean predictions across space in a given year), with error bars for this total. If the observations were independent, you could just do |
As far as I know, the only implementation to get covariances of predictions is the |
Thanks for the reply, I will check that out. I'm sure you have a long list of features to work on, but if you do have time... producing estimates of the covariance between RF predictions would greatly expand what we can use spatial RF models for (at least in my field, fisheries/ecology). |
Hi. Thanks to everyone here for developing such a nicely worked out package. Just to clarify, because the language is unclear: |
Yes, that's correct. Where do you think the language is unclear? Any suggestions for improvement? |
I was unsure whether you were using the phrase "prediction" in the technical sense or the colloquial sense (where it just means anything that comes out of a model). I suggest adding a sense somewhere clarifying that these are "confidence intervals for the mean, and not prediction intervals" or similar. Thanks! |
This issue is quite old but still open. forestError is a new R package that's out-of-box compatible with ranger (and on CRAN as of yesterday, 14 Jan 2020), implementing methods for estimating "conditional mean squared prediction errors, conditional biases, conditional prediction intervals, and conditional error distributions for random forest predictions." The ArXiv preprint is here. This might be a helpful addition to the README/docs. |
I'm bumping into a case with quantile regression forests with {ranger} where my coverage is substantially higher than my prediction intervals. I posted the question on Stack Overflow: Prediction Intervals from Quantile Regression Forests have higher coverage than expected?. I'm curious if you have any clues as to why I would be getting substantially higher coverage rates than would be expected? (Or could point me in the correct direction.) (I described the problem further in a draft blog post that prompted the question... I tried different values of |
I've made few code examples that I think could be very generic (and hence of interest to ranger package users) and that provide information about how to estimate prediction uncertainty / associated prediction errors. FYI, we use ranger to generate spatial predictions mainly (hence maps). Points indicated on maps are the training points used to build models.
I consider two generic cases:
The second stream requires extra computing and some parameters needs to be set by hand such as number of folds (one extra model) but it also produces reasonable results.
It would be nice to see, in near future that, when function
predict.ranger
is triggered, that also standard errors can be derived for any type of model.The text was updated successfully, but these errors were encountered: