This presentation focuses on the estimation of the variance function in regression and its applications in regression with reject option and prediction intervals.
First, we are interested in estimating the variance function through two methods : model selection (MS) and convex aggregation (C). The goal of the MS procedure is to select the best estimator from a set of predictors, while the C procedure aims to choose the best convex combination among the predictors. The selected predictors are then referred to as MS-estimator and C-estimator, respectively. The construction of both the MS-estimator and C-estimator is based on a two-step procedure. In the first step, using the first sample, we construct estimators of the variance function through a residual-based method. In the second step, we aggregate these estimators using a second sample. We establish the consistency of both the MS-estimator and C-estimator with respect to the L2-risk.
Next, we shift our focus to the regression problem, where one is allowed to abstain from predicting. We focus on the case where the rejection rate is fixed and derive the optimal rule which relies on thresholding the conditional variance function. We provide a semi-supervised estimation procedure for this optimal rule. The resulting predictor with reject option is shown to be almost as good as the optimal predictor with reject option both in terms of risk and rejection rate. We additionally apply our methodology with kNN algorithm and establish rates of convergence for the resulting kNN predictor under mild conditions. Finally, a numerical study is performed to illustrate the benefit of using the proposed procedure.
Finally, we tackle the problem of building a prediction interval in heteroscedastic Gaussian regression. We focus on prediction intervals with constrained expected length in order to guarantee interpretability of the output. In this framework, we derive a closed form expression of the optimal prediction interval that allows for the development a data-driven prediction interval based on plug-in. The construction of the proposed algorithm is based on two samples, one labeled and another unlabeled. Under mild conditions, we show that our procedure is asymptotically as good as the optimal prediction interval both in terms of expected length and error rate. We conduct a numerical analysis that exhibits the good performance of our method.