Найдено 364
On practical implementation of the fully robust one-sided cross-validation method in the nonparametric regression and density estimation contexts
Savchuk O.
Q2
Springer Nature
Computational Statistics, 2025, цитирований: 0, doi.org, Abstract
The fully robust one-sided cross-validation (OSCV) method has versions in the nonparametric regression and density estimation settings. It selects the consistent bandwidths for estimating the continuous regression and density functions that might have finitely many discontinuities in their first derivatives. The theoretical results underlying the method were thoroughly elaborated in the preceding publications, while its practical implementations needed improvement. In particular, until this publication, no appropriate implementation of the method existed in the density estimation context. In the regression setting, the previously proposed implementation has a serious disadvantage of occasionally producing the irregular OSCV functions that complicates the bandwidth selection procedure. In this article, we make a substantial progress towards resolving the aforementioned issues by proposing a suitable implementation of fully robust OSCV for density estimation and providing specific recommendations for the further improvement of the method in the regression setting.
Lasso multinomial performance indicators for in-play basketball data
Damoulaki A., Ntzoufras I., Pelechrinis K.
Q2
Springer Nature
Computational Statistics, 2025, цитирований: 0, doi.org, Abstract
Abstract A typical approach to quantify the contribution of each player in basketball uses the plus–minus method. The ratings obtained by such a method are estimated using simple regression models and their regularized variants, with response variable being either the points scored or the point differences. To capture more precisely the effect of each player, detailed possession-based play-by-play data may be used. This is the direction we take in this article, in which we investigate the performance of regularized adjusted plus–minus (RAPM) indicators estimated by different regularized models having as a response the number of points scored in each possession. Therefore, we use possession play-by-play data from all NBA games for the season 2021–2022 (322,852 possessions). We initially present simple regression model-based indices starting from the implementation of ridge regression which is the standard technique in the relevant literature. We proceed with the lasso approach which has specific advantages and better performance than ridge regression when compared with selected objective validation criteria. Then, we implement regularized binary and multinomial logistic regression models to obtain more accurate performance indicators since the response is a discrete variable taking values mainly from zero to three. Our final proposal is an improved RAPM measure which is based on the expected points of a multinomial logistic regression model where each player’s contribution is weighted by his participation in the team’s possessions. The proposed indicator, called weighted expected points (wEPTS), outperforms all other RAPM measures we investigate in this study.
Nonparametric CUSUM change-point detection procedures based on modified empirical likelihood
Wang P., Ning W.
Q2
Springer Nature
Computational Statistics, 2025, цитирований: 0, doi.org, Abstract
AbstractSequential change-point analysis, which identifies a change of probability distribution in a sequence of random observations, has important applications in many fields. A good method should detect the change point as soon as possible, and keep a low rate of false alarms. As an outstanding procedure, Page’s CUSUM rule holds many optimalities. However, its implementation requires the pre-change and post-change distributions to be known which is not achievable in practice. In this article, we propose a nonparametric-CUSUM procedure by embedding different versions of empirical likelihood by assuming that two training samples, before and after change, are available for parametric estimations. Simulations are conducted to compare the performance of the proposed methods to the existing methods. The results show that when the underlying distribution is unknown and training sample sizes are small, our modified procedures exhibit advantages by giving a smaller delay of detection. A well-log data is provided to illustrate the detection procedure.
Sequential Monte Carlo for cut-Bayesian posterior computation
Mathews J., Gopalan G., Gattiker J., Smith S., Francom D.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
We propose a sequential Monte Carlo (SMC) method to efficiently and accurately compute cut-Bayesian posterior quantities of interest, variations of standard Bayesian approaches constructed primarily to account for model misspecification. We prove finite sample concentration bounds for estimators derived from the proposed method and apply these results to a realistic setting where a computer model is misspecified. Two theoretically justified variations are presented for making the sequential Monte Carlo estimator more computationally efficient, based on linear tempering and finding suitable permutations of initial parameter draws. We then illustrate the SMC method for inference in a modular chemical reactor example that includes submodels for reaction kinetics, turbulence, mass transfer, and diffusion. The samples obtained are commensurate with a direct-sampling approach that consists of running multiple Markov chains, with computational efficiency gains using the SMC method. Overall, the SMC method presented yields a novel, rigorous approach to computing with cut-Bayesian posterior distributions.
Efficient estimation of a disease prevalence using auxiliary ranks information
Zamanzade E., Saboori H., Samawi H.M.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 1, doi.org, Abstract
It is a common challenge in medical field to obtain the prevalence of a specific disease within a given population. To tackle this problem, researchers usually draw a random sample from the target population to obtain an accurate estimate of the proportion of diseased people. However, some limitations may occur in practice due to constraints, such as complexity or cost. In these situations, some alternative sampling techniques are needed to achieve precision with smaller sample sizes. One such approach is Neoteric Ranked Set Sampling (NRSS), which is a variation of Ranked Set Sampling (RSS) design. NRSS scheme involves selecting sample units using a rank-based method that incorporates auxiliary information to obtain a more informative sample. In this article, we focus on the problem of estimating the population proportion using NRSS. We develop an estimator for the population proportion using the NRSS design and establish some of its properties. We employ Monte Carlo simulations to compare the proposed estimator with competitors in Simple Random Sampling (SRS) and RSS designs. Our results demonstrate that statistical inference based on the introduced estimator can be significantly more efficient than its competitors in RSS and SRS designs. Finally, to demonstrate the effectiveness of the proposed procedure in estimating breast cancer prevalence within the target population, we apply it to analyze Wisconsin Breast Cancer data.
A memorial for the late Professor Friedrich Leisch
Symanzik J., Mori Y., Vieu P.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org
Conditional sufficient variable selection with prior information
Wang P., Lu J., Weng J., Mitra S.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
AbstractDimension reduction and variable selection play crucial roles in high-dimensional data analysis. Numerous existing methods have been demonstrated to attain either or both of these goals. The Minimum Average Variance Estimation (MAVE) method and its variants are effective approaches to estimate directions on the Central Mean Subspace (CMS). The Sparse Minimum Average Variance Estimation (SMAVE) combines the concepts of sufficient dimension reduction and variable selection and has been demonstrated to exhaustively estimate CMS while simultaneously selecting informative variables using LASSO without assuming any specific model or distribution on the predictor variables. In many applications, however, researchers typically possess prior knowledge for a set of predictors that is associated with response. In the presence of a known set of variables, the conditional contribution of additional predictors provides a natural evaluation of the relative importance. Based on this concept, we propose the Conditional Sparse Minimum Average Variance Estimation (CSMAVE) method. By utilizing prior information and creating a meaningful conditioning set for SMAVE, we intend to select variables that will result in a more parsimonious model and a more accurate interpretation than SMAVE. We evaluate our strategy by analyzing simulation examples and comparing them to the SMAVE method. And a real-world dataset validates the applicability and efficiency of our method.
Should sports professionals consider their adversary’s strategy? A case study of match play in golf
Wajge N., Stauffer G.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
AbstractThis study explores strategic considerations in professional golf’s Match Play format. Leveraging Professional Golfers’ Association Tour data, we investigate the impact of factoring in an adversary’s strategy. Our findings suggest that while slight strategy adjustments can be advantageous in specific scenarios, the overall benefit of considering an opponent’s strategy remains modest. This confirms the common wisdom in golf, reinforcing the recommendation to adhere to optimal stroke-play strategies due to challenges in obtaining precise opponent statistics. The methodology employed here is generic and could offer valuable insights into whether opponents’ performances should also be considered in other two-player or team sports, such as tennis, darts, soccer, volleyball, etc. We hope that this research will pave the way for new avenues of study in these areas.
A new class of information criteria for improved prediction in the presence of training/validation data heterogeneity
Flores J.E., Cavanaugh J.E., Neath A.A.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
Information criteria provide a cogent approach for identifying models that provide an optimal balance between the competing objectives of goodness-of-fit and parsimony. Models that better conform to a dataset are often more complex, yet such models are plagued by greater variability in estimation and prediction. Conversely, overly simplistic models reduce variability at the cost of increases in bias. Asymptotically efficient criteria are those that, for large samples, select the fitted candidate model whose predictors minimize the mean squared prediction error, optimizing between prediction bias and variability. In the context of prediction, asymptotically efficient criteria are thus a preferred tool for model selection, with the Akaike information criterion (AIC) being among the most widely used. However, asymptotic efficiency relies upon the assumption of a panel of validation data generated independently from, but identically to, the set of training data. We argue that assuming identically distributed training and validation data is misaligned with the premise of prediction and often violated in practice. This is most apparent in a regression context, where assuming training/validation data homogeneity requires identical panels of regressors. We therefore develop a new class of predictive information criteria (PIC) that do not assume training/validation data homogeneity and are shown to generalize AIC to the more practically relevant setting of training/validation data heterogeneity. The analytic properties and predictive performance of these new criteria are explored within the traditional regression framework. We consider both simulated and real-data settings. Software for implementing these methods is provided in the R package, picR, available through CRAN.
Forecasting the cost of drought events in France by Super Learning from a short time series of many slightly dependent data
Ecoto G., Bibaut A.F., Chambaz A.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 1, doi.org, Abstract
Drought events are the second most expensive type of natural disaster within the French legal framework known as the natural disasters compensation scheme. In recent years, drought events have been remarkable in their geographical location and scale and in their intensity. We develop and apply a new methodology to forecast the cost of a drought event in France. The methodology hinges on Super Learning (van der Laan et al. in Stat Appl Genet Mol Biol 6:23, 2007; Benkeser et al. Stat Med 37:249-260, 2018), a general aggregation strategy to learn a feature of the law of the data identified through an ad hoc risk function by relying on a library of algorithms. The algorithms either compete (discrete Super Learning) or collaborate (continuous Super Learning), with a cross-validation scheme determining the best performing algorithm or combination of algorithms. The theoretical analysis reveals that our Super Learner can learn from a short time series where each time-t-specific data-structure consists of many slightly dependent data indexed by a. We use a dependency graph to model the amount of conditional independence within each t-specific data-structure and a concentration inequality by Janson (Random Struct Algorithms 24:234-248, 2004) and leverage a large ratio of the number of distinct a-s to the degree of the dependency graph in the face of a small number of t-specific data-structures.
Bayesian adaptive lasso quantile regression with non-ignorable missing responses
Chen R., Dao M., Ye K., Wang M.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
In this paper, we develop a fully Bayesian adaptive lasso quantile regression model to analyze data with non-ignorable missing responses, which frequently occur in various fields of study. Specifically, we employ a logistic regression model to deal with missing data of non-ignorable mechanism. By using the asymmetric Laplace working likelihood for the data and specifying Laplace priors for the regression coefficients, our proposed method extends the Bayesian lasso framework by imposing specific penalization parameters on each regression coefficient, enhancing our estimation and variable selection capability. Furthermore, we embrace the normal-exponential mixture representation of the asymmetric Laplace distribution and the Student-t approximation of the logistic regression model to develop a simple and efficient Gibbs sampling algorithm for generating posterior samples and making statistical inferences. The finite-sample performance of the proposed algorithm is investigated through various simulation studies and a real-data example.
Change point estimation for Gaussian time series data with copula-based Markov chain models
Sun L., Wang Y., Liu L., Emura T., Chiu C.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
This paper proposes a method for change-point estimation, focusing on detecting structural shifts within time series data. Traditional maximum likelihood estimation (MLE) methods assume either independence or linear dependence via auto-regressive models. To address this limitation, the paper introduces copula-based Markov chain models, offering more flexible dependence modeling. These models treat a Gaussian time series as a Markov chain and utilize copula functions to handle serial dependence. The profile MLE procedure is then employed to estimate the change-point and other model parameters, with the Newton–Raphson algorithm facilitating numerical calculations for the estimators. The proposed approach is evaluated through simulations and real stock return data, considering two distinct periods: the 2008 financial crisis and the COVID-19 pandemic in 2020.
A novel nonconvex, smooth-at-origin penalty for statistical learning
John M., Vettam S., Wu Y.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
Nonconvex penalties are utilized for regularization in high-dimensional statistical learning algorithms primarily because they yield unbiased or nearly unbiased estimators for the parameters in the model. Nonconvex penalties existing in the literature such as SCAD, MCP, Laplace and arctan have a singularity at origin which makes them useful also for variable selection. However, in several high-dimensional frameworks such as deep learning, variable selection is less of a concern. In this paper, we present a nonconvex penalty which is smooth at origin. The paper includes asymptotic results for ordinary least squares estimators regularized with the new penalty function, showing asymptotic bias that vanishes exponentially fast. We also conducted simulations to better understand the finite sample properties and conducted an empirical study employing deep neural network architecture on three datasets and convolutional neural network on four datasets. The empirical study based on artificial neural networks showed better performance for the new regularization approach in five out of the seven datasets.
BARMPy: Bayesian additive regression models Python package
Van Boxel D.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
We make Bayesian additive regression networks (BARN) available as a Python package, barmpy, with documentation at https://dvbuntu.github.io/barmpy/ for general machine learning practitioners. Our object-oriented design is compatible with SciKit-Learn, allowing usage of their tools like cross-validation. To ease learning to use barmpy, we produce a companion tutorial that expands on reference information in the documentation. Any interested user can pip install barmpy from the official PyPi repository. barmpy also serves as a baseline Python library for generic Bayesian additive regression models.
Semiparametric regression analysis of panel binary data with an informative observation process
Ge L., Li Y., Sun J.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
Panel binary data arise in an event history study when study subjects are observed only at discrete time points instead of continuously and the only available information on the occurrence of the recurrent event of interest is whether the event has occurred over two consecutive observation times or each observation window. Although some methods have been proposed for regression analysis of such data, all of them assume independent observation times or processes, which may not be true sometimes. To address this, we propose a joint modeling procedure that allows for informative observation processes. For the implementation of the proposed method, a computationally efficient EM algorithm is developed and the resulting estimators are consistent and asymptotically normal. The simulation study conducted to assess its performance indicates that it works well in practical situations, and the proposed approach is applied to the motivating data set from the Health and Retirement Study.
Site-specific nitrogen recommendation: fast, accurate, and feasible Bayesian kriging
Poursina D., Brorsen B.W.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
Bayesian Kriging (BK) provides a way to estimate regression models where the parameters are smoothed across space. Such estimates could help guide site-specific fertilizer recommendations. One advantage of BK is that it can readily fill in the missing values that are common in yield monitor data. The problem is that previous methods are too computationally intensive to be commercially feasible when estimating a nonlinear production function. This paper sought to increase computational speed by imposing restrictions on the spatial covariance matrix. Previous research used an exponential function for the spatial covariance matrix. The two alternatives considered are the conditional autoregressive and simultaneous autoregressive models. In addition, a new analytical solution is provided for finding the optimal value of nitrogen with a stochastic linear plateau model. A comparison among models in the accuracy and computational burden shows that the restrictions significantly reduced the computational burden, although they did sacrifice some accuracy in the dataset considered.
Computational econometrics with gretl
Yalta A.T., Cottrell A., Rodrigues P.C.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org
Empirical likelihood change point detection in quantile regression models
Ratnasingam S., Gamage R.D.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
Quantile regression is an extension of linear regression which estimates a conditional quantile of interest. In this paper, we propose an empirical likelihood-based non-parametric procedure to detect structural changes in the quantile regression models. Further, we have modified the proposed smoothed empirical likelihood-based method using adjusted smoothed empirical likelihood and transformed smoothed empirical likelihood techniques. We have shown that under the null hypothesis, the limiting distribution of the smoothed empirical likelihood ratio test statistic is identical to that of the classical parametric likelihood. Simulations are conducted to investigate the finite sample properties of the proposed methods. Finally, to demonstrate the effectiveness of the proposed method, it is applied to urinary Glycosaminoglycans (GAGs) data to detect structural changes.
Double truncation method for controlling local false discovery rate in case of spiky null
Kim S., Oh Y., Lim J., Park D., Green E.M., Ramos M.L., Jeong J.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
Many multiple test procedures, which control the false discovery rate, have been developed to identify some cases (e.g. genes) showing statistically significant difference between two different groups. However, a common issue encountered in some practical data sets is the presence of highly spiky null distributions. Existing methods struggle to control type I error in such cases due to the “inflated false positives," but this problem has not been addressed in previous literature. Our team recently encountered this issue while analyzing SET4 gene deletion data and proposed modeling the null distribution using a scale mixture normal distribution. However, the use of this approach is limited due to strong assumptions on the spiky peak. In this paper, we present a novel multiple test procedure that can be applied to any type of spiky peak data, including situations with no spiky peak or with one or two spiky peaks. Our approach involves truncating the central statistics around 0, which primarily contribute to the null spike, as well as the two tails that may be contaminated by alternative distributions. We refer to this method as the “double truncation method." After applying double truncation, we estimate the null density using the doubly truncated maximum likelihood estimator. We demonstrate numerically that our proposed method effectively controls the false discovery rate at the desired level using simulated data. Furthermore, we apply our method to two real data sets, namely the SET protein data and peony data.
FPDclustering: a comprehensive R package for probabilistic distance clustering based methods
Tortora C., Palumbo F.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 1, doi.org, Abstract
AbstractData clustering has a long history and refers to a vast range of models and methods that exploit the ever-more-performing numerical optimization algorithms and are designed to find homogeneous groups of observations in data. In this framework, the probability distance clustering (PDC) family methods offer a numerically effective alternative to model-based clustering methods and a more flexible opportunity in the framework of geometric data clustering. Given nJ-dimensional data vectors arranged in a data matrix and the number K of clusters, PDC maximizes the joint density function that is defined as the sum of the products between the distance and the probability, both of which are measured for each data vector from each center. This article shows the capabilities of the PDC family, illustrating the package .
Efficient regression analyses with zero-augmented models based on ranking
Kanda D., Yin J., Zhang X., Samawi H.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
Several zero-augmented models exist for estimation involving outcomes with large numbers of zero. Two of such models for handling count endpoints are zero-inflated and hurdle regression models. In this article, we apply the extreme ranked set sampling (ERSS) scheme in estimation using zero-inflated and hurdle regression models. We provide theoretical derivations showing superiority of ERSS compared to simple random sampling (SRS) using these zero-augmented models. A simulation study is also conducted to compare the efficiency of ERSS to SRS and lastly, we illustrate applications with real data sets.
Exact and approximate computation of the scatter halfspace depth
Liu X., Liu Y., Laketa P., Nagy S., Chen Y.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
The scatter halfspace depth (sHD) is an extension of the location halfspace (also called Tukey) depth that is applicable in the nonparametric analysis of scatter. Using sHD, it is possible to define minimax optimal robust scatter estimators for multivariate data. The problem of exact computation of sHD for data of dimension $$d \ge 2$$ has, however, not been addressed in the literature. We develop an exact algorithm for the computation of sHD in any dimension d and implement it efficiently for any dimension $$d \ge 1$$ . Since the exact computation of sHD is slow especially for higher dimensions, we also propose two fast approximate algorithms. All our programs are freely available in the R package scatterdepth.
A Bayesian approach for clustering and exact finite-sample model selection in longitudinal data mixtures
Corneli M., Erosheva E., Qian X., Lorenzi M.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
We consider mixtures of longitudinal trajectories, where one trajectory contains measurements over time of the variable of interest for one individual and each individual belongs to one cluster. The number of clusters as well as individual cluster memberships are unknown and must be inferred. We propose an original Bayesian clustering framework that allows us to obtain an exact finite-sample model selection criterion for selecting the number of clusters. Our finite-sample approach is more flexible and parsimonious than asymptotic alternatives such as Bayesian information criterion or integrated classification likelihood criterion in the choice of the number of clusters. Moreover, our approach has other desirable qualities: (i) it keeps the computational effort of the clustering algorithm under control and (ii) it generalizes to several families of regression mixture models, from linear to purely non-parametric. We test our method on simulated datasets as well as on a real world dataset from the Alzheimer’s disease neuroimaging initative database.
Closed-form expressions of the run-length distribution of the nonparametric double sampling precedence monitoring scheme
Magagula Z., Malela-Majika J., Human S.W., Castagliola P., Chatterjee K., Koukouvinos C.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 1, doi.org, Abstract
AbstractA significant challenge in statistical process monitoring (SPM) is to find exact and closed-form expressions (CFEs) (i.e. formed with constants, variables and a finite set of essential functions connected by arithmetic operations and function composition) for the run-length properties such as the average run-length ($$ARL$$ ARL ), the standard deviation of the run-length ($$SDRL$$ SDRL ), and the percentiles of the run-length ($$PRL$$ PRL ) of nonparametric monitoring schemes. Most of the properties of these schemes are usually evaluated using simulation techniques. Although simulation techniques are helpful when the expression for the run-length is complicated, their shortfall is that they require a high number of replications to reach reasonably accurate answers. Consequently, they take too much computational time compared to other methods, such as the Markov chain method or integration techniques, and even with many replications, the results are always affected by simulation error and may result in an inaccurate estimation. In this paper, closed-form expressions of the run-length properties for the nonparametric double sampling precedence monitoring scheme are derived and used to evaluate its ability to detect shifts in the location parameter. The computational times of the run-length properties for the CFE and the simulation approach are compared under different scenarios. It is found that the proposed approach requires less computational time compared to the simulation approach. Moreover, once derived, CFEs have the added advantage of ease of implementation, cutting off on complex convergence techniques. CFE's can also easily be built into mathematical software for ease of computation and may be recalled for further work.
A class of transformed joint quantile time series models with applications to health studies
Tourani-Farani F., Aghabazaz Z., Kazemi I.
Q2
Springer Nature
Computational Statistics, 2024, цитирований: 0, doi.org, Abstract
Extensions of quantile regression modeling for time series analysis are extensively employed in medical and health studies. This study introduces a specific class of transformed quantile-dispersion regression models for non-stationary time series. These models possess the flexibility to incorporate the time-varying structure into the model specification, enabling precise predictions for future decisions. Our proposed modeling methodology applies to dynamic processes characterized by high variation and possible periodicity, relying on a non-linear framework. Additionally, unlike the transformed time series model, our approach directly interprets the regression parameters concerning the initial response. For computational purposes, we present an iteratively reweighted least squares algorithm. To assess the performance of our model, we conduct simulation experiments. To illustrate the modeling strategy, we analyze time-series measurements of influenza infection and daily COVID-19 deaths.
Cobalt Бета
ru en