Найдено 265
Comments on: Exploratory functional data analysis
Lopez-Pintado S.
Q2
Springer Nature
Test, 2025, цитирований: 0, doi.org, Abstract
Abstract In this invited paper we highlight some of the exploratory functional data methods described in the systematic review paper by Qu et al. (TEST, 2024. 10.1007/s11749-024-00952-8). We discuss recent developments related to functional boxplots and consider possible extensions of exploratory methods to non-Euclidean object data.
Flexible clustering via Gaussian parsimonious mixture models with censored and missing values
Wang W., Lachos V.H., Chen Y., Lin T.
Q2
Springer Nature
Test, 2025, цитирований: 0, doi.org
Comments on: Exploratory functional data analysis
Lillo R.E.
Q2
Springer Nature
Test, 2025, цитирований: 0, doi.org
Comments on: Exploratory functional data analysis
Hyndman R.J.
Q2
Springer Nature
Test, 2025, цитирований: 0, doi.org, Abstract
Abstract A useful approach to exploratory functional data analysis is to work in the lower-dimensional principal component space rather than in the original functional data space. I demonstrate this approach by finding anomalies in age-specific US mortality rates between 1933 and 2022. The same approach can be employed for many other standard data analysis tasks and has the advantage that it allows immediate use of the vast array of multivariate data analysis tools that already exist, rather than having to develop new tools for functional data.
A one-bring-one route for assessing the uncertainty of small area estimation in nested-error regression models
Liu Y., Ma H., Liu X., Jiang J.
Q2
Springer Nature
Test, 2025, цитирований: 0, doi.org, Abstract
The nested-error regression (NER) models are widely used to analyze unit-level data in small area estimation. Concerned about possible model misspecification, Jiang et al. (Surv Methodol 41:37–55, 2015) suggested a new prediction procedure, entitled observed best prediction (OBP), for the NER models and showed its desirable properties under such a setting. However, how to assess the uncertainty of OBP in such a case remains poorly addressed. This paper investigates this issue by developing a new estimator relying on the so-called one-bring-one route. It is shown that the new estimator is second-order unbiased under some mild conditions. Some simulations are conducted to confirm its finite sample performance. Finally, we applied the proposed estimator to a real-data example.
Comments on: Exploratory functional data analysis
Goldsmith J.
Q2
Springer Nature
Test, 2025, цитирований: 0, doi.org, Abstract
Exploratory FDA is a critical step in understanding functional data. In this article, the authors present a range of compelling real-data examples to illustrate best practices in visualization and initial analyses. My comments include suggestions to incorporate covariates in exploratory approaches and note a new software package to facilitate these techniques.
Topical collection on “goodness-of-fit, change-point and related problems”
González-Manteiga W., Meintanis S.G., Patilea V.
Q2
Springer Nature
Test, 2025, цитирований: 0, doi.org
Copula based dependent censoring in cure models
Delhelle M., Van Keilegom I.
Q2
Springer Nature
Test, 2025, цитирований: 0, doi.org, Abstract
In this paper we consider a time-to-event variable T that is subject to random right censoring, and we assume that the censoring time C is stochastically dependent on T and that there is a positive probability of not observing the event. There are various situations in practice in which this happens, and appropriate models and methods need to be considered to avoid biased estimators of the survival function or incorrect conclusions in clinical trials. In this work we propose a fully parametric mixture cure model for the bivariate distribution of (T, C), which deals with all these features. The model depends on a parametric copula and on parametric marginal distributions for T and C. A major advantage of our approach in comparison to existing approaches in the literature is that the copula which models the dependence between T and C is not assumed to be known, nor is the association parameter. Furthermore, our model allows for the identification and estimation of the cure fraction and the association between T and C, despite the fact that only the smallest of these variables is observable. Sufficient conditions are developed under which the model is identified, and an estimation procedure is proposed. The asymptotic behaviour of the estimated parameters is studied, and their finite sample performance is illustrated by means of a thorough simulation study and an analysis of breast cancer data.
Mixed causal-noncausal count process
Pei J., Lu Y., Zhu F.
Q2
Springer Nature
Test, 2024, цитирований: 0, doi.org, Abstract
Recently, Gouriéroux and Lu (Electron J Stat 15(2):3852–3891, 2021) introduced a class of (Markov) noncausal count processes. These processes are obtained by time-reverting a standard count process (such as INAR(1)), but have quite different dynamic properties. In particular, they can feature bubble-type phenomena, which are epochs of steady increase, followed by sharp decreases. This is in contrast to usual INAR and INGARCH type models, which only feature “reverse bubbles”, that are epochs of sharp increase followed by steady decreases. In practice, however, in many datasets, sudden jumps and crashes are rare, while it is more frequent to observe epochs of steady increase or decrease. This paper introduces the mixed causal-noncausal integer-valued autoregressive (m-INAR(1,1)) process, obtained by superposing a causal and a noncausal INAR(1) process sharing the same sequence of error terms. We show that this process inherits some key properties from the noncausal INAR(1), such as the bi-modality of the predictive distribution and the irreversibility of the dynamics, while at the same time allowing different accumulation and burst speeds for the bubble. We propose a GMM estimation method, investigate its finite sample performance, develop testing procedures, and apply the methodology to stock transaction data.
Convolution smoothing and online updating estimation for support vector machine
Wang K., Meng X., Sun X.
Q2
Springer Nature
Test, 2024, цитирований: 1, doi.org, Abstract
Support vector machine (SVM) is a powerful binary classification statistical learning tool. In real applications, streaming data are common, which arrive in batches and have unbounded cumulative size. Because of the memory constraints of one single computer, the classical SVM solving the entire data together is unsuitable. Furthermore, the non-smoothness of hinge loss in SVM also poses high computational complexity. To overcome these issues, we first develop a convolution smoothing approach that achieves smooth and convex approximation to SVM. Then an online updating SVM is proposed, in which the estimators are renewed with current data and historical summary statistics. In theory, we prove that the convolution smoothing SVM achieves adequate approximation to SVM, and they are asymptotically equivalent in inference. Furthermore, the online updating SVM achieves the same efficiency as the classical SVM applying to the entire dataset. Numerical experiments on both synthetic and real data also validate our new methods.
Exploratory functional data analysis
Qu Z., Dai W., Euan C., Sun Y., Genton M.G.
Q2
Springer Nature
Test, 2024, цитирований: 2, doi.org, Abstract
With the advance of technology, functional data are being recorded more frequently, whether over one-dimensional or multi-dimensional domains. Due to the high dimensionality and complex features of functional data, exploratory data analysis (EDA) faces significant challenges. To meet the demands of practical applications, researchers have developed various EDA tools, including visualization tools, outlier detection techniques, and clustering methods that can handle diverse types of functional data. This paper offers a comprehensive overview of recent procedures for exploratory functional data analysis (EFDA). It begins by introducing fundamental statistical concepts, such as mean and covariance functions, as well as robust statistics such as the median and quantiles in multivariate functional data. Then, the paper reviews popular visualization methods for functional data, such as the rainbow plot, and various versions of the functional boxplot, each designed to accommodate different features of functional data. In addition to visualization tools, the paper also reviews outlier detection methods, which are commonly integrated with visualization methods to identify anomalous patterns within the data. Finally, the paper focuses on functional data clustering techniques which provide another set of practical tools for EFDA. The paper concludes with a brief discussion of future directions for EFDA. All the reviewed methods have been implemented in an R package named EFDA .
Distribution-free tests for lossless feature selection in classification and regression
Györfi L., Linder T., Walk H.
Q2
Springer Nature
Test, 2024, цитирований: 0, doi.org, Abstract
We study the problem of lossless feature selection for a d-dimensional feature vector $$X=(X^{(1)},\dots ,X^{(d)})$$ and label Y for binary classification as well as nonparametric regression. For an index set $$S\subset \{1,\dots ,d\}$$ , consider the selected |S|-dimensional feature subvector $$X_S=(X^{(i)}, i\in S)$$ . If $$L^*$$ and $$L^*(S)$$ stand for the minimum risk based on X and $$X_S$$ , respectively, then $$X_S$$ is called lossless if $$L^*=L^*(S)$$ . For classification, the minimum risk is the Bayes error probability, while in regression, the minimum risk is the residual variance. We introduce nearest-neighbor-based test statistics to test the hypothesis that $$X_S$$ is lossless. This test statistic is an estimate of the excess risk $$L^*(S)-L^*$$ . Surprisingly, estimating this excess risk turns out to be a functional estimation problem that does not suffer from the curse of dimensionality in the sense that the convergence rate does not depend on the dimension d. For the threshold $$a_n=\log n/\sqrt{n}$$ , the corresponding tests are proved to be consistent under conditions on the distribution of (X, Y) that are significantly milder than in previous work. Also, our threshold is universal (dimension independent), in contrast to earlier methods where for large d the threshold becomes too large to be useful in practice.
Statistical properties of partially observed integrated functional depths
Elías A., Nagy S.
Q2
Springer Nature
Test, 2024, цитирований: 0, doi.org, Abstract
Abstract Integrated functional depths (IFDs) present a versatile toolbox of methods introducing notions of ordering, quantiles, and rankings into a functional data analysis context. They provide fundamental tools for nonparametric inference of infinite-dimensional data. Recently, the literature has extended IFDs to address the challenges posed by partial observability of functional data, commonly encountered in practice. That resulted in the development of partially observed integrated functional depths (POIFDs). POIFDs have demonstrated good empirical results in simulated experiments and real problems. However, there are still no theoretical results in line with the state of the art of IFDs. This article addresses this gap by providing theoretical support for POIFDs, including (i) uniform consistency of their sample versions, (ii) weak continuity with respect to the underlying probability measure, and (iii) uniform consistency for discretely observed functional data. Finally, we present a sensitivity analysis that evaluates how our theoretical results are affected by violations of the main assumptions.
Integrative subgroup analysis for high-dimensional mixed-type multi-response data
Song S., Wu J., Zhang W.
Q2
Springer Nature
Test, 2024, цитирований: 0, doi.org, Abstract
Identifying subgroup structures presents an intriguing challenge in data analysis. Various methods have been proposed to divide the population into subgroups based on individual heterogeneity. However, these methods often fail to accommodate mixed multi-responses and high-dimensional covariates. This article considers the problem of high-dimensional mixed multi-response data with heterogeneity and sparsity. We introduce an integrative subgroup analysis approach with general linear models, accounting for heterogeneity through unobserved latent factors across different responses and sparsity due to high-dimensional covariates. Our approach automatically divides observations into subgroups while identifying significant covariates using non-convex penalty functions. We develop an algorithm that combines the alternating direction method of multipliers with the coordinate descent algorithm for implementation. Additionally, we establish the oracle property of the estimator, illustrating consistent identification of latent subgroups and significant covariates. The efficacy of our method is further validated through numerical simulations and a case study on a randomized clinical trial for buprenorphine maintenance treatment in opiate dependence.
Semi-functional partial linear regression with measurement error: an approach based on kNN estimation
Novo S., Aneiros G., Vieu P.
Q2
Springer Nature
Test, 2024, цитирований: 0, doi.org, Abstract
This paper focuses on a semi-parametric regression model in which the response variable is explained by the sum of two components. One of them is parametric (linear), the corresponding explanatory variable is measured with additive error and its dimension is finite (p). The other component models, in a nonparametric way, the effect of a functional variable (infinite dimension) on the response. kNN-based estimators are proposed for each component, and some asymptotic results are obtained. A simulation study illustrates the behaviour of such estimators for finite sample sizes, while an application to real data shows the usefulness of our proposal.
Inference and prediction for ARCH time series via innovation distribution function
Zhong C., Zhang Y., Yang L.
Q2
Springer Nature
Test, 2024, цитирований: 0, doi.org, Abstract
A kernel distribution estimator (KDE) is obtained based on residuals of innovation distribution in ARCH time series. The deviation between KDE and the innovation distribution function is shown to converge to a Gaussian process. Based on this convergence, a smooth simultaneous confidence band is constructed for the innovation distribution and an invariant procedure proposed for testing the symmetry of innovation distribution function. Quantiles are further estimated from the KDE, and multi-step-ahead prediction intervals (PIs) of future observations are constructed using the estimated quantiles, which achieve asymptotically the nominal prediction level. The multi-step-ahead PI is constructed for the S&P 500 daily returns series with satisfactory performance, which corroborates the asymptotic theory.
Conditional minimum density power divergence estimator for self-exciting integer-valued threshold autoregressive models
Sun M., Yang K., Li A.
Q2
Springer Nature
Test, 2024, цитирований: 0, doi.org, Abstract
To overcome the sensitivity of maximum likelihood estimation to outliers in integer-valued time series of counts, we develop a conditional version of minimum density power divergence estimator by introducing the structure of the loss function of the original minimum density power divergence estimator. The properties of the proposed estimator, including the strong consistency and asymptotic normality, are obtained. Some simulation studies are conducted to show the performances of the conditional minimum density power divergence estimator. Finally, an application to the quarterly earthquake data is provided and prove that when outliers exist in data set, the proposed estimator has a better performance than the conditional maximum likelihood estimator, showing robustness property.
A semiparametric approach for simple step-stress model
Pal A., Samanta D., Kundu D.
Q2
Springer Nature
Test, 2024, цитирований: 0, doi.org, Abstract
In many life-testing experiments, interest often lies in examining the effect of extreme or varying stress factors such as frequency, voltage, temperature, load, etc. on the lifetimes of the experimental units. An experimenter often then performs the step-stress accelerated life-testing (SSALT) experiment, a special case of the more general accelerated life-testing (ALT) experiment, to get an insight about various reliability characteristics of the lifetime distribution much quickly compared to that obtained under normal operating conditions. An extensive amount of work has been performed analyzing data obtained from a simple SSALT experiment based on different parametric models. We propose here a flexible data-driven semiparametric approach based on a piecewise constant approximation (PCA) of the baseline hazard function (HF) in order to analyze failure time data obtained from a simple SSALT experiment when the data are Type-I censored. It is assumed that the associated lifetime distribution satisfies the failure rate- based model assumptions. We provide both the classical and Bayesian solutions to this problem. In particular, methodologies to obtain the point and interval estimates of the associated model parameters are discussed. Extensive simulation studies are carried out to see the effectiveness of the proposed method. A real-life data example is considered for illustrative purposes.
A Kolmogorov–Smirnov-type test for the two-sample problem with left-truncated data
Lago A., de Uña-Álvarez J., Pardo-Fernández J.C.
Q2
Springer Nature
Test, 2024, цитирований: 2, doi.org, Abstract
A Kolmogorov–Smirnov-type test for the two-sample problem with left-truncated data is proposed. The asymptotic null distribution of the test statistic and its omnibus consistency are established. A bootstrap resampling plan to approximate the null distribution of the test statistic is introduced. The finite sample performance of the proposed test is investigated through simulations. Comparison to the well-known log-rank test and a real data illustration are included.
Modeling paired binary data by a new bivariate Bernoulli model with flexible beta kernel correlation
Li X., Li S., Tian G., Shi J.
Q2
Springer Nature
Test, 2024, цитирований: 0, doi.org, Abstract
Paired binary data often appear in studies of subjects with two sites such as eyes, ears, lungs, kidneys, feet and so on. Three popular models [i.e., (Rosner in Biometrics 38:105-114, 1982) R model, (Dallal in Biometrics 44:253-257, 1988) model and (Donner in Biometrics 45:605-661, 1989) model] were proposed to fit such twin data by considering the intra-person correlation. However, Rosner’s R model can only fit the twin data with an increasing correlation coefficient, Dallal’s model may incur the problem of over–fitting, while Donner’s model can only fit the twin data with a constant correlation. This paper aims to propose a new bivariate Bernoulli model with flexible beta kernel correlation (denoted by $$\hbox {Bernoulli}_2^{\textrm{bk}}$$ ) for fitting the paired binary data with a wide range of group–specific disease probabilities. The correlation coefficient of the $$\hbox {Bernoulli}_2^{\textrm{bk}}$$ model could be increasing, or decreasing, or unimodal, or convex with respect to the disease probability of one eye. To obtain the maximum likelihood estimates (MLEs) of parameters, we develop a series of minorization–maximization (MM) algorithms by constructing four surrogate functions with closed–form expressions at each iteration of the MM algorithms. Simulation studies are conducted, and two real datasets are analyzed to illustrate the proposed model and methods.
Jackknife empirical likelihood for the correlation coefficient with additive distortion measurement errors
Chen D., Dai L., Zhao Y.
Q2
Springer Nature
Test, 2024, цитирований: 1, doi.org, Abstract
The correlation coefficient is fundamental in advanced statistical analysis. However, traditional methods of calculating correlation coefficients can be biased due to the existence of confounding variables. Such confounding variables could act in an additive or multiplicative fashion. To study the additive model, previous research has shown residual-based estimation of correlation coefficients. The powerful tool of empirical likelihood (EL) has been used to construct the confidence interval for the correlation coefficient. However, the methods so far only perform well when sample sizes are large. With small sample size situations, the coverage probability of EL, for instance, can be below 90% at confidence level 95%. On the basis of previous research, we propose new methods of interval estimation for the correlation coefficient using jackknife empirical likelihood, mean jackknife empirical likelihood and adjusted jackknife empirical likelihood. For better performance with small sample sizes, we also propose mean adjusted empirical likelihood. The simulation results show the best performance with mean adjusted jackknife empirical likelihood when the sample sizes are as small as 25. Real data analyses are used to illustrate the proposed approach.
Comments on: Data integration via analysis of subspaces (DIVAS)
Zhou L., Song P.X.
Q2
Springer Nature
Test, 2024, цитирований: 0, doi.org
Rejoinder on: Data integration via analysis of subspaces (DIVAS)
Prothero J., Jiang M., Hannig J., Tran-Dinh Q., Ackerman A., Marron J.S.
Q2
Springer Nature
Test, 2024, цитирований: 0, doi.org, Abstract
Modern data collection in many data paradigms, including bioinformatics, often incorporates multiple traits derived from different data types (i.e., platforms). We call this data multi-block, multi-view, or multi-omics data. The emergent field of data integration develops and applies new methods for studying multi-block data and identifying how different data types relate and differ. One major frontier in contemporary data integration research is methodology that can identify partially shared structure between sub-collections of data types. This work presents a new approach: Data Integration Via Analysis of Subspaces (DIVAS). DIVAS combines new insights in angular subspace perturbation theory with recent developments in matrix signal processing and convex–concave optimization into one algorithm for exploring partially shared structure. Based on principal angles between subspaces, DIVAS provides built-in inference on the results of the analysis and is effective even in high-dimension-low-sample-size (HDLSS) situations.
Nonparametric conditional survival function estimation and plug-in bandwidth selection with multiple covariates
Bagkavos D., Guillen M., Nielsen J.P.
Q2
Springer Nature
Test, 2024, цитирований: 1, doi.org, Abstract
The present research provides two methodological advances, simulation evidence and a real data analysis, all contributing to the area of local linear survival function estimation and bandwidth selection. The first contribution is the development of a double smoothed local linear survival function estimator which admits an arbitrary number of covariates and the analytic establishment of its asymptotic properties. The second contribution is the efficient implementation of the estimator in practice. This is achieved by developing an automatic plug-in smoothing parameter selector which optimizes the estimator’s performance in all coordinate directions. The traditional problem of vectorization of higher-order derivatives which lead to increasingly intractable matrix algebraic expressions is addressed here by introducing an alternative vectorization that exploits the analytic relationships between the functionals involved. This yields simpler, tractable and efficient in terms of computing time expressions which greatly facilitate the implementation of the rule in practice. The analytic study of the rule’s rate of convergence shows that in contrast to the traditional cross validation approach, the proposed bandwidth selector is functional even for a large number of covariates. The benefits of all methodological advances are illustrated with the analysis of a motivating real-world dataset on credit risk.
Higher-order spatial autoregressive varying coefficient model: estimation and specification test
Li T., Wang Y.
Q2
Springer Nature
Test, 2024, цитирований: 1, doi.org, Abstract
Conventional higher-order spatial autoregressive models assume that regression coefficients are constant over space, which is overly restrictive and unrealistic in applications. In this paper, we introduce higher-order spatial autoregressive varying coefficient model where regression coefficients are allowed to smoothly change over space, which enables us to simultaneously explore different types of spatial dependence and spatial heterogeneity of regression relationship. We propose a semi-parametric generalized method of moments estimation method for the proposed model and derive asymptotic properties of resulting estimators. Moreover, we propose a testing method to detect spatial heterogeneity of the regression relationship. Simulation studies show that the proposed estimation and testing methods perform quite well in finite samples. The Boston house price data are finally analyzed to demonstrate the proposed model and its estimation and testing methods.
Cobalt Бета
ru en