|
University of California, Davis July 29 - August 1, 2003 |
Abstracts of talks at the New Researchers conference by name of participant in alphabetical order:
Anna Amirdjanova, Department of Statistics University of Michigan, Ann Arbor
This work is devoted to stochastic modeling of motion of
incompressible homogeneous viscous flows. We focus on a two-dimensional
stochastic Navier-Stokes model in its vorticity formulation based on the
Skorohod-Ito microscopic evolution of a finite system of interacting
point vortices. The equations are driven by a two-dimensional Brownian
sheet and a compensated Poisson random measure and allow to incorporate
a possible discontinuous displacement of vortices while retaining the
continuous Brownian component intact.
Under fairly general conditions on the initial vorticity profile we
show the existence and uniqueness of the solution to the stochastic
vorticity equation. When the vorticity process is not observed with
precision, associated nonlinear filtering problems are considered
and the equations for the optimal filter are derived.
The original idea to describe numerically the time evolution of a
vorticity field in terms of the motion of point vortices goes back
to L. Rosenhead (1932). Today vortex methods are widely used for
numerical simulation of both inviscid and viscous flows. Our goal is
to develop a mathematical theory of chaotic viscous fluid motion
from a perspective offered by point vortex methods of computational
hydrodynamics.
Sudipto Banerjee, School of Public Health, University of Minnesota
We look at a spatially replicated experiment, where each location is, by itself, a lattice of plots with each plot monitoring the growth of a weed through time. Along with estimating fixed effects that affect the growth of the weed, of particular interest is the spatial variation in the growth patterns. This setup allows the modelling of spatial variation at two resolutions - for the plots nested within locations and between the locations.We develop a Bayesian hierarchical framework that helps capture variations across replicates and locations, using growth curve coefficients that vary spatially. We use spatial processes to model the spatially-varying coefficients and discuss the flexibilty of our approach. We look at several competing models that arise in our context, discuss fitting algorithms and perform model comparisons. We interpret these models with respect to our experiment and indicate extensions and future work.
Vera Bulaevskaya, Department of Statistics, Carnegie Mellon University
Currently, images acquired via Magnetic Resonance Imaging (MRI) technology are reconstructed using the discrete inverse Fourier transform. While computationally convenient, this method has a serious limitation: its approach to reconstructing images of 3-dimensional objects is 2-dimensional. We propose an alternative approach to reconstruction, based on the penalized likelihood methodology, that makes 3-dimensional reconstruction possible. We focus on a particular form of the penalized likelihood: shrinkage estimation. In addition, we consider the Haar wavelet basis as an alternative to the Fourier basis functions currently in use. Our work concentrates on penalized likelihood estimation and its advantages to the current approach in 1- and 2-dimensional settings as building blocks necessary to reach the ultimate objective of reconstruction in 3-dimensional settings.
Florentina Bunea, Florida State University, Department of Statistics
The False Discovery Rate (FDR) procedure has been developed in the context
of multiple hypotheses testing, by Benjamini and Hochberg (1995). Given a
set of $p$ hypotheses out of which an unknown number $p_0$ may be true,
the FDR method identifies the hypotheses that can be rejected, while
keeping the expected value of the ratio of false rejections to rejections
below a user specified control value, $q$. This technique can handle
problems in which $p$ is very large at a very low computational cost.
In this talk, we regard the FDR as an estimation procedure. Given a
generic model ${\bf M}_{\beta}$,
in which $\beta \in \Real^p$ is the parameter of interest, we let $I_0
\subseteq \{1, \ldots, p\}$ be the
index set of the non-zero components of $\beta$. Then, the FDR method
yields an estimator for $I_0$. The procedure requires the construction of
$p$ tests statistics. We show that under appropriate conditions on the
distribution $F$ of these test statistics and by choosing the control
value $q = q_n$ to converge to zero as the sample size $n$ increases,
the FDR leads to consistent estimators of $I_0$. We consider both the case
$p$ fixed and $p \rightarrow \infty$. The rate at which $q_n \rightarrow
0$ depends on $F$ and, for the latter case, on
$p$. As an important example, we show that in any parametric model in
which one can construct asymptotically normal estimators for $\beta$, the
FDR estimator of $I_0$ is consistent.
Also, we indicate how the FDR method can be used as a step in a multistage
estimation procedure in semiparametric models.
A simulation study for covariate selection ($p = 50$) in linear and
logistic regression strongly supports our theoretical findings; we note
that for this case a traditional search over the space of the $2^p$
submodels may become computationally intractable.
Yuguo Chen, Institute of Statistics and Decision Sciences, Duke University
The Hardy-Weinberg law is of basic importance in studying biological systems, and it is important to be able to determine if a population is in Hardy-Weinberg equilibrium. For finite populations, this means testing if they are a draw from a distribution known as Hardy-Weinberg proportions. Because the state space is exponentially large in the population size, the only efficient means for doing this is Monte Carlo simulation. Here we apply sequential importance sampling technique for this problem that work in the presence of structural zeros. It can also give an estimate of the total number of realizations under these constraints. This is joint work with SAMSI contingency table group: Ian Dinwoodie, Adrian Dobra, Mark Huber and Michael Nicholas.
Vanja Dukic, Department of Health Studies, University of Chicago
The problem of estimating the variance in a Gaussian model which
ranges between a model where the mean parameter is fully known and a
model where the mean parameter is completely unknown will be
considered. This research is motivated by the desire to understand
theoretical implications of the process of selecting a model among
several sub-models, and then estimating a parameter of interest after
model selection, but with these sequential steps using the {\em same}
sample data.
In this talk we compare the following three approaches to
inference: (I) Utilizing estimators developed under a more general
model, and in the extreme case, under a fully nonparametric model;
(II) Performing a two-step process, with the first step being to
select the sub-model, and the second step being to use an estimator
developed under the chosen sub-model, with both steps using the same
sample data; and (III) Forming a weighted combination of sub-model
estimators, with the weights, which are the sub-model posterior
probabilities, being data-dependent. It will be shown that efficiency
gains can be obtained by exploiting the sub-model structure through
the use of adaptive, Bayes, and sub-model weighted estimators,
especially when the number of competing sub-models is few, but this
advantage may deteriorate and be lost altogether for some adaptive
estimators as the number of sub-models increases. In particular, it
will be demonstrated that weighted estimators perform best in
estimating the variance and are preferable over two-step adaptive
estimators.
Danielle Harvey, Division of Biostatistics, Department of Epidemiology and Preventive Medicine, UC Davis
Consider an observational study of the effect of hospice care
on mortality.
Because referral to hospice occurs only when death is near, a
simple analysis of time from
hospice referral to death does not capture the independent effect of
hospice, but instead a combination of this effect and other
factors leading to referral. One of these
factors is the patient's true state of health, which is
partially known but cannot be completely measured. This
state of health affects both when the patient is referred to
hospice and when the patient dies. In order to
estimate the true effect of hospice, the unknown health effect must
be separated from the hospice effect.
In problems such as this one, there are two
correlated event times, which we call the ``treatment event''
and the ``outcome event'', with a common unobservable variable.
We propose a two-stage model of the
process; the first stage describes the model for the
log-hazard of the ``treatment event,'' while the second
describes the model for the
log-hazard of the ``outcome event'' conditional on the
``treatment event.'' In addition, our model includes an observable
instrumental variable which influences the ``treatment
event,'' but is not
correlated with the common unknown variable.
We use a maximum
likelihood approach to estimate the parameters in the model
implementing a grid search
. Through
simulations, we compare the estimates of the effect of the ``treatment
event'' on the ``outcome event'' from the Cox model and our
two-stage model to
assess the properties of these estimates with or without
censoring. We find that
the estimates from the two-stage
model perform much better than those from the Cox model in terms of
bias and coverage probabilities.
Matthew J. Hayat, National Institutes of Health
Heterogeneity of variance and serial correlation are often present in
measurements taken over time. The most popular parametric dependence models
for serial correlation are stationary autoregressive models and other
second-order stationary models. In these models, it is assumed that
variances are constant over time and correlations between measurements
equidistant in time are equal. These assumptions may not be reasonable. Our
work considers a class of nonstationary models that allows for heterogeneity
of variance and serial correlation.
Modeling dependence is difficult for two reasons. First, dimensionality of
the problem can be large in many applications and second, the covariance
matrix must be constrained to be positive definite. A modified Cholesky
decomposition of the precision matrix (inverse of the covariance matrix)
allows us to address both of these challenges. It also produces
nonstationary analogues of many stationary covariances with special
structure that are available in the literature of longitudinal data
analysis.
We implement full Bayesian inference for several such models. Markov Chain
Monte Carlo (MCMC) techniques are used to calculate posterior distributions
for all parameters. Missing data are handled by generating samples from
their conditional predictive distribution. The methodology is applied to a
study of late-deafened adults receiving cochlear implants.
Rosaria Ignaccolo, Dipartimento di Statistica e Matematica Applicata, Universita degli Studi di Torino, Italy
We establish some properties of a class of functional tests of goodness-of-fit for correlated observations generated by strongly mixing discrete time stochastic processes. These tests are associated with the projection density estimator. This class of functional tests contains the Pearson's Chi-Squared test (1900) and the Smooth Test of Neyman (1937) as particular cases. Results in the case of independent observations, in a general framework, can be found in Bosq (2002). Moreover, we consider a class of tests for the regression function and we establish some properties for the case of no-effect hypothesis. In both cases, we analyse the asymptotic behaviour of the test statistics, under both the null hypothesis and the alternative, establishing the rate of convergence to their limit distributions. We determine the necessary and sufficient conditions for the consistency of the tests. Indications for implementing the test are provided and results of a simulation's study are given.
Sonia Jain, Department of Family & Preventive Medicine, University of California, San Diego
A major impediment in designing Markov chain Monte Carlo algorithms for nonconjugate models is the computational difficulty that arises when the model is no longer analytically tractable. In this talk, we will propose a new nonincremental Markov chain sampling technique that efficiently clusters heterogeneous data by splitting and merging mixture components of a nonconjugate Dirichlet Process mixture model. Our method, which is a generalization of our conjugate split-merge Metropolis-Hastings procedure, will accommodate models with a specific type of nonconjugate prior, the conditionally conjugate family of priors. Appropriate Metropolis-Hastings split-merge proposal distributions are obtained by utilizing properties of a restricted Gibbs sampling scan. Highlights from a simulation study that was conducted to empirically compare the performance of our nonconjugate split-merge procedure and a nonconjugate Gibbs sampling technique will be shown.
Wolfgang Jank, Department of Decision and Information Technologies, University of Maryland
The EM algorithm is a popular tool in statistics and many other fields. One of the reasons for EM's popularity are its convenient statistical and numerical properties. Arguably, one of the most famous of these properties is EM's likelihood-ascent property. That is, EM guarantees an increase in the likelihood function between pairs of successive parameter updates. Unfortunately, not all of these properties are inherited automatically by its stochastic version, the Monte Carlo EM (MCEM) algorithm. Indeed, while MCEM does not converge with a fixed Monte Carlo sample size, it also does not inherit EM's ascent property. In this work we propose a new MCEM formulation, the Ascent MCEM algorithm. That is, borrowing ideas from its deterministic origin, this new formulation recovers the likelihood-ascent property, at least with high probability. We show that while Ascent MCEM allows for a convenient implementation of many different sampling schemes, it can also result in a better allocation of Monte Carlo resources than competing methods. Furthermore, it admits standard stopping rules and it can reduce the total simulation effort significantly over existing approaches.
Ming Ji, Graduate School of Public Health, San Diego State University
In this paper, we develop a statistical hypothesis test for detecting a change point over the course of cognitive decline among Alzheimer^Rs disease patients. The model under the null hypothesis assumes a constant rate of cognitive decline over time and the model under the alternative hypothesis is a general bilinear model with an unknown change point. When the change point is unknown, however, the null distribution of the test statistics is not analytically tractable and has to be simulated by parametric bootstrap. When the alternative hypothesis that a change point exists is accepted, we propose an estimate of its location based on the Akaike's Information Criterion. We applied our method to a data set from the Neuropsychological Database Initiative by implementing our hypothesis testing method to analyze Mini Mental Status Exam scores based on a random-slope and random-intercept model with a bilinear fixed effect. Our result shows that despite large amount of missing data, accelerated decline does exist for MMSE. Our finding supports the clinical belief of the existence of a change point during cognitive decline among AD patients and suggests the use of change point models for the longitudinal modeling of cognitive decline in AD research.
Samir Lababidi, National Institutes of Health
There have been many parametric estimation problems from indirect observations in the literature, e.g., in filtering and stochastic control. In this talk I consider the nonparametric estimation problem of an integral-type functional from indirect observations where the observation $Y_\varepsilon(t)$ is a sum of a known function of an unobservable process $X_\varepsilon(t)$ and a Gaussian white noise, and $X_\varepsilon(t)$ is a sum of an unknown function $a(t)$ and a Gaussian process. First I will present some examples of these functionals. Then I will derive the minimax lower bound on the quality of nonparametric estimation and provide an asymptotically efficient estimator.
Min Li, Management Information Science Department, California State University, Sacramento
This talk provides a Bayesian approach to estimating Treasury and corporate term structures with a regression spline model. The number and locations of the knots are adaptively chosen by the data. We first estimate the Treasury term structure with a Bayesian adaptive regression spline model, simultaneously selecting the number and locations of knots. We then estimate the corporate term structure by adding a spread to the estimated Treasury term structure, incorporating the knowledge of positive credit spread into our Bayesian model as informative priors.
Joey Lin, Department of Mathematics and Statistics, San Diego State University
This article studies small sample confidence intervals for simple difference of the success proportions under inverse sampling. Three methods are proposed. The first method is based on the conditional intervals for the ratio of the proportions. The second method constructs unconditional intervals with use of tail method. The third method applies an inversion of two-sided test using score statistics to construct confidence intervals. The coverage probability and the length of the proposed methods are discussed.
Martin Lindquist, Department of Statistics, Columbia University
What occurs in humans' brains when they see a picture of someone they
recognize? Recognition occurs almost immediately, but where in the brain
does it take place? Functional Magnetic Resonance Imaging (fMRI), a
technique that can be used to study mental activity in the brain, holds
great promise for assisting in the arduous task of mapping brain
functions. It is also a technique where mathematics and statistics promise
to play a crucial role.
As currently used, the temporal resolution of fMRI studies are too slow to
effectively answer the questions posed above. To increase their
usefulness, new methods to accelerate the speed of fMRI studies must be
introduced. In this talk, an idea is presented for obtaining this needed
acceleration, based on a trade-off of spatial for temporal resolution,
achieved by sampling only a small fraction of the Fourier transform of the
spin density and applying a prolate spheroidal wave function filter. This
is used to obtain the total activity in a predetermined region of the
brain. The fraction sampled will depend upon the shape of the region being
studied, and is carefully chosen to optimize certain criteria.
In order to properly detect activation in the predetermined region, there
is a need for a thorough statistical analysis of the collected fMRI time
series data. Some novel approaches to analyzing fMRI data will also be
discussed.
Ji Meng Loh, Department of Statistics, Columbia University
The nature and extent of clustering in the universe is of fundamental interest to cosmologists. Typically the two-point correlation function is estimated from a catalog of observed objects (such as galaxies) and used to constrain certain parameters found in cosmological models. We instead suggest estimating a related measure of clustering, called the reduced second moment function, and illustrate with applications to an absorber catalog and a galaxy catalog. Important considerations in correctly estimating the clustering include accounting for edge effects due to the boundary of the observation window and accounting for bias due to possible selection effects.
Louis T. Mariano, RAND
When modeling a latent quantity, repeated ratings--measures of a related variable from a subjective source--are often available. For example, a single student's essay on a standardized test may be scored by more than one grader. The availability of repeated ratings allows for the consideration of individual rater bias and variability in the estimation of the latent quantity, and hierarchical models based in item response theory have been introduced to model rater effects. In this paper, we demonstrate how these models may be extended to include covariates of the rating process. For example, how do features of an essay grader's training affect their performance and thus the estimate of a student's writing proficiency? We first formulate a design matrix that indicates the available raters and covariates and also reflects necessary identifiability constraints. Using this, we cast the overall rating effects as a linear combination of individual rater and covariate effects and discuss competing options for including these overall effects in a model hierarchy. We use data from a rating study of a 5th grade student assessment to illustrate the method.
Susanne May, Department of Family and Preventive Medicine, University of California, San Diego
A number of goodness-of-fit test statistics have been proposed for the Cox proportional hazards model. Among these are test statistics, which are based on a partitioning of the time and covariate space (e.g. Schoenfeld, 1980, Moreau et al, 1986, Parzen and Lipsitz, 1999). Some of these test statistics can be easily computed by any statistical software package that allows for time-dependent covariates. Nevertheless, determining the appropriate degrees of freedom for this test statistic can be difficult for certain partitionings. We use a graphical representation to illustrate some of the issues involved and provide worked examples.
Carlos Morales, Department of Mathematical Sciences, Worcester Polytechnic Institute
Multifractal processes have become a widely used modeling tool with applications in areas such as turbulence, Internet traffic, and more recently biomedical engineering. Their appeal as a modeling tool stems from the fact that sample paths of these processes exhibit intricate patterns of locally varying scaling behavior and smoothness. In the present work, we use wavelet-based methods to estimate a typical path's range and prevalence of smooth behavior. The asymptotic behavior of the estimators are studied for the case of fractional Brownian motion, accounting for the across and within-scale covariance structure of wavelet coefficients. The procedure is then applied to study differences in sample paths of biomedical signals from normal and pathological patients.
Bhramar Mukherjee, Department of Statistics, University of Florida
Case-control studies occupy a central place in epidemiological research. Fixed case-control studies separately collect a case sample and a control sample with the two sample sizes being fixed prior to studies and sometimes chosen arbitrarily. This often results in loss of efficiency in terms of cost and time. We propose a Bayesian sequential design for case-control studies and derive a simple optimal sampling rule with an accompanying asymptotically pointwise optimal (in the sense of Bickel and Yahav (1965,1967,1968)) stopping rule for case-control designs. The group sequential extension and higher order asymptotic analysis of this procedure is also considered.
Kerrie Nelson, Department of Statistics, University of South Carolina
Over the last decade, generalized linear mixed models, have become an
increasingly popular modelling approach in a regression setting for correlated,
clustered and overdispersed data. Due to the intractability of the likelihood
functions involved, it is usually not possible to find closed form
solutions for parameter estimates for these models, and consequently,
alternative methods have been developed for estimation including commonly
used methods such as penalized quasi-likelihood, an iterative bias
correction method and a maximum likelihood algorithm using Markov chain
methods. These methods tend to be either biased or computationally
intensive.
To date, little has been done to evaluate the performance of the above
methods. In this talk I will describe an improvement to the iterative
bias correction method which reduces its computational time. I will also
compare the performance of the above methods with particular regard to the
effects of sample size, regression coefficents and variance components
using simulation studies for a generalized linear mixed model based on a
time series of polio incidence counts in the USA.
Jacob Oleson, Department of Mathematics & Statistics, Arizona State University
The purpose of this work is to incorporate both pre- and post-stratification into a Bayesian hierarchical framework. A generalized linear model with possible correlated random effects is often used when there is one unknown parameter (\fullcite*{Sun00a}). We propose a new family of generalized linear mixed models with correlated random effects when there are two unknown canonical parameters. This is the case when the sample size is considered random. Current methods for handling cases of this nature are not suitable. Such a family can be used to model both random sample sizes and success probabilities in small area estimation under pre- and post-stratification. General formulae for Bayesian estimation and prediction at the post-stratification level are given. One application is the 1998 Missouri Turkey Hunting Survey, which was pre-stratified based on the hunters place of residence. Success rates, hunting pressure, and harvests are of interest at the post-stratified county level. The computations are performed via Gibbs sampling and adaptive rejection sampling techniques. Results show that there are significant spatial correlations between counties and the variability at the post-stratification level is reduced with the pre-stratification.
Hernando Ombao, Department of Statistics, University of Illinois, Champaign
Many time series datasets are non-stationary in nature. As examples, brain waves, seismic waves and speech signals have amplitudes (variance) that change over time. Moreover, the waves oscillate at frequencies that vary over time. In this talk, we will first introduce the SLEX transform (Smooth Localized Complex EXponentials) as a basic tool in analyzing non-stationary time series. The SLEX transform forms a collection of bases with orthogonal vectors which are time-localized versions of the Fourier complex exponentials. Hence, they are ideal at representing processes with spectral properties that evolve over time. In view of the above, the SLEX analysis for non-stationary time series is a generalization of the traditional Fourier analysis for stationary time series. In this talk, we will discuss how the SLEX library can be used detecting patterns in non-stationary time series. Our approach is to select from the SLEX library a basis that can best discriminate between classes of processes. We then develop a classification criterion which is based on Kullback Leibler distance and show that this is consistent, i.e., the probability of incorrectly classifying a time series goes to zero. We will illustrate this using seismic waves recorded from an earthquake and explosion.
Iain Pardoe, Charles H. Lundquist College of Business, University of Oregon
The authors use the multinomial adjacent-categories logit (ACL) random effects model to explore preferences for products reflecting various socially responsible practices. The ACL model is uniquely appropriate to an experimental choice situation in which subjects make repeated choices among alternatives in multiple purchasing situations and more than two choice alternatives are fully ordered. Subjects were presented with a choice between socially responsible and more conventional but lower priced versions of a product in each of nine purchasing situations. Subjects were then asked to repeat the choice task under two different relative price conditions. Subjects therefore could choose the more socially responsible product never, once (at the lowest price differential), twice, or three times (at the highest price differential). The ACL model allows the comparison of selection odds across these ordered categories, providing insights that are not otherwise available.
Vladimir Pozdnyakov, Department of Statistics, University of Connecticut
In statistical analysis we usually collect data of a certain size that is specified in advance, before the experiment. Then we make our decision based on the collected observations. However, the decision often could be made when just a part of observations is sampled. So we can stop the sampling earlier saving significant amount of time and cost. Precise statistical procedures that implement this approach are the main subjects of investigation in sequential analysis. Most sequential methods are designed under the assumption that data are normally distributed or at least have finite variance. But there is growing evidence that, for example, in the case of financial data we often have to work with so-called heavy tailed distributions. One of the standard fixed-size statistical solutions is to use trimmed or truncated sums -- sums from which the most extreme observations are excluded. Methods based on such objects usually give very good results in the case of non-normal data and still almost as efficient as classical statistical methods in the situations when data are normal. Since trimmed and truncated sums are not processes with independent increments the development of such tools is also a challenging theoretical problem. In this talk we present a robust nonparametric repeated significance test based on truncated sums with a truncating level that goes up as the sample size increases.
Rema Raman, Family and Preventive Medicine & Neurosciences, University of California, San Diego
Three-level data occur frequently in behavior and medical sciences. For example, in a multi-center trial, subjects within a given site are randomly assigned to treatments and then studied over time. Mixed-effects models have been developed to analyze such three-level data only when the response is binary, not ordinal. These models for binary data also assume that the variances at the second and/or the third level of data are the same. Unfortunately, this assumption does not hold in several situations. A mixed-effects model is described for the analysis of three-level ordinal response data. This model allows for either homogeneous or heterogeneous variances between groups at either higher level of data. A maximum marginal likelihood (MML) solution is described and Gauss-Hermite numerical quadrature is used to integrate over the distribution of random effects. Simulation studies will show that the fit of the heterogeneous model increases as the magnitude of the difference in variation between the groups increase. The features of this model will be illustrated using a real-life data set.
Jerome P. Reiter, Institute of Statistics and Decision Sciences, Duke University
To avoid disclosures of microdata, one proposal is to release multiply-imputed, synthetic (i.e. simulated) public-use datasets so that (i) no unit in the released data has sensitive data from an actual unit in the population, and (ii) statistical procedures that are valid for the original data are valid for the released data. In this talk, I discuss my current research on this topic. Specifically, I discuss methods of obtaining inferences from multiply-imputed, fully synthetic and partially synthetic datasets. These methods differ from the usual combining rules for multiple imputation for missing data. I present theoretical and simulation-based results that illustrate the potential of this approach.
Rhonda J. Rosychuk, Dept of Pediatrics, University of Alberta, Canada
Responses may be misclassified when a diagnostic test is imperfect. The diagnostic test may not accurately reflect the underlying state of a disease process. I consider a two-state continuous-time Markov model for an unobservable alternating binary process. A related process, such as the repeated diagnostic test results, is observed at discrete time points. I examine the identifiability and estimability of maximum likelihood estimates and provide a simple restriction to ensure identifiability. The behaviour of maximum likelihood transition probability estimates as functions of known misclassification probabilities is also investigated. Bias adjusted approximate estimators are easily constructed. Simulation studies reveal the effect of misclassification on estimation. Repeated diagnostic testing of parasitic infection serves as an example.
Fotios Siannis, MRC Biostatistics Unit, University Forvie Site, UK
Many different models and approaches have been studied in the literature for analyzing survival data, and almost invariably all assume that the censoring is {\it non-informative} or {\it ignorable}. Whatever the mechanism is which determines the times of censoring, it is deemed irrelevant as far as inference about the failure distribution is concerned. In many applications, however, the assumption of ignorable censoring is at best an approximation and at worst seriously misleading. In this work we study the bias induced by informative censoring by embedding censored survival data in a competing risks framework. For each individual we assume there is a potential random censoring time $C$ and a potential random lifetime $T$. The censoring is non-informative if $C$ and $T$ are independent (conditional on values of covariates). We observe the time $Y=\min (T,C)$, and the censoring indicator $I=1$ if $T \geq C$ and $I=0$ if $T>C$. If $f_C(c,\gamma)$ is the marginal distribution of the censored times, we propose the parametric model \begin{eqnarray*} P \big( C=c|T=t \big) & = & f_{C} \big( c,\gamma + \delta B(t,\theta) \big) \end{eqnarray*} which allows for dependence in terms of a parameter $\delta$ and a bias function $B(t,\theta)$. Parameter $\theta$ of the failure distribution is hence the parameter of interest. However, given that we are unable to estimate the level of dependence between lifetime and censoring mechanisms, we argue that the next best thing is to develop a sensitivity analysis which will enable us to see how robust our estimates and conclusions are to different degrees of dependence which may be present in our data. We develop a relatively simple sensitivity analysis using linear approximations to parameter estimates for small values of $\delta$. Extension of the model to include covariates is a straight forward application. Examples are presented to illustrate the theory.
Xiaogang Su, Department of Statistics and Actuarial Science, University of Central Florida
This paper introduces the TAR algorithm, which augments the linear regression model with a tree structure. The proposed method can be described as a `father regression and mother tree' idea, not only capturing global patterns of data structure, but also identifying local properties at finer levels. As a result it yields better predictions without loss of interpretability, and moreover, its flexibility makes it a useful routine procedure for data analysis. We illustrate the use of the the TAR model by assessing treatment effects in the presence of confounding covariates, detecting interactions in an explicit manner, and estimating transitions with unknown number of change points. A real data application of TAR analyzing baseball salary data is also presented.
Lehana Thabane, Department of Clinical Epidemiology and Biostatistics, McMaster University, Canada
In this paper, I present a Bayesian approach to estimation of the number needed to treat (NNT). The use of NNT as a measure of clinical benefit is now becoming commonplace. As a measure of clinical benefit, NNT is intuitively attractive due to its interpretability in a clinical setting. Various classical methods of estimation have been proposed, but none of them seem to provide entirely good estimates. Very little has been done to understand the statistical properties of NNT. Here, I derive the posterior distribution of NNT and use simulations to investigate the general behaviour of the distribution for large and small samples. The Bayesian approach provides a constructive alternative to classical methods to express the uncertainty about NNT using its posterior distribution, although in some cases the distribution has unpleasant properties because of the obvious instability of NNT as an estimand. The posterior mode of the distribution is proposed as a point estimate and results are compared with the conventional method of estimation of NNT done by inversion and the one based on posterior mean. Simulations results show that the posterior mode outperforms its two counterparts with respect to the average error percentage.
Roshan Joseph Vengazhiyil, School of Industrial and Systems Engineering, Georgia Institute of Technology
Stochastic approximation has lots of applications in science and engineering. The Robbins-Monro procedure (1951) is a nonparametric approach. The convergence can be improved if we know the distribution of the response. Wu (1985, 1988) has demonstrated that the maximum likelihood recursion methods outperform the Robbins-Monro procedure. Wu's approach assumes a parametric model and therefore the convergence rate slows down if the assumed model is very different from the true model. In this research we propose an adaptive Bayesian approach that is robust to the model assumptions. Simulation study shows that the new approach gives a superior performance over the existing methods.
Steve C. Wang, Department of Mathematics and Statistics, Swarthmore College
Mass extinctions have played a major role in the history of life. The end-Cretaceous extinction 65 million years ago, for instance, killed the dinosaurs and opened the way for many new mammal species. A much-debated question is whether mass extinctions are sudden or gradual. We address this question by presenting a model for the mass extinction of a group of S species. Occurrences of the ith fossil species are modeled as a Poisson process defined over time 0 to $\theta_i$, where $\theta_i$ represents the extinction time of the ith species. We then use a likelihood ratio statistic to test hypotheses of the form $\theta_i$ = $\k_i$ for i = 1...S. Equal $\k_i$ correspond to a simultaneous (sudden) extinction; unequal $\k_i$ correspond to a gradual extinction. We also calculate confidence intervals for max($\theta_i$ - $\theta_j$), the duration of the extinction event. Such information provides important evidence for inferring whether the extinction was sudden and caused by factors such as asteroid impact, or gradual and caused by factors such as climate change.
Virginia Wheway, The Boeing Company,
Recent advances in data mining have led to the development of a method called 'boosting'. Instead of building a single model for each dataset, 'boosting' sees several models being built using weighted versions of the original data. These models are then combined into a single prediction model via a voting method. Models, which are more accurate on the original data, are given higher voting power. Studies have demonstrated that this method of model combination leads to significantly more accurate predictions on unseen data.
The method of boosting may be extended beyond its original aims of improved prediction. A plot of error statistics may be used as a tool to detect noisy data and unearth structure within datasets that cannot be detected using standard methods. Research into outlier detection led to a technique for unearthing clusters of data, which appear to be behaving differently to other sections of the data. This technique will be demonstrated on an industrial (steel processing) dataset. If such clusters go undetected, model errors are higher and the resulting model is unnecessarily complex. In practice, an accurate model which is interpretable to all team members is favorable to a complex mathematical model, understood by few.
Jayson D. Wilbur, Department of Mathematical Sciences, Worcester Polytechnic Institute
In order to understand the role of microorganisms in environmental processes, the relevant microbial communities must be characterized and compared. Denaturing gradient gel electrophoresis (DGGE) is a process by which characteristic profiles of microbial communities can be produced on an agarose gel. These profiles are used to make comparisons between microbial communities sampled from different environments or at different stages in an environmental process. Unfortunately, DGGE profiles are subject to a systematic pattern warping commonly referred to as the ``smiling effect.'' In order to enable meaningful statistical comparisons between microbial communities, a Bayesian model is fit to the warped profiles, directly accounting for the smiling effect. Posterior estimates are used to obtain unwarped profiles via an inverse transformation.
Hao Helen Zhang, Department of Statistics, North Carolina State University
Variable selection is fundamental to multivariate statistical model building,
validation and selection. Traditional approaches include best subset selection,
the forward, backward, and stepwise selection, often validated with selection
criteria like Mallow's $C_p$, AIC and BIC. Shrinkage methods such as ridge
regression, non-negative garrotte, least absolute shrinkage and selection
operator (LASSO), and the smoothly clipped absolute deviation (SCAD) penalized
approach are proposed recently. By solving a certain constrained linear regression
problem, these methods tend to mitigate the instability and high variability
of subset selection. In this work we propose a new method for model selection
and model fitting in nonparametric regression models, in the framework of smoothing
spline ANOVA.
The ``COSSO'', abbreviated for component selection and smoothing operator, is a
method of regularization with the penalty functional being the sum of component
norms, instead of the squared norm employed in the traditional smoothing spline
method. Not only does it provide a unified framework for several recent proposals
for model selection in linear models and smoothing spline ANOVA models, it also
has optimal theoretical properties in terms of the rate of convergence of
the COSSO estimator. In the case of a tensor product design with periodic functions,
the COSSO applies a novel soft thresholding type operation to the function components
and selects the correct model structure with probability tending to one. Compared
with the MARS, the COSSO gives very competitive performances in both simulations
and real examples, COSSO has great potential of use in a variety of stochastic
settings such as regression, density estimation, hazard regression, and so on. A
freeware implemented in both MATLAB and R is available for public use.
Lan Zhang, Department of Statistics, Carnegie Mellon University
It is a common financial practice to estimate volatility from the sum of frequently-sampled squared returns. However market microstructure poses challenge to this estimation approach, as evidenced by recent empirical studies in finance. This work attempts to lay out theoretical grounds that reconcile continuous-time modeling and discrete-time samples. We propose an estimation approach that takes advantage of the rich sources in tick-by-tick data while preserving the continuous-time assumption on the underlying returns. Under our framework, it becomes clear why and where the ``usual'' volatility estimator fails when the returns are sampled at the highest frequency.
Zhengjun Zhang, Department of Mathematics, Washington University
Studies have shown that financial data are fat tailed. As a result, the traditional multivariate normal assumption, when applied to analyzing the extremal financial price movements, may give rise to misleading results. Using extreme value theory, we have shown, in an earlier paper, that the negative returns of each of the three financial assets are fat tailed, whereas the positive returns of one asset are fat tailed and are short tailed in the other twos. Based on an efficient test statistic, constructed in that paper, the negative returns between different risk factors are shown to be asymptotically dependent, while the positive returns are shown to be asymptotically independent. In this paper, a new model is proposed for modeling multiple financial time series. This new model combines the theory and methodology of max-stable process and Bernoulli process, with the extreme value theory and results obtained from the earlier paper. It can be used to model the asymmetric behaviors of the financial time series, and to model the tail dependencies between risk factors and tail dependencies within each sequence. Based on this proposed model, we calculate the VaRs (Value at Risk), the maximal possible losses of portfolios under given level of confidence, and develop the optimal choice of the portfolios under VaR constraints. Unlike the Mean-Variance approach and other approaches, the new portfolio optimization model effectively demonstrates the asymmetric behavior of the positive returns and the negative returns, and at the same time exhibits a desirable feature that the higher the risk, the higher the returns.
Jun Zhu, Department of Statistics, University of Wisconsin--Madison
Many agricultural, biological, and environmental studies involve detecting temporal changes of a response variable, based on data observed at sampling sites in a spatial region and repeatedly over a fixed number of time points. That is, data are repeated measures over time and are potentially correlated across space. The traditional repeated-measures analysis allows for time dependence but assumes that the observations at different sampling sites are mutually independent. Therefore the existing statistical tools are not suitable for field data that are correlated across space. In this talk, a nonparametric large-sample inference procedure is proposed to assess time effects, while accounting for spatial dependence. For illustration, the methodology is applied to describe population dynamics of root-lesion nematodes on a production field in Wisconsin.