Analysis of incomplete datasets: Estimation
of mean values and covariance matrices and imputation of missing
values
What follows is a collection of
Matlab modules for
- the estimation of mean values and covariance matrices from
incomplete datasets, and
- the imputation of missing values in incomplete datasets.
The modules implement the regularized EM algorithm described in
T. Schneider, 2001: Analysis of incomplete climate data:
Estimation of mean values and covariance matrices and imputation
of missing values. Journal of Climate,
14, 853871.
The EM algorithm for Gaussian data is based on iterated linear
regression analyses. In the regularized EM algorithm, a
regularized estimation method replaces the conditional maximum
likelihood estimation of regression parameters in the conventional
EM algorithm for Gaussian data. The modules here provide truncated
total least squares (with fixed truncation parameter) and ridge
regression with generalized cross-validation as regularized
estimation methods.
The implementation of the regularized EM algorithm is modular,
so that the modules that perform the regularized estimation of
regression parameters (e.g., ridge regression and generalized
cross-validation) can be exchanged for other regularization
methods and other methods of determining a regularization
parameter. Per-Christian Hansen's Regularization
Tools contain Matlab modules implementing a collection of
regularization methods that can be adapted to fit into the
framework of the EM algorithm. The generalized cross-validation
modules of the regularized EM algorithm are adapted from Hansen's
generalized cross-validation modules.
In the Matlab implementation of the regularized EM algorithm,
more emphasis was placed on the modularity of the program code
than on computational efficiency. Below are some suggestions on how the regularized EM
algorithm could be implemented more efficiently.
The program package consists of several Matlab modules. To
install the programs, copy the package (available as a tar.gz-file) into a directory that is
accessible by Matlab. Unpack the package using gunzip imputation.tar.gz tar -xvf imputation.tar
Starting Matlab and invoking Matlab's online help function
help filename
displays information on the module filename.m.
- CHANGES
- Recent significant changes of the programs.
- center.m
- Centers data by subtracting the mean.
- gcvfctn.m (auxiliary
module to gcvridge.m)
- Evaluates generalized cross-validation function.
- gcvridge.m
- Finds minimum of generalized cross-validation function for
ridge regression.
- iridge.m
- Computes regression parameters by individual ridge regressions.
- mridge.m
- Computes regression parameters by a multiple ridge regression.
- nancov.m
- Sample covariance matrix of available values in
incomplete dataset.
- nanmean.m
- Sample mean of available values in incomplete dataset.
- nanstd.m
- Standard deviation of available values in
incomplete dataset.
- nansum.m
- Sum over available values in incomplete dataset.
- peigs.m
- Computes positive eigenvalues and corresponding
eigenvectors.
- pttls.m
- Computes regression parameters by truncated total least squares.
- regem.m
- Driver module for regularized EM algorithm.
- standardize.m
- Standardizes data by subtracting the mean and scaling
with the standard deviation.
More efficient implementations of the regularized EM algorithm
are possible. For example, if the missing values in the dataset
under consideration follow regular patterns, the algorithm might
exploit the special structure of the dataset. Other possible
modifications include:
- One could implement a regularized EM algorithm that exploits
spatio-temporal covariability (cf. Section 4 of the above paper).
- One could implement an adaptive method for the choice of
truncation parameter if truncated total least squares (TTLS) is
used as the regularization method in the regularized EM
algorithm. Some criteria for the choice of truncation parameter
in TTLS are discussed in Sima and van Huffel (2007), Level
choice in truncated total least squares, Comp. Stat. Data
Anal. (to appear). These methods require one additional
eigendecomposition per record, in addition to the one
eigendecomposition per iteration of the total covariance matrix
required if TTLS is used.
- One could find matching patterns of missing values in
different records and compute a regression for each pattern of
missing values instead of for each record.
- One could parallelize the algorithm, so that the
computations for several records (or for several patterns of
missing values) are carried out simultaneously.
- One could compute only one eigendecomposition per iteration,
instead of one eigendecomposition per record (or per pattern of
missing values), and compute the ridge regression via a singular
value decomposition of a data matrix (cf. Section 3 of the
above paper). For
datasets with many more variables than records, this procedure
might be faster than computing one eigendecomposition per record
and iteration.
|