Analysis of scientific computer models. Methodology

Course monograph on a special topic
Analysis of scientific computer models
Methodology in computer simulator data analysis

Author Ksenia N. Kyzyurova

Content Book of abstracts
Chapter 1: Prerequisites: an emulator of a computer model
 Abstract:  This chapter provides a tutorial for development of a Gaussian process statistical emulator, approximation to a computationally challenging computer model. The emphasis is made towards construction of the 'default' emulator within its objective Bayesian implementation.
Chapter 2: Assessment of a statistical emulator
 Abstract:  Protagoras argued that a man is a measure of all things. We mathematically show that this is indeed so: scoring rules (scores) calculated for predictive model evaluation and comparison are subjective. By that we mean that the choice of a scoring rule for model comparison affects the results of the comparison and, therefore, the decision on a model choice. We recommend to, instead, employ three independent frequency measures of model predictive performance: (1) empirical frequency coverage, (2) an estimate of predictive bias, and (3) an estimate of uncertainty (variability) in predictions.

Chapter 3: Linked emulator: emulator of a system of models Manuscript, Supplementary materials, Code | Presentation, poster | Movie 1, Movie 2
 Abstract: Direct coupling of computer models is difficult for computational and logistical reasons. We propose coupling of computer models by linking independently developed Gaussian process emulators of these models. Developed linked emulator is closed form. The linked emulator results in a smaller epistemic uncertainty than a standard Gaussian process emulator of the coupled computer model would have (if such a model were available). This feature is illustrated via simulations.

Chapter 4: Calibration of computer models Manuscript, Code
 Abstract: The problem of calibration of mathematical computer models with respect to collected data occasionally occurs in contemporary research. Calibration task is analogous to identifying the preimage of a set of experimental or observational data under a certain computer model — function over an input space of parameters to that model. Identified preimage is formed as a subset of the input space to the computer model. In its turn, collected data is typically described by means of probability distributions; thus, leading to performing probabilistic calibration within the Bayesian framework. Bayesian inversion is considered advantageous over other non-Bayesian or pseudo-Bayesian approaches because of interpretability of its results. 

In practice, calibration quickly runs into computational obstacles: the shape of the posterior distribution resulting from the Bayesian inversion may be "ugly", such that standard approaches to its estimate (including Markov chain Monte-Carlo (MCMC) approximation) are prohibitive. Instructional examples are provided for illustration. 

Chapter 5: Multivariate output emulation | Presentation
 Abstract: Computer models often produce multivariate output for every single run of the model. There have been attempts to account for correlation among outputs in the construction of a Gaussian process emulator of such a models with the goal of achieving a more accurate emulator. We investigate properties of linear model of coregionalization (LMC), the model typically used for construction of multivariate output emulator. Both, theoretical and numerical evidence is found that multivariate emulator does not lead to "better" (that is, more accurate, precise or less uncertain) emulation results compared to independent modeling of each component of the output.

The formulation and the performance of the LMC model for spatial data on a grid is mathematically exactly the same as have been discussed in our work, i.e. the use of LMC model is not advantageous over independent modeling of a multivariate variable for the predictive purpose. This was falsely claimed to be otherwise in the paper on LMC model (Schmidt and Gelfand 2003), generating plenty of unnecessary 'research' papers detrimental to science and statistical practice (prominently demonstrated in a horrible book of Banerjee, Carlin and Gelfand (2014)).

NB: Every single paper and every single book which is on the name of Alan Enoch Gelfand, James B. Duke professor at Duke University, is a piece of shit. 
Chapter 6: Censored emulator of zero-inflated output | Presentation
 Abstract:  Computer model TITAN2D, given a set of initial conditions, produces an output, height of a volcano pyroclastic flow at several thousands of spatial locations in a geographical region of interest. This output is non-negative and often results in exact zero, thus, indicating the absence of a flow. This fact destroys the applicability of a customarily employed assumption that computer model is represented by a smooth function. In order to account for large number of zero values in the non-negative output, we propose the methodology of a censored Gaussian Stochastic Process approximation to such a computer model. Subsequent probabilistic assessment of a hazard using the proposed methodology is given in comparison to probabilistic assessment given by other methods proposed in the literature. The corresponding difference in hazard estimates appears to be dramatic. The censoring methodology proposed in present work appears to be more adequate in its assessment of a probability of a hazard. 

Supplementary materials include (1) discussion on a topic of truncated output of a computer model, (2) strong disadvantages of several log-based transformations of the output, (3) linking of a Gaussian Process emulator with either censored or truncated emulator in a sequence of two emulators.

Chapter 7: Design of experiments for large-scale simulators
 Abstract:  A simulator is defined as large-scale if the number of inputs is such that construction of its emulator (which involves optimization over its parameters) is prohibitively time-consuming. In order to facilitate the exploration of such a simulator useful is to divide inputs into two groups. First, design over the range of one group, choosing, say, m points. For each ith point develop a Gaussian process emulator over the rest of inputs from the other group. Second, for each set of fixed inputs from the second group, an emulator over input of the first one conditional on fixed inputs from the second group is constructed.

This methodology may be used for facilitating parameter estimation and fast emulation of a model with many inputs. Depending on the purpose and implementation of this methodology in practice, but the Gaussian process over the entire input space may be lost, although useful approximations are still constructed.