# Bayesian Entity Resolution

Record linkage (deduplication or entity resolution) is the process of merging together noisy databases and removing duplicate entities, often in the absence of a unique identifier. Linking data from multiple databases can increase the utility of many datasets and performing this linkage procedure using Bayesian methods can greatly enhance the analysis that results through the opportunity for error propagation. My work has broadly focused on improving computational efficiency of fitting Bayesian entity resolution models and ensuring proper inference on linked data can be achieved. Applications include official statistics, ecology, and social sciences.

#### Papers (* denotes student)

Drew, L.*, **Kaplan, A.**, and Breckheimer, I. "A Bayesian Record Linkage Approach to Tree Demography Using Overlapping Lidar Scans". (2023+).

Taylor, I.*, **Kaplan, A.**, and Betancourt, B. "Fast Bayesian Record Linkage for Streaming Data Contexts". (2023+).

**Kaplan, A.**, Betancourt, B., and Steorts, R. C. "A Practical Approach to Proper Inference with Linked Data". The American Statistician 76.4 (2022), pp. 384 - 393.

Lu, X.*, Hooten, M., **Kaplan, A.**, Womble, J., and Bower, M. "Improving Wildlife Population Inference Using Aerial Imagery and Entity Resolution". Journal of Agricultural, Biological and Environmental Statistics 27.2 (2022), pp. 364-381.

Marchant, N.*, **Kaplan, A.**, Elazar, D. N., Rubinstein, B. I. P., and Steorts, R. C. "d-blink: Distributed End-to-End Bayesian Entity Resolution". Journal of Computational and Graphical Statistics 30.2 (2021), pp. 406-421.

#### Software

bstrl: Perform record linkage on streaming files using recursive Bayesian updating. R package version 0.1.0. 2022.

representr: Create Representative Records After Entity Resolution. R package version 0.1.4. 2022.

dblink: Distributed End-to-End Bayesian Entity Resolution. Scala package version 0.2.0. 2020.

# MCMC for Resampling with Complex Dependency

For data with complex dependency (including spatial, graph, network, and other data structures), conditionally specified models can be formulated on the basis of an underlying Markov random field. This approach often provides an attractive alternative to direct specification of a full joint data distribution, which may be difficult for large, correlated data structures. For such Markov random field models, I have developed a new and fast way to simulate data, which has provable convergence rate properties, such as geometric ergodicity. This method has been used to allow for hypothesis testing with statistics with no known asymptotic distribution, such as those for assessing goodness-of-fit.

#### Papers

Biswas, E.*, **Kaplan, A.**, and Nordman, D. "A Goodness-of-Fit Test for Binary Spatial Data" (2023+).

**Kaplan, A.**, Kaiser, M. S., Lahiri, S. N., and Nordman, D. J. "Simulating Markov Random Fields With a Conclique-Based Gibbs Sampler". Journal of Computational and Graphical Statistics 29.2 (2020), pp. 286-296.

#### Software

conclique: Gibbs Sampling for Spatial Data and Concliques. R package version 0.1.0. 2017.

# Instability and Degeneracy of Deep Learning

A restricted Boltzmann machine (RBM) is an undirected graphical model used for image classification, and in recent years, has risen to prominence due to their connection to deep learning. A RBM is characterized by having two layers, one hidden and one visible, and is described as a generative model. By incorporating a hidden layer, RBMs are thought to have the ability to encode very complex and rich structures in data, making them attractive for supervised learning. However, the statistical properties of this model for conceptualizing data are largely unexplored in the literature, and the commonly cited fitting methodology remains heuristic-based and abstruse. I provide steps toward a thorough understanding of the model and its behavior from the perspective of statistical theory and then explore the possibility of a rigorous fitting methodology via MCMC.

#### Papers

**Kaplan, A.**, Nordman, D., and Vardeman, S. "On the S-instability and degeneracy of discrete deep learning models". Information and Inference: A Journal of the IMA 9.3 (2020), pp. 627-655.

**Kaplan, A.**, Nordman, D., and Vardeman, S. "Properties and Bayesian fitting of restricted Boltzmann machines". Statistical Analysis and Data Mining: The ASA Data Science Journal 12.1 (2019), pp. 23-38.

#### Code and Apps

Reproducible Code supplement to "Properties and Bayesian fitting of restricted Boltzmann machines."

Shiny Apps to visualize and understand the properties of restricted Boltzmann machines.

# Interactive and Statistical Graphics

Across all my research, I place a strong value on how research is conducted and results are shared. Specifically, I am a proponent of reproducibility in research. My work is open sourced, developed in the open online community, and is often accompanied by a software package or application. By espousing these practices, I have enjoyed collaborations with other researchers in the field with the same values, and our work has greatly benefited. Much of that collaboration has happened in the field of interactive statistical graphics.

#### Papers

**Kaplan, A.** and Bien, J. "Interactive Exploration of Large Dendrograms with Prototypes". The American Statistician 0.0 (2022), pp. 1-11.

**Kaplan, A.** and Hare, E. "Putting Down Roots: A Graphical Exploration of Community Attachment". Computational Statistics 34.4 (2019), pp. 1449-1464.

Hare, E. and **Kaplan, A.** "Designing Modular Software: A Case Study in Introductory Statistics". Journal of Computational and Graphical Statistics 26.3 (2017), pp. 493-500.

**Kaplan, A.**, Hofmann, H., and Nordman, D. "An interactive graphical method for community detection in network data". Computational Statistics 32.2 (2017), pp. 535-557.

**Kaplan, A.**, Hare, E., Hofmann, H., and Cook, D. "Can you buy a president? Politics after the Tillman Act". Chance 27.1 (2014), pp. 20-30.

#### Software

protoshiny: Interactive Dendrograms for Visualizing Hierarchical Clusters with Prototypes. R package version 0.1.0. 2022.

intRo: Download and Run the intRo Statistical Software. R package version 0.1.0. 2017.

#### Code and Apps

Shiny App to explore community attachment, supplement to "Putting Down Roots: A Graphical Exploration of Community Attachment"

Reproducible Code supplement to "Putting Down Roots: A Graphical Exploration of Community Attachment"

Shiny App for teaching introductory statistics, supplement to "Designing Modular Software: A Case Study in Introductory Statistics"

Shiny App to visually perform community detection in networks, supplement to "An interactive graphical method for community detection in network data"

Reproducible Code supplement to "An interactive graphical method for community detection in network data"

Reproducible Code supplement to "Can you buy a president? Politics after the Tillman Act"

# Miscellaneous

According to John Tukey, "the best thing about being a statistician is that you get to play in everyone's backyard." I don't disagree. I love being able to work (and play) in a variety of topics. Some of this work does not fit nicely into the above topics, but was still incredibly rewarding.

#### Papers

Keller, J. P., Zhou, T., Kaplan, A., Anderson, G. B., and Zhou, W. "Tracking the transmission dynamics of COVID-19 with a time-varying coefficient state-space model". Statistics in Medicine 41.15 (2022), pp. 2745-2767.

#### Software

forestr: Random Forests with a User Created Splitting Criterion. R package version 0.0.0.9000. 2015.