MDI 404: Statistical Principles of Materials Informatics, Fall 2023

Instructor

Prof. Kristofer Reyes

[email protected]
134 Bell Hall
Office Hours: Fridays, 1–2 p.m.

Instruction style

In-person, mostly on the board
Assessments: homework sets and oral quizzes.

Course time and location

119 Baldy Hall
Tuesdays and Thursdays, 8 - 9:20 am

Textbook

There is no official textbook, but much of the topics will be covered in either of the following two books

“Machine Learning: A Bayesian and Optimization Perspective, 2nd Edition,” Sergios Theodoris
“Pattern Recognition and Machine Learning,” Christopher M. Bishop. It is available as a free PDF from the author here: https://www.microsoft.com/en-us/research/people/cmbishop/prml-book/

Other books that complement the above:

“Introduction to Probability Models,” Sheldon Ross.
“Introduction to Linear Algebra,” Gilbert Strang.

Prerequisites

MDI 306 - Introduction to Differential Equations
MDI 309 - Introductory Linear Algebra
MDI 312 - Multiscale Design of Materials

Description

This course provides an introduction to mathematical, statistical, and programming principles for materials informatics. The topics covered include probability theory and modeling (frequentist/Bayesian), hypothesis testing, regression and classification analysis, dimensionality reduction, and design of experiments. Emphasis is placed on developing an understanding of statistical concepts and their specific application to materials science problems, with a focus on real-world modeling, data analysis, and best practices. Students will gain experience using Python software packages to model problems, analyze data, and interpret statistical results.

Course Outline

Module 1: Data Exploration, Visualization, and Modeling

This module explores the fundamental concepts of data modeling and exploration. Throughout this module, students will learn how to load, visualize, and analyze data using Python software packages such as Numpy and Matplotlib. Additionally, students will learn how to filter and aggregate data through the lens of conditional and marginal distributions and how to make and test hypotheses about the data statistically rigorously. By the end of Module 1, students will have a solid basic understanding of probability theory and modeling, including frequentist and Bayesian approaches, as well as hypothesis testing and data visualization techniques. Students will also have gained experience using Python software packages to model, analyze, and interpret data for use in subsequent techniques taught in later modules.

Learning outcomes:

By the end of this module, students will be able to:

Load a data set and characterize data distributions through summary statistics. Topics to master include:
1. Basic definitions of probability distributions and random variables.
  1. What is a PDF? CDF?
  2. What is a random variable?
  3. What are some common probability distributions?
  4. What distributions describe some physical phenomena?
2. Summary statistics such as mean, median, standard deviation, percentiles, correlation coefficients, and covariances.
  1. What is the expected value of a random variable?
  2. What is the variance of a random variable?
  3. What is the median value of a random variable?
  4. What is the Pearson correlation coefficient?
  5. What is the covariance between two random variables?
3. The Numpy Python package, loading and saving data, initializing, accessing, and editing Numpy array components.
4. Using and editing Jupyter notebooks and Python files.
5. Using software versioning tools such as Git.
Visualize data using several types of plots, understanding their uses and limitations. Topics to master include:
1. Plot types to characterize data, such as line, scatter, and bar plots.
2. Plot types to characterize distributions, such as histograms, box, and quantile-quantile plots.
  1. What do box plots show?
  2. What is the use of a QQ plot?
3. Plots and strategies to visualize high-dimensional data.
4. Plots for a mix of discrete, continuous, and categorical data.
5. The Matplotlib Python package.
Understand how to aggregate and filter data viewed through the lens of conditional and marginal distributions. Topics of mastery include:
1. Joint, Condition, and Marginal distributions.
2. Laws of total probability and Bayes’s Theorem.
3. Boolean indexing with Numpy.
4. Plots to present conditional and marginal distributions succinctly.
Make and test hypotheses about the data in a statistically rigorous way. Topics of mastery include:
1. Frequentist hypothesis testing concepts such as one and two-sided testing, size, power, and significance levels of a test, and the Neyman-Pearson Lemma.
2. Specific hypothesis tests: $t$-test, $F$-test, $\chi^2$ test, Kolmogorov-Smirnov test
3. Bayesian hypothesis testing.
4. The Scipy. stats Python package.

Module 2: Unsupervised Learning

Several unsupervised learning provides a more systematic approach to data exploration and data representation. These techniques represent a key step in the machine learning pipeline, preparing our data for use in subsequent analysis. In this module, we’ll focus on dimensionality reduction/data representation and clustering — both methods for identifying structure present within the data set in aggregate. We shall also consider methods to assess the performance of the methods we employ. We will introduce more topics from linear algebra and graph theory, as well as the Scikit-learn Python package.

Learning outcomes: