# Inference

By: Asher Noel & Leo Saenger

## Introduction

Beyond getting data, one usually needs to interpret it, to some degree. For this, we have statistics: a discipline centered around exploring data and a phenomenon of interest, describing causal conclusions about the effect of changing one variable on another, and predicting one variable using another.

##### important

The purpose of the Statistical Analyses HODP docs is to give bootcampers enough of a working understanding of statistics to intelligently work with data, without assuming any familiarity with statistics at Harvard. As such, there are three Statistical Analyses docs:

The first doc is Inference, which aims to give the basic tools and understanding necessary to explore data. More broadly, inference is the branch of statistics that extracts information from already generated data.

The second doc is Hypothesis Testing. Hypothesis testing has been used to validate the work of many experiments, but it is also vulnerable to manipulation and misinterpretation. Broadly, hypothesis testing is the process of testing one hypothesis by comparing it to a null hypothesis.

The third doc is Regression. Regression is only one form of prediction, but fairly common and can be simple to implement. More generally, regression is the task where a computer program is asked to predict a numerical value given some input, outputting a function $f : \mathcal{R}^n \to \mathcal{R}$.

## Estimation

When conducting data analysis, one is often interested in a phenomenon of interest. They might want to know the average amount of water that Harvard students drink each day, the maximal number of trips to Mount Auburn that each Harvard first-year took, or the probability that they have covid given a variety of factors, for example.

In each of these cases, there is the true “god-given” value: this is the estimand $\theta$. Before sampling data, the exact crystallization of the data is a random variable $X \in \mathcal{R}$, and all of the $n$ data can be represented by $\vec{X} \in \mathcal{R}^n$. One does not know if the first person they survey will have drank 14 or 16 oz of water, or something else entirely. Once the value does crystalize, it is often notationally referred to as $y$. An estimator $\hat{\theta}$ is the output of a function $g(\vec{X})$ that attempts to estimate the estimand $\theta$. Because an estimator is a function of unobserved data, it is also a random variable. After observing the data, the estimate is the output of the function $g(\vec{y})$.

If the phenomenon of interest is the daily average water intake of Harvard students, then the “true” value, the estimand $\theta$, may be some irrational number close to 64.43 oz, which could only be known if an analyst had perfect information and measurements about the exact water intake of every student. An estimator of $\theta$ may be the mean. In addition to the point estimate, the estimator has an associated standard error $s = \sqrt{Variance(\hat{\theta})}$, where the “e” in error is for “estimator.” In practice, standard error can be estimated with the standard deviation of the sample, which many packages can calculate.

Similarly, an estimator is unbiased if its expectation is the estimand. Formally, we define bias to be $\textrm{bias}(\hat{\theta}) = \mathbb{E}_{\theta}(\hat{\theta) - \theta. The mean squared error (MSE) is equivalent to the sum of the squared bias and standard errors, the proof of which is beyond the scope of the docs. When designing experiments, statisticians often have to tradeoff bias and variance to minimize mean squared error.

Ideally, this estimate would converge to the estimand for asymptotically large values of $n$. We call this consistency: an estimator $\hat{\theta}$ is consistent if $\hat{\theta} \overset{p}{\longrightarrow} \theta.$. An estimator can be consistent but biased (e.g., estimating the mean with $\sum x_n + \frac{1}{n}$) and inconsistent but unbiased (e.g., estimating the mean with $g(\vec{X}) = x_1 \hspace{2mm} \forall n \geq 2$).

##### caution

When doing work in HODP, it is important to be cognizant about as many potential sources of bias as possible whenever collecting or analyzing data and to choose, as best as possible, consistent estimators of a phenomenon of interest.

## Confidence Intervals

Point estimates and standard errors are great fits for some phenomenon of interest, but sometimes it is better to have a range of values that describe possible values a fixed yet unknown estimand $\theta$ could take. In the frequentist paradigm, where probabilities describe frequencies, these ranges are called confidence intervals: formally, a confidence interval of an estimand $\theta$ is an interval $C_n = (a,b)$ where the bounds are functions of the data such that $\mathbb{P}_{\theta}(\theta \in C_n) \geq 1 - \alpha \textrm{for all} \theta \in \Theta$ where $\Theta$ is the parameter space and $\alpha$ is the confidence level.

##### important

Because the interval is a function of the data and therefore random, a correct interpretation of a 95% confidence level (with the arbitrary confidence level of 0.05 chosen for its popularity) is that the random interval would contain the true estimand in 95% of its crystallizations after observing the data.

##### warning

Many times in practice, people will define a 95% confidence interval of the mean to be the range of values within two standard errors of the mean. This assumes that the parameter of interest is normally distributed.

##### tip

Fortunately, the normal distribution shows up many times in real life. Generally, a reasonably large (in practice, this means loosely $n \geq 30$) sum of random variables sampled from an arbitrary random distribution is approximately normally distributed, per the central limit theorem (proof omitted), implying that the mean, or the sum scaled by a constant factor, is also approximately normally distributed.

If ever unsure about the interpretation of statistics or confidence intervals, or whether the technique you are applying works well with your data, feel free to reach out to anyone in HODP in the slack. We recognize that people have varying degrees of expertise when it comes to statistics and drawing inferences from data, and we would love to help!

## The Benefits of Large Samples

Sample size is one of the most important considerations in many experiments in statistics. More data is great for a lot of reasons: the Strong Law of Large Numbers states that sample means probabilistically converge to their true means, the central limit theorem starts to take stronger effect as asymptotics kick in, and, anytime either the variance or bias of an estimator is indirectly related to sample size, the mean squared error of an estimator decreases.

##### warning

Small sample sizes do not mean that an experiment is worthless; it just means that the statistician must be ever more careful when interpreting results.

With this in mind, and the language from above, any statistician is ready to dive into analyzing their data. To help with claims about significance and causality, we have created the next docs: Hypothesis Testing.