Practical Statistics for Data Science

This guide explains core statistical concepts with a focus on real-world data science applications. Each topic includes definitions, formulas (rendered with MathJax), and practical examples.

1. Types of Analysis

In data science, we use different types of statistical analysis depending on the question we’re trying to answer:

Practical Example: A data scientist at an e-commerce company uses descriptive analysis to report average monthly sales, then uses inferential analysis to test if a new website layout increases conversion rates.

2. Population and Sampling

We use samples because collecting data from an entire population is often impractical or too expensive.

Example: If you want to know the average height of all adults in a country (population), you might measure 1,000 randomly selected adults (sample).

3. Sampling Methods

How you select your sample affects the reliability of your conclusions.

Common Sampling Techniques:

Practical Tip: In customer surveys, stratified sampling ensures representation across age groups, regions, or user tiers.

4. Sample Size

Choosing the right sample size balances cost, time, and accuracy.

A larger sample reduces sampling error but increases cost. The required size depends on:

For estimating a population mean:

\[ n = \left( \frac{Z \cdot \sigma}{E} \right)^2 \]

Where \(Z\) is the z-score (e.g., 1.96 for 95% confidence).

If \(\sigma\) is unknown, use a pilot study or conservative estimate.

Example: Want to estimate average app usage time within ±5 minutes with 95% confidence. If \(\sigma \approx 30\) min: \[ n = \left( \frac{1.96 \cdot 30}{5} \right)^2 \approx 138.3 \Rightarrow \text{Use } n = 139 \]

5. Variables

Variables represent measurable attributes. Understanding their type guides analysis.

Types of Variables:

Practical Impact: You’d use linear regression for continuous outcomes but logistic regression for binary (categorical) outcomes.

6. Branches of Statistics

Data science heavily relies on both descriptive and inferential methods, with growing use of Bayesian approaches in machine learning.

7. Moments

Moments describe the shape of a distribution.

Use in Practice: Skewness tells you if you need to transform data (e.g., log-transform) before applying linear models.

8. 5-Number Summary

A quick way to describe the distribution of numeric data:

  1. Minimum
  2. First Quartile (\(Q_1\)) — 25th percentile
  3. Median (\(Q_2\)) — 50th percentile
  4. Third Quartile (\(Q_3\)) — 75th percentile
  5. Maximum

Used to build box plots, which visualize spread and detect outliers.

Outliers are often defined as values below \(Q_1 - 1.5 \cdot IQR\) or above \(Q_3 + 1.5 \cdot IQR\), where \(IQR = Q_3 - Q_1\).

Example: Dataset: [1, 3, 5, 7, 9, 11, 13] → Min=1, \(Q_1=3\), Median=7, \(Q_3=11\), Max=13 → IQR = 8 → Outlier thresholds: -9 and 23 → No outliers.

9. Distributions

A distribution shows how values are spread across possible outcomes.

Common Distributions in Data Science:

Many statistical methods assume normality, so checking distribution shape is crucial.

10. Distribution Comparison

Comparing distributions helps detect differences between groups or changes over time.

Methods:

Practical Use: A/B testing: Compare conversion rate distributions between control and treatment groups using a chi-square test (for proportions) or t-test (for means).

11. Correlation

Measures the strength and direction of a linear relationship between two numeric variables.

Pearson Correlation Coefficient (\(r\)):

\[ r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} \]

Range: \(-1 \leq r \leq 1\)

Important: Correlation ≠ Causation!

Other Types:

Example: In housing data, square footage and price often have \(r \approx 0.7\) — strong positive correlation. But adding more square footage doesn’t *guarantee* higher price (location matters too!).