UNIT 5.6

Data Analysis & Statistics for Biochemists

Making sense of your experimental results

🎯 After this unit, you will be able to:

Organize and clean raw biochemical data
Calculate descriptive statistics (mean, SD, SEM)
Create and interpret standard curves
Choose and apply appropriate statistical tests
Present data in figures and tables

📊 Why Data Analysis Matters

In the lab, you generate numbers—absorbance readings, enzyme activities, protein concentrations. But raw data is meaningless without proper analysis. Good data analysis:

Reveals whether your results are significant or just random variation
Allows you to compare treatments (e.g., stressed vs. control plants)
Enables you to publish your findings with confidence
Helps you make informed decisions (e.g., which variety is more drought-tolerant)

Key insight: "The difference between the almost right word and the right word is really a large matter—'tis the difference between the lightning-bug and the lightning." — Mark Twain. The same applies to choosing the right statistical test .

📁 Part 1: Organizing Raw Data

Good Data Management Practices

📝 Lab Notebook First

Always record raw data in your bound lab notebook immediately. Never trust your memory!

📱 Digital Backup

Enter data into spreadsheets (Excel, Google Sheets) with clear headers and consistent formatting.

🏷️ Naming Conventions

Use consistent file names: 20240315_protein_assay_teff.xlsx

📋 Metadata

Record conditions: temperature, pH, instrument, operator, date.

Example: Raw Data from Bradford Assay

File: bradford_teff_leaves_20240315.csv
Sample, Rep, Absorbance_595nm, Dilution
Blank, 1, 0.000, -
Blank, 2, 0.002, -
BSA_0.1, 1, 0.085, -
BSA_0.1, 2, 0.089, -
BSA_0.2, 1, 0.172, -
...
Teff_Control, 1, 0.345, 10
Teff_Control, 2, 0.352, 10
Teff_Drought, 1, 0.412, 10
Teff_Drought, 2, 0.408, 10

💡 Excel Tip: Use "Format as Table" (Ctrl+T) to keep your data organized. This allows you to add new rows and formulas automatically extend .

📈 Part 2: Descriptive Statistics

Descriptive statistics summarize your data with a few key numbers.

Measures of Central Tendency

Term	Definition	Formula	When to use
Mean (x̄)	Average of all values	x̄ = Σx / n	Normally distributed data
Median	Middle value when sorted	Sort, take middle	Skewed data, outliers
Mode	Most frequent value	Count frequencies	Categorical data

Measures of Variability

Term	Definition	Formula	Interpretation
Range	Max - Min	=MAX()-MIN()	Simple but sensitive to outliers
Variance (s²)	Average squared deviation from mean	s² = Σ(x - x̄)²/(n-1)	Used in ANOVA, but units squared
Standard Deviation (SD)	Square root of variance	s = √s²	Describes spread of raw data
Standard Error (SEM)	SD / √n	SEM = s/√n	Describes precision of the mean

🧮 Excel Formulas

=AVERAGE(range) - mean
=MEDIAN(range) - median
=STDEV.S(range) - standard deviation (sample)
=STDEV.P(range) - standard deviation (population)
=VAR.S(range) - variance
=STDEV.S(range)/SQRT(COUNT(range)) - SEM

Example Calculation

Data: Absorbance readings from teff leaf extracts: 0.345, 0.352, 0.338, 0.341, 0.350

Mean: (0.345+0.352+0.338+0.341+0.350)/5 = 0.345

SD: =STDEV.S(0.345,0.352,0.338,0.341,0.350) = 0.0057

SEM: 0.0057/√5 = 0.0025

Report as: 0.345 ± 0.003 (SEM) (n=5)

📉 Part 3: Standard Curves

As covered in Unit 5.2, standard curves are essential for quantifying unknowns. But now we focus on the analysis of those curves.

Creating a Standard Curve in Excel

Enter your standard concentrations in column A, absorbances in column B
Select data, insert Scatter plot
Right-click on data points, "Add Trendline"
Choose "Linear" and check "Display Equation on chart" and "Display R-squared"

📈 [Screenshot: Excel standard curve with equation y = 0.85x + 0.02, R² = 0.99 — to be inserted]

What Makes a Good Standard Curve?

Criterion	Good	Poor
R² value	>0.98	<0.95
Range	Covers expected unknown concentrations	Unknowns outside standard range
Number of points	At least 5-6 standards	Only 2-3 points
Blanks	Near zero	High blank, negative values

Calculating Unknown Concentrations

🧪 Example

Standard curve equation: y = 0.85x + 0.02 (y = absorbance, x = mg/mL protein)

Unknown absorbance: 0.45 (after blank correction)

Calculation: 0.45 = 0.85x + 0.02 → x = (0.45 - 0.02)/0.85 = 0.506 mg/mL

With dilution factor: If sample was diluted 10×, original = 0.506 × 10 = 5.06 mg/mL

🔬 Part 4: Choosing the Right Statistical Test

The most common mistake in data analysis is using the wrong test. Here's a decision guide:

📊 [Flowchart: t-test vs. ANOVA vs. chi-square — to be inserted]

Common Tests for Biochemists

Question	Test	Example
Compare two groups	Student's t-test	Control vs. drought-treated plants
Compare >2 groups	ANOVA	Control, mild drought, severe drought
Compare before/after	Paired t-test	Enzyme activity before and after heat treatment
Correlation	Pearson's r	Brix vs. fruit weight
Frequency data	Chi-square	Survived vs. died under stress

Student's t-test in Excel

📊 Two-sample t-test (assuming equal variance)

Data: Group A (control) and Group B (treated)

Excel formula: =T.TEST(array1, array2, 2, 2)

2 = two-tailed (for difference in either direction)
2 = two-sample equal variance (homoscedastic)

Interpretation: If p < 0.05, groups are significantly different.

🌾 Teff Drought Experiment

A researcher measures proline content in teff leaves under control and drought conditions (n=5 each):

Control: 2.1, 2.3, 2.0, 2.2, 2.4 μmol/g
Drought: 5.2, 5.5, 4.8, 5.3, 5.1 μmol/g

t-test result: p = 0.000003 (p < 0.001). Conclusion: Drought significantly increases proline accumulation in teff.

🎨 Part 5: Presenting Data in Figures and Tables

Tables vs. Figures

Tables: Use when you need exact values, multiple parameters, or small datasets
Figures: Use to show trends, comparisons, or relationships visually

Guidelines for Good Figures

📊 Bar charts

Use for comparing categories. Always include error bars (SD or SEM) and indicate significance (**, *, ns).

📈 Scatter plots

Use for correlations or showing individual data points. Add trend line with equation and R².

📉 Line graphs

Use for time courses or dose-response curves.

🧪 Box plots

Show median, quartiles, and outliers. Good for skewed data.

Figure Legends

A good figure legend should stand alone. Include:

Brief title
What was measured
Experimental conditions (n, treatments)
What error bars represent (SD or SEM)
Statistical significance notation

✅ Example Figure Legend

"Figure 1. Proline accumulation in teff leaves under control and drought stress. Plants were grown for 14 days with (n=5) or without (n=5) water. Bars represent mean ± SEM. *** p < 0.001 (t-test)."

⚠️ Part 6: Common Pitfalls and How to Avoid Them

Pitfall	Why it's a problem	Solution
Confusing SD and SEM	SD describes data spread; SEM describes precision of mean. Using SEM to describe variation makes data look less variable than it really is.	Always specify which you use. For describing data, use SD. For comparing means, use SEM with clear statement.
Multiple t-tests instead of ANOVA	Increases chance of Type I error (false positive). If you have 3 groups and do 3 t-tests, your chance of a false positive is ~14% not 5%.	Use ANOVA first, then post-hoc tests (Tukey, Bonferroni) if ANOVA is significant.
Ignoring outliers	Outliers can skew results dramatically.	Use Grubbs' test or IQR method to identify outliers. Document any removal.
P-hacking	Running many tests until you get p < 0.05 is unethical and produces false results.	Pre-register your hypothesis and analysis plan. Be transparent about all tests run.
Extrapolating beyond standard curve	Linear relationship may not hold at higher concentrations.	Dilute samples so unknowns fall within your standard curve range.

📊 Did you know? The journal "Basic and Applied Ecology" found that ~50% of papers had statistical errors, most commonly misuse of t-tests and incorrect error bars .

💻 Part 7: Software Options

📊 Microsoft Excel

Good for basic stats, standard curves, and simple tests. Available in most Ethiopian universities.

📈 R (Free)

Powerful, free, and widely used in research. Steep learning curve but worth it.

📉 Python

With libraries like pandas, numpy, matplotlib. Also free and powerful.

🧪 GraphPad Prism

User-friendly, excellent for biochemists. Commercial license required.

📥 Free Resources:

✏️ Part 8: Practice Dataset

Download this dataset and analyze it using Excel:

Dataset: peroxidase_activity.csv
Variety, Treatment, Activity_U_per_g
Teff_White, Control, 2.3
Teff_White, Control, 2.5
Teff_White, Control, 2.2
Teff_White, Drought, 4.8
Teff_White, Drought, 5.1
Teff_White, Drought, 4.9
Teff_Brown, Control, 3.1
Teff_Brown, Control, 3.3
Teff_Brown, Control, 3.0
Teff_Brown, Drought, 5.8
Teff_Brown, Drought, 6.2
Teff_Brown, Drought, 5.9

📋 Analysis Questions

Calculate mean and SD for each group
Create a bar chart with error bars (SEM)
Perform a t-test comparing control vs. drought for each variety
Perform a two-way ANOVA (if possible) to test for variety × treatment interaction
Write a figure legend for your bar chart

🇪🇹 Data Analysis in Ethiopian Research

Ethiopian researchers face unique challenges in data analysis:

Software access: Many rely on Excel, which is sufficient for basic stats. Free R workshops are increasingly available.
Collaboration: International partners often provide statistical support and software.
Publishing: International journals require rigorous statistics. Learning proper analysis increases publication success.

🌾 Teff Variety Trial

An Ethiopian researcher tested 5 teff varieties under 3 water regimes (well-watered, moderate drought, severe drought). Yield data (kg/ha) was collected. Analysis:

Two-way ANOVA: Variety, water regime, and interaction all significant (p < 0.01)
Post-hoc test (Tukey HSD): Identified which varieties performed best under each regime
Conclusion: Variety "DZ-01-196" had highest yield under moderate drought, recommended for areas with intermittent drought

📌 Unit Summary

Concept	Key points
Descriptive stats	Mean (average), SD (spread), SEM (precision of mean)
Standard curves	y = mx + b, R² > 0.98, never extrapolate
t-test	Compare two groups; p < 0.05 = significant difference
ANOVA	Compare >2 groups; use post-hoc tests if significant
Error bars	SD for data spread, SEM for comparing means—always specify which
Common pitfalls	Multiple t-tests, p-hacking, extrapolating beyond standard curve

Reflection question: A student measures peroxidase activity in two teff varieties under control and drought conditions (n=3 each). They run three separate t-tests (control vs. drought for variety A; control vs. drought for variety B; and variety A vs. variety B under drought). Explain why this approach is problematic and what they should do instead.

📌 Key terms introduced

Mean Standard deviation (SD) Standard error (SEM) Variance R² t-test ANOVA Post-hoc test p-value Significance Error bars Outliers p-hacking

✅ Check your understanding

What's the difference between standard deviation and standard error?
You measure protein concentration in three groups: control, treatment A, treatment B. What statistical test would you use?
Your standard curve has R² = 0.89. Is this acceptable? Why or why not?
Explain why you should never extrapolate beyond your standard curve.
A t-test gives p = 0.045. Is this significant? What does it mean?

Discuss your answers in the course forum.

Plant Biochemistry for Horticulture · HORT 202 · Dilla University · Last updated March 2026