← Back to course dashboard 🔬 Module V · Analytical & Biochemical Techniques
UNIT 5.6

Data Analysis & Statistics for Biochemists

Making sense of your experimental results

🎯 After this unit, you will be able to:

  • Organize and clean raw biochemical data
  • Calculate descriptive statistics (mean, SD, SEM)
  • Create and interpret standard curves
  • Choose and apply appropriate statistical tests
  • Present data in figures and tables

📊 Why Data Analysis Matters

In the lab, you generate numbers—absorbance readings, enzyme activities, protein concentrations. But raw data is meaningless without proper analysis. Good data analysis:

  • Reveals whether your results are significant or just random variation
  • Allows you to compare treatments (e.g., stressed vs. control plants)
  • Enables you to publish your findings with confidence
  • Helps you make informed decisions (e.g., which variety is more drought-tolerant)

Key insight: "The difference between the almost right word and the right word is really a large matter—'tis the difference between the lightning-bug and the lightning." — Mark Twain. The same applies to choosing the right statistical test .

📁 Part 1: Organizing Raw Data

Good Data Management Practices

📝 Lab Notebook First

Always record raw data in your bound lab notebook immediately. Never trust your memory!

📱 Digital Backup

Enter data into spreadsheets (Excel, Google Sheets) with clear headers and consistent formatting.

🏷️ Naming Conventions

Use consistent file names: 20240315_protein_assay_teff.xlsx

📋 Metadata

Record conditions: temperature, pH, instrument, operator, date.

Example: Raw Data from Bradford Assay

File: bradford_teff_leaves_20240315.csv
Sample, Rep, Absorbance_595nm, Dilution
Blank, 1, 0.000, -
Blank, 2, 0.002, -
BSA_0.1, 1, 0.085, -
BSA_0.1, 2, 0.089, -
BSA_0.2, 1, 0.172, -
...
Teff_Control, 1, 0.345, 10
Teff_Control, 2, 0.352, 10
Teff_Drought, 1, 0.412, 10
Teff_Drought, 2, 0.408, 10
💡 Excel Tip: Use "Format as Table" (Ctrl+T) to keep your data organized. This allows you to add new rows and formulas automatically extend .

📈 Part 2: Descriptive Statistics

Descriptive statistics summarize your data with a few key numbers.

Measures of Central Tendency

Term Definition Formula When to use
Mean (x̄) Average of all values x̄ = Σx / n Normally distributed data
Median Middle value when sorted Sort, take middle Skewed data, outliers
Mode Most frequent value Count frequencies Categorical data

Measures of Variability

Term Definition Formula Interpretation
Range Max - Min =MAX()-MIN() Simple but sensitive to outliers
Variance (s²) Average squared deviation from mean s² = Σ(x - x̄)²/(n-1) Used in ANOVA, but units squared
Standard Deviation (SD) Square root of variance s = √s² Describes spread of raw data
Standard Error (SEM) SD / √n SEM = s/√n Describes precision of the mean

🧮 Excel Formulas

  • =AVERAGE(range) - mean
  • =MEDIAN(range) - median
  • =STDEV.S(range) - standard deviation (sample)
  • =STDEV.P(range) - standard deviation (population)
  • =VAR.S(range) - variance
  • =STDEV.S(range)/SQRT(COUNT(range)) - SEM

Example Calculation

Data: Absorbance readings from teff leaf extracts: 0.345, 0.352, 0.338, 0.341, 0.350

Mean: (0.345+0.352+0.338+0.341+0.350)/5 = 0.345

SD: =STDEV.S(0.345,0.352,0.338,0.341,0.350) = 0.0057

SEM: 0.0057/√5 = 0.0025

Report as: 0.345 ± 0.003 (SEM) (n=5)

📉 Part 3: Standard Curves

As covered in Unit 5.2, standard curves are essential for quantifying unknowns. But now we focus on the analysis of those curves.

Creating a Standard Curve in Excel

  1. Enter your standard concentrations in column A, absorbances in column B
  2. Select data, insert Scatter plot
  3. Right-click on data points, "Add Trendline"
  4. Choose "Linear" and check "Display Equation on chart" and "Display R-squared"
📈 [Screenshot: Excel standard curve with equation y = 0.85x + 0.02, R² = 0.99 — to be inserted]

What Makes a Good Standard Curve?

Criterion Good Poor
R² value >0.98 <0.95
Range Covers expected unknown concentrations Unknowns outside standard range
Number of points At least 5-6 standards Only 2-3 points
Blanks Near zero High blank, negative values

Calculating Unknown Concentrations

🧪 Example

Standard curve equation: y = 0.85x + 0.02 (y = absorbance, x = mg/mL protein)

Unknown absorbance: 0.45 (after blank correction)

Calculation: 0.45 = 0.85x + 0.02 → x = (0.45 - 0.02)/0.85 = 0.506 mg/mL

With dilution factor: If sample was diluted 10×, original = 0.506 × 10 = 5.06 mg/mL

🔬 Part 4: Choosing the Right Statistical Test

The most common mistake in data analysis is using the wrong test. Here's a decision guide:

📊 [Flowchart: t-test vs. ANOVA vs. chi-square — to be inserted]

Common Tests for Biochemists

Question Test Example
Compare two groups Student's t-test Control vs. drought-treated plants
Compare >2 groups ANOVA Control, mild drought, severe drought
Compare before/after Paired t-test Enzyme activity before and after heat treatment
Correlation Pearson's r Brix vs. fruit weight
Frequency data Chi-square Survived vs. died under stress

Student's t-test in Excel

📊 Two-sample t-test (assuming equal variance)

Data: Group A (control) and Group B (treated)

Excel formula: =T.TEST(array1, array2, 2, 2)

  • 2 = two-tailed (for difference in either direction)
  • 2 = two-sample equal variance (homoscedastic)

Interpretation: If p < 0.05, groups are significantly different.

🌾 Teff Drought Experiment

A researcher measures proline content in teff leaves under control and drought conditions (n=5 each):

  • Control: 2.1, 2.3, 2.0, 2.2, 2.4 μmol/g
  • Drought: 5.2, 5.5, 4.8, 5.3, 5.1 μmol/g

t-test result: p = 0.000003 (p < 0.001). Conclusion: Drought significantly increases proline accumulation in teff.

🎨 Part 5: Presenting Data in Figures and Tables

Tables vs. Figures

  • Tables: Use when you need exact values, multiple parameters, or small datasets
  • Figures: Use to show trends, comparisons, or relationships visually

Guidelines for Good Figures

📊 Bar charts

Use for comparing categories. Always include error bars (SD or SEM) and indicate significance (**, *, ns).

📈 Scatter plots

Use for correlations or showing individual data points. Add trend line with equation and R².

📉 Line graphs

Use for time courses or dose-response curves.

🧪 Box plots

Show median, quartiles, and outliers. Good for skewed data.

Figure Legends

A good figure legend should stand alone. Include:

  • Brief title
  • What was measured
  • Experimental conditions (n, treatments)
  • What error bars represent (SD or SEM)
  • Statistical significance notation

✅ Example Figure Legend

"Figure 1. Proline accumulation in teff leaves under control and drought stress. Plants were grown for 14 days with (n=5) or without (n=5) water. Bars represent mean ± SEM. *** p < 0.001 (t-test)."

⚠️ Part 6: Common Pitfalls and How to Avoid Them

Pitfall Why it's a problem Solution
Confusing SD and SEM SD describes data spread; SEM describes precision of mean. Using SEM to describe variation makes data look less variable than it really is. Always specify which you use. For describing data, use SD. For comparing means, use SEM with clear statement.
Multiple t-tests instead of ANOVA Increases chance of Type I error (false positive). If you have 3 groups and do 3 t-tests, your chance of a false positive is ~14% not 5%. Use ANOVA first, then post-hoc tests (Tukey, Bonferroni) if ANOVA is significant.
Ignoring outliers Outliers can skew results dramatically. Use Grubbs' test or IQR method to identify outliers. Document any removal.
P-hacking Running many tests until you get p < 0.05 is unethical and produces false results. Pre-register your hypothesis and analysis plan. Be transparent about all tests run.
Extrapolating beyond standard curve Linear relationship may not hold at higher concentrations. Dilute samples so unknowns fall within your standard curve range.
📊 Did you know? The journal "Basic and Applied Ecology" found that ~50% of papers had statistical errors, most commonly misuse of t-tests and incorrect error bars .

💻 Part 7: Software Options

📊 Microsoft Excel

Good for basic stats, standard curves, and simple tests. Available in most Ethiopian universities.

📈 R (Free)

Powerful, free, and widely used in research. Steep learning curve but worth it.

📉 Python

With libraries like pandas, numpy, matplotlib. Also free and powerful.

🧪 GraphPad Prism

User-friendly, excellent for biochemists. Commercial license required.

✏️ Part 8: Practice Dataset

Download this dataset and analyze it using Excel:

Dataset: peroxidase_activity.csv
Variety, Treatment, Activity_U_per_g
Teff_White, Control, 2.3
Teff_White, Control, 2.5
Teff_White, Control, 2.2
Teff_White, Drought, 4.8
Teff_White, Drought, 5.1
Teff_White, Drought, 4.9
Teff_Brown, Control, 3.1
Teff_Brown, Control, 3.3
Teff_Brown, Control, 3.0
Teff_Brown, Drought, 5.8
Teff_Brown, Drought, 6.2
Teff_Brown, Drought, 5.9

📋 Analysis Questions

  1. Calculate mean and SD for each group
  2. Create a bar chart with error bars (SEM)
  3. Perform a t-test comparing control vs. drought for each variety
  4. Perform a two-way ANOVA (if possible) to test for variety × treatment interaction
  5. Write a figure legend for your bar chart

🇪🇹 Data Analysis in Ethiopian Research

Ethiopian researchers face unique challenges in data analysis:

  • Software access: Many rely on Excel, which is sufficient for basic stats. Free R workshops are increasingly available.
  • Collaboration: International partners often provide statistical support and software.
  • Publishing: International journals require rigorous statistics. Learning proper analysis increases publication success.

🌾 Teff Variety Trial

An Ethiopian researcher tested 5 teff varieties under 3 water regimes (well-watered, moderate drought, severe drought). Yield data (kg/ha) was collected. Analysis:

  • Two-way ANOVA: Variety, water regime, and interaction all significant (p < 0.01)
  • Post-hoc test (Tukey HSD): Identified which varieties performed best under each regime
  • Conclusion: Variety "DZ-01-196" had highest yield under moderate drought, recommended for areas with intermittent drought

📌 Unit Summary

Concept Key points
Descriptive stats Mean (average), SD (spread), SEM (precision of mean)
Standard curves y = mx + b, R² > 0.98, never extrapolate
t-test Compare two groups; p < 0.05 = significant difference
ANOVA Compare >2 groups; use post-hoc tests if significant
Error bars SD for data spread, SEM for comparing means—always specify which
Common pitfalls Multiple t-tests, p-hacking, extrapolating beyond standard curve
Reflection question: A student measures peroxidase activity in two teff varieties under control and drought conditions (n=3 each). They run three separate t-tests (control vs. drought for variety A; control vs. drought for variety B; and variety A vs. variety B under drought). Explain why this approach is problematic and what they should do instead.

📌 Key terms introduced

Mean Standard deviation (SD) Standard error (SEM) Variance t-test ANOVA Post-hoc test p-value Significance Error bars Outliers p-hacking

✅ Check your understanding

  1. What's the difference between standard deviation and standard error?
  2. You measure protein concentration in three groups: control, treatment A, treatment B. What statistical test would you use?
  3. Your standard curve has R² = 0.89. Is this acceptable? Why or why not?
  4. Explain why you should never extrapolate beyond your standard curve.
  5. A t-test gives p = 0.045. Is this significant? What does it mean?

Discuss your answers in the course forum.

Plant Biochemistry for Horticulture · HORT 202 · Dilla University · Last updated March 2026