In the lab, you generate numbers—absorbance readings, enzyme activities, protein concentrations. But raw data is meaningless without proper analysis. Good data analysis:
Key insight: "The difference between the almost right word and the right word is really a large matter—'tis the difference between the lightning-bug and the lightning." — Mark Twain. The same applies to choosing the right statistical test .
Always record raw data in your bound lab notebook immediately. Never trust your memory!
Enter data into spreadsheets (Excel, Google Sheets) with clear headers and consistent formatting.
Use consistent file names: 20240315_protein_assay_teff.xlsx
Record conditions: temperature, pH, instrument, operator, date.
Descriptive statistics summarize your data with a few key numbers.
| Term | Definition | Formula | When to use |
|---|---|---|---|
| Mean (x̄) | Average of all values | x̄ = Σx / n | Normally distributed data |
| Median | Middle value when sorted | Sort, take middle | Skewed data, outliers |
| Mode | Most frequent value | Count frequencies | Categorical data |
| Term | Definition | Formula | Interpretation |
|---|---|---|---|
| Range | Max - Min | =MAX()-MIN() | Simple but sensitive to outliers |
| Variance (s²) | Average squared deviation from mean | s² = Σ(x - x̄)²/(n-1) | Used in ANOVA, but units squared |
| Standard Deviation (SD) | Square root of variance | s = √s² | Describes spread of raw data |
| Standard Error (SEM) | SD / √n | SEM = s/√n | Describes precision of the mean |
Data: Absorbance readings from teff leaf extracts: 0.345, 0.352, 0.338, 0.341, 0.350
Mean: (0.345+0.352+0.338+0.341+0.350)/5 = 0.345
SD: =STDEV.S(0.345,0.352,0.338,0.341,0.350) = 0.0057
SEM: 0.0057/√5 = 0.0025
Report as: 0.345 ± 0.003 (SEM) (n=5)
As covered in Unit 5.2, standard curves are essential for quantifying unknowns. But now we focus on the analysis of those curves.
| Criterion | Good | Poor |
|---|---|---|
| R² value | >0.98 | <0.95 |
| Range | Covers expected unknown concentrations | Unknowns outside standard range |
| Number of points | At least 5-6 standards | Only 2-3 points |
| Blanks | Near zero | High blank, negative values |
Standard curve equation: y = 0.85x + 0.02 (y = absorbance, x = mg/mL protein)
Unknown absorbance: 0.45 (after blank correction)
Calculation: 0.45 = 0.85x + 0.02 → x = (0.45 - 0.02)/0.85 = 0.506 mg/mL
With dilution factor: If sample was diluted 10×, original = 0.506 × 10 = 5.06 mg/mL
The most common mistake in data analysis is using the wrong test. Here's a decision guide:
| Question | Test | Example |
|---|---|---|
| Compare two groups | Student's t-test | Control vs. drought-treated plants |
| Compare >2 groups | ANOVA | Control, mild drought, severe drought |
| Compare before/after | Paired t-test | Enzyme activity before and after heat treatment |
| Correlation | Pearson's r | Brix vs. fruit weight |
| Frequency data | Chi-square | Survived vs. died under stress |
Data: Group A (control) and Group B (treated)
Excel formula: =T.TEST(array1, array2, 2, 2)
Interpretation: If p < 0.05, groups are significantly different.
A researcher measures proline content in teff leaves under control and drought conditions (n=5 each):
t-test result: p = 0.000003 (p < 0.001). Conclusion: Drought significantly increases proline accumulation in teff.
Use for comparing categories. Always include error bars (SD or SEM) and indicate significance (**, *, ns).
Use for correlations or showing individual data points. Add trend line with equation and R².
Use for time courses or dose-response curves.
Show median, quartiles, and outliers. Good for skewed data.
A good figure legend should stand alone. Include:
"Figure 1. Proline accumulation in teff leaves under control and drought stress. Plants were grown for 14 days with (n=5) or without (n=5) water. Bars represent mean ± SEM. *** p < 0.001 (t-test)."
| Pitfall | Why it's a problem | Solution |
|---|---|---|
| Confusing SD and SEM | SD describes data spread; SEM describes precision of mean. Using SEM to describe variation makes data look less variable than it really is. | Always specify which you use. For describing data, use SD. For comparing means, use SEM with clear statement. |
| Multiple t-tests instead of ANOVA | Increases chance of Type I error (false positive). If you have 3 groups and do 3 t-tests, your chance of a false positive is ~14% not 5%. | Use ANOVA first, then post-hoc tests (Tukey, Bonferroni) if ANOVA is significant. |
| Ignoring outliers | Outliers can skew results dramatically. | Use Grubbs' test or IQR method to identify outliers. Document any removal. |
| P-hacking | Running many tests until you get p < 0.05 is unethical and produces false results. | Pre-register your hypothesis and analysis plan. Be transparent about all tests run. |
| Extrapolating beyond standard curve | Linear relationship may not hold at higher concentrations. | Dilute samples so unknowns fall within your standard curve range. |
Good for basic stats, standard curves, and simple tests. Available in most Ethiopian universities.
Powerful, free, and widely used in research. Steep learning curve but worth it.
With libraries like pandas, numpy, matplotlib. Also free and powerful.
User-friendly, excellent for biochemists. Commercial license required.
Download this dataset and analyze it using Excel:
Ethiopian researchers face unique challenges in data analysis:
An Ethiopian researcher tested 5 teff varieties under 3 water regimes (well-watered, moderate drought, severe drought). Yield data (kg/ha) was collected. Analysis:
| Concept | Key points |
|---|---|
| Descriptive stats | Mean (average), SD (spread), SEM (precision of mean) |
| Standard curves | y = mx + b, R² > 0.98, never extrapolate |
| t-test | Compare two groups; p < 0.05 = significant difference |
| ANOVA | Compare >2 groups; use post-hoc tests if significant |
| Error bars | SD for data spread, SEM for comparing means—always specify which |
| Common pitfalls | Multiple t-tests, p-hacking, extrapolating beyond standard curve |
Discuss your answers in the course forum.