Computational Biology · Case Study
Gene Expression Age Prediction
Do multivariate gene-expression patterns predict donor age within and across human tissues?
View project on GitHub →Aging alters gene regulation across thousands of genes at once, not just a handful of markers. Using GTEx bulk RNA-seq across four human tissues — Whole Blood, Muscle-Skeletal, Brain-Cortex, and Liver — this project asks whether expression patterns predict a donor's age, and whether that signal is shared across tissues or is largely tissue-specific.
Dataset & preprocessing
The analysis draws on 2,153 donor-tissue samples spanning 26,611 genes. Sample metadata includes donor age, age group, tissue type, and sex, giving both a continuous target and categorical covariates for downstream modeling.
GTEx RNA-seq
bulk expression
2,153 samples × 26,611 genes
filtered TPM matrix
X_genes
expression predictors
y = age
AGE_NUM (years)
metadata
tissue · sex · age group
Gene expression matrix
2,153 samples (rows) × 26,611 genes (columns); each entry is an expression level. Age, tissue, and sex are kept as separate metadata.
Analysis workflow
Gene expression matrix
PCA
Visualize patterns
Regression (age)
Bootstrap resampling
Confidence intervals
Tissue comparison
Unsupervised analysis (PCA)
Before modeling, PCA reduces the 26,611-gene matrix to a handful of components that capture the dominant patterns of variation. Each principal component is a weighted combination of genes, and crucially PCA never uses age — so any age structure that shows up in the components is intrinsic to the expression data, not imposed by the model.
Gene expression matrix
2,153 × 26,611
Standardize
StandardScaler
PCA
10 components
PC scores
PC1–PC10 per sample
Visualize
color = age / tissue
Variance explained by principal components
PC1 alone captures 28.5%; the first 10 PCs capture 73.7%.
PCA coordinates load from a precomputed export
Supervised modeling
The response variable is donor age; predictors are the first ten principal components (PC1–PC10) plus a sex covariate. A linear regression model is evaluated with 5-fold cross-validation, producing an R² (variance explained) and RMSE (average prediction error) for each fold.
PC1–PC10
input features
Sex
covariate
y = age
response
Model matrix
n × 11
Linear regression
5-fold cross-validation
R²
variance explained
RMSE
prediction error
R²
0.129
≈13% of age variance
RMSE
11.7 yr
average prediction error
Samples / Genes
2,153 / 26,611
samples / genes
Bootstrap mean R²
0.130
95% CI 0.102–0.160
Elastic Net R²
0.130
α=0.1, l1=0.5 · CI 0.102–0.160
Bootstrap resampling
To quantify uncertainty in the cross-validated R², the dataset is resampled with replacement, 5-fold CV is rerun on each resample, and the process is repeated 10,000 times. This produces a distribution of CV R² values from which a 95% confidence interval is derived.
Original dataset
(X, y)
Resample w/ replacement
same size
5-fold CV
record mean R²
Repeat ×10,000
Bootstrap distribution
→ 95% CI 0.102–0.160
Bootstrap distribution loads from a precomputed export
Tissue comparison
Repeating the bootstrap within each tissue separately shows how the age signal varies. Whole Blood is the most predictable; Liver and Brain-Cortex have intervals that overlap zero, meaning their age signal is weak and uncertain.
Age predictability across tissues
Whole Blood shows the strongest, most reliable age signal; Liver and Brain overlap zero.
Conclusions
- PCA shows tissue identity dominates expression variation.
- Expression carries a modest aging signal (~13% of age variance).
- Age predictability differs by tissue — Whole Blood strongest, Muscle moderate, Liver & Brain weaker.
- Suggests aging-related transcriptomic change is partly tissue-specific.
Limitations: PCA captures only linear variation; GTEx is cross-sectional rather than longitudinal; environment and lifestyle factors are not included. Future work could explore nonlinear and regularized (Elastic Net) models.