← Back

Computational Biology · Case Study

Gene Expression Age Prediction

Do multivariate gene-expression patterns predict donor age within and across human tissues?

View project on GitHub →

Aging alters gene regulation across thousands of genes at once, not just a handful of markers. Using GTEx bulk RNA-seq across four human tissues — Whole Blood, Muscle-Skeletal, Brain-Cortex, and Liver — this project asks whether expression patterns predict a donor's age, and whether that signal is shared across tissues or is largely tissue-specific.

Dataset & preprocessing

The analysis draws on 2,153 donor-tissue samples spanning 26,611 genes. Sample metadata includes donor age, age group, tissue type, and sex, giving both a continuous target and categorical covariates for downstream modeling.

GTEx RNA-seq

bulk expression

2,153 samples × 26,611 genes

filtered TPM matrix

X_genes

expression predictors

y = age

AGE_NUM (years)

metadata

tissue · sex · age group

Gene expression matrix

G1G2G3G26611S1S2S3S2153

2,153 samples (rows) × 26,611 genes (columns); each entry is an expression level. Age, tissue, and sex are kept as separate metadata.

Analysis workflow

Gene expression matrix

PCA

Visualize patterns

Regression (age)

Bootstrap resampling

Confidence intervals

Tissue comparison

Unsupervised analysis (PCA)

Before modeling, PCA reduces the 26,611-gene matrix to a handful of components that capture the dominant patterns of variation. Each principal component is a weighted combination of genes, and crucially PCA never uses age — so any age structure that shows up in the components is intrinsic to the expression data, not imposed by the model.

Gene expression matrix

2,153 × 26,611

Standardize

StandardScaler

PCA

10 components

PC scores

PC1–PC10 per sample

Visualize

color = age / tissue

Variance explained by principal components

0%20%40%60%80%PC1PC2PC3PC4PC5PC6PC7PC8PC9PC10

PC1 alone captures 28.5%; the first 10 PCs capture 73.7%.

PCA coordinates load from a precomputed export

Supervised modeling

The response variable is donor age; predictors are the first ten principal components (PC1–PC10) plus a sex covariate. A linear regression model is evaluated with 5-fold cross-validation, producing an R² (variance explained) and RMSE (average prediction error) for each fold.

PC1–PC10

input features

Sex

covariate

y = age

response

Model matrix

n × 11

Linear regression

5-fold cross-validation

variance explained

RMSE

prediction error

0.129

≈13% of age variance

RMSE

11.7 yr

average prediction error

Samples / Genes

2,153 / 26,611

samples / genes

Bootstrap mean R²

0.130

95% CI 0.102–0.160

Elastic Net R²

0.130

α=0.1, l1=0.5 · CI 0.102–0.160

Bootstrap resampling

To quantify uncertainty in the cross-validated R², the dataset is resampled with replacement, 5-fold CV is rerun on each resample, and the process is repeated 10,000 times. This produces a distribution of CV R² values from which a 95% confidence interval is derived.

Original dataset

(X, y)

Resample w/ replacement

same size

5-fold CV

record mean R²

Repeat ×10,000

Bootstrap distribution

→ 95% CI 0.102–0.160

Bootstrap distribution loads from a precomputed export

Tissue comparison

Repeating the bootstrap within each tissue separately shows how the age signal varies. Whole Blood is the most predictable; Liver and Brain-Cortex have intervals that overlap zero, meaning their age signal is weak and uncertain.

Age predictability across tissues

-0.100.000.100.200.25Whole Blood0.170Muscle - Skeletal0.139Liver0.091Brain - Cortex0.049

Whole Blood shows the strongest, most reliable age signal; Liver and Brain overlap zero.

Conclusions

Limitations: PCA captures only linear variation; GTEx is cross-sectional rather than longitudinal; environment and lifestyle factors are not included. Future work could explore nonlinear and regularized (Elastic Net) models.