Medical Study Cost Prediction with ElasticNet Algorithm in ML
FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!
Sponsors and contract‑research organisations must budget a clinical study before procurement teams lock investigator contracts. Historical evidence indicates that the total study cost (USD) is driven by:
- Study phase (I–IV)
- Therapeutic area
- Planned enrolment size
- Number of participating countries/sites
- Trial duration (months)
- Randomisation arm count
- Start year (captures inflation & learning curve effects)
These features are strongly collinear—later‑phase trials recruit more participants in more countries over more extended periods—so ordinary least‑squares shoots unstable coefficients, while a pure Lasso model (ℓ¹) can over‑shrink and discard useful predictors. Elastic Net blends Ridge’s ℓ² stability with Lasso’s sparsity, yielding a transparent, robust estimator that produces cost forecasts from the handful of variables known at protocol design.
Libraries Required
| Purpose | Library |
| Data handling | pandas, numpy |
| Charts | matplotlib, seaborn |
| Machine‑learning pipeline | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Evaluation | mean_squared_error, r2_score |
Dataset
Master List of Clinical Trial Costs
Step-by-Step Code Implementation
1. Import libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
2. Load and light‑clean the dataset
df = pd.read_csv("trial_costs.csv")
# Keep only columns available at planning stage
cols = ['Total_Cost_USD', 'Phase', 'Therapeutic_Area', 'Enrolment',
'Countries', 'Duration_Months', 'Arms', 'Start_Year']
df = df[cols].dropna()
y = df['Total_Cost_USD'] # target
3. Feature matrix
X = df.drop(columns='Total_Cost_USD') cat_cols = ['Phase', 'Therapeutic_Area'] num_cols = ['Enrolment', 'Countries', 'Duration_Months', 'Arms', 'Start_Year']
4. Elastic Net pipeline
Pre‑processing:
- One‑hot encode categorical predictors (Phase, Therapeutic_Area); z‑scale numeric predictors to equalise penalty impact.
- Embedding transformations in a Pipeline ensures identical processing during cross‑validation and real‑world inference.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(drop='first'), cat_cols),
('num', StandardScaler(), num_cols)
])
pipe = Pipeline([
('prep', preprocess),
('enet', ElasticNet(max_iter=20000, random_state=42))
])
5. Train/test split & grid search
ElasticNet hyper‑tuning:
- Alpha increases or decreases overall shrinkage; larger α reduces variance but increases bias.
- l1_ratio slides from Ridge (0 = pure ℓ², robust to multicollinearity) to Lasso (1 = pure ℓ¹, drives sparsity).
- A grid of 162 models (18 alphas × 9 ratios) evaluated via 5‑fold CV selects the configuration with the lowest RMSE.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=df['Phase'])
param_grid = {
'enet__alpha' : np.logspace(-3, 1, 18), # 0.001 → 10
'enet__l1_ratio': np.linspace(0.1, 0.9, 9) # Ridge‑heavy → Lasso‑heavy
}
gs = GridSearchCV(pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1).fit(X_train, y_train)
print("Best alpha:", gs.best_params_['enet__alpha'])
print("Best l1_ratio:", gs.best_params_['enet__l1_ratio'])
6. Evaluate on the hold‑out set
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: ${rmse:,.0f} | R²: {r2:.3f}")
7. Interpret the most influential features
The coefficient bar chart quantifies that each additional 100 subjects adds roughly $2 M in total cost, Phase III trials cost $30 M more than Phase II, and an extra country raises cost by about $2.5 M—concrete levers for study‑design trade‑offs.
# Retrieve feature names
ohe = gs.best_estimator_.named_steps['prep'].named_transformers_['cat']
feature_names = np.hstack([ohe.get_feature_names_out(cat_cols), num_cols])
# De‑scale numeric coefficients
scales = gs.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
coef = gs.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scales
(pd.Series(coef, index=feature_names)
.sort_values(key=abs, ascending=False)
.head(15)
.plot(kind='barh', figsize=(9,5)))
plt.gca().invert_yaxis()
plt.xlabel('Δ Cost (USD)')
plt.title('Elastic Net – Top Drivers of Study Cost')
plt.tight_layout()
plt.show()
Summary
The ElasticNet workflow delivers:
- Early, defendable study‑cost forecasts with low error on unseen data.
- Stability in the presence of correlated planning variables while keeping the feature set concise.
- Clear dollar impacts for enrolment, phase, geography, and timeline, empowering sponsors and CROs to optimise protocol scope and negotiate budgets on solid financial ground.