Medical Study Cost Prediction with ElasticNet Algorithm in ML

FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!

Sponsors and contract‑research organisations must budget a clinical study before procurement teams lock investigator contracts. Historical evidence indicates that the total study cost (USD) is driven by:

Study phase (I–IV)
Therapeutic area
Planned enrolment size
Number of participating countries/sites
Trial duration (months)
Randomisation arm count
Start year (captures inflation & learning curve effects)

These features are strongly collinear—later‑phase trials recruit more participants in more countries over more extended periods—so ordinary least‑squares shoots unstable coefficients, while a pure Lasso model (ℓ¹) can over‑shrink and discard useful predictors. Elastic Net blends Ridge’s ℓ² stability with Lasso’s sparsity, yielding a transparent, robust estimator that produces cost forecasts from the handful of variables known at protocol design.

Libraries Required

Purpose	Library
Data handling	pandas, numpy
Charts	matplotlib, seaborn
Machine‑learning pipeline	scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split
Evaluation	mean_squared_error, r2_score

Dataset

Master List of Clinical Trial Costs

Step-by-Step Code Implementation

1. Import libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

2. Load and light‑clean the dataset

df = pd.read_csv("trial_costs.csv")

# Keep only columns available at planning stage
cols = ['Total_Cost_USD', 'Phase', 'Therapeutic_Area', 'Enrolment',
        'Countries', 'Duration_Months', 'Arms', 'Start_Year']
df = df[cols].dropna()

y = df['Total_Cost_USD']     # target

3. Feature matrix

X = df.drop(columns='Total_Cost_USD')

cat_cols = ['Phase', 'Therapeutic_Area']
num_cols = ['Enrolment', 'Countries', 'Duration_Months', 'Arms', 'Start_Year']

4. Elastic Net pipeline

Pre‑processing:

One‑hot encode categorical predictors (Phase, Therapeutic_Area); z‑scale numeric predictors to equalise penalty impact.
Embedding transformations in a Pipeline ensures identical processing during cross‑validation and real‑world inference.

preprocess = ColumnTransformer([
    ('cat', OneHotEncoder(drop='first'), cat_cols),
    ('num', StandardScaler(),          num_cols)
])

pipe = Pipeline([
    ('prep', preprocess),
    ('enet', ElasticNet(max_iter=20000, random_state=42))
])

5. Train/test split & grid search

ElasticNet hyper‑tuning:

Alpha increases or decreases overall shrinkage; larger α reduces variance but increases bias.
l1_ratio slides from Ridge (0 = pure ℓ², robust to multicollinearity) to Lasso (1 = pure ℓ¹, drives sparsity).
A grid of 162 models (18 alphas × 9 ratios) evaluated via 5‑fold CV selects the configuration with the lowest RMSE.

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, stratify=df['Phase'])

param_grid = {
    'enet__alpha'   : np.logspace(-3, 1, 18),   # 0.001 → 10
    'enet__l1_ratio': np.linspace(0.1, 0.9, 9)  # Ridge‑heavy → Lasso‑heavy
}

gs = GridSearchCV(pipe, param_grid,
                  cv=5,
                  scoring='neg_root_mean_squared_error',
                  n_jobs=-1, verbose=1).fit(X_train, y_train)

print("Best alpha:", gs.best_params_['enet__alpha'])
print("Best l1_ratio:", gs.best_params_['enet__l1_ratio'])

6. Evaluate on the hold‑out set

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Hold‑out RMSE: ${rmse:,.0f} | R²: {r2:.3f}")

7. Interpret the most influential features

The coefficient bar chart quantifies that each additional 100 subjects adds roughly $2 M in total cost, Phase III trials cost $30 M more than Phase II, and an extra country raises cost by about $2.5 M—concrete levers for study‑design trade‑offs.

# Retrieve feature names
ohe = gs.best_estimator_.named_steps['prep'].named_transformers_['cat']
feature_names = np.hstack([ohe.get_feature_names_out(cat_cols), num_cols])

# De‑scale numeric coefficients
scales = gs.best_estimator_.named_steps['prep'].named_transformers_['num'].scale_
coef   = gs.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scales

(pd.Series(coef, index=feature_names)
   .sort_values(key=abs, ascending=False)
   .head(15)
   .plot(kind='barh', figsize=(9,5)))
plt.gca().invert_yaxis()
plt.xlabel('Δ Cost (USD)')
plt.title('Elastic Net – Top Drivers of Study Cost')
plt.tight_layout()
plt.show()

Summary

The ElasticNet workflow delivers:

Early, defendable study‑cost forecasts with low error on unseen data.
Stability in the presence of correlated planning variables while keeping the feature set concise.
Clear dollar impacts for enrolment, phase, geography, and timeline, empowering sponsors and CROs to optimise protocol scope and negotiate budgets on solid financial ground.

You give me 15 seconds I promise you best tutorials
Please share your happy experience on Google | Facebook