Academic Grant Funding Prediction using ElasticNet Algorithm in ML
FREE Online Courses: Transform Your Career – Enroll for Free!
Grant‑making bodies such as the U.S. National Institutes of Health (NIH) receive tens of thousands of proposals per year. Program officers want an early, data‑driven signal of how much funding each application is likely to receive, based only on information available at submission time: institute, study section, principal‑investigator (PI) history, project duration, keywords, etc.
The feature set is broad (dozens of textual and categorical indicators) and collinear (e.g., institute ↔ study section). Ordinary least‑squares inflates unstable coefficients; a pure Lasso (ℓ¹) may over‑shrink and discard genuinely helpful columns. Elastic Net—a Ridge + Lasso blend—delivers a sparse yet stable model that forecasts Award_Amount (USD) for unseen proposals. Accurate forecasts help agencies set realistic funding lines and PIs scope budgets wisely.
Libraries Required
| Purpose | Library |
| Data wrangling | pandas, numpy |
| Visuals | matplotlib, seaborn |
| ML workflow | scikit‑learn → ColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split |
| Metrics | mean_squared_error, r2_score |
Dataset Link
Step-by-Step Code Implementation
Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import ElasticNet from sklearn.metrics import mean_squared_error, r2_score
Download and load the dataset
# one‑time only (requires Kaggle API credentials in ~/.kaggle/kaggle.json):
# kaggle datasets download -d brasks43/nih-reporter-grant-dataset-w-abstracts-2017-2022 -p data --unzip
df = pd.read_csv("data/nih_grant_dataset_2017_2022.csv", low_memory=False)
Quick EDA & target check
print(df[['fy', 'activity', 'agency_ic', 'award_amount']].head())
sns.histplot(df['award_amount']/1e6, kde=True)
plt.xlabel('Award (million USD)'); plt.title('Distribution of NIH Grant Size'); plt.show()
Define features & target
Why Elastic Net? Collinearity abounds: direct_cost_amt + indirect_cost_amt ≈ award_amount, and agencies define activity codes. Elastic Net’s ℓ² part keeps correlated predictors, while its ℓ¹ part zeroes trivial dummies (e.g., PI names with scant data).
# Select a compact set of predictors available at review time
keep_cols = ['fy', 'agency_ic', 'activity', 'admin_ic', 'support_year',
'project_start', 'project_end', 'pi_name', 'org_country',
'direct_cost_amt', 'indirect_cost_amt']
model_df = df[keep_cols + ['award_amount']].dropna()
# Target
y = model_df['award_amount'] # USD
X = model_df.drop(columns='award_amount')
# Identify column types
cat_cols = ['agency_ic', 'activity', 'admin_ic', 'pi_name', 'org_country']
num_cols = ['fy', 'support_year', 'direct_cost_amt', 'indirect_cost_amt']
Pre‑processing & Elastic Net pipeline
Pipeline safety ColumnTransformer one‑hot‑encodes categorical IDs and z‑scales numeric columns inside each CV fold, preventing data leakage.
preprocess = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), cat_cols),
('num', StandardScaler(), num_cols)
])
pipe = Pipeline([
('prep', preprocess),
('enet', ElasticNet(max_iter=20000, random_state=42))
])
Train/test split & hyper‑parameter tuning
Hyper‑tuning: We sweep 18 α values × 9 l1_ratios (162 models) with five‑fold CV. neg_root_mean_squared_error selects the smoothest trade‑off between bias and variance.
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=X['agency_ic'])
param_grid = {
'enet__alpha' : np.logspace(-3, 1, 18), # 0.001 → 10
'enet__l1_ratio': np.linspace(0.1, 0.9, 9) # Ridge‑heavy → Lasso‑heavy
}
search = GridSearchCV(pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1)
search.fit(X_train, y_train)
print("Best α (overall shrinkage):", search.best_params_['enet__alpha'])
print("Best l1_ratio (Ridge ↔ Lasso):", search.best_params_['enet__l1_ratio'])
Evaluate on the hold‑out set
y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Hold‑out RMSE: ${rmse:,.0f} | R²: {r2:.3f}")
Interpret feature importance
Interpretation – Non‑zero coefficients highlight big‑ticket levers such as Cancer Institute grants (NCI) or Program Project activity codes (P series), adding $1–$3 million. At the same time, year-of-support slopes capture budget inflation trends.
# Get final feature names
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])
# Reverse‑scale numeric coeffs
scale = search.best_estimator_.named_steps['prep'] \
.named_transformers_['num'].scale_
coef = search.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scale
imp = pd.Series(coef, index=feature_names).sort_values(key=abs, ascending=False)
plt.figure(figsize=(9,6))
imp.head(20).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Elastic Net Coefficients – Drivers of NIH Award Size')
plt.xlabel('Δ USD'); plt.tight_layout(); plt.show()
Summary
This tutorial shows how to craft a human‑interpretable Elastic Net model that predicts grant award amounts from proposal metadata:
- Robust accuracy: low RMSE on unseen proposals despite collinear finance variables.
- Sparsity + stability: Ridge component tames correlation, Lasso component trims noise.
- Actionable insights: coefficients quantify how institute, activity, and cost requests sway funding.
Refreshing the model each fiscal year is a single search.fit() call—keeping budget planning sharp as grant portfolios evolve.