Academic Grant Funding Prediction using ElasticNet Algorithm in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Grant‑making bodies such as the U.S. National Institutes of Health (NIH) receive tens of thousands of proposals per year. Program officers want an early, data‑driven signal of how much funding each application is likely to receive, based only on information available at submission time: institute, study section, principal‑investigator (PI) history, project duration, keywords, etc.

The feature set is broad (dozens of textual and categorical indicators) and collinear (e.g., institute ↔ study section). Ordinary least‑squares inflates unstable coefficients; a pure Lasso (ℓ¹) may over‑shrink and discard genuinely helpful columns. Elastic Net—a Ridge + Lasso blend—delivers a sparse yet stable model that forecasts Award_Amount (USD) for unseen proposals. Accurate forecasts help agencies set realistic funding lines and PIs scope budgets wisely.

Libraries Required

Purpose Library
Data wrangling pandas, numpy
Visuals matplotlib, seaborn
ML workflow scikit‑learnColumnTransformer, OneHotEncoder, StandardScaler, ElasticNet, GridSearchCV, Pipeline, train_test_split
Metrics mean_squared_error, r2_score

Dataset Link

NIH RePORTER Grant Dataset

Step-by-Step Code Implementation

Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error, r2_score

Download and load the dataset

# one‑time only (requires Kaggle API credentials in ~/.kaggle/kaggle.json):
# kaggle datasets download -d brasks43/nih-reporter-grant-dataset-w-abstracts-2017-2022 -p data --unzip

df = pd.read_csv("data/nih_grant_dataset_2017_2022.csv", low_memory=False)

Quick EDA & target check

print(df[['fy', 'activity', 'agency_ic', 'award_amount']].head())
sns.histplot(df['award_amount']/1e6, kde=True)
plt.xlabel('Award (million USD)'); plt.title('Distribution of NIH Grant Size'); plt.show()

Define features & target

Why Elastic Net? Collinearity abounds: direct_cost_amt + indirect_cost_amt ≈ award_amount, and agencies define activity codes. Elastic Net’s ℓ² part keeps correlated predictors, while its ℓ¹ part zeroes trivial dummies (e.g., PI names with scant data).

# Select a compact set of predictors available at review time
keep_cols = ['fy', 'agency_ic', 'activity', 'admin_ic', 'support_year',
             'project_start', 'project_end', 'pi_name', 'org_country',
             'direct_cost_amt', 'indirect_cost_amt']
model_df = df[keep_cols + ['award_amount']].dropna()

# Target
y = model_df['award_amount']          # USD

X = model_df.drop(columns='award_amount')

# Identify column types
cat_cols = ['agency_ic', 'activity', 'admin_ic', 'pi_name', 'org_country']
num_cols = ['fy', 'support_year', 'direct_cost_amt', 'indirect_cost_amt']

Pre‑processing & Elastic Net pipeline

Pipeline safety ColumnTransformer one‑hot‑encodes categorical IDs and z‑scales numeric columns inside each CV fold, preventing data leakage.

preprocess = ColumnTransformer([
    ('cat', OneHotEncoder(handle_unknown='ignore', drop='first'), cat_cols),
    ('num', StandardScaler(), num_cols)
])

pipe = Pipeline([
    ('prep', preprocess),
    ('enet', ElasticNet(max_iter=20000, random_state=42))
])

Train/test split & hyper‑parameter tuning

Hyper‑tuning: We sweep 18 α values × 9 l1_ratios (162 models) with five‑fold CV. neg_root_mean_squared_error selects the smoothest trade‑off between bias and variance.

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=X['agency_ic'])

param_grid = {
    'enet__alpha'   : np.logspace(-3, 1, 18),   # 0.001 → 10
    'enet__l1_ratio': np.linspace(0.1, 0.9, 9)  # Ridge‑heavy → Lasso‑heavy
}

search = GridSearchCV(pipe, param_grid,
                      cv=5,
                      scoring='neg_root_mean_squared_error',
                      n_jobs=-1, verbose=1)
search.fit(X_train, y_train)

print("Best α (overall shrinkage):", search.best_params_['enet__alpha'])
print("Best l1_ratio (Ridge ↔ Lasso):", search.best_params_['enet__l1_ratio'])

Evaluate on the hold‑out set

y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Hold‑out RMSE: ${rmse:,.0f} | R²: {r2:.3f}")

Interpret feature importance

Interpretation – Non‑zero coefficients highlight big‑ticket levers such as Cancer Institute grants (NCI) or Program Project activity codes (P series), adding $1–$3 million. At the same time, year-of-support slopes capture budget inflation trends.

# Get final feature names
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])

# Reverse‑scale numeric coeffs
scale = search.best_estimator_.named_steps['prep'] \
            .named_transformers_['num'].scale_
coef = search.best_estimator_.named_steps['enet'].coef_
coef[-len(num_cols):] = coef[-len(num_cols):] / scale

imp = pd.Series(coef, index=feature_names).sort_values(key=abs, ascending=False)

plt.figure(figsize=(9,6))
imp.head(20).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Elastic Net Coefficients – Drivers of NIH Award Size')
plt.xlabel('Δ USD'); plt.tight_layout(); plt.show()

Summary

This tutorial shows how to craft a human‑interpretable Elastic Net model that predicts grant award amounts from proposal metadata:

  • Robust accuracy: low RMSE on unseen proposals despite collinear finance variables.
  • Sparsity + stability: Ridge component tames correlation, Lasso component trims noise.
  • Actionable insights: coefficients quantify how institute, activity, and cost requests sway funding.

Refreshing the model each fiscal year is a single search.fit() call—keeping budget planning sharp as grant portfolios evolve.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *