Crop Yield Prediction using Quantile Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Farmers and agronomists need not only a point estimate of expected crop yield but also an understanding of uncertainty—for instance, what yields might look like in adverse (10th percentile) or exceptional (90th percentile) seasons.
In this crop yield prediction ML project, we’ll predict multiple quantiles (e.g., 10th, 50th, 90th percentiles) of per‑hectare crop yield (in hectograms) based on input rates (fertilizer, pesticide, labor, seed) and environmental factors (rainfall, temperature).
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.formula.api as smf # Quantile regression via formula API from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import mean_pinball_loss # Quantile loss metric
Dataset
Step-by-Step Code Implementation
Load & Inspect Data
We load ~28,000 observations of yield and input/environmental factors. Initial .info() and .describe() confirm data completeness and value ranges.
# Load Crop Yield Prediction dataset from Kaggle :contentReference[oaicite:0]{index=0}
df = pd.read_csv("crop-yield-prediction-dataset.csv")
# Quick look at structure and stats
print(df.head())
print(df.info())
print(df.describe())
Feature Engineering & Cleaning
- We calculate an example Input_Cost_per_ha to illustrate cost‐sensitive analyses (not used directly in the quantile formula here, but could be).
- We drop rows missing any core field to ensure clean modeling.
- We designate six predictors (features) and the response Yield_hg_per_ha.
# Compute total input cost per hectare (example unit costs)
df['Input_Cost_per_ha'] = (
df['Fertilizer_kg_per_ha'] * 1.2
+ df['Pesticide_kg_per_ha'] * 3.5
+ df['Labor_Hours_per_ha'] * 10
+ df['Seed_Cost_per_ha']
)
# Compute target: yield cost efficiency (could also predict yield directly)
# Here, we predict the yield quantiles, so target is the raw yield
df = df.dropna(subset=[
'Yield_hg_per_ha','Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
'Labor_Hours_per_ha','Seed_Cost_per_ha','Annual_Rainfall_mm','Avg_Temperature_C'
])
# Define predictors and response
features = [
'Fertilizer_kg_per_ha','Pesticide_kg_per_ha','Labor_Hours_per_ha',
'Seed_Cost_per_ha','Annual_Rainfall_mm','Avg_Temperature_C'
]
X = df[features]
y = df['Yield_hg_per_ha']
data = pd.concat([X, y], axis=1)
Train/Test Split
We randomly reserve 20% of the data for out‑of‑sample evaluation, ensuring quantile estimates generalize.
# Hold out 20% for evaluation train, test = train_test_split(data, test_size=0.2, random_state=42)
Fit Quantile Regression Models
For each quantile (0.1, 0.5, 0.9):
- We define a statsmodels formula relating yield to all features.
- We fit a QuantReg model at that percentile.
- We output the coefficient table, which shows how each feature’s effect on yield differs across the distribution (e.g., rainfall might be more critical at the 10th percentile than the median).
quantiles = [0.1, 0.5, 0.9]
results = {}
formula = "Yield_hg_per_ha ~ " + " + ".join(features)
for q in quantiles:
model = smf.quantreg(formula, train)
res = model.fit(q=q)
results[q] = res
print(f"\n=== Quantile {int(q*100)}th Summary ===")
print(res.summary().tables[1]) # coefficient table
Evaluate with Pinball Loss
We predict yields on the test set and compute pinball loss for each quantile—a proper loss function for quantile forecasts—quantifying average over‑ and under‑prediction penalties and enabling fair comparison across quantiles.
for q, res in results.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['Yield_hg_per_ha'], preds, alpha=q)
print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")
Summary
Quantile regression unveils how agronomic inputs and weather factors impact different segments of the yield distribution. For example, fertilizer application may have a modest effect on median yield (50th percentile) but a pronounced effect in low‑yield conditions (10th percentile).
By modeling the 10th, 50th, and 90th percentiles separately, this approach equips agronomists and farm managers with distribution‑aware forecasts—planning for worst‑case (drought), typical, and best‑case scenarios—and crafting input strategies that are robust to environmental variability.