Agricultural Yield Prediction using Quantile Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Farmers and agribusinesses must plan not only for the average crop yield but for its range—preparing for poor‐yield seasons (10th percentile), typical harvests (50th percentile), and bumper crops (90th percentile).
Relying on mean forecasts can lead to either underprovisioning of inputs in lean years or overinvesting in average seasons. Here, we’ll predict the 10th, 50th, and 90th percentiles of per‐hectare yield (hectograms) for key crops using agronomic input rates (fertilizer, pesticide, labor, seed) and environmental factors (rainfall, temperature).
By fitting separate quantile regression models, we’ll reveal how each factor’s effect shifts across low‑, median‑, and high‑yield scenarios—enabling resource plans that are robust to both adverse and exceptional conditions.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.formula.api as smf # Quantile regression via formula API from sklearn.model_selection import train_test_split from sklearn.metrics import mean_pinball_loss # Proper loss for quantile forecasts
Dataset
Step-by-Step Code Implementation
Load & Inspect Data
We load ~28 000 field observations—each with yield per hectare and input/environmental features—and verify structure (.info()) and yield distribution (.describe()) to understand data quality and skew.
# Load the Crop Yield Prediction dataset
# Source: Kaggle – Crop Yield Prediction Dataset :contentReference[oaicite:1]{index=1}
df = pd.read_csv("crop-yield-prediction-dataset.csv")
# Inspect top rows and summary stats
print(df.head())
print(df.info())
print(df['Yield_hg_per_ha'].describe())
Preprocessing & Feature Engineering
- We drop any rows missing essential agronomic or weather data.
- We optionally compute Input_Cost_per_ha to illustrate cost–yield trade‐off analyses.
- We assemble predictors (input rates and environmental metrics) and rename Yield_hg_per_ha to Yield for cleaner formulas.
# Remove rows with missing core variables
df = df.dropna(subset=[
'Yield_hg_per_ha',
'Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
'Labor_Hours_per_ha','Seed_Cost_per_ha',
'Annual_Rainfall_mm','Avg_Temperature_C'
])
# (Optional) Compute total input cost per hectare
df['Input_Cost_per_ha'] = (
df['Fertilizer_kg_per_ha'] * 1.2
+ df['Pesticide_kg_per_ha'] * 3.5
+ df['Labor_Hours_per_ha'] * 10
+ df['Seed_Cost_per_ha']
)
# Define predictors and response
features = [
'Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
'Labor_Hours_per_ha','Seed_Cost_per_ha',
'Annual_Rainfall_mm','Avg_Temperature_C'
]
data = df[features + ['Yield_hg_per_ha']].copy()
data.rename(columns={'Yield_hg_per_ha':'Yield'}, inplace=True)
Train/Test Split
An 80/20 random split reserves data for unbiased evaluation of quantile forecasts on unseen fields.
# Hold out 20% for evaluation train, test = train_test_split(data, test_size=0.2, random_state=42)
Fit Quantile Regression Models
For each quantile (10th, 50th, 90th):
- We build a formula string (“Yield ~ Fertilizer_kg_per_ha + …”).
- We fit a QuantReg model on the training set.
- We print the coefficient table, revealing how each predictor’s effect varies across yield levels—e.g. rainfall may drive the lower tail more strongly than the median.
quantiles = [0.10, 0.50, 0.90]
results = {}
formula = "Yield ~ " + " + ".join(features)
for q in quantiles:
model = smf.quantreg(formula, train)
res = model.fit(q=q)
results[q] = res
print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
print(res.summary().tables[1]) # coefficient table only
Evaluation with Pinball Loss
- We predict quantile‐specific yields on the test set.
- We compute pinball loss—a scoring rule tailored to quantile forecasts—that asymmetrically penalises under‑ and over‑predictions. Lower loss indicates better calibration and reliability of distribution‑aware yield estimates.
for q, res in results.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['Yield'], preds, alpha=q)
print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")
Summary
Quantile regression equips agronomists and farm managers with tail‐sensitive yield forecasts:
- The 10th‑percentile model prepares for low-yield seasons—guiding minimal-input strategies.
- The median (50th percentile) model predicts typical harvests for routine planning.
- The 90th‑percentile model anticipates bumper crops—guiding opportunities to maximise revenue.
By modelling multiple yield quantiles, stakeholders gain robust insights into variability—optimising input allocation, mitigating downside risk, and capitalising on favourable conditions.