Farm Output Prediction using Quantile Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Farm managers need to anticipate not just the average crop yield but the range of possible outcomes—preparing for poor‑yield years (e.g., 10th percentile) as well as bumper harvests (e.g., 90th percentile). In this project, we’ll predict the 10th, 50th, and 90th percentiles of per‑hectare yield (hectograms) based on agronomic inputs (fertilizer rate, pesticide usage, labor hours, seed cost) and environmental factors (rainfall, temperature).
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.formula.api as smf # Quantile regression via formula API from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import mean_pinball_loss # Quantile loss metric
Dataset
Step-by-Step Code Implementation
Load & Inspect Data
We load ~28,000 field observations of yield and related inputs. Initial .info() and .describe() calls confirm no unexpected missing values in core fields and show yield ranges.
# Load the Crop Yield Prediction dataset
# Source: Kaggle
df = pd.read_csv("crop-yield-prediction-dataset.csv")
# Examine structure and summary statistics
print(df.head())
print(df.info())
print(df['Yield_hg_per_ha'].describe())
Preprocessing & Feature Engineering
- We drop any rows missing essential agronomic or weather data to ensure consistent modeling.
- We optionally compute Input_Cost_per_ha (not used in the quantile formula here) to illustrate integrating cost indicators in future analyses.
- We select six predictors—input rates and environmental metrics—and rename the target to Yield for succinct formulas.
# Drop rows with missing core features
df = df.dropna(subset=[
'Yield_hg_per_ha',
'Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
'Labor_Hours_per_ha','Seed_Cost_per_ha',
'Annual_Rainfall_mm','Avg_Temperature_C'
])
# (Optional) Compute input cost per hectare for later analysis
df['Input_Cost_per_ha'] = (
df['Fertilizer_kg_per_ha'] * 1.2
+ df['Pesticide_kg_per_ha'] * 3.5
+ df['Labor_Hours_per_ha'] * 10
+ df['Seed_Cost_per_ha']
)
# Define predictor list and response
features = [
'Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
'Labor_Hours_per_ha','Seed_Cost_per_ha',
'Annual_Rainfall_mm','Avg_Temperature_C'
]
data = df[features + ['Yield_hg_per_ha']].copy()
data.rename(columns={'Yield_hg_per_ha':'Yield'}, inplace=True)
Train/Test Split
An 80/20 random split reserves data for unbiased evaluation of quantile forecasts on unseen fields.
# Reserve 20% of observations for out‑of‑sample evaluation train, test = train_test_split(data, test_size=0.2, random_state=42)
Fit Quantile Regression Models
For each quantile (10th, 50th, 90th):
- We construct a formula (“Yield ~ Fertilizer_kg_per_ha + …”).
- We fit a QuantReg model at that percentile on the training data.
- We print the coefficient table, revealing how each input’s impact on yield shifts across the distribution—e.g., rainfall may have a stronger positive effect at the 10th than at the 50th percentile.
quantiles = [0.10, 0.50, 0.90]
results = {}
formula = "Yield ~ " + " + ".join(features)
for q in quantiles:
model = smf.quantreg(formula, train)
res = model.fit(q=q)
results[q] = res
print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
print(res.summary().tables[1]) # coefficient estimates per quantile
Evaluation with Pinball Loss
- We generate quantile‑specific yield predictions on the test set.
- We compute pinball loss for each quantile—a scoring function tailored to quantile estimates—quantifying the average penalty for over‑ and under‑prediction relative to the target percentile. Lower pinball loss indicates better forecast calibration.
for q, res in results.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['Yield'], preds, alpha=q)
print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")
Summary
Quantile regression provides distribution‑aware insights into crop yield:
- The 10th‑percentile model captures driver effects under adverse conditions, guiding input plans for dry years.
- The median (50th) model informs typical yield expectations for standard seasons.
- The 90th‑percentile model highlights factors enabling exceptional yields, aiding strategies to capitalize on favorable weather.
These tailored quantile forecasts equip agronomists and farm managers with robust planning tools—balancing conservative budgeting for low‑yield scenarios, accurate forecasting of average output, and targeted strategies for high‑yield opportunities.