Crop Yield Prediction using Quantile Regression in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

Farmers and agronomists need not only a point estimate of expected crop yield but also an understanding of uncertainty—for instance, what yields might look like in adverse (10th percentile) or exceptional (90th percentile) seasons.

In this crop yield prediction ML project, we’ll predict multiple quantiles (e.g., 10th, 50th, 90th percentiles) of per‑hectare crop yield (in hectograms) based on input rates (fertilizer, pesticide, labor, seed) and environmental factors (rainfall, temperature).

Libraries Required

import pandas as pd                       # Data loading & manipulation  
import numpy as np                        # Numerical operations  
import statsmodels.formula.api as smf     # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  # Train/test split  
from sklearn.metrics import mean_pinball_loss        # Quantile loss metric

Dataset

Crop Yield Prediction Dataset

Step-by-Step Code Implementation

Load & Inspect Data

We load ~28,000 observations of yield and input/environmental factors. Initial .info() and .describe() confirm data completeness and value ranges.

# Load Crop Yield Prediction dataset from Kaggle :contentReference[oaicite:0]{index=0}
df = pd.read_csv("crop-yield-prediction-dataset.csv")

# Quick look at structure and stats
print(df.head())
print(df.info())
print(df.describe())

Feature Engineering & Cleaning

We calculate an example Input_Cost_per_ha to illustrate cost‐sensitive analyses (not used directly in the quantile formula here, but could be).
We drop rows missing any core field to ensure clean modeling.
We designate six predictors (features) and the response Yield_hg_per_ha.

# Compute total input cost per hectare (example unit costs)
df['Input_Cost_per_ha'] = (
      df['Fertilizer_kg_per_ha'] * 1.2  
    + df['Pesticide_kg_per_ha'] * 3.5  
    + df['Labor_Hours_per_ha'] * 10  
    + df['Seed_Cost_per_ha']
)

# Compute target: yield cost efficiency (could also predict yield directly)
# Here, we predict the yield quantiles, so target is the raw yield
df = df.dropna(subset=[
    'Yield_hg_per_ha','Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
    'Labor_Hours_per_ha','Seed_Cost_per_ha','Annual_Rainfall_mm','Avg_Temperature_C'
])

# Define predictors and response
features = [
    'Fertilizer_kg_per_ha','Pesticide_kg_per_ha','Labor_Hours_per_ha',
    'Seed_Cost_per_ha','Annual_Rainfall_mm','Avg_Temperature_C'
]
X = df[features]
y = df['Yield_hg_per_ha']
data = pd.concat([X, y], axis=1)

Train/Test Split

We randomly reserve 20% of the data for out‑of‑sample evaluation, ensuring quantile estimates generalize.

# Hold out 20% for evaluation
train, test = train_test_split(data, test_size=0.2, random_state=42)

Fit Quantile Regression Models

For each quantile (0.1, 0.5, 0.9):

We define a statsmodels formula relating yield to all features.
We fit a QuantReg model at that percentile.
We output the coefficient table, which shows how each feature’s effect on yield differs across the distribution (e.g., rainfall might be more critical at the 10th percentile than the median).

quantiles = [0.1, 0.5, 0.9]
results = {}
formula = "Yield_hg_per_ha ~ " + " + ".join(features)

for q in quantiles:
    model = smf.quantreg(formula, train)
    res = model.fit(q=q)
    results[q] = res
    print(f"\n=== Quantile {int(q*100)}th Summary ===")
    print(res.summary().tables[1])   # coefficient table

Evaluate with Pinball Loss

We predict yields on the test set and compute pinball loss for each quantile—a proper loss function for quantile forecasts—quantifying average over‑ and under‑prediction penalties and enabling fair comparison across quantiles.

for q, res in results.items():
    preds = res.predict(test[features])
    loss = mean_pinball_loss(test['Yield_hg_per_ha'], preds, alpha=q)
    print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")

Summary

Quantile regression unveils how agronomic inputs and weather factors impact different segments of the yield distribution. For example, fertilizer application may have a modest effect on median yield (50th percentile) but a pronounced effect in low‑yield conditions (10th percentile).

By modeling the 10th, 50th, and 90th percentiles separately, this approach equips agronomists and farm managers with distribution‑aware forecasts—planning for worst‑case (drought), typical, and best‑case scenarios—and crafting input strategies that are robust to environmental variability.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook