Crop Yield Prediction using Quantile Regression in ML

FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!

Farmers and agronomists need not only a point estimate of expected crop yield but also an understanding of uncertainty—for instance, what yields might look like in adverse (10th percentile) or exceptional (90th percentile) seasons.

In this crop yield prediction ML project, we’ll predict multiple quantiles (e.g., 10th, 50th, 90th percentiles) of per‑hectare crop yield (in hectograms) based on input rates (fertilizer, pesticide, labor, seed) and environmental factors (rainfall, temperature).

Libraries Required

import pandas as pd                       # Data loading & manipulation  
import numpy as np                        # Numerical operations  
import statsmodels.formula.api as smf     # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  # Train/test split  
from sklearn.metrics import mean_pinball_loss        # Quantile loss metric  

Dataset

Crop Yield Prediction Dataset

Step-by-Step Code Implementation

Load & Inspect Data

We load ~28,000 observations of yield and input/environmental factors. Initial .info() and .describe() confirm data completeness and value ranges.

# Load Crop Yield Prediction dataset from Kaggle :contentReference[oaicite:0]{index=0}
df = pd.read_csv("crop-yield-prediction-dataset.csv")

# Quick look at structure and stats
print(df.head())
print(df.info())
print(df.describe())

Feature Engineering & Cleaning

  • We calculate an example Input_Cost_per_ha to illustrate cost‐sensitive analyses (not used directly in the quantile formula here, but could be).
  • We drop rows missing any core field to ensure clean modeling.
  • We designate six predictors (features) and the response Yield_hg_per_ha.
# Compute total input cost per hectare (example unit costs)
df['Input_Cost_per_ha'] = (
      df['Fertilizer_kg_per_ha'] * 1.2  
    + df['Pesticide_kg_per_ha'] * 3.5  
    + df['Labor_Hours_per_ha'] * 10  
    + df['Seed_Cost_per_ha']
)

# Compute target: yield cost efficiency (could also predict yield directly)
# Here, we predict the yield quantiles, so target is the raw yield
df = df.dropna(subset=[
    'Yield_hg_per_ha','Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
    'Labor_Hours_per_ha','Seed_Cost_per_ha','Annual_Rainfall_mm','Avg_Temperature_C'
])

# Define predictors and response
features = [
    'Fertilizer_kg_per_ha','Pesticide_kg_per_ha','Labor_Hours_per_ha',
    'Seed_Cost_per_ha','Annual_Rainfall_mm','Avg_Temperature_C'
]
X = df[features]
y = df['Yield_hg_per_ha']
data = pd.concat([X, y], axis=1)

Train/Test Split

We randomly reserve 20% of the data for out‑of‑sample evaluation, ensuring quantile estimates generalize.

# Hold out 20% for evaluation
train, test = train_test_split(data, test_size=0.2, random_state=42)

Fit Quantile Regression Models

For each quantile (0.1, 0.5, 0.9):

  • We define a statsmodels formula relating yield to all features.
  • We fit a QuantReg model at that percentile.
  • We output the coefficient table, which shows how each feature’s effect on yield differs across the distribution (e.g., rainfall might be more critical at the 10th percentile than the median).
quantiles = [0.1, 0.5, 0.9]
results = {}
formula = "Yield_hg_per_ha ~ " + " + ".join(features)

for q in quantiles:
    model = smf.quantreg(formula, train)
    res = model.fit(q=q)
    results[q] = res
    print(f"\n=== Quantile {int(q*100)}th Summary ===")
    print(res.summary().tables[1])   # coefficient table

Evaluate with Pinball Loss

We predict yields on the test set and compute pinball loss for each quantile—a proper loss function for quantile forecasts—quantifying average over‑ and under‑prediction penalties and enabling fair comparison across quantiles.

for q, res in results.items():
    preds = res.predict(test[features])
    loss = mean_pinball_loss(test['Yield_hg_per_ha'], preds, alpha=q)
    print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")

Summary

Quantile regression unveils how agronomic inputs and weather factors impact different segments of the yield distribution. For example, fertilizer application may have a modest effect on median yield (50th percentile) but a pronounced effect in low‑yield conditions (10th percentile).

By modeling the 10th, 50th, and 90th percentiles separately, this approach equips agronomists and farm managers with distribution‑aware forecasts—planning for worst‑case (drought), typical, and best‑case scenarios—and crafting input strategies that are robust to environmental variability.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *