Agricultural Yield Prediction using Quantile Regression in ML

FREE Online Courses: Knowledge Awaits – Click for Free Access!

Farmers and agribusinesses must plan not only for the average crop yield but for its range—preparing for poor‐yield seasons (10th percentile), typical harvests (50th percentile), and bumper crops (90th percentile).

Relying on mean forecasts can lead to either underprovisioning of inputs in lean years or overinvesting in average seasons. Here, we’ll predict the 10th, 50th, and 90th percentiles of per‐hectare yield (hectograms) for key crops using agronomic input rates (fertilizer, pesticide, labor, seed) and environmental factors (rainfall, temperature).

By fitting separate quantile regression models, we’ll reveal how each factor’s effect shifts across low‑, median‑, and high‑yield scenarios—enabling resource plans that are robust to both adverse and exceptional conditions.

Libraries Required

import pandas as pd                          # Data loading & manipulation  
import numpy as np                           # Numerical operations  
import statsmodels.formula.api as smf        # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_pinball_loss  # Proper loss for quantile forecasts  

Dataset

Crop Yield Prediction Dataset

Step-by-Step Code Implementation

Load & Inspect Data

We load ~28 000 field observations—each with yield per hectare and input/environmental features—and verify structure (.info()) and yield distribution (.describe()) to understand data quality and skew.

# Load the Crop Yield Prediction dataset
# Source: Kaggle – Crop Yield Prediction Dataset :contentReference[oaicite:1]{index=1}
df = pd.read_csv("crop-yield-prediction-dataset.csv")

# Inspect top rows and summary stats
print(df.head())
print(df.info())
print(df['Yield_hg_per_ha'].describe())

Preprocessing & Feature Engineering

  • We drop any rows missing essential agronomic or weather data.
  • We optionally compute Input_Cost_per_ha to illustrate cost–yield trade‐off analyses.
  • We assemble predictors (input rates and environmental metrics) and rename Yield_hg_per_ha to Yield for cleaner formulas.
# Remove rows with missing core variables
df = df.dropna(subset=[
    'Yield_hg_per_ha',
    'Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
    'Labor_Hours_per_ha','Seed_Cost_per_ha',
    'Annual_Rainfall_mm','Avg_Temperature_C'
])

# (Optional) Compute total input cost per hectare
df['Input_Cost_per_ha'] = (
      df['Fertilizer_kg_per_ha'] * 1.2
    + df['Pesticide_kg_per_ha'] * 3.5
    + df['Labor_Hours_per_ha'] * 10
    + df['Seed_Cost_per_ha']
)

# Define predictors and response
features = [
    'Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
    'Labor_Hours_per_ha','Seed_Cost_per_ha',
    'Annual_Rainfall_mm','Avg_Temperature_C'
]
data = df[features + ['Yield_hg_per_ha']].copy()
data.rename(columns={'Yield_hg_per_ha':'Yield'}, inplace=True)

Train/Test Split

An 80/20 random split reserves data for unbiased evaluation of quantile forecasts on unseen fields.

# Hold out 20% for evaluation
train, test = train_test_split(data, test_size=0.2, random_state=42)

Fit Quantile Regression Models

For each quantile (10th, 50th, 90th):

  • We build a formula string (“Yield ~ Fertilizer_kg_per_ha + …”).
  • We fit a QuantReg model on the training set.
  • We print the coefficient table, revealing how each predictor’s effect varies across yield levels—e.g. rainfall may drive the lower tail more strongly than the median.
quantiles = [0.10, 0.50, 0.90]
results   = {}
formula   = "Yield ~ " + " + ".join(features)

for q in quantiles:
    model = smf.quantreg(formula, train)
    res   = model.fit(q=q)
    results[q] = res
    print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
    print(res.summary().tables[1])   # coefficient table only

Evaluation with Pinball Loss

  • We predict quantile‐specific yields on the test set.
  • We compute pinball loss—a scoring rule tailored to quantile forecasts—that asymmetrically penalises under‑ and over‑predictions. Lower loss indicates better calibration and reliability of distribution‑aware yield estimates.
for q, res in results.items():
    preds = res.predict(test[features])
    loss  = mean_pinball_loss(test['Yield'], preds, alpha=q)
    print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")

Summary

Quantile regression equips agronomists and farm managers with tail‐sensitive yield forecasts:

  • The 10th‑percentile model prepares for low-yield seasons—guiding minimal-input strategies.
  • The median (50th percentile) model predicts typical harvests for routine planning.
  • The 90th‑percentile model anticipates bumper crops—guiding opportunities to maximise revenue.

By modelling multiple yield quantiles, stakeholders gain robust insights into variability—optimising input allocation, mitigating downside risk, and capitalising on favourable conditions.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *