Farm Output Prediction using Quantile Regression in ML

FREE Online Courses: Enroll Now, Thank us Later!

Farm managers need to anticipate not just the average crop yield but the range of possible outcomes—preparing for poor‑yield years (e.g., 10th percentile) as well as bumper harvests (e.g., 90th percentile). In this project, we’ll predict the 10th, 50th, and 90th percentiles of per‑hectare yield (hectograms) based on agronomic inputs (fertilizer rate, pesticide usage, labor hours, seed cost) and environmental factors (rainfall, temperature).

Libraries Required

import pandas as pd                       # Data loading & manipulation  
import numpy as np                        # Numerical operations  
import statsmodels.formula.api as smf     # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  # Train/test split  
from sklearn.metrics import mean_pinball_loss        # Quantile loss metric

Dataset

Crop Yield Prediction Dataset

Step-by-Step Code Implementation

Load & Inspect Data

We load ~28,000 field observations of yield and related inputs. Initial .info() and .describe() calls confirm no unexpected missing values in core fields and show yield ranges.

# Load the Crop Yield Prediction dataset
# Source: Kaggle 
df = pd.read_csv("crop-yield-prediction-dataset.csv")

# Examine structure and summary statistics
print(df.head())
print(df.info())
print(df['Yield_hg_per_ha'].describe())

Preprocessing & Feature Engineering

We drop any rows missing essential agronomic or weather data to ensure consistent modeling.
We optionally compute Input_Cost_per_ha (not used in the quantile formula here) to illustrate integrating cost indicators in future analyses.
We select six predictors—input rates and environmental metrics—and rename the target to Yield for succinct formulas.

# Drop rows with missing core features
df = df.dropna(subset=[
    'Yield_hg_per_ha',
    'Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
    'Labor_Hours_per_ha','Seed_Cost_per_ha',
    'Annual_Rainfall_mm','Avg_Temperature_C'
])

# (Optional) Compute input cost per hectare for later analysis
df['Input_Cost_per_ha'] = (
      df['Fertilizer_kg_per_ha'] * 1.2
    + df['Pesticide_kg_per_ha'] * 3.5
    + df['Labor_Hours_per_ha'] * 10
    + df['Seed_Cost_per_ha']
)

# Define predictor list and response
features = [
    'Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
    'Labor_Hours_per_ha','Seed_Cost_per_ha',
    'Annual_Rainfall_mm','Avg_Temperature_C'
]
data = df[features + ['Yield_hg_per_ha']].copy()
data.rename(columns={'Yield_hg_per_ha':'Yield'}, inplace=True)

Train/Test Split

An 80/20 random split reserves data for unbiased evaluation of quantile forecasts on unseen fields.

# Reserve 20% of observations for out‑of‑sample evaluation
train, test = train_test_split(data, test_size=0.2, random_state=42)

Fit Quantile Regression Models

For each quantile (10th, 50th, 90th):

We construct a formula (“Yield ~ Fertilizer_kg_per_ha + …”).
We fit a QuantReg model at that percentile on the training data.
We print the coefficient table, revealing how each input’s impact on yield shifts across the distribution—e.g., rainfall may have a stronger positive effect at the 10th than at the 50th percentile.

quantiles = [0.10, 0.50, 0.90]
results   = {}
formula   = "Yield ~ " + " + ".join(features)

for q in quantiles:
    model = smf.quantreg(formula, train)
    res   = model.fit(q=q)
    results[q] = res
    print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
    print(res.summary().tables[1])   # coefficient estimates per quantile

Evaluation with Pinball Loss

We generate quantile‑specific yield predictions on the test set.
We compute pinball loss for each quantile—a scoring function tailored to quantile estimates—quantifying the average penalty for over‑ and under‑prediction relative to the target percentile. Lower pinball loss indicates better forecast calibration.

for q, res in results.items():
    preds = res.predict(test[features])
    loss  = mean_pinball_loss(test['Yield'], preds, alpha=q)
    print(f"{int(q*100)}th quantile pinball loss: {loss:.2f}")

Summary

Quantile regression provides distribution‑aware insights into crop yield:

The 10th‑percentile model captures driver effects under adverse conditions, guiding input plans for dry years.
The median (50th) model informs typical yield expectations for standard seasons.
The 90th‑percentile model highlights factors enabling exceptional yields, aiding strategies to capitalize on favorable weather.

These tailored quantile forecasts equip agronomists and farm managers with robust planning tools—balancing conservative budgeting for low‑yield scenarios, accurate forecasting of average output, and targeted strategies for high‑yield opportunities.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook