Agricultural Input Cost Prediction using Quantile Regression in ML

FREE Online Courses: Knowledge Awaits – Click for Free Access!

Farm managers and agribusinesses incur different input costs per unit of yield depending on fertilizer, pesticide, labor, and seed expenses. Rather than forecasting only the average cost per hectogram of yield, stakeholders need to anticipate cost variability—from low‑cost scenarios (25th percentile) to high‑cost extremes (75th percentile).

In this agricultural input cost prediction ML project, we’ll predict the 25th, 50th, and 75th percentiles of cost per hectogram (Cost_per_hg) using agronomic input rates (fertilizer, pesticide, labor, seed) and weather factors (rainfall, temperature) by fitting separate quantile regression models. These distribution‑aware cost forecasts will guide budgeting for lean seasons, typical operations, and worst‑case input pricing.

Libraries Required

import pandas as pd                          # Data loading & manipulation  
import numpy as np                           # Numerical operations  
import statsmodels.formula.api as smf        # Quantile regression via formula API  
from sklearn.model_selection import train_test_split  
from sklearn.metrics import mean_pinball_loss  # Proper loss for quantile forecasts

Dataset

Crop Yield Prediction Dataset

Step-by-Step Code Implementation

Load & Inspect Data

We load the crop‐yield dataset (~28 k records) containing input rates and yield per hectare. Initial .info() and .describe() confirm data completeness and yield distributions.

# Load Crop Yield Prediction Dataset – Kaggle 
df = pd.read_csv("crop-yield-prediction-dataset.csv")

# Inspect first rows and summary statistics
print(df.head())
print(df.info())
print(df['Yield_hg_per_ha'].describe())

Preprocessing & Feature Engineering

We calculate Input_Cost_per_ha by summing fertilizer, pesticide, labor, and seed costs at specified unit prices.
We derive the target Cost_per_hg by dividing Input_Cost_per_ha by Yield_hg_per_ha.
We drop zero‐yield or missing rows to avoid division errors.
We select six predictors—four input rates and two weather metrics—and assemble our modeling DataFrame with renamed target Cost.

# Compute total input cost per hectare (unit costs: fertilizer $1.2/kg, pesticide $3.5/kg, labor $10/hr)
df['Input_Cost_per_ha'] = (
      df['Fertilizer_kg_per_ha'] * 1.2
    + df['Pesticide_kg_per_ha'] * 3.5
    + df['Labor_Hours_per_ha'] * 10
    + df['Seed_Cost_per_ha']
)

# Compute cost per hectogram of yield
df = df[df['Yield_hg_per_ha'] > 0].dropna(
    subset=['Input_Cost_per_ha','Yield_hg_per_ha',
            'Annual_Rainfall_mm','Avg_Temperature_C']
)
df['Cost_per_hg'] = df['Input_Cost_per_ha'] / df['Yield_hg_per_ha']

# Define predictors and response
features = [
    'Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
    'Labor_Hours_per_ha','Seed_Cost_per_ha',
    'Annual_Rainfall_mm','Avg_Temperature_C'
]
data = df[features + ['Cost_per_hg']].copy()
data.rename(columns={'Cost_per_hg':'Cost'}, inplace=True)

Train/Test Split

We randomly reserve 20% of observations for out‐of‐sample evaluation, ensuring quantile estimates generalize.

# Hold out 20% of data for evaluation
train, test = train_test_split(data, test_size=0.2, random_state=42)

Fit Quantile Regression Models

For each desired percentile (25th, 50th, 75th):

We construct a formula string (“Cost ~ Fertilizer_kg_per_ha + …”).
We fit a QuantReg model at that quantile on the training set.
We print the coefficient table, revealing how each predictor’s marginal effect on cost shifts across the cost distribution—for example, labor may drive up the 75th‐percentile cost more than the median.

quantiles = [0.25, 0.50, 0.75]
models    = {}
formula   = "Cost ~ " + " + ".join(features)

for q in quantiles:
    mod = smf.quantreg(formula, train)
    res = mod.fit(q=q)
    models[q] = res
    print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
    print(res.summary().tables[1])   # coefficient estimates per quantile

Evaluation with Pinball Loss

We predict quantile‐specific costs on the test set.
We compute pinball loss for each quantile—a proper scoring rule for quantile forecasts that penalizes under‑ and over‑predictions asymmetrically. Lower pinball loss indicates better quantile calibration and thus more reliable cost planning.

for q, res in models.items():
    preds = res.predict(test[features])
    loss  = mean_pinball_loss(test['Cost'], preds, alpha=q)
    print(f"{int(q*100)}th percentile pinball loss: {loss:.2f}")

Summary

Quantile regression of agricultural input costs uncovers distribution‑aware insights:

The 25th‑percentile model prepares for low-cost scenarios, guiding conservative budgeting when inputs and yields align favorably.
The median (50th‑percentile) model predicts typical costs for routine financial planning.
The 75th‑percentile model anticipates high‐cost conditions, such as low yields or elevated input rates, enabling contingency reserves.

These quantile forecasts equip agronomists and finance teams with robust tools to manage input budgets under environmental and operational uncertainty—optimizing resource allocation across diverse seasonal outcomes.

Your opinion matters
Please write your valuable feedback about ProjectGurukul on Google | Facebook