Agricultural Input Cost Prediction using Quantile Regression in ML
FREE Online Courses: Click, Learn, Succeed, Start Now!
Farm managers and agribusinesses incur different input costs per unit of yield depending on fertilizer, pesticide, labor, and seed expenses. Rather than forecasting only the average cost per hectogram of yield, stakeholders need to anticipate cost variability—from low‑cost scenarios (25th percentile) to high‑cost extremes (75th percentile).
In this agricultural input cost prediction ML project, we’ll predict the 25th, 50th, and 75th percentiles of cost per hectogram (Cost_per_hg) using agronomic input rates (fertilizer, pesticide, labor, seed) and weather factors (rainfall, temperature) by fitting separate quantile regression models. These distribution‑aware cost forecasts will guide budgeting for lean seasons, typical operations, and worst‑case input pricing.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.formula.api as smf # Quantile regression via formula API from sklearn.model_selection import train_test_split from sklearn.metrics import mean_pinball_loss # Proper loss for quantile forecasts
Dataset
Step-by-Step Code Implementation
Load & Inspect Data
We load the crop‐yield dataset (~28 k records) containing input rates and yield per hectare. Initial .info() and .describe() confirm data completeness and yield distributions.
# Load Crop Yield Prediction Dataset – Kaggle
df = pd.read_csv("crop-yield-prediction-dataset.csv")
# Inspect first rows and summary statistics
print(df.head())
print(df.info())
print(df['Yield_hg_per_ha'].describe())
Preprocessing & Feature Engineering
- We calculate Input_Cost_per_ha by summing fertilizer, pesticide, labor, and seed costs at specified unit prices.
- We derive the target Cost_per_hg by dividing Input_Cost_per_ha by Yield_hg_per_ha.
- We drop zero‐yield or missing rows to avoid division errors.
- We select six predictors—four input rates and two weather metrics—and assemble our modeling DataFrame with renamed target Cost.
# Compute total input cost per hectare (unit costs: fertilizer $1.2/kg, pesticide $3.5/kg, labor $10/hr)
df['Input_Cost_per_ha'] = (
df['Fertilizer_kg_per_ha'] * 1.2
+ df['Pesticide_kg_per_ha'] * 3.5
+ df['Labor_Hours_per_ha'] * 10
+ df['Seed_Cost_per_ha']
)
# Compute cost per hectogram of yield
df = df[df['Yield_hg_per_ha'] > 0].dropna(
subset=['Input_Cost_per_ha','Yield_hg_per_ha',
'Annual_Rainfall_mm','Avg_Temperature_C']
)
df['Cost_per_hg'] = df['Input_Cost_per_ha'] / df['Yield_hg_per_ha']
# Define predictors and response
features = [
'Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
'Labor_Hours_per_ha','Seed_Cost_per_ha',
'Annual_Rainfall_mm','Avg_Temperature_C'
]
data = df[features + ['Cost_per_hg']].copy()
data.rename(columns={'Cost_per_hg':'Cost'}, inplace=True)
Train/Test Split
We randomly reserve 20% of observations for out‐of‐sample evaluation, ensuring quantile estimates generalize.
# Hold out 20% of data for evaluation train, test = train_test_split(data, test_size=0.2, random_state=42)
Fit Quantile Regression Models
For each desired percentile (25th, 50th, 75th):
- We construct a formula string (“Cost ~ Fertilizer_kg_per_ha + …”).
- We fit a QuantReg model at that quantile on the training set.
- We print the coefficient table, revealing how each predictor’s marginal effect on cost shifts across the cost distribution—for example, labor may drive up the 75th‐percentile cost more than the median.
quantiles = [0.25, 0.50, 0.75]
models = {}
formula = "Cost ~ " + " + ".join(features)
for q in quantiles:
mod = smf.quantreg(formula, train)
res = mod.fit(q=q)
models[q] = res
print(f"\n--- {int(q*100)}th Percentile Coefficients ---")
print(res.summary().tables[1]) # coefficient estimates per quantile
Evaluation with Pinball Loss
- We predict quantile‐specific costs on the test set.
- We compute pinball loss for each quantile—a proper scoring rule for quantile forecasts that penalizes under‑ and over‑predictions asymmetrically. Lower pinball loss indicates better quantile calibration and thus more reliable cost planning.
for q, res in models.items():
preds = res.predict(test[features])
loss = mean_pinball_loss(test['Cost'], preds, alpha=q)
print(f"{int(q*100)}th percentile pinball loss: {loss:.2f}")
Summary
Quantile regression of agricultural input costs uncovers distribution‑aware insights:
- The 25th‑percentile model prepares for low-cost scenarios, guiding conservative budgeting when inputs and yields align favorably.
- The median (50th‑percentile) model predicts typical costs for routine financial planning.
- The 75th‑percentile model anticipates high‐cost conditions, such as low yields or elevated input rates, enabling contingency reserves.
These quantile forecasts equip agronomists and finance teams with robust tools to manage input budgets under environmental and operational uncertainty—optimizing resource allocation across diverse seasonal outcomes.