Agricultural Yield Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Dive into Knowledge for Free. Learn More!

Farm managers and agribusinesses need to estimate the cost per unit of yield—for example, cost per ton of harvested crop—to evaluate profitability and optimize input use.

Thus, Factors such as fertilizer rate, seed variety, labor hours, pesticide application, and agro‐climatic conditions all influence both yield and cost.

In this project, we will predict the cost per hectogram of crop yield using field‐level data (input rates, labor, environmental metrics) by fitting a linear model with stepwise feature selection.

Certainly, the goal is an interpretable model that highlights which inputs and conditions drive the lowest cost per unit of yield.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Ordinary Least Squares regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

Crop Yield Prediction Dataset

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We discard records with missing or zero yields to avoid division errors, ensuring a consistent dataset for modeling.

# Block 1: Load dataset
# Crop Yield Prediction Dataset – Kaggle   
url = "https://www.kaggle.com/datasets/patelris/crop-yield-prediction-dataset/download"
df = pd.read_csv(url)

# Inspect first few rows and summary
print(df.head())
print(df.info())
print(df.describe())

Feature Engineering & Cost Computation

We approximate total input cost per hectare by summing fertilizer, pesticide, labor, and seed expenses.
Therefore, using unit costs of $1.2/kg, $3.5/kg, and $10/hour, respectively, then derive the cost per hectogram of yield.

# Block 2: Compute yield cost and prepare features
# Assume columns: 'Yield_hg_per_ha', 'Fertilizer_kg_per_ha', 'Pesticide_kg_per_ha',
# 'Labor_Hours_per_ha', 'Seed_Cost_per_ha', 'Annual_Rainfall_mm', 'Avg_Temperature_C'

# Define cost per hectare as sum of input costs
df['Input_Cost_per_ha'] = (
    df['Fertilizer_kg_per_ha'] * 1.2 +   # $1.2 per kg fertilizer
    df['Pesticide_kg_per_ha'] * 3.5 +    # $3.5 per kg pesticide
    df['Labor_Hours_per_ha'] * 10 +      # $10 per labor hour
    df['Seed_Cost_per_ha']               # already in USD
)

# Compute cost per unit yield (USD per hg)
df['Cost_per_hg'] = df['Input_Cost_per_ha'] / df['Yield_hg_per_ha']

# Drop any rows with zero or missing yield
df = df[df['Yield_hg_per_ha'] > 0].dropna(
    subset=['Cost_per_hg','Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
            'Labor_Hours_per_ha','Seed_Cost_per_ha',
            'Annual_Rainfall_mm','Avg_Temperature_C']
)

# Define predictors and target
X = df[[
    'Fertilizer_kg_per_ha','Pesticide_kg_per_ha',
    'Labor_Hours_per_ha','Seed_Cost_per_ha',
    'Annual_Rainfall_mm','Avg_Temperature_C'
]]
y = df['Cost_per_hg']

Train/Test Split

# Block 3: Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The custom stepwise_selection function iteratively adds the excluded predictor with the smallest p‑value (< 0.01) and removes the included predictor with the largest p‑value (> 0.05) until no further changes occur.
Therefore, this yields a reduced set of statistically significant features.

# Block 4: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step
        excluded = [col for col in X.columns if col not in included]
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            pval = sm.OLS(y, sm.add_constant(X[included + [col]])).fit().pvalues[col]
            new_pvals[col] = pval
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:25} p-value {best_pval:.4f}")

        # Backward step
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:25} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

We fit an OLS regression on the selected features using statsmodels.
The .summary() output reports coefficient estimates (USD cost impact per unit change), p‑values, R², and diagnostic statistics—clarifying which inputs and conditions most influence cost efficiency.
Predictions on the held‑out test data produce R² (variance explained) and RMSE (root‑mean‑square error), quantifying out‑of‑sample accuracy.

# Block 5: Select features and fit model
selected = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set and evaluate
X_test_sel = sm.add_constant(X_test[selected])
y_pred = model.predict(X_test_sel)
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A residuals plot checks for non‑random patterns or heteroscedasticity, validating key OLS assumptions and highlighting potential model improvements.

# Block 6: Plot residuals
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Cost per hg")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

By applying stepwise regression to farm input and yield data, we isolate the key drivers of cost per unit of yield—like fertilizer and labor intensity—while excluding less informative factors.

Moreover, the final linear model achieves a clear balance between interpretability (few, significant predictors) and predictive performance (strong R², low RMSE),

Thus, it offers agronomists and farm managers a transparent tool for forecasting production costs and optimizing input decisions.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook