Agricultural Input Cost Prediction using Stepwise Regression in ML

FREE Online Courses: Your Passport to Excellence - Start Now

Farmers and agri‑businesses need to forecast their input costs—covering seeds, fertilisers, labour, and machinery use—to budget effectively and maximise profits. In this project, we’ll predict the total input cost per hectare based on factors such as crop type, soil quality metrics, fertiliser application rates, labour hours, and regional economic indicators. Through stepwise regression, we’ll find the most impactful cost drivers and build an interpretable linear model that balances simplicity with predictive accuracy—helping stakeholders plan expenses and improve resource allocation.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # OLS regression  
from sklearn.model_selection import train_test_split   # Data splitting  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization  

Dataset

Agriculture and Farming Dataset

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We import the “Agriculture & Farming Dataset” from Kaggle, which includes crop, soil, input usage, and cost data. Initial inspection reveals key variables and checks for completeness.

# Block 1: Load dataset
# Agriculture & Farming Dataset – Kaggle :contentReference[oaicite:0]{index=0}
url = "https://www.kaggle.com/datasets/bhadramohit/agriculture-and-farming-dataset/download"
df = pd.read_csv(url)

# Inspect structure and summary
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

We drop records missing any essential fields. Categorical features (Crop_Type, Region) are transformed via one‑hot encoding. The feature matrix X includes numeric soil and input application metrics plus encoded categories; the target y is Input_Cost_per_ha. We then split into training (80%) and testing (20%) sets.

# Block 2: Clean & prepare data
# Assume dataset columns include:
# 'Crop_Type', 'Soil_pH', 'Soil_Organic_Carbon', 'Fertilizer_kg_per_ha',
# 'Labor_Hours_per_ha', 'Machinery_Hours_per_ha', 'Region', and 'Input_Cost_per_ha'

# Drop any rows with missing values in key columns
df = df.dropna(subset=[
    'Crop_Type','Soil_pH','Soil_Organic_Carbon',
    'Fertilizer_kg_per_ha','Labor_Hours_per_ha',
    'Machinery_Hours_per_ha','Region','Input_Cost_per_ha'
])

# One‐hot encode categorical features
df_enc = pd.get_dummies(df, columns=["Crop_Type","Region"], drop_first=True)

# Define predictors and target
X = df_enc.drop("Input_Cost_per_ha", axis=1)
y = df_enc["Input_Cost_per_ha"]

# Split into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function implements a hybrid approach:

  • Forward inclusion: adds the excluded predictor with the lowest p‑value under 0.01.
  • Backward elimination: removes the included predictor with the highest p‑value above 0.05. Iteration continues until no further additions or removals are warranted, yielding a parsimonious set of cost drivers.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: test adding each excluded predictor
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:30} p‑value {best_pval:.4f}")

        # Backward step: test removing each included predictor
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:30} p‑value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

  • Model Fitting: We fit an Ordinary Least Squares regression using statsmodels on the selected features. The .summary() output provides coefficient estimates, standard errors, p‑values, R², and fit diagnostics—quantifying each predictor’s impact on input cost.
  • Evaluation: Predictions on the held‑out test set allow us to compute R² (variance explained) and RMSE (prediction error), assessing generalisation performance.
# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A residual plot (residuals vs predicted values) checks for non‑random patterns or heteroscedasticity, validating core OLS assumptions and model reliability.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Input Cost per ha")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

Applying stepwise regression to agricultural input data identifies the key drivers of per‑hectare cost—such as fertiliser rate, labour hours, soil organic carbon, and specific crop or region categories—while pruning redundant variables. The resulting linear model strikes a strong balance between interpretability and predictive power (high R², low RMSE), equipping farmers and agribusiness planners with a transparent forecasting tool to budget inputs more accurately and optimise resource use.

Your 15 seconds will encourage us to work even harder
Please share your happy experience on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *