Agricultural Input Cost Prediction using Stepwise Regression in ML
FREE Online Courses: Your Passport to Excellence - Start Now
Farmers and agri‑businesses need to forecast their input costs—covering seeds, fertilisers, labour, and machinery use—to budget effectively and maximise profits. In this project, we’ll predict the total input cost per hectare based on factors such as crop type, soil quality metrics, fertiliser application rates, labour hours, and regional economic indicators. Through stepwise regression, we’ll find the most impactful cost drivers and build an interpretable linear model that balances simplicity with predictive accuracy—helping stakeholders plan expenses and improve resource allocation.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.api as sm # OLS regression from sklearn.model_selection import train_test_split # Data splitting from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Visualization
Dataset
Agriculture and Farming Dataset
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We import the “Agriculture & Farming Dataset” from Kaggle, which includes crop, soil, input usage, and cost data. Initial inspection reveals key variables and checks for completeness.
# Block 1: Load dataset
# Agriculture & Farming Dataset – Kaggle :contentReference[oaicite:0]{index=0}
url = "https://www.kaggle.com/datasets/bhadramohit/agriculture-and-farming-dataset/download"
df = pd.read_csv(url)
# Inspect structure and summary
print(df.head())
print(df.info())
print(df.describe())
Data Preprocessing
We drop records missing any essential fields. Categorical features (Crop_Type, Region) are transformed via one‑hot encoding. The feature matrix X includes numeric soil and input application metrics plus encoded categories; the target y is Input_Cost_per_ha. We then split into training (80%) and testing (20%) sets.
# Block 2: Clean & prepare data
# Assume dataset columns include:
# 'Crop_Type', 'Soil_pH', 'Soil_Organic_Carbon', 'Fertilizer_kg_per_ha',
# 'Labor_Hours_per_ha', 'Machinery_Hours_per_ha', 'Region', and 'Input_Cost_per_ha'
# Drop any rows with missing values in key columns
df = df.dropna(subset=[
'Crop_Type','Soil_pH','Soil_Organic_Carbon',
'Fertilizer_kg_per_ha','Labor_Hours_per_ha',
'Machinery_Hours_per_ha','Region','Input_Cost_per_ha'
])
# One‐hot encode categorical features
df_enc = pd.get_dummies(df, columns=["Crop_Type","Region"], drop_first=True)
# Define predictors and target
X = df_enc.drop("Input_Cost_per_ha", axis=1)
y = df_enc["Input_Cost_per_ha"]
# Split into training and testing sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
The stepwise_selection function implements a hybrid approach:
- Forward inclusion: adds the excluded predictor with the lowest p‑value under 0.01.
- Backward elimination: removes the included predictor with the highest p‑value above 0.05. Iteration continues until no further additions or removals are warranted, yielding a parsimonious set of cost drivers.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: test adding each excluded predictor
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:30} p‑value {best_pval:.4f}")
# Backward step: test removing each included predictor
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals.max()
if worst_pval > threshold_out:
worst_var = pvals.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:30} p‑value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
- Model Fitting: We fit an Ordinary Least Squares regression using statsmodels on the selected features. The .summary() output provides coefficient estimates, standard errors, p‑values, R², and fit diagnostics—quantifying each predictor’s impact on input cost.
- Evaluation: Predictions on the held‑out test set allow us to compute R² (variance explained) and RMSE (prediction error), assessing generalisation performance.
# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)
# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
# Compute performance metrics
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A residual plot (residuals vs predicted values) checks for non‑random patterns or heteroscedasticity, validating core OLS assumptions and model reliability.
# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Input Cost per ha")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()
Summary
Applying stepwise regression to agricultural input data identifies the key drivers of per‑hectare cost—such as fertiliser rate, labour hours, soil organic carbon, and specific crop or region categories—while pruning redundant variables. The resulting linear model strikes a strong balance between interpretability and predictive power (high R², low RMSE), equipping farmers and agribusiness planners with a transparent forecasting tool to budget inputs more accurately and optimise resource use.