Farm Irrigation Cost Prediction using Stepwise Regression in ML
FREE Online Courses: Elevate Your Skills, Zero Cost Attached - Enroll Now!
Irrigation accounts for a large share of on‑farm expenses—covering water pumping, distribution, and system maintenance. Accurately forecasting irrigation costs per hectare, based on crop type, soil properties, water use, and energy prices, enables farmers to budget effectively and optimise water‑use efficiency.
In this project, we will predict irrigation cost (USD/ha) using stepwise linear regression to select the most significant predictors and build an interpretable model that balances simplicity with predictive power, thereby helping agronomists and farm managers plan input expenditures and improve sustainability.
Libraries Required
import pandas as pd # Data loading & manipulation import numpy as np # Numerical operations import statsmodels.api as sm # Ordinary Least Squares regression from sklearn.model_selection import train_test_split # Train/test split from sklearn.metrics import r2_score, mean_squared_error # Evaluation metrics import matplotlib.pyplot as plt # Visualization
Dataset
Agricultural Data for Rajasthan, India (2018–2019)
Step-by-Step Code Implementation
Data Loading & Initial Inspection
We load a region‑specific agricultural dataset that includes soil, rainfall, irrigation usage, and recorded irrigation costs. Initial commands (.head(), .info(), .describe()) verify completeness and distributions.
# Block 1: Load dataset
# Agricultural Data for Rajasthan, India (2018–2019) :contentReference[oaicite:0]{index=0}
url = "https://www.kaggle.com/datasets/suraj520/agricultural-data-for-rajasthan-india-2018-2019/download"
df = pd.read_csv(url)
# Inspect the data
print(df.head())
print(df.info())
print(df.describe())
Data Preprocessing
We drop incomplete records, then one‑hot encode categorical fields (Irrigation_Method, Crop). The predictors matrix X includes soil pH, rainfall, water used, energy price, and encoded categories; the target y is the observed irrigation cost per hectare. We split into training (80%) and testing (20%) sets.
# Block 2: Clean & prepare features
# Assume columns include: 'Crop', 'Soil_pH', 'Rainfall_mm', 'Irrigation_Method',
# 'Water_Used_mm', 'Energy_Price_per_kWh', 'Irrigation_Cost_per_ha'
# Drop any rows with missing values in key columns
df = df.dropna(subset=[
'Crop','Soil_pH','Rainfall_mm','Irrigation_Method',
'Water_Used_mm','Energy_Price_per_kWh','Irrigation_Cost_per_ha'
])
# One‑hot encode irrigation method and crop type
df_enc = pd.get_dummies(df,
columns=['Irrigation_Method','Crop'],
drop_first=True)
# Define predictors (X) and target (y)
X = df_enc.drop('Irrigation_Cost_per_ha', axis=1)
y = df_enc['Irrigation_Cost_per_ha']
# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
Stepwise Regression Function
The stepwise_selection function alternates forward inclusion—adding predictors with p < 0.01—and backward elimination—dropping predictors with p > 0.05—until no further changes occur, yielding a parsimonious set of significant drivers.
# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
initial_list=[],
threshold_in=0.01,
threshold_out=0.05,
verbose=True):
included = list(initial_list)
while True:
changed = False
# Forward step: test each excluded predictor
excluded = list(set(X.columns) - set(included))
new_pvals = pd.Series(index=excluded, dtype=float)
for col in excluded:
model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
new_pvals[col] = model.pvalues[col]
best_pval = new_pvals.min()
if best_pval < threshold_in:
best_var = new_pvals.idxmin()
included.append(best_var)
changed = True
if verbose:
print(f"Add {best_var:30} p-value {best_pval:.4f}")
# Backward step: test removing each included predictor
model = sm.OLS(y, sm.add_constant(X[included])).fit()
pvals = model.pvalues.iloc[1:] # exclude intercept
worst_pval = pvals.max()
if worst_pval > threshold_out:
worst_var = pvals.idxmax()
included.remove(worst_var)
changed = True
if verbose:
print(f"Drop {worst_var:30} p-value {worst_pval:.4f}")
if not changed:
break
return included
Model Building & Evaluation
- We fit an Ordinary Least Squares regression on the selected features using statsmodels. The .summary() output provides coefficient estimates, p -values (significance), R², and diagnostic statistics, offering clear interpretability of each factor’s impact on cost.
- On the held‑out test set, we compute R² (variance explained) and RMSE (root‑mean‑square error) to quantify predictive performance and generalisation.
# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)
# Fit the final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())
# Predict and evaluate on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))
Residual Diagnostics
A plot of residuals versus predicted costs is used to check for patterns or heteroscedasticity, validate linear model assumptions, and identify outliers.
# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Irrigation Cost (USD/ha)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()
Summary
By applying stepwise regression to irrigation data, we identify the most influential cost drivers—such as water-use volume, energy price, and specific irrigation methods—while pruning non‑informative variables.
The resulting linear model strikes a balance between interpretability (few, statistically significant predictors) and predictive power (high test‑set R², low RMSE), equipping farmers and agronomists with a transparent tool to forecast irrigation expenditures and optimise water‑use strategies.