Farm Irrigation Cost Prediction using Stepwise Regression in ML

We offer you a brighter future with FREE online courses - Start Now!!

Irrigation accounts for a large share of on‑farm expenses—covering water pumping, distribution, and system maintenance. Accurately forecasting irrigation costs per hectare, based on crop type, soil properties, water use, and energy prices, enables farmers to budget effectively and optimise water‑use efficiency.

In this project, we will predict irrigation cost (USD/ha) using stepwise linear regression to select the most significant predictors and build an interpretable model that balances simplicity with predictive power, thereby helping agronomists and farm managers plan input expenditures and improve sustainability.

Libraries Required

import pandas as pd               # Data loading & manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Ordinary Least Squares regression  
from sklearn.model_selection import train_test_split   # Train/test split  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation metrics  
import matplotlib.pyplot as plt   # Visualization

Dataset

Agricultural Data for Rajasthan, India (2018–2019)

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load a region‑specific agricultural dataset that includes soil, rainfall, irrigation usage, and recorded irrigation costs. Initial commands (.head(), .info(), .describe()) verify completeness and distributions.

# Block 1: Load dataset
# Agricultural Data for Rajasthan, India (2018–2019) :contentReference[oaicite:0]{index=0}  
url = "https://www.kaggle.com/datasets/suraj520/agricultural-data-for-rajasthan-india-2018-2019/download"  
df = pd.read_csv(url)

# Inspect the data
print(df.head())
print(df.info())
print(df.describe())

Data Preprocessing

We drop incomplete records, then one‑hot encode categorical fields (Irrigation_Method, Crop). The predictors matrix X includes soil pH, rainfall, water used, energy price, and encoded categories; the target y is the observed irrigation cost per hectare. We split into training (80%) and testing (20%) sets.

# Block 2: Clean & prepare features
# Assume columns include: 'Crop', 'Soil_pH', 'Rainfall_mm', 'Irrigation_Method',
# 'Water_Used_mm', 'Energy_Price_per_kWh', 'Irrigation_Cost_per_ha'

# Drop any rows with missing values in key columns
df = df.dropna(subset=[
    'Crop','Soil_pH','Rainfall_mm','Irrigation_Method',
    'Water_Used_mm','Energy_Price_per_kWh','Irrigation_Cost_per_ha'
])

# One‑hot encode irrigation method and crop type
df_enc = pd.get_dummies(df,
                        columns=['Irrigation_Method','Crop'],
                        drop_first=True)

# Define predictors (X) and target (y)
X = df_enc.drop('Irrigation_Cost_per_ha', axis=1)
y = df_enc['Irrigation_Cost_per_ha']

# Split into training and test sets (80/20)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function alternates forward inclusion—adding predictors with p < 0.01—and backward elimination—dropping predictors with p > 0.05—until no further changes occur, yielding a parsimonious set of significant drivers.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y,
                       initial_list=[],
                       threshold_in=0.01,
                       threshold_out=0.05,
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step: test each excluded predictor
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(index=excluded, dtype=float)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:30} p-value {best_pval:.4f}")

        # Backward step: test removing each included predictor
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:30} p-value {worst_pval:.4f}")

        if not changed:
            break

    return included

Model Building & Evaluation

We fit an Ordinary Least Squares regression on the selected features using statsmodels. The .summary() output provides coefficient estimates, p -values (significance), R², and diagnostic statistics, offering clear interpretability of each factor’s impact on cost.
On the held‑out test set, we compute R² (variance explained) and RMSE (root‑mean‑square error) to quantify predictive performance and generalisation.

# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)

# Fit the final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()
print(model.summary())

# Predict and evaluate on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)
print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

A plot of residuals versus predicted costs is used to check for patterns or heteroscedasticity, validate linear model assumptions, and identify outliers.

# Block 5: Residual plot
residuals = y_test - y_pred
plt.scatter(y_pred, residuals, alpha=0.6)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Irrigation Cost (USD/ha)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Cost")
plt.show()

Summary

By applying stepwise regression to irrigation data, we identify the most influential cost drivers—such as water-use volume, energy price, and specific irrigation methods—while pruning non‑informative variables.

The resulting linear model strikes a balance between interpretability (few, statistically significant predictors) and predictive power (high test‑set R², low RMSE), equipping farmers and agronomists with a transparent tool to forecast irrigation expenditures and optimise water‑use strategies.

Did you like our efforts? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook