Crop Yield Prediction using Stepwise Regression in ML

FREE Online Courses: Click, Learn, Succeed, Start Now!

Accurate prediction of crop yields is essential for farmers and policy‑makers to optimize resource allocation, balance food supply chains, and mitigate risks due to weather variability or input costs.

In this project, we will predict crop yield (tons per hectare) based on environmental factors (rainfall and temperature), agricultural inputs (pesticide use), and geographic indicators (country/region and crop type).

By applying stepwise regression, we will identify the essential predictors and create a streamlined linear model that balances interpretability with predictive performance, helping stakeholders make data‑driven decisions.

Libraries Required

import pandas as pd               # Data manipulation  
import numpy as np                # Numerical operations  
import statsmodels.api as sm      # Statistical modeling  
from sklearn.model_selection import train_test_split   # Data splitting  
from sklearn.metrics import r2_score, mean_squared_error  # Evaluation  
import matplotlib.pyplot as plt   # Visualization 

Dataset

Crop Yield Prediction Dataset

Step-by-Step Code Implementation

Data Loading & Initial Inspection

We load the CSV from Kaggle and assess the data structure, identifying key features such as annual rainfall and pesticide use, as well as categorical fields (Country, Crop Type). (gts.ai)

# Block 1: Load dataset
url = "https://www.kaggle.com/datasets/patelris/crop-yield-prediction-dataset/download"
df = pd.read_csv(url)

# Inspect structure and summary
print(df.head())
print(df.info())
print(df.describe())

The dataset contains ~28,242 entries with features such as Country, Crop Type, Year, Yield per Hectare, Annual Rainfall, and Pesticide Usage (gts.ai).

Data Preprocessing

Categorical variables are one‑hot encoded to convert them into numeric dummy columns; missing entries (if any) are dropped.

The dataset is split into features (X) and target (y), then divided into 80/20 train‑test sets.

# Block 2: Encode categorical variables
df_enc = pd.get_dummies(df, columns=["Country", "Crop Type"], drop_first=True)

# Drop any missing values (if present)
df_enc = df_enc.dropna()

# Define predictors and target
X = df_enc.drop("Yield per Hectare (hg/ha)", axis=1)
y = df_enc["Yield per Hectare (hg/ha)"]

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Stepwise Regression Function

The stepwise_selection function iteratively adds the predictor with the lowest p‑value below 0.01 (forward step). It removes the predictor with the highest p‑value above 0.05 (backward step), terminating when no further changes occur.

This produces a parsimonious feature set.

# Block 3: Forward–backward stepwise selection
def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out=0.05, 
                       verbose=True):
    included = list(initial_list)
    while True:
        changed = False

        # Forward step
        excluded = list(set(X.columns) - set(included))
        new_pvals = pd.Series(dtype=float, index=excluded)
        for col in excluded:
            model = sm.OLS(y, sm.add_constant(X[included + [col]])).fit()
            new_pvals[col] = model.pvalues[col]
        best_pval = new_pvals.min()
        if best_pval < threshold_in:
            best_var = new_pvals.idxmin()
            included.append(best_var)
            changed = True
            if verbose:
                print(f"Add  {best_var:30} p-value {best_pval:.6f}")

        # Backward step
        model = sm.OLS(y, sm.add_constant(X[included])).fit()
        pvals = model.pvalues.iloc[1:]  # exclude intercept
        worst_pval = pvals.max()
        if worst_pval > threshold_out:
            worst_var = pvals.idxmax()
            included.remove(worst_var)
            changed = True
            if verbose:
                print(f"Drop {worst_var:30} p-value {worst_pval:.6f}")

        if not changed:
            break

    return included

Model Building & Evaluation

Model Fitting:

  • An Ordinary Least Squares regression is fit on the selected predictors using statsmodels.
  • The .summary() output provides coefficient estimates, standard errors, p‑values, and overall fit statistics (R², F‑statistic).

Evaluation:

  • Predictions on the test set yield quantitative metrics—R² (variance explained) and RMSE (prediction error)—to assess generalisation performance.
# Block 4: Feature selection
selected_features = stepwise_selection(X_train, y_train)

# Fit final OLS model
X_train_sel = sm.add_constant(X_train[selected_features])
model = sm.OLS(y_train, X_train_sel).fit()

# Review model summary
print(model.summary())

# Predict and evaluate on test set
X_test_sel = sm.add_constant(X_test[selected_features])
y_pred = model.predict(X_test_sel)

print("Test R²:", r2_score(y_test, y_pred))
print("Test RMSE:", np.sqrt(mean_squared_error(y_test, y_pred)))

Residual Diagnostics

Plotting residuals versus predicted values checks for patterns indicating heteroscedasticity or non‑linearity, validating key OLS assumptions.

# Block 5: Residual plot to check assumptions
residuals = y_test - y_pred
plt.scatter(y_pred, residuals)
plt.axhline(0, linestyle="--")
plt.xlabel("Predicted Yield (hg/ha)")
plt.ylabel("Residuals")
plt.title("Residuals vs. Predicted Yield")
plt.show()

Summary

By applying stepwise regression to the crop yield dataset, we isolate the most influential factors —such as rainfall, pesticide use, and regional indicators—while ignoring less useful variables.

The resulting linear model achieves a strong balance between interpretability and predictive accuracy (high R², low RMSE), providing actionable insights for agricultural planning.

However, farmers and policymakers can leverage this model to forecast yields, optimise input usage, and make informed decisions to enhance food security and resource sustainability.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook

ProjectGurukul Team

The ProjectGurukul Team delivers project-based tutorials on programming, machine learning, and web development. We simplify learning by providing hands-on projects to help you master real-world skills. We also provide free major and minor projects for enginering students.

Leave a Reply

Your email address will not be published. Required fields are marked *