Crop Yield Response Prediction using Polynomial Regression in ML

FREE Online Courses: Dive into Knowledge for Free. Learn More!

Agronomists and farm‑management teams need to forecast crop yield (tons per hectare) based on readily measured environmental and management factors—soil moisture, nutrient levels, rainfall, temperature, and fertiliser application rate—before harvest. These relationships are inherently nonlinear: for example, yield gains taper off at high fertiliser rates, and interact multiplicatively with moisture and temperature. A simple linear model underestimates such curvature, while an unconstrained high‑degree polynomial overfits. By applying Polynomial Regression (i.e., linear regression on engineered polynomial and interaction features) with Ridge regularisation, we can capture smooth nonlinear responses and deliver robust, interpretable yield forecasts for precision‑agriculture decision support.

Libraries Required

import pandas as pd                   # data loading & handling  
import numpy as np                    # numerical operations  

import matplotlib.pyplot as plt       # plotting  
import seaborn as sns                 # visualization enhancements  

from sklearn.model_selection import train_test_split, GridSearchCV  
from sklearn.preprocessing import StandardScaler, PolynomialFeatures  
from sklearn.linear_model import Ridge  
from sklearn.pipeline import Pipeline  
from sklearn.metrics import mean_squared_error, r2_score

Dataset

Crop Yield Prediction Using Soil and Weather

Step-by-Step Code Implementation

1. Load Libraries & Data

import pandas as pd

# Adjust path if needed
df = pd.read_csv("data/crop_yield_soil_weather.csv")

# Preview key columns
df.head()[['soil_moisture','soil_nitrogen','rainfall_mm',
           'avg_temp_c','fertilizer_kg_ha','yield_t_ha']]

2. Exploratory Data Analysis

import seaborn as sns
import matplotlib.pyplot as plt

# Visualise nonlinear trend: fertilizer vs yield
sns.scatterplot(x='fertilizer_kg_ha', y='yield_t_ha', data=df, alpha=0.5)
plt.title("Fertilizer Rate vs Yield")
plt.xlabel("Fertilizer (kg/ha)")
plt.ylabel("Yield (t/ha)")
plt.show()

3. Define Features & Target

Polynomial Features augments the five raw inputs with their squares and pairwise interactions (e.g., fertilizer_kg_ha², rainfall_mm × soil_moisture), enabling the model to learn diminishing returns and synergistic effects.

# Select predictors and response
X = df[['soil_moisture','soil_nitrogen','rainfall_mm','avg_temp_c','fertilizer_kg_ha']]
y = df['yield_t_ha']

4. Build Pipeline with Polynomial Features

StandardScaler standardises each feature so that Ridge’s ℓ² penalty treats all terms evenly, preventing dominance by high‑variance predictors.

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge

pipe = Pipeline([
    ('scale', StandardScaler()),  
    ('poly', PolynomialFeatures(include_bias=False)),  
    ('ridge', Ridge(random_state=42))  
])

5. Train/Test Split & Hyperparameter Search

GridSearchCV optimises the polynomial degree (1–3) and regularisation strength α (from 1e‑3 to 1e3) via 5‑fold cross‑validation. This minimises root‑mean‑squared error.

from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

param_grid = {
    'poly__degree' : [1, 2, 3],
    'ridge__alpha' : np.logspace(-3, 3, 7)
}

gs = GridSearchCV(
    pipe, param_grid,
    cv=5,
    scoring='neg_root_mean_squared_error',
    n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)

print("Best parameters:", gs.best_params_)

6. Evaluate Model

y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)

print(f"Test RMSE: {rmse:.3f} t/ha")
print(f"Test R²  : {r2:.3f}")

7. Inspect Key Polynomial Coefficients

Coefficient inspection reveals the nonlinear terms (e.g., fertilizer_kg_ha², rainfall_mm × soil_nitrogen) that strongly affect the predicted yield. This guides agronomic decisions on fertilizer dosing and irrigation scheduling.
Ridge regression applies ℓ² shrinkage to control overfitting caused by the expanded feature space.

# Extract polynomial feature names
poly = gs.best_estimator_.named_steps['poly']
feat_names = poly.get_feature_names_out(input_features=X.columns)

# Retrieve Ridge coefficients
coefs = gs.best_estimator_.named_steps['ridge'].coef_

import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)

# Plot top 10
plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Influencing Yield")
plt.xlabel("Coefficient magnitude")
plt.tight_layout()
plt.show()

Summary

Polynomial feature engineering integration with Ridge regularisation provides a model that:

1. Accurately captures nonlinear soil‑weather‑management effects on crop yield (low RMSE, high R²).

2. Controls complexity via α tuning, balancing bias and variance in the presence of high‑order terms.

3. Provides interpretable insights: the most influential polynomial features pinpoint the key agronomic interactions driving yield, supporting data‑driven recommendations for fertiliser application and irrigation management.

Your opinion matters
Please write your valuable feedback about ProjectGurukul on Google | Facebook

Crop Yield Response Prediction using Polynomial Regression in ML

Libraries Required

Dataset

Step-by-Step Code Implementation

1. Load Libraries & Data

2. Exploratory Data Analysis

3. Define Features & Target

4. Build Pipeline with Polynomial Features

5. Train/Test Split & Hyperparameter Search

6. Evaluate Model

7. Inspect Key Polynomial Coefficients