Crop Yield Response Prediction using Polynomial Regression in ML
FREE Online Courses: Dive into Knowledge for Free. Learn More!
Agronomists and farm‑management teams need to forecast crop yield (tons per hectare) based on readily measured environmental and management factors—soil moisture, nutrient levels, rainfall, temperature, and fertiliser application rate—before harvest. These relationships are inherently nonlinear: for example, yield gains taper off at high fertiliser rates, and interact multiplicatively with moisture and temperature. A simple linear model underestimates such curvature, while an unconstrained high‑degree polynomial overfits. By applying Polynomial Regression (i.e., linear regression on engineered polynomial and interaction features) with Ridge regularisation, we can capture smooth nonlinear responses and deliver robust, interpretable yield forecasts for precision‑agriculture decision support.
Libraries Required
import pandas as pd # data loading & handling import numpy as np # numerical operations import matplotlib.pyplot as plt # plotting import seaborn as sns # visualization enhancements from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.preprocessing import StandardScaler, PolynomialFeatures from sklearn.linear_model import Ridge from sklearn.pipeline import Pipeline from sklearn.metrics import mean_squared_error, r2_score
Dataset
Crop Yield Prediction Using Soil and Weather
Step-by-Step Code Implementation
1. Load Libraries & Data
import pandas as pd
# Adjust path if needed
df = pd.read_csv("data/crop_yield_soil_weather.csv")
# Preview key columns
df.head()[['soil_moisture','soil_nitrogen','rainfall_mm',
'avg_temp_c','fertilizer_kg_ha','yield_t_ha']]
2. Exploratory Data Analysis
import seaborn as sns
import matplotlib.pyplot as plt
# Visualise nonlinear trend: fertilizer vs yield
sns.scatterplot(x='fertilizer_kg_ha', y='yield_t_ha', data=df, alpha=0.5)
plt.title("Fertilizer Rate vs Yield")
plt.xlabel("Fertilizer (kg/ha)")
plt.ylabel("Yield (t/ha)")
plt.show()
3. Define Features & Target
Polynomial Features augments the five raw inputs with their squares and pairwise interactions (e.g., fertilizer_kg_ha², rainfall_mm × soil_moisture), enabling the model to learn diminishing returns and synergistic effects.
# Select predictors and response X = df[['soil_moisture','soil_nitrogen','rainfall_mm','avg_temp_c','fertilizer_kg_ha']] y = df['yield_t_ha']
4. Build Pipeline with Polynomial Features
StandardScaler standardises each feature so that Ridge’s ℓ² penalty treats all terms evenly, preventing dominance by high‑variance predictors.
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import Ridge
pipe = Pipeline([
('scale', StandardScaler()),
('poly', PolynomialFeatures(include_bias=False)),
('ridge', Ridge(random_state=42))
])
5. Train/Test Split & Hyperparameter Search
GridSearchCV optimises the polynomial degree (1–3) and regularisation strength α (from 1e‑3 to 1e3) via 5‑fold cross‑validation. This minimises root‑mean‑squared error.
from sklearn.model_selection import train_test_split, GridSearchCV
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
param_grid = {
'poly__degree' : [1, 2, 3],
'ridge__alpha' : np.logspace(-3, 3, 7)
}
gs = GridSearchCV(
pipe, param_grid,
cv=5,
scoring='neg_root_mean_squared_error',
n_jobs=-1, verbose=1
)
gs.fit(X_train, y_train)
print("Best parameters:", gs.best_params_)
6. Evaluate Model
y_pred = gs.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.3f} t/ha")
print(f"Test R² : {r2:.3f}")
7. Inspect Key Polynomial Coefficients
- Coefficient inspection reveals the nonlinear terms (e.g., fertilizer_kg_ha², rainfall_mm × soil_nitrogen) that strongly affect the predicted yield. This guides agronomic decisions on fertilizer dosing and irrigation scheduling.
- Ridge regression applies ℓ² shrinkage to control overfitting caused by the expanded feature space.
# Extract polynomial feature names
poly = gs.best_estimator_.named_steps['poly']
feat_names = poly.get_feature_names_out(input_features=X.columns)
# Retrieve Ridge coefficients
coefs = gs.best_estimator_.named_steps['ridge'].coef_
import pandas as pd
coef_series = pd.Series(coefs, index=feat_names).abs().sort_values(ascending=False)
# Plot top 10
plt.figure(figsize=(8,5))
coef_series.head(10).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title("Top Polynomial Features Influencing Yield")
plt.xlabel("Coefficient magnitude")
plt.tight_layout()
plt.show()
Summary
Polynomial feature engineering integration with Ridge regularisation provides a model that:
1. Accurately captures nonlinear soil‑weather‑management effects on crop yield (low RMSE, high R²).
2. Controls complexity via α tuning, balancing bias and variance in the presence of high‑order terms.
3. Provides interpretable insights: the most influential polynomial features pinpoint the key agronomic interactions driving yield, supporting data‑driven recommendations for fertiliser application and irrigation management.