Crop Yield Variation Prediction with Lasso Regression in ML
FREE Online Courses: Transform Your Career – Enroll for Free!
Farmers and agronomists constantly battle shifting weather, soil depletion, and input‑cost pressures. Accurately forecasting how these factors interact to influence tonnes per hectare can help them fine-tune irrigation, fertiliser plans, and seed choice. This project develops a Lasso‑regularised linear model that:
- Predicts the expected crop yield for a given field-season using easily collected soil chemistry, rainfall, temperature, and management variables.
- Highlights the handful of drivers with the most decisive influence by shrinking unimportant coefficients to zero, providing practitioners with a concise, evidence-based checklist for intervention.
Because Lasso couples an ℓ1 penalty with linear regression, it balances interpretability and predictive power, guarding against over‑fitting in datasets where environmental variables are often collinear.
Libraries Required
| Purpose | Library |
| Data wrangling | pandas, numpy |
| Visualisation | matplotlib, seaborn |
| Modelling pipeline | scikit‑learn → Lasso, Pipeline, ColumnTransformer, StandardScaler, OneHotEncoder, GridSearchCV |
| Evaluation | mean_squared_error, r2_score |
Dataset Link
Step-by-Step Code Implementation
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns from sklearn.compose import ColumnTransformer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.model_selection import train_test_split, GridSearchCV from sklearn.pipeline import Pipeline from sklearn.linear_model import Lasso from sklearn.metrics import mean_squared_error, r2_score
2. Download and load the dataset
The Kaggle file records crop type, soil NPK levels, average rainfall, temperature, and pesticide/fertiliser usage for multiple regions and seasons.
# One‑time download (requires Kaggle API):
# kaggle datasets download -d patelris/crop-yield-prediction-dataset -p data --unzip
data = pd.read_csv("data/crop_yield_data.csv") # adjust name if different
3. Initial inspection & EDA
print(data.head())
print(data.info())
sns.boxplot(data['yield']); plt.title('Yield distribution'); plt.show()
sns.heatmap(data.corr(numeric_only=True), cmap='RdBu', center=0); plt.title('Numeric correlation'); plt.show()
4. Define target & feature matrix
y = data['yield'] # target: tonnes per hectare X = data.drop(columns=['yield'])
5. Pre‑processing recipe
Country, crop variety, and management‑practice columns are one‑hot encoded; numeric predictors are z‑scaled so the Lasso penalty treats each on equal footing.
cat_cols = X.select_dtypes('object').columns
num_cols = X.select_dtypes(exclude='object').columns
preprocess = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
('num', StandardScaler(), num_cols)
]
)
6. Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
7. Build & tune Lasso pipeline
A log‑spaced grid from 0.001 to 10 finds the sweet spot between bias and variance. Small α keeps more variables; large α zeroes out noisy ones. A five-fold CV ensures the choice generalises.
pipe = Pipeline(steps=[
('prep', preprocess),
('model', Lasso(max_iter=10_000, random_state=42))
])
param_grid = {'model__alpha': np.logspace(-3, 1, 25)} # 0.001 → 10
search = GridSearchCV(pipe, param_grid,
cv=5, scoring='neg_root_mean_squared_error')
search.fit(X_train, y_train)
print("Optimal α:", search.best_params_['model__alpha'])
8. Evaluate model
RMSE provides the average tonne-per-hectare error, a unit that agronomists instantly grasp. R2R^2 shows the proportion of yield variance explained.
y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2 = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.2f} t/ha | R²: {r2:.3f}")
9. Interpret coefficients
Non‑zero coefficients surviving the penalty immediately point to actionable levers—e.g. rainfall during the critical growth stage or soil phosphorus. Zeroed features can often be dropped from future data collection, saving cost.
# Recover one‑hot names
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])
coefs = search.best_estimator_.named_steps['model'].coef_
importance = pd.Series(coefs, index=feature_names).sort_values(key=abs, ascending=False)
plt.figure(figsize=(9,6))
importance.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Lasso Coefficients (absolute)')
plt.xlabel('Coefficient')
plt.show()
Summary
By combining scikit-learn’s Pipeline, ColumnTransformer, and Lasso, we created an interpretable model that explains yield variation and forecasts production before harvest. Agronomists can plug fresh season data into the pipeline to:
- Receive an early alert if the projected yield dips below the target.
- Prioritise the top environmental or management factors driving that shortfall.
The ℓ1‑regularised approach keeps the model compact—crucial when communicating results to growers who value clear, actionable insights over abstract algorithmic complexity.