Crop Yield Variation Prediction with Lasso Regression in ML

FREE Online Courses: Transform Your Career – Enroll for Free!

Farmers and agronomists constantly battle shifting weather, soil depletion, and input‑cost pressures. Accurately forecasting how these factors interact to influence tonnes per hectare can help them fine-tune irrigation, fertiliser plans, and seed choice. This project develops a Lasso‑regularised linear model that:

  • Predicts the expected crop yield for a given field-season using easily collected soil chemistry, rainfall, temperature, and management variables.
  • Highlights the handful of drivers with the most decisive influence by shrinking unimportant coefficients to zero, providing practitioners with a concise, evidence-based checklist for intervention.

Because Lasso couples an ℓ1 penalty with linear regression, it balances interpretability and predictive power, guarding against over‑fitting in datasets where environmental variables are often collinear.

Libraries Required

Purpose Library
Data wrangling pandas, numpy
Visualisation matplotlib, seaborn
Modelling pipeline scikit‑learnLasso, Pipeline, ColumnTransformer, StandardScaler, OneHotEncoder, GridSearchCV
Evaluation mean_squared_error, r2_score

Dataset Link

Crop Yield Prediction Dataset

Step-by-Step Code Implementation

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.linear_model import Lasso
from sklearn.metrics import mean_squared_error, r2_score

2.  Download and load the dataset

The Kaggle file records crop type, soil NPK levels, average rainfall, temperature, and pesticide/fertiliser usage for multiple regions and seasons.

# One‑time download (requires Kaggle API):
# kaggle datasets download -d patelris/crop-yield-prediction-dataset -p data --unzip

data = pd.read_csv("data/crop_yield_data.csv")   # adjust name if different

3.  Initial inspection & EDA

print(data.head())
print(data.info())

sns.boxplot(data['yield']); plt.title('Yield distribution'); plt.show()
sns.heatmap(data.corr(numeric_only=True), cmap='RdBu', center=0); plt.title('Numeric correlation'); plt.show()

4.  Define target & feature matrix

y = data['yield']                        # target: tonnes per hectare
X = data.drop(columns=['yield'])

5.  Pre‑processing recipe

Country, crop variety, and management‑practice columns are one‑hot encoded; numeric predictors are z‑scaled so the Lasso penalty treats each on equal footing.

cat_cols = X.select_dtypes('object').columns
num_cols = X.select_dtypes(exclude='object').columns

preprocess = ColumnTransformer(
    transformers=[
        ('cat', OneHotEncoder(drop='first', sparse=False), cat_cols),
        ('num', StandardScaler(), num_cols)
    ]
)

6.  Train/test split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42)

7.  Build & tune Lasso pipeline

A log‑spaced grid from 0.001 to 10 finds the sweet spot between bias and variance. Small α keeps more variables; large α zeroes out noisy ones. A five-fold CV ensures the choice generalises.

pipe = Pipeline(steps=[
    ('prep', preprocess),
    ('model', Lasso(max_iter=10_000, random_state=42))
])

param_grid = {'model__alpha': np.logspace(-3, 1, 25)}   # 0.001 → 10
search = GridSearchCV(pipe, param_grid,
                      cv=5, scoring='neg_root_mean_squared_error')
search.fit(X_train, y_train)

print("Optimal α:", search.best_params_['model__alpha'])

8.  Evaluate model

RMSE provides the average tonne-per-hectare error, a unit that agronomists instantly grasp. R2R^2 shows the proportion of yield variance explained.

y_pred = search.predict(X_test)
rmse = mean_squared_error(y_test, y_pred, squared=False)
r2   = r2_score(y_test, y_pred)
print(f"Test RMSE: {rmse:.2f} t/ha | R²: {r2:.3f}")

9.  Interpret coefficients

Non‑zero coefficients surviving the penalty immediately point to actionable levers—e.g. rainfall during the critical growth stage or soil phosphorus. Zeroed features can often be dropped from future data collection, saving cost.

# Recover one‑hot names
ohe = search.best_estimator_.named_steps['prep'].named_transformers_['cat']
ohe_names = ohe.get_feature_names_out(cat_cols)
feature_names = np.hstack([ohe_names, num_cols])

coefs = search.best_estimator_.named_steps['model'].coef_
importance = pd.Series(coefs, index=feature_names).sort_values(key=abs, ascending=False)

plt.figure(figsize=(9,6))
importance.head(15).plot(kind='barh')
plt.gca().invert_yaxis()
plt.title('Top Lasso Coefficients (absolute)')
plt.xlabel('Coefficient')
plt.show()

Summary

By combining scikit-learn’s Pipeline, ColumnTransformer, and Lasso, we created an interpretable model that explains yield variation and forecasts production before harvest. Agronomists can plug fresh season data into the pipeline to:

  • Receive an early alert if the projected yield dips below the target.
  • Prioritise the top environmental or management factors driving that shortfall.

The ℓ1‑regularised approach keeps the model compact—crucial when communicating results to growers who value clear, actionable insights over abstract algorithmic complexity.

Did you like this article? If Yes, please give ProjectGurukul 5 Stars on Google | Facebook

ProjectGurukul Team

ProjectGurukul Team specializes in creating project-based learning resources for programming, Java, Python, Android, AI, Webdevelopment and machine learning. Our mission is to help learners build practical skills through engaging, hands-on projects. We also offer free major and minor projects with source code for engineering students

Leave a Reply

Your email address will not be published. Required fields are marked *