Soil Nutrient Level Prediction using Linear Regression in ML

FREE Online Courses: Knowledge Awaits – Click for Free Access!

Soil scientists and agronomists often need a quick estimate of the primary macronutrients—nitrogen (N), phosphorus (P), and potassium (K)—present in a soil sample so they can prescribe fertiliser rates and plan crop rotations.

In this mini‑project, we build a linear‑regression baseline that predicts each nutrient’s concentration (parts‑per‑million) from inexpensive, easy‑to‑obtain field measurements: soil pH, electrical conductivity, organic‑matter percentage, moisture, and region. A transparent line highlights the factors that most strongly influence N, P, and K, and provides a benchmark before switching to more complex regressors.

Libraries Required

pandas # data wrangling
numpy # numerical helpers
matplotlib.pyplot # optional quick plots
scikit‑learn # preprocessing, model, metrics
joblib # save the trained pipeline

Dataset Link

Soil measures

Step-by-Step Code Implementation

Why linear regression first? In many soils, macronutrient levels respond nearly linearly to pH, organic matter, and salinity within normal ranges. A straight‑line fit quantifies these elasticities and is trivial to explain to extension agents.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2. Load & peek at the data

Download soil_measures.csv from the given link above:

df = pd.read_csv("soil_measures.csv")
print(df.head())

Typical columns

column	description
N / P / K	macronutrient ppm (targets)
pH	soil pH (0–14)
EC	electrical conductivity (dS m⁻¹)
OrganicMatter	% by weight
Moisture	% at sampling time
Region	categorical (e.g., North, East …)

3. Basic cleaning

core = ['N','P','K','pH','EC','OrganicMatter','Moisture','Region']
df   = df.dropna(subset=core).copy()

4.  Choose predictors & label(s)

We will train one model per nutrient to ensure crystal-clear interpretation.

num_cols = ['pH', 'EC', 'OrganicMatter', 'Moisture']
cat_cols = ['Region']

5.  Reusable preprocessing pipeline

Standard scaling and one-hot encoding keep coefficients comparable (ppm per 1 σ) and treat Region as categorical, avoiding fake numeric ordering.

preproc = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

linreg = LinearRegression()
pipe   = Pipeline([('prep', preproc),
                   ('model', linreg)])

6. Train, evaluate, and save for each nutrient

Separate models for N, P, and K avoid multivariate target interactions that could complicate interpretation.
Evaluation metrics – R² indicates the percentage of nutrient variance captured; MAE in ppm provides a tangible error band. Agronomists can decide if ±3 ppm is “good enough” for fertiliser planning.
Joblib persistence lets a mobile soil‑testing app load the .pkl files and output nutrient estimates on‑site without retraining.

models = {}
for nutrient in ['N', 'P', 'K']:
    X = df[num_cols + cat_cols]
    y = df[nutrient]

    X_train, X_test, y_train, y_test = train_test_split(
            X, y, test_size=0.2, random_state=42)

    pipe.fit(X_train, y_train)
    pred = pipe.predict(X_test)

    print(f"\n=== {nutrient} Model ===")
    print(f"R²  : {r2_score(y_test, pred):.3f}")
    print(f"MAE : {mean_absolute_error(y_test, pred):.1f} ppm")

    models[nutrient] = pipe
    joblib.dump(pipe, f"soil_{nutrient.lower()}_linreg.pkl")

7. Interpreting coefficients (example for Nitrogen)

Coefficient tables reveal actionable levers, such as a substantial positive weight on OrganicMatter for N, suggesting that adding compost could boost nitrogen reserves.

ohe_feats = models['N'].named_steps['prep']\
                       .named_transformers_['cat']\
                       .get_feature_names_out(cat_cols)
all_feats = list(ohe_feats) + num_cols

coef_series = pd.Series(models['N'].named_steps['model'].coef_,
                        index=all_feats).sort_values()

print("\nFactors decreasing N content:")
print(coef_series.head(5))

print("\nFactors increasing N content:")
print(coef_series.tail(5))

Because inputs are z‑scored, each coefficient reads as ppm change in N for a one‑standard‑deviation change in that feature.

Summary

In roughly 100 lines of Python, we transformed routine soil‑lab readings into an explainable nutrient‑content forecaster:

Instant ppm estimates help farmers fine‑tune fertiliser rates before planting.
Transparent coefficients precisely quantify how pH, salinity, organic matter, and moisture influence N-P-K levels.

Keep this linear baseline as a benchmark—when you switch to tree ensembles or spectroscopy‑based models, you’ll know precisely how much additional predictive value the sophistication brings.

If you are Happy with ProjectGurukul, do not forget to make us happy with your positive feedback on Google | Facebook