Soil Nutrient Level Prediction using Linear Regression in ML
FREE Online Courses: Knowledge Awaits – Click for Free Access!
Soil scientists and agronomists often need a quick estimate of the primary macronutrients—nitrogen (N), phosphorus (P), and potassium (K)—present in a soil sample so they can prescribe fertiliser rates and plan crop rotations.
In this mini‑project, we build a linear‑regression baseline that predicts each nutrient’s concentration (parts‑per‑million) from inexpensive, easy‑to‑obtain field measurements: soil pH, electrical conductivity, organic‑matter percentage, moisture, and region. A transparent line highlights the factors that most strongly influence N, P, and K, and provides a benchmark before switching to more complex regressors.
Libraries Required
- pandas # data wrangling
- numpy # numerical helpers
- matplotlib.pyplot # optional quick plots
- scikit‑learn # preprocessing, model, metrics
- joblib # save the trained pipeline
Dataset Link
Step-by-Step Code Implementation
Why linear regression first? In many soils, macronutrient levels respond nearly linearly to pH, organic matter, and salinity within normal ranges. A straight‑line fit quantifies these elasticities and is trivial to explain to extension agents.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load & peek at the data
Download soil_measures.csv from the given link above:
df = pd.read_csv("soil_measures.csv")
print(df.head())
Typical columns
| column | description |
| N / P / K | macronutrient ppm (targets) |
| pH | soil pH (0–14) |
| EC | electrical conductivity (dS m⁻¹) |
| OrganicMatter | % by weight |
| Moisture | % at sampling time |
| Region | categorical (e.g., North, East …) |
3. Basic cleaning
core = ['N','P','K','pH','EC','OrganicMatter','Moisture','Region'] df = df.dropna(subset=core).copy()
4. Choose predictors & label(s)
We will train one model per nutrient to ensure crystal-clear interpretation.
num_cols = ['pH', 'EC', 'OrganicMatter', 'Moisture'] cat_cols = ['Region']
5. Reusable preprocessing pipeline
Standard scaling and one-hot encoding keep coefficients comparable (ppm per 1 σ) and treat Region as categorical, avoiding fake numeric ordering.
preproc = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline([('prep', preproc),
('model', linreg)])
6. Train, evaluate, and save for each nutrient
- Separate models for N, P, and K avoid multivariate target interactions that could complicate interpretation.
- Evaluation metrics – R² indicates the percentage of nutrient variance captured; MAE in ppm provides a tangible error band. Agronomists can decide if ±3 ppm is “good enough” for fertiliser planning.
- Joblib persistence lets a mobile soil‑testing app load the .pkl files and output nutrient estimates on‑site without retraining.
models = {}
for nutrient in ['N', 'P', 'K']:
X = df[num_cols + cat_cols]
y = df[nutrient]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42)
pipe.fit(X_train, y_train)
pred = pipe.predict(X_test)
print(f"\n=== {nutrient} Model ===")
print(f"R² : {r2_score(y_test, pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, pred):.1f} ppm")
models[nutrient] = pipe
joblib.dump(pipe, f"soil_{nutrient.lower()}_linreg.pkl")
7. Interpreting coefficients (example for Nitrogen)
Coefficient tables reveal actionable levers, such as a substantial positive weight on OrganicMatter for N, suggesting that adding compost could boost nitrogen reserves.
ohe_feats = models['N'].named_steps['prep']\
.named_transformers_['cat']\
.get_feature_names_out(cat_cols)
all_feats = list(ohe_feats) + num_cols
coef_series = pd.Series(models['N'].named_steps['model'].coef_,
index=all_feats).sort_values()
print("\nFactors decreasing N content:")
print(coef_series.head(5))
print("\nFactors increasing N content:")
print(coef_series.tail(5))
Because inputs are z‑scored, each coefficient reads as ppm change in N for a one‑standard‑deviation change in that feature.
Summary
In roughly 100 lines of Python, we transformed routine soil‑lab readings into an explainable nutrient‑content forecaster:
- Instant ppm estimates help farmers fine‑tune fertiliser rates before planting.
- Transparent coefficients precisely quantify how pH, salinity, organic matter, and moisture influence N-P-K levels.
Keep this linear baseline as a benchmark—when you switch to tree ensembles or spectroscopy‑based models, you’ll know precisely how much additional predictive value the sophistication brings.