Restaurant Profit Prediction using Linear Regression in ML
FREE Online Courses: Click, Learn, Succeed, Start Now!
Independent owners and franchise groups alike ask the same question before expanding or redesigning a concept:
“Given a restaurant’s location, concept type, opening age and local demographics, how much net profit will it make in its next full fiscal year?”
Being able to forecast annual profit (USD) early allows investors to size loans, landlords to negotiate leases, and operators to plan headcount. Here we build a linear‑regression baseline that predicts a unit’s profit from readily available data:
- restaurant age (years since opening)
- city and city‑group size (big city / other)
- concept type (food‑court / inline / drive‑thru / mobile)
- 37 anonymised site variables supplied by the franchisor (P1 … P37)
- a simple industry‑average margin to convert revenue into profit.
The dataset comes from the classic “Restaurant Revenue Prediction” Kaggle competition. It provides yearly revenue; we assume a conservative 15 % net margin to obtain an approximate profit target.
Libraries Required
- pandas # tabular wrangling
- numpy # numerical helpers
- matplotlib.pyplot # quick sanity plots
- scikit‑learn # preprocessing, model, metrics
- joblib # persist the trained pipeline
Dataset Link
Restaurant Revenue Prediction Dataset
Step-by-Step Code Implementation
Why linear regression? Restaurant EBIT often follows an additive recipe: baseline margin on sales + uplifts (prime location, upscale concept) – penalties (ageing décor, poor demographics). OLS captures those additive effects and yields coefficients managers can sanity‑check.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from datetime import datetime from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load data & quick look
df = pd.read_csv("restaurant_revenue_train.csv") # file from Kaggle
print(df.head())
Key columns in the file
| column | description |
| Open Date | day the unit opened |
| City | literal city name |
| City Group | Big Cities / Other |
| Type | FC (food‑court) / IL (inline) / DT (drive‑thru) / MB (mobile) |
| P1‑P37 | anonymised site metrics |
| revenue | yearly revenue (obfuscated scale) |
3. Convert revenue → profit & feature engineer age
- Profit derivation – We multiply the revenue target by an industry‑average 15 % net margin. If actual margin data becomes available per unit, swap in that column and retrain; the pipeline stays identical.
- Standard scaling puts P‑features and age on comparable variance, so coefficients read as $ profit per 1 σ change.
NET_MARGIN = 0.15 # industry‑average after‑tax margin df['profit'] = df['revenue'] * NET_MARGIN # years open as of dataset snapshot (assume snapshot on 2025‑01‑01) snapshot = datetime(2025, 1, 1) df['Open Date'] = pd.to_datetime(df['Open Date']) df['YearsOpen'] = (snapshot - df['Open Date']).dt.days / 365.25
4. Define predictors & target
YearsOpen translates the Open Date string into a numeric age. Older outlets often earn higher profits due to brand equity, but may also suffer from cost creep; the coefficient reveals which side prevails.
num_cols = ['YearsOpen'] + [f'P{i}' for i in range(1, 38)]
cat_cols = ['City Group', 'Type', 'City']
target = 'profit'
X = df[num_cols + cat_cols]
y = df[target]
5. Pre‑processing & regression pipeline
One-hot encoding allows each city group or concept type to carry its own fixed profit offset without imposing a “distance” between categories.
preproc = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
linreg = LinearRegression()
pipe = Pipeline([
('prep', preproc),
('model', linreg)
])
6. Train‑test split & training
X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42, shuffle=True) pipe.fit(X_train, y_train)
7. Evaluation
Performance metrics – R² shows the share of profit variation our simple recipe explains; MAE in dollars gives finance teams a tangible error band (e.g., ±$38 k).
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : ${mean_absolute_error(y_test, y_pred):,.0f}")
8. Understand profit drivers
The coefficient table quickly spotlights profit levers: a large positive bump for Type_FC might confirm that food courts are cash cows; a negative weight on City Group_Other quantifies the drag of small-town sites.
# recover encoded feature names
ohe_feats = (pipe.named_steps['prep']
.named_transformers_['cat']
.get_feature_names_out(cat_cols))
all_feats = list(ohe_feats) + num_cols
coef = pd.Series(pipe.named_steps['model'].coef_, index=all_feats)\
.sort_values()
print("\nNegative‑impact factors (lower profit):")
print(coef.head(8))
print("\nPositive‑impact factors (raise profit):")
print(coef.tail(8))
Because numeric inputs are z‑scored, coefficients on YearsOpen or P‑features read as “profit change for a one‑σ shift.” One-hot coefficients indicate the dollar increase compared to the reference level.
9. Persist the trained pipeline
Joblib model is ready for your BI tool or quoting webform – call joblib.load(…), pass a new restaurant’s spec in a pandas row, and return a profit forecast in milliseconds.
joblib.dump(pipe, "restaurant_profit_linreg.pkl")
Summary
With roughly 120 lines of Python, we turned an open revenue dataset into an explainable restaurant‑profit forecaster:
- Instant, data-backed profit projections assist in site selection, loan sizing, and concept tweaks.
- Transparent marginal effects reveal exactly how age, concept type, city size, and site metrics influence the bottom line.
Keep this transparent baseline in your toolbox; every gradient‑boosted tree, Bayesian network, or causal‑impact model you test next must beat its mean‑absolute‑error and still make business sense to the CFO.