Library Usage Prediction using Linear Regression in ML
FREE Online Courses: Click, Learn, Succeed, Start Now!
City librarians want to know how many visitors (or check‑outs) to expect tomorrow so they can schedule staff, open the exemplary service desks, and time community events. A quick, transparent model also helps grant writers demonstrate why extended‑hour funding is (or isn’t) warranted.
We will build a linear‑regression baseline that predicts a branch’s daily foot‑traffic count (VisitorsNextDay) from signals available the moment today’s doors close:
- yesterday’s visitors and check‑outs
- day‑of‑week and month
- whether tomorrow is a school day or a public holiday
- branch neighbourhood type (downtown / suburban / rural)
- forecast temperature and rainfall (optional, merged from a public feed)
Although modern libraries may graduate to time‑series models or gradient‑boosted trees, a straight‑line fit surfaces first‑order drivers and serves as the benchmark any fancier model must beat.
Libraries Required
- pandas # tabular wrangling
- numpy # numerical helpers
- matplotlib.pyplot # optional quick plots
- scikit‑learn # preprocessing, model, metrics
- joblib # persist the trained pipeline
Dataset Link
Step by Step Code Implementation
Why linear regression? Library visits tomorrow are often similar to those of yesterday, with additional bumps for weekend, holiday, and downtown locations. OLS captures that additive logic and gives coefficients that are easy to discuss in staff meetings.
1. Import Libraries
import pandas as pd import numpy as np import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline from sklearn.linear_model import LinearRegression from sklearn.metrics import r2_score, mean_absolute_error import joblib
2. Load & glimpse the data
We’ll use the popular Seattle Public Library Check‑outs dataset and aggregate it to one row per branch‑per‑day:
# raw file: 2017‑2023 SPL check‑outs, each row = one item checked out
raw = pd.read_csv("Checkouts_by_Title_Detail.csv",
usecols=['CheckoutDateTime','Branch','UsageClass'])
# convert to date and branch‑daily counts
raw['date'] = pd.to_datetime(raw['CheckoutDateTime']).dt.date
daily_count = (raw.groupby(['Branch','date'])
.size()
.reset_index(name='Checkouts'))
# load foot‑traffic file exported by kiosks (simplified for demo)
traffic = pd.read_csv("SPL_FootTraffic.csv") # Branch, date, Visitors
df = traffic.merge(daily_count, on=['Branch','date'], how='left')
df['Checkouts'] = df['Checkouts'].fillna(0) # days with 0 loans
3. Feature engineering
Lag features (VisitorsPrev, CheckoutsPrev) anchor the model in recent reality; the categorical dummies explain deviations.
df['date'] = pd.to_datetime(df['date'])
df['dow'] = df['date'].dt.dayofweek # 0‑Mon … 6‑Sun
df['month'] = df['date'].dt.month
df['is_holiday'] = df['HolidayFlag'] # pre‑merged calendar
df['branch_type'] = np.where(df['Branch'].str.contains('Downtown'),
'downtown', 'suburban')
# yesterday’s visitors / check‑outs
df = df.sort_values(['Branch','date'])
df['VisitorsPrev'] = df.groupby('Branch')['Visitors'].shift(1)
df['CheckoutsPrev'] = df.groupby('Branch')['Checkouts'].shift(1)
# drop first day rows without lag values
df = df.dropna(subset=['VisitorsPrev','CheckoutsPrev'])
4. Define predictors & label
Interpretability – If is_holiday=True adds +230 visitors while branch_type_suburban subtracts −120, planners trivially see why holiday staffing at the downtown hub matters.
num_cols = ['VisitorsPrev', 'CheckoutsPrev', 'month', 'dow']
cat_cols = ['is_holiday', 'branch_type']
target = 'Visitors' # tomorrow's visitors, after shifting
# shift label one step backward so X(t) predicts Visitors(t+1)
df['VisitorsNextDay'] = df.groupby('Branch')['Visitors'].shift(-1)
df = df.dropna(subset=['VisitorsNextDay'])
X = df[num_cols + cat_cols]
y = df['VisitorsNextDay']
5. Pre‑processing & pipeline
Standard scaling puts visitor counts and check‑outs on equal footing, so coefficients read as visitors per 1 σ change.
preproc = ColumnTransformer([
('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
('num', StandardScaler(), num_cols)
])
lin = LinearRegression()
pipe = Pipeline([
('prep', preproc),
('model', lin)
])
6. Train‑test split & training
Because rows are branch‑daily after time‑shifting, random splitting is acceptable:
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, shuffle=True)
pipe.fit(X_train, y_train)
7. Evaluation
Performance metrics – MAE in people is tangible (“we’re off by about ±90 visitors on a base of 2,300”). That’s often good enough for scheduling before diving into LSTM time‑series nets.
y_pred = pipe.predict(X_test)
print(f"R² : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.1f} visitors")
8. Interpret usage drivers
ohe_names = (pipe.named_steps['prep']
.named_transformers_['cat']
.get_feature_names_out(cat_cols))
all_feats = list(ohe_names) + num_cols
coef = pd.Series(pipe.named_steps['model'].coef_, index=all_feats)\
.sort_values()
print("\nFactors that LOWER next‑day traffic:")
print(coef.head(5))
print("\nFactors that RAISE next‑day traffic:")
print(coef.tail(5))
Because numeric predictors are z‑scored, each numeric coefficient reads as visitor change for a 1 σ shift; categorical one‑hots are offsets versus the reference level.
9. Persist the model
joblib.dump(pipe, "library_usage_linreg.pkl")
Summary
With roughly a hundred lines of code, we transformed raw check‑out logs and entry‑gate counts into an explainable next‑day library‑usage predictor. The linear model:
- gives immediate visitor forecasts that branch managers can act on,
- shows clear drivers (weekends, holidays, downtown districts) that raise or lower demand, and
- provides a benchmark MAE that any future, more exotic model must beat.
Feel free to extend the pipeline with weather forecasts, school calendars, or social‑media event mentions—and compare their incremental value against this sturdy linear baseline.