Library Usage Prediction using Linear Regression in ML

FREE Online Courses: Enroll Now, Thank us Later!

City librarians want to know how many visitors (or check‑outs) to expect tomorrow so they can schedule staff, open the exemplary service desks, and time community events. A quick, transparent model also helps grant writers demonstrate why extended‑hour funding is (or isn’t) warranted.

We will build a linear‑regression baseline that predicts a branch’s daily foot‑traffic count (VisitorsNextDay) from signals available the moment today’s doors close:

yesterday’s visitors and check‑outs
day‑of‑week and month
whether tomorrow is a school day or a public holiday
branch neighbourhood type (downtown / suburban / rural)
forecast temperature and rainfall (optional, merged from a public feed)

Although modern libraries may graduate to time‑series models or gradient‑boosted trees, a straight‑line fit surfaces first‑order drivers and serves as the benchmark any fancier model must beat.

Libraries Required

pandas # tabular wrangling
numpy # numerical helpers
matplotlib.pyplot # optional quick plots
scikit‑learn # preprocessing, model, metrics
joblib # persist the trained pipeline

Dataset Link

Daily checkout records

Step by Step Code Implementation

Why linear regression? Library visits tomorrow are often similar to those of yesterday, with additional bumps for weekend, holiday, and downtown locations. OLS captures that additive logic and gives coefficients that are easy to discuss in staff meetings.

1. Import Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_absolute_error
import joblib

2.  Load & glimpse the data

We’ll use the popular Seattle Public Library Check‑outs dataset and aggregate it to one row per branch‑per‑day:

# raw file: 2017‑2023 SPL check‑outs, each row = one item checked out
raw = pd.read_csv("Checkouts_by_Title_Detail.csv",
                  usecols=['CheckoutDateTime','Branch','UsageClass'])

# convert to date and branch‑daily counts
raw['date']   = pd.to_datetime(raw['CheckoutDateTime']).dt.date
daily_count   = (raw.groupby(['Branch','date'])
                     .size()
                     .reset_index(name='Checkouts'))

# load foot‑traffic file exported by kiosks (simplified for demo)
traffic = pd.read_csv("SPL_FootTraffic.csv")          # Branch, date, Visitors

df = traffic.merge(daily_count, on=['Branch','date'], how='left')
df['Checkouts'] = df['Checkouts'].fillna(0)           # days with 0 loans

3. Feature engineering

Lag features (VisitorsPrev, CheckoutsPrev) anchor the model in recent reality; the categorical dummies explain deviations.

df['date']       = pd.to_datetime(df['date'])
df['dow']        = df['date'].dt.dayofweek            # 0‑Mon … 6‑Sun
df['month']      = df['date'].dt.month
df['is_holiday'] = df['HolidayFlag']                  # pre‑merged calendar
df['branch_type'] = np.where(df['Branch'].str.contains('Downtown'),
                             'downtown', 'suburban')

# yesterday’s visitors / check‑outs
df = df.sort_values(['Branch','date'])
df['VisitorsPrev']  = df.groupby('Branch')['Visitors'].shift(1)
df['CheckoutsPrev'] = df.groupby('Branch')['Checkouts'].shift(1)

# drop first day rows without lag values
df = df.dropna(subset=['VisitorsPrev','CheckoutsPrev'])

4. Define predictors & label

Interpretability – If is_holiday=True adds +230 visitors while branch_type_suburban subtracts −120, planners trivially see why holiday staffing at the downtown hub matters.

num_cols = ['VisitorsPrev', 'CheckoutsPrev', 'month', 'dow']
cat_cols = ['is_holiday', 'branch_type']
target   = 'Visitors'           # tomorrow's visitors, after shifting

# shift label one step backward so X(t) predicts Visitors(t+1)
df['VisitorsNextDay'] = df.groupby('Branch')['Visitors'].shift(-1)
df = df.dropna(subset=['VisitorsNextDay'])

X = df[num_cols + cat_cols]
y = df['VisitorsNextDay']

5. Pre‑processing & pipeline

Standard scaling puts visitor counts and check‑outs on equal footing, so coefficients read as visitors per 1 σ change.

preproc = ColumnTransformer([
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols),
        ('num', StandardScaler(),                      num_cols)
])

lin = LinearRegression()

pipe = Pipeline([
        ('prep',  preproc),
        ('model', lin)
])

6. Train‑test split & training

Because rows are branch‑daily after time‑shifting, random splitting is acceptable:

X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.2, random_state=42, shuffle=True)

pipe.fit(X_train, y_train)

7.  Evaluation

Performance metrics – MAE in people is tangible (“we’re off by about ±90 visitors on a base of 2,300”). That’s often good enough for scheduling before diving into LSTM time‑series nets.

y_pred = pipe.predict(X_test)
print(f"R²  : {r2_score(y_test, y_pred):.3f}")
print(f"MAE : {mean_absolute_error(y_test, y_pred):.1f} visitors")

8.  Interpret usage drivers

ohe_names = (pipe.named_steps['prep']
                  .named_transformers_['cat']
                  .get_feature_names_out(cat_cols))

all_feats = list(ohe_names) + num_cols
coef      = pd.Series(pipe.named_steps['model'].coef_, index=all_feats)\
              .sort_values()

print("\nFactors that LOWER next‑day traffic:")
print(coef.head(5))

print("\nFactors that RAISE next‑day traffic:")
print(coef.tail(5))

Because numeric predictors are z‑scored, each numeric coefficient reads as visitor change for a 1 σ shift; categorical one‑hots are offsets versus the reference level.

9. Persist the model

joblib.dump(pipe, "library_usage_linreg.pkl")

Summary

With roughly a hundred lines of code, we transformed raw check‑out logs and entry‑gate counts into an explainable next‑day library‑usage predictor. The linear model:

gives immediate visitor forecasts that branch managers can act on,
shows clear drivers (weekends, holidays, downtown districts) that raise or lower demand, and
provides a benchmark MAE that any future, more exotic model must beat.

Feel free to extend the pipeline with weather forecasts, school calendars, or social‑media event mentions—and compare their incremental value against this sturdy linear baseline.

Your opinion matters
Please write your valuable feedback about ProjectGurukul on Google | Facebook