NYC Subway Foot-Traffic Forecasting

A self-hosted streaming ML pipeline for NYC subway foot-traffic forecasting

I built a Kafka and Spark Structured Streaming system that keeps historical model development separate from live simulated turnstile ingestion.

I designed and built the real-time streaming infrastructure for a subway foot-traffic forecasting system, including a realistic Kafka producer, a Spark Structured Streaming pipeline into MongoDB, and live Random Forest inference for station entries and exits. Working with a teammate on the historical side, we analyzed approximately 13 million MTA turnstile records, where Random Forest regression and classification models reached approximately 2,700 RMSE, below 4.8% MAE, and 93.36% classification accuracy.

ContextPersonal Project

RoleData Engineer

Team2 collaborators

DateMay 2025

I designed and built the Kafka producer, the Spark Structured Streaming pipeline, MongoDB persistence, the live Random Forest inference models, the automated tests, and the Docker deployment setup. A teammate and I worked together on the historical analysis and the Random Forest regression and classification models trained on the 13 million-record dataset.

13M historical MTA recordsKafka + Spark Structured StreamingRandom Forest regression and classification

Apache KafkaSpark Structured StreamingPySparkMongoDBscikit-learnPython

Source

Overview

The project combines two distinct workloads: analysis and model development over approximately 13 million historical MTA turnstile records, and a live self-hosted streaming pipeline using simulated turnstile events. The system applies historical traffic patterns to station-level forecasting while showing how Kafka, Spark Structured Streaming, MongoDB, and trained ML models work together in a continuously running pipeline.

Architecture

A teammate and I cleaned and analyzed approximately 13 million historical MTA turnstile records, engineered station and time-based features, and trained Random Forest regression models for station entries and exits alongside a Random Forest classifier distinguishing low from high traffic periods. For live streaming, I built a Kafka producer that weights ten real subway stations by ridership and reverses the entries-versus-exits balance between the morning and evening rush, while scaling volume down on weekends and major holidays. Kafka transports each event to a Spark Structured Streaming consumer I built, which parses it against a defined schema, tags it with an ingestion timestamp, and writes it to MongoDB and a console sink in the same micro-batch. A second streaming job I built loads the trained Random Forest models and applies them to each incoming batch, predicting entries and exits per station in near real time. Docker packages the services for deployment on a self-hosted server.

What I Built

I personally designed and built the Kafka producer, the Spark Structured Streaming ingestion and inference path, MongoDB persistence, the live Random Forest inference models, the automated tests, the Docker packaging, and the self-hosted deployment. A teammate and I worked together on the historical analysis workflow and the Random Forest regression and classification models trained on the 13 million-record dataset.

Built a Kafka producer that weights ten real subway stations by ridership, reverses the entries-versus-exits balance between the morning and evening rush, and scales volume down on weekends and holidays.
Built the Spark Structured Streaming consumer that parses each event against a defined schema, tags it with an ingestion timestamp, and writes it to MongoDB and a console sink in the same micro-batch.
Trained Random Forest regression models on the live-streamed MongoDB data and built the streaming inference job that scores each incoming batch for station-level entries and exits in near real time.
Worked with a teammate to analyze approximately 13 million historical MTA turnstile records and train the Random Forest regression and classification models that reached the reported RMSE, MAE, and accuracy figures.
Found that station identity and hour of day accounted for nearly all of the predictive signal in the entry and exit models, while day of week contributed almost nothing.
Packaged the Kafka, Spark, and MongoDB services with Docker and validated the pipeline with automated tests.

Engineering Decisions

Keep historical analysis separate from the live stream

Why — The 13 million-record dataset supported model development, while the live self-hosted stream validated runtime behavior with simulated turnstile events.

Trade-off — That separation made the system easier to reason about, but it also meant the live stream was a controlled simulation rather than an official transit feed.

Kafka and Spark Structured Streaming for ingestion and processing

Why — Kafka handled event transport while Spark Structured Streaming handled ingestion, transformation, aggregation, and inference in one pipeline.

Trade-off — The design required Kafka, Zookeeper, MongoDB, and Spark to run together, which added more local operational complexity than a lightweight consumer script.

MongoDB for raw and aggregated persistence

Why — MongoDB kept both the raw stream events and the aggregated station outputs accessible without adding a separate relational schema layer.

Trade-off — The document store made iteration straightforward, but it did not replace a warehoused analytical model for larger downstream workloads.

Separate regression and classification models

Why — Regression produced entry and exit forecasts, while a separate Random Forest classifier distinguished low from high traffic periods as a different modeling task.

Trade-off — That split kept each model focused on one target, but it required separate training and evaluation paths.

A compact, station- and hour-based feature set

Why — Feature importance from the trained Random Forest models showed station identity and hour of day carrying nearly all of the predictive weight, with day of week contributing almost nothing, so the feature set stayed deliberately small.

Trade-off — A compact feature set kept training fast and interpretable, but the models could miss longer-range seasonal or event-driven patterns that calendar features might capture with more data.

Results & Validation

On approximately 13 million historical MTA turnstile records, the entry and exit regression models reached around 2,700 RMSE and below 4.8% MAE, and the separate traffic-level classification task reached 93.36% accuracy.

Automated tests validated the historical workflow, the streaming path, MongoDB persistence, and the model inference steps.

The system was packaged with Docker so Kafka, Spark Structured Streaming, and MongoDB could run together as a repeatable setup on the self-hosted server. Each streamed event carries an ingestion timestamp, and the consumer writes to MongoDB and a console sink in the same micro-batch, so freshness and delivery could be checked directly against the live stream.

Deployment

I deployed and operated the pipeline on a self-hosted server as a live self-hosted streaming pipeline using simulated turnstile events, while keeping the 13 million-record historical analysis workflow separate from the runtime stream.

Evidence / Technologies

View Kafka producer View Spark streaming consumer View model training notebook View historical EDA and training notebook

Apache KafkaSpark Structured StreamingPySparkMongoDBscikit-learnPythonPandasDockerAutomated Testing

Back to all projects