SignalLake

A local-first log analytics platform built with FastAPI, partitioned Parquet, and DuckDB

Designed the complete path from validated event ingestion to query-ready operational analytics.

I designed and built an end-to-end log analytics platform that validates incoming events, preserves raw records in JSONL, transforms them into Hive-partitioned Parquet, and serves operational metrics through DuckDB and FastAPI. On a benchmark of 1 million events across 192 files, partition pruning reduced data scanned from 86.7 MB to 6.3 MB.

ContextPersonal Project

RoleBackend / Data Engineer

TeamSolo

DateJun 2026

Designed and built the complete ingestion-to-query pipeline solo, including the event schema, storage layout, query layer, API, and test suite.

1M events · 192 filespytest · 4 test filesGitHub Actions CI

FastAPIPythonPydanticDuckDBApache ParquetPyArrow

Source

Overview

SignalLake gives backend, data, and platform engineers a lightweight way to investigate service latency, failures, and slow endpoints. It combines typed ingestion, raw-event preservation, analytical storage, SQL queries, operational APIs, structured logging, automated validation, and containerized execution in one cohesive system.

Architecture

Events submitted through the API or generated for benchmarking follow the same Pydantic LogEvent contract. FastAPI accepts individual or batch events and appends validated records to the raw JSONL zone. The ETL workflow converts those records into analytics-ready Parquet partitioned by event date, service name, and status code. DuckDB queries the files directly, while FastAPI exposes p50, p95, and p99 service latency, endpoint error rates, and the slowest endpoints. Middleware emits structured JSON logs and assigns a request ID to every API call.

What I Built

I independently designed and implemented the complete ingestion-to-query system, including:

Defined the shared Pydantic event schema used by ingestion, generation, storage, and tests.
Implemented single-event and batch FastAPI ingestion backed by an append-only JSONL raw zone.
Built the raw-to-Parquet ETL workflow and designed the Hive-style partition layout.
Implemented five DuckDB analytical queries covering latency percentiles, error rates, slowest endpoints, regional server errors, and hourly volume. The first three are exposed as FastAPI endpoints, while the regional and hourly queries are available directly through the DuckDB query layer.
Created the partition-pruning benchmark, structured request logging, pytest suite, GitHub Actions workflow, Dockerfile, and Docker Compose environment.

Engineering Decisions

DuckDB for embedded analytical SQL

Why — DuckDB queries Parquet directly without requiring a separate database service, making the complete analytical path reproducible locally.

Trade-off — Performance remains sensitive to file layout, projection efficiency, and local execution conditions.

Hive-partitioned Parquet for selective scans

Why — Partitioning by date, service, and status code aligns the storage layout with common operational filters and allows DuckDB to skip unrelated files.

Trade-off — At the measured scale, the pruned query still opened 8 partition files against the single unpartitioned file, and the best-of-seven query time moved from 9.03 ms to 13.10 ms rather than improving, so per-file overhead outweighed the byte savings at this dataset size.

FastAPI and Pydantic for the service contract

Why — FastAPI provides explicit ingestion and analytical routes, while Pydantic keeps event validation consistent across the API, generator, and storage layers.

Trade-off — Synchronous JSONL appends keep the local execution path simple, but request completion remains coupled to file I/O.

Results & Validation

Benchmarked 1,000,000 events across 192 Parquet files with identical analytical results across partitioned and unpartitioned layouts. The pruned query touched 8 of those 192 files.

Twenty-one tests across four pytest files validate API behavior, including ingestion and all three live query endpoints, schema defaults and serialization, reproducible event generation, and partitioned and unpartitioned storage round trips. GitHub Actions runs the suite on pushes to main and pull requests.

Structured JSON request logs capture request ID, endpoint, status, service, and latency fields, and each API response includes its request ID.

The FastAPI service is packaged with Docker and Docker Compose for repeatable local and containerized execution.

Benchmark Insight

Partition pruning reduced data scanned by 92.7%, from 86.7 MB to 6.3 MB, but the pruned query still opened 8 partition files against the single unpartitioned file, and the best-of-seven query time moved from 9.03 ms to 13.10 ms rather than improving. The benchmark showed that bytes scanned must be evaluated alongside file count, projection efficiency, cache state, and engine overhead when designing analytical storage. Next, I would benchmark larger cold-cache and concurrent workloads, test compaction and alternative file-size strategies, and evaluate the layout with S3, Glue, and Athena.

Evidence / Technologies

View benchmarks View tests View CI workflow View Docker setup View architecture notes View design decisions

FastAPIPythonPydanticDuckDBApache ParquetPyArrowpytestDockerGitHub Actions

Back to all projects