Universal Content Extractor — multi-source news aggregation

01Overview

The Universal Content Extractor ingests public news from wildly different sources — PDF newspapers and magazines, the open web, RSS/Atom feeds and social platforms — and turns the firehose into a coherent, deduplicated, cross-language, personalised feed. The defining design choice: every piece of text understanding and on-page image recognition runs on a self-hosted Gemma model on the server, so processing news at scale costs nothing in commercial LLM API fees.

02The problem it solves

Aggregating news at scale means two hard problems at once: getting content out of formats that fight you (a PDF newspaper is a layout, not an article list; a website is markup behind anti-bot defences), and then making sense of the result across languages without it becoming a wall of duplicates. And doing all of that with commercial LLMs would make the per-article cost prohibitive. Running a local model removes that ceiling — and since the content is public news, the angle is purely cost, not data privacy.

Result news processed at scale with effectively zero per-article model bill.

03What we built

Multi-source extraction

PDF newspapers & magazines — AI segmentation via a "Flash" pipeline: PyMuPDF text extraction, a streaming / incremental segmenter, per-page image processing, spread-based page processing, and multi-page article-continuation detection for Finnish, Swedish and English.
Web scraping — CSS selectors plus Playwright for JavaScript-rendered pages.
RSS / Atom feeds.
Social — X / Twitter, Facebook, Instagram and LinkedIn.

Anti-detection

Scraping at scale survives because of the unglamorous defences: random delays, user-agent rotation, proxy rotation, exponential backoff and a circuit breaker to back off cleanly when a source pushes back.

Enrichment & the feed

Once content is in, it's enriched: semantic clustering by cosine similarity (threshold ~0.75), cross-language linking so the same story in different languages connects, AI synopsis generation and event classification. A personalised feed then applies multi-algorithm ranking, deduplication and user preferences.

Local LLM

A self-hosted Gemma model does all the text understanding and the on-page image recognition locally — reading and comprehension alike — which is what makes processing this volume economically viable.

04Storage & interfaces

Records live in PostgreSQL; embeddings live in a Milvus vector database for semantic search and clustering. On top sit three interfaces: an admin panel, a REST API, and a news-feed UI.

05Tech

Self-hosted Gemma PyMuPDF Playwright PostgreSQL Milvus vector DB Semantic clustering ~0.75 REST API FI / SV / EN

SourcesPDF · web · RSS/Atom · social

LLMSelf-hosted Gemma, fully local

StoragePostgreSQL + Milvus vectors

InterfacesAdmin panel · REST API · feed UI

DeploymentOn-prem

06Highlights

A "Flash" PDF pipeline that segments newspaper layouts and stitches multi-page articles across FI / SV / EN.
Web, RSS and social ingestion behind a full anti-detection layer — rotation, backoff and a circuit breaker.
Semantic clustering and cross-language linking so one story isn't a dozen duplicates.
A personalised feed with multi-algorithm ranking, dedup and user preferences.
All text understanding and on-page image recognition on a local Gemma model — zero commercial LLM API fees at scale.
PostgreSQL for records, Milvus for vector search.

Related work

← all case studies ← back to home