How do I avoid showing the same story five times?

Embed each article and cluster items whose vectors are highly similar — the same event reported by different outlets lands close together in embedding space. Pick one representative per cluster (or merge them in the summary) so the digest carries one entry per story instead of five near-duplicates from competing publishers.

How do I keep LLM costs down at news scale?

Summarise per cluster, not per article, so five sources collapse into one summarisation call. Cache summaries by content hash so re-runs do not re-pay, use a small cheap model for summarisation, and only summarise clusters that pass a relevance filter. Most of the spend disappears once you stop summarising duplicates.

How do I respect publishers and copyright?

Store and display short generated summaries with a prominent link back to the original, never republish full article text, and honour each feed's terms and robots rules. Aggregators that add a summary and drive traffic to the source sit on far safer ground than ones that mirror full content. Attribute every item to its outlet.

How often should I poll feeds?

Match the cadence to the source — breaking-news outlets justify polling every few minutes, most blogs every 30 to 60 minutes. Use conditional requests (ETag and If-Modified-Since) so you only download changed feeds, and stagger polls to avoid hammering any one server. The planner below estimates daily article volume from your feed count and cadence.

How do I personalise the digest?

Build an interest vector per user from topics they follow or articles they engage with, then rank clusters by similarity to that vector before assembling the digest. Start simple with explicit topic tags, then layer in implicit signals like opens and clicks. Always let users tune their topics so the personalisation stays transparent.

How to Build an AI News Aggregator

What you are building

This tutorial builds an AI news aggregator: a pipeline that pulls articles from many RSS feeds, collapses the same story reported by different outlets into a single item, summarises each one, and delivers a personalised digest by email or web. The intelligence is in two places — using embeddings to cluster duplicates so readers see one entry per story instead of five, and using an LLM to summarise each cluster into a few tight sentences. The rest is plumbing: scheduled fetching, deduplication, ranking, and delivery.

How it works

A scheduled fetcher polls your RSS and Atom feeds, parses each entry into a common shape, and stores new items, skipping anything it has seen by URL hash and using conditional requests so it only downloads changed feeds. A clustering step embeds every new article and groups items whose vectors are highly similar — the same event from competing publishers sits close together in embedding space, so it merges into one cluster. A summarisation step runs a cheap LLM once per cluster (not per article) and caches the result by content hash. Finally a personalisation and delivery step ranks clusters against each user’s interest vector and assembles a digest sent on their chosen cadence, every cluster linking back to the original sources.

Tips and the planner below

Summarise per cluster, not per article — that one decision removes most of your LLM bill because five sources of the same story become one call. Cache summaries by content hash so re-runs cost nothing, and only summarise clusters that clear a relevance filter. Respect publishers: show a short generated summary with a prominent link back, never mirror full text, and honour feed terms. Poll at a cadence that matches each source and stagger requests to be a good citizen. The planner below estimates your daily article volume, how many clusters and summarisation calls that produces, and the resulting monthly cost from your feed count, polling cadence, and model price.