Building an Intelligent RSS News Aggregator for Discord

The Problem

A friend of mine created a useful RSS feed to aggregate a bunch of San Diego news feeds. My instinct was to use my discord bot to post the content to a new channel in discord, which worked, but generated a ton of noise.

The simpsons “

Starting Simple: Direct RSS Posting

Initial Implementation

Before I glance over the initial solution, there is some cool stuff there. To make it work, I created a new slash command to register a feed for a server. Since my discord bot lives in multiple servers, this would register the rss feed on a per channel, per server basis. I implemented a cron job on the box which runs my discord bot (a raspberry pi on my home network) that checks every 10 minutes and posts new feed items to the channel if there is anything new (comparing to a saved list of already posted feed items). Like I said this worked, using a simple ID based system to prevent duplicate posts, but it wasn't very sophisticated and was also quite a lot of posts every 10 minutes.

Architecture:

User runs /news add
  ↓
Bot stores feed config in app_state.json
  ↓
Background poster (Docker service) runs every 10 minutes
  ↓
Fetches RSS feed with feedparser
  ↓
Posts new items as Discord embeds
  ↓
Tracks seen item IDs to prevent duplicates

Key Files:

bot/app/commands/news/news.py - Slash command handlers
bot/app/tasks/rss_feed_poster.py - Background polling service
bot/app/app_state.json - Persistent state storage

Evolution: From Spam to Signal

The "Too Many Articles" Problem

With a sea of articles riddling the newly created news channel, it was not very useful. This was evident in how the other server members quickly muted the new channel. I realized I had to change this if I wanted it to be useful to anyone. To fix this, with some feedback from the friend with the RSS feed, I decided to implement something a bit more sophisticated, using LLMs to help aggregate and summarize content into a few smaller updates throughout the day.

Designing the Summary System

The big tradeoff here was collecting and filtering news without overwhelming my OpenAi API budget, while also making meaningful news summaries. For this reason I went with a two-stage pipeline. First, collect posts throughout the day, then later when its time to post the summary, I would parse the stored entries to make a update.

Stage 1: Collection (rss_feed_poster.py)

Runs every 10 minutes
Collects new articles from all feeds
Stores in pending_news.json
Tracks what's been seen

Stage 2: Summarization (rss_summary_poster.py)

Runs on a configurable schedule (defaults to 8am/8pm Pacific)
Processes pending articles
Generates AI summary
Posts to Discord
Clears pending articles

Why two stages?

Decouples data collection from presentation
Allows flexible scheduling
Enables batching for better AI summaries
Reduces API costs (fewer LLM calls)

Direct Mode:
For some RSS feeds that don't update frequently, I added a fallback method that will still post every 10 minutes.

Good for: Breaking news, time-sensitive feeds
Posts articles as they arrive
No AI processing

Summary Mode: Scheduled aggregation (new default)

Good for: General news, opinion pieces, most content
Collects articles throughout the day
Generates contextual summary at scheduled times
Reduces noise, increases signal

Building the AI Summarization Pipeline

The goal of the AI summarization pipeline is to get the signal of "what's going on" from multiple similar stories, while avoiding repeatedly posting the same stories over and over. We also want to make sure the stories are relevant to the specific channel or topic, we we needed a way to cut out stories that might be unrelated to the desired topic.

Step 1: URL Filtering

The first step is really just to eliminate an identical stories. This cuts down on noise and reduces the amount of LLM summaries and filtering calls we'll need to make.

Initial Limits:

Start with up to 50 most recent articles
Configurable per channel

Feed-Specific Filters:

Allow custom filter instructions per feed
Example: "only San Diego articles" for broad regional feeds
AI evaluates each article against filter criteria

URL Deduplication:

Same article from different feeds? Keep only one
Prevents redundant processing

Step 2: Relevance Ranking

LLM "scoring" is a pretty flawed concept, but with so many news stories coming in all the time, I wanted to implement some kind of sorting system. The LLM doesn't do a terrible job at this, as long as you recognize it's not giving very scientific scoring. Its just a nice gut-check for which stories are more important than others.

AI-Powered Scoring:

Evaluates each article for importance (1-10 scale)
Considers: timeliness, significance, local relevance
Narrow down to top N articles (default: 18)
Configurable per channel

Step 3: Story Clustering

Next we identify stories that are similar in content and write a summary about all of them. This allows us to capture details from multiple sources about the same story instead of repeating the story or leaving out details from different sources.

Similarity Detection:

AI identifies articles about the same story
Groups them into clusters
Each cluster = one story
Generates single summary per story instead of per article

Example:

Fire breaks out in downtown (CBS8)
Downtown building evacuated (Union-Tribune)  } → "Downtown Fire" cluster
Fire contained, no injuries (Fox5)

Step 4: Deduplication Against History

With the clustered and summarized stories in place, we'll do another check, comparing this summary to all the ones posted in the rolling window, s

Story History System:

Tracks summaries posted in last N hours (default: 24)
Configurable time window per channel (6-168 hours)
AI compares new stories against recent history
Filters out stories already covered

Why this matters:

Breaking news continues for days
This prevents a sense of "we already told you this yesterday"
User can adjust based on channel purpose

Step 5: Final Summary Generation

Now we're ready to post the full news summary for this period of time. At this stage we have the LLM review the story clusters and post one update of all the stories, then clear out the pending stories from the state to allow the system to start collecting more stories for the next update

Contextual Summarization:

Groups all remaining stories
Generates cohesive narrative
Includes relevant details from all sources
Natural language, not bullet points

Advanced Features

Feed Diversity & Fairness

After creating several channels with multiple feeds, it was apparent that some feeds post so often, they drown out other feeds in the same channel. To help regulate how much space in a given news update one feed can take up, I created this feed diversity solution. Here's how it works.

Each channel already has a limited number of stories that will be collected from feeds in the pending_stories state (default: 50)
Each channel can be configured with a min and max number of stories per feed to be added to pending_stories
As stories are collected from feeds, the minimum will be collected from each feed first, up to the maximum per feed
If a feed hits the maximum amount, no more stories will be added to pending_stories from that feed until the pending stories are cleared
If there are empty slots left after all remaining feeds are exhausted, the remaining slots in pending_stories will be filled from feeds that have extra, using a round-robin approach

Example: News Channel with 5 feeds
Collection Limit of 50 stories
Minimum per feed is 2 stories
Maximum per feed is 8 stories

Feed 1: 125 new stories
Feed 2: 8 new stories
Feed 3: 2 new stories
Feed 4: 12 new stories
Feed 5: 220 new stories

(Without distribution, its possible that all stories would come from Feed 1 and no other feeds would be represented in the update)

With distribution using the values mentioned above, we would first collect 2 stories from each feed (10 total). Feed 1, 2, 4, and 5 still have leftover stories, so we'll collect up to the maximum for each: 6 from each feed (24 total). Of our 50 story maximum, we still have 16 "slots" open, so we'll pull one story from each feed in a round-robin style until we've reached the limit. This way we end up with the following distribution:

Feed 1: 2 + 6 + 6 = 14/125 stories
Feed 2: 2 + 6 + 0 = 8/8 stories
Feed 3: 2 + 0 + 0 = 2/2 stories
Feed 4: 2 + 6 + 4 = 12/12 stories
Feed 5: 2 + 6 + 6 = 14/220 stories

(up to min + up to max + round robin additions = total stories)

Configuration:

/news diversity configure
  strategy:balanced
  max_per_feed:4
  min_per_feed:1

On-Demand Summaries

Sometimes we want an ad-hoc summarization of pending stories. This is helpful with testing and for very busy feeds with lots of stories.

The /news summary Command:

Generate summary immediately
Don't wait for scheduled time
Useful for checking "what did I miss?"
Clears pending articles after posting

Article Browsing

It's also helpful to see what posts are in the pending stories before they are summarized. It can be entertaining to browse them and also helps validate that story collection is working as expected.

The /news latest Command:

Paginated view of pending articles
Filter by specific feed
See what's been collected
Decide if you want to trigger summary early

This bot is running in production and serving daily news summaries to my Discord community. If you're interested in the code or want to discuss the architecture, feel free to reach out. If you want to add a feature rich discord bot with tons of AI features, you can download and run your own copy of the bot here.

Comments