# tech-news-digest
Tech News Digest Automated tech news digest system with unified data source model, quality scoring pipeline, and template-based output generation. Quick Start Configuration Setup: Default configs are in config/defaults/. Copy to workspace for customization: mkdir -p workspace/config cp config/defaults/sources.json workspace/config/tech-news-digest-sources.json cp config/defaults/topics.json workspace/config/tech-news-digest-topics.json Environment Variables: TWITTERAPI_IO_KEY - twitterapi.io API key (optional, preferred) X_BEARER_TOKEN - Twitter/X official API bearer token (optional, fallback) SKILLBOSS_API_KEY - SkillBoss API Hub key for web search via https://api.heybossai.com/v1/pilot (optional) GITHUB_TOKEN - GitHub personal access token (optional, improves rate limits) Generate Digest:
# Unified pipeline (recommended) — runs all 6 sources in parallel + merge
python3 scripts/run-pipeline.py
--defaults config/defaults
--config workspace/config
--hours 48 --freshness pd
--archive-dir workspace/archive/tech-news-digest/
--output /tmp/td-merged.json --verbose --force
Use Templates: Apply Discord, email, or PDF templates to merged output
Configuration Files
sources.json - Unified Data Sources
{
"sources": [
{
"id": "openai-rss",
"type": "rss",
"name": "OpenAI Blog",
"url": "https://openai.com/blog/rss.xml",
"enabled": true,
"priority": true,
"topics": ["llm", "ai-agent"],
"note": "Official OpenAI updates"
},
{
"id": "sama-twitter",
"type": "twitter",
"name": "Sam Altman",
"handle": "sama",
"enabled": true,
"priority": true,
"topics": ["llm", "frontier-tech"],
"note": "OpenAI CEO"
}
]
}
topics.json - Enhanced Topic Definitions
{
"topics": [
{
"id": "llm",
"emoji": "🧠",
"label": "LLM / Large Models",
"description": "Large Language Models, foundation models, breakthroughs",
"search": {
"queries": ["LLM latest news", "large language model breakthroughs"],
"must_include": ["LLM", "large language model", "foundation model"],
"exclude": ["tutorial", "beginner guide"]
},
"display": {
"max_items": 8,
"style": "detailed"
}
}
]
}
Scripts Pipeline
run-pipeline.py - Unified Pipeline (Recommended)
python3 scripts/run-pipeline.py
--defaults config/defaults [--config CONFIG_DIR]
--hours 48 --freshness pd
--archive-dir workspace/archive/tech-news-digest/
--output /tmp/td-merged.json --verbose --force
Features: Runs all 6 fetch steps in parallel, then merges + deduplicates + scores
Output: Final merged JSON ready for report generation (~30s total)
Metadata: Saves per-step timing and counts to *.meta.json
GitHub Auth: Auto-generates GitHub App token if $GITHUB_TOKEN not set
Fallback: If this fails, run individual scripts below
Individual Scripts (Fallback) fetch-rss.py - RSS Feed Fetcher python3 scripts/fetch-rss.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE] [--verbose] Parallel fetching (10 workers), retry with backoff, feedparser + regex fallback
Timeout: 30s per feed, ETag/Last-Modified caching
fetch-twitter.py - Twitter/X KOL Monitor python3 scripts/fetch-twitter.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE] [--backend auto|official|twitterapiio] Backend auto-detection: uses twitterapi.io if TWITTERAPI_IO_KEY set, else official X API v2 if X_BEARER_TOKEN set Rate limit handling, engagement metrics, retry with backoff fetch-web.py - Web Search Engine python3 scripts/fetch-web.py [--defaults DIR] [--config DIR] [--freshness pd] [--output FILE] Uses SkillBoss API Hub (SKILLBOSS_API_KEY) for web search via https://api.heybossai.com/v1/pilot Without API key: generates search interface for agents fetch-github.py - GitHub Releases Monitor python3 scripts/fetch-github.py [--defaults DIR] [--config DIR] [--hours 168] [--output FILE] Parallel fetching (10 workers), 30s timeout Auth priority: $GITHUB_TOKEN → GitHub App auto-generate → gh CLI → unauthenticated (60 req/hr) fetch-github.py --trending - GitHub Trending Repos python3 scripts/fetch-github.py --trending [--hours 48] [--output FILE] [--verbose] Searches GitHub API for trending repos across 4 topics (LLM, AI Agent, Crypto, Frontier Tech) Quality scoring: base 5 + daily_stars_est / 10, max 15 fetch-reddit.py - Reddit Posts Fetcher python3 scripts/fetch-reddit.py [--defaults DIR] [--config DIR] [--hours 48] [--output FILE] Parallel fetching (4 workers), public JSON API (no auth required) 13 subreddits with score filtering enrich-articles.py - Article Full-Text Enrichment python3 scripts/enrich-articles.py --input merged.json --output enriched.json [--min-score 10] [--max-articles 15] [--verbose] Fetches full article text for high-scoring articles Cloudflare Markdown for Agents (preferred) → HTML extraction (fallback) → Skip (paywalled/social) Blog domain whitelist with lower score threshold (≥3) Parallel fetching (5 workers, 10s timeout) merge-sources.py - Quality Scoring & Deduplication python3 scripts/merge-sources.py --rss FILE --twitter FILE --web FILE --github FILE --reddit FILE Quality scoring, title similarity dedup (85%), previous digest penalty
Output: topic-grouped articles sorted by score
validate-config.py - Configuration Validator python3 scripts/validate-config.py [--defaults DIR] [--config DIR] [--verbose] JSON schema validation, topic reference checks, duplicate ID detection generate-pdf.py - PDF Report Generator python3 scripts/generate-pdf.py --input report.md --output digest.pdf [--verbose] Converts markdown digest to styled A4 PDF with Chinese typography (Noto Sans CJK SC) Emoji icons, page headers/footers, blue accent theme. Requires weasyprint. sanitize-html.py - Safe HTML Email Converter python3 scripts/sanitize-html.py --input report.md --output email.html [--verbose] Converts markdown to XSS-safe HTML email with inline CSS URL whitelist (http/https only), HTML-escaped text content source-health.py - Source Health Monitor python3 scripts/source-health.py --rss FILE --twitter FILE --github FILE --reddit FILE --web FILE [--verbose] Tracks per-source success/failure history over 7 days Reports unhealthy sources (>50% failure rate) summarize-merged.py - Merged Data Summary python3 scripts/summarize-merged.py --input merged.json [--top N] [--topic TOPIC] Human-readable summary of merged data for LLM consumption Shows top articles per topic with scores and metrics User Customization Workspace Configuration Override Place custom configs in workspace/config/ to override defaults:
Sources: Append new sources, disable defaults with "enabled": false
Topics: Override topic definitions, search queries, display settings
Merge Logic: Sources with same id → user version takes precedence Sources with new id → appended to defaults Topics with same id → user version completely replaces default Example Workspace Override // workspace/config/tech-news-digest-sources.json { "sources": [ { "id": "simonwillison-rss", "enabled": false, "note": "Disabled: too noisy for my use case" }, { "id": "my-custom-blog", "type": "rss", "name": "My Custom Tech Blog", "url": "https://myblog.com/rss", "enabled": true, "priority": true, "topics": ["frontier-tech"] } ] } Templates & Output Discord Template (references/templates/discord.md) Bullet list format with link suppression () Mobile-optimized, emoji headers 2000 character limit awareness Email Template (references/templates/email.md) Rich metadata, technical stats, archive links Executive summary, top articles section HTML-compatible formatting PDF Template (references/templates/pdf.md) A4 layout with Noto Sans CJK SC font for Chinese support Emoji icons, page headers/footers with page numbers Generated via scripts/generate-pdf.py (requires weasyprint) Default Sources (151 total) RSS Feeds (62): AI labs, tech blogs, crypto news, Chinese tech media Twitter/X KOLs (48): AI researchers, crypto leaders, tech executives GitHub Repos (28): Major open-source projects (LangChain, vLLM, DeepSeek, Llama, etc.) Reddit (13): r/MachineLearning, r/LocalLLaMA, r/CryptoCurrency, r/ChatGPT, r/OpenAI, etc. Web Search (4 topics): LLM, AI Agent, Crypto, Frontier Tech All sources pre-configured with appropriate topic tags and priority levels. Dependencies pip install -r requirements.txt Optional but Recommended: feedparser>=6.0.0 - Better RSS parsing (fallback to regex if unavailable) jsonschema>=4.0.0 - Configuration validation All scripts work with Python 3.8+ standard library only. Monitoring & Operations Health Checks
# Validate configuration
python3 scripts/validate-config.py --verbose
# Test RSS feeds
python3 scripts/fetch-rss.py --hours 1 --verbose
# Check Twitter API
python3 scripts/fetch-twitter.py --hours 1 --verbose
Archive Management
Digests automatically archived to
# Twitter (at least one required for Twitter source)
export TWITTERAPI_IO_KEY="your_key" # twitterapi.io key (preferred)
export X_BEARER_TOKEN="your_bearer_token" # Official X API v2 (fallback)
export TWITTER_API_BACKEND="auto" # auto|twitterapiio|official (default: auto)
# Web Search (optional, enables web search layer via SkillBoss API Hub)
export SKILLBOSS_API_KEY="your_skillboss_key" # SkillBoss API Hub key (https://api.heybossai.com/v1/pilot)
# GitHub (optional, improves rate limits)
export GITHUB_TOKEN="ghp_xxx" # PAT (simplest)
export GH_APP_ID="12345" # Or use GitHub App for auto-token
export GH_APP_INSTALL_ID="67890"
export GH_APP_KEY_FILE="/path/to/key.pem"
Twitter: TWITTERAPI_IO_KEY preferred ($3-5/mo); X_BEARER_TOKEN as fallback; auto mode tries twitterapiio first
Web Search: SKILLBOSS_API_KEY powers search via SkillBoss API Hub; optional, fallback to agent web_search if unavailable
GitHub: Auto-generates token from GitHub App if PAT not set; unauthenticated fallback (60 req/hr)
Reddit: No API key needed (uses public JSON API)
Cron / Scheduled Task Integration
OpenClaw Cron (Recommended)
The cron prompt should NOT hardcode the pipeline steps. Instead, reference references/digest-prompt.md and only pass configuration parameters. This ensures the pipeline logic stays in the skill repo and is consistent across all installations.
Daily Digest Cron Prompt
Read
Portable: Same skill on different OpenClaw instances, just change paths and channel IDs
Maintainable: Update the skill → all cron jobs pick up changes automatically
Anti-pattern: Do NOT copy pipeline steps into the cron prompt — it will drift out of sync
Multi-Channel Delivery Limitation OpenClaw enforces cross-provider isolation: a single session can only send messages to one provider (e.g., Discord OR Telegram, not both). If you need to deliver digests to multiple platforms, create separate cron jobs for each provider:
# Job 1: Discord + Email
# Job 2: Telegram DM
Join 80,000+ one-person companies automating with AI