This is a real-time news search engine using live ingested data from various news APIs.
In today's digital age, we could say that the flux of data has grown exponentially, requiring accurate acquisition, curation, parsing, categorization, and organization into a coherent structure. With the recent advancements of LLM and AI Generated content, it might become hard for you to filter your information stream. The scope of this article is to tackle just that - by providing a customizable way of updating your Knowledge Base. In real-world scenarios, data is flowing continuously, it requires constant fetching, curating, parsing, grouping and presenting it in a structured format. We have the popular Yahoo-Finance for finance, HackerRank for tech, ESPN for sports, all these platforms follow an underlying principle - provide niched, high-quality news articles.
This pattern of the application we're going to build is commonly used as part of RAGs, the only difference here being that we won't use an LLM, but actually focus on the Data Ingestion, Processing and Retrieval - which are the key points of a RAG.
Here's what we're going to proceed with the project:
- ingest articles from news APIs
- spawn multiple Kafka Producer threads and push messages to Upstash Kafka cluster
- use Bytewax to process payloads from Kafka
- parse, chunk, embed using Pyndatic models and upsert to Upstash Vector Database
- build a UI interface with a search bar, and render results from Vector Database
- Python (version 3.19)
- Miniconda (version 24.1.2)
- Poetry (version 1.7.1)
- GNU Make (version 3.81)
- Docker (version 24.0.7)
We're using Poetry to manage the env and dependencies of this project. You don't need a GPU to run this
To install, run this following command:
make install
This will create a new conda environment called py39upstash
, activate it and install dependencies defined in pyproject.toml
.
The Makefile
found at the root of this project, contains 5 commands:
test
: will run the defined unit testsrun_producers
: will start the Kafka Producer Threads that ingest from NewsAPIsrun_pipeline
: will start the Bytewax Stream Processing consumer to parse messages from Kafka, embed and push to VectorDBclean_vdb
: [WARNING] This is used to purge the VectorDBrun_ui
: will start the interactive Streamlit UI.
Here's the full command-set to start the solution:
- Start Producers :
make run_producers
- Start Bytewax Consumer :
make run_pipeline
- Start UI :
make run_ui
NOTE:
To ingest News Articles, we're using 2 APIs which offer free-tier access:
- NewsAPI
- Articles fetched from here have their
content
field limited to 260 characters. You can get the full-content with a paid plan, but for our use-case is more than enough. - Payload Example:
- Articles fetched from here have their
{
"source": {
"id": null,
"name": "News18"
},
"author": "News18",
"title": "Still Using Paytm FASTag? Here Is A Step-by-Step Guide To Port To A New FASTag - News18",
"description": "NHAI suggests users acquire FASTags from the 32 banks that are now on the authorised list for FASTag issuing",
"url": "https://www.news18.com/business/still-using-paytm-fastag-here-is-a-step-by-step-guide-to-port-to-a-new-fastag-8814964.html",
"urlToImage": "https://images.news18.com/ibnlive/uploads/2024/02/untitled-design-2024-02-12t023732.441-2024-02-d1cbfb73a1fd442b891b3917ea3d4de1-16x9.jpg?impolicy=website&width=1200&height=675",
"publishedAt": "2024-03-14T12:18:27Z",
"content": "In a recent move, the Reserve Bank of India (RBI) ordered Paytm Payments Bank Ltd. (PPBL) to cease taking deposits or top-ups in any client accounts including wallets and FASTags after February 29, 2\u2026 [+3529 chars]"
}
- NewsDataIO
- Articles fetched from this API, have their full-content field masked, but provide a description field which represents the content summary.
- Payloads from here have more fields compared to API #1, but still some are available only on paid plans.
- Payload example:
{
"article_id": "0b09d2891dcb9085f2d5201249356458",
"title": "Top events of the day: From PM Modi's Kerala visit to Russian presidential elections, track top news on March 15 here",
"link": "https://www.livemint.com/news/india/top-events-of-the-day-from-pm-modis-kerala-visit-to-russian-presidential-elections-track-top-news-on-march-15-here-11710464847509.html",
"keywords": "None",
"creator": "None",
"video_url": "None",
"description": "Top news of the day: PM Modi's Lok Sabha poll campaign in Kerala, Rahul Gandhi to address a public rally in Bhiwandi, deadline of advance tax payment, Russia's presidential elections, and more",
"content": "ONLY AVAILABLE IN PAID PLANS",
"pubDate": "2024-03-15 01:42:57",
"image_url": "https://www.livemint.com/lm-img/img/2024/03/15/1600x900/Modi-16_1710466650723_1710466677228.jpg",
"source_id": "livemint",
"source_url": "https://www.livemint.com",
"source_icon": "https://i.bytvi.com/domain_icons/livemint.png",
"source_priority": 7134,
"country": ["india"],
"category": ["top"],
"language": "english",
"ai_tag": "ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE PLANS",
"sentiment": "ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE PLANS",
"sentiment_stats": "ONLY AVAILABLE IN PROFESSIONAL AND CORPORATE PLANS",
"ai_region": "ONLY AVAILABLE IN CORPORATE PLANS"
}
This article is an open-source project released under the MIT license. Thus, as long you distribute our LICENSE and acknowledge our work, you can safely clone or fork this project and use it as a source of inspiration for whatever you want (e.g., work, university projects, college degree projects, etc.).
![]() |
Sakalya Mitra ML Developer| Data Scientist |