Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Deep crawling is exceeding the max_pages parameter and continuing beyond the set limit. #927

Open
Harinib-Kore opened this issue Apr 2, 2025 · 2 comments
Assignees
Labels
💪 - Intermediate Difficulty level - Intermediate 🐞 Bug Something isn't working ⚙️ In-progress Issues, Features requests that are in Progress

Comments

@Harinib-Kore
Copy link

Harinib-Kore commented Apr 2, 2025

crawl4ai version

0.5.0.post4

Expected Behavior

The crawler should stop after crawling 10 pages, as specified by max_pages=10.
len(results) should report a maximum of 10 pages.

Current Behavior

When using AsyncWebCrawler with BestFirstCrawlingStrategy and setting max_pages=10, the crawler unexpectedly crawls more pages than specified. In my case, it crawled 17 pages instead of stopping at 10.

Is this reproducible?

Yes

Inputs Causing the Bug

I think adding filter chain is causing this bug

Steps to Reproduce

Code snippets

import asyncio
from crawl4ai import AsyncWebCrawler, CrawlerRunConfig
from crawl4ai.content_scraping_strategy import LXMLWebScrapingStrategy
from crawl4ai.deep_crawling import BestFirstCrawlingStrategy
from crawl4ai.deep_crawling.filters import (
    FilterChain,
    DomainFilter,
    URLPatternFilter,
    ContentTypeFilter
)

async def main():
    filter_chain = FilterChain([
        DomainFilter(allowed_domains=["kore.ai"]),
        URLPatternFilter(patterns=["*use-cases*", "*blog*", "*research*"]),
        ContentTypeFilter(allowed_types=["text/html"])
    ])
    config = CrawlerRunConfig(
        deep_crawl_strategy=BestFirstCrawlingStrategy(
            max_depth=3, 
            include_external=False,
            max_pages=10,
            filter_chain=filter_chain
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        verbose=True
    )

    async with AsyncWebCrawler() as crawler:
        results = await crawler.arun("https://kore.ai/use-cases", config=config)
        print(f"Crawled {len(results)} pages in total")

if __name__ == "__main__":
    asyncio.run(main())

OS

Linux

Python version

3.9.7

Browser

Chrome

Browser version

131.0.6778.139

Error logs & Screenshots (if applicable)

Image

@Harinib-Kore Harinib-Kore added 🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers labels Apr 2, 2025
@Harinib-Kore
Copy link
Author

Harinib-Kore commented Apr 2, 2025

@aravindkarnam

@unclecode unclecode added ⚙️ In-progress Issues, Features requests that are in Progress 💪 - Intermediate Difficulty level - Intermediate and removed 🩺 Needs Triage Needs attention of maintainers labels Apr 2, 2025
@unclecode
Copy link
Owner

Thanks for reporting this, @Harinib-Kore

After reviewing the code based on your report, we can confirm this is indeed a bug related to how max_pages is handled within the deep crawling strategies when processing URLs in batches. Your intuition that the FilterChain wasn't the direct cause was correct.

@ntohidi @aravindkarnam

The core issue lies in the timing of the max_pages check relative to processing results from crawler.arun_many.

  1. Current Behavior: The check if self._pages_crawled >= self.max_pages: typically occurs before processing a batch of URLs. The counter self._pages_crawled is then incremented within the loop handling the results of that batch.
  2. Problem: This allows the counter to exceed the max_pages limit during the processing of a batch, but the crawl only stops after that batch is fully processed and before the next batch starts. This leads to the observed overshoot.

Required Fix Hint:

We need to add an additional check for the max_pages limit immediately after self._pages_crawled is incremented inside the inner result-processing loops (async for result in ... or similar) within all relevant deep crawling strategies (like BFSDeepCrawlStrategy, BestFirstCrawlingStrategy, etc.).

Implementation Steps:

  1. Locate the self._pages_crawled += 1 line within the result loops in each deep crawl strategy's run methods (e.g., _arun_batch, _arun_stream, _arun_best_first).
  2. Immediately after incrementing the counter, add a check:
    if self._pages_crawled >= self.max_pages:
        self.logger.info(f"Max pages limit ({self.max_pages}) reached, stopping processing.")
        break # Exit the inner loop handling the current batch/stream
  3. Ensure link_discovery is only called if the limit hasn't been reached by that specific result. The break handles subsequent results in the batch.
  4. Apply this fix consistently across all deep crawling strategies that implement max_pages.

This change will ensure the strategies stop processing and yielding results much closer to the specified max_pages limit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 - Intermediate Difficulty level - Intermediate 🐞 Bug Something isn't working ⚙️ In-progress Issues, Features requests that are in Progress
Projects
None yet
Development

No branches or pull requests

4 participants