Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Error: Page.content: Target page, context or browser has been closed #842

Open
eliaweiss opened this issue Mar 16, 2025 · 16 comments
Assignees
Labels
🐞 Bug Something isn't working ⚙️ In-progress Issues, Features requests that are in Progress 📌 Root caused identified the root cause of bug

Comments

@eliaweiss
Copy link

crawl4ai version

0.5.0.post4

Expected Behavior

Crawler should crawl

Current Behavior

I get the following error

[ERROR]... × https://out-door.co.il/product/%d7%a4%d7%90%d7%a0%... | Error:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 528 in wrap_api_call (venv/lib/python3.12/site- │
│ packages/playwright/_impl/_connection.py): │
│ Error: Page.content: Target page, context or browser has been closed │
│ │
│ Code context: │
│ 523 parsed_st = _extract_stack_trace_information_from_stack(st, is_internal) │
│ 524 self._api_zone.set(parsed_st) │
│ 525 try: │
│ 526 return await cb() │
│ 527 except Exception as error: │
│ 528 → raise rewrite_error(error, f"{parsed_st['apiName']}: {error}") from None │
│ 529 finally: │
│ 530 self._api_zone.set(None) │
│ 531 │
│ 532 def wrap_api_call_sync( │
│ 533 self, cb: Callable[[], Any], is_internal: bool = False │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

this happens after about 50 to 100 pages

I use ec2 t2.large and this is my code

@app.post("/crawl", response_model=CrawlResponse)
async def crawl(request: CrawlRequest):
"""
Run the crawler on the specified URL
"""
print(request)

try:
    # Convert UUID to string for the query
    crawler_config = execute_select_query(f"SELECT * FROM crawls WHERE id = '{request.crawler_id}'")
    if not crawler_config:
        raise HTTPException(
            status_code=404,
            detail=f"Crawler config not found for id: {request.crawler_id}"
        )
    
    crawler_config = crawler_config[0]
    root_url = crawler_config['root_url']
    logger.info(f"🔍 Starting crawl for URL: {root_url}")
    
    depth = crawler_config.get('depth', 1)
    include_external = crawler_config.get('include_external', False)
    max_pages = crawler_config.get('max_pages', 5)
    
    # Step 1: Create a pruning filter
    prune_filter = PruningContentFilter(
        # Lower → more content retained, higher → more content pruned
        threshold=0.45,           
        # "fixed" or "dynamic"
        threshold_type="dynamic",  
        # Ignore nodes with <5 words
        min_word_threshold=5      
    )

    # Step 2: Insert it into a Markdown Generator
    md_generator = DefaultMarkdownGenerator(content_filter=prune_filter) #, options={"ignore_links": True}

    # Step 3: Pass it to CrawlerRunConfig
    # Configure the crawler
    config = CrawlerRunConfig(
        deep_crawl_strategy=BFSDeepCrawlStrategy(
            max_depth=depth,
            include_external=include_external,
            max_pages=max_pages
        ),
        scraping_strategy=LXMLWebScrapingStrategy(),
        stream=True,
        verbose=True,
        markdown_generator=md_generator
    )

    crawled_pages = []
    page_count = 0

    # Run the crawler
    async with AsyncWebCrawler() as crawler:
        try:
            async for result in await crawler.arun(crawler_config['root_url'], config=config):
                processed_result = await process_crawl_result(crawler_config, result)
                crawled_pages.append(processed_result)
                page_count += 1
                logger.info(f"Processed page {page_count}: {result.url}")
        except Exception as crawl_error:
            logger.error(f"Error during crawling: {str(crawl_error)}")
            raise HTTPException(
                status_code=500,
                detail=f"Crawling process failed: {str(crawl_error)}"
            )

    result = {
        "url": root_url,
        "depth": depth,
        "pages_crawled": page_count,
        "crawled_pages": crawled_pages
    }
    
    return CrawlResponse(
        status="success",
        data=result
    )

except Exception as e:
    logger.error(f"Crawling error: {str(e)}")
    raise HTTPException(
        status_code=500,
        detail=f"Crawling failed: {str(e)}"
    )

any idea on how to debug it?
what does this error means?

My guess is that the headless browser is crashing, but I'm not sure how to debug it, and why it could happen

When I run a crawler with simpe fetch I can crawl all 483 pages in the web site, but with crawl4ai it crashes after about a 50 to 100 pages, and just print a list of these errors

Is this reproducible?

Yes

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

ubuntu (ec2 t2.large)

Python version

3.12.3

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

@eliaweiss eliaweiss added 🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers labels Mar 16, 2025
@eliaweiss
Copy link
Author

here is some more info:

  1. max_page is ignored
    max_pages = 10

    Configure a 2-level deep crawl

    config = CrawlerRunConfig(
    semaphore_count=1,
    deep_crawl_strategy=BFSDeepCrawlStrategy(
    max_depth=10,
    include_external=False,
    # Maximum number of pages to crawl (optional)
    max_pages=max_pages
    ),
    scraping_strategy=LXMLWebScrapingStrategy(),
    stream=True, # Enable streaming
    verbose=True
    )

  2. adding a break
    page_count = 0
    async with AsyncWebCrawler() as crawler:
    async for result in await crawler.arun("https://out-door.co.il/", config=config):
    page_count += 1
    print(f"page_count {page_count}")
    if page_count > 10:
    break
    await process_result(result)

cause this error
[ERROR]... × https://out-door.co.il/product-category/%d7%9e%d7%... | Error:
┌───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ × Unexpected error in _crawl_web at line 579 in _crawl_web (venv/lib/python3.10/site- │
│ packages/crawl4ai/async_crawler_strategy.py): │
│ Error: Failed on navigating ACS-GOTO: │
│ Page.goto: net::ERR_ABORTED; maybe frame was detached? │
│ Call log: │
│ - navigating to "https://out-door.co.il/product-category/%d7%9e%d7%a2%d7%a7%d7%94-%d7%a7%d7%a6%d7%94-
│ %d7%9c%d7%9e%d7%9b%d7%99%d7%a8%d7%94/%d7%a1%d7%95%d7%92%d7%99-%d7%9e%d7%a2%d7%a7%d7%95%d7%aa", waiting until │
│ "domcontentloaded" │
│ │
│ │
│ Code context: │
│ 574 response = await page.goto( │
│ 575 url, wait_until=config.wait_until, timeout=config.page_timeout │
│ 576 ) │
│ 577 redirected_url = page.url │
│ 578 except Error as e: │
│ 579 → raise RuntimeError(f"Failed on navigating ACS-GOTO:\n{str(e)}") │
│ 580 │
│ 581 await self.execute_hook( │
│ 582 "after_goto", page, context=context, url=url, response=response, config=config │
│ 583 ) │
│ 584 │
└───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

@eliaweiss
Copy link
Author

although it showing 500 crawled page, it only save 250, does it know how to handle repeated links?

@eliaweiss
Copy link
Author

it seems that I was able to suppress this issue by setting semaphore_count=1,

@inVains
Copy link

inVains commented Mar 17, 2025

same problem

@eliaweiss
Copy link
Author

I'm pretty sure the problem is in playwright/chromium rather than crawl4ai

And that it is a resource problem

Note that a similar problem is reported on playwright proj

@aravindkarnam
Copy link
Collaborator

I'm pretty sure the problem is in playwright/chromium rather than crawl4ai

And that it is a resource problem

Note that a similar problem is reported on playwright proj

@eliaweiss Do you have the issue Id for problem reported on playwright proj. Can you link that here.

@aravindkarnam aravindkarnam added ❓ Question Q&A and removed 🩺 Needs Triage Needs attention of maintainers labels Mar 17, 2025
@eliaweiss
Copy link
Author

@aravindkarnam
See this issue microsoft/playwright#13038

The error msg is different, but in my log there were a ton of error msg, and later I realize that the first one was with this msg
browser.newContext: Target page, context or browser has been closed)

which is also reported in the playwright/issues/13038

@no-chris
Copy link

on my side I fixed it by switching from chromium to firefox
https://docs.crawl4ai.com/api/parameters/

@aysan0
Copy link

aysan0 commented Mar 26, 2025

same problem. consistently happens on the second crawl attempt. Any updates here?

@Sandy-Tsang
Copy link

Same problem here. I changed my browser to firefox, and the bug was not fixed.

@aravindkarnam
Copy link
Collaborator

aravindkarnam commented Mar 28, 2025

RCA

When making consecutive requests to the /crawl endpoint, the second request would fail with:

"BrowserType.launch: Target page, context or browser has been closed"

The BrowserManager class in Crawl4AI implemented a singleton pattern for the Playwright instance using a static class variable:

_playwright_instance = None
    
@classmethod
async def get_playwright(cls):
    if cls._playwright_instance is None:
        cls._playwright_instance = await async_playwright().start()
    return cls._playwright_instance

When the browser was closed after the first request, the close() method properly stopped the Playwright instance, but did not reset the static _playwright_instance reference:

async def close(self):
    # ...
    if self.playwright:
        await self.playwright.stop()
        self.playwright = None
    # Missing: BrowserManager._playwright_instance = None

This caused subsequent requests to try using an already-closed Playwright instance.

Why This Only Appeared in Server Environment?

This issue specifically manifested in the server environment because:

In server contexts, the process remains alive between requests
Static/class variables persist across multiple requests
In library usage, the process would typically terminate after use, naturally cleaning up all resources

Solution

We modified the close() method in the AsyncPlaywrightCrawlerStrategy class to reset the Playwright instance after cleanup:

async def close(self):
    """
    Close the browser and clean up resources.
    """
    await self.browser_manager.close()
    
    # Reset the static Playwright instance
    BrowserManager._playwright_instance = None

This ensures that each new request gets a fresh Playwright instance, preventing the error while maintaining the resource efficiency benefits of the singleton pattern within a single request's lifecycle.

@aysan0
Copy link

aysan0 commented Mar 28, 2025

@aravindkarnam awesome! Appreciate the quick turn around!
Is there a PR ?

aravindkarnam added a commit that referenced this issue Mar 28, 2025
@aravindkarnam
Copy link
Collaborator

aravindkarnam commented Mar 28, 2025

@aysan0 Yeah. This was quite a mole hunt! I need some help with testing this out first. I pushed this to the bug fix branch. Could you pull this, run it once and give me confirmation that this indeed fixes the issue.

@aravindkarnam aravindkarnam added ⚙️ In-progress Issues, Features requests that are in Progress 📌 Root caused identified the root cause of bug and removed ❓ Question Q&A labels Mar 31, 2025
@aravindkarnam aravindkarnam self-assigned this Mar 31, 2025
@Rolnand
Copy link

Rolnand commented Mar 31, 2025

@aysan0 Yeah. This was quite a mole hunt! I need some help with testing this out first. I pushed this to the bug fix branch. Could you pull this, run it once and give me confirmation that this indeed fixes the issue.

It works. Thank you so much for fixing the bug!

@Sanjaypranav
Copy link

is that fixed now ?

@StefanSamba
Copy link

I don't think it is fixed yet, meanwhile you can monkey patch it in your code. When the fix is released you can upgrade the package and omit the patch.

from crawl4ai.async_crawler_strategy import AsyncPlaywrightCrawlerStrategy
from crawl4ai.browser_manager import BrowserManager


async def patched_async_playwright__crawler_strategy_close(self) -> None:
    """
    Close the browser and clean up resources.

    This patch addresses an issue with Playwright instance cleanup where the static instance
    wasn't being properly reset, leading to issues with multiple crawls.

    Issue: https://github.com/unclecode/crawl4ai/issues/842

    Returns:
        None
    """
    await self.browser_manager.close()

    # Reset the static Playwright instance
    BrowserManager._playwright_instance = None


AsyncPlaywrightCrawlerStrategy.close = patched_async_playwright__crawler_strategy_close

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🐞 Bug Something isn't working ⚙️ In-progress Issues, Features requests that are in Progress 📌 Root caused identified the root cause of bug
Projects
None yet
Development

No branches or pull requests

9 participants