Skip to content

Unable to scrape JavaScript website #891

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nihaludin opened this issue Jan 13, 2025 · 9 comments
Open

Unable to scrape JavaScript website #891

nihaludin opened this issue Jan 13, 2025 · 9 comments
Labels
bug Something isn't working stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed

Comments

@nihaludin
Copy link

Hi @ALL,

I’m encountering a similar issue and would appreciate your help. Initially, I received the message, "You need to enable JavaScript to run this app."

However, after enabling requires_js_support: True in my configuration, I am now encountering a timeout error.

--- Executing Fetch Node ---
--- (Fetching HTML from: https://www.example.com/) ---
Attempt 1 failed: Page.goto: Timeout 30000ms exceeded.
Call log:

navigating to "https://www.example.com/", waiting until "networkidle"
Here’s my current configuration:

graph_config = { "llm": { "api_key": OPENAI_API_KEY, "model": "openai/gpt-4o", }, "verbose": True, "headless": False, "loader_kwargs": { "requires_js_support": True, "timeout": 120, "retry_limit": 3, "slow_mo": 20000 } }

Copy link

dosubot bot commented Jan 13, 2025

Hey @nihaludin!

I'm a bot here to help you with bugs, questions, and contributions. Let's tackle this issue together!

To address the timeout error you're experiencing, consider the following adjustments:

  1. Increase Timeout: You can try increasing the timeout parameter further if the page requires more time to load. For example, set it to 180 seconds or more.

  2. Check URL: Ensure that the URL you're trying to scrape actually contains dynamic content that requires JavaScript. The example URL you provided is a placeholder and may not have any content to load.

  3. Debugging: Since you have headless set to False, you can visually inspect what the browser is doing during the scraping process. This can help identify if there are any issues with the page loading.

  4. Network Conditions: Ensure that there are no network issues that might be causing delays in loading the page.

Here's an updated configuration suggestion:

graph_config = {
    "llm": {
        "api_key": OPENAI_API_KEY,
        "model": "openai/gpt-4o",
    },
    "verbose": True,
    "headless": False,
    "loader_kwargs": {
        "requires_js_support": True,
        "timeout": 180,  # Increase timeout
        "retry_limit": 3,
        "slow_mo": 20000
    }
}

These steps should help mitigate the timeout issue. If the problem persists, consider testing with a different URL that has known dynamic content.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@dosubot dosubot bot added the bug Something isn't working label Jan 13, 2025
@nihaludin
Copy link
Author

I increased the timeout parameter to 1000 but still im getting the same error @dosu

Copy link

dosubot bot commented Jan 13, 2025

If increasing the timeout to 1000 seconds didn't resolve the issue, consider the following steps:

  1. Verify JavaScript Content: Double-check that the URL you're trying to scrape actually requires JavaScript to load content. If the page is static or doesn't have dynamic content, enabling JavaScript support might not be necessary.

  2. Inspect Network Activity: Use the browser's developer tools to inspect network activity when loading the page manually. This can help identify if there are specific resources or scripts that are taking too long to load.

  3. Check for Errors: Look for any JavaScript errors in the console when loading the page manually. These errors might indicate issues that could affect the scraping process.

  4. Test with Different URLs: Try scraping a different URL that you know requires JavaScript and has dynamic content. This can help determine if the issue is specific to the current URL.

  5. Review Proxy and Network Settings: Ensure that there are no proxy or network settings that might be interfering with the page load.

  6. Adjust Slow Motion: Consider reducing the slow_mo parameter to see if it affects the loading time. While slow_mo is useful for debugging, it can slow down the overall process.

If these steps don't resolve the issue, it might be helpful to log more detailed information about the scraping attempts to diagnose the problem further.

To continue talking to Dosu, mention @dosu.


Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other

@nihaludin
Copy link
Author

@VinciGit00 Not able to fetch details after "requires_js_support": True is there any other approach to overcome this issue. Thanks!

@VinciGit00
Copy link
Collaborator

Why you navigated to https://www.example.com/? It does not require javascript

@nihaludin
Copy link
Author

@VinciGit00 Apologies ,it's just an example not the same website I'm actually working with an e-commerce website which has javascript.

@VinciGit00
Copy link
Collaborator

ok send the complete code please

@nihaludin
Copy link
Author

`graph_config = {
"llm": {
"api_key": OPENAI_API_KEY,
"model": "openai/gpt-4o-mini",
},
"verbose": True,
"headless": False,

"loader_kwargs": {
    # "requires_js_support": True,
    "timeout": 120,
    "retry_limit": 1,
    # "slow_mo": 200000,
}

}`

`smart_scraper = SmartScraperGraph(
prompt="Segment the URLs with their respective category based on product type", # this prompt for extracting categories
source="some ecommerce website" )

result = smart_scraper.run()`

@VinciGit00 please find the code above, Thanks!

Copy link

dosubot bot commented Jun 2, 2025

Hi, @nihaludin. I'm Dosu, and I'm helping the Scrapegraph-ai team manage their backlog. I'm marking this issue as stale.

Issue Summary:

  • You are experiencing a timeout error while scraping a JavaScript-dependent e-commerce website.
  • Despite increasing the timeout parameter, the issue persists.
  • I suggested troubleshooting steps, including verifying JavaScript necessity and inspecting network activity.
  • VinciGit00 noted the example URL doesn't require JavaScript, leading to clarification that the target is an e-commerce site.
  • You provided the complete code for further assistance, but the issue remains unresolved.

Next Steps:

  • Please confirm if this issue is still relevant to the latest version of the Scrapegraph-ai repository by commenting here.
  • If no updates are provided, this issue will be automatically closed in 7 days.

Thank you for your understanding and contribution!

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label Jun 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed
Projects
None yet
Development

No branches or pull requests

2 participants