Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: exclude_domains invalid #925

Closed
cccmolo opened this issue Apr 2, 2025 · 1 comment
Closed

[Bug]: exclude_domains invalid #925

cccmolo opened this issue Apr 2, 2025 · 1 comment
Assignees
Labels

Comments

@cccmolo
Copy link

cccmolo commented Apr 2, 2025

crawl4ai version

0.5.0.post8

Expected Behavior

import asyncio
from crawl4ai import *

crawler_cfg = CrawlerRunConfig(
exclude_domains=["camo.githubusercontent.com", "img.shields.io"],
# exclude_social_media_links=True, # skip Twitter, Facebook, etc.
wait_for_images=True, # ensure images are loaded
verbose=True
)

async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://github.com/unclecode/crawl4ai/blob/main/README.md",
config=crawler_cfg
)
if result.success:
for image in result.media.get("images", []):
print(image["src"])

if name == "main":
asyncio.run(main())

results: ["camo.githubusercontent.com", "img.shields.io"] still exists, why?
Image

Current Behavior

results: ["camo.githubusercontent.com", "img.shields.io"] still exists, why?

Is this reproducible?

No

Inputs Causing the Bug

Steps to Reproduce

Code snippets

OS

macOS

Python version

3.11

Browser

No response

Browser version

No response

Error logs & Screenshots (if applicable)

No response

@cccmolo cccmolo added 🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers labels Apr 2, 2025
@unclecode
Copy link
Owner

@cccmolo Thanks for reporting this issue. The behavior you're seeing is expected. The exclusion based on the base domain applies only to external content. The code checks if an image or link is internal by comparing its domain to the page's base domain. That means even if the target domain appears in your exclude_domains list, internal images remain untouched.

To filter out images from external domains like "camo.githubusercontent.com" or "img.shields.io," you need to enable the exclude_external_images flag. For example:

crawler_cfg = CrawlerRunConfig(
    exclude_domains=["camo.githubusercontent.com", "img.shields.io"],
    exclude_external_images=True,  # This will filter out external images matching the domains
    wait_for_images=True,
    verbose=True
)

This configuration ensures that only external images with the specified domains are removed, while internal resources remain unaffected.

I close the issue but you are most welcome to continue the conversation. Thanks again for your report!

@unclecode unclecode added ❓ Question Q&A and removed 🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers labels Apr 2, 2025
@unclecode unclecode self-assigned this Apr 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants