You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@cccmolo Thanks for reporting this issue. The behavior you're seeing is expected. The exclusion based on the base domain applies only to external content. The code checks if an image or link is internal by comparing its domain to the page's base domain. That means even if the target domain appears in your exclude_domains list, internal images remain untouched.
To filter out images from external domains like "camo.githubusercontent.com" or "img.shields.io," you need to enable the exclude_external_images flag. For example:
crawler_cfg=CrawlerRunConfig(
exclude_domains=["camo.githubusercontent.com", "img.shields.io"],
exclude_external_images=True, # This will filter out external images matching the domainswait_for_images=True,
verbose=True
)
This configuration ensures that only external images with the specified domains are removed, while internal resources remain unaffected.
I close the issue but you are most welcome to continue the conversation. Thanks again for your report!
crawl4ai version
0.5.0.post8
Expected Behavior
import asyncio
from crawl4ai import *
crawler_cfg = CrawlerRunConfig(
exclude_domains=["camo.githubusercontent.com", "img.shields.io"],
# exclude_social_media_links=True, # skip Twitter, Facebook, etc.
wait_for_images=True, # ensure images are loaded
verbose=True
)
async def main():
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(
url="https://github.com/unclecode/crawl4ai/blob/main/README.md",
config=crawler_cfg
)
if result.success:
for image in result.media.get("images", []):
print(image["src"])
if name == "main":
asyncio.run(main())
results: ["camo.githubusercontent.com", "img.shields.io"] still exists, why?

Current Behavior
results: ["camo.githubusercontent.com", "img.shields.io"] still exists, why?
Is this reproducible?
No
Inputs Causing the Bug
Steps to Reproduce
Code snippets
OS
macOS
Python version
3.11
Browser
No response
Browser version
No response
Error logs & Screenshots (if applicable)
No response
The text was updated successfully, but these errors were encountered: