Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Incorrect rendering of inline code inside of links #583

Closed
dmurat opened this issue Jan 29, 2025 · 8 comments
Closed

[Bug]: Incorrect rendering of inline code inside of links #583

dmurat opened this issue Jan 29, 2025 · 8 comments
Assignees
Labels
💪 - Intermediate Difficulty level - Intermediate 🐞 Bug Something isn't working ☕ Low Priority - Low 📌 Root caused identified the root cause of bug ⚙️ Under Test Bug fix / Feature request that's under testing

Comments

@dmurat
Copy link

dmurat commented Jan 29, 2025

crawl4ai version

0.4.248b3

Expected Behavior

Correct rendering of links containing inline code. For example:

<a href="https://docs.spring.io/spring-framework/docs/6.2.x/javadoc-api/org/springframework/context/annotation/Configuration.html" class="apiref"><code>@Configuration</code></a>

should be rendered as

[`@Configuration`](https://docs.spring.io/spring-framework/docs/6.2.x/javadoc-api/org/springframework/context/annotation/Configuration.html)

Current Behavior

Currently, the rendering of links with inline code outputs inline code first, followed by correct but empty links like in

`@Configuration`[](https://docs.spring.io/spring-framework/docs/6.2.x/javadoc-api/org/springframework/context/annotation/Configuration.html)

Is this reproducible?

Yes

Inputs Causing the Bug

- URL: https://docs.spring.io/spring-boot/how-to/security.html
- css _selector: "article.doc > *:not(.breadcrumbs-container):not(aside):not(nav)"
- excluded_selector: ".source-toolbox, .ulist.tablist, .tab:not(.is-selected), .tabpanel.is-hidden"

Steps to Reproduce

Code snippets

crawler_run_config = CrawlerRunConfig(
    scraping_strategy=CustomWebScrapingStrategy(),
    css_selector=css_selector,
    excluded_selector=excluded_css_selector or "",
    exclude_external_links=True,
    exclude_external_images=True,
    markdown_generator=DefaultMarkdownGenerator(
        options={
            "skip_internal_links": True,
            "single_line_break": False,
            "protect_links": False,
            "pad_tables": True
        }
    ),
    process_iframes=False,
    magic=True,
    cache_mode=CacheMode.BYPASS,
    verbose=True,
)


crawl_result = await crawler.arun(
    url=url,
    config=crawler_run_config,
)

OS

macOS

Python version

3.12.7

Browser

Chrome

Browser version

Version 132.0.6834.160 (Official Build) (arm64)

Error logs & Screenshots (if applicable)

No response

@dmurat dmurat added 🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers labels Jan 29, 2025
@dmurat
Copy link
Author

dmurat commented Jan 29, 2025

Here is the outline of the fix I'm currently using, and it works ok as far as I can see:

class CustomHTML2Text(HTML2Text):
    def __init__(self, *args, handle_code_in_pre=False, **kwargs):
        super().__init__(*args, **kwargs)
        ...
        self.inside_link = False  # Add this to track if we're inside a link
        ...

    # fmt: off
    def handle_tag(self, tag, attrs, start):
        # Handle links
        if tag == "a":
            if start:
                self.inside_link = True
            else:
                self.inside_link = False

            super().handle_tag(tag, attrs, start)
            return

        ...
        # Handle pre tags
        if tag == 'pre':
            ...

        elif tag == 'code':
            if self.inside_pre and not self.handle_code_in_pre:
                return

            if start:
                if not self.inside_link:
                    self.o("`")  # Only output backtick if not inside a link
                self.inside_code = True
            else:
                if not self.inside_link:
                    self.o("`")  # Only output backtick if not inside a link
                self.inside_code = False

            # If inside a link, let the parent class handle the content
            if self.inside_link:
                super().handle_tag(tag, attrs, start) 

        else:
            super().handle_tag(tag, attrs, start)

    ...

HTH

@aravindkarnam aravindkarnam added 📌 Root caused identified the root cause of bug and removed 🩺 Needs Triage Needs attention of maintainers labels Jan 31, 2025
@aravindkarnam
Copy link
Collaborator

aravindkarnam commented Jan 31, 2025

@dmurat Thanks for point this out and for your suggestion. Looks like you already fixed it. Could you raise a PR for this?

@aravindkarnam aravindkarnam added 💪 - Intermediate Difficulty level - Intermediate ☕ Low Priority - Low labels Jan 31, 2025
@dmurat
Copy link
Author

dmurat commented Jan 31, 2025

@aravindkarnam Sure, I can try. One question though, are there any existing tests where I can look for examples?

@aravindkarnam
Copy link
Collaborator

@dmurat There are several examples in /tests folder.

@aravindkarnam
Copy link
Collaborator

@dmurat Were you able to make any progress on this?

@dmurat
Copy link
Author

dmurat commented Feb 10, 2025

@aravindkarnam Sry, didn't find time. Maybe during this or next week if you can wait.

@aravindkarnam aravindkarnam self-assigned this Feb 14, 2025
aravindkarnam added a commit that referenced this issue Feb 14, 2025
@aravindkarnam
Copy link
Collaborator

aravindkarnam commented Feb 14, 2025

@dmurat No worries. I went through you code suggestions, it works as expected. I added these changes to relevant files and tested it. You will be attributed for your contribution in next release. Thanks again!

@aravindkarnam aravindkarnam added the ⚙️ Under Test Bug fix / Feature request that's under testing label Feb 14, 2025
@aravindkarnam aravindkarnam mentioned this issue Feb 15, 2025
6 tasks
@banagale
Copy link

@dmurat This also fixed an issue I was seeing. Thank you for identifying a solution and @aravindkarnam for getting it in.

@github-project-automation github-project-automation bot moved this from To Assign to Done in 2025-Feb-Alpha-1 Mar 4, 2025
promagician77 added a commit to promagician77/crawl4ai that referenced this issue Mar 13, 2025
* spelling change in prompt

* gpt-4o-mini support

* Remove leading Y before here

* prompt spell correction

* (Docs) Fix numbered list end-of-line formatting

Added the missing "two spaces" to add a line break

* fix: access downloads_path through browser_config in _handle_download method - Fixes #585

* crawl

* fix: unclecode/crawl4ai#592

* fix: unclecode/crawl4ai#583

* Docs update: unclecode/crawl4ai#649

* fix: unclecode/crawl4ai#570

* Docs: updated example for content-selection to reflect new changes in yc newsfeed css

* Refactor: Removed old filters and replaced with optimised filters

* fix:Fixed imports as per the new names of filters

* Tests: For deep crawl filters

* Refactor: Remove old scorers and replace with optimised ones: Fix imports forall filters and scorers.

* fix: awaiting on filters that are async in nature eg: content relevance and seo filters

* fix: unclecode/crawl4ai#592

* fix: unclecode/crawl4ai#715

---------

Co-authored-by: DarshanTank <[email protected]>
Co-authored-by: Tuhin Mallick <[email protected]>
Co-authored-by: Serhat Soydan <[email protected]>
Co-authored-by: cardit1 <[email protected]>
Co-authored-by: Tautik Agrahari <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 - Intermediate Difficulty level - Intermediate 🐞 Bug Something isn't working ☕ Low Priority - Low 📌 Root caused identified the root cause of bug ⚙️ Under Test Bug fix / Feature request that's under testing
Projects
Status: Done
Development

No branches or pull requests

3 participants