Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Markdown output has incorect spacing. #599

Closed
dkampien opened this issue Feb 1, 2025 · 5 comments · Fixed by #658
Closed

[Bug]: Markdown output has incorect spacing. #599

dkampien opened this issue Feb 1, 2025 · 5 comments · Fixed by #658
Assignees
Labels
💪 - Beginner Difficulty level - Beginners 🐞 Bug Something isn't working ⚡ High Priority - High 📌 Root caused identified the root cause of bug ⚙️ Under Test Bug fix / Feature request that's under testing

Comments

@dkampien
Copy link

dkampien commented Feb 1, 2025

crawl4ai version

0.4.247

Expected Behavior

Im trying to scrape a page from the blender manual @ https://docs.blender.org/manual/en/4.3/editors/outliner/interface.html

The markdown should look a little more like this (scraped with jina-ai):

Image

Notice the spacing between paragraphs.

Current Behavior

Instead it messes up the spacing like so:

Image

Notice that the spacing between paragraphs is messed up. LLMs can pick up this paragraph proximity.

Is there any config in CrawlRunConfig that I should know that can fix this? @aravindkarnam @unclecode

Is this reproducible?

Yes

Inputs Causing the Bug

https://docs.blender.org/manual/en/4.3/editors/outliner/interface.html

Steps to Reproduce

Code snippets

import asyncio
from crawl4ai import AsyncWebCrawler, CacheMode
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig


async def main():
    browser_config = BrowserConfig()  # Default browser configuration
    run_config = CrawlerRunConfig(
        cache_mode=CacheMode.BYPASS,
        css_selector="#furo-main-content",
        excluded_selector=".toc-drawer, a.headerlink"
    )   # Default crawl run configuration


    async with AsyncWebCrawler(config=browser_config) as crawler:
        result = await crawler.arun(
            url="https://docs.blender.org/manual/en/4.3/editors/outliner/interface.html",
            config=run_config
        )
        
        # Export to markdown file
        with open('output.md', 'w', encoding='utf-8') as f:
            f.write(result.markdown)  # Write markdown content to file

if __name__ == "__main__":
    asyncio.run(main())

OS

macos

Python version

3.11.9

Browser

Edge

Browser version

No response

Error logs & Screenshots (if applicable)

No response

@dkampien dkampien added 🐞 Bug Something isn't working 🩺 Needs Triage Needs attention of maintainers labels Feb 1, 2025
@aravindkarnam
Copy link
Collaborator

aravindkarnam commented Feb 2, 2025

RCA:

The HTML tags corresponding to the sections in question is dl, dt and dd tags ( description list).
Image

Currently these being handled by the custom html2text package (crawl4ai/html2text/__init__.py). handle_tag function. The exact code causing this problem, is as follows

    if tag == "dl" and start:
            self.p()
        if tag == "dt" and not start:
            self.pbr()
        if tag == "dd" and start:
            self.o("    ")
        if tag == "dd" and not start:
            self.pbr()

The issue occurs because:

  1. After each dt ends, we add a line break (self.pbr())
  2. When dd starts, we only add indentation with no spacing control
  3. After each dd ends, we add another line break
  4. The self.p_p counter that controls paragraph breaks isn't being properly managed between terms and definitions

Fix Suggestions

Courtesy of claude sonet, the following changes fixes the markdown as expected (dt and dd, the term and corresponding descriptions positioned together, rather than with preceding or upcoming descriptions)

        if tag == "dl" and start:
            self.p()  # Add paragraph break before list starts
            self.p_p = 0  # Reset paragraph state
        
        elif tag == "dt" and start:
            if self.p_p == 0:  # If not first term
                self.o("\n\n")  # Add spacing before new term-definition pair
            self.p_p = 0  # Reset paragraph state
        
        elif tag == "dt" and not start:
            self.o("\n")  # Single newline between term and definition
        
        elif tag == "dd" and start:
            self.o("    ")  # Indent definition
        
        elif tag == "dd" and not start:
            self.p_p = 0

I have verified that following produces the expected output. This just needs a little bit more refinement and testing.


Call for contributors

We are on the lookout for talented Open source contributors. Now this one is a simple fix. If you are beginner, you can bag your first open source contribution(and we want that for you 😉). Comment below "Interested" and issue will be assigned to you.

@aravindkarnam aravindkarnam added 💪 - Beginner Difficulty level - Beginners ⚡ High Priority - High 📌 Root caused identified the root cause of bug and removed 🩺 Needs Triage Needs attention of maintainers labels Feb 2, 2025
@github-project-automation github-project-automation bot moved this to To Assign in 2025-Feb-Alpha-1 Feb 2, 2025
@tautik
Copy link
Contributor

tautik commented Feb 2, 2025

interested @aravindkarnam

@aravindkarnam
Copy link
Collaborator

@tautikAg Thanks for showing interest. Next release is by Feb-15th, so plan to raise a PR 2-3 days in advance.

@aravindkarnam
Copy link
Collaborator

@tautikAg Hi. Were you able to make progress on this?

@tautik
Copy link
Contributor

tautik commented Feb 11, 2025

hey @aravindkarnam , i am testing rn. WIll tag you to the PR soon (in few hrs)

aravindkarnam added a commit that referenced this issue Feb 12, 2025
@aravindkarnam aravindkarnam added the ⚙️ Under Test Bug fix / Feature request that's under testing label Feb 14, 2025
@aravindkarnam aravindkarnam mentioned this issue Feb 15, 2025
6 tasks
@aravindkarnam aravindkarnam moved this from To Assign to Ready in 2025-Feb-Alpha-1 Feb 25, 2025
@aravindkarnam aravindkarnam moved this from Ready to Done in 2025-Feb-Alpha-1 Feb 25, 2025
@aravindkarnam aravindkarnam closed this as completed by moving to Done in 2025-Feb-Alpha-1 Feb 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💪 - Beginner Difficulty level - Beginners 🐞 Bug Something isn't working ⚡ High Priority - High 📌 Root caused identified the root cause of bug ⚙️ Under Test Bug fix / Feature request that's under testing
Projects
Status: Done
3 participants