-
Notifications
You must be signed in to change notification settings - Fork 17.3k
Bug: GitbookLoader fails to process nested sitemaps #30629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I got this one |
@andrasfe do this: loader = GitbookLoader(
web_page="https://platform-docs.opentargets.org/",
load_all_paths=True,
sitemap_url="https://platform-docs.opentargets.org/sitemap.xml",
content_selector="url"
) |
- **Description:** We do not need to set parser in `scrape` since it is already been done in `_scrape` - **Issue:** #30629, not directly related but makes sure xml parser is used
@keenborder786 , perhaps I am missing something but that does not work. |
@andrasfe is it working now? |
It was not but a fix is pending merge approval, @keenborder786 . |
Checked other resources
Example Code
The following code returns 0.
Expected Behavior
The loader should process the sitemap index file, follow links to child sitemaps, and extract URLs to content pages, returning all available documents.
Actual Behavior
The loader processes only the top-level sitemap file, fails to recognize it as a sitemap index, and returns 0 documents.
Root Cause
The code doesn't distinguish between sitemap index files () and regular sitemap files ()
It isn't using the proper XML parser for sitemap files
There is no recursive processing for nested sitemaps
Error Message and Stack Trace (if applicable)
code above returns 0.
Description
Trying to load the gitbook document. The GitbookLoader fails to extract documents when the target site uses nested sitemaps (a sitemap index pointing to child sitemaps). When encountering a sitemap index file, the loader attempts to process it as a regular content page rather than recursively exploring the referenced sitemaps, resulting in zero documents returned.
PS: this is a continuation of issue 30473. Prior PR solved for the example but not for the recursive sitemap issue.
System Info
System Information
Package Information
Optional packages not installed
Other Dependencies
The text was updated successfully, but these errors were encountered: