Skip to content

Bug: GitbookLoader fails to process nested sitemaps #30629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
5 tasks done
andrasfe opened this issue Apr 3, 2025 · 5 comments · May be fixed by #30681
Open
5 tasks done

Bug: GitbookLoader fails to process nested sitemaps #30629

andrasfe opened this issue Apr 3, 2025 · 5 comments · May be fixed by #30681
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature

Comments

@andrasfe
Copy link
Contributor

andrasfe commented Apr 3, 2025

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

The following code returns 0.

from langchain_community.document_loaders import GitbookLoader

loader = GitbookLoader(
    web_page="https://platform-docs.opentargets.org/",
    load_all_paths=True,
    sitemap_url="https://platform-docs.opentargets.org/sitemap.xml",
)
docs = loader.load()
print(len(docs))  # Returns 0 instead of expected documents

Expected Behavior
The loader should process the sitemap index file, follow links to child sitemaps, and extract URLs to content pages, returning all available documents.
Actual Behavior
The loader processes only the top-level sitemap file, fails to recognize it as a sitemap index, and returns 0 documents.
Root Cause
The code doesn't distinguish between sitemap index files () and regular sitemap files ()
It isn't using the proper XML parser for sitemap files
There is no recursive processing for nested sitemaps

Error Message and Stack Trace (if applicable)

code above returns 0.

Description

Trying to load the gitbook document. The GitbookLoader fails to extract documents when the target site uses nested sitemaps (a sitemap index pointing to child sitemaps). When encountering a sitemap index file, the loader attempts to process it as a regular content page rather than recursively exploring the referenced sitemaps, resulting in zero documents returned.

PS: this is a continuation of issue 30473. Prior PR solved for the example but not for the recursive sitemap issue.

System Info

System Information

OS: Linux
OS Version: #59-Ubuntu SMP PREEMPT_DYNAMIC Sat Mar 15 17:40:59 UTC 2025
Python Version: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0]

Package Information

langchain_core: 0.3.49
langchain: 0.3.22
langchain_community: 0.3.20
langsmith: 0.3.22
langchain_anthropic: 0.3.10
langchain_openai: 0.3.11
langchain_text_splitters: 0.3.7

Optional packages not installed

langserve

Other Dependencies

aiohttp<4.0.0,>=3.8.3: Installed. No version info available.
anthropic<1,>=0.49.0: Installed. No version info available.
async-timeout<5.0.0,>=4.0.0;: Installed. No version info available.
dataclasses-json<0.7,>=0.5.7: Installed. No version info available.
httpx: 0.28.1
httpx-sse<1.0.0,>=0.4.0: Installed. No version info available.
jsonpatch<2.0,>=1.33: Installed. No version info available.
langchain-anthropic;: Installed. No version info available.
langchain-aws;: Installed. No version info available.
langchain-azure-ai;: Installed. No version info available.
langchain-cohere;: Installed. No version info available.
langchain-community;: Installed. No version info available.
langchain-core<1.0.0,>=0.3.45: Installed. No version info available.
langchain-core<1.0.0,>=0.3.49: Installed. No version info available.
langchain-deepseek;: Installed. No version info available.
langchain-fireworks;: Installed. No version info available.
langchain-google-genai;: Installed. No version info available.
langchain-google-vertexai;: Installed. No version info available.
langchain-groq;: Installed. No version info available.
langchain-huggingface;: Installed. No version info available.
langchain-mistralai;: Installed. No version info available.
langchain-ollama;: Installed. No version info available.
langchain-openai;: Installed. No version info available.
langchain-text-splitters<1.0.0,>=0.3.7: Installed. No version info available.
langchain-together;: Installed. No version info available.
langchain-xai;: Installed. No version info available.
langchain<1.0.0,>=0.3.21: Installed. No version info available.
langsmith-pyo3: Installed. No version info available.
langsmith<0.4,>=0.1.125: Installed. No version info available.
langsmith<0.4,>=0.1.17: Installed. No version info available.
numpy<3,>=1.26.2: Installed. No version info available.
openai-agents: Installed. No version info available.
openai<2.0.0,>=1.68.2: Installed. No version info available.
opentelemetry-api: Installed. No version info available.
opentelemetry-exporter-otlp-proto-http: Installed. No version info available.
opentelemetry-sdk: Installed. No version info available.
orjson: 3.10.16
packaging: 24.2
packaging<25,>=23.2: Installed. No version info available.
pydantic: 2.11.1
pydantic-settings<3.0.0,>=2.4.0: Installed. No version info available.
pydantic<3.0.0,>=2.5.2;: Installed. No version info available.
pydantic<3.0.0,>=2.7.4: Installed. No version info available.
pydantic<3.0.0,>=2.7.4;: Installed. No version info available.
pytest: 8.3.5
PyYAML>=5.3: Installed. No version info available.
requests: 2.32.3
requests-toolbelt: 1.0.0
requests<3,>=2: Installed. No version info available.
rich: Installed. No version info available.
SQLAlchemy<3,>=1.4: Installed. No version info available.
tenacity!=8.4.0,<10,>=8.1.0: Installed. No version info available.
tenacity!=8.4.0,<10.0.0,>=8.1.0: Installed. No version info available.
tiktoken<1,>=0.7: Installed. No version info available.
typing-extensions>=4.7: Installed. No version info available.
zstandard: 0.23.0

@dosubot dosubot bot added the 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature label Apr 3, 2025
@andrasfe
Copy link
Contributor Author

andrasfe commented Apr 3, 2025

I got this one

@keenborder786
Copy link
Contributor

@andrasfe do this:

loader = GitbookLoader(
    web_page="https://platform-docs.opentargets.org/",
    load_all_paths=True,
    sitemap_url="https://platform-docs.opentargets.org/sitemap.xml",
   content_selector="url"
)

eyurtsev pushed a commit that referenced this issue Apr 4, 2025
- **Description:** We do not need to set parser in `scrape` since it is
already been done in `_scrape`
- **Issue:** #30629, not directly related but makes sure xml parser is
used
@andrasfe
Copy link
Contributor Author

andrasfe commented Apr 4, 2025

@andrasfe do this:

loader = GitbookLoader(
web_page="https://platform-docs.opentargets.org/",
load_all_paths=True,
sitemap_url="https://platform-docs.opentargets.org/sitemap.xml",
content_selector="url"
)

@keenborder786 , perhaps I am missing something but that does not work.

@keenborder786
Copy link
Contributor

@andrasfe is it working now?

@andrasfe
Copy link
Contributor Author

andrasfe commented Apr 21, 2025

@andrasfe is it working now?

It was not but a fix is pending merge approval, @keenborder786 .

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature
Projects
None yet
2 participants