Skip to content

GitBook loader does not load any pages when Sitemap has nested Sitemaps #30473

Open
@mutje

Description

@mutje

Checked other resources

  • I added a very descriptive title to this issue.
  • I searched the LangChain documentation with the integrated search.
  • I used the GitHub search to find a similar question and didn't find it.
  • I am sure that this is a bug in LangChain rather than my code.
  • The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).

Example Code

loader = GitbookLoader(
            web_page="https://docs.gitbook.com/",
            load_all_paths=True
        )
docs = loader.load()
print(len(docs))

Error Message and Stack Trace (if applicable)

No response

Description

  • Trying to fetch all pages from gitbook documentation, by using GitBookLoader
  • The sitemap (e.g. documentation of GitBook itself) contains references to other sitemaps
  • Instead of fetching correct sub pages into docs variable, docs is empty list (0 is printed)

The problem can be fixed by replacing the webpage in gitbook.py init by

if load_all_paths:
    # set web_path to the sitemap if we want to crawl all paths
    web_page = f"{self.base_url}/sitemap-pages.xml"

So perhaps a constructor parameter to provide custom sitemap url would be sufficient.

System Info

System Information

OS: Windows
OS Version: 10.0.19045
Python Version: 3.11.9 (tags/v3.11.9:de54cf5, Apr 2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]

Package Information

langchain_core: 0.3.48
langchain: 0.3.21
langchain_community: 0.3.20
langsmith: 0.1.137
langchain_openai: 0.3.10
langchain_text_splitters: 0.3.7

Metadata

Metadata

Assignees

No one assigned

    Labels

    🤖:bugRelated to a bug, vulnerability, unexpected error with an existing feature

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions