Open
Description
Checked other resources
- I added a very descriptive title to this issue.
- I searched the LangChain documentation with the integrated search.
- I used the GitHub search to find a similar question and didn't find it.
- I am sure that this is a bug in LangChain rather than my code.
- The bug is not resolved by updating to the latest stable version of LangChain (or the specific integration package).
Example Code
loader = GitbookLoader(
web_page="https://docs.gitbook.com/",
load_all_paths=True
)
docs = loader.load()
print(len(docs))
Error Message and Stack Trace (if applicable)
No response
Description
- Trying to fetch all pages from gitbook documentation, by using GitBookLoader
- The sitemap (e.g. documentation of GitBook itself) contains references to other sitemaps
- Instead of fetching correct sub pages into docs variable, docs is empty list (0 is printed)
The problem can be fixed by replacing the webpage in gitbook.py init by
if load_all_paths:
# set web_path to the sitemap if we want to crawl all paths
web_page = f"{self.base_url}/sitemap-pages.xml"
So perhaps a constructor parameter to provide custom sitemap url would be sufficient.
System Info
System Information
OS: Windows
OS Version: 10.0.19045
Python Version: 3.11.9 (tags/v3.11.9:de54cf5, Apr 2 2024, 10:12:12) [MSC v.1938 64 bit (AMD64)]
Package Information
langchain_core: 0.3.48
langchain: 0.3.21
langchain_community: 0.3.20
langsmith: 0.1.137
langchain_openai: 0.3.10
langchain_text_splitters: 0.3.7