Skip to content

Commit 03e49dc

Browse files
committed
feat: added markdownify and localscraper tools
1 parent 8f4c9d2 commit 03e49dc

15 files changed

+796
-54
lines changed

README.md

+140-1
Original file line numberDiff line numberDiff line change
@@ -1 +1,140 @@
1-
# langchain-scrapegraph
1+
# 🕷️🦜 langchain-scrapegraph
2+
3+
[![License](https://img.shields.io/badge/License-MIT-blue.svg)](https://opensource.org/licenses/MIT)
4+
[![Python Support](https://img.shields.io/pypi/pyversions/langchain-scrapegraph.svg)](https://pypi.org/project/langchain-scrapegraph/)
5+
[![Documentation](https://img.shields.io/badge/Documentation-Latest-green)](https://scrapegraphai.com/docs)
6+
7+
Supercharge your LangChain agents with AI-powered web scraping capabilities. LangChain-ScrapeGraph provides a seamless integration between [LangChain](https://github.com/langchain-ai/langchain) and [ScrapeGraph AI](https://scrapegraphai.com), enabling your agents to extract structured data from websites using natural language.
8+
9+
## 📦 Installation
10+
11+
```bash
12+
pip install langchain-scrapegraph
13+
```
14+
15+
## 🛠️ Available Tools
16+
17+
### 📝 MarkdownifyTool
18+
Convert any webpage into clean, formatted markdown.
19+
20+
```python
21+
from langchain_scrapegraph.tools import MarkdownifyTool
22+
23+
tool = MarkdownifyTool()
24+
markdown = tool.invoke({"website_url": "https://example.com"})
25+
26+
print(markdown)
27+
```
28+
29+
### 🔍 SmartscraperTool
30+
Extract structured data from any webpage using natural language prompts.
31+
32+
```python
33+
from langchain_scrapegraph.tools import SmartscraperTool
34+
35+
# Initialize the tool (uses SGAI_API_KEY from environment)
36+
tool = SmartscraperTool()
37+
38+
# Extract information using natural language
39+
result = tool.invoke({
40+
"website_url": "https://www.example.com",
41+
"user_prompt": "Extract the main heading and first paragraph"
42+
})
43+
44+
print(result)
45+
```
46+
47+
### 💻 LocalscraperTool
48+
Extract information from HTML content using AI.
49+
50+
```python
51+
from langchain_scrapegraph.tools import LocalscraperTool
52+
53+
tool = LocalscraperTool()
54+
result = tool.invoke({
55+
"user_prompt": "Extract all contact information",
56+
"website_html": "<html>...</html>"
57+
})
58+
59+
print(result)
60+
```
61+
62+
## 🌟 Key Features
63+
64+
- 🐦 **LangChain Integration**: Seamlessly works with LangChain agents and chains
65+
- 🔍 **AI-Powered Extraction**: Use natural language to describe what data to extract
66+
- 📊 **Structured Output**: Get clean, structured data ready for your agents
67+
- 🔄 **Flexible Tools**: Choose from multiple specialized scraping tools
68+
-**Async Support**: Built-in support for async operations
69+
70+
## 💡 Use Cases
71+
72+
- 📖 **Research Agents**: Create agents that gather and analyze web data
73+
- 📊 **Data Collection**: Automate structured data extraction from websites
74+
- 📝 **Content Processing**: Convert web content into markdown for further processing
75+
- 🔍 **Information Extraction**: Extract specific data points using natural language
76+
77+
## 🤖 Example Agent
78+
79+
```python
80+
from langchain.agents import initialize_agent, AgentType
81+
from langchain_scrapegraph.tools import SmartscraperTool
82+
from langchain_openai import ChatOpenAI
83+
84+
# Initialize tools
85+
tools = [
86+
SmartscraperTool(),
87+
]
88+
89+
# Create an agent
90+
agent = initialize_agent(
91+
tools=tools,
92+
llm=ChatOpenAI(temperature=0),
93+
agent=AgentType.ZERO_SHOT_REACT_DESCRIPTION,
94+
verbose=True
95+
)
96+
97+
# Use the agent
98+
response = agent.run("""
99+
Visit example.com, make a summary of the content and extract the main heading and first paragraph
100+
""")
101+
```
102+
103+
## ⚙️ Configuration
104+
105+
Set your ScrapeGraph API key in your environment:
106+
```bash
107+
export SGAI_API_KEY="your-api-key-here"
108+
```
109+
110+
Or set it programmatically:
111+
```python
112+
import os
113+
os.environ["SGAI_API_KEY"] = "your-api-key-here"
114+
```
115+
116+
## 📚 Documentation
117+
118+
- [API Documentation](https://scrapegraphai.com/docs)
119+
- [LangChain Documentation](https://python.langchain.com/docs/get_started/introduction.html)
120+
- [Examples](examples/)
121+
122+
## 💬 Support & Feedback
123+
124+
- 📧 Email: [email protected]
125+
- 💻 GitHub Issues: [Create an issue](https://github.com/ScrapeGraphAI/langchain-scrapegraph/issues)
126+
- 🌟 Feature Requests: [Request a feature](https://github.com/ScrapeGraphAI/langchain-scrapegraph/issues/new)
127+
128+
## 📄 License
129+
130+
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
131+
132+
## 🙏 Acknowledgments
133+
134+
This project is built on top of:
135+
- [LangChain](https://github.com/langchain-ai/langchain)
136+
- [ScrapeGraph AI](https://scrapegraphai.com)
137+
138+
---
139+
140+
Made with ❤️ by [ScrapeGraph AI](https://scrapegraphai.com)

examples/agent_example.py

+57
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
"""
2+
Remember to install the additional dependencies for this example to work:
3+
pip install langchain-openai langchain
4+
"""
5+
6+
from dotenv import load_dotenv
7+
from langchain.agents import AgentExecutor, create_openai_functions_agent
8+
from langchain_core.messages import SystemMessage
9+
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
10+
from langchain_openai import ChatOpenAI
11+
12+
from langchain_scrapegraph.tools import (
13+
GetCreditsTool,
14+
LocalScraperTool,
15+
SmartScraperTool,
16+
)
17+
18+
load_dotenv()
19+
20+
# Initialize the tools
21+
tools = [
22+
SmartScraperTool(),
23+
LocalScraperTool(),
24+
GetCreditsTool(),
25+
]
26+
27+
# Create the prompt template
28+
prompt = ChatPromptTemplate.from_messages(
29+
[
30+
SystemMessage(
31+
content=(
32+
"You are a helpful AI assistant that can analyze websites and extract information. "
33+
"You have access to tools that can help you scrape and process web content. "
34+
"Always explain what you're doing before using a tool."
35+
)
36+
),
37+
MessagesPlaceholder(variable_name="chat_history", optional=True),
38+
("user", "{input}"),
39+
MessagesPlaceholder(variable_name="agent_scratchpad"),
40+
]
41+
)
42+
43+
# Initialize the LLM
44+
llm = ChatOpenAI(temperature=0)
45+
46+
# Create the agent
47+
agent = create_openai_functions_agent(llm, tools, prompt)
48+
49+
# Create the executor
50+
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
51+
52+
# Example usage
53+
query = """Extract the main products from https://www.scrapegraphai.com/"""
54+
55+
print("\nQuery:", query, "\n")
56+
response = agent_executor.invoke({"input": query})
57+
print("\nFinal Response:", response["output"])

examples/get_credits_tool.py

+9-5
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,13 @@
1+
from scrapegraph_py.logger import sgai_logger
2+
13
from langchain_scrapegraph.tools import GetCreditsTool
24

3-
# Will automatically get SGAI_API_KEY from environment, or set it manually
5+
sgai_logger.set_logging(level="INFO")
6+
7+
# Will automatically get SGAI_API_KEY from environment
48
tool = GetCreditsTool()
5-
credits = tool.run()
69

7-
print("\nCredits Information:")
8-
print(f"Remaining Credits: {credits['remaining_credits']}")
9-
print(f"Total Credits Used: {credits['total_credits_used']}")
10+
# Use the tool
11+
credits = tool.invoke({})
12+
13+
print(credits)

examples/localscraper_tool.py

+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
from scrapegraph_py.logger import sgai_logger
2+
3+
from langchain_scrapegraph.tools import LocalScraperTool
4+
5+
sgai_logger.set_logging(level="INFO")
6+
7+
# Will automatically get SGAI_API_KEY from environment
8+
tool = LocalScraperTool()
9+
10+
# Example website and prompt
11+
html_content = """
12+
<html>
13+
<body>
14+
<h1>Company Name</h1>
15+
<p>We are a technology company focused on AI solutions.</p>
16+
<div class="contact">
17+
<p>Email: [email protected]</p>
18+
<p>Phone: (555) 123-4567</p>
19+
</div>
20+
</body>
21+
</html>
22+
"""
23+
user_prompt = "Make a summary of the webpage and extract the email and phone number"
24+
25+
# Use the tool
26+
result = tool.invoke({"website_html": html_content, "user_prompt": user_prompt})
27+
28+
print(result)

examples/markdownify_tool.py

+16
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,16 @@
1+
from scrapegraph_py.logger import sgai_logger
2+
3+
from langchain_scrapegraph.tools import MarkdownifyTool
4+
5+
sgai_logger.set_logging(level="INFO")
6+
7+
# Will automatically get SGAI_API_KEY from environment
8+
tool = MarkdownifyTool()
9+
10+
# Example website and prompt
11+
website_url = "https://www.example.com"
12+
13+
# Use the tool
14+
result = tool.invoke({"website_url": website_url})
15+
16+
print(result)

examples/smartscraper_tool.py

+10-8
Original file line numberDiff line numberDiff line change
@@ -1,15 +1,17 @@
1-
from langchain_scrapegraph.tools import SmartscraperTool
1+
from scrapegraph_py.logger import sgai_logger
22

3-
# Will automatically get SGAI_API_KEY from environment, or set it manually
4-
tool = SmartscraperTool()
3+
from langchain_scrapegraph.tools import SmartScraperTool
4+
5+
sgai_logger.set_logging(level="INFO")
6+
7+
# Will automatically get SGAI_API_KEY from environment
8+
tool = SmartScraperTool()
59

610
# Example website and prompt
711
website_url = "https://www.example.com"
812
user_prompt = "Extract the main heading and first paragraph from this webpage"
913

10-
# Use the tool synchronously
11-
result = tool.run({"user_prompt": user_prompt, "website_url": website_url})
14+
# Use the tool
15+
result = tool.invoke({"website_url": website_url, "user_prompt": user_prompt})
1216

13-
print("\nExtraction Results:")
14-
print(f"Main Heading: {result['main_heading']}")
15-
print(f"First Paragraph: {result['first_paragraph']}")
17+
print(result)
+4-2
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,6 @@
11
from .credits import GetCreditsTool
2-
from .smartscraper import SmartscraperTool
2+
from .localscraper import LocalScraperTool
3+
from .markdownify import MarkdownifyTool
4+
from .smartscraper import SmartScraperTool
35

4-
__all__ = ["SmartscraperTool", "GetCreditsTool"]
6+
__all__ = ["SmartScraperTool", "GetCreditsTool", "MarkdownifyTool", "LocalScraperTool"]

langchain_scrapegraph/tools/credits.py

+51-4
Original file line numberDiff line numberDiff line change
@@ -7,25 +7,72 @@
77
from langchain_core.tools import BaseTool
88
from langchain_core.utils import get_from_dict_or_env
99
from pydantic import model_validator
10-
from scrapegraph_py import SyncClient
10+
from scrapegraph_py import Client
1111

1212

1313
class GetCreditsTool(BaseTool):
14+
"""Tool for checking remaining credits on your ScrapeGraph AI account.
15+
16+
Setup:
17+
Install ``langchain-scrapegraph`` python package:
18+
19+
.. code-block:: bash
20+
21+
pip install langchain-scrapegraph
22+
23+
Get your API key from ScrapeGraph AI (https://scrapegraphai.com)
24+
and set it as an environment variable:
25+
26+
.. code-block:: bash
27+
28+
export SGAI_API_KEY="your-api-key"
29+
30+
Key init args:
31+
api_key: Your ScrapeGraph AI API key. If not provided, will look for SGAI_API_KEY env var.
32+
client: Optional pre-configured ScrapeGraph client instance.
33+
34+
Instantiate:
35+
.. code-block:: python
36+
37+
from langchain_scrapegraph.tools import GetCreditsTool
38+
39+
# Will automatically get SGAI_API_KEY from environment
40+
tool = GetCreditsTool()
41+
42+
# Or provide API key directly
43+
tool = GetCreditsTool(api_key="your-api-key")
44+
45+
Use the tool:
46+
.. code-block:: python
47+
48+
result = tool.invoke({})
49+
50+
print(result)
51+
# {
52+
# "remaining_credits": 100,
53+
# "total_credits_used": 50
54+
# }
55+
56+
Async usage:
57+
.. code-block:: python
58+
59+
result = await tool.ainvoke({})
60+
"""
61+
1462
name: str = "GetCredits"
1563
description: str = (
1664
"Get the current credits available in your ScrapeGraph AI account"
1765
)
1866
return_direct: bool = True
19-
client: Optional[SyncClient] = None
67+
client: Optional[Client] = None
2068
api_key: str
21-
testing: bool = False
2269

2370
@model_validator(mode="before")
2471
@classmethod
2572
def validate_environment(cls, values: Dict) -> Dict:
2673
"""Validate that api key exists in environment."""
2774
values["api_key"] = get_from_dict_or_env(values, "api_key", "SGAI_API_KEY")
28-
values["client"] = SyncClient(api_key=values["api_key"])
75+
values["client"] = Client(api_key=values["api_key"])
2976
return values
3077

3178
def __init__(self, **data: Any):

0 commit comments

Comments
 (0)