Skip to content

Commit e4d9c5e

Browse files
committed
feat: Implement first version
1 parent 4cf8f9b commit e4d9c5e

File tree

7 files changed

+393
-24
lines changed

7 files changed

+393
-24
lines changed

Diff for: README.md

+65-3
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,76 @@
55
[![pypi version](https://img.shields.io/pypi/v/mkdocs-llmstxt.svg)](https://pypi.org/project/mkdocs-llmstxt/)
66
[![gitter](https://badges.gitter.im/join%20chat.svg)](https://app.gitter.im/#/room/#mkdocs-llmstxt:gitter.im)
77

8-
MkDocs plugin to generate an /llms.txt file.
8+
MkDocs plugin to generate an [/llms.txt file](https://llmstxt.org/).
99

10-
## Requirements
10+
> /llms.txt - A proposal to standardise on using an /llms.txt file to provide information to help LLMs use a website at inference time.
1111
12-
Pandoc must be [installed](https://pandoc.org/installing.html) and available as `pandoc`.
12+
See our own dynamically generated [/llms.txt](llms.txt) as a demonstration.
1313

1414
## Installation
1515

1616
```bash
1717
pip install mkdocs-llmstxt
1818
```
19+
20+
## Usage
21+
22+
Enable the plugin in `mkdocs.yml`:
23+
24+
```yaml title="mkdocs.yml"
25+
plugins:
26+
- llmstxt:
27+
files:
28+
- output: llms.txt
29+
inputs:
30+
- file1.md
31+
- folder/file2.md
32+
```
33+
34+
You can generate several files, each from its own set of input files.
35+
36+
File globbing is supported:
37+
38+
```yaml title="mkdocs.yml"
39+
plugins:
40+
- llmstxt:
41+
files:
42+
- output: llms.txt
43+
inputs:
44+
- file1.md
45+
- reference/*/*.md
46+
```
47+
48+
The plugin will concatenate the rendered HTML of these input pages, clean it up a bit (with [BeautifulSoup](https://pypi.org/project/beautifulsoup4/)), convert it back to Markdown (with [Markdownify](https://pypi.org/project/markdownify)), and format it (with [Mdformat](https://pypi.org/project/mdformat)). By concatenating HTML instead of Markdown, we ensure that dynamically generated contents (API documentation, executed code blocks, snippets from other files, Jinja macros, etc.) are part of the generated text files. Credits to [Petyo Ivanov](https://github.com/petyosi) for the original idea ✨
49+
50+
You can disable auto-cleaning of the HTML:
51+
52+
```yaml title="mkdocs.yml"
53+
plugins:
54+
- llmstxt:
55+
autoclean: false
56+
```
57+
58+
You can also pre-process the HTML before it is converted back to Markdown:
59+
60+
```yaml title="mkdocs.yml"
61+
plugins:
62+
- llmstxt:
63+
preprocess: path/to/script.py
64+
```
65+
66+
The specified `script.py` must expose a `preprocess` function that accepts the `soup` and `output` arguments:
67+
68+
```python
69+
from typing import TYPE_CHECKING
70+
71+
if TYPE_CHECKING:
72+
from bs4 import BeautifulSoup
73+
74+
def preprocess(soup: BeautifulSoup, output: str) -> None:
75+
... # modify the soup
76+
```
77+
78+
The `output` argument lets you modify the soup *depending on which file is being generated*.
79+
80+
Have a look at [our own pre-processing function](https://pawamoy.github.io/mkdocs-llmstxt/reference/mkdocs_llmstxt/preprocess/#mkdocs_llmstxt.preprocess.autoclean) to get inspiration.

Diff for: mkdocs.yml

+4-15
Original file line numberDiff line numberDiff line change
@@ -140,23 +140,12 @@ plugins:
140140
enabled: !ENV [DEPLOY, false]
141141
enable_creation_date: true
142142
type: timeago
143-
- manpage:
144-
preprocess: scripts/preprocess.py
145-
pages:
146-
- title: MkDocs Manpage
147-
header: MkDocs plugins
148-
output: share/man/man1/mkdocs-manpage.1
143+
- llmstxt:
144+
files:
145+
- output: llms.txt
149146
inputs:
150147
- index.md
151-
- changelog.md
152-
- contributing.md
153-
- credits.md
154-
- license.md
155-
- title: mkdocs-manpage API
156-
header: Python Library APIs
157-
output: share/man/man3/mkdocs_manpage.3
158-
inputs:
159-
- reference/mkdocs_manpage/*.md
148+
- reference/mkdocs_llmstxt/*.md
160149
- minify:
161150
minify_html: !ENV [DEPLOY, false]
162151
- group:

Diff for: pyproject.toml

+4-6
Original file line numberDiff line numberDiff line change
@@ -28,12 +28,10 @@ classifiers = [
2828
"Topic :: Utilities",
2929
"Typing :: Typed",
3030
]
31-
dependencies = []
32-
33-
[project.optional-dependencies]
34-
preprocess = [
31+
dependencies = [
3532
"beautifulsoup4>=4.12",
36-
"lxml>=5.3",
33+
"markdownify>=0.14",
34+
"mdformat>=0.7.21",
3735
]
3836

3937
[project.urls]
@@ -108,4 +106,4 @@ dev = [
108106
"mkdocstrings[python]>=0.25",
109107
# YORE: EOL 3.10: Remove line.
110108
"tomli>=2.0; python_version < '3.11'",
111-
]
109+
]

Diff for: src/mkdocs_llmstxt/config.py

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
"""Configuration options for the MkDocs LLMsTxt plugin."""
2+
3+
from __future__ import annotations
4+
5+
from mkdocs.config import config_options as mkconf
6+
from mkdocs.config.base import Config as BaseConfig
7+
8+
9+
class FileConfig(BaseConfig):
10+
"""Sub-config for each Markdown file."""
11+
12+
output = mkconf.Type(str)
13+
inputs = mkconf.ListOfItems(mkconf.Type(str))
14+
15+
16+
class PluginConfig(BaseConfig):
17+
"""Configuration options for the plugin."""
18+
19+
autoclean = mkconf.Type(bool, default=True)
20+
preprocess = mkconf.Optional(mkconf.File(exists=True))
21+
files = mkconf.ListOfItems(mkconf.SubConfig(FileConfig))

Diff for: src/mkdocs_llmstxt/logger.py

+49
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
"""Logging functions."""
2+
3+
from __future__ import annotations
4+
5+
import logging
6+
from typing import TYPE_CHECKING, Any
7+
8+
if TYPE_CHECKING:
9+
from collections.abc import MutableMapping
10+
11+
12+
class PluginLogger(logging.LoggerAdapter):
13+
"""A logger adapter to prefix messages with the originating package name."""
14+
15+
def __init__(self, prefix: str, logger: logging.Logger):
16+
"""Initialize the object.
17+
18+
Arguments:
19+
prefix: The string to insert in front of every message.
20+
logger: The logger instance.
21+
"""
22+
super().__init__(logger, {})
23+
self.prefix = prefix
24+
25+
def process(self, msg: str, kwargs: MutableMapping[str, Any]) -> tuple[str, Any]:
26+
"""Process the message.
27+
28+
Arguments:
29+
msg: The message:
30+
kwargs: Remaining arguments.
31+
32+
Returns:
33+
The processed message.
34+
"""
35+
return f"{self.prefix}: {msg}", kwargs
36+
37+
38+
def get_logger(name: str) -> PluginLogger:
39+
"""Return a logger for plugins.
40+
41+
Arguments:
42+
name: The name to use with `logging.getLogger`.
43+
44+
Returns:
45+
A logger configured to work well in MkDocs,
46+
prefixing each message with the plugin package name.
47+
"""
48+
logger = logging.getLogger(f"mkdocs.plugins.{name}")
49+
return PluginLogger(name.split(".", 1)[0], logger)

Diff for: src/mkdocs_llmstxt/plugin.py

+149
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,149 @@
1+
"""MkDocs plugin that generates a Markdown file at the end of the build."""
2+
3+
from __future__ import annotations
4+
5+
import fnmatch
6+
from collections import defaultdict
7+
from itertools import chain
8+
from pathlib import Path
9+
from typing import TYPE_CHECKING
10+
11+
import mdformat
12+
from bs4 import BeautifulSoup as Soup
13+
from bs4 import Tag
14+
from markdownify import ATX, MarkdownConverter
15+
from mkdocs.config.defaults import MkDocsConfig
16+
from mkdocs.exceptions import PluginError
17+
from mkdocs.plugins import BasePlugin
18+
19+
from mkdocs_llmstxt.config import PluginConfig
20+
from mkdocs_llmstxt.logger import get_logger
21+
from mkdocs_llmstxt.preprocess import autoclean, preprocess
22+
23+
if TYPE_CHECKING:
24+
from typing import Any
25+
26+
from mkdocs.config.defaults import MkDocsConfig
27+
from mkdocs.structure.files import Files
28+
from mkdocs.structure.pages import Page
29+
30+
31+
logger = get_logger(__name__)
32+
33+
34+
class MkdocsLLMsTxtPlugin(BasePlugin[PluginConfig]):
35+
"""The MkDocs plugin to generate an `llms.txt` file.
36+
37+
This plugin defines the following event hooks:
38+
39+
- `on_page_content`
40+
- `on_post_build`
41+
42+
Check the [Developing Plugins](https://www.mkdocs.org/user-guide/plugins/#developing-plugins) page of `mkdocs`
43+
for more information about its plugin system.
44+
"""
45+
46+
mkdocs_config: MkDocsConfig
47+
48+
def __init__(self) -> None: # noqa: D107
49+
self.html_pages: dict[str, dict[str, str]] = defaultdict(dict)
50+
51+
def _expand_inputs(self, inputs: list[str], page_uris: list[str]) -> list[str]:
52+
expanded: list[str] = []
53+
for input_file in inputs:
54+
if "*" in input_file:
55+
expanded.extend(fnmatch.filter(page_uris, input_file))
56+
else:
57+
expanded.append(input_file)
58+
return expanded
59+
60+
def on_config(self, config: MkDocsConfig) -> MkDocsConfig | None:
61+
"""Save the global MkDocs configuration.
62+
63+
Hook for the [`on_config` event](https://www.mkdocs.org/user-guide/plugins/#on_config).
64+
In this hook, we save the global MkDocs configuration into an instance variable,
65+
to re-use it later.
66+
67+
Arguments:
68+
config: The MkDocs config object.
69+
70+
Returns:
71+
The same, untouched config.
72+
"""
73+
self.mkdocs_config = config
74+
return config
75+
76+
def on_files(self, files: Files, *, config: MkDocsConfig) -> Files | None: # noqa: ARG002
77+
"""Expand inputs for generated files.
78+
79+
Hook for the [`on_files` event](https://www.mkdocs.org/user-guide/plugins/#on_files).
80+
In this hook we expand inputs for generated file (glob patterns using `*`).
81+
82+
Parameters:
83+
files: The collection of MkDocs files.
84+
config: The MkDocs configuration.
85+
86+
Returns:
87+
Modified collection or none.
88+
"""
89+
for file in self.config.files:
90+
file["inputs"] = self._expand_inputs(file["inputs"], page_uris=list(files.src_uris.keys()))
91+
return files
92+
93+
def on_page_content(self, html: str, *, page: Page, **kwargs: Any) -> str | None: # noqa: ARG002
94+
"""Record pages contents.
95+
96+
Hook for the [`on_page_content` event](https://www.mkdocs.org/user-guide/plugins/#on_page_content).
97+
In this hook we simply record the HTML of the pages into a dictionary whose keys are the pages' URIs.
98+
99+
Parameters:
100+
html: The rendered HTML.
101+
page: The page object.
102+
"""
103+
for file in self.config.files:
104+
if page.file.src_uri in file["inputs"]:
105+
logger.debug(f"Adding page {page.file.src_uri} to page {file['output']}")
106+
self.html_pages[file["output"]][page.file.src_uri] = html
107+
return html
108+
109+
def on_post_build(self, config: MkDocsConfig, **kwargs: Any) -> None: # noqa: ARG002
110+
"""Combine all recorded pages contents and convert it to a Markdown file with BeautifulSoup and Markdownify.
111+
112+
Hook for the [`on_post_build` event](https://www.mkdocs.org/user-guide/plugins/#on_post_build).
113+
In this hook we concatenate all previously recorded HTML, and convert it to Markdown using Markdownify.
114+
115+
Parameters:
116+
config: MkDocs configuration.
117+
"""
118+
119+
def language_callback(tag: Tag) -> str:
120+
for css_class in chain(tag.get("class", ()), tag.parent.get("class", ())):
121+
if css_class.startswith("language-"):
122+
return css_class[9:]
123+
return ""
124+
125+
converter = MarkdownConverter(
126+
bullets="-",
127+
code_language_callback=language_callback,
128+
escape_underscores=False,
129+
heading_style=ATX,
130+
)
131+
132+
for file in self.config.files:
133+
try:
134+
html = "\n\n".join(self.html_pages[file["output"]][input_page] for input_page in file["inputs"])
135+
except KeyError as error:
136+
raise PluginError(str(error)) from error
137+
138+
soup = Soup(html, "html.parser")
139+
if self.config.autoclean:
140+
autoclean(soup)
141+
if self.config.preprocess:
142+
preprocess(soup, self.config.preprocess, file["output"])
143+
144+
output_file = Path(config.site_dir).joinpath(file["output"])
145+
output_file.parent.mkdir(parents=True, exist_ok=True)
146+
markdown = mdformat.text(converter.convert_soup(soup), options={"wrap": "no"})
147+
output_file.write_text(markdown, encoding="utf8")
148+
149+
logger.info(f"Generated file /{file['output']}")

0 commit comments

Comments
 (0)