You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: specs/crawler/common/schemas/action.yml
+27-21Lines changed: 27 additions & 21 deletions
Original file line number
Diff line number
Diff line change
@@ -21,17 +21,16 @@ Action:
21
21
discoveryPatterns:
22
22
type: array
23
23
description: |
24
-
Indicates _intermediary_ pages that the crawler should visit.
25
-
26
-
For more information, see the [`discoveryPatterns` documentation](https://www.algolia.com/doc/tools/crawler/apis/discoverypatterns/).
24
+
Which _intermediary_ web pages the crawler should visit.
25
+
Use `discoveryPatterns` to define pages that should be visited _just_ for their links to other pages,
26
+
_not_ their content.
27
+
It functions similarly to the `pathsToMatch` action but without record extraction.
27
28
items:
28
29
$ref: '#/urlPattern'
29
30
fileTypesToMatch:
30
31
type: array
31
32
description: |
32
33
File types for crawling non-HTML documents.
33
-
34
-
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
35
34
maxItems: 100
36
35
items:
37
36
$ref: '#/fileTypes'
@@ -69,16 +68,22 @@ Action:
69
68
type: object
70
69
description: |
71
70
Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
72
-
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor` property.
73
71
74
-
For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/recordextractor/).
72
+
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor`.
73
+
For details, see the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor).
75
74
properties:
76
75
__type:
77
76
$ref: '#/configurationRecordExtractorType'
78
77
source:
79
78
type: string
80
79
description: |
81
80
A JavaScript function (as a string) that returns one or more Algolia records for each crawled page.
81
+
schedule:
82
+
type: string
83
+
description: |
84
+
How often to perform a complete crawl for this action.
85
+
86
+
For mopre information, consult the [`schedule` parameter documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/schedule/).
82
87
selectorsToMatch:
83
88
type: array
84
89
description: |
@@ -87,7 +92,8 @@ Action:
87
92
maxItems: 100
88
93
items:
89
94
type: string
90
-
description: DOM selector. Negation is supported. This lets you ignore pages that match the selector.
95
+
description: |
96
+
Prefix a selector with `!` to ignore matching pages.
91
97
example:
92
98
- .products
93
99
- '!.featured'
@@ -110,8 +116,6 @@ ActionSchedule:
110
116
fileTypes:
111
117
type: string
112
118
description: |
113
-
Supported file type for indexing non-HTML documents.
114
-
115
119
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
116
120
enum:
117
121
- doc
@@ -127,20 +131,14 @@ fileTypes:
127
131
urlPattern:
128
132
type: string
129
133
description: |
130
-
Pattern for matching URLs.
131
-
132
-
Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
134
+
Use [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
133
135
example: https://www.algolia.com/**
134
136
135
137
hostnameAliases:
136
138
type: object
137
139
example:
138
140
'dev.example.com': 'example.com'
139
-
description: |
140
-
Key-value pairs to replace matching hostnames found in a sitemap,
141
-
on a page, in canonical links, or redirects.
142
-
143
-
For more information, see the [`hostnameAliases` documentation](https://www.algolia.com/doc/tools/crawler/apis/hostnamealiases/).
141
+
description: "Key-value pairs to replace matching hostnames found in a sitemap,\non a page, in canonical links, or redirects.\n\n\nDuring a crawl, this action maps one hostname to another whenever the crawler encounters specific URLs.\nThis helps with links to staging environments (like `dev.example.com`) or external hosting services (such as YouTube).\n\n\nFor example, with this `hostnameAliases` mapping:\n\n {\n hostnameAliases: {\n 'dev.example.com': 'example.com'\n }\n }\n\n1. The crawler encounters `https://dev.example.com/solutions/voice-search/`.\n\n1. `hostnameAliases` transforms the URL to `https://example.com/solutions/voice-search/`.\n\n1. The crawler follows the transformed URL (not the original).\n\n\n**`hostnameAliases` only changes URLs, not page text. In the preceding example, if the extracted text contains the string `dev.example.com`, it remains unchanged.**\n\n\nThe crawler can discover URLs in places such as:\n\n\n- Crawled pages\n\n- Sitemaps\n\n- [Canonical URLs](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#canonical-urls-and-crawler-behavior)\n\n- Redirects. \n\n\nHowever, `hostnameAliases` doesn't transform URLs you explicitly set in the `startUrls` or `sitemaps` parameters,\nnor does it affect the `pathsToMatch` action or other configuration elements.\n"
144
142
additionalProperties:
145
143
type: string
146
144
description: Hostname that should be added in the records.
@@ -153,13 +151,21 @@ pathAliases:
153
151
'/foo': '/bar'
154
152
description: |
155
153
Key-value pairs to replace matching paths with new values.
156
-
154
+
157
155
It doesn't replace:
158
-
156
+
159
157
- URLs in the `startUrls`, `sitemaps`, `pathsToMatch`, and other settings.
160
158
- Paths found in extracted text.
161
159
162
160
The crawl continues from the _transformed_ URLs.
161
+
162
+
163
+
For example, if you create a mapping for `{ "dev.example.com": { '/foo': '/bar' } }` and the crawler encounters `https://dev.example.com/foo/hello/`,
164
+
it’s transformed to `https://dev.example.com/bar/hello/`.
165
+
166
+
167
+
> Compare with the `hostnameAliases` action.
168
+
163
169
additionalProperties:
164
170
type: object
165
171
description: Hostname for which matching paths should be replaced.
@@ -174,7 +180,7 @@ cache:
174
180
description: |
175
181
Whether the crawler should cache crawled pages.
176
182
177
-
For more information, see the [`cache` documentation](https://www.algolia.com/doc/tools/crawler/apis/cache/).
183
+
For more information, see [Partial crawls with caching](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#partial-crawls-with-caching).
0 commit comments