Skip to content

Commit d231d08

Browse files
gazconroyGary Conroykai687shortcuts
authored
fix(specs): New Crawler API parameter - ignorePaginationAttributes (#4614)
Co-authored-by: Gary Conroy <[email protected]> Co-authored-by: Kai Welke <[email protected]> Co-authored-by: Clément Vannicatte <[email protected]>
1 parent d91d660 commit d231d08

File tree

4 files changed

+181
-66
lines changed

4 files changed

+181
-66
lines changed

specs/crawler/common/parameters.yml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,6 @@ applicationID:
5454
type: string
5555
description: |
5656
Algolia application ID where the crawler creates and updates indices.
57-
The Crawler add-on must be enabled for this application.
5857
5958
CrawlerID:
6059
type: string

specs/crawler/common/schemas/action.yml

Lines changed: 27 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -21,17 +21,16 @@ Action:
2121
discoveryPatterns:
2222
type: array
2323
description: |
24-
Indicates _intermediary_ pages that the crawler should visit.
25-
26-
For more information, see the [`discoveryPatterns` documentation](https://www.algolia.com/doc/tools/crawler/apis/discoverypatterns/).
24+
Which _intermediary_ web pages the crawler should visit.
25+
Use `discoveryPatterns` to define pages that should be visited _just_ for their links to other pages,
26+
_not_ their content.
27+
It functions similarly to the `pathsToMatch` action but without record extraction.
2728
items:
2829
$ref: '#/urlPattern'
2930
fileTypesToMatch:
3031
type: array
3132
description: |
3233
File types for crawling non-HTML documents.
33-
34-
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
3534
maxItems: 100
3635
items:
3736
$ref: '#/fileTypes'
@@ -69,16 +68,22 @@ Action:
6968
type: object
7069
description: |
7170
Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
72-
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor` property.
7371
74-
For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/recordextractor/).
72+
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor`.
73+
For details, see the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor).
7574
properties:
7675
__type:
7776
$ref: '#/configurationRecordExtractorType'
7877
source:
7978
type: string
8079
description: |
8180
A JavaScript function (as a string) that returns one or more Algolia records for each crawled page.
81+
schedule:
82+
type: string
83+
description: |
84+
How often to perform a complete crawl for this action.
85+
86+
For mopre information, consult the [`schedule` parameter documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/schedule/).
8287
selectorsToMatch:
8388
type: array
8489
description: |
@@ -87,7 +92,8 @@ Action:
8792
maxItems: 100
8893
items:
8994
type: string
90-
description: DOM selector. Negation is supported. This lets you ignore pages that match the selector.
95+
description: |
96+
Prefix a selector with `!` to ignore matching pages.
9197
example:
9298
- .products
9399
- '!.featured'
@@ -110,8 +116,6 @@ ActionSchedule:
110116
fileTypes:
111117
type: string
112118
description: |
113-
Supported file type for indexing non-HTML documents.
114-
115119
For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
116120
enum:
117121
- doc
@@ -127,20 +131,14 @@ fileTypes:
127131
urlPattern:
128132
type: string
129133
description: |
130-
Pattern for matching URLs.
131-
132-
Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
134+
Use [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
133135
example: https://www.algolia.com/**
134136

135137
hostnameAliases:
136138
type: object
137139
example:
138140
'dev.example.com': 'example.com'
139-
description: |
140-
Key-value pairs to replace matching hostnames found in a sitemap,
141-
on a page, in canonical links, or redirects.
142-
143-
For more information, see the [`hostnameAliases` documentation](https://www.algolia.com/doc/tools/crawler/apis/hostnamealiases/).
141+
description: "Key-value pairs to replace matching hostnames found in a sitemap,\non a page, in canonical links, or redirects.\n\n\nDuring a crawl, this action maps one hostname to another whenever the crawler encounters specific URLs.\nThis helps with links to staging environments (like `dev.example.com`) or external hosting services (such as YouTube).\n\n\nFor example, with this `hostnameAliases` mapping:\n\n {\n hostnameAliases: {\n 'dev.example.com': 'example.com'\n }\n }\n\n1. The crawler encounters `https://dev.example.com/solutions/voice-search/`.\n\n1. `hostnameAliases` transforms the URL to `https://example.com/solutions/voice-search/`.\n\n1. The crawler follows the transformed URL (not the original).\n\n\n**`hostnameAliases` only changes URLs, not page text. In the preceding example, if the extracted text contains the string `dev.example.com`, it remains unchanged.**\n\n\nThe crawler can discover URLs in places such as:\n\n\n- Crawled pages\n\n- Sitemaps\n\n- [Canonical URLs](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#canonical-urls-and-crawler-behavior)\n\n- Redirects. \n\n\nHowever, `hostnameAliases` doesn't transform URLs you explicitly set in the `startUrls` or `sitemaps` parameters,\nnor does it affect the `pathsToMatch` action or other configuration elements.\n"
144142
additionalProperties:
145143
type: string
146144
description: Hostname that should be added in the records.
@@ -153,13 +151,21 @@ pathAliases:
153151
'/foo': '/bar'
154152
description: |
155153
Key-value pairs to replace matching paths with new values.
156-
154+
157155
It doesn't replace:
158-
156+
159157
- URLs in the `startUrls`, `sitemaps`, `pathsToMatch`, and other settings.
160158
- Paths found in extracted text.
161159
162160
The crawl continues from the _transformed_ URLs.
161+
162+
163+
For example, if you create a mapping for `{ "dev.example.com": { '/foo': '/bar' } }` and the crawler encounters `https://dev.example.com/foo/hello/`,
164+
it’s transformed to `https://dev.example.com/bar/hello/`.
165+
166+
167+
> Compare with the `hostnameAliases` action.
168+
163169
additionalProperties:
164170
type: object
165171
description: Hostname for which matching paths should be replaced.
@@ -174,7 +180,7 @@ cache:
174180
description: |
175181
Whether the crawler should cache crawled pages.
176182
177-
For more information, see the [`cache` documentation](https://www.algolia.com/doc/tools/crawler/apis/cache/).
183+
For more information, see [Partial crawls with caching](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#partial-crawls-with-caching).
178184
properties:
179185
enabled:
180186
type: boolean

0 commit comments

Comments
 (0)