Skip to content

fix(specs): New Crawler API parameter - ignorePaginationAttributes #4614

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Mar 20, 2025
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 67 additions & 9 deletions specs/crawler/common/schemas/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,10 +3,14 @@ Action:
description: |
How to process crawled URLs.


Each action defines:


- The targeted subset of URLs it processes.

- What information to extract from the web pages.

- The Algolia indices where the extracted records will be stored.

If a single web page matches several actions,
Expand All @@ -21,16 +25,23 @@ Action:
discoveryPatterns:
type: array
description: |
Indicates _intermediary_ pages that the crawler should visit.
Which _intermediary_ web pages the crawler should visit.
Use `discoveryPatterns` to define pages that should be visited _just_ for their links to other pages, _not_ their content.


For more information, see the [`discoveryPatterns` documentation](https://www.algolia.com/doc/tools/crawler/apis/discoverypatterns/).
It functions similarly to the `pathsToMatch` action but without record extraction.


`discoveryPatterns` uses [micromatch](https://github.com/micromatch/micromatch) to support matching with wildcards, negation, and other features.
The crawler adds all matching URLs to its queue.
items:
$ref: '#/urlPattern'
fileTypesToMatch:
type: array
description: |
File types for crawling non-HTML documents.


For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
maxItems: 100
items:
Expand Down Expand Up @@ -59,6 +70,7 @@ Action:
description: |
URLs to which this action should apply.


Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
minItems: 1
maxItems: 100
Expand All @@ -69,9 +81,12 @@ Action:
type: object
description: |
Function for extracting information from a crawled page and transforming it into Algolia records for indexing.
The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor` property.

For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/recordextractor/).

The Crawler has an [editor](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#the-editor) with autocomplete and validation to help you update the `recordExtractor`.


For details, consult the [`recordExtractor` documentation](https://www.algolia.com/doc/tools/crawler/apis/configuration/actions/#parameter-param-recordextractor).
properties:
__type:
$ref: '#/configurationRecordExtractorType'
Expand Down Expand Up @@ -110,7 +125,8 @@ ActionSchedule:
fileTypes:
type: string
description: |
Supported file type for indexing non-HTML documents.
Supported file types for indexing non-HTML documents.


For more information, see [Extract data from non-HTML documents](https://www.algolia.com/doc/tools/crawler/extracting-data/non-html-documents/).
enum:
Expand All @@ -129,6 +145,7 @@ urlPattern:
description: |
Pattern for matching URLs.


Uses [micromatch](https://github.com/micromatch/micromatch) for negation, wildcards, and more.
example: https://www.algolia.com/**

Expand All @@ -140,7 +157,43 @@ hostnameAliases:
Key-value pairs to replace matching hostnames found in a sitemap,
on a page, in canonical links, or redirects.

For more information, see the [`hostnameAliases` documentation](https://www.algolia.com/doc/tools/crawler/apis/hostnamealiases/).

During a crawl, this action maps one hostname to another whenever the crawler encounters specific URLs.
This helps with links to staging environments (like `dev.example.com`) or external hosting services (such as YouTube).


For example, with this `hostnameAliases` mapping:

{
hostnameAliases: {
'dev.example.com': 'example.com'
}
}

1. The crawler encounters `https://dev.example.com/solutions/voice-search/`.

1. `hostnameAliases` transforms the URL to `https://example.com/solutions/voice-search/`.

1. The crawler follows the transformed URL (not the original).


**`hostnameAliases` only changes URLs, not page text. In the preceding example, if the extracted text contains the string `dev.example.com`, it remains unchanged.**


The crawler can discover URLs in places such as:


- Crawled pages

- Sitemaps

- [Canonical URLs](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#canonical-urls-and-crawler-behavior)

- Redirects.


However, `hostnameAliases` doesn't transform URLs you explicitly set in the `startUrls` or `sitemaps` parameters,
nor does it affect the `pathsToMatch` action or other configuration elements.
additionalProperties:
type: string
description: Hostname that should be added in the records.
Expand All @@ -153,12 +206,16 @@ pathAliases:
'/foo': '/bar'
description: |
Key-value pairs to replace matching paths with new values.


It doesn't replace:


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are all of those spaces expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double spacing makes sure paragraphs render properly and spacing between list elements stops them running into each other

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary for specs in this repo as you can see with the other API specs. The sources don't have this formatting, but in the generated, complete specs, you'll find the double spacing. Maybe something to do with YAML parsing/writing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the source, you'll notice that long description fields are declared as literal block scalars with the | symbol, which means keep newlines as indicated.

During the bundling of the spec, and I don't know why, most of these are converted into folded block scalars with the > symbol, which means newlines are replaced by a space, two newlines are replaced by one newline.

I find the | style more readable as it leads to more compact blocks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll keep an eye on this when the bundle is built. I did notice some discrepancies in spacing that I manually corrected for the Mintlify prototype.


- URLs in the `startUrls`, `sitemaps`, `pathsToMatch`, and other settings.

- Paths found in extracted text.


The crawl continues from the _transformed_ URLs.
additionalProperties:
type: object
Expand All @@ -172,9 +229,10 @@ pathAliases:
cache:
type: object
description: |
Whether the crawler should cache crawled pages.
Whether the crawler should cache crawled pages.


For more information, see the [`cache` documentation](https://www.algolia.com/doc/tools/crawler/apis/cache/).
For more information, see [Partial crawls with caching](https://www.algolia.com/doc/tools/crawler/getting-started/crawler-configuration/#partial-crawls-with-caching).
properties:
enabled:
type: boolean
Expand Down
Loading