-
Notifications
You must be signed in to change notification settings - Fork 25.2k
[DOCS] Defines data frame transform stats API objects #44197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pinging @elastic/es-docs |
Pinging @elastic/ml-core |
@lcawl Can this PR be closed? Looks outdated to me. |
@elasticmachine update branch |
merge conflict between base and head |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this again.
I commented the TBD parts and added some explanations.
`checkpointing`.`last`::: | ||
(object) Contains statistics about the last completed checkpoint. | ||
`checkpointing`.`last`.`checkpoint`:::: | ||
(TBD) A unique identifier for the checkpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest: "sequence number for the checkpoint")
`checkpointing`.`last`.`checkpoint`:::: | ||
(TBD) A unique identifier for the checkpoint. | ||
`checkpointing`.`last`.`time_upper_bound_millis`:::: | ||
(date) TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optional, timestamp until data has been processed when using time-based synchronization
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... timestamp until data has been processed...
Thanks for the feedback @hendrikmuhs ! I'm not sure I understand this description yet, however. Is it the duration of the checkpoint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think of a continuous transform, where you source indexing gets new data in, so the destination/transformed index runs always behind the source. time_upper_bound
marks the timestamp until all data from source
has been processed into dest
. So it's not a duration its an endmarker until data has been processed.
There is also timestamp and it might seem like the same thing, but timestamp is the time the checkpoint has been created, time_upper_bound has to take the delay into account. So normally time_upper_bound = timestamp - delay
. (However in future this might change, that's why timestamp
and time_upper_bound
are separate fields)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I've drafted changes to those two descriptions. If they still need tweaking, please just let me know.
`checkpointing`.`last`.`time_upper_bound_millis`:::: | ||
(date) TBD | ||
`checkpointing`.`last`.`timestamp_millis`:::: | ||
(date) TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
timestamp of the checkpoint (when the checkpoint has been created)
(date) TBD | ||
`checkpointing`.`last`.`timestamp_millis`:::: | ||
(date) TBD | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkpointing.
next`:::
optional (object) Contains statistics about the next - currently in progress - checkpoint. This object only appears if the transform is currently processing data and only for the 1st checkpoint
It uses the same fields at last but has one more object:
checkpoint_progress
::
(object) Contains statistics about the progress of the checkpoint.
Not sure how much we want to go into the details, the inner fields are:
- total_docs
- docs_remaining
- percent_complete
- docs_processed
- docs_indexed
This information is only available for batch transforms and for the 1st checkpoint of a continuous transform.
* `indexing`: The {transform} is actively processing data and creating new | ||
documents. | ||
* `started`: The {transform} is running but not actively indexing data. | ||
* `stopped`: The {transform} is stopped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aborting
The {transform} is aborting.stopping
The {transform} is stopping.failed
The {transform} has failed. Check the reason field for further information.
`stats`:: | ||
(object) An object that provides statistical information about the {transform}. | ||
`stats`.`documents_indexed`::: | ||
(TBD) The number of new documents that have been indexed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of documents that have been indexed into the transform dest
index.
`stats`.`documents_indexed`::: | ||
(TBD) The number of new documents that have been indexed. | ||
`stats`.`documents_processed`::: | ||
(TBD) The number of documents that have been processed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of documents that have been processed from the transform source
index.
`stats`.`index_total`::: | ||
(long) The number of indices created. | ||
`stats`.`pages_processed`::: | ||
(TBD) The number of pages processed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(long) The number of pages (number of search/bulk index operations) processed.
(I do not know if this need better explanation: In a nutshell documents are not processed one by one but always on batches. This happens both for search - a search page - as well as for indexing. There a "page" describes 1 bulk index operation that consists of a list of documents to be indexed.)
`stats`.`search_time_in_ms`::: | ||
(long) The amount of time spent searching, in milliseconds. | ||
`stats`.`search_total`::: | ||
(long) TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of search operations on the transform source
index.
`stats`.`search_total`::: | ||
(long) TBD | ||
`stats`.`trigger_count`::: | ||
(TBD) TBD |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number of times the transform has been triggered by the scheduler.
(The scheduler triggers the transform indexer to e.g. check for updates / ingest new data, this can be controlled via the frequency
parameter in the config: https://www.elastic.co/guide/en/elasticsearch/reference/master/put-transform.html#put-transform-request-body)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Looks great, I still suggest to remove some technical detail, those were only for you explaining how transform works internally.
`checkpointing`.`next`::: | ||
(object) Contains statistics about the next checkpoint that is currently in | ||
progress. This object appears only if the {transform} is currently processing | ||
data and only for the first checkpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry this is not quite correct yet: "and only for the first checkpoint" is only true for the checkpoint_progress
nested object below.
checkpointing
.next
will always be there if the transform is actively doing something (when the state is indexing
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarifications! I've pushed another commit
(date) When using time-based synchronization, this timestamp indicates the | ||
upper bound of data that is included in the checkpoint. Typically, this value | ||
is equal to the `checkpointing`.`last`.`time_upper_bound_millis` minus the | ||
`sync`.`time`.`delay`, which is defined when you create the {transform}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not include "Typically, this value is ..." This was just for your information, not meant to be put here.
(object) Contains statistics about the progress of the checkpoint. For example, | ||
it lists the `total_docs`, `docs_remaining`, `percent_complete`, | ||
`docs_processed`, and `docs_indexed`. This information is available only for | ||
batch {transforms} and the first checkpoint of {ctransforms}. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
batch {transforms} and the first checkpoint of {ctransforms}. | ||
`checkpointing`.`next`.`time_upper_bound_millis`:::: | ||
(date) When using time-based synchronization, this timestamp indicates the | ||
upper bound of data that is included in the checkpoint. Typically, this value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would omit "Typically..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
This PR drafts definitions for results from the "get data frame transform statistics" API (https://www.elastic.co/guide/en/elasticsearch/reference/master/get-transform-stats.html), equivalent to what we have for anomaly detector job statistics:
https://www.elastic.co/guide/en/elasticsearch/reference/master/ml-get-job-stats.html
Preview: http://elasticsearch_44197.docs-preview.app.elstc.co/guide/en/elasticsearch/reference/master/get-transform-stats.html