Create a doc for versioning info (#113601)

thecoop · web-flow · commit acd4f0747591 · 2024-09-30T10:42:59.000+01:00
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -660,51 +660,11 @@ node cannot continue to operate as a member of the cluster:
 
 Errors like this should be very rare. When in doubt, prefer `WARN` to `ERROR`.
 
-### Version numbers in the Elasticsearch codebase
-
-Starting in 8.8.0, we have separated out the version number representations
-of various aspects of Elasticsearch into their own classes, using their own
-numbering scheme separate to release version. The main ones are
-`TransportVersion` and `IndexVersion`, representing the version of the
-inter-node binary protocol and index data + metadata respectively.
-
-Separated version numbers are comprised of an integer number. The semantic
-meaning of a version number are defined within each `*Version` class.  There
-is no direct mapping between separated version numbers and the release version.
-The versions used by any particular instance of Elasticsearch can be obtained
-by querying `/_nodes/info` on the node.
-
-#### Using separated version numbers
-
-Whenever a change is made to a component versioned using a separated version
-number, there are a few rules that need to be followed:
-
-1. Each version number represents a specific modification to that component,
-   and should not be modified once it is defined. Each version is immutable
-   once merged into `main`.
-2. To create a new component version, add a new constant to the respective class
-   with a descriptive name of the change being made. Increment the integer
-   number according to the particular `*Version` class.
-
-If your pull request has a conflict around your new version constant,
-you need to update your PR from `main` and change your PR to use the next
-available version number.
-
-### Checking for cluster features
-
-As part of developing a new feature or change, you might need to determine
-if all nodes in a cluster have been upgraded to support your new feature.
-This can be done using `FeatureService`. To define and check for a new
-feature in a cluster:
-
-1. Define a new `NodeFeature` constant with a unique id for the feature
-   in a class related to the change you're doing.
-2. Return that constant from an instance of `FeatureSpecification.getFeatures`,
-   either an existing implementation or a new implementation. Make sure
-   the implementation is added as an SPI implementation in `module-info.java`
-   and `META-INF/services`.
-3. To check if all nodes in the cluster support the new feature, call
-`FeatureService.clusterHasFeature(ClusterState, NodeFeature)`
+### Versioning Elasticsearch
+
+There are various concepts used to identify running node versions,
+and the capabilities and compatibility of those nodes. For more information,
+see `docs/internal/Versioning.md`
 
 ### Creating a distribution
 
diff --git a/docs/internal/Versioning.md b/docs/internal/Versioning.md
@@ -0,0 +1,297 @@
+Versioning Elasticsearch
+========================
+
+Elasticsearch is a complicated product, and is run in many different scenarios.
+A single version number is not sufficient to cover the whole of the product,
+instead we need different concepts to provide versioning capabilities
+for different aspects of Elasticsearch, depending on their scope, updatability,
+responsiveness, and maintenance.
+
+## Release version
+
+This is the version number used for published releases of Elasticsearch,
+and the Elastic stack. This takes the form _major.minor.patch_,
+with a corresponding version id.
+
+Uses of this version number should be avoided, as it does not apply to
+some scenarios, and use of release version will break Elasticsearch nodes.
+
+The release version is accessible in code through `Build.current().version()`,
+but it **should not** be assumed that this is a semantic version number,
+it could be any arbitrary string.
+
+## Transport protocol
+
+The transport protocol is used to send binary data between Elasticsearch nodes;
+`TransportVersion` is the version number used for this protocol.
+This version number is negotiated between each pair of nodes in the cluster
+on first connection, and is set as the lower of the highest transport version
+understood by each node.
+This version is then accessible through the `getTransportVersion` method
+on `StreamInput` and `StreamOutput`, so serialization code can read/write
+objects in a form that will be understood by the other node.
+
+Every change to the transport protocol is represented by a new transport version,
+higher than all previous transport versions, which then becomes the highest version
+recognized by that build of Elasticsearch. The version ids are stored
+as constants in the `TransportVersions` class.
+Each id has a standard pattern `M_NNN_SS_P`, where:
+* `M` is the major version
+* `NNN` is an incrementing id
+* `SS` is used in subsidiary repos amending the default transport protocol
+* `P` is used for patches and backports
+
+When you make a change to the serialization form of any object,
+you need to create a new sequential constant in `TransportVersions`,
+introduced in the same PR that adds the change, that increments
+the `NNN` component from the previous highest version,
+with other components  set to zero.
+For example, if the previous version number is `8_413_00_1`,
+the next version number should be `8_414_00_0`.
+
+Once you have defined your constant, you then need to use it
+in serialization code. If the transport version is at or above the new id,
+the modified protocol should be used:
+
+    str = in.readString();
+    bool = in.readBoolean();
+    if (in.getTransportVersion().onOrAfter(TransportVersions.NEW_CONSTANT)) {
+        num = in.readVInt();
+    }
+
+If a transport version change needs to be reverted, a **new** version constant
+should be added representing the revert, and the version id checks
+adjusted appropriately to only use the modified protocol between the version id
+the change was added, and the new version id used for the revert (exclusive).
+The `between` method can be used for this.
+
+Once a transport change with a new version has been merged into main or a release branch,
+it **must not** be modified - this is so the meaning of that specific
+transport version does not change.
+
+_Elastic developers_ - please see corresponding documentation for Serverless
+on creating transport versions for Serverless changes.
+
+### Collapsing transport versions
+
+As each change adds a new constant, the list of constants in `TransportVersions`
+will keep growing. However, once there has been an official release of Elasticsearch,
+that includes that change, that specific transport version is no longer needed,
+apart from constants that happen to be used for release builds.
+As part of managing transport versions, consecutive transport versions can be
+periodically collapsed together into those that are only used for release builds.
+This task is normally performed by Core/Infra on a semi-regular basis,
+usually after each new minor release, to collapse the transport versions
+for the previous minor release. An example of such an operation can be found
+[here](https://github.com/elastic/elasticsearch/pull/104937).
+
+### Minimum compatibility versions
+
+The transport version used between two nodes is determined by the initial handshake
+(see `TransportHandshaker`, where the two nodes swap their highest known transport version).
+The lowest transport version that is compatible with the current node
+is determined by `TransportVersions.MINIMUM_COMPATIBLE`,
+and the node is prevented from joining the cluster if it is below that version.
+This constant should be updated manually on a major release.
+
+The minimum version that can be used for CCS is determined by
+`TransportVersions.MINIMUM_CCS_VERSION`, but this is not actively checked
+before queries are performed. Only if a query cannot be serialized at that
+version is an action rejected. This constant is updated automatically
+as part of performing a release.
+
+### Mapping to release versions
+
+For releases that do use a version number, it can be confusing to encounter
+a log or exception message that references an arbitrary transport version,
+where you don't know which release version that corresponds to. This is where
+the `.toReleaseVersion()` method comes in. It uses metadata stored in a csv file
+(`TransportVersions.csv`) to map from the transport version id to the corresponding
+release version. For any transport versions it encounters without a direct map,
+it performs a best guess based on the information it has. The csv file
+is updated automatically as part of performing a release.
+
+In releases that do not have a release version number, that method becomes
+a no-op.
+
+### Managing patches and backports
+
+Backporting transport version changes to previous releases
+should only be done if absolutely necessary, as it is very easy to get wrong
+and break the release in a way that is very hard to recover from.
+
+If we consider the version number as an incrementing line, what we are doing is
+grafting a change that takes effect at a certain point in the line,
+to additionally take effect in a fixed window earlier in the line.
+
+To take an example, using indicative version numbers, when the latest
+transport version is 52, we decide we need to backport a change done in
+transport version 50 to transport version 45. We use the `P` version id component
+to create version 45.1 with the backported change.
+This change will apply for version ids 45.1 to 45.9 (should they exist in the future).
+
+The serialization code in the backport needs to use the backported protocol
+for all version numbers 45.1 to 45.9. The `TransportVersion.isPatchFrom` method
+can be used to easily determine if this is the case: `streamVersion.isPatchFrom(45.1)`.
+However, the `onOrAfter` also does what is needed on patch branches.
+
+The serialization code in version 53 then needs to additionally check
+version numbers 45.1-45.9 to use the backported protocol, also using the `isPatchFrom` method.
+
+As an example, [this transport change](https://github.com/elastic/elasticsearch/pull/107862)
+was backported from 8.15 to [8.14.0](https://github.com/elastic/elasticsearch/pull/108251)
+and [8.13.4](https://github.com/elastic/elasticsearch/pull/108250) at the same time
+(8.14 was a build candidate at the time).
+
+The 8.13 PR has:
+
+    if (transportVersion.onOrAfter(8.13_backport_id))
+
+The 8.14 PR has:
+
+    if (transportVersion.isPatchFrom(8.13_backport_id)
+        || transportVersion.onOrAfter(8.14_backport_id))
+
+The 8.15 PR has:
+
+    if (transportVersion.isPatchFrom(8.13_backport_id)
+        || transportVersion.isPatchFrom(8.14_backport_id)
+        || transportVersion.onOrAfter(8.15_transport_id))
+
+In particular, if you are backporting a change to a patch release,
+you also need to make sure that any subsequent released version on any branch
+also has that change, and knows about the patch backport ids and what they mean.
+
+## Index version
+
+Index version is a single incrementing version number for the index data format,
+metadata, and associated mappings. It is declared the same way as the
+transport version - with the pattern `M_NNN_SS_P`, for the major version, version id,
+subsidiary version id, and patch number respectively.
+
+Index version is stored in index metadata when an index is created,
+and it is used to determine the storage format and what functionality that index supports.
+The index version does not change once an index is created.
+
+In the same way as transport versions, when a change is needed to the index
+data format or metadata, or new mapping types are added, create a new version constant
+below the last one, incrementing the `NNN` version component.
+
+Unlike transport version, version constants cannot be collapsed together,
+as an index keeps its creation version id once it is created.
+Fortunately, new index versions are only created once a month or so,
+so we don’t have a large list of index versions that need managing.
+
+Similar to transport version, index version has a `toReleaseVersion` to map
+onto release versions, in appropriate situations.
+
+## Cluster Features
+
+Cluster features are identifiers, published by a node in cluster state,
+indicating they support a particular top-level operation or set of functionality.
+They are used for internal checks within Elasticsearch, and for gating tests
+on certain functionality. For example, to check all nodes have upgraded
+to a certain point before running a large migration operation to a new data format.
+Cluster features should not be referenced by anything outside the Elasticsearch codebase.
+
+Cluster features are indicative of top-level functionality introduced to
+Elasticsearch - e.g. a new transport endpoint, or new operations.
+
+It is also used to check nodes can join a cluster - once all nodes in a cluster
+support a particular feature, no nodes can then join the cluster that do not
+support that feature. This is to ensure that once a feature is supported
+by a cluster, it will then always be supported in the future.
+
+To declare a new cluster feature, add an implementation of the `FeatureSpecification` SPI,
+suitably registered (or use an existing one for your code area), and add the feature
+as a constant to be returned by getFeatures. To then check whether all nodes
+in the cluster support that feature, use the method `clusterHasFeature` on `FeatureService`.
+It is only possible to check whether all nodes in the cluster have a feature;
+individual node checks should not be done.
+
+Once a cluster feature is declared and deployed, it cannot be modified or removed,
+else new nodes will not be able to join existing clusters.
+If functionality represented by a cluster feature needs to be removed,
+a new cluster feature should be added indicating that functionality is no longer
+supported, and the code modified accordingly (bearing in mind additional BwC constraints).
+
+The cluster features infrastructure is only designed to support a few hundred features
+per major release, and once features are added to a cluster they can not be removed.
+Cluster features should therefore be used sparingly.
+Adding too many cluster features risks increasing cluster instability.
+
+When we release a new major version N, we limit our backwards compatibility
+to the highest minor of the previous major N-1. Therefore, any cluster formed
+with the new major version is guaranteed to have all features introduced during
+releases of major N-1. All such features can be deemed to be met by the cluster,
+and the features themselves can be removed from cluster state over time,
+and the feature checks removed from the code of major version N.
+
+### Testing
+
+Tests often want to check if a certain feature is implemented / available on all nodes,
+particularly BwC or mixed cluster test.
+
+Rather than introducing a production feature just for a test condition,
+this can be done by adding a _test feature_ in an implementation of
+`FeatureSpecification.getTestFeatures`. These features will only be set
+on clusters running as part of an integration test. Even so, cluster features
+should be used sparingly if possible; Capabilities is generally a better
+option for test conditions.
+
+In Java Rest tests, checking cluster features can be done using
+`ESRestTestCase.clusterHasFeature(feature)`
+
+In YAML Rest tests, conditions can be defined in the `requires` or `skip` sections
+that use cluster features; see [here](https://github.com/elastic/elasticsearch/blob/main/rest-api-spec/src/yamlRestTest/resources/rest-api-spec/test/README.asciidoc#skipping-tests) for more information.
+
+To aid with backwards compatibility tests, the test framework adds synthetic features
+for each previously released Elasticsearch version, of the form `gte_v{VERSION}`
+(for example `gte_v8.14.2`).
+This can be used to add conditions based on previous releases. It _cannot_ be used
+to check the current snapshot version; real features or capabilities should be
+used instead.
+
+## Capabilities
+
+The Capabilities API is a REST API for external clients to check the capabilities
+of an Elasticsearch cluster. As it is dynamically calculated for every query,
+it is not limited in size or usage.
+
+A capabilities query can be used to query for 3 things:
+* Is this endpoint supported for this HTTP method?
+* Are these parameters of this endpoint supported?
+* Are these capabilities (arbitrary string ids) of this endpoint supported?
+
+The API will return with a simple true/false, indicating if all specified aspects
+of the endpoint are supported by all nodes in the cluster.
+If any aspect is not supported by any one node, the API returns `false`.
+
+The API can also return `supported: null` (indicating unknown)
+if there was a problem communicating with one or more nodes in the cluster.
+
+All registered endpoints automatically work with the endpoint existence check.
+To add support for parameter and feature capability queries to your REST endpoint,
+implement the `supportedQueryParameters` and `supportedCapabilities` methods in your rest handler.
+
+To perform a capability query, perform a REST call to the `_capabilities` API,
+with parameters `method`, `path`, `parameters`, `capabilities`.
+The call will query every node in the cluster, and return `{supported: true}`
+if all nodes support that specific combination of method, path, query parameters,
+and endpoint capabilities. If any single aspect is not supported,
+the query will return `{supported: false}`. If there are any problems
+communicating with nodes in the cluster, the response will be `{supported: null}`
+indicating support or lack thereof cannot currently be determined.
+Capabilities can be checked using the clusterHasCapability method in ESRestTestCase.
+
+Similar to cluster features, YAML tests can have skip and requires conditions
+specified with capabilities like the following:
+
+    - requires:
+        capabilities:
+          - method: GET
+            path: /_endpoint
+            parameters: [param1, param2]
+            capabilities: [cap1, cap2]
+
+method: GET is the default, and does not need to be explicitly specified.