Skip to content

Fix failing tests on feature/desired-balance-allocator branch #86429

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
73 tasks done
idegtiarenko opened this issue May 4, 2022 · 1 comment
Closed
73 tasks done

Fix failing tests on feature/desired-balance-allocator branch #86429

idegtiarenko opened this issue May 4, 2022 · 1 comment
Assignees
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.

Comments

@idegtiarenko
Copy link
Contributor

idegtiarenko commented May 4, 2022

Following tests are failing and need to be fixed before DesiredBalanceShardsAllocator could be merged to master.

  • DesiredBalanceServiceTests.* (Fix DesiredBalanceServiceTests #86435)
  • DesiredBalanceReconcilerTests.* (Fix DesiredBalanceReconcilerTests#testFailsNewPrimariesIfNoDataNodes #86432)
  • ClusterAllocationExplainIT. testAllocationFilteringOnIndexCreation
  • ClusterHealthIT. testHealthOnIndexCreation
  • CorruptedFileIT. testCorruptionOnNetworkLayer (Replica shard is not started and remains in error state)
  • CorruptedFileIT. testReplicaCorruption
  • FilteringAllocationIT. testDecommissionNodeNoReplicas
  • IndexFoldersDeletionListenerIT. testListenersInvokedWhenIndexHasLeftOverShard (small probability to stuck after logger.debug("--> creating a new index [{}]", indexName);)
  • IndexRecoveryIT. testCancelNewShardRecoveryAndUsesExistingShardCopy
  • IndexRecoveryIT. testDoNotInfinitelyWaitForMapping (timed out waiting for green state: ALLOCATION_FAILED, failed shard on node [y0Jt0-QrSu29efTbYU8AdQ]: failed to create index, failure org.elasticsearch.index.mapper.MapperParsingException: simulate mapping parsing error)
  • IndexRecoveryIT. testCancelRecoveryWithAutoExpandReplicas (stuck after creating index [0-all] index in a cluster with a single master and no data nodes)
  • RareClusterStateIT. testDeleteCreateInOneBulk (consistently timing out on creating index with 0s timeout)
  • RecoveryFromGatewayIT. testSingleNodeNoFlush
  • ReplicaShardAllocatorIT. testDoNotCancelRecoveryForBrokenNode (timed out waiting for green state: ALLOCATION_FAILED, failed recovery, failure org.elasticsearch.indices.recovery.RecoveryFailedException)
  • ReplicaShardAllocatorIT. testPreferCopyCanPerformNoopRecovery
  • ReplicaShardAllocatorIT. testPreferCopyWithHighestMatchingOperations
  • ReplicaShardAllocatorIT. testPeerRecoveryForClosedIndices (<5% probability)
  • ReplicaShardAllocatorSyncIdIT. testPreferCopyCanPerformNoopRecovery
  • SimpleIndexStateIT. testFastCloseAfterCreateContinuesCreateAfterOpen (~50% failure rate with Expected: <RED> but: was <YELLOW> when creating index that could not be allocated)
  • TransportSearchFailuresIT. testFailedSearchWithWrongQuery (~1% probability to timeout on logger.info("Done Cluster Health, status {}", clusterHealth.getStatus());, looks like it is more likely ~5% with -Dtests.seed=F9E8E5F50A9C9B21)
  • UpdateShardAllocationSettingsIT. testUpdateSameHostSetting
  • ClusterRerouteIT. testDelayWithALargeAmountOfShards (might rarely timeout. Shards balance is not converging 250 shards over 3 data nodes with ~5% probability). Related to: BalancedShardsAllocator rebalancing might move shards but not improve the balance #88384
  • GetGlobalCheckpointsActionIT. testWaitOnIndexCreated (repeatedly failing)
  • GetGlobalCheckpointsActionIT#testWaitOnPrimaryShardThrottled (cluster.routing.allocation.node_initial_primaries_recoveries=0 prevents balance from converging)
  • org.elasticsearch.datastreams.DataStreamMigrationIT. testBasicMigration (times out when executing migration, listener is not called in the else branch)
  • NodeShutdownShardsIT. testNodeReplacementOnlyAllowsShardsFromReplacedNode
  • test {yaml=indices.split/30_copy_settings/Copy settings during split index}
  • test {yaml=indices.shrink/30_copy_settings/Copy settings during shrink index}
  • TransformAuditorIT.testAliasCreatedforBWCIndexes

org.elasticsearch.action.admin.indices.shrink.TransportResizeAction

  • ShrinkIndexIT. testCreateShrinkIndexToN ([NO(initial allocation of the shrunken index is only allowed on nodes [_id:"hg09_hMfS3uDUfv93xggmA"] that hold a copy of every shard in the index)])
  • ShrinkIndexIT. testShrinkThenSplitWithFailedNode (NO(initial allocation of the shrunken index is only allowed on nodes [_id:"eh7a8csCQzOwCePaFlw9xA"] that hold a copy of every shard in the index))
  • SplitIndexIT. testCreateSplitIndexToN (NO(source primary is allocated on another node))
  • SplitIndexIT. testSplitFromOneToN (NO(source primary is allocated on another node))
  • SplitIndexIT. testSplitIndexPrimaryTerm (NO(source primary is allocated on another node))
  • PartitionedRoutingIT. testShrinking

HasFrozenCacheAllocationDecider

  • various searchable snapshot test failures due to throttling when xpack.searchable.snapshot.shared_cache.size is not yet reported
  • FrozenExistenceDeciderIT. testZeroToOne fails for the same reason

MoveAllocationCommand usage

  • ClusterRerouteIT. testClusterRerouteWithBlocks (uses MoveAllocationCommand)
  • IndexPrimaryRelocationIT. testPrimaryRelocationWhileIndexing (uses MoveAllocationCommand)
  • IndexRecoveryIT. testRerouteRecovery
  • IndicesStoreIntegrationIT. testIndexCleanup (~10% to stuck when running individually)
  • RelocationIT. testRelocationWhileIndexingRandom (MoveAllocationCommand)
  • RelocationIT. testRelocationWhileRefreshing (MoveAllocationCommand)

not retrying shard allocation after an error

setWaitForNoRelocatingShards(true) should wait for desired balance to converge

  • AwarenessAllocationIT. testAwarenessZonesIncrementalNodes (health setWaitForNoRelocatingShards(true) is not waiting for a pending desired balance computation ~10% chance)

Snapshot related tests

  • AbortedRestoreIT. testAbortedRestoreAlsoAbortFileRestores
  • BlobStoreIncrementalityIT. testIncrementalBehaviorOnPrimaryFailover (20% chance failure with timed out waiting for green state)
  • FsBlobStoreRepositoryIntegTests. testSnapshotAndRestore
  • IndicesOptionsIntegrationIT. testWildcardBehaviourSnapshotRestore
  • MetadataLoadingDuringSnapshotRestoreIT. testWhenMetadataAreLoaded
  • ConcurrentSnapshotsIT. testConcurrentRestoreDeleteAndClone
  • CorruptedBlobStoreRepositoryIT. *
  • DataStreamsSnapshotsIT. *
  • DedicatedClusterSnapshotRestoreIT. *
  • DiskThresholdDeciderIT. testRestoreSnapshotAllocationDoesNotExceedWatermark
  • RestoreSnapshotIT. *
  • SharedClusterSnapshotRestoreIT.testUnrestorableIndexDuringRestore (this test is stuck when running individually)
  • SnapshotCustomPluginStateIT. testIncludeGlobalState
  • SnapshotStressTestsIT. testRandomActivities
  • SystemDataStreamSnapshotIT. *
  • SystemIndicesSnapshotIT. *

4308 integration tests passed.

ESAllocationTestCase related unit test failures when using desired balance allocator

@idegtiarenko idegtiarenko added :Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels May 4, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

DaveCTurner added a commit to DaveCTurner/elasticsearch that referenced this issue May 4, 2022
With this change we withhold the response to the update cluster settings
API until the corresponding reroute completes, fixing tests that do
things such as updating allocation filters and then waiting for all
relocations to complete, such as `AwarenessAllocationIT`.

Relates elastic#86429
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Allocation All issues relating to the decision making around placing a shard (both master logic & on the nodes) Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination.
Projects
None yet
Development

No branches or pull requests

3 participants