[Cluster] Error 503 handling with ROUND_ROBIN #710

orgrimarr · 2020-12-22T22:38:18Z

Environment

ArangoJS: 7.2.0
ArangoDB 3.7.5 Cluster
- Tested on docker
- Tested on windows

Description

When en arango is in maintenance mode or is starting, it return an error 503.

This error is handled by arangojs, but with ROUND_ROBIN, it should try an other node even if arango does not respond whith LEADER_ENDPOINT_HEADER

This throw an error instead of retrying

Steps to reproduce

Coordinator run on ports 9001, 9002, 9003

Script

const arangojs = require('./build')

const test = async function(){
    const db = new arangojs.Database({
        url: ['https://127.0.0.1:9001', 'https://127.0.0.1:9002', 'https://127.0.0.1:9003'],
        maxRetries: 3,
        databaseName: '_system',
        loadBalancingStrategy: "ROUND_ROBIN",
        agentOptions: {
            rejectUnauthorized: false
        }
    })

    db.useBasicAuth('root', '')


    while(true){
        console.time('test')
        const cursor = await db.query({
            query: `
                FOR element IN @@collection
                RETURN element
            `,
            bindVars: {
                "@collection": '_users'
            }
        })
        const results = []
        for await (const result of cursor) {
            results.push(result)
        }
        console.timeEnd('test')
        console.log('OK', results.length)
    }
}

test()
.catch(console.error)

Cluster

node1: /usr/bin/arangodb --ssl.auto-key
node2: /usr/bin/arangodb --ssl.auto-key --starter.join=db1
node3: /usr/bin/arangodb --ssl.auto-key --starter.join=db1

docker-compose.yaml

version: '3.7'
services:
  db1:
    image: arangodb:3.7.5
    container_name: db1
    command: /usr/bin/arangodb --ssl.auto-key
    environment:
      ARANGO_ROOT_PASSWORD:
    ports:
      - 9001:8529
    volumes:
      - ./db1:/data
  db2:
    image: arangodb:3.7.5
    container_name: db2
    command: /usr/bin/arangodb --ssl.auto-key --starter.join=db1
    environment:
      ARANGO_ROOT_PASSWORD:
    ports:
      - 9002:8529
    volumes:
      - ./db2:/data
  db3:
    image: arangodb:3.7.5
    container_name: db3
    command: /usr/bin/arangodb --ssl.auto-key --starter.join=db1
    environment:
      ARANGO_ROOT_PASSWORD:
    ports:
      - 9003:8529
    volumes:
      - ./db3:/data

Steps

Run the cluster
Run the script
Kill an arango instance
If script still running then restart the failed node

Error

    body: {
      error: true,
      errorNum: 503,
      errorMessage: 'service unavailable due to startup or maintenance mode',
      code: 503
    },
    arangojsHostId: 2,
    [Symbol(kCapture)]: false
  },
  errorNum: 503,
  code: 503

Proposition

Apply the retry strategy for error 503 without LEADER_ENDPOINT_HEADER
connection.ts, line 61X

Example:

      } else {
        const response = res!;
        if (response.statusCode === 503) {
          if(response.headers[LEADER_ENDPOINT_HEADER]){
            const url = response.headers[LEADER_ENDPOINT_HEADER]!;
            const [index] = this.addToHostList(url);
            task.host = index;
            if (this._activeHost === host) {
              this._activeHost = index;
            }
            this._queue.push(task);
          }
          else if(!task.host && this._shouldRetry && task.retries < (this._maxRetries || this._hosts.length - 1)){
            task.retries += 1;
            this._queue.push(task);
          }
          else {
            response.arangojsHostId = host;
            task.resolve(response);
          }
        } else {
          response.arangojsHostId = host;
          task.resolve(response);
        }
      }

pluma · 2021-09-30T13:53:11Z

I don't think this is a bug. A naked 503 response could mean anything. We would need to make explicit assumptions about whether or not the 503 response without the leader endpoint header means it is safe to retry the request.

wdyt @rashtao?

rashtao · 2021-10-01T10:26:59Z

I think we can safely retry or failover to another coordinator if the contacted coordinator is starting up or in maintenance mode.
In this case you would get back a 503 response having json body like:

{"error":true,"errorNum":503,"code":503, ...}

orgrimarr · 2021-12-09T13:21:54Z

Hi,
Any update on this issue ?

Can i contribute ?

pluma added the Bug A code defect that needs to be fixed. label Jan 13, 2021

ghost closed this as completed in caa770c Sep 1, 2022

ajw227 mentioned this issue Dec 3, 2023

[DE-55] Retry case of error 503(without 'x-arango-endpoint') arangodb/arangodb-java-driver#530

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Cluster] Error 503 handling with ROUND_ROBIN #710

[Cluster] Error 503 handling with ROUND_ROBIN #710

orgrimarr commented Dec 22, 2020

pluma commented Sep 30, 2021

rashtao commented Oct 1, 2021

orgrimarr commented Dec 9, 2021

[Cluster] Error 503 handling with ROUND_ROBIN #710

[Cluster] Error 503 handling with ROUND_ROBIN #710

Comments

orgrimarr commented Dec 22, 2020

Environment

Description

Steps to reproduce

Script

Cluster

Steps

Error

Proposition

pluma commented Sep 30, 2021

rashtao commented Oct 1, 2021

orgrimarr commented Dec 9, 2021