Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the cluster is abnormal, the node started without the full local DB loaded. #1111

Open
zhangyongding opened this issue Nov 10, 2022 · 9 comments

Comments

@zhangyongding
Copy link

What version are you running?
7.6

Are you using Docker or Kubernetes to run your system?
no

Are you running a single node or a cluster?
cluster

What did you do?
Start one of the two cluster nodes and check the database data on the node

What did you expect to happen?
The node started with the full local DB loaded.

What happened instead?
The node loads only snapshot data.

Please include the Status, Nodes, and Expvar output from each node (or at least the Leader!)

See https://github.com/rqlite/rqlite/blob/master/DOC/DIAGNOSTICS.md

@otoolep
Copy link
Member

otoolep commented Nov 10, 2022

If you start only one node of a two-node cluster, your system is in an undefined state. Why would you expect the node to show the fully state?

Anyway, I need more information than this. Please show me your query (ideally using curl) of the node in question before and after restarting.

@otoolep
Copy link
Member

otoolep commented Nov 10, 2022

I will also need the output of /status, /nodes and /nodes?noleader for the node.

@zhangyongding
Copy link
Author

Start tow rqlite node:

rqlited -node-id 1 -http-addr localhost:4441 -raft-addr localhost:4442 ./node.1
rqlited -node-id 2 -http-addr localhost:4443 -raft-addr localhost:4444 -join http://localhost:4441 ./node.2

Create table foo:

curl -XPOST 'localhost:4441/db/execute?pretty&timings' -H "Content-Type: application/json" -d '["CREATE TABLE foo (id INTEGER NOT NULL PRIMARY KEY, name TEXT, age INTEGER)"]'
{
    "results": [
        {
            "time": 0.000236969
        }
    ],
    "time": 0.00389962
}

Insert data into the foo table:

curl -XPOST 'localhost:4441/db/execute?pretty&timings' -H "Content-Type: application/json" -d '[["INSERT INTO foo(name) VALUES(?)", "fiona"],["INSERT INTO foo(name) VALUES(?)", "sinead"]]'
{
        "results": [
            {
                "last_insert_id": 1,
                "rows_affected": 1,
                "time": 0.000159378
            },
            {
                "last_insert_id": 2,
                "rows_affected": 1,
                "time": 0.000022018
            }
        ],
        "time": 0.004037973
}

Query table foo:

curl -XPOST 'localhost:4441/db/query?pretty' -H "Content-Type: application/json" -d '["SELECT * FROM foo"]'  
{
    "results": [
        {
            "columns": [
                "id",
                "name",
                "age"
            ],
            "types": [
                "integer",
                "text",
                "integer"
            ],
            "values": [
                [
                    1,
                    "fiona",
                    null
                ],
                [
                    2,
                    "sinead",
                    null
                ]
            ]
        }
    ]
}

Query status:

curl localhost:4441/status?pretty
{
    "build": {
        "branch": "unknown",
        "build_time": "unknown",
        "commit": "unknown",
        "compiler": "gc",
        "version": "7"
    },
    "cluster": {
        "addr": "localhost:4442",
        "api_addr": "localhost:4441",
        "https": "false"
    },
    "http": {
        "auth": "disabled",
        "bind_addr": "127.0.0.1:4441",
        "cluster": {
            "local_node_addr": "localhost:4442",
            "timeout": 30000000000
        },
        "queue": {
            "_default": {
                "batch_size": 16,
                "max_size": 128,
                "sequence_number": 0,
                "timeout": 50000000
            }
        }
    },
    "node": {
        "current_time": "2022-11-11T09:15:54.128169592+08:00",
        "start_time": "2022-11-11T09:06:51.410110136+08:00",
        "uptime": "9m2.718060081s"
    },
    "os": {
        "executable": "/home/godev/go/bin/rqlited",
        "hostname": "iZwz9a5suhdnpl2tlbxscgZ",
        "page_size": 4096,
        "pid": 10742,
        "ppid": 941
    },
    "runtime": {
        "GOARCH": "amd64",
        "GOMAXPROCS": 4,
        "GOOS": "linux",
        "num_cpu": 4,
        "num_goroutine": 22,
        "version": "go1.19.3"
    },
    "store": {
        "addr": "localhost:4442",
        "apply_timeout": "10s",
        "db_applied_index": 4,
        "db_conf": {
            "memory": true,
            "fk_constraints": false
        },
        "dir": "./node.1",
        "dir_size": 32768,
        "election_timeout": "1s",
        "fsm_index": 4,
        "heartbeat_timeout": "1s",
        "leader": {
            "addr": "localhost:4442",
            "node_id": "1"
        },
        "no_freelist_sync": false,
        "node_id": "1",
        "nodes": [
            {
                "id": "1",
                "addr": "localhost:4442",
                "suffrage": "Voter"
            },
            {
                "id": "2",
                "addr": "localhost:4444",
                "suffrage": "Voter"
            }
        ],
        "observer": {
            "dropped": 0,
            "observed": 1
        },
        "raft": {
            "applied_index": 4,
            "bolt": {
                "FreePageN": 0,
                "PendingPageN": 2,
                "FreeAlloc": 8192,
                "FreelistInuse": 32,
                "TxN": 73,
                "OpenTxN": 0,
                "TxStats": {
                    "PageCount": 22,
                    "PageAlloc": 90112,
                    "CursorCount": 180,
                    "NodeCount": 21,
                    "NodeDeref": 0,
                    "Rebalance": 0,
                    "RebalanceTime": 0,
                    "Split": 0,
                    "Spill": 11,
                    "SpillTime": 141255,
                    "Write": 33,
                    "WriteTime": 21008335
                }
            },
            "commit_index": 4,
            "fsm_pending": 0,
            "last_contact": 0,
            "last_log_index": 4,
            "last_log_term": 2,
            "last_snapshot_index": 0,
            "last_snapshot_term": 0,
            "latest_configuration": "[{Suffrage:Voter ID:1 Address:localhost:4442} {Suffrage:Voter ID:2 Address:localhost:4444}]",
            "latest_configuration_index": 0,
            "log_size": 32768,
            "num_peers": 1,
            "protocol_version": 3,
            "protocol_version_max": 3,
            "protocol_version_min": 0,
            "snapshot_version_max": 1,
            "snapshot_version_min": 0,
            "state": "Leader",
            "term": 2
        },
        "request_marshaler": {
            "compression_batch": 5,
            "compression_size": 150,
            "force_compression": false
        },
        "snapshot_interval": 30000000000,
        "snapshot_threshold": 8192,
        "sqlite3": {
            "compile_options": [
                "ATOMIC_INTRINSICS=1",
                "COMPILER=gcc-7.5.0",
                "DEFAULT_AUTOVACUUM",
                "DEFAULT_CACHE_SIZE=-2000",
                "DEFAULT_FILE_FORMAT=4",
                "DEFAULT_JOURNAL_SIZE_LIMIT=-1",
                "DEFAULT_MMAP_SIZE=0",
                "DEFAULT_PAGE_SIZE=4096",
                "DEFAULT_PCACHE_INITSZ=20",
                "DEFAULT_RECURSIVE_TRIGGERS",
                "DEFAULT_SECTOR_SIZE=4096",
                "DEFAULT_SYNCHRONOUS=2",
                "DEFAULT_WAL_AUTOCHECKPOINT=1000",
                "DEFAULT_WAL_SYNCHRONOUS=1",
                "DEFAULT_WORKER_THREADS=0",
                "ENABLE_DBSTAT_VTAB",
                "ENABLE_FTS3",
                "ENABLE_FTS3_PARENTHESIS",
                "ENABLE_RTREE",
                "ENABLE_UPDATE_DELETE_LIMIT",
                "MALLOC_SOFT_LIMIT=1024",
                "MAX_ATTACHED=10",
                "MAX_COLUMN=2000",
                "MAX_COMPOUND_SELECT=500",
                "MAX_DEFAULT_PAGE_SIZE=8192",
                "MAX_EXPR_DEPTH=1000",
                "MAX_FUNCTION_ARG=127",
                "MAX_LENGTH=1000000000",
                "MAX_LIKE_PATTERN_LENGTH=50000",
                "MAX_MMAP_SIZE=0x7fff0000",
                "MAX_PAGE_COUNT=1073741823",
                "MAX_PAGE_SIZE=65536",
                "MAX_SQL_LENGTH=1000000000",
                "MAX_TRIGGER_DEPTH=1000",
                "MAX_VARIABLE_NUMBER=32766",
                "MAX_VDBE_OP=250000000",
                "MAX_WORKER_THREADS=8",
                "MUTEX_PTHREADS",
                "OMIT_DEPRECATED",
                "OMIT_SHARED_CACHE",
                "SYSTEM_MALLOC",
                "TEMP_STORE=1",
                "THREADSAFE=1"
            ],
            "conn_pool_stats": {
                "ro": {
                    "max_open_connections": 0,
                    "open_connections": 1,
                    "in_use": 0,
                    "idle": 1,
                    "wait_count": 0,
                    "wait_duration": 0,
                    "max_idle_closed": 0,
                    "max_idle_time_closed": 0,
                    "max_lifetime_closed": 0
                },
                "rw": {
                    "max_open_connections": 1,
                    "open_connections": 1,
                    "in_use": 0,
                    "idle": 1,
                    "wait_count": 0,
                    "wait_duration": 0,
                    "max_idle_closed": 0,
                    "max_idle_time_closed": 0,
                    "max_lifetime_closed": 0
                }
            },
            "db_size": 8192,
            "mem_stats": {
                "cache_size": -2000,
                "freelist_count": 0,
                "hard_heap_limit": 0,
                "max_page_count": 1073741823,
                "page_count": 2,
                "page_size": 4096,
                "soft_heap_limit": 0
            },
            "path": ":memory:",
            "ro_dsn": "file:/cODBBejrmMLDfcgldpMK?mode=ro\u0026vfs=memdb\u0026_txlock=deferred\u0026_fk=false",
            "rw_dsn": "file:/cODBBejrmMLDfcgldpMK?mode=rw\u0026vfs=memdb\u0026_txlock=immediate\u0026_fk=false",
            "version": "3.38.5"
        },
        "startup_on_disk": false,
        "trailing_logs": 10240
    }
}

Query nodes:

curl localhost:4441/nodes?pretty 
{
    "1": {
        "api_addr": "http://localhost:4441",
        "addr": "localhost:4442",
        "reachable": true,
        "leader": true,
        "time": 0.00002403
    },
    "2": {
        "api_addr": "http://localhost:4443",
        "addr": "localhost:4444",
        "reachable": true,
        "leader": false,
        "time": 0.001975779
    }
}

Query nodes?noleader:

curl localhost:4441/nodes?noleader
{
        "1": {
                "api_addr": "http://localhost:4441",
                "addr": "localhost:4442",
                "reachable": true,
                "leader": true,
                "time": 0.000027689
        },
        "2": {
                "api_addr": "http://localhost:4443",
                "addr": "localhost:4444",
                "reachable": true,
                "leader": false,
                "time": 0.000294698
        }
}

Stop node 1 and node2

Start node 1:

rqlited -node-id 1 -http-addr localhost:4441 -raft-addr localhost:4442 ./node.1

Set the level to none and query table foo:

curl -XPOST 'localhost:4441/db/query?level=none&pretty' -H "Content-Type: application/json" -d '["SELECT * FROM foo"]'  
{
        "results": [
            {
                "error": "no such table: foo"
            }
        ]
}

Query status:

curl localhost:4441/status?pretty
{
    "build": {
        "branch": "unknown",
        "build_time": "unknown",
        "commit": "unknown",
        "compiler": "gc",
        "version": "7"
    },
    "cluster": {
        "addr": "localhost:4442",
        "api_addr": "localhost:4441",
        "https": "false"
    },
    "http": {
        "auth": "disabled",
        "bind_addr": "127.0.0.1:4441",
        "cluster": {
            "local_node_addr": "localhost:4442",
            "timeout": 30000000000
        },
        "queue": {
            "_default": {
                "batch_size": 16,
                "max_size": 128,
                "sequence_number": 0,
                "timeout": 50000000
            }
        }
    },
    "node": {
        "current_time": "2022-11-11T10:07:46.737276036+08:00",
        "start_time": "2022-11-11T10:06:36.912889316+08:00",
        "uptime": "1m9.824387398s"
    },
    "os": {
        "executable": "/home/godev/go/bin/rqlited",
        "hostname": "iZwz9a5suhdnpl2tlbxscgZ",
        "page_size": 4096,
        "pid": 11574,
        "ppid": 941
    },
    "runtime": {
        "GOARCH": "amd64",
        "GOMAXPROCS": 4,
        "GOOS": "linux",
        "num_cpu": 4,
        "num_goroutine": 16,
        "version": "go1.19.3"
    },
    "store": {
        "addr": "localhost:4442",
        "apply_timeout": "10s",
        "db_applied_index": 0,
        "db_conf": {
            "memory": true,
            "fk_constraints": false
        },
        "dir": "./node.1",
        "dir_size": 32768,
        "election_timeout": "1s",
        "fsm_index": 0,
        "heartbeat_timeout": "1s",
        "leader": {
            "addr": "",
            "node_id": ""
        },
        "no_freelist_sync": false,
        "node_id": "1",
        "nodes": [
            {
                "id": "1",
                "addr": "localhost:4442",
                "suffrage": "Voter"
            },
            {
                "id": "2",
                "addr": "localhost:4444",
                "suffrage": "Voter"
            }
        ],
        "observer": {
            "dropped": 0,
            "observed": 0
        },
        "raft": {
            "applied_index": 0,
            "bolt": {
                "FreePageN": 0,
                "PendingPageN": 2,
                "FreeAlloc": 8192,
                "FreelistInuse": 32,
                "TxN": 17,
                "OpenTxN": 0,
                "TxStats": {
                    "PageCount": 291,
                    "PageAlloc": 1191936,
                    "CursorCount": 473,
                    "NodeCount": 290,
                    "NodeDeref": 0,
                    "Rebalance": 0,
                    "RebalanceTime": 0,
                    "Split": 0,
                    "Spill": 145,
                    "SpillTime": 1420934,
                    "Write": 437,
                    "WriteTime": 221949343
                }
            },
            "commit_index": 0,
            "fsm_pending": 0,
            "last_contact": "never",
            "last_log_index": 8,
            "last_log_term": 38,
            "last_snapshot_index": 0,
            "last_snapshot_term": 0,
            "latest_configuration": "[{Suffrage:Voter ID:1 Address:localhost:4442} {Suffrage:Voter ID:2 Address:localhost:4444}]",
            "latest_configuration_index": 0,
            "log_size": 32768,
            "num_peers": 1,
            "protocol_version": 3,
            "protocol_version_max": 3,
            "protocol_version_min": 0,
            "snapshot_version_max": 1,
            "snapshot_version_min": 0,
            "state": "Candidate",
            "term": 162
        },
        "request_marshaler": {
            "compression_batch": 5,
            "compression_size": 150,
            "force_compression": false
        },
        "snapshot_interval": 30000000000,
        "snapshot_threshold": 8192,
        "sqlite3": {
            "compile_options": [
                "ATOMIC_INTRINSICS=1",
                "COMPILER=gcc-7.5.0",
                "DEFAULT_AUTOVACUUM",
                "DEFAULT_CACHE_SIZE=-2000",
                "DEFAULT_FILE_FORMAT=4",
                "DEFAULT_JOURNAL_SIZE_LIMIT=-1",
                "DEFAULT_MMAP_SIZE=0",
                "DEFAULT_PAGE_SIZE=4096",
                "DEFAULT_PCACHE_INITSZ=20",
                "DEFAULT_RECURSIVE_TRIGGERS",
                "DEFAULT_SECTOR_SIZE=4096",
                "DEFAULT_SYNCHRONOUS=2",
                "DEFAULT_WAL_AUTOCHECKPOINT=1000",
                "DEFAULT_WAL_SYNCHRONOUS=1",
                "DEFAULT_WORKER_THREADS=0",
                "ENABLE_DBSTAT_VTAB",
                "ENABLE_FTS3",
                "ENABLE_FTS3_PARENTHESIS",
                "ENABLE_RTREE",
                "ENABLE_UPDATE_DELETE_LIMIT",
                "MALLOC_SOFT_LIMIT=1024",
                "MAX_ATTACHED=10",
                "MAX_COLUMN=2000",
                "MAX_COMPOUND_SELECT=500",
                "MAX_DEFAULT_PAGE_SIZE=8192",
                "MAX_EXPR_DEPTH=1000",
                "MAX_FUNCTION_ARG=127",
                "MAX_LENGTH=1000000000",
                "MAX_LIKE_PATTERN_LENGTH=50000",
                "MAX_MMAP_SIZE=0x7fff0000",
                "MAX_PAGE_COUNT=1073741823",
                "MAX_PAGE_SIZE=65536",
                "MAX_SQL_LENGTH=1000000000",
                "MAX_TRIGGER_DEPTH=1000",
                "MAX_VARIABLE_NUMBER=32766",
                "MAX_VDBE_OP=250000000",
                "MAX_WORKER_THREADS=8",
                "MUTEX_PTHREADS",
                "OMIT_DEPRECATED",
                "OMIT_SHARED_CACHE",
                "SYSTEM_MALLOC",
                "TEMP_STORE=1",
                "THREADSAFE=1"
            ],
            "conn_pool_stats": {
                "ro": {
                    "max_open_connections": 0,
                    "open_connections": 1,
                    "in_use": 0,
                    "idle": 1,
                    "wait_count": 0,
                    "wait_duration": 0,
                    "max_idle_closed": 0,
                    "max_idle_time_closed": 0,
                    "max_lifetime_closed": 0
                },
                "rw": {
                    "max_open_connections": 1,
                    "open_connections": 1,
                    "in_use": 0,
                    "idle": 1,
                    "wait_count": 0,
                    "wait_duration": 0,
                    "max_idle_closed": 0,
                    "max_idle_time_closed": 0,
                    "max_lifetime_closed": 0
                }
            },
            "db_size": 0,
            "mem_stats": {
                "cache_size": -2000,
                "freelist_count": 0,
                "hard_heap_limit": 0,
                "max_page_count": 1073741823,
                "page_count": 0,
                "page_size": 4096,
                "soft_heap_limit": 0
            },
            "path": ":memory:",
            "ro_dsn": "file:/ksLfdbkaaHttAjGtaNbM?mode=ro\u0026vfs=memdb\u0026_txlock=deferred\u0026_fk=false",
            "rw_dsn": "file:/ksLfdbkaaHttAjGtaNbM?mode=rw\u0026vfs=memdb\u0026_txlock=immediate\u0026_fk=false",
            "version": "3.38.5"
        },
        "startup_on_disk": false,
        "trailing_logs": 10240
    }
}

Query nodes:

curl localhost:4441/nodes?pretty 
{
    "1": {
        "api_addr": "http://localhost:4441",
        "addr": "localhost:4442",
        "reachable": true,
        "leader": false,
        "time": 0.000023635
    },
    "2": {
        "addr": "localhost:4444",
        "reachable": false,
        "leader": false,
        "error": "factory is not able to fill the pool: dial tcp [::1]:4444: connect: connection refused"
    }
}

Query nodes?noleader:

curl localhost:4441/nodes?noleader
{
        "1": {
                "api_addr": "http://localhost:4441",
                "addr": "localhost:4442",
                "reachable": true,
                "leader": false,
                "time": 0.000022582
        },
        "2": {
                "addr": "localhost:4444",
                "reachable": false,
                "leader": false,
                "error": "factory is not able to fill the pool: dial tcp [::1]:4444: connect: connection refused"
        }
}

@zhangyongding
Copy link
Author

I want to use level none to get the full data of the local node when the cluster is abnormal

@otoolep
Copy link
Member

otoolep commented Nov 11, 2022

Thanks for the detailed report. Your particular scenario does not appear to be supported by the underlying Raft system.

  • Start node 1 and node 2, forming a 2-node cluster. Kill node 1 while keeping node 2 running. In this case a query of the remaining node will work OK.
  • Form a 2-node cluster. Shutdown the entire cluster. Only start node 2. In this case the node 2 Raft subsystem is refusing to restore the database until it can first connect to node 1.

So the situations are different -- in the first a single node just goes down, in the second the whole cluster goes down. It seems like the Hashicorp Raft code works differently at full cluster restart time, which is what you are doing.

I agree it's not ideal, but I don't know at this time if this behaviour can be changed. I'll look into it.

@zhangyongding
Copy link
Author

Thank you. Looking forward to it.

@zhangyongding
Copy link
Author

Is there any new progress?

@otoolep
Copy link
Member

otoolep commented Dec 25, 2023

No updates yet.

@otoolep
Copy link
Member

otoolep commented May 17, 2024

No sign at this time this behavior can be changed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants