-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disrupted bootstrap can break host ID -> IP mappings #18676
Comments
I see this error also in dtest, for example:
|
It'd be good to have the backtrace decoded and in a comment here (for search later). @margdoc |
It seems like a recent regression, let's get to the bottom of it and fix whatever needs fixing or consider reverting the patch that caused/exposed it. |
The root cause of the failure -- scenario leading to a missing mapping -- is not a recent regression. But the fact that it now causes But it's not like the @margdoc please also post the findings of what happens before cfd03fe. I remember there were no failures or crashes, but there were lots of warnings happening in the logs and through analysis we concluded that it is possible to perform writes which should be failing in that broken state with missing mapping. |
Also @margdoc please write down an explanation how we end up with an entry in |
On one hand the scenario presented looks like it's hard to hit -- we need a crash in very specific moment. On the other, because of the recently introduced But even if the node enters a crash loop, it should be possible to recover it using maintenance mode -- by writing the mapping into system.peers manually. All in all, if we consider just the scenario presented in the first post, I don't think it's a blocker. But there's also the dtest that @bhalevy referenced: I suspect it's a bit different scenario that hits the same We need to understand if it's the same thing (losing IP mapping due to crash) or something else. For now I'm marking the issue as a blocker. I'll try to take a look at Benny's test and see if it's a different issue. If it is something different then I'll probably open a new GitHub issue, mark that one as blocker, and mark this one as non-blocker. |
We now also see a similar symptom in e.g. https://jenkins.scylladb.com/job/scylla-master/job/tablets/job/gating-dtest-release-with-tablets/67/artifact/logs-full.release.003/1716082048650_replace_address_test.py%3A%3ATestReplaceAddress%3A%3Atest_replace_node_diff_ip_take_write%5Buse_endpoint-rbo_enabled%5D/node3.log
Decoded:
|
Please see https://github.com/bhalevy/scylla/commits/storage_proxy-host_id-wip/ for a sketch for preventing this issue by using host_id where possible on the relevant paths. Cc @kbr-scylla, @gleb-cloudius WDYT? |
I did not looks to deep into it yet, but when using host ids in higher levels thing like |
Right. This is exactly the direction in my branch. |
When the second node is bootstrapping, the first node calls case node_state::bootstrapping: {
assert(!node.rs->ring);
auto num_tokens = std::get<join_param>(node.req_param.value()).num_tokens;
auto tokens_string = std::get<join_param>(node.req_param.value()).tokens_string;
// A node have just been accepted and does not have tokens assigned yet
// Need to assign random tokens to the node
auto tmptr = get_token_metadata_ptr();
std::unordered_set<token> bootstrap_tokens;
try {
bootstrap_tokens = dht::boot_strapper::get_bootstrap_tokens(tmptr, tokens_string, num_tokens, dht::check_token_endpoint::yes);
} catch (...) {
_rollback = fmt::format("Failed to assign tokens: {}", std::current_exception());
}
auto [gen_uuid, guard, mutation] = co_await prepare_and_broadcast_cdc_generation_data(
tmptr, take_guard(std::move(node)), bootstrapping_info{bootstrap_tokens, *node.rs});
topology_mutation_builder builder(guard.write_timestamp());
// Write the new CDC generation data through raft.
builder.set_transition_state(topology::transition_state::commit_cdc_generation)
.set_new_cdc_generation_data_uuid(gen_uuid)
.with_node(node.id)
.set("tokens", bootstrap_tokens);
auto reason = ::format(
"bootstrap: insert tokens and CDC generation data (UUID: {})", gen_uuid);
co_await update_topology_state(std::move(guard), {std::move(mutation), builder.build()}, reason);
} But the first node crashes before it saves the IP address of the second node (error injection "crash-before-bootstrapping-node-added"). When the first node restarts, it has only bootstrap tokens and the host ID of the second node. auto process_transition_node = [&](raft::server_id id, const replica_state& rs) -> future<> {
locator::host_id host_id{id.uuid()};
auto ip = am.find(id);
rtlogger.trace("loading topology: raft id={} ip={} node state={} dc={} rack={} tokens state={} tokens={}",
id, ip, rs.state, rs.datacenter, rs.rack, _topology_state_machine._topology.tstate,
seastar::value_of([&] () -> sstring {
return rs.ring ? ::format("{}", rs.ring->tokens) : sstring("null");
}));
switch (rs.state) {
case node_state::bootstrapping:
if (rs.ring.has_value()) {
if (ip && !is_me(*ip)) {
utils::get_local_injector().inject("crash-before-bootstrapping-node-added", [] {
slogger.error("crash-before-bootstrapping-node-added hit, killing the node");
_exit(1);
});
// Save ip -> id mapping in peers table because we need it on restart, but do not save tokens until owned
sys_ks_futures.push_back(_sys_ks.local().update_peer_info(*ip, host_id, {}));
} |
Before cfd03fe, this test passes (but it shouldn't): When the first node restarts and enters the Why should these writes fail?
A write is failed if there are more failed responses and required successful responses than the total number of replicas. What are these values in our scenario?
abstract_write_response_handler(shared_ptr<storage_proxy> p,
locator::effective_replication_map_ptr erm,
db::consistency_level cl, db::write_type type,
std::unique_ptr<mutation_holder> mh, inet_address_vector_replica_set targets, tracing::trace_state_ptr trace_state,
storage_proxy::write_stats& stats, service_permit permit, db::per_partition_rate_limit::info rate_limit_info, size_t pending_endpoints = 0,
inet_address_vector_topology_change dead_endpoints = {}, is_cancellable cancellable = is_cancellable::no)
: _id(p->get_next_response_id()), _proxy(std::move(p))
, _effective_replication_map_ptr(std::move(erm))
, _trace_state(trace_state), _cl(cl), _type(type), _mutation_holder(std::move(mh)), _targets(std::move(targets)),
_dead_endpoints(std::move(dead_endpoints)), _stats(stats), _expire_timer([this] { timeout_cb(); }), _permit(std::move(permit)),
_rate_limit_info(rate_limit_info) {
// original comment from cassandra:
// during bootstrap, include pending endpoints in the count
// or we may fail the consistency level guarantees (see #833, #8058)
_total_block_for = db::block_for(*_effective_replication_map_ptr, _cl) + pending_endpoints;
++_stats.writes;
if (cancellable) {
register_cancellable();
}
}
datacenter_write_response_handler(shared_ptr<storage_proxy> p,
locator::effective_replication_map_ptr ermp,
db::consistency_level cl, db::write_type type,
std::unique_ptr<mutation_holder> mh, inet_address_vector_replica_set targets,
const inet_address_vector_topology_change& pending_endpoints, inet_address_vector_topology_change dead_endpoints, tracing::trace_state_ptr tr_state,
storage_proxy::write_stats& stats, service_permit permit, db::per_partition_rate_limit::info rate_limit_info) :
abstract_write_response_handler(p, ermp, cl, type, std::move(mh), // can't move ermp, it's used below
std::move(targets), std::move(tr_state), stats, std::move(permit), rate_limit_info,
ermp->get_topology().count_local_endpoints(pending_endpoints), std::move(dead_endpoints)) {
_total_endpoints = _effective_replication_map_ptr->get_topology().count_local_endpoints(_targets);
} as const endpoint_dc_rack& topology::get_location(const inet_address& ep) const {
if (auto node = find_node(ep)) {
return node->dc_rack();
}
// We should do the following check after lookup in nodes.
// In tests, there may be no config for local node, so fall back to get_location()
// only if no mapping is found. Otherwise, get_location() will return empty location
// from config or random node, neither of which is correct.
if (ep == _cfg.this_endpoint) {
return get_location();
}
// FIXME -- this shouldn't happen. After topology is stable and is
// correctly populated with endpoints, this should be replaced with
// on_internal_error()
tlogger.warn("Requested location for node {} not in topology. backtrace {}", ep, lazy_backtrace());
return endpoint_dc_rack::default_location;
} Filter discards this entry and the number of pending endpoints passed to the
|
FWIW, we ran a bisect job to try to pinpoint when this started:
Edit: I'm not sure that the good commit is indeed good. |
@bhalevy I know exactly why it happens in dtest and it is a different, tabets related, issue. Open it please. The problem is that tablets code adds some left to the topology and it brings havoc to the universe. |
Well I do not know exactly what happens since I did not investigate why a left node is returned as a replica, but I do know it is a different issue from this one. |
|
🤦 it is returned because the table is governed by tablets now and the left node is still a replica there. This is why we are adding it to the topology in the first place. |
Moving the release blocker label to #18843 |
If there is no mapping from host id to ip while a node is in bootstrap state there is no point adding it to pending endpoint since write handler will not be able to map it back to host id anyway. If the transition sate requires double writes though we still want to fail. In case the state is write_both_read_old we fail the barrier that will cause topology operation to rollback and in case of write_both_read_new we assert but this should not happen since the mapping is persisted by this point (or we failed in write_both_read_old state). Fixes: scylladb#18676
If there is no mapping from host id to ip while a node is in bootstrap state there is no point adding it to pending endpoint since write handler will not be able to map it back to host id anyway. If the transition sate requires double writes though we still want to fail. In case the state is write_both_read_old we fail the barrier that will cause topology operation to rollback and in case of write_both_read_new we assert but this should not happen since the mapping is persisted by this point (or we failed in write_both_read_old state). Fixes: scylladb#18676
I'm not sure what the status of the issue here, but it should be moved to 6.1, perhaps backported to 6.0.x. |
Updated milestone to 6.0.1 |
When running
on margdoc@a7c4afa, the first node is aborted after the restart:
The scenario of this test is as follows:
system.peers
table (error injection "crash-before-bootstrapping-node-added"). Bootstrap tokens of the second node are already saved intopology_coordinator::handle_topology_transition
.commit cdc generation
transition statescylla-1.log
scylla-2.log
topology_experimental_raft.test_ip_mappings.1.log
Decoded error backtrace:
The text was updated successfully, but these errors were encountered: