Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bitnami/etcd] Stop relying on files for state #75906

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

pckhoi
Copy link
Contributor

@pckhoi pckhoi commented Dec 25, 2024

Description of the change

The current etcd container and chart have a few major problems:

  1. It relies on files outside of the data directory which could contain conflicting information compared to the data dir and the cluster
  2. Removing the member during pre-stop hook is problematic. I guess this was added to support scaling down the cluster. If so, this logic is leaky. There are 2 cases where this breaks down:
    1. If the pod was killed for reasons other than replicas update then the next time the pod starts, it will not be able to start from the existing data dir which means it must throw away the data dir and start from scratch.
    2. If the cluster is scaled down and the PVC is retained, the next time the cluster is scaled up, the new member will encounter a non-empty data dir which it must discard
  3. If the member was removed, the container chokes up on non-empty data dir in most cases except when recovering from a snapshot
  4. It might attempt to add a new member even when an old member with the same name already exists. This is caused by relying on files for state
  5. It runs etcdctl member update for unclear reasons when the data dir is not empty and there is a member ID
  6. It relies on ETCD_INITIAL_CLUSTER_STATE to know whether the cluster is new which could be inaccurate

This PR add the following changes:

  • Add preupgrade.sh which should be run in a Helm pre-upgrade hook. When the cluster is scaled down, it detects and removes obsolete members with etcdctl member remove.
  • Remove prestop.sh
  • Stop storing/checking member ID from the member_id file. Instead, the remote member ID is read from the cluster with etcdctl member list, and the local member ID is checked for conflict during startup.
  • Stop storing/checking member removal state from member_removal.log. Check with etcdctl member list instead.
  • If the data dir is not empty, check if the member still belongs to the cluster (remote ID and local ID are the same). If there is a conflict, remove the data dir, remove the old member, add a new member, and start the member from scratch.
  • Remove environment variable ETCD_DISABLE_STORE_MEMBER_ID
  • Remove environment variable ETCD_DISABLE_PRESTOP
  • Environment variable ETCD_INITIAL_CLUSTER_STATE becomes read-only

Benefits

  • Not relying on files outside of the data directory means there is only a single source of truths (or only as many as there are live members in the cluster plus the data dir), which makes most operations more reliable
  • Removing obsolete members in Helm pre-upgrade hook means the etcdctl member remove command tends to be executed against a healthy cluster
  • If the pod was killed for reasons other than replica changes, it can rejoin the cluster on its own while keeping all its data intact
  • The container no longer chokes up on a non-empty data dir, even when the old member is removed

Possible drawbacks

  • If during initialization there is a network outage and the current member can't connect to other members, it will think that it must start a new cluster. That said, I don't think there is any good solution in this case except manual recovery.
  • I have not tested this set of changes outside of Helm/K8s

Applicable issues

Additional information

Related changes in the Helm chart: bitnami/charts#31161 and bitnami/charts#31164

- Remove prestop logic (no longer removing member when container stops)
- Remove members not included in ETCD_INITIAL_CLUSTERS during startup
- Stop storing member id on a separate file, member id is checked from
  etcd data dir instead
- Stop reading member removal state off of disk, probe the cluster
  instead
- Remove old member (with the same name) if exist before adding new
  member
- If data dir is not empty, check if the member still belongs to the
  cluster. If not, remove data dir, remove member with the same name,
  and add new member
- Remove env var ETCD_DISABLE_STORE_MEMBER_ID
- Remove env var ETCD_DISABLE_PRESTOP

Signed-off-by: Khoi Pham <[email protected]>
@github-actions github-actions bot added etcd triage Triage is needed labels Dec 25, 2024
@github-actions github-actions bot requested a review from javsalgar December 25, 2024 09:12
@pckhoi
Copy link
Contributor Author

pckhoi commented Dec 26, 2024

I'm planning to open a complementary PR in the charts repo. I will try to add more tests there.

@carrodher carrodher added verify Execute verification workflow for these changes in-progress labels Dec 26, 2024
@github-actions github-actions bot removed the triage Triage is needed label Dec 26, 2024
@github-actions github-actions bot removed the request for review from javsalgar December 26, 2024 08:06
@github-actions github-actions bot requested a review from alvneiayu December 26, 2024 08:06
@carrodher carrodher removed the request for review from alvneiayu December 27, 2024 07:36
Copy link
Contributor

@juan131 juan131 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pckhoi

Thanks so much for this amazing contribution! It'd definitely help on making the Bitnami etcd chart more stable.

I think the main concern/challenge with your changes would be providing a solution for users who may scale down the cluster via kubectl scale sts/etcd --replicas X (or via some HorizontalPodAutoscaler that may also scale down the cluster without Helm's control via hooks). Correct me if I'm wrong but this use case won't be covered, right?

@pckhoi
Copy link
Contributor Author

pckhoi commented Jan 3, 2025

@juan131 you're correct that the autoscaling use case isn't covered. People use Etcd for its consistency rather than for handling large, fluctuating traffic so I think autoscaling to handle large traffic is a niche use case.

As for manual scaling, running helm upgrade makes more sense if the cluster is installed via Helm. If people are scaling with kubectl scale then they're probably not using Helm which makes it difficult to operate. So no, deploying/upgrading without Helm is also not supported.

@juan131
Copy link
Contributor

juan131 commented Jan 3, 2025

Thanks for confirming so @pckhoi ! In that case, I'd add a warning at the "Upgrading" section alerting about what these changes imply (I mean, warning users to use exclusively Helm to scale the cluster):

We could even add it in the chart NOTES:

@pckhoi
Copy link
Contributor Author

pckhoi commented Jan 3, 2025

Sure, I will do that.

@pckhoi
Copy link
Contributor Author

pckhoi commented Jan 4, 2025

Comment on lines 20 to 22
# Return a comma separated list of <host>:<port> for each endpoint
# Globals:
# ETCD_*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Return a comma separated list of <host>:<port> for each endpoint
# Globals:
# ETCD_*
# Return a comma separated list of <host>:<port> for each endpoint
# based on "initial-cluster" flag value
# ref: https://etcd.io/docs/latest/op-guide/clustering/#static
# Globals:
# ETCD_INITIAL_CLUSTER

Comment on lines 32 to 63
########################
# Remove members that are not named in ETCD_INITIAL_CLUSTER
# Globals:
# ETCD_*
# Arguments:
# None
# Returns:
# None
#########################
remove_members() {
local -a extra_flags current expected

read -r -a extra_flags <<<"$(etcdctl_auth_flags)"
is_boolean_yes "$ETCD_ON_K8S" && extra_flags+=("--endpoints=$(endpoints_as_host_port)")
debug "Listing members"
current="$(etcdctl member list ${extra_flags[@]} --write-out simple | awk -F ", " '{print $1 "," $3}')"
if [ $? -ne 0 ]; then
debug "Error listing members, is this a new cluster?"
return 0
fi
info "Current cluster members are: $(echo "${current[@]}" | awk -F, '{print $2}' | tr -s '\n' ',' | sed 's/,$//g')"

expected="$(echo $ETCD_INITIAL_CLUSTER | sed 's/,/\n/g' | awk -F= '{print $1}')"
info "Expected cluster members are: $(IFS= echo "${expected[@]}" | tr -s '\n' ', ' | sed 's/,$//g')"

for member in $(comm -23 <(echo "${current[@]}" | awk -F, '{print $2}' | sort) <(echo "${expected[@]}" | sort)); do
info "Removing obsolete member $member"
etcdctl member remove ${extra_flags[@]} $(echo "${current[@]}" | grep "$member" | awk -F, '{print $1}')
done
}

remove_members
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this function isn't reused anywhere and it contains the actual logic of the script, I think it's simpler to directly use the code, check my suggestion:

Suggested change
########################
# Remove members that are not named in ETCD_INITIAL_CLUSTER
# Globals:
# ETCD_*
# Arguments:
# None
# Returns:
# None
#########################
remove_members() {
local -a extra_flags current expected
read -r -a extra_flags <<<"$(etcdctl_auth_flags)"
is_boolean_yes "$ETCD_ON_K8S" && extra_flags+=("--endpoints=$(endpoints_as_host_port)")
debug "Listing members"
current="$(etcdctl member list ${extra_flags[@]} --write-out simple | awk -F ", " '{print $1 "," $3}')"
if [ $? -ne 0 ]; then
debug "Error listing members, is this a new cluster?"
return 0
fi
info "Current cluster members are: $(echo "${current[@]}" | awk -F, '{print $2}' | tr -s '\n' ',' | sed 's/,$//g')"
expected="$(echo $ETCD_INITIAL_CLUSTER | sed 's/,/\n/g' | awk -F= '{print $1}')"
info "Expected cluster members are: $(IFS= echo "${expected[@]}" | tr -s '\n' ', ' | sed 's/,$//g')"
for member in $(comm -23 <(echo "${current[@]}" | awk -F, '{print $2}' | sort) <(echo "${expected[@]}" | sort)); do
info "Removing obsolete member $member"
etcdctl member remove ${extra_flags[@]} $(echo "${current[@]}" | grep "$member" | awk -F, '{print $1}')
done
}
remove_members
# Remove members that are not listed in ETCD_INITIAL_CLUSTER
# from the cluster before running Helm upgrades that potentially scale
# down the etcd cluster
read -r -a extra_flags <<<"$(etcdctl_auth_flags)"
is_boolean_yes "$ETCD_ON_K8S" && extra_flags+=("--endpoints=$(endpoints_as_host_port)")
debug "Listing members"
if ! current="$(etcdctl member list ${extra_flags[@]} --write-out simple | awk -F ", " '{print $3 ":" $1}')"; then
debug "Error listing members, is this a new cluster?"
return 0
fi
info "Current cluster members are: $(echo "$current" | awk -F: '{print $1}' | tr -s '\n' ',' | sed 's/,$//g')"
expected="$(echo $ETCD_INITIAL_CLUSTER | tr -s ',' '\n' | awk -F= '{print $1}')"
info "Expected cluster members are: $(echo "$expected" | tr -s '\n' ',' | sed 's/,$//g')"
for member in $(comm -23 <(echo "$current" | awk -F: '{print $1}' | sort) <(echo "$expected" | sort)); do
info "Removing obsolete member $member"
etcdctl member remove ${extra_flags[@]} "$(echo "$current" | grep "$member" | awk -F: '{print $2}')"
done

Comment on lines 346 to 350
if ! etcdctl endpoint status --cluster ${extra_flags[@]}; then
return 0
else
return 1
fi
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shorter:

Suggested change
if ! etcdctl endpoint status --cluster ${extra_flags[@]}; then
return 0
else
return 1
fi
! debug_execute etcdctl endpoint status --cluster ${extra_flags[@]}


Additionally, you can configure etcd using the upstream env variables [here](https://etcd.io/docs/v3.4/op-guide/configuration/)

## Notable Changes

### 3.5.17-debian-12-r3

* Drop support for non-Helm deployment. Upgrading of any kind including increasing replica count must also be done with `helm upgrade` exclusively. CD automation tools that respect Helm hooks such as ArgoCD can also be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* Drop support for non-Helm deployment. Upgrading of any kind including increasing replica count must also be done with `helm upgrade` exclusively. CD automation tools that respect Helm hooks such as ArgoCD can also be used.
* Drop support for non-Helm cluster deployment. Upgrading of any kind including increasing replica count must also be done with `helm upgrade` exclusively. CD automation tools that respect Helm hooks such as ArgoCD can also be used.

| `ETCD_NEW_MEMBERS_ENV_FILE` | File containining the etcd environment to use after adding a member. | `${ETCD_DATA_DIR}/new_member_envs` |
| `ETCD_DAEMON_USER` | etcd system user name. | `etcd` |
| `ETCD_DAEMON_GROUP` | etcd system user group. | `etcd` |
| `ETCD_INITIAL_CLUSTER_STATE` | Initial cluster state. Either "new" or "existing". | `nil` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user does not use it. Is it fine to leave it in the read-only section then?

@pckhoi
Copy link
Contributor Author

pckhoi commented Jan 11, 2025

Thanks! I have addressed all the suggestsions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
etcd in-progress verify Execute verification workflow for these changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[bitnami/etcd] etcd pods are unable to join existing cluster on node drain
6 participants