Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e for work suspension resume #5354

Merged
merged 1 commit into from
Aug 15, 2024
Merged

Conversation

a7i
Copy link
Contributor

@a7i a7i commented Aug 12, 2024

What type of PR is this?
/kind feature

What this PR does / why we need it:

Adding more e2e tests based on the proposal

Which issue(s) this PR fixes:
Part of #5217

Special notes for your reviewer:

Note that 2 of the tests are NOT passing, which indicates that there's a bug in execution / work sync somewhere

Does this PR introduce a user-facing change?:


@karmada-bot karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Aug 12, 2024
@karmada-bot karmada-bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Aug 12, 2024
@codecov-commenter
Copy link

codecov-commenter commented Aug 12, 2024

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 29.36%. Comparing base (235ec91) to head (d909218).
Report is 5 commits behind head on master.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #5354      +/-   ##
==========================================
+ Coverage   29.01%   29.36%   +0.34%     
==========================================
  Files         632      632              
  Lines       43862    43862              
==========================================
+ Hits        12728    12878     +150     
+ Misses      30218    30050     -168     
- Partials      916      934      +18     
Flag Coverage Δ
unittests 29.36% <ø> (+0.34%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@RainbowMango
Copy link
Member

Just a reminder, the tests are failing:

• [FAILED] [425.219 seconds]
[Suspend] PropagationPolicy testing update resource in the control plane [It] suspends updating deployment replicas in member cluster
/home/runner/work/karmada/karmada/test/e2e/propagationpolicy_test.go:1238

  Captured StdOut/StdErr Output >>
  I0812 12:56:18.118934   52953 deployment.go:75] Waiting for deployment(karmadatest-x2q9n/deploy-7vsmw01) synced on cluster(member1)
  << Captured StdOut/StdErr Output

  Timeline >>
  STEP: Creating PropagationPolicy(karmadatest-x2q9n/deploy-7vsmw) @ 08/12/24 12:56:13.049
  STEP: Creating Deployment(karmadatest-x2q9n/deploy-7vsmw01) @ 08/12/24 12:56:13.064
  STEP: Updating Deployment(karmadatest-x2q9n/deploy-7vsmw01)'s replicas to 2 @ 08/12/24 12:56:13.073
  [FAILED] in [It] - /home/runner/work/karmada/karmada/test/e2e/framework/deployment.go:82 @ 08/12/24 13:03:18.12
  STEP: Removing Deployment(karmadatest-x2q9n/deploy-7vsmw01) @ 08/12/24 13:03:18.263
  << Timeline

  [FAILED] Timed out after 420.001s.
  Expected
      <bool>: false
  to equal
      <bool>: true
  In [It] at: /home/runner/work/karmada/karmada/test/e2e/framework/deployment.go:82 @ 08/12/24 13:03:18.12

  Full Stack Trace
    github.com/karmada-io/karmada/test/e2e/framework.WaitDeploymentPresentOnClusterFitWith({0xc000228a99, 0x7}, {0xc0009e3248, 0x11}, {0xc000f199b2, 0xe}, 0x5138690)
    	/home/runner/work/karmada/karmada/test/e2e/framework/deployment.go:82 +0x614
    github.com/karmada-io/karmada/test/e2e.init.func41.5.3()
    	/home/runner/work/karmada/karmada/test/e2e/propagationpolicy_test.go:1239 +0xdc

@a7i
Copy link
Contributor Author

a7i commented Aug 13, 2024

Just a reminder, the tests are failing:

• [FAILED] [425.219 seconds]
[Suspend] PropagationPolicy testing update resource in the control plane [It] suspends updating deployment replicas in member cluster
/home/runner/work/karmada/karmada/test/e2e/propagationpolicy_test.go:1238

  Captured StdOut/StdErr Output >>
  I0812 12:56:18.118934   52953 deployment.go:75] Waiting for deployment(karmadatest-x2q9n/deploy-7vsmw01) synced on cluster(member1)
  << Captured StdOut/StdErr Output

  Timeline >>
  STEP: Creating PropagationPolicy(karmadatest-x2q9n/deploy-7vsmw) @ 08/12/24 12:56:13.049
  STEP: Creating Deployment(karmadatest-x2q9n/deploy-7vsmw01) @ 08/12/24 12:56:13.064
  STEP: Updating Deployment(karmadatest-x2q9n/deploy-7vsmw01)'s replicas to 2 @ 08/12/24 12:56:13.073
  [FAILED] in [It] - /home/runner/work/karmada/karmada/test/e2e/framework/deployment.go:82 @ 08/12/24 13:03:18.12
  STEP: Removing Deployment(karmadatest-x2q9n/deploy-7vsmw01) @ 08/12/24 13:03:18.263
  << Timeline

  [FAILED] Timed out after 420.001s.
  Expected
      <bool>: false
  to equal
      <bool>: true
  In [It] at: /home/runner/work/karmada/karmada/test/e2e/framework/deployment.go:82 @ 08/12/24 13:03:18.12

  Full Stack Trace
    github.com/karmada-io/karmada/test/e2e/framework.WaitDeploymentPresentOnClusterFitWith({0xc000228a99, 0x7}, {0xc0009e3248, 0x11}, {0xc000f199b2, 0xe}, 0x5138690)
    	/home/runner/work/karmada/karmada/test/e2e/framework/deployment.go:82 +0x614
    github.com/karmada-io/karmada/test/e2e.init.func41.5.3()
    	/home/runner/work/karmada/karmada/test/e2e/propagationpolicy_test.go:1239 +0xdc

@RainbowMango thanks for the logs. As mentioned in the PR description, there seems to be a bug in code because suspended dispatching is still allowing updates/deletion. I'm debugging this issue right now but wanted to push up the e2e changes in case anyone had ideas on why it's failing.

@a7i a7i marked this pull request as draft August 13, 2024 02:14
@karmada-bot karmada-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 13, 2024
@a7i a7i force-pushed the work-suspend-e2e branch 2 times, most recently from e932996 to ad2e9a7 Compare August 13, 2024 22:21
@karmada-bot karmada-bot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Aug 13, 2024
@a7i a7i force-pushed the work-suspend-e2e branch 2 times, most recently from 32b7ddd to c3d539f Compare August 13, 2024 23:31
@XiShanYongYe-Chang
Copy link
Member

there seems to be a bug in code because suspended dispatching is still allowing updates/deletion. I'm debugging this issue right now but wanted to push up the e2e changes in case anyone had ideas on why it's failing.

Hi @a7i Is there anything I can do for you?

@a7i
Copy link
Contributor Author

a7i commented Aug 14, 2024

Thanks @XiShanYongYe-Chang Here is the test case:

  1. Create a PropagationPolicy with resourceSelector for Deployment
  2. Create a Deployment
  3. Pause the Propagation Policy
  4. Delete the Deployment

=> Observe that Deployment is not deleted ✅

  1. Resume the Propagation Policy

=> Observe that Deployment is deleted ❌

This is because when the Deployment is deleted (step 4) in the karmada control plane then:

  • binding controller Reconcile gets called which deletes all Work
  • ResourceBinding is deleted
  • Work is marked for Deletion but execution Controller stops when it's suspended (suspend check comes before before deletion)

And then you can't update a resource with deletion timestamp so I cannot unpause it.

@a7i a7i force-pushed the work-suspend-e2e branch from c3d539f to d909218 Compare August 14, 2024 20:46
@a7i
Copy link
Contributor Author

a7i commented Aug 14, 2024

/retest

@a7i
Copy link
Contributor Author

a7i commented Aug 14, 2024

I think I figured it out 🤞🏼

@a7i
Copy link
Contributor Author

a7i commented Aug 14, 2024

/retest

@a7i a7i marked this pull request as ready for review August 14, 2024 21:49
@karmada-bot karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 14, 2024
@XiShanYongYe-Chang
Copy link
Member

XiShanYongYe-Chang commented Aug 15, 2024

Thanks @XiShanYongYe-Chang Here is the test case:

  1. Create a PropagationPolicy with resourceSelector for Deployment
  2. Create a Deployment
  3. Pause the Propagation Policy
  4. Delete the Deployment

=> Observe that Deployment is not deleted ✅

  1. Resume the Propagation Policy

=> Observe that Deployment is deleted ❌

This is because when the Deployment is deleted (step 4) in the karmada control plane then:

  • binding controller Reconcile gets called which deletes all Work
  • ResourceBinding is deleted
  • Work is marked for Deletion but execution Controller stops when it's suspended (suspend check comes before before deletion)

And then you can't update a resource with deletion timestamp so I cannot unpause it.

Hi @a7i I tested this case and the work resource ended up being left behind. This is not supposed to happen.

This should be caused by my comment. The desired behavior is probably what you started with: the pause operation cannot affect the deletion of the resource.

@a7i
Copy link
Contributor Author

a7i commented Aug 15, 2024

Thanks @XiShanYongYe-Chang Here is the test case:

  1. Create a PropagationPolicy with resourceSelector for Deployment
  2. Create a Deployment
  3. Pause the Propagation Policy
  4. Delete the Deployment

=> Observe that Deployment is not deleted ✅

  1. Resume the Propagation Policy

=> Observe that Deployment is deleted ❌
This is because when the Deployment is deleted (step 4) in the karmada control plane then:

  • binding controller Reconcile gets called which deletes all Work
  • ResourceBinding is deleted
  • Work is marked for Deletion but execution Controller stops when it's suspended (suspend check comes before before deletion)

And then you can't update a resource with deletion timestamp so I cannot unpause it.

Hi @a7i I tested this case and the work resource ended up being left behind. This is not supposed to happen.

This should be caused by my comment. The desired behavior is probably what you started with: the pause operation cannot affect the deletion of the resource.

Then in that case we don't need an e2e test for deletion resume. I can fix up the intended behavior in a separate PR and remove the invalid test-case from this PR. Thoughts?

@XiShanYongYe-Chang
Copy link
Member

Then in that case we don't need an e2e test for deletion resume. I can fix up the intended behavior in a separate PR and remove the invalid test-case from this PR. Thoughts?

Thanks a lot @a7i I agree with you.

I'm sorry my previous comment wasn't well thought out and caused this problem.

@a7i
Copy link
Contributor Author

a7i commented Aug 15, 2024

All good. I most certainly appreciate your feedback and guidance on this feature! ❤️

Copy link
Member

@XiShanYongYe-Chang XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks
/lgtm
/approve

@karmada-bot karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Aug 15, 2024
@karmada-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: XiShanYongYe-Chang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@karmada-bot karmada-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 15, 2024
@karmada-bot karmada-bot merged commit cd99897 into karmada-io:master Aug 15, 2024
12 checks passed
@a7i a7i deleted the work-suspend-e2e branch August 15, 2024 12:46
@chaosi-zju
Copy link
Member

Hi @a7i, I found a occasional e2e failure: https://github.com/karmada-io/karmada/actions/runs/10553025531/job/29232824137?pr=5423

• [FAILED] [300.256 seconds]
[Suspend] clusterPropagation testing suspend the ClusterPropagationPolicy dispatching [It] suspends Work
/home/runner/work/karmada/karmada/test/e2e/clusterpropagationpolicy_test.go:1077

  Timeline >>
  STEP: Creating ClusterPropagationPolicy(clusterrole-tmxsv) @ 08/26/24 03:46:47.688
  STEP: Creating ClusterRole(system:test-clusterrole-tmxsv) @ 08/26/24 03:46:47.701
  STEP: Updating ClusterPropagationPolicy(clusterrole-tmxsv) spec @ 08/26/24 03:46:47.705
  [FAILED] in [It] - /home/runner/work/karmada/karmada/test/e2e/clusterpropagationpolicy_test.go:1085 @ 08/26/24 03:51:47.742
  STEP: Removing ClusterPropagationPolicy(clusterrole-tmxsv) @ 08/26/24 03:51:47.929
  STEP: Remove ClusterRole(system:test-clusterrole-tmxsv) @ 08/26/24 03:51:47.936
  << Timeline

  [FAILED] Timed out after 300.000s.
  Expected
      <bool>: false
  to equal
      <bool>: true
  In [It] at: /home/runner/work/karmada/karmada/test/e2e/clusterpropagationpolicy_test.go:1085 @ 08/26/24 03:51:47.742

  Full Stack Trace
    github.com/karmada-io/karmada/test/e2e.init.func7.4.2()
    	/home/runner/work/karmada/karmada/test/e2e/clusterpropagationpolicy_test.go:1085 +0x288
ClusterResourceBinding
{
	"kind": "ClusterResourceBinding",
	"apiVersion": "work.karmada.io/v1alpha2",
	"metadata": {
		"name": "system.test-clusterrole-tmxsv-clusterrole",
		"uid": "8b0c2f62-7b21-43cc-95fa-6656c01fa6e1",
		"resourceVersion": "21296",
		"generation": 4,
		"creationTimestamp": "2024-08-26T03:46:47Z",
		"deletionTimestamp": "2024-08-26T03:51:47Z",
		"deletionGracePeriodSeconds": 0,
		"labels": {
			"clusterresourcebinding.karmada.io/permanent-id": "4ede7d27-97b9-475b-abf0-3b87ee46a16c"
		},
		"annotations": {
			"policy.karmada.io/applied-placement": "{\"clusterAffinity\":{\"clusterNames\":[\"member1\"]},\"clusterTolerations\":[{\"key\":\"cluster.karmada.io/not-ready\",\"operator\":\"Exists\",\"effect\":\"NoExecute\",\"tolerationSeconds\":30},{\"key\":\"cluster.karmada.io/unreachable\",\"operator\":\"Exists\",\"effect\":\"NoExecute\",\"tolerationSeconds\":30}]}"
		},
		"ownerReferences": [{
			"apiVersion": "rbac.authorization.k8s.io/v1",
			"kind": "ClusterRole",
			"name": "system:test-clusterrole-tmxsv",
			"uid": "6ab3bd00-f68a-4ff5-9ac2-45a21b81eea1",
			"controller": true,
			"blockOwnerDeletion": true
		}],
		"finalizers": ["karmada.io/cluster-resource-binding-controller"]
	},
	"spec": {
		"resource": {
			"apiVersion": "rbac.authorization.k8s.io/v1",
			"kind": "ClusterRole",
			"name": "system:test-clusterrole-tmxsv",
			"uid": "6ab3bd00-f68a-4ff5-9ac2-45a21b81eea1",
			"resourceVersion": "11592"
		},
		"clusters": [{
			"name": "member1"
		}],
		"placement": {
			"clusterAffinity": {
				"clusterNames": ["member1"]
			},
			"clusterTolerations": [{
				"key": "cluster.karmada.io/not-ready",
				"operator": "Exists",
				"effect": "NoExecute",
				"tolerationSeconds": 30
			}, {
				"key": "cluster.karmada.io/unreachable",
				"operator": "Exists",
				"effect": "NoExecute",
				"tolerationSeconds": 30
			}]
		},
		"schedulerName": "default-scheduler",
		"conflictResolution": "Abort"
	},
	"status": {
		"schedulerObservedGeneration": 3,
		"lastScheduledTime": "2024-08-26T03:50:16Z",
		"conditions": [{
			"type": "Scheduled",
			"status": "True",
			"lastTransitionTime": "2024-08-26T03:46:47Z",
			"reason": "Success",
			"message": "Binding has been scheduled successfully."
		}, {
			"type": "FullyApplied",
			"status": "True",
			"lastTransitionTime": "2024-08-26T03:46:47Z",
			"reason": "FullyAppliedSuccess",
			"message": "All works have been successfully applied"
		}],
		"aggregatedStatus": [{
			"clusterName": "member1",
			"applied": true,
			"health": "Unknown"
		}]
	}
}

It seem like the crb is not as expected and the work doesn't exist, but I can't find the root cause, do you have some inspire?

@a7i
Copy link
Contributor Author

a7i commented Aug 27, 2024

Hi @a7i, I found a occasional e2e failure: https://github.com/karmada-io/karmada/actions/runs/10553025531/job/29232824137?pr=5423

• [FAILED] [300.256 seconds]
[Suspend] clusterPropagation testing suspend the ClusterPropagationPolicy dispatching [It] suspends Work
/home/runner/work/karmada/karmada/test/e2e/clusterpropagationpolicy_test.go:1077

  Timeline >>
  STEP: Creating ClusterPropagationPolicy(clusterrole-tmxsv) @ 08/26/24 03:46:47.688
  STEP: Creating ClusterRole(system:test-clusterrole-tmxsv) @ 08/26/24 03:46:47.701
  STEP: Updating ClusterPropagationPolicy(clusterrole-tmxsv) spec @ 08/26/24 03:46:47.705
  [FAILED] in [It] - /home/runner/work/karmada/karmada/test/e2e/clusterpropagationpolicy_test.go:1085 @ 08/26/24 03:51:47.742
  STEP: Removing ClusterPropagationPolicy(clusterrole-tmxsv) @ 08/26/24 03:51:47.929
  STEP: Remove ClusterRole(system:test-clusterrole-tmxsv) @ 08/26/24 03:51:47.936
  << Timeline

  [FAILED] Timed out after 300.000s.
  Expected
      <bool>: false
  to equal
      <bool>: true
  In [It] at: /home/runner/work/karmada/karmada/test/e2e/clusterpropagationpolicy_test.go:1085 @ 08/26/24 03:51:47.742

  Full Stack Trace
    github.com/karmada-io/karmada/test/e2e.init.func7.4.2()
    	/home/runner/work/karmada/karmada/test/e2e/clusterpropagationpolicy_test.go:1085 +0x288

ClusterResourceBinding
It seem like the crb is not as expected and the work doesn't exist, but I can't find the root cause, do you have some inspire?

I will take a look today. Thank you for the logs!

@chaosi-zju
Copy link
Member

@a7i
Copy link
Contributor Author

a7i commented Aug 28, 2024

Unfortunately I'm having a hard time reproducing locally:

 [Suspend] clusterPropagation testing suspend the ClusterPropagationPolicy dispatching suspends Work
/Users/amiralavi/workspace/a7i/karmada/test/e2e/clusterpropagationpolicy_test.go:1077

  Timeline >>
  STEP: Creating ClusterPropagationPolicy(clusterrole-q5qkh) @ 08/27/24 23:08:17.531
  STEP: Creating ClusterRole(system:test-clusterrole-q5qkh01) @ 08/27/24 23:08:17.543
  STEP: Updating ClusterPropagationPolicy(clusterrole-q5qkh) spec @ 08/27/24 23:08:17.546
  STEP: Removing ClusterPropagationPolicy(clusterrole-q5qkh) @ 08/27/24 23:08:22.566
  STEP: Remove ClusterRole(system:test-clusterrole-q5qkh01) @ 08/27/24 23:08:22.574
  << Timeline
------------------------------
[SynchronizedAfterSuite] PASSED [0.008 seconds]

@a7i
Copy link
Contributor Author

a7i commented Aug 28, 2024

Giving this a try: #5440

@chaosi-zju
Copy link
Member

Unfortunately I'm having a hard time reproducing locally:

I didn't reproduce it locally too, is it possible that different e2e cases affect each other?

@a7i
Copy link
Contributor Author

a7i commented Aug 28, 2024

Unfortunately I'm having a hard time reproducing locally:

I didn't reproduce it locally too, is it possible that different e2e cases affect each other?

Yes, that's my thinking as well. ClusterRole is cluster-level so we cannot isolate it to a single namespace. We can conclude this, because the same test for PropagationPolicy is passing.

@RainbowMango RainbowMango added this to the v1.11 milestone Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/feature Categorizes issue or PR as related to a new feature. lgtm Indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants