Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix panic when servers return a wrapped error with status OK #6374

Merged
merged 1 commit into from
Jul 5, 2023

Conversation

atollena
Copy link
Collaborator

@atollena atollena commented Jun 15, 2023

When wrapping an error that has a gRPC status, we reuse the underlying status message field and put the error message in there. If the returned gRPC status is nil, then there is a nil dereference of the status Message field, causing panic.

This change fixes this behavior by returning an Unknown status code and the error message whenever the wrapped error
has status OK.

RELEASE NOTES:

  • status: status.FromError now returns an error with codes.Unknown when the error implements the GRPCStatus() method, and calling GRPCStatus() returns nil.

@atollena atollena force-pushed the issues-6373 branch 2 times, most recently from 2fe624f to 667903b Compare June 15, 2023 14:15
@arvindbr8 arvindbr8 self-assigned this Jun 15, 2023
@atollena
Copy link
Collaborator Author

@dfawley if you're going to cherry-pick #6354 into a new version of 1.56, I'd really appreciate if it also includes this fix. FWIW I think it would also make sense to back port this patch (or another solution for this bug) into 1.55, if it isn't too much hassle.

@atollena
Copy link
Collaborator Author

@zasweq I see that this change missed the 1.56.1 release :( Any chance we can consider this PR and #6373 for any upcoming release?

@easwars
Copy link
Contributor

easwars commented Jun 22, 2023

I'm in agreement with the spirit of this change, i.e grpc should better protect itself from misbehaving wrapped errors that implement the GRPCStatus method. I'm also wondering if the semantics of the GRPCStatus method should be better documented.

Also, @dfawley is OOO and is back mid next week. I would want to get his thoughts too on this one.

And backporting it to as many releases as necessary shouldn't be an issue.

@atollena : Is your production service still affected? How badly/frequently? Or do you have a workaround for now? Thanks.

@easwars easwars added this to the 1.57 Release milestone Jun 22, 2023
@atollena
Copy link
Collaborator Author

I'm also wondering if the semantics of the GRPCStatus method should be better documented.

I think the only trace of documentation for the GRPCStatus() method is the godoc comment at the top of this method (status.FromError). I changed my commit to add a specific bullet point for this case in this commit, since it was giving conflicting information. Is there some other clarification you think are unclear?

@atollena : Is your production service still affected? How badly/frequently? Or do you have a workaround for now? Thanks.

We mitigated that problem by reverting to v1.54.1, so we are only affected in that we are stuck on an older version. The problem is infrequent but exists. We fixed the occurrences that we found, some of them in third party libraries: googleapis/google-cloud-go#8127 (in case of errors retrying to GCS, I think). It's not trivial to audit the codebase to be 100% sure if there is no other occurence.

@arvindbr8 arvindbr8 removed their assignment Jun 23, 2023
Copy link
Contributor

@easwars easwars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM module some minor nits.

status/status.go Outdated Show resolved Hide resolved
status/status.go Outdated Show resolved Hide resolved
status/status_test.go Show resolved Hide resolved
When wrapping an error that has a gRPC status, we reuse the underlying status message field and put
the error message in there. If the returned gRPC status is nil, then there is a nil dereference of
the status Message field.

This change fixes this behavior by returning Unknown status and the error message when the wrapped
error has status OK.
@atollena
Copy link
Collaborator Author

@easwars I took your suggestion thank you. I also updated the RELEASE NOTES string in the PR description.

I saw that you approved but in the past, I've let you take care of merging. Should I merge or should I let you do it? Thanks!

@dfawley
Copy link
Member

dfawley commented Jun 28, 2023

I'm confused about this.

This change fixes this behavior by returning an Unknown status code and the error message whenever the wrapped error has status OK.

How is this not an error in usage? Why/when are you wrapping a "not-an-error" status error (err==nil; code==OK) inside another error?

@atollena
Copy link
Collaborator Author

atollena commented Jun 29, 2023

How is this not an error in usage? Why/when are you wrapping a "not-an-error" status error (err==nil; code==OK) inside another error?

There was an occurence in the google cloud sdk (it's been fixed in the latest release). This made services that use Google Cloud Storage crash on retry after the grpc upgrade, while they were working fine with 1.54.1. Even though, as you point out, this is an error in usage, I do think this change in behaviour is dangerous.

Do you think we should just fix all call sites as we discover them? (note that this can be difficult when it involves third party libraries, like was the case for us -- but we could hope that this never happens again).

@dfawley
Copy link
Member

dfawley commented Jun 30, 2023

Thanks for the explanation. I agree we shouldn't panic when this happens even if it's an error in usage, and so this change is probably good as-is. I just wanted to make sure you weren't actually doing this kind of thing intentionally.

Is the fix in google-cloud-go the right change? I was thinking they could just remove the GRPCStatus() method directly now that grpc itself supports unwrapping in FromError.

@atollena
Copy link
Collaborator Author

atollena commented Jul 3, 2023

Is the fix in google-cloud-go the right change? I was thinking they could just remove the GRPCStatus() method directly now that grpc itself supports unwrapping in FromError.

You're right (and looking at the code again it took me a moment to understand why GRPCStatus() calling status.FromError() doesn't result in infinite recursion). Opened googleapis/google-cloud-go#8196 to remove GRPCStatus entirely.

Could you merge this PR please, now that it got 2 approvals? Or do you want any other changes applied? Thanks!

@dfawley dfawley merged commit df3e021 into grpc:master Jul 5, 2023
1 check passed
@dfawley
Copy link
Member

dfawley commented Jul 5, 2023

IIUC this needs to be backported to the v1.56.x branch too?

@atollena
Copy link
Collaborator Author

atollena commented Jul 6, 2023

IIUC this needs to be backported to the v1.56.x branch too?

It would be very useful to us, so if it isn't too much to ask, yes please.

@atollena atollena deleted the issues-6373 branch July 6, 2023 09:38
@zasweq
Copy link
Contributor

zasweq commented Jul 6, 2023

I'll backport this to 1.55 and 1.56 today :)

zasweq added a commit that referenced this pull request Jul 6, 2023
zasweq added a commit that referenced this pull request Jul 6, 2023
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jan 3, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants