-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
schemachange: speed up slow schema changes #48608
Conversation
scErr = sc.exec(ctx) | ||
if scErr == nil { | ||
return nil | ||
} | ||
switch { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably cleaner as:
switch scErr := sc.exec(ctx); scErr {
case nil:
return nil
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
pkg/sql/schema_changer.go
Outdated
} | ||
} | ||
return nil | ||
return jobs.NewRetryJobError(scErr.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're here it probably means that your context was canceled. It's reasonably like that scErr
is nil here which means this will panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well if scErr was nil, we would return inside the body of the loop
MaxBackoff: 20 * time.Second, | ||
Multiplier: 1.5, | ||
} | ||
var scErr error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure it makes sense to retain this across iterations of the loop.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no but we need it after we exit the loop to return to registry the last error from the schema change
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
e14652c
to
e281a34
Compare
❌ The GitHub CI (Cockroach) build has failed on e281a348. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Touches cockroachdb#47790. Release note (performance improvement): Before this a simple schema change could take 30s+. The reason was that if the schema change is not first in line in the table mutation queue it would return a re-triable error and the jobs framework will re-adopt and run it later. The problem is that the job adoption loop is 30s. To repro run this for some time: ``` cockroach sql --insecure --watch 1s -e 'drop table if exists users cascade; create table users (id uuid not null, name varchar(255) not null, email varchar(255) not null, password varchar(255) not null, remember_token varchar(100) null, created_at timestamp(0) without time zone null, updated_at timestamp(0) without time zone null, deleted_at timestamp(0) without time zone null); alter table users add primary key (id); alter table users add constraint users_email_unique unique (email);' ``` Instead of returning on retriable errors we retry with a exponential backoff in the schema change code. This pattern of dealing with retriable errors in client job code is encouraged vs relying on the registry beacuse the latter leads to slowness and additionally to more complicated test fixtures that rely in hacking with the internals of the job registry,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
MaxBackoff: 20 * time.Second, | ||
Multiplier: 1.5, | ||
} | ||
var scErr error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense.
bors r+ |
Build failed (retrying...) |
bors r+ |
Already running a review |
Build succeeded |
Touches #45150.
Fixes #47607.
Touches #47790.
Release note (performance improvement):
Before this a simple schema change could take 30s+.
The reason was that if the schema change is not first
in line in the table mutation queue it would return a
re-triable error and the jobs framework will re-adopt and
run it later. The problem is that the job adoption loop
is 30s.
To repro run this for some time:
Instead of returning on re-triable errors we retry with exponential
backoff in the schema change code. This pattern of dealing with
re-triable errors in client job code is encouraged vs relying on the
registry because the latter leads to slowness and additionally to more
complicated test fixtures that rely on hacking with the internals of the
job registry,