Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: schema changes can be very slow #47790

Closed
1 of 5 tasks
ajwerner opened this issue Apr 21, 2020 · 8 comments
Closed
1 of 5 tasks

sql: schema changes can be very slow #47790

ajwerner opened this issue Apr 21, 2020 · 8 comments
Assignees
Labels
A-schema-changes C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@ajwerner
Copy link
Contributor

ajwerner commented Apr 21, 2020

Describe the problem

We've had a number of reports of schema changes being slow lately (##38111, #47607, #47512). This is a meta-issue for a problem which is theorized here and to track reproduction and evaluation. It also includes proposed steps.

Theory

We’ve exposed the job adoption cycle to some schema changes even if they’d happen super fast.

  • Run just sends on the channel N times for N jobs.
  • It might pick up jobs other than the ones we wanted to pick
  • Especially bad is we might pick up GC jobs which jobs now create from other jobs.
  • Also might pick up queued schema changes which will fail immediately due to unmet dependencies.

Alternative Consideration

We scan the entire jobs table in the job adoption loop, as this table gets big, that's probably slow.

Proposals

  • 1. Understand the exact implications with regards to user-visible latency for schema changes

  • We should write some tests which are hella slow and fix that

    • N goroutines, each M time creating a table, doing some queued schema changes, drop the table
    • Also randomized syntax tests seem slow
  • 2. Prioritize the jobs we adopt in the adoption loop

  • We really want to adopt just about anything over top of GC Jobs

  • We really don't want to adopt jobs that have unmet mutation dependencies

  • We could create a simple ranking mechanism where we prioritize

    1. Type - Do GC jobs last maybe
    2. Key - choose uniformly randomly over the jobs with the same key
    3. Value - choose in value order for jobs with the same key

I'm not exactly sold on the type prioritization but at least it's easy.
For the others, we could inject a function per type that controls its ranking.

It doesn't need to be very tightly coupled.

  • 3. Start jobs immediately after finishing a job.

  • Right now we wait for the adoption loop before picking up another job. If you have a number of queued mutations, it may take a long time for the subsequent jobs to get picked up.

  • Really easy

  • 4. Stop creating jobs that just wait.

  • Right now we create jobs that make no sense as jobs.

  • There is no value if another node adopts the job except for coordinating the waiting.

  • These excess jobs probably exacerbate the above problems.

  • 5. Create an index on the status column of jobs.

  • Junk piles up and then the query gets very slow.

  • Adding an index should be an easy migration.

Environment:

  • Applies especially to CockroachDB 20.1
@ajwerner
Copy link
Contributor Author

ajwerner commented May 7, 2020

  1. Reproduces well enough with the set of queries in sql: add primary key on empty table very slow #47607

Offline discussion with @spaskob to sidestep 2. above by instead having the Registry.Run() call force the registry to adopt specific jobs rather than any jobs. For 3. then we can have schema change jobs which are aware of subsequent jobs in the mutations queue launch a goroutine to call Registry.Run() explicitly or something like that.

  1. We should do as just not writing jobs in cases where we use sqlbase.InvalidMutationID

  2. also seems good

spaskob added a commit to spaskob/cockroach that referenced this issue May 8, 2020
Touches cockroachdb#47790.

Release note (performance improvement):
Before this a simple schema change could take 30s+.
The reason was that if the schema change is not first
in line in the table mutation queue it would return a
re-triable error and the jobs framework will re-adopt and
run it later. The problem is that the job adoption loop
is 30s.

To repro run this for some time:
```
cockroach sql --insecure --watch 1s -e 'drop table if exists users cascade; create table users (id uuid not null, name varchar(255) not null, email varchar(255) not null, password varchar(255) not null, remember_token varchar(100) null, created_at timestamp(0) without time zone null, updated_at timestamp(0) without time zone null, deleted_at timestamp(0) without time zone null); alter table users add primary key (id); alter table users add constraint users_email_unique unique (email);'
```

Instead of returning on retriable errors we retry with a exponential
backoff in the schema change code. This pattern of dealing with
retriable errors in client job code is encouraged vs relying on the
registry beacuse the latter leads to slowness and additionally to more
complicated test fixtures that rely in hacking with the internals of the
job registry,
spaskob added a commit to spaskob/cockroach that referenced this issue May 8, 2020
Touches cockroachdb#47790.

Release note (performance improvement):
Before this a simple schema change could take 30s+.
The reason was that if the schema change is not first
in line in the table mutation queue it would return a
re-triable error and the jobs framework will re-adopt and
run it later. The problem is that the job adoption loop
is 30s.

To repro run this for some time:
```
cockroach sql --insecure --watch 1s -e 'drop table if exists users cascade; create table users (id uuid not null, name varchar(255) not null, email varchar(255) not null, password varchar(255) not null, remember_token varchar(100) null, created_at timestamp(0) without time zone null, updated_at timestamp(0) without time zone null, deleted_at timestamp(0) without time zone null); alter table users add primary key (id); alter table users add constraint users_email_unique unique (email);'
```

Instead of returning on retriable errors we retry with a exponential
backoff in the schema change code. This pattern of dealing with
retriable errors in client job code is encouraged vs relying on the
registry beacuse the latter leads to slowness and additionally to more
complicated test fixtures that rely in hacking with the internals of the
job registry,
@spaskob
Copy link
Contributor

spaskob commented May 8, 2020

This is a simple way to reproduce the slowness:

cockroach sql --insecure --watch 1s -e 'drop table if exists users cascade; create table users (id uuid not null, name varchar(255) not null, email varchar(255) not null, password varchar(255) not null, remember_token varchar(100) null, created_at timestamp(0) without time zone null, updated_at timestamp(0) without time zone null, deleted_at timestamp(0) without time zone null); alter table users add primary key (id); alter table users add constraint users_email_unique unique (email);'

@spaskob
Copy link
Contributor

spaskob commented May 8, 2020

It turns there's a short term fix that seems to work pretty well. The reason for the slowness is that if the schema change is not first in line in the table mutation queue it would return a re-triable error and the jobs framework will re-adopt and run it later. The problem is that the job adoption loop
is 30s. This is now fixed via #48608.

@spaskob
Copy link
Contributor

spaskob commented May 8, 2020

I will still work on the job improvements to prevent future regressions and simplify the jobs framework. This PR #48600 is an example of what is coming next.

craig bot pushed a commit that referenced this issue May 9, 2020
48045: sql: enable indexing and ordering on arrays of orderable and indexable types r=rohany a=rohany

Fixes #17154.
Fixes #35707.

This PR enables arrays to be ordered and indexed by
introducing an ordered key encoding for arrays.
Once this exists, the rest of the SQL infrastructure
is ready to handle indexing and ordering on arrays.

To encode an array of elements `ARRAY[a, b]`,
we create the following encoding.

Let `AM` = a marker byte for arrays and let `AT` be a terminator byte.

`enc(ARRAY[a, b]) = [AM, enc(a), enc(b), AT]`

The key is that the terminator is less than the element marker.
This allows for the "prefix matching" style comparison that
arrays support.

Release note (sql change): This PR adds support for indexing
and ordering of arrays of indexable and orderable inner types.


48608: schemachange: speed up slow schema changes r=spaskob a=spaskob

Touches #45150.
Fixes #47607.
Touches #47790.

Release note (performance improvement):
Before this a simple schema change could take 30s+.
The reason was that if the schema change is not first
in line in the table mutation queue it would return a
re-triable error and the jobs framework will re-adopt and
run it later. The problem is that the job adoption loop
is 30s.

To repro run this for some time:
```
cockroach sql --insecure --watch 1s -e 'drop table if exists users cascade; create table users (id uuid not null, name varchar(255) not null, email varchar(255) not null, password varchar(255) not null, remember_token varchar(100) null, created_at timestamp(0) without time zone null, updated_at timestamp(0) without time zone null, deleted_at timestamp(0) without time zone null); alter table users add primary key (id); alter table users add constraint users_email_unique unique (email);'
```

Instead of returning on re-triable errors we retry with exponential
backoff in the schema change code. This pattern of dealing with
re-triable errors in client job code is encouraged vs relying on the
registry because the latter leads to slowness and additionally to more
complicated test fixtures that rely on hacking with the internals of the
job registry,

Co-authored-by: Rohan Yadav <[email protected]>
Co-authored-by: Spas Bojanov <[email protected]>
spaskob added a commit to spaskob/cockroach that referenced this issue May 9, 2020
Touches cockroachdb#47790.

Release note (performance improvement):
Before this a simple schema change could take 30s+.
The reason was that if the schema change is not first
in line in the table mutation queue it would return a
re-triable error and the jobs framework will re-adopt and
run it later. The problem is that the job adoption loop
is 30s.

To repro run this for some time:
```
cockroach sql --insecure --watch 1s -e 'drop table if exists users cascade; create table users (id uuid not null, name varchar(255) not null, email varchar(255) not null, password varchar(255) not null, remember_token varchar(100) null, created_at timestamp(0) without time zone null, updated_at timestamp(0) without time zone null, deleted_at timestamp(0) without time zone null); alter table users add primary key (id); alter table users add constraint users_email_unique unique (email);'
```

Instead of returning on retriable errors we retry with a exponential
backoff in the schema change code. This pattern of dealing with
retriable errors in client job code is encouraged vs relying on the
registry beacuse the latter leads to slowness and additionally to more
complicated test fixtures that rely in hacking with the internals of the
job registry,
@jordanlewis
Copy link
Member

Points 2 and 3 were subsumed by #48621.

Points 4 and 5 are less high priority.

@kocoten1992
Copy link

Hi, schema change has improve speed, but there still an issue

CPU usage is still extremely high (like 90%-100% of an 4 cores machine) when schema change

@ajwerner
Copy link
Contributor Author

Hi, schema change has improve speed, but there still an issue

CPU usage is still extremely high (like 90%-100% of an 4 cores machine) when schema change

How much data are in the tables and how many schema changes are you running? If you’d be willing to provide any sort of scripts to reproduce or cpu profiles, that would be very helpful

@spaskob spaskob added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-schema-changes labels Oct 5, 2020
@spaskob spaskob closed this as completed Oct 5, 2020
@aeneasr
Copy link

aeneasr commented Mar 19, 2021

We are also experiencing extremely high CPU usage (>90%) as well as bandwidth usage in our cockroach cloud 3-node production cluster (ID 36691cbd-9927-438f-83af-cdc3c06a2b20). Is this regression truly resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-changes C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

5 participants