-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
@aws-cdk/aws-redshift : Table; new tableColumns ordering not preserved on UPDATE #20495
Comments
Thank you for the very detailed bug description @Romanmc72, Would you be willing to contribute and help us with this yourself? Check out our contributing guide if you're interested 🙂 |
Happy to help! I will take a look there and see where I can get started. |
Let me know how the PR looks, happy to change if need be. |
The PR raised was closed a year ago as this bug was deemed not required due to the best practice of always specifying the order of columns in any SQL one writes. This is a fair callout and as such this is not really a major bug that requires a fix. I do want to make some additional callouts on the above use of |
|
Describe the bug
When using the
Table
construct in the@aws-cdk/aws-redshift
module, if you have a table that already exists, then you attempt to modify that table by adding more than one column, then the ordering of those columns is not guaranteed. It works correctly when creating tables because the entire operation is run in oneCREATE TABLE
SQL statement, however the addition of more than 1 new column at a time will create the columns out of order in a potentially erratic and unpredictable manner. The reason for this is because the addition of new columns is done with parallel asynchronous calls ofALTER TABLE <tableName> ADD COLUMN <columnName> <dataType>
SQL statements. This can be seen here executed where theALTER TABLE
statements are added to an array and then here where those statements are executed in an asynchronous fashion against the redshift data API. So whichever request arrives first to the API wins instead of the statements being executed as the developer would expect, being the order they specified in their CDK code.Expected Behavior
As a user of the Typescript version of this library, Arrays in typescript come ordered (as they do in most languages) so when I have an ordered array of column definitions, my expectation is that that order is preserved when applied to my table and that the columns in my CDK code match the columns in my Redshift table in terms of data type and ordering.
Current Behavior
Currently, if I have a table and add more than one new column to that table at a time, the order of those columns will at random not match with the order I place them in my
tableColumns
array, even between deployments to different isolated environments that utilized the same templates.Reproduction Steps
If you only add 2 or 3 columns then there is a chance that they just happen to resolve in the correct order. I have added 10 here and am able to reproduce this bug every time I run it (ran 4 times). The code below worked for me as-is.
npm run build && cdk deploy RedshiftStack
4.. Add 10 new columns to that original table.
npm run build && cdk deploy RedshiftStack
You would expect to query the redshift database and see the columns in the numeric order matching what is laid out above in the
tableColumns
array, however you will get something more like this:Feel free to edit the stack, remove the new columns, then read-add them and get a whole new order! I did exactly that and on the second attempt got the following column order:
Possible Solution
There are 2 ways to solve this as I see it.
executeStatement()
performed here.So instead of:
it'd be more like:
This will take a while to execute and require many round trips to the API, but it would "work".
Table
construct only leverages the ExecuteStatement endpoint on the redshift-data API. The shape of the output and response from the batch endpoint matches that of the regular execute endpoint:and the subsequent call made to waitForStatementComplete could remain unchanged as well. The shape of the response on DescribeStatement changes slightly for the case when one describes a batch execute statement, but not in a way that would break the API response to this function.
The only difference would be the existence of the
SubStatements
attribute on a batch statement. In effect though, you could just replace the use of theexecuteStatement
endpoint entirely with thebatchExecuteStatement
one, and change one of the input properties fromSql: string
toSqls: string[]
and viola, done. Of course you would need to update everywhere that callsexecuteStatement
to pass in an array of strings not a string, but this would be pretty trivial to fix. The BatchExecuteStatement supports sending 1 or more SQL statements, so it would be compatible with the rest of the single-statement redshift-data API calls as well.Additional Information/Context
Happy to raise the Pull Request here for this one myself. Would love to know your thoughts on switching from the ExecuteStatement command to the BatchExecuteStatement command as that seems like the best way forward here in my opinion.
Here is my
package.json
if that helpsI did have
eslint
set up as you can see above but that is not required for reproducing this bug.CDK CLI Version
2.1.0 (build f4f18b1)
Framework Version
No response
Node.js Version
v16.15.0
OS
macOS 12.3.1 (21E258)
Language
Typescript
Language Version
TypeScript (3.9.10)
Other information
await Promise.all(array.map())
. (there are many such articles out there if you do some light googling)here is a small nodejs program you can run to see this mess in action for yourself:
The text was updated successfully, but these errors were encountered: