Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3Client.getObject does not work for binary data #105

Closed
KOConchobhair opened this issue Jun 6, 2024 · 7 comments
Closed

S3Client.getObject does not work for binary data #105

KOConchobhair opened this issue Jun 6, 2024 · 7 comments
Assignees
Labels

Comments

@KOConchobhair
Copy link
Contributor

On top of #45, the S3Client cannot be used to download a binary file from S3!

as I'm sure you are aware, if you have discardResponseBodies=false
there’s a Params.responseType that you can pass to a call to k6's http like

  const res = http.get('https://httpbin.test.k6.io/bytes/10', {
    responseType: 'binary',
  });

and that controls whether res.body is a string or an ArrayBuffer.

however since the S3Client code does not pass in a responseType in this code

const res = await http.asyncRequest(method, signedRequest.url, null, {
headers: signedRequest.headers,
})

then that means it defaults to responseType: 'text' and that means

  • text: k6 will return it as a string. This might not be what you want in cases of binary data as the conversation to UTF-16 will probably break it.
    This is also the default if discardResponseBodies is set to false or not set at all.

We need a way to either pass thru or hard-code a responseType: 'binary' for the k6 http client call in getObject.
Do you have a preference on how that might be implemented?

@joanlopez
Copy link
Contributor

Hi @KOConchobhair,

Thanks for raising this topic 🙇🏻 First, let me ask you, as a library user, do you have any preference?

From my side, I don't have a clear path in mind, but some of the thoughts from the top of my mind are:

  • In the context of S3.getObject specifically, I think that in general is better to assume that the contents are binary (use ArrayBuffer?), because no matter what kind of content is more common, I think it's generally easier to convert the body into a string later, if needed, than the opposite.
    • Although I know this is probably not very feasible now as it would represent a breaking change in respect to the current behavior.
  • That said, unless we want to go all-in with ArrayBuffer (which doesn't sound really bad, except for what I mentioned above), I guess we'll need to add a new optional argument to the S3.getObject function.
    • What I'm wondering now is whether would make more sense as a client parameter/configuration, but at first glance it looks quite specific, although a few other service calls may also return binary data.

That said, let me invoke @oleiade, because as the main contributor of the project, he might have some thoughts to share.

@joanlopez joanlopez assigned oleiade and unassigned joanlopez Jun 7, 2024
@KOConchobhair
Copy link
Contributor Author

KOConchobhair commented Jun 7, 2024

hmm, it seems that the problem is worse than simply passing the responseType into the k6 http request. Because this doesn't even work:

        const res = await http.asyncRequest(method, signedRequest.url, null, {
            headers: signedRequest.headers,
            responseType: 'binary',
        })
        console.log(res)

it always returns {} for res.body

console.log(res) --> https://pastebin.com/raw/S3mZR9CH

The Typescript types for the RefinedResponseBody seem to want to treat the data as bytes in this case
https://github.com/DefinitelyTyped/DefinitelyTyped/blob/c58e3459ea3d434ca1e7bf3a0e1ffcb4a95d363b/types/k6/http.d.ts#L631

But it seems the k6 http module is returning an ArrayBuffer
https://github.com/grafana/k6/blob/73aaa790da2c66d9f3bbc3fd059246fc6ad5b63d/js/modules/k6/http/request.go#L134

I don't know if this is the root cause but it appears only responseType: 'text' is functional atm. I also confirmed that there's no issue with localstack and downloading an image file via the S3 API (using aws cli) so the problem appears to be in k6. Let me know if there is another repo to escalate this issue to.

@KOConchobhair
Copy link
Contributor Author

KOConchobhair commented Jun 11, 2024

okay, i think I see what's going on here. Even with the types fixed (I got a PR for that going here: DefinitelyTyped/DefinitelyTyped#69781) the k6 script itself is javascript so its a little confusing how to access the value for res.body. For example, res.body.byteLength works but console.log(res.body) always converts to empty object {}. Maybe thats expected for javascript/ArrayBuffer, I'm not sure. But the good news is the k6 http module is working fine for binary files! So I will proceed with a PR to address the originally reported issue with the following recommended approach (which should also help with #45 as well)

add a new optional argument to the S3.getObject function.

@joanlopez
Copy link
Contributor

Hey @KOConchobhair,

Thanks for spending time on this, and for bringing all the details! 🙇🏻 Really appreciated!

For example, res.body.byteLength works but console.log(res.body) always converts to empty object {}.

I guess this might be related with 👉🏻 grafana/k6#3677 (comment)

In any case, regarding your contributions:

Thanks! 🙏🏻

KOConchobhair added a commit to KOConchobhair/k6-jslib-aws that referenced this issue Jun 15, 2024
@KOConchobhair
Copy link
Contributor Author

KOConchobhair commented Jun 15, 2024

  • Really looking forward for that PR to address the reported issue.

Okay, a PR for this issue is ready for review: #107
(I also think it mostly resolves #45 as well)

I've got a second PR for general cleanup/unit test fix ready for review as well: #106

Regarding the types, yeah I suppose I could move the changes there. Although while I don't claim to understand the release process, this fix seems like a bug to me in the currently released v0.51.0 version of k6. Why would we have to wait until v0.52.0 to fix an existing issue with the types? It would seem DefinitelyTyped can handle an update prior to to the next version.

@oleiade
Copy link
Member

oleiade commented Jun 17, 2024

For context, I believe this is connected to #45; for which I unfortunately didn't find a satisfying solution at the time.

Regarding the type definitions, we try to keep changes aligned there with the k6 releases indeed. However, in this specific case, I believe we could probably go ahead and amend the k6 v0.51 definitions directly.

@oleiade
Copy link
Member

oleiade commented Jun 17, 2024

Hey folks 👋🏻

First and foremost, thanks for raising this and proactively contributing to the resolution @KOConchobhair 🙇🏻

I've been considering our options there, and although k6 initially adopted this approach in its HTTP module, and it made sense there, I'd rather avoid adding a responseType parameter to getObject.

Although we don't support most of them, the S3 getObject action actually supports many pre-existing request parameters, and I'd rather leverage those instead.

Thus, my preference would be to add an additionalHeaders (or any better name) argument to s3.getObject, which would be added to the underlying request. Using this, we could integrate the logic in the method so that if, say, the additional headers contain an Accept header set to application/octet-stream, we then treat the response as binary and return an ArrayBuffer or bytes.

That way, we can also start supporting more of the underlying API call request parameters as we need them in the future.

What do you folks think?

For reference, I've tried implementing it on the side to test the idea, and I quite like the look and feel...

implementation

async getObject(bucketName: string, objectKey: string, additionalHeaders = {}): Promise<S3Object> {
        // Prepare request
        const method = 'GET'

        const signedRequest = this.signature.sign(
            {
                method: method,
                endpoint: this.endpoint,
                path: encodeURI(`/${bucketName}/${objectKey}`),
                headers: {
                    ...additionalHeaders
                },
            },
            {}
        )


        let responseType: ResponseType = 'text'
        if ('Accept' in additionalHeaders &&
            additionalHeaders['Accept'] !== undefined &&
            additionalHeaders['Accept'] === "application/octet-stream") {
           responseType = "binary"
        }

        const res = await http.asyncRequest(method, signedRequest.url, null, {
            headers: signedRequest.headers,
            responseType: responseType as ResponseType,
        })
        this._handle_error('GetObject', res)

        return new S3Object(
            objectKey,
            Date.parse(res.headers['Last-Modified']),
            res.headers['ETag'],
            parseInt(res.headers['Content-Length']),

            // The X-Amz-Storage-Class header is only set if the storage class is
            // not the default 'STANDARD' one.
            (res.headers['X-Amz-Storage-Class'] ?? 'STANDARD') as StorageClass,

            res.body
        )
    }

Usage

// Let's redownload it, verify it's correct, and delete it
    const downloadedFileContent = await s3.getObject(
        testBucketName,
        testFileKey,
        { 'Accept': 'application/octet-stream' })

    console.log(`downloadedFileContent.byteLength: ${downloadedFileContent.data.byteLength}`)
    console.log(downloadedFileContent.data instanceof ArrayBuffer)

    if (downloadedFileContent.byteLength !== testFileContent.byteLength) {
        exec.test.abort("Downloaded file size does not match the original file size")
    }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants