-
Notifications
You must be signed in to change notification settings - Fork 47.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Flight] Optimize Large Strings by Not Escaping Them #26932
Conversation
@@ -66,6 +66,10 @@ export function clonePrecomputedChunk( | |||
return chunk; | |||
} | |||
|
|||
export function byteLengthOfChunk(chunk: Chunk | PrecomputedChunk): number { | |||
throw new Error('Not implemented.'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Jarred-Sumner Not sure what you want to do here. Either you can expose a JSC fast path for getting the byteLength of a string in UTF-8 or we can switch to always encoding strings eagerly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Buffer.byteLength(string, "utf8") is fast in Bun. Uses SIMD, handles latin1 vs utf16, etc
@@ -960,7 +975,12 @@ function resolveModelToJSON( | |||
return serializeDateFromDateJSON(value); | |||
} | |||
} | |||
|
|||
if (value.length > 1024) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took some large looking strings that looked clowny and measured them but this is heuristic is pretty much taken out of thin air.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wouldn't >= 1024
be a rounder number?
d686497
to
df81cb6
Compare
This puts a lot of burden on a runtime to support UTF-8 encoding exposed in the runtime. E.g. the FB one doesn't atm. However, I'm not aware of any current supported environment that wouldn't be able to. The follow up to support binary data also puts more requirements on the client (and SSR). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you know that TextEncoder doesn't let you specify an encoding? I learned this last week when thinking about your question
_rowID: number, // parts of a row ID parsed so far | ||
_rowTag: number, // 0 indicates that we're currently parsing the row ID | ||
_rowLength: number, // remaining bytes in the row. 0 indicates that we're looking for a newline. | ||
_buffer: Array<string | Uint8Array>, // chunks received so far as part of this row |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when is this a string?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It’s not anymore but really ideally it should be able to accept string chunks at least it’s known not to be binary content but then it gets tricky to count the bytes again, especially for the last chunk, so maybe we don’t support it.
When it’s coming through a html encoded stream we’d encode the chunking in html so there might be that the last chunk resolved is a string but not the buffer.
} | ||
if (lastIdx > -1) { | ||
// We found the last chunk of the row | ||
const lastChunk = chunk.slice(i, lastIdx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would DataView be more efficient here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doubt it. It generally isn’t but also we don’t really control the source and this will also be used for binary streams where the protocol is uint8arrays too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh you mean to avoid the copy? I actually don’t know why I did a copy.
In general we expect these to be immutable representations so I could probably do a new array from the same underlying buffer.
It’ll need to be copied for serialized typed arrays in some cases anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, in JSC (don't know about V8), calling .subarray or .buffer on an ArrayBufferView will actually clone it on the first access because by default the lifetime of the buffer is the lifetime of the ArrayBufferView ("WastefulTypedArray" is the name for this in the code iirc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's unfortunate. Given that TextDecoder requires a view, is there any workaround that doesn't?
@@ -960,7 +975,12 @@ function resolveModelToJSON( | |||
return serializeDateFromDateJSON(value); | |||
} | |||
} | |||
|
|||
if (value.length > 1024) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: wouldn't >= 1024
be a rounder number?
Yea, that was the part that made any clever hacks with encoding in HTML impossible. I guess it makes sense that browsers don't really have to know how to encode other encodings - only decoding. |
…ter (#26945) Follow up to #26932 For regular rows, we're increasing the index by one to skip past the last trailing newline character which acts a boundary. For length encoded rows we shouldn't skip an extra byte because it'll leave us missing one. This only accidentally worked because this was also the end of the current chunk which tests don't account for since we're just passing through the chunks. So I added some noise by splitting and joining the chunks so that this gets tested.
…26954) This uses the same mechanism as [large strings](#26932) to encode chunks of length based binary data in the RSC payload behind a flag. I introduce a new BinaryChunk type that's specific to each stream and ways to convert into it. That's because we sometimes need all chunks to be Uint8Array for the output, even if the source is another array buffer view, and sometimes we need to clone it before transferring. Each type of typed array is its own row tag. This lets us ensure that the instance is directly in the right format in the cached entry instead of creating a wrapper at each reference. Ideally this is also how Map/Set should work but those are lazy which complicates that approach a bit. We assume both server and client use little-endian for now. If we want to support other modes, we'd convert it to/from little-endian so that the transfer protocol is always little-endian. That way the common clients can be the fastest possible. So far this only implements Server to Client. Still need to implement Client to Server for parity. NOTE: This is the first time we make RSC effectively a binary format. This is not compatible with existing SSR techniques which serialize the stream as unicode in the HTML. To be compatible, those implementations would have to use base64 or something like that. Which is what we'll do when we move this technique to be built-in to Fizz.
This PR updates the vendored react dependencies using `pnpm sync-react` ### React upstream changes - facebook/react#27028 - facebook/react#27027 - facebook/react#27019 - facebook/react#26954 - facebook/react#26987 - facebook/react#26985 - facebook/react#26933 - facebook/react#26625 - facebook/react#27011 - facebook/react#27008 - facebook/react#26997 - facebook/react#26989 - facebook/react#26955 - facebook/react#26963 - facebook/react#26983 - facebook/react#26914 - facebook/react#26951 - facebook/react#26977 - facebook/react#26958 - facebook/react#26940 - facebook/react#26939 - facebook/react#26887 - facebook/react#26947 - facebook/react#26945 - facebook/react#26942 - facebook/react#26938 - facebook/react#26844 - facebook/react#25510 - facebook/react#26932 - facebook/react#26896 - facebook/react#26913 - facebook/react#26888 - facebook/react#26827 - facebook/react#26889 - facebook/react#26877 - facebook/react#26873 - facebook/react#26880 - facebook/react#26842 - facebook/react#26858 - facebook/react#26754 - facebook/react#26753 - facebook/react#26881 ### Related Closes #49409 (by facebook/react#26977) fix NEXT-1189 Co-authored-by: Shu Ding <[email protected]>
Fixes #49409 ### React upstream changes - facebook/react#27045 - facebook/react#27051 - facebook/react#27032 - facebook/react#27031 - facebook/react#27029 - facebook/react#27028 - facebook/react#27027 - facebook/react#27019 - facebook/react#26954 - facebook/react#26987 - facebook/react#26985 - facebook/react#26933 - facebook/react#26625 - facebook/react#27011 - facebook/react#27008 - facebook/react#26997 - facebook/react#26989 - facebook/react#26955 - facebook/react#26963 - facebook/react#26983 - facebook/react#26914 - facebook/react#26951 - facebook/react#26977 - facebook/react#26958 - facebook/react#26940 - facebook/react#26939 - facebook/react#26887 - facebook/react#26947 - facebook/react#26945 - facebook/react#26942 - facebook/react#26938 - facebook/react#26844 - facebook/react#25510 - facebook/react#26932 - facebook/react#26896 - facebook/react#26913 - facebook/react#26888 - facebook/react#26827 - facebook/react#26889 - facebook/react#26877 - facebook/react#26873 - facebook/react#26880 - facebook/react#26842 - facebook/react#26858 - facebook/react#26754 - facebook/react#26753 - facebook/react#26881 --------- Co-authored-by: Jiachi Liu <[email protected]>
This PR updates the vendored react dependencies using `pnpm sync-react` ### React upstream changes - facebook/react#27028 - facebook/react#27027 - facebook/react#27019 - facebook/react#26954 - facebook/react#26987 - facebook/react#26985 - facebook/react#26933 - facebook/react#26625 - facebook/react#27011 - facebook/react#27008 - facebook/react#26997 - facebook/react#26989 - facebook/react#26955 - facebook/react#26963 - facebook/react#26983 - facebook/react#26914 - facebook/react#26951 - facebook/react#26977 - facebook/react#26958 - facebook/react#26940 - facebook/react#26939 - facebook/react#26887 - facebook/react#26947 - facebook/react#26945 - facebook/react#26942 - facebook/react#26938 - facebook/react#26844 - facebook/react#25510 - facebook/react#26932 - facebook/react#26896 - facebook/react#26913 - facebook/react#26888 - facebook/react#26827 - facebook/react#26889 - facebook/react#26877 - facebook/react#26873 - facebook/react#26880 - facebook/react#26842 - facebook/react#26858 - facebook/react#26754 - facebook/react#26753 - facebook/react#26881 ### Related Closes #49409 (by facebook/react#26977) fix NEXT-1189 Co-authored-by: Shu Ding <[email protected]>
This introduces a Text row (T) which is essentially a string blob and refactors the parsing to now happen at the binary level. ``` RowID + ":" + "T" + ByteLengthInHex + "," + Text ``` Today, we encode all row data in JSON, which conveniently never has newline characters and so we use newline as the line terminator. We can't do that if we pass arbitrary unicode without escaping it. Instead, we pass the byte length (in hexadecimal) in the leading header for this row tag followed by a comma. We could be clever and use fixed or variable-length binary integers for the row id and length but it's not worth the more difficult debuggability so we keep these human readable in text. Before this PR, we used to decode the binary stream into UTF-8 strings before parsing them. This is inefficient because sometimes the slices end up having to be copied so it's better to decode it directly into the format. The follow up to this is also to add support for binary data and then we can't assume the entire payload is UTF-8 anyway. So this refactors the parser to parse the rows in binary and then decode the result into UTF-8. It does add some overhead to decoding on a per row basis though. Since we do this, we need to encode the byte length that we want decode - not the string length. Therefore, this requires clients to receive binary data and why I had to delete the string option. It also means that I had to add a way to get the byteLength from a chunk since they're not always binary. For Web streams it's easy since they're always typed arrays. For Node streams it's trickier so we use the byteLength helper which may not be very efficient. Might be worth eagerly encoding them to UTF8 - perhaps only for this case.
…ter (facebook#26945) Follow up to facebook#26932 For regular rows, we're increasing the index by one to skip past the last trailing newline character which acts a boundary. For length encoded rows we shouldn't skip an extra byte because it'll leave us missing one. This only accidentally worked because this was also the end of the current chunk which tests don't account for since we're just passing through the chunks. So I added some noise by splitting and joining the chunks so that this gets tested.
…acebook#26954) This uses the same mechanism as [large strings](facebook#26932) to encode chunks of length based binary data in the RSC payload behind a flag. I introduce a new BinaryChunk type that's specific to each stream and ways to convert into it. That's because we sometimes need all chunks to be Uint8Array for the output, even if the source is another array buffer view, and sometimes we need to clone it before transferring. Each type of typed array is its own row tag. This lets us ensure that the instance is directly in the right format in the cached entry instead of creating a wrapper at each reference. Ideally this is also how Map/Set should work but those are lazy which complicates that approach a bit. We assume both server and client use little-endian for now. If we want to support other modes, we'd convert it to/from little-endian so that the transfer protocol is always little-endian. That way the common clients can be the fastest possible. So far this only implements Server to Client. Still need to implement Client to Server for parity. NOTE: This is the first time we make RSC effectively a binary format. This is not compatible with existing SSR techniques which serialize the stream as unicode in the HTML. To be compatible, those implementations would have to use base64 or something like that. Which is what we'll do when we move this technique to be built-in to Fizz.
This introduces a Text row (T) which is essentially a string blob and refactors the parsing to now happen at the binary level. ``` RowID + ":" + "T" + ByteLengthInHex + "," + Text ``` Today, we encode all row data in JSON, which conveniently never has newline characters and so we use newline as the line terminator. We can't do that if we pass arbitrary unicode without escaping it. Instead, we pass the byte length (in hexadecimal) in the leading header for this row tag followed by a comma. We could be clever and use fixed or variable-length binary integers for the row id and length but it's not worth the more difficult debuggability so we keep these human readable in text. Before this PR, we used to decode the binary stream into UTF-8 strings before parsing them. This is inefficient because sometimes the slices end up having to be copied so it's better to decode it directly into the format. The follow up to this is also to add support for binary data and then we can't assume the entire payload is UTF-8 anyway. So this refactors the parser to parse the rows in binary and then decode the result into UTF-8. It does add some overhead to decoding on a per row basis though. Since we do this, we need to encode the byte length that we want decode - not the string length. Therefore, this requires clients to receive binary data and why I had to delete the string option. It also means that I had to add a way to get the byteLength from a chunk since they're not always binary. For Web streams it's easy since they're always typed arrays. For Node streams it's trickier so we use the byteLength helper which may not be very efficient. Might be worth eagerly encoding them to UTF8 - perhaps only for this case. DiffTrain build for commit db50164.
…ter (#26945) Follow up to #26932 For regular rows, we're increasing the index by one to skip past the last trailing newline character which acts a boundary. For length encoded rows we shouldn't skip an extra byte because it'll leave us missing one. This only accidentally worked because this was also the end of the current chunk which tests don't account for since we're just passing through the chunks. So I added some noise by splitting and joining the chunks so that this gets tested. DiffTrain build for commit a1723e1.
This introduces a Text row (T) which is essentially a string blob and refactors the parsing to now happen at the binary level.
Today, we encode all row data in JSON, which conveniently never has newline characters and so we use newline as the line terminator. We can't do that if we pass arbitrary unicode without escaping it. Instead, we pass the byte length (in hexadecimal) in the leading header for this row tag followed by a comma.
We could be clever and use fixed or variable-length binary integers for the row id and length but it's not worth the more difficult debuggability so we keep these human readable in text.
Before this PR, we used to decode the binary stream into UTF-8 strings before parsing them. This is inefficient because sometimes the slices end up having to be copied so it's better to decode it directly into the format. The follow up to this is also to add support for binary data and then we can't assume the entire payload is UTF-8 anyway. So this refactors the parser to parse the rows in binary and then decode the result into UTF-8. It does add some overhead to decoding on a per row basis though.
Since we do this, we need to encode the byte length that we want decode - not the string length. Therefore, this requires clients to receive binary data and why I had to delete the string option.
It also means that I had to add a way to get the byteLength from a chunk since they're not always binary. For Web streams it's easy since they're always typed arrays. For Node streams it's trickier so we use the byteLength helper which may not be very efficient. Might be worth eagerly encoding them to UTF8 - perhaps only for this case.