-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[python] non ascii HTTP headers #868
Comments
Are these headers being sent from the client? If so, which client and is it reproducible with a different client? If so. Can you give an example of a specific header with both its correct and incorrect coding? |
Yes, the header are being sent by the client, for exemple with I made a repository with an example to reproduce the issue if it can help: https://github.com/RomainMou/unit-django-issue-headers. |
Thanks for the app! I set it up under Fedora 38 (Python 3.11.3, django 4.0.10). With latest Unit from master. (I am not using the home config directive). The first time I ran it I got the error. $ curl --header "ASCIITEST: 1" http://localhost:8080/test/
Hello, HTTP_ASCIITEST is ascii? False But of course that isn't entirely correct as the code is looking for a header called HTTP_ ASCIITEST. Once we use the right header... $ curl --header "HTTP_ASCIITEST: 1" http://localhost:8080/test/ | head
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 49764 100 49764 0 0 771k 0 --:--:-- --:--:-- --:--:-- 783k
<!DOCTYPE html>
<html lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta name="robots" content="NONE,NOARCHIVE">
<title>AttributeError
at /test/</title>
<style type="text/css">
html * { padding:0; margin:0; }
body * { padding:10px 20px; } I've noticed some inconsistency in the hesders you're sending and what you're checking against. Could you double check that you are indeed checking against the correct headers?. |
The
So yes, it should be the same header. Edit :
|
AFAICT this header is coming through as ASCII. If I save this header value off to a file f = open("/tmp/t.txt", "w")
f.write(request.META.get("HTTP_ASCIITEST"))
f.close() $ ls -l /tmp/t.txt
-rw-r--r-- 1 andrew andrew 14 May 10 03:17 /tmp/t.txt
$ file /tmp/t.txt
/tmp/t.txt: ASCII text, with no line terminators
$ od -a /tmp/t.txt
0000000 T h i s sp i s sp a sp t e s t
0000016 If I look at the memory where this is stored before being passed through to the application in gdb
Again, looks OK. Unclear where isascii() is going wrong... |
Another thing I noticed, each char of the string is individually considered ascii:
In the difference I could find, the memory footprint of the string seams to be corresponding to a multi-bytes extended ascii char (characters 128-255), even if the value is equal to a simple ascii char.
Edit: another thing, using
|
I'm leaning towards this not being a problem per se. With a simple PHP application {
"listeners": {
"[::]:8080": {
"pass": "applications/php"
}
},
"applications": {
"php": {
"type": "php",
"root": "/home/andrew/src/php"
}
}
} <?php
foreach ($_SERVER as $key => $value) {
echo "{$key} => {$value} ";
echo "[";
echo mb_detect_encoding($value, 'ASCII', true);
echo "]\n";
}
?> $ curl --header "ASCIITEST: This is a test" http://localhost:8080/test.php
SERVER_SOFTWARE => Unit/1.30.0 [ASCII]
SERVER_PROTOCOL => HTTP/1.1 [ASCII]
PHP_SELF => /test.php [ASCII]
SCRIPT_NAME => /test.php [ASCII]
SCRIPT_FILENAME => /home/andrew/src/php/test.php [ASCII]
DOCUMENT_ROOT => /home/andrew/src/php [ASCII]
REQUEST_METHOD => GET [ASCII]
REQUEST_URI => /test.php [ASCII]
QUERY_STRING => [ASCII]
REMOTE_ADDR => ::1 [ASCII]
SERVER_ADDR => ::1 [ASCII]
SERVER_NAME => localhost [ASCII]
SERVER_PORT => 80 [ASCII]
HTTP_HOST => localhost:8080 [ASCII]
HTTP_USER_AGENT => curl/8.0.1 [ASCII]
HTTP_ACCEPT => */* [ASCII]
HTTP_ASCIITEST => This is a test [ASCII]
REQUEST_TIME_FLOAT => 1683739126.4316 [ASCII]
REQUEST_TIME => 1683739126 [ASCII] All the headers came through as ASCII. So Unit isn't doing anything silly. Now with a simple Python application {
"listeners": {
"[::1]:8080": {
"pass": "applications/python"
}
},
"applications": {
"python": {
"type": "python",
"path": "/home/andrew/src/python",
"module": "gh868"
}
}
} def application(environ, start_response):
t = environ['HTTP_ASCIITEST']
# t = "This is a test"
t = "'" + t + "'" + " (" + str(len(t)) + ")"
try:
t.decode("ascii")
except:
t = t + " [non-ascii]"
else:
t = t + " [ascii]"
resp = t + "\n"
start_response("200 OK", [("Content-Type", "text/plain")])
return (resp.encode()) $ curl --header "ASCIITEST: This is a test" http://localhost:8080/
'This is a test' (14) [non-ascii] Lets try with a simple standalone application #!/usr/bin/python
t = "This is a test"
t = "'" + t + "'" + " (" + str(len(t)) + ")"
try:
t.decode("ascii")
except:
t = t + " [non-ascii]"
else:
t = t + " [ascii]"
print(t) $ ./s.py
'This is a test' (14) [non-ascii] This is with Python 3.11.3 It looks like just normal behaviour. Wasn't this one of the big changes in Python 3? UTF-8 encoding being used and the use of ByteArrays rather than just simple strings... Are you seeing an actual problem because of this? |
In your python example, So when you
And it always pass into the With something like:
You get the expected result: So no it's not the default behavior. For the actual problem, yes, I have some headers that I compare with reference values using the function I couldn't reproduce the issue without Nginx Unit, so I still think it's an issue on how the header strings are initialized, somehow. |
Looks like in Python 3 (and for WSGI at least) we add the headers in Python as Unicode. See https://github.com/nginx/unit/blob/master/src/python/nxt_python_wsgi.c#L675 It's been this way for at least ~3 years. I'm not sure this is something we'd change. But it sounds like you can work with byte strings... |
unit/src/python/nxt_python_wsgi.c Line 826 in aa45d17
unit/src/python/nxt_python_wsgi.c Line 867 in aa45d17
I think the issue is with these lines. If I understand the documentation about I rebuilt Unit with
We can see that the string is ASCII and that the memory usage dropped from 74 to 50 bytes. I don't know if this can have other impacts. |
Thanks for digging into this. So with using a maxchar of 127 my mind immediately went to 'What happens with extended ASCII characters, such as the '£' sign (GBP Sterling)'?. With the current code (maxchar 255), things should work as expected... $ curl --header "ASCIITEST: GBP currency symbol £" http://localhost:8080/
'GBP currency symbol £' (22) [non-ascii] Err, no, that's not right! We have extraneous character. Lets try with a maxchar of 127 $ curl --header "ASCIITEST: GBP currency symbol £" http://localhost:8080/
'GBP currency symbol £' (22) [ascii] Hmm. OK, looks better, so because it's encoded as Unicode, even though the string length should only be 21 (as in visible characters) we get 22. We know the pound sign is in extended ascii, i.e above 127. From GNOME character map, looking at the 'Liberation Sans' font, we find the '£' in the 'Latin-1 Supplement' Unicode block
So we can see that it takes up two bytes in Unicode, hence we get 22 for the length. Things are starting to make sense now! How does curl actually send that? $ strace -f -e sendto -s 256 curl --header "ASCIITEST: GBP currency symbol £" http://localhost:8080/
sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: GBP currency symbol \302\243\r\n\r\n", 112, MSG_NOSIGNAL, NULL, 0) = 112 OK, makes sense. Hmm, but how does something not in extended ascii get encoded wrong? OK, this seems to be the key bit here is_ascii = 0;
struct_size = sizeof(PyCompactUnicodeObject);
if (maxchar < 128) {
kind = PyUnicode_1BYTE_KIND;
char_size = 1;
is_ascii = 1;
struct_size = sizeof(PyASCIIObject);
}
else if (maxchar < 256) {
kind = PyUnicode_1BYTE_KIND;
char_size = 1;
} So Hmm, what happens if we actually send some Unicode with this change? curl --header "ASCIITEST: Have some unicode 💩" http://localhost:8080/test/
'Have some unicode 💩' (22) [ascii] Well it came though OK (pardon the pun!) but it's labelled as ascii, I bet isascii() simply checks the is_ascii flag... Is this what you'd expect? So it seems you can either have everything treated as Unicode or everything treated as ASCII... |
Oh right, I wanted to test that but I forgot. No, it's not what I'd expect. I have really limited knowledge of C programming, so I just made dirty proof of concepts: With something like:
I get something that seems better:
For reference, current Unit output respectively:
And Gunicorn output:
With
To be honest I don't know what's the right move here. I still think the result of |
Hmm, Thanks! |
OK, so this seems to work for all cases. Unique and duplicate Unicode and ASCII headers. Unicode is flowed through correctly and .ascii() does the right thing. diff --git a/src/python/nxt_python_wsgi.c b/src/python/nxt_python_wsgi.c
index dfb31509..a925b113 100644
--- a/src/python/nxt_python_wsgi.c
+++ b/src/python/nxt_python_wsgi.c
@@ -863,11 +863,34 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
char *p, *src;
PyObject *res;
+ src = nxt_unit_sptr_get(&f->value);
+
#if PY_MAJOR_VERSION == 3
- res = PyUnicode_New(vl, 255);
+ if (nxt_slow_path(n > 1)) {
+ char *ptr;
+
+ p = nxt_malloc(vl + 1);
+ ptr = p;
+ p = nxt_cpymem(p, src, f->value_length);
+
+ for (i = 1; i < n; i++) {
+ p = nxt_cpymem(p, ", ", 2);
+
+ src = nxt_unit_sptr_get(&f[i].value);
+ p = nxt_cpymem(p, src, f[i].value_length);
+ };
+ *p = '\0';
+
+ src = ptr;
+ }
+
+ res = PyUnicode_FromString(src);
+
+ if (nxt_slow_path(n > 1)) {
+ nxt_free(src);
+ }
#else
res = PyString_FromStringAndSize(NULL, vl);
-#endif
if (nxt_slow_path(res == NULL)) {
return NULL;
@@ -875,7 +898,6 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
p = PyString_AS_STRING(res);
- src = nxt_unit_sptr_get(&f->value);
p = nxt_cpymem(p, src, f->value_length);
for (i = 1; i < n; i++) {
@@ -884,6 +906,7 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
src = nxt_unit_sptr_get(&f[i].value);
p = nxt_cpymem(p, src, f[i].value_length);
}
+#endif
return res;
} Heh, I ain't touching the Python 2.7 code! There are likely many other problematic area's. E.g I haven't touch the header names. But also likely other places that I've seen. But they can wait until they are actually deemed an issue. |
Slightly modified patch diff --git a/src/python/nxt_python_wsgi.c b/src/python/nxt_python_wsgi.c
index dfb31509..9cbae708 100644
--- a/src/python/nxt_python_wsgi.c
+++ b/src/python/nxt_python_wsgi.c
@@ -863,11 +863,34 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
char *p, *src;
PyObject *res;
+ src = nxt_unit_sptr_get(&f->value);
+
#if PY_MAJOR_VERSION == 3
- res = PyUnicode_New(vl, 255);
+ if (nxt_slow_path(n > 1)) {
+ char *ptr;
+
+ p = nxt_malloc(vl + 1);
+ ptr = p;
+ p = nxt_cpymem(p, src, f->value_length);
+
+ for (i = 1; i < n; i++) {
+ p = nxt_cpymem(p, ", ", 2);
+
+ src = nxt_unit_sptr_get(&f[i].value);
+ p = nxt_cpymem(p, src, f[i].value_length);
+ };
+ *p = '\0';
+
+ src = ptr;
+ }
+
+ res = PyUnicode_DecodeCharmap(src, vl, NULL, NULL);
+
+ if (nxt_slow_path(n > 1)) {
+ nxt_free(src);
+ }
#else
res = PyString_FromStringAndSize(NULL, vl);
-#endif
if (nxt_slow_path(res == NULL)) {
return NULL;
@@ -875,7 +898,6 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
p = PyString_AS_STRING(res);
- src = nxt_unit_sptr_get(&f->value);
p = nxt_cpymem(p, src, f->value_length);
for (i = 1; i < n; i++) {
@@ -884,6 +906,7 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
src = nxt_unit_sptr_get(&f[i].value);
p = nxt_cpymem(p, src, f[i].value_length);
}
+#endif
return res;
} The only difference is the use of PyUnicode_DecodeCharmap() instead of PyUnicode_FromString(). While PyUnicode_FromString() generally worked, it did break one of our tests where we send a 0xFF byte as latin1 (which is a valid character) which is not valid in UTF-8 and PyUnicode_FromString() expects a valid UTF-8 string. The differences between the two are PyUnicode_FromString() with UTF-8 being returned from the application $ strace -s 256 -e sendto,recvfrom curl --header "ASCIITEST: £" http://localhost:8080/
sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92
recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Fri, 19 May 2023 14:07:59 GMT\r\nTransfer-Encoding: chunked\r\n\r\n16\r\n'\302\243' (1) [non-ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 164
'£' (1) [non-ascii] So we're sending and receiving UTF-8 (\302\243), the string length according to Python is 1, is it storing it as latin1? PyUnicode_DecodeCharmap() with latin1 being returned from the application $ strace -s 256 -e sendto,recvfrom curl --header "ASCIITEST: £" http://localhost:8080/
sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92
recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Fri, 19 May 2023 14:12:16 GMT\r\nTransfer-Encoding: chunked\r\n\r\n16\r\n'\302\243' (2) [non-ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 164
'£' (2) [non-ascii] So, again we're sending and receiving UTF-8, but now the string length according to Python is 2, which matches what strlen("£") returns in C. |
@RomainMou does the patch merk for you? |
Hi @tippexs, the patch is working fine for me. |
On GitHub, @RomainMou reported an issue whereby HTTP header field values where being incorrectly reported as non-ascii by the Python .isacii() method. For example, using the following test application def application(environ, start_response): t = environ['HTTP_ASCIITEST'] t = "'" + t + "'" + " (" + str(len(t)) + ")" if t.isascii(): t = t + " [ascii]" else: t = t + " [non-ascii]" resp = t + "\n\n" start_response("200 OK", [("Content-Type", "text/plain")]) return (bytes(resp, 'latin1')) You would see the following $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [non-ascii] '$' has an ASCII code of 0x24 (36). The initial idea was to adjust the second parameter to the PyUnicode_New() call from 255 to 127. This unfortunately had the opposite effect. $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] Good. However... $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [ascii] Not good. Let's take a closer look at this. '£' is not in basic ASCII, but is in extended ASCII with a value of 0xA3 (163). Its UTF-8 encoding is 0xC2 0xA3, hence the length of 2 bytes above. $ strace -s 256 -e sendto,recvfrom curl -H "ASCIITEST: £" http://localhost:8080/ sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92 recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Mon, 22 May 2023 12:44:11 GMT\r\nTransfer-Encoding: chunked\r\n\r\n12\r\n'\302\243' (2) [ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 160 '£' (2) [ascii] So we can see curl sent it UTF-8 encoded '\302\243\' which is C octal escaped UTF-8 for 0xC2 0xA3, and we got the same back. But it should not be marked as ASCII. When doing PyUnicode_New(size, 127) it sets the buffer as ASCII. So we need to use another function and that function would appear to be PyUnicode_DecodeCharmap() Which creates an Unicode object with the correct ascii/non-ascii properties based on the character encoding. With this function we now get $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [non-ascii] and for good measure $ curl -H "ASCIITEST: $ £" http://localhost:8080/ '$ £' (4) [non-ascii] $ curl -H "ASCIITEST: $" -H "ASCIITEST: £" http://localhost:8080/ '$, £' (5) [non-ascii] PyUnicode_DecodeCharmap() does require having the full string upfront so we need to build up the potentially comma separated header field values string before invoking this function. I did not want to touch the Python 2.7 code (which may or may not even be affected by this) so kept these changes completely isolated from that, hence a slight duplication with the for () loop. Python 2.7 was sunset on January 1st 2020[0], so this code will hopefully just disappear soon anyway. I also purposefully didn't touch other code that may well have similar issues (such as the HTTP header field names) if we ever get issue reports about them, we'll deal with them then. [0]: <https://www.python.org/doc/sunset-python-2/> Link: <https://docs.python.org/3/c-api/unicode.html> Closes: <nginx#868> Reported-by: RomainMou <https://github.com/RomainMou> Tested-by: RomainMou <https://github.com/RomainMou> Signed-off-by: Andrew Clayton <[email protected]>
On GitHub, @RomainMou reported an issue whereby HTTP header field values where being incorrectly reported as non-ascii by the Python .isacii() method. For example, using the following test application def application(environ, start_response): t = environ['HTTP_ASCIITEST'] t = "'" + t + "'" + " (" + str(len(t)) + ")" if t.isascii(): t = t + " [ascii]" else: t = t + " [non-ascii]" resp = t + "\n\n" start_response("200 OK", [("Content-Type", "text/plain")]) return (bytes(resp, 'latin1')) You would see the following $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [non-ascii] '$' has an ASCII code of 0x24 (36). The initial idea was to adjust the second parameter to the PyUnicode_New() call from 255 to 127. This unfortunately had the opposite effect. $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] Good. However... $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [ascii] Not good. Let's take a closer look at this. '£' is not in basic ASCII, but is in extended ASCII with a value of 0xA3 (163). Its UTF-8 encoding is 0xC2 0xA3, hence the length of 2 bytes above. $ strace -s 256 -e sendto,recvfrom curl -H "ASCIITEST: £" http://localhost:8080/ sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92 recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Mon, 22 May 2023 12:44:11 GMT\r\nTransfer-Encoding: chunked\r\n\r\n12\r\n'\302\243' (2) [ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 160 '£' (2) [ascii] So we can see curl sent it UTF-8 encoded '\302\243\' which is C octal escaped UTF-8 for 0xC2 0xA3, and we got the same back. But it should not be marked as ASCII. When doing PyUnicode_New(size, 127) it sets the buffer as ASCII. So we need to use another function and that function would appear to be PyUnicode_DecodeCharmap() Which creates an Unicode object with the correct ascii/non-ascii properties based on the character encoding. With this function we now get $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [non-ascii] and for good measure $ curl -H "ASCIITEST: $ £" http://localhost:8080/ '$ £' (4) [non-ascii] $ curl -H "ASCIITEST: $" -H "ASCIITEST: £" http://localhost:8080/ '$, £' (5) [non-ascii] PyUnicode_DecodeCharmap() does require having the full string upfront so we need to build up the potentially comma separated header field values string before invoking this function. I did not want to touch the Python 2.7 code (which may or may not even be affected by this) so kept these changes completely isolated from that, hence a slight duplication with the for () loop. Python 2.7 was sunset on January 1st 2020[0], so this code will hopefully just disappear soon anyway. I also purposefully didn't touch other code that may well have similar issues (such as the HTTP header field names) if we ever get issue reports about them, we'll deal with them then. [0]: <https://www.python.org/doc/sunset-python-2/> Link: <https://docs.python.org/3/c-api/unicode.html> Closes: <nginx#868> Reported-by: RomainMou <https://github.com/RomainMou> Tested-by: RomainMou <https://github.com/RomainMou> Signed-off-by: Andrew Clayton <[email protected]>
On GitHub, @RomainMou reported an issue whereby HTTP header field values where being incorrectly reported as non-ascii by the Python .isacii() method. For example, using the following test application def application(environ, start_response): t = environ['HTTP_ASCIITEST'] t = "'" + t + "'" + " (" + str(len(t)) + ")" if t.isascii(): t = t + " [ascii]" else: t = t + " [non-ascii]" resp = t + "\n\n" start_response("200 OK", [("Content-Type", "text/plain")]) return (bytes(resp, 'latin1')) You would see the following $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [non-ascii] '$' has an ASCII code of 0x24 (36). The initial idea was to adjust the second parameter to the PyUnicode_New() call from 255 to 127. This unfortunately had the opposite effect. $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] Good. However... $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [ascii] Not good. Let's take a closer look at this. '£' is not in basic ASCII, but is in extended ASCII with a value of 0xA3 (163). Its UTF-8 encoding is 0xC2 0xA3, hence the length of 2 bytes above. $ strace -s 256 -e sendto,recvfrom curl -H "ASCIITEST: £" http://localhost:8080/ sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92 recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Mon, 22 May 2023 12:44:11 GMT\r\nTransfer-Encoding: chunked\r\n\r\n12\r\n'\302\243' (2) [ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 160 '£' (2) [ascii] So we can see curl sent it UTF-8 encoded '\302\243\' which is C octal escaped UTF-8 for 0xC2 0xA3, and we got the same back. But it should not be marked as ASCII. When doing PyUnicode_New(size, 127) it sets the buffer as ASCII. So we need to use another function and that function would appear to be PyUnicode_DecodeCharmap() Which creates an Unicode object with the correct ascii/non-ascii properties based on the character encoding. With this function we now get $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [non-ascii] and for good measure $ curl -H "ASCIITEST: $ £" http://localhost:8080/ '$ £' (4) [non-ascii] $ curl -H "ASCIITEST: $" -H "ASCIITEST: £" http://localhost:8080/ '$, £' (5) [non-ascii] PyUnicode_DecodeCharmap() does require having the full string upfront so we need to build up the potentially comma separated header field values string before invoking this function. I did not want to touch the Python 2.7 code (which may or may not even be affected by this) so kept these changes completely isolated from that, hence a slight duplication with the for () loop. Python 2.7 was sunset on January 1st 2020[0], so this code will hopefully just disappear soon anyway. I also purposefully didn't touch other code that may well have similar issues (such as the HTTP header field names) if we ever get issue reports about them, we'll deal with them then. [0]: <https://www.python.org/doc/sunset-python-2/> Link: <https://docs.python.org/3/c-api/unicode.html> Closes: <nginx#868> Reported-by: RomainMou <https://github.com/RomainMou> Tested-by: RomainMou <https://github.com/RomainMou> Signed-off-by: Andrew Clayton <[email protected]>
On GitHub, @RomainMou reported an issue whereby HTTP header field values where being incorrectly reported as non-ascii by the Python .isacii() method. For example, using the following test application def application(environ, start_response): t = environ['HTTP_ASCIITEST'] t = "'" + t + "'" + " (" + str(len(t)) + ")" if t.isascii(): t = t + " [ascii]" else: t = t + " [non-ascii]" resp = t + "\n\n" start_response("200 OK", [("Content-Type", "text/plain")]) return (bytes(resp, 'latin1')) You would see the following $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [non-ascii] '$' has an ASCII code of 0x24 (36). The initial idea was to adjust the second parameter to the PyUnicode_New() call from 255 to 127. This unfortunately had the opposite effect. $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] Good. However... $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [ascii] Not good. Let's take a closer look at this. '£' is not in basic ASCII, but is in extended ASCII with a value of 0xA3 (163). Its UTF-8 encoding is 0xC2 0xA3, hence the length of 2 bytes above. $ strace -s 256 -e sendto,recvfrom curl -H "ASCIITEST: £" http://localhost:8080/ sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92 recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Mon, 22 May 2023 12:44:11 GMT\r\nTransfer-Encoding: chunked\r\n\r\n12\r\n'\302\243' (2) [ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 160 '£' (2) [ascii] So we can see curl sent it UTF-8 encoded '\302\243\' which is C octal escaped UTF-8 for 0xC2 0xA3, and we got the same back. But it should not be marked as ASCII. When doing PyUnicode_New(size, 127) it sets the buffer as ASCII. So we need to use another function and that function would appear to be PyUnicode_DecodeCharmap() Which creates an Unicode object with the correct ascii/non-ascii properties based on the character encoding. With this function we now get $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [non-ascii] and for good measure $ curl -H "ASCIITEST: $ £" http://localhost:8080/ '$ £' (4) [non-ascii] $ curl -H "ASCIITEST: $" -H "ASCIITEST: £" http://localhost:8080/ '$, £' (5) [non-ascii] PyUnicode_DecodeCharmap() does require having the full string upfront so we need to build up the potentially comma separated header field values string before invoking this function. I did not want to touch the Python 2.7 code (which may or may not even be affected by this) so kept these changes completely isolated from that, hence a slight duplication with the for () loop. Python 2.7 was sunset on January 1st 2020[0], so this code will hopefully just disappear soon anyway. I also purposefully didn't touch other code that may well have similar issues (such as the HTTP header field names) if we ever get issue reports about them, we'll deal with them then. [0]: <https://www.python.org/doc/sunset-python-2/> Link: <https://docs.python.org/3/c-api/unicode.html> Closes: <nginx#868> Reported-by: RomainMou <https://github.com/RomainMou> Tested-by: RomainMou <https://github.com/RomainMou> Signed-off-by: Andrew Clayton <[email protected]>
The fix for this was merged. Thanks for your work on this. |
On GitHub, @RomainMou reported an issue whereby HTTP header field values where being incorrectly reported as non-ascii by the Python .isacii() method. For example, using the following test application def application(environ, start_response): t = environ['HTTP_ASCIITEST'] t = "'" + t + "'" + " (" + str(len(t)) + ")" if t.isascii(): t = t + " [ascii]" else: t = t + " [non-ascii]" resp = t + "\n\n" start_response("200 OK", [("Content-Type", "text/plain")]) return (bytes(resp, 'latin1')) You would see the following $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [non-ascii] '$' has an ASCII code of 0x24 (36). The initial idea was to adjust the second parameter to the PyUnicode_New() call from 255 to 127. This unfortunately had the opposite effect. $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] Good. However... $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [ascii] Not good. Let's take a closer look at this. '£' is not in basic ASCII, but is in extended ASCII with a value of 0xA3 (163). Its UTF-8 encoding is 0xC2 0xA3, hence the length of 2 bytes above. $ strace -s 256 -e sendto,recvfrom curl -H "ASCIITEST: £" http://localhost:8080/ sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92 recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Mon, 22 May 2023 12:44:11 GMT\r\nTransfer-Encoding: chunked\r\n\r\n12\r\n'\302\243' (2) [ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 160 '£' (2) [ascii] So we can see curl sent it UTF-8 encoded '\302\243\' which is C octal escaped UTF-8 for 0xC2 0xA3, and we got the same back. But it should not be marked as ASCII. When doing PyUnicode_New(size, 127) it sets the buffer as ASCII. So we need to use another function and that function would appear to be PyUnicode_DecodeCharmap() Which creates an Unicode object with the correct ascii/non-ascii properties based on the character encoding. With this function we now get $ curl -H "ASCIITEST: $" http://localhost:8080/ '$' (1) [ascii] $ curl -H "ASCIITEST: £" http://localhost:8080/ '£' (2) [non-ascii] and for good measure $ curl -H "ASCIITEST: $ £" http://localhost:8080/ '$ £' (4) [non-ascii] $ curl -H "ASCIITEST: $" -H "ASCIITEST: £" http://localhost:8080/ '$, £' (5) [non-ascii] PyUnicode_DecodeCharmap() does require having the full string upfront so we need to build up the potentially comma separated header field values string before invoking this function. I did not want to touch the Python 2.7 code (which may or may not even be affected by this) so kept these changes completely isolated from that, hence a slight duplication with the for () loop. Python 2.7 was sunset on January 1st 2020[0], so this code will hopefully just disappear soon anyway. I also purposefully didn't touch other code that may well have similar issues (such as the HTTP header field names) if we ever get issue reports about them, we'll deal with them then. [0]: <https://www.python.org/doc/sunset-python-2/> Link: <https://docs.python.org/3/c-api/unicode.html> Closes: <#868> Reported-by: RomainMou <https://github.com/RomainMou> Tested-by: RomainMou <https://github.com/RomainMou> Signed-off-by: Andrew Clayton <[email protected]>
Hello,
I'm trying to replace gunicorn with Unit 1.29.1 for some Django projects, but I have some weirds issues with the HTTP headers contents:
It seams that the string is encoded in a way that Python does not considered it to be an ASCII string. I can also see that it seams that the strings have a different memory size between the non ASCII and the ASCII version.
I couldn't manage to reproduce this issue with Gunicorn:
Is this normal? Did I miss some configuration?
The text was updated successfully, but these errors were encountered: