Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] non ascii HTTP headers #868

Closed
RomainMou opened this issue May 5, 2023 · 18 comments
Closed

[python] non ascii HTTP headers #868

RomainMou opened this issue May 5, 2023 · 18 comments
Assignees

Comments

@RomainMou
Copy link
Contributor

RomainMou commented May 5, 2023

Hello,

I'm trying to replace gunicorn with Unit 1.29.1 for some Django projects, but I have some weirds issues with the HTTP headers contents:

(Pdb) request.META.get("HTTP_DNT")
'1'
(Pdb) request.META.get("HTTP_DNT").isascii()
False
(Pdb) test = request.META.get("HTTP_DNT")
(Pdb) test.isascii()
False
(Pdb) test2 = (test + '.')[:-1]
(Pdb) test == test2
True
(Pdb) test2.isascii()
True

It seams that the string is encoded in a way that Python does not considered it to be an ASCII string. I can also see that it seams that the strings have a different memory size between the non ASCII and the ASCII version.

(Pdb) from sys import getsizeof
(Pdb) getsizeof(test)
74
(Pdb) getsizeof(test2)
50
(Pdb) import ctypes
(Pdb) ctypes.string_at(id(test), getsizeof(test))
b'\x02\x00\x00\x00\x00\x00\x00\x00\xa0\xc8\x9e\xddT\x7f\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\xff\xff\xff\xff\xff\xff\xff\xff\xa4\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x001\x00'
(Pdb) ctypes.string_at(id(test2), getsizeof(test2))
b'K\x00\x00\x00\x00\x00\x00\x00\xa0\xc8\x9e\xddT\x7f\x00\x00\x01\x00\x00\x00\x00\x00\x00\x00\x06\x1b\xaa\x81s-=\xed\xe5\xe5L\xddT\x7f\x00\x00\x00\x00\x00\x00\x00\x00\x00\x001\x00'

I couldn't manage to reproduce this issue with Gunicorn:

(Pdb) request.META.get("HTTP_DNT")
'1'
(Pdb) test = request.META.get("HTTP_DNT")
(Pdb) test.isascii()
True
(Pdb) from sys import getsizeof
(Pdb) getsizeof(test)
50

Is this normal? Did I miss some configuration?

@ac000
Copy link
Member

ac000 commented May 8, 2023

Are these headers being sent from the client? If so, which client and is it reproducible with a different client?

If so. Can you give an example of a specific header with both its correct and incorrect coding?

@RomainMou
Copy link
Contributor Author

Yes, the header are being sent by the client, for exemple with curl --header "ASCIITEST: 1" http://localhost:8000/test/. In that case, request.META.get("HTTP_ASCIITEST") will be considered non ascii when using Unit.

I made a repository with an example to reproduce the issue if it can help: https://github.com/RomainMou/unit-django-issue-headers.

@ac000
Copy link
Member

ac000 commented May 9, 2023

Yes, the header are being sent by the client, for exemple with curl --header "ASCIITEST: 1" http://localhost:8000/test/. In that case, request.META.get("HTTP_ASCIITEST") will be considered non ascii when using Unit.

I made a repository with an example to reproduce the issue if it can help: https://github.com/RomainMou/unit-django-issue-headers.

Thanks for the app!

I set it up under Fedora 38 (Python 3.11.3, django 4.0.10). With latest Unit from master. (I am not using the home config directive).

The first time I ran it I got the error.

$ curl --header "ASCIITEST: 1" http://localhost:8080/test/
Hello, HTTP_ASCIITEST is ascii? False

But of course that isn't entirely correct as the code is looking for a header called HTTP_ ASCIITEST. Once we use the right header...

$ curl --header "HTTP_ASCIITEST: 1" http://localhost:8080/test/ | head
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 49764  100 49764    0     0   771k      0 --:--:-- --:--:-- --:--:--  783k
<!DOCTYPE html>
<html lang="en">
<head>
  <meta http-equiv="content-type" content="text/html; charset=utf-8">
  <meta name="robots" content="NONE,NOARCHIVE">
  <title>AttributeError
          at /test/</title>
  <style type="text/css">
    html * { padding:0; margin:0; }
    body * { padding:10px 20px; }

I've noticed some inconsistency in the hesders you're sending and what you're checking against. Could you double check that you are indeed checking against the correct headers?.

@RomainMou
Copy link
Contributor Author

RomainMou commented May 9, 2023

The request.META object in Django has a specific behavior with headers names, all header are prefix with HTTP_ as stated in the documentation:

With the exception of CONTENT_LENGTH and CONTENT_TYPE, as given above, any HTTP headers in the request are converted to META keys by converting all characters to uppercase, replacing any hyphens with underscores and adding an HTTP_ prefix to the name. So, for example, a header called X-Bender would be mapped to the META key HTTP_X_BENDER.

So yes, it should be the same header.

Edit :
Another way to access header is with the HttpRequest.headers object (documentation), and I can reproduce the same behavior with it:

(Pdb) request.headers.get('ASCIITEST')
'1'
(Pdb) request.headers.get('ASCIITEST').isascii()
False

@ac000
Copy link
Member

ac000 commented May 10, 2023

AFAICT this header is coming through as ASCII.

If I save this header value off to a file

f = open("/tmp/t.txt", "w")                                                 
f.write(request.META.get("HTTP_ASCIITEST"))                                 
f.close()
$ ls -l /tmp/t.txt 
-rw-r--r-- 1 andrew andrew 14 May 10 03:17 /tmp/t.txt
$ file /tmp/t.txt 
/tmp/t.txt: ASCII text, with no line terminators
$ od -a /tmp/t.txt 
0000000   T   h   i   s  sp   i   s  sp   a  sp   t   e   s   t
0000016

If I look at the memory where this is stored before being passed through to the application in gdb

0x7f9c2c8de100: "ASCIITEST"
0x7f9c2c8de10a: "This is a test"

Again, looks OK.

Unclear where isascii() is going wrong...

@RomainMou
Copy link
Contributor Author

RomainMou commented May 10, 2023

Another thing I noticed, each char of the string is individually considered ascii:

>>> request.META.get("HTTP_ASCIITEST").isascii()
False
>>> for char in request.META.get("HTTP_ASCIITEST"):
...   char.isascii()
... 
True

In the difference I could find, the memory footprint of the string seams to be corresponding to a multi-bytes extended ascii char (characters 128-255), even if the value is equal to a simple ascii char.

>>> import sys
>>> simple_ascii_char = "1"
>>> multi_bytes_ascii_char = "é"
>>> two_bytes_utf_8_char = "ʃ"
>>> print(sys.getsizeof(simple_ascii_char), sys.getsizeof(multi_bytes_ascii_char), sys.getsizeof(two_bytes_utf_8_char))
50 74 76
>>> sys.getsizeof(request.META.get("HTTP_ASCIITEST"))
74
>>> request.META.get("HTTP_ASCIITEST")
'1'
>>> simple_ascii_char == request.META.get("HTTP_ASCIITEST")
True

Edit: another thing, using encode() result in a valid ascii byte string:

(Pdb) request.META.get("HTTP_ASCIITEST").isascii()
False
(Pdb) request.META.get("HTTP_ASCIITEST").encode()
b'1'
(Pdb) request.META.get("HTTP_ASCIITEST").encode().isascii()
True

@ac000
Copy link
Member

ac000 commented May 10, 2023

I'm leaning towards this not being a problem per se.

With a simple PHP application

{
    "listeners": {
        "[::]:8080": {
            "pass": "applications/php"
        }
    },

    "applications": {
        "php": {
            "type": "php",
            "root": "/home/andrew/src/php"
        }
    }
}
<?php

foreach ($_SERVER as $key => $value) {
        echo "{$key} => {$value} ";
        echo "[";
        echo mb_detect_encoding($value, 'ASCII', true);
        echo "]\n";
}

?>
$ curl --header "ASCIITEST: This is a test" http://localhost:8080/test.php
SERVER_SOFTWARE => Unit/1.30.0 [ASCII]
SERVER_PROTOCOL => HTTP/1.1 [ASCII]
PHP_SELF => /test.php [ASCII]
SCRIPT_NAME => /test.php [ASCII]
SCRIPT_FILENAME => /home/andrew/src/php/test.php [ASCII]
DOCUMENT_ROOT => /home/andrew/src/php [ASCII]
REQUEST_METHOD => GET [ASCII]
REQUEST_URI => /test.php [ASCII]
QUERY_STRING =>  [ASCII]
REMOTE_ADDR => ::1 [ASCII]
SERVER_ADDR => ::1 [ASCII]
SERVER_NAME => localhost [ASCII]
SERVER_PORT => 80 [ASCII]
HTTP_HOST => localhost:8080 [ASCII]
HTTP_USER_AGENT => curl/8.0.1 [ASCII]
HTTP_ACCEPT => */* [ASCII]
HTTP_ASCIITEST => This is a test [ASCII]
REQUEST_TIME_FLOAT => 1683739126.4316 [ASCII]
REQUEST_TIME => 1683739126 [ASCII]

All the headers came through as ASCII. So Unit isn't doing anything silly.

Now with a simple Python application

{
    "listeners": {
        "[::1]:8080": {
            "pass": "applications/python"
        }
    },
    "applications": {
        "python": {
            "type": "python",
            "path": "/home/andrew/src/python",
            "module": "gh868"
        }
    }
}
def application(environ, start_response):
    t = environ['HTTP_ASCIITEST']
#    t = "This is a test"
    t = "'" + t + "'" +  " (" + str(len(t)) + ")"
    
    try:
        t.decode("ascii")
    except:
        t = t + " [non-ascii]"
    else:
        t = t + " [ascii]"

    resp = t + "\n"
    start_response("200 OK", [("Content-Type", "text/plain")])
    return (resp.encode())
$ curl --header "ASCIITEST: This is a test" http://localhost:8080/
'This is a test' (14) [non-ascii]

Lets try with a simple standalone application

#!/usr/bin/python
    
t = "This is a test"
t = "'" + t + "'" +  " (" + str(len(t)) + ")"

try:
    t.decode("ascii")
except:
    t = t + " [non-ascii]"
else:
    t = t + " [ascii]"

print(t)
$ ./s.py
'This is a test' (14) [non-ascii]

This is with Python 3.11.3

It looks like just normal behaviour. Wasn't this one of the big changes in Python 3? UTF-8 encoding being used and the use of ByteArrays rather than just simple strings...

Are you seeing an actual problem because of this?

@RomainMou
Copy link
Contributor Author

RomainMou commented May 10, 2023

In your python example, t is defined as a string, it doesn't have a decode() function. You can encode() a string to obtain a byte string, and decode() a byte string to obtain a string, in my understanding of https://docs.python.org/3/library/stdtypes.html#bytes.decode and https://docs.python.org/3/library/stdtypes.html#str.encode.

So when you try to decode, you have and error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'decode'. Did you mean: 'encode'?

And it always pass into the except.

With something like:

#!/usr/bin/python
    
t = "This is a test"
t = "'" + t + "'" +  " (" + str(len(t)) + ")"

if t.isascii():
  t = t + " [ascii]"
else:
  t = t + " [non-ascii]"

print(t)

You get the expected result: 'This is a test' (14) [ascii]

So no it's not the default behavior.

For the actual problem, yes, I have some headers that I compare with reference values using the function hmac.compare_digest(the_header, the_reference) and when using strings in arguments, they need to be ASCII strings. I can convert each string to byte string to avoid this issue.

I couldn't reproduce the issue without Nginx Unit, so I still think it's an issue on how the header strings are initialized, somehow.

@ac000
Copy link
Member

ac000 commented May 10, 2023

Looks like in Python 3 (and for WSGI at least) we add the headers in Python as Unicode.

See

https://github.com/nginx/unit/blob/master/src/python/nxt_python_wsgi.c#L675
https://github.com/nginx/unit/blob/master/src/python/nxt_python_wsgi.c#L797
https://github.com/nginx/unit/blob/master/src/python/nxt_python_wsgi.c#L867

It's been this way for at least ~3 years. I'm not sure this is something we'd change.

But it sounds like you can work with byte strings...

@RomainMou
Copy link
Contributor Author

res = PyUnicode_New(len + 5, 255);

res = PyUnicode_New(vl, 255);

I think the issue is with these lines. If I understand the documentation about PyUnicode_New, the maxchar value should be the maximum character code, to optimize the memory usage. In the case of ASCII, characters are between 0 and 127. 255 corresponds to extended ASCII.

I rebuilt Unit with 127 instead of 255 and it's working as expected:

(Pdb) sys.getsizeof(request.META.get("HTTP_ASCIITEST"))
50
(Pdb) request.META.get("HTTP_ASCIITEST")
'1'
(Pdb) request.META.get("HTTP_ASCIITEST").isascii()
True

We can see that the string is ASCII and that the memory usage dropped from 74 to 50 bytes.

I don't know if this can have other impacts.

@ac000
Copy link
Member

ac000 commented May 11, 2023

Thanks for digging into this.

So with using a maxchar of 127 my mind immediately went to 'What happens with extended ASCII characters, such as the '£' sign (GBP Sterling)'?.

With the current code (maxchar 255), things should work as expected...

$ curl --header "ASCIITEST: GBP currency symbol £" http://localhost:8080/
'GBP currency symbol £' (22) [non-ascii]

Err, no, that's not right! We have extraneous character.

Lets try with a maxchar of 127

$ curl --header "ASCIITEST: GBP currency symbol £" http://localhost:8080/
'GBP currency symbol £' (22) [ascii]

Hmm. OK, looks better, so because it's encoded as Unicode, even though the string length should only be 21 (as in visible characters) we get 22. We know the pound sign is in extended ascii, i.e above 127.

From GNOME character map, looking at the 'Liberation Sans' font, we find the '£' in the 'Latin-1 Supplement' Unicode block

Various Useful Representations

UTF-8: 0xC2 0xA3
UTF-16: 0x00A3

C octal escaped UTF-8: \302\243
XML decimal entity: &#163;

So we can see that it takes up two bytes in Unicode, hence we get 22 for the length.

Things are starting to make sense now!

How does curl actually send that?

$ strace -f -e sendto -s 256 curl --header "ASCIITEST: GBP currency symbol £" http://localhost:8080/
sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: GBP currency symbol \302\243\r\n\r\n", 112, MSG_NOSIGNAL, NULL, 0) = 112

OK, makes sense.

Hmm, but how does something not in extended ascii get encoded wrong?

OK, this seems to be the key bit here

    is_ascii = 0;
    struct_size = sizeof(PyCompactUnicodeObject);
    if (maxchar < 128) {
        kind = PyUnicode_1BYTE_KIND;
        char_size = 1;
        is_ascii = 1;
        struct_size = sizeof(PyASCIIObject);
    }
    else if (maxchar < 256) {
        kind = PyUnicode_1BYTE_KIND;
        char_size = 1;
    }

So is_ascii is only set when maxchar < 128.

Hmm, what happens if we actually send some Unicode with this change?

curl --header "ASCIITEST: Have some unicode 💩" http://localhost:8080/test/
'Have some unicode 💩' (22) [ascii]

Well it came though OK (pardon the pun!) but it's labelled as ascii, I bet isascii() simply checks the is_ascii flag...

Is this what you'd expect? So it seems you can either have everything treated as Unicode or everything treated as ASCII...

@RomainMou
Copy link
Contributor Author

Oh right, I wanted to test that but I forgot. No, it's not what I'd expect.

I have really limited knowledge of C programming, so I just made dirty proof of concepts:

With something like:

nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
{
    char      *src;
    PyObject  *res;

    src = nxt_unit_sptr_get(&f->value);
    res = PyUnicode_FromString(src);
    return res;
}

I get something that seems better:

curl --header "ASCIITEST: unicorn" http://localhost:8000/test/ascii
'unicorn' (7) [ascii]
curl --header "ASCIITEST: é_è" http://localhost:8000/test/ascii
'é_è' (3) [non-ascii]
curl --header "ASCIITEST: I'm a unicorn now 🦄" http://localhost:8000/test/ascii
'I'm a unicorn now 🦄' (19) [non-ascii]

For reference, current Unit output respectively:

'unicorn' (7) [non-ascii]
'é_è' (5) [non-ascii]
'I'm a unicorn now ð
                    ' (22) [non-ascii]

And Gunicorn output:

'unicorn' (7) [ascii]
'é_è' (5) [non-ascii]
'I'm a unicorn now ð
                    ' (22) [non-ascii]

With res = PyUnicode_DecodeLatin1(src, strlen(src), NULL); I get what's the closest from the current behavior, with the isascii() fixed:

curl --header "ASCIITEST: unicorn" http://localhost:8000/test/ascii
'unicorn' (7) [ascii]
curl --header "ASCIITEST: é_è" http://localhost:8000/test/ascii
'é_è' (5) [non-ascii]
curl --header "ASCIITEST: I'm a unicorn now 🦄" http://localhost:8000/test/ascii
'I'm a unicorn now ð
                    ' (22) [non-ascii]

To be honest I don't know what's the right move here. I still think the result of isascii() should be correct. But I'm not an expert in how a web server should parse the header values.
And I don't know if this is feasible to do something like PyUnicode_DecodeLatin1 that doesn't brake python2 stuff for example.

@ac000
Copy link
Member

ac000 commented May 12, 2023

Hmm, PyUnicode_FromString() does seem to do the right thing. I'll need to see if I can get it to play nice with the rest of the function.

Thanks!

@ac000 ac000 self-assigned this May 12, 2023
@ac000
Copy link
Member

ac000 commented May 17, 2023

OK, so this seems to work for all cases. Unique and duplicate Unicode and ASCII headers. Unicode is flowed through correctly and .ascii() does the right thing.

diff --git a/src/python/nxt_python_wsgi.c b/src/python/nxt_python_wsgi.c
index dfb31509..a925b113 100644
--- a/src/python/nxt_python_wsgi.c
+++ b/src/python/nxt_python_wsgi.c
@@ -863,11 +863,34 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
     char      *p, *src;
     PyObject  *res;
 
+    src = nxt_unit_sptr_get(&f->value);
+
 #if PY_MAJOR_VERSION == 3
-    res = PyUnicode_New(vl, 255);
+    if (nxt_slow_path(n > 1)) {
+        char  *ptr;
+
+        p = nxt_malloc(vl + 1);
+        ptr = p;
+        p = nxt_cpymem(p, src, f->value_length);
+
+        for (i = 1; i < n; i++) {
+            p = nxt_cpymem(p, ", ", 2);
+
+            src = nxt_unit_sptr_get(&f[i].value);
+            p = nxt_cpymem(p, src, f[i].value_length);
+        };
+        *p = '\0';
+
+        src = ptr;
+    }
+
+    res = PyUnicode_FromString(src);
+
+    if (nxt_slow_path(n > 1)) {
+        nxt_free(src);
+    }
 #else
     res = PyString_FromStringAndSize(NULL, vl);
-#endif
 
     if (nxt_slow_path(res == NULL)) {
         return NULL;
@@ -875,7 +898,6 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
 
     p = PyString_AS_STRING(res);
 
-    src = nxt_unit_sptr_get(&f->value);
     p = nxt_cpymem(p, src, f->value_length);
 
     for (i = 1; i < n; i++) {
@@ -884,6 +906,7 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
         src = nxt_unit_sptr_get(&f[i].value);
         p = nxt_cpymem(p, src, f[i].value_length);
     }
+#endif
 
     return res;
 }

Heh, I ain't touching the Python 2.7 code!

There are likely many other problematic area's. E.g I haven't touch the header names. But also likely other places that I've seen. But they can wait until they are actually deemed an issue.

@ac000
Copy link
Member

ac000 commented May 19, 2023

Slightly modified patch

diff --git a/src/python/nxt_python_wsgi.c b/src/python/nxt_python_wsgi.c
index dfb31509..9cbae708 100644
--- a/src/python/nxt_python_wsgi.c
+++ b/src/python/nxt_python_wsgi.c
@@ -863,11 +863,34 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
     char      *p, *src;
     PyObject  *res;
 
+    src = nxt_unit_sptr_get(&f->value);
+
 #if PY_MAJOR_VERSION == 3
-    res = PyUnicode_New(vl, 255);
+    if (nxt_slow_path(n > 1)) {
+        char  *ptr;
+
+        p = nxt_malloc(vl + 1);
+        ptr = p;
+        p = nxt_cpymem(p, src, f->value_length);
+
+        for (i = 1; i < n; i++) {
+            p = nxt_cpymem(p, ", ", 2);
+
+            src = nxt_unit_sptr_get(&f[i].value);
+            p = nxt_cpymem(p, src, f[i].value_length);
+        };
+        *p = '\0';
+
+        src = ptr;
+    }
+
+    res = PyUnicode_DecodeCharmap(src, vl, NULL, NULL);
+
+    if (nxt_slow_path(n > 1)) {
+        nxt_free(src);
+    }
 #else
     res = PyString_FromStringAndSize(NULL, vl);
-#endif
 
     if (nxt_slow_path(res == NULL)) {
         return NULL;
@@ -875,7 +898,6 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
 
     p = PyString_AS_STRING(res);
 
-    src = nxt_unit_sptr_get(&f->value);
     p = nxt_cpymem(p, src, f->value_length);
 
     for (i = 1; i < n; i++) {
@@ -884,6 +906,7 @@ nxt_python_field_value(nxt_unit_field_t *f, int n, uint32_t vl)
         src = nxt_unit_sptr_get(&f[i].value);
         p = nxt_cpymem(p, src, f[i].value_length);
     }
+#endif
 
     return res;
 }

The only difference is the use of PyUnicode_DecodeCharmap() instead of PyUnicode_FromString().

While PyUnicode_FromString() generally worked, it did break one of our tests where we send a 0xFF byte as latin1 (which is a valid character) which is not valid in UTF-8 and PyUnicode_FromString() expects a valid UTF-8 string.

The differences between the two are

PyUnicode_FromString() with UTF-8 being returned from the application return (bytes(resp, 'utf8'))

$ strace -s 256 -e sendto,recvfrom curl --header "ASCIITEST: £" http://localhost:8080/
sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92
recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Fri, 19 May 2023 14:07:59 GMT\r\nTransfer-Encoding: chunked\r\n\r\n16\r\n'\302\243' (1) [non-ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 164
'£' (1) [non-ascii]

So we're sending and receiving UTF-8 (\302\243), the string length according to Python is 1, is it storing it as latin1?

PyUnicode_DecodeCharmap() with latin1 being returned from the application return (bytes(resp, 'latin1'))

$ strace -s 256 -e sendto,recvfrom curl --header "ASCIITEST: £" http://localhost:8080/
sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92
recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Fri, 19 May 2023 14:12:16 GMT\r\nTransfer-Encoding: chunked\r\n\r\n16\r\n'\302\243' (2) [non-ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 164
'£' (2) [non-ascii]

So, again we're sending and receiving UTF-8, but now the string length according to Python is 2, which matches what strlen("£") returns in C.

@tippexs
Copy link
Contributor

tippexs commented Sep 8, 2023

@RomainMou does the patch merk for you?
@ac000 looks like something we can push into 1.32?

@RomainMou
Copy link
Contributor Author

RomainMou commented Sep 13, 2023

Hi @tippexs, the patch is working fine for me.

ac000 added a commit to ac000/unit that referenced this issue Nov 8, 2023
On GitHub, @RomainMou reported an issue whereby HTTP header field values
where being incorrectly reported as non-ascii by the Python .isacii()
method.

For example, using the following test application

  def application(environ, start_response):
      t = environ['HTTP_ASCIITEST']

      t = "'" + t + "'" +  " (" + str(len(t)) + ")"

      if t.isascii():
          t = t + " [ascii]"
      else:
          t = t + " [non-ascii]"

      resp = t + "\n\n"

      start_response("200 OK", [("Content-Type", "text/plain")])
      return (bytes(resp, 'latin1'))

You would see the following

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [non-ascii]

'$' has an ASCII code of 0x24 (36).

The initial idea was to adjust the second parameter to the
PyUnicode_New() call from 255 to 127. This unfortunately had the
opposite effect.

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [ascii]

Good. However...

  $ curl -H "ASCIITEST: £" http://localhost:8080/
  '£' (2) [ascii]

Not good. Let's take a closer look at this.

'£' is not in basic ASCII, but is in extended ASCII with a value of 0xA3
(163). Its UTF-8 encoding is 0xC2 0xA3, hence the length of 2 bytes
above.

  $ strace -s 256 -e sendto,recvfrom curl -H "ASCIITEST: £" http://localhost:8080/
  sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92
  recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Mon, 22 May 2023 12:44:11 GMT\r\nTransfer-Encoding: chunked\r\n\r\n12\r\n'\302\243' (2) [ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 160
  '£' (2) [ascii]

So we can see curl sent it UTF-8 encoded '\302\243\' which is C octal
escaped UTF-8 for 0xC2 0xA3, and we got the same back. But it should not
be marked as ASCII.

When doing PyUnicode_New(size, 127) it sets the buffer as ASCII. So we
need to use another function and that function would appear to be

  PyUnicode_DecodeCharmap()

Which creates an Unicode object with the correct ascii/non-ascii
properties based on the character encoding.

With this function we now get

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [ascii]

  $ curl -H "ASCIITEST: £" http://localhost:8080/
  '£' (2) [non-ascii]

and for good measure

  $ curl -H "ASCIITEST: $ £" http://localhost:8080/
  '$ £' (4) [non-ascii]

  $ curl -H "ASCIITEST: $" -H "ASCIITEST: £" http://localhost:8080/
  '$, £' (5) [non-ascii]

PyUnicode_DecodeCharmap() does require having the full string upfront so
we need to build up the potentially comma separated header field values
string before invoking this function.

I did not want to touch the Python 2.7 code (which may or may not even
be affected by this) so kept these changes completely isolated from
that, hence a slight duplication with the for () loop.

Python 2.7 was sunset on January 1st 2020[0], so this code will
hopefully just disappear soon anyway.

I also purposefully didn't touch other code that may well have similar
issues (such as the HTTP header field names) if we ever get issue
reports about them, we'll deal with them then.

[0]: <https://www.python.org/doc/sunset-python-2/>

Link: <https://docs.python.org/3/c-api/unicode.html>
Closes: <nginx#868>
Reported-by: RomainMou <https://github.com/RomainMou>
Tested-by: RomainMou <https://github.com/RomainMou>
Signed-off-by: Andrew Clayton <[email protected]>
ac000 added a commit to ac000/unit that referenced this issue Nov 8, 2023
On GitHub, @RomainMou reported an issue whereby HTTP header field values
where being incorrectly reported as non-ascii by the Python .isacii()
method.

For example, using the following test application

  def application(environ, start_response):
      t = environ['HTTP_ASCIITEST']

      t = "'" + t + "'" +  " (" + str(len(t)) + ")"

      if t.isascii():
          t = t + " [ascii]"
      else:
          t = t + " [non-ascii]"

      resp = t + "\n\n"

      start_response("200 OK", [("Content-Type", "text/plain")])
      return (bytes(resp, 'latin1'))

You would see the following

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [non-ascii]

'$' has an ASCII code of 0x24 (36).

The initial idea was to adjust the second parameter to the
PyUnicode_New() call from 255 to 127. This unfortunately had the
opposite effect.

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [ascii]

Good. However...

  $ curl -H "ASCIITEST: £" http://localhost:8080/
  '£' (2) [ascii]

Not good. Let's take a closer look at this.

'£' is not in basic ASCII, but is in extended ASCII with a value of 0xA3
(163). Its UTF-8 encoding is 0xC2 0xA3, hence the length of 2 bytes
above.

  $ strace -s 256 -e sendto,recvfrom curl -H "ASCIITEST: £" http://localhost:8080/
  sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92
  recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Mon, 22 May 2023 12:44:11 GMT\r\nTransfer-Encoding: chunked\r\n\r\n12\r\n'\302\243' (2) [ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 160
  '£' (2) [ascii]

So we can see curl sent it UTF-8 encoded '\302\243\' which is C octal
escaped UTF-8 for 0xC2 0xA3, and we got the same back. But it should not
be marked as ASCII.

When doing PyUnicode_New(size, 127) it sets the buffer as ASCII. So we
need to use another function and that function would appear to be

  PyUnicode_DecodeCharmap()

Which creates an Unicode object with the correct ascii/non-ascii
properties based on the character encoding.

With this function we now get

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [ascii]

  $ curl -H "ASCIITEST: £" http://localhost:8080/
  '£' (2) [non-ascii]

and for good measure

  $ curl -H "ASCIITEST: $ £" http://localhost:8080/
  '$ £' (4) [non-ascii]

  $ curl -H "ASCIITEST: $" -H "ASCIITEST: £" http://localhost:8080/
  '$, £' (5) [non-ascii]

PyUnicode_DecodeCharmap() does require having the full string upfront so
we need to build up the potentially comma separated header field values
string before invoking this function.

I did not want to touch the Python 2.7 code (which may or may not even
be affected by this) so kept these changes completely isolated from
that, hence a slight duplication with the for () loop.

Python 2.7 was sunset on January 1st 2020[0], so this code will
hopefully just disappear soon anyway.

I also purposefully didn't touch other code that may well have similar
issues (such as the HTTP header field names) if we ever get issue
reports about them, we'll deal with them then.

[0]: <https://www.python.org/doc/sunset-python-2/>

Link: <https://docs.python.org/3/c-api/unicode.html>
Closes: <nginx#868>
Reported-by: RomainMou <https://github.com/RomainMou>
Tested-by: RomainMou <https://github.com/RomainMou>
Signed-off-by: Andrew Clayton <[email protected]>
ac000 added a commit to ac000/unit that referenced this issue Nov 9, 2023
On GitHub, @RomainMou reported an issue whereby HTTP header field values
where being incorrectly reported as non-ascii by the Python .isacii()
method.

For example, using the following test application

  def application(environ, start_response):
      t = environ['HTTP_ASCIITEST']

      t = "'" + t + "'" +  " (" + str(len(t)) + ")"

      if t.isascii():
          t = t + " [ascii]"
      else:
          t = t + " [non-ascii]"

      resp = t + "\n\n"

      start_response("200 OK", [("Content-Type", "text/plain")])
      return (bytes(resp, 'latin1'))

You would see the following

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [non-ascii]

'$' has an ASCII code of 0x24 (36).

The initial idea was to adjust the second parameter to the
PyUnicode_New() call from 255 to 127. This unfortunately had the
opposite effect.

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [ascii]

Good. However...

  $ curl -H "ASCIITEST: £" http://localhost:8080/
  '£' (2) [ascii]

Not good. Let's take a closer look at this.

'£' is not in basic ASCII, but is in extended ASCII with a value of 0xA3
(163). Its UTF-8 encoding is 0xC2 0xA3, hence the length of 2 bytes
above.

  $ strace -s 256 -e sendto,recvfrom curl -H "ASCIITEST: £" http://localhost:8080/
  sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92
  recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Mon, 22 May 2023 12:44:11 GMT\r\nTransfer-Encoding: chunked\r\n\r\n12\r\n'\302\243' (2) [ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 160
  '£' (2) [ascii]

So we can see curl sent it UTF-8 encoded '\302\243\' which is C octal
escaped UTF-8 for 0xC2 0xA3, and we got the same back. But it should not
be marked as ASCII.

When doing PyUnicode_New(size, 127) it sets the buffer as ASCII. So we
need to use another function and that function would appear to be

  PyUnicode_DecodeCharmap()

Which creates an Unicode object with the correct ascii/non-ascii
properties based on the character encoding.

With this function we now get

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [ascii]

  $ curl -H "ASCIITEST: £" http://localhost:8080/
  '£' (2) [non-ascii]

and for good measure

  $ curl -H "ASCIITEST: $ £" http://localhost:8080/
  '$ £' (4) [non-ascii]

  $ curl -H "ASCIITEST: $" -H "ASCIITEST: £" http://localhost:8080/
  '$, £' (5) [non-ascii]

PyUnicode_DecodeCharmap() does require having the full string upfront so
we need to build up the potentially comma separated header field values
string before invoking this function.

I did not want to touch the Python 2.7 code (which may or may not even
be affected by this) so kept these changes completely isolated from
that, hence a slight duplication with the for () loop.

Python 2.7 was sunset on January 1st 2020[0], so this code will
hopefully just disappear soon anyway.

I also purposefully didn't touch other code that may well have similar
issues (such as the HTTP header field names) if we ever get issue
reports about them, we'll deal with them then.

[0]: <https://www.python.org/doc/sunset-python-2/>

Link: <https://docs.python.org/3/c-api/unicode.html>
Closes: <nginx#868>
Reported-by: RomainMou <https://github.com/RomainMou>
Tested-by: RomainMou <https://github.com/RomainMou>
Signed-off-by: Andrew Clayton <[email protected]>
ac000 added a commit to ac000/unit that referenced this issue Nov 9, 2023
On GitHub, @RomainMou reported an issue whereby HTTP header field values
where being incorrectly reported as non-ascii by the Python .isacii()
method.

For example, using the following test application

  def application(environ, start_response):
      t = environ['HTTP_ASCIITEST']

      t = "'" + t + "'" +  " (" + str(len(t)) + ")"

      if t.isascii():
          t = t + " [ascii]"
      else:
          t = t + " [non-ascii]"

      resp = t + "\n\n"

      start_response("200 OK", [("Content-Type", "text/plain")])
      return (bytes(resp, 'latin1'))

You would see the following

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [non-ascii]

'$' has an ASCII code of 0x24 (36).

The initial idea was to adjust the second parameter to the
PyUnicode_New() call from 255 to 127. This unfortunately had the
opposite effect.

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [ascii]

Good. However...

  $ curl -H "ASCIITEST: £" http://localhost:8080/
  '£' (2) [ascii]

Not good. Let's take a closer look at this.

'£' is not in basic ASCII, but is in extended ASCII with a value of 0xA3
(163). Its UTF-8 encoding is 0xC2 0xA3, hence the length of 2 bytes
above.

  $ strace -s 256 -e sendto,recvfrom curl -H "ASCIITEST: £" http://localhost:8080/
  sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92
  recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Mon, 22 May 2023 12:44:11 GMT\r\nTransfer-Encoding: chunked\r\n\r\n12\r\n'\302\243' (2) [ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 160
  '£' (2) [ascii]

So we can see curl sent it UTF-8 encoded '\302\243\' which is C octal
escaped UTF-8 for 0xC2 0xA3, and we got the same back. But it should not
be marked as ASCII.

When doing PyUnicode_New(size, 127) it sets the buffer as ASCII. So we
need to use another function and that function would appear to be

  PyUnicode_DecodeCharmap()

Which creates an Unicode object with the correct ascii/non-ascii
properties based on the character encoding.

With this function we now get

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [ascii]

  $ curl -H "ASCIITEST: £" http://localhost:8080/
  '£' (2) [non-ascii]

and for good measure

  $ curl -H "ASCIITEST: $ £" http://localhost:8080/
  '$ £' (4) [non-ascii]

  $ curl -H "ASCIITEST: $" -H "ASCIITEST: £" http://localhost:8080/
  '$, £' (5) [non-ascii]

PyUnicode_DecodeCharmap() does require having the full string upfront so
we need to build up the potentially comma separated header field values
string before invoking this function.

I did not want to touch the Python 2.7 code (which may or may not even
be affected by this) so kept these changes completely isolated from
that, hence a slight duplication with the for () loop.

Python 2.7 was sunset on January 1st 2020[0], so this code will
hopefully just disappear soon anyway.

I also purposefully didn't touch other code that may well have similar
issues (such as the HTTP header field names) if we ever get issue
reports about them, we'll deal with them then.

[0]: <https://www.python.org/doc/sunset-python-2/>

Link: <https://docs.python.org/3/c-api/unicode.html>
Closes: <nginx#868>
Reported-by: RomainMou <https://github.com/RomainMou>
Tested-by: RomainMou <https://github.com/RomainMou>
Signed-off-by: Andrew Clayton <[email protected]>
@ac000
Copy link
Member

ac000 commented Nov 10, 2023

The fix for this was merged.

@RomainMou

Thanks for your work on this.

@ac000 ac000 closed this as completed Nov 10, 2023
andrey-zelenkov pushed a commit that referenced this issue Feb 27, 2024
On GitHub, @RomainMou reported an issue whereby HTTP header field values
where being incorrectly reported as non-ascii by the Python .isacii()
method.

For example, using the following test application

  def application(environ, start_response):
      t = environ['HTTP_ASCIITEST']

      t = "'" + t + "'" +  " (" + str(len(t)) + ")"

      if t.isascii():
          t = t + " [ascii]"
      else:
          t = t + " [non-ascii]"

      resp = t + "\n\n"

      start_response("200 OK", [("Content-Type", "text/plain")])
      return (bytes(resp, 'latin1'))

You would see the following

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [non-ascii]

'$' has an ASCII code of 0x24 (36).

The initial idea was to adjust the second parameter to the
PyUnicode_New() call from 255 to 127. This unfortunately had the
opposite effect.

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [ascii]

Good. However...

  $ curl -H "ASCIITEST: £" http://localhost:8080/
  '£' (2) [ascii]

Not good. Let's take a closer look at this.

'£' is not in basic ASCII, but is in extended ASCII with a value of 0xA3
(163). Its UTF-8 encoding is 0xC2 0xA3, hence the length of 2 bytes
above.

  $ strace -s 256 -e sendto,recvfrom curl -H "ASCIITEST: £" http://localhost:8080/
  sendto(5, "GET / HTTP/1.1\r\nHost: localhost:8080\r\nUser-Agent: curl/8.0.1\r\nAccept: */*\r\nASCIITEST: \302\243\r\n\r\n", 92, MSG_NOSIGNAL, NULL, 0) = 92
  recvfrom(5, "HTTP/1.1 200 OK\r\nContent-Type: text/plain\r\nServer: Unit/1.30.0\r\nDate: Mon, 22 May 2023 12:44:11 GMT\r\nTransfer-Encoding: chunked\r\n\r\n12\r\n'\302\243' (2) [ascii]\n\n\r\n0\r\n\r\n", 102400, 0, NULL, NULL) = 160
  '£' (2) [ascii]

So we can see curl sent it UTF-8 encoded '\302\243\' which is C octal
escaped UTF-8 for 0xC2 0xA3, and we got the same back. But it should not
be marked as ASCII.

When doing PyUnicode_New(size, 127) it sets the buffer as ASCII. So we
need to use another function and that function would appear to be

  PyUnicode_DecodeCharmap()

Which creates an Unicode object with the correct ascii/non-ascii
properties based on the character encoding.

With this function we now get

  $ curl -H "ASCIITEST: $" http://localhost:8080/
  '$' (1) [ascii]

  $ curl -H "ASCIITEST: £" http://localhost:8080/
  '£' (2) [non-ascii]

and for good measure

  $ curl -H "ASCIITEST: $ £" http://localhost:8080/
  '$ £' (4) [non-ascii]

  $ curl -H "ASCIITEST: $" -H "ASCIITEST: £" http://localhost:8080/
  '$, £' (5) [non-ascii]

PyUnicode_DecodeCharmap() does require having the full string upfront so
we need to build up the potentially comma separated header field values
string before invoking this function.

I did not want to touch the Python 2.7 code (which may or may not even
be affected by this) so kept these changes completely isolated from
that, hence a slight duplication with the for () loop.

Python 2.7 was sunset on January 1st 2020[0], so this code will
hopefully just disappear soon anyway.

I also purposefully didn't touch other code that may well have similar
issues (such as the HTTP header field names) if we ever get issue
reports about them, we'll deal with them then.

[0]: <https://www.python.org/doc/sunset-python-2/>

Link: <https://docs.python.org/3/c-api/unicode.html>
Closes: <#868>
Reported-by: RomainMou <https://github.com/RomainMou>
Tested-by: RomainMou <https://github.com/RomainMou>
Signed-off-by: Andrew Clayton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants