Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check unicode links correctly #3

Closed
wants to merge 2 commits into from
Closed

Conversation

Flimm
Copy link
Contributor

@Flimm Flimm commented Sep 24, 2015

Take a web page, which is correctly encoded in UTF-8 and has set the UTF-8 header correctly, that has a link like this:

<a href="https://fr.wikipedia.org/wiki/Café_(établissement)">café</a>

If you click on the link, the browser encodes/escapes the URL to this:

https://fr.wikipedia.org/wiki/Caf%C3%A9_%28%C3%A9tablissement%29

And sends a GET request with that URL. If the browser forgets to do encode/escape it, some servers won't accept the request (although it seems some servers do recover well).

The patch included fixes check-pages to handle this use-case correctly, even when the URL contains white-space, and it includes a test.

@DavidAnson
Copy link
Owner

Thank you! I'm working on a different task at the moment, but will come back to this soon. For the time being, I have a couple of questions about your changes and will leave them inline. Appreciate the help! :)

@@ -50,7 +50,7 @@ function runTest(options, callback) {
function testOutput(test, log, error, exception) {
return function(err, context, count) {
if (err || exception) {
test.equal(err.message, exception, 'Wrong exception text');
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was kind of a deliberate attempt to ensure that err always had a .message property (i.e., was an Error). Do you recall the scenario that required you to make a change here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I don't remember. I was getting an error (something like "undefined has no message property"), and it wasn't clear to me whether the error was coming from this line or from checkPages.js, so I decided to fix it here to reduce confusion for future me. If err doesn't have a message property, this test will still fail.

@Flimm
Copy link
Contributor Author

Flimm commented Sep 28, 2015

I really dug in to this bug, investigating with Wireshark. Firefox always URI encodes URLs before putting them in the GET HTTP header. However, it seems that Node's http library (which request uses) does not do the same. For characters greater than U+FFFF, it encodes to UTF-16 and then only sends the first byte for each character, because of the way Buffer(x, 'binary').toString() works.

This is bug #2114 in Node.js and I want to see if my pull request gets accepted.

@DavidAnson
Copy link
Owner

Thanks for the update! Now I'm curious as well. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants