-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CMake to build uchardet and update upstream submodule #5
Conversation
uchardet now returns "ASCII" when the input was ASCII and only returns "" when it failed to detect a charset
I'm going to test this on Windows (GNU and MSVC) in a couple of hours. |
As for Windows, I'm happy to set up an appveyor.yml file so that it stays Le 26 oct. 2016 8:54 AM, "Eric Kidd" [email protected] a écrit :
|
This is a fantastic idea! I'll take a look and merge it shortly. If not,
ping me.
If this changes the charsets detected for plain ASCII inputs, that will be
a breaking change for semver. No worries; I was thinking about a version
bump anyway.
|
(At least on my machine...) Setting up Windows CI is a good idea. I only tested an x64 build with Visual Studio 15 (preview version; had to make cmake-rs recognize it first...). Now on to windows-gnu... |
Got it (windows-gnu) working after I rediscovered the solution I found two month ago. Depends on rust-lang/cmake-rs#17. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really great patch, and I'm looking forward to merging it as soon as I can test it against substudy. Happily, I was already planning to update substudy (and overhaul the build system for rust-uchardet), and you've just saved me a ton of work! 👌🏻
I ask several questions below and suggest a couple of minor changes, but this is a great PR and you've really made my day.
@@ -3,7 +3,7 @@ sudo: required | |||
rust: | |||
- nightly | |||
- beta | |||
- 1.0.0 | |||
- stable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to always require the latest stable Rust, or do we want to support back to some specific version? I could be convinced to go either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have an opinion on that (I'm normally running nightly). Theoretically (I believe) 1.2 should suffice as the incompatibilty was caused by the usage of debug builders which are stabilized since 1.2.
//! | ||
//! Note that the underlying implemention is written in C and C++, and I'm | ||
//! not aware of any security audits which have been performed against it. | ||
//! | ||
//! ``` | ||
//! use uchardet::detect_encoding_name; | ||
//! | ||
//! assert_eq!(Some("windows-1252".to_string()), | ||
//! detect_encoding_name(&[0x66u8, 0x72, 0x61, 0x6e, 0xe7, | ||
//! 0x61, 0x69, 0x73]).unwrap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the detector still detect "windows-1252"
and output that string? I notice that this PR removes several tests related to this encoding.
For my use case, I encounter a lot of input data in "windows-1252"
(including older subtitle data for European films), and I want to make sure the detector can still detect that format correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It outputs "WINDOWS-1252"
when e.g. , 0x80
is appended to the array. I just changed it so the encoded string isn't totally random but I can also change the examples back to windows-1252 if you like.
This character encoding is a superset of ISO 8859-1 in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range.
Relevant wiki pages:
https://en.wikipedia.org/wiki/Windows-1252
https://en.wikipedia.org/wiki/ISO/IEC_8859-1
@@ -54,27 +53,30 @@ struct EncodingDetector { | |||
} | |||
|
|||
/// Return the name of the charset used in `data`, or `None` if the | |||
/// charset is ASCII or if the encoding can't be detected. This is | |||
/// charset if the encoding can't be detected. This is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can no longer return None
, so this comment is now incorrect.
/// `None` on error, or if the data appears to be ASCII. | ||
fn charset(&self) -> Option<String> { | ||
fn charset(&self) -> Result<String, EncodingDetectorError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This type is written as EncodingDetectorResult<String>
above, but Result<String, EncodingDetectorError>
here. At the least we should be consistent—but I'm halfway tempted to convert this library to use error-chain
since we're making breaking API changes, and since our error type is relatively heavyweight already thanks to the message: String
field. What do you think?
I'm totally happy to handle the error-chain
conversion myself; I've done several of them lately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to try the conversion. Though it's my first time using quick-error
so tell me if I'm doing dumb stuff.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ehm, I meant error-chain
...
Ok("") => None, | ||
Ok(encoding) => Some(encoding.to_string()) | ||
Ok("") => Err(EncodingDetectorError { | ||
message: "uchardet failed to recognize a charset".to_string() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a good idea, I need to double-check this new behavior against substudy
to see if it still gets good results with older subtitle files.
And if I convert the library to use error-chain, I want to break this out as a distinct error type so that our callers can handle it specially.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will try to integrate this.
@@ -18,3 +18,4 @@ libc = "*" | |||
|
|||
[build-dependencies] | |||
pkg-config = '*' | |||
cmake = "*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm quite happy to see this dependency, especially if it helps us build on Windows and stay in sync with upstream.
.arg(&src.join("uchardet")) | ||
.arg("-DCMAKE_BUILD_TYPE=Release") | ||
.arg(&format!("-DCMAKE_INSTALL_PREFIX={}", dst.display())) | ||
.arg(&format!("-DCMAKE_CXX_FLAGS={}", cxxflags))); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This massive code deletion is immensely satisfying.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great conversion to error-chain
, and it makes things feel tidier.
I have a couple of very minor remarks below.
/// `None` on error, or if the data appears to be ASCII. | ||
fn charset(&self) -> Result<String, EncodingDetectorError> { | ||
/// Get the decoder's current best guess as to the encoding. May return | ||
/// an error if uchardet was unable to detect an encoding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing period.
/// detect_encoding_name("ascii".as_bytes()).unwrap()); | ||
/// assert_eq!("UTF-8".to_string(), | ||
/// detect_encoding_name("©français".as_bytes()).unwrap()); | ||
/// assert_eq!("ISO-8859-1".to_string(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These to_string
calls in these assertions are no longer necessary now that the surrounding Ok
has been removed from the strings, because you can compare &str
and String
, but not Ok(&str)
and Ok(String)
, if I recall correctly.
fn description(&self) -> &str { "encoding detector error" } | ||
fn cause(&self) -> Option<&Error> { None } | ||
} | ||
use errors::*; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try pub use
instead, so that users can access all the types used in our API signatures. Older Rust will fail to compile without this, and callers can't use links
to import our error chain into theirs.
let result = unsafe { | ||
ffi::uchardet_handle_data(self.ptr, data.as_ptr() as *const i8, | ||
data.len() as size_t) | ||
}; | ||
match result { | ||
0 => Ok(()), | ||
_ => { | ||
let msg = "Error handling data".to_string(); | ||
Err(EncodingDetectorError{message: msg}) | ||
Err(ErrorKind::DataHandlingError.into()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any other uchardet error codes we should map to separate ErrorKind
values? Or does it just return one cryptic error code for everything?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I see it, there are currently only two possible return values:
enum nsresult
{
NS_OK,
NS_ERROR_OUT_OF_MEMORY
};
source
In nsUniversalDetector::HandleData
only those values are return
ed but I don't think they guarantee that it stays like this in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, in this case, we should map nsresult
to Ok(())
, Err(ErrorKind::OutOfMemory)
and Err(ErrorKind::Other)
. We might want to do this using an impl From<nsresult> for ErrorKind
in the errors
module if nsresult
is an actual struct
or enum
in Rust. Or if nsresult
is just a type
alias, then we could define a private ErrorKind::from_nsresult
function.
@@ -128,9 +120,7 @@ impl EncodingDetector { | |||
match charset { | |||
Err(_) => | |||
panic!("uchardet_get_charset returned invalid value"), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know this is my code, but I'd like to change this panic!
message to something like "uchardet_get_charset returned a charset name containing invalid characters"
. I can do this when merging unless you want to do it yourself.
OK, here's my TODO list to get this patch merged and a 2.0.0 release made. Some of these items are for me, some you'll probably want to do yourself, and some are for either of us. Merge checklist:
I'll check off items as we do them. |
What about AppVeyor Windows CI (GNU and MSVC)? |
AppVeyor is added to the checklist! I need to do some setup on my end, and I already have a standard config file that I use for my portable projects. I'll take a look at it shortly. |
} | ||
|
||
impl ErrorKind { | ||
pub fn from_nsresult(nsresult: ::ffi::nsresult) -> ErrorKind { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know how to make this private but still accessible for EncodingDetector::handle_data
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could make a public free function error_kind_from_nsresult(res: i32) -> ErrorKind
in the errors
module and (in the crate root) only reexport errors::{Error, ErrorKind, ChainErr}
.
Does that seem fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, we should probably just hoist all the code in mod errors
up to the top of the crate for simplicity.
Thank you for taking the time to respond to all my excessively detailed review comments, by the way!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I put it in a module so I could #[allow(missing_docs)]
as it appears that you're unable to document the generated code (and I don't see another way to set that attribute). I'm currently trying to write up a PR for error-chain
so the generated code has #[allow(missing_docs)]
.
It's no problem at all; Rather it's really educational for me to have somebody more experienced guide me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(This comment is just a remark about something I noticed, not a feature request!)
You know, I think it might actually be possible to fix error-chain to support doc comments. Comments starting with ///
actually get transformed to #[doc...]
attributes before macros see them (if I recall correctly), so the macro could move them around and attach them to the right enum
members on output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually also thought about that possibility but after taking a glance at quick_error.rs I quickly abandoned that idea... 😅
(PR is up: rust-lang-deprecated/error-chain#50)
Looks great!
I'm thinking that from_nsresult might be better as private, especially
since it has that assertion.
|
OK, this patch is looking really great! I'm going to go ahead and merge it, because I want to take a look at #1 (which might be more feasible now that we have from https://github.com/hsivonen/encoding_rs) as well as BurntSushi/ripgrep#1 (which can't depend on any C code, but which would benefit from a Rust-native implementation of detecting the most common character sets). If you have any other tweaks, please feel free to open a new PR! And thank you for all your terrific work here. |
OK, I've set up AppVeyor for both Windows targets, I've updated Travis to trying building against 1.10.0. We can cross-compile from Linux to MinGW!
Unfortunately, AppVeyor fails for the MinGW build, because I don't seem to have everything installed:
If you have a spare moment, would you be interested in looking at this and offering me a hint? :-) |
I can take a look at it in a couple of hours after I'm back. |
Thank you! I don't have a working Windows setup at the moment, so I can't test any of the Windows builds locally, except for cross-compilation (which works). I really appreciate your help. Before we release this, we also need to take care of licensing. As the README.md points out, this crate is in the public domain so that it can be used without any restrictions at all, and to keep it that way, contributors need to include the following notice in one of their commit messages:
Once again, many thanks for helping update and port this project! |
Upstream change: see README
Also had to change the examples a bit as the algorithm changed and other charsets are detected.
ASCII
is now a valid result of uchardet so we can directly return aEncodingDetectorResult<String>
.This is a breaking change.