Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CMake to build uchardet and update upstream submodule #5

Merged
merged 9 commits into from
Oct 29, 2016
Merged

Use CMake to build uchardet and update upstream submodule #5

merged 9 commits into from
Oct 29, 2016

Conversation

Bobo1239
Copy link
Contributor

Upstream change: see README

Also had to change the examples a bit as the algorithm changed and other charsets are detected. ASCII is now a valid result of uchardet so we can directly return a EncodingDetectorResult<String>.

This is a breaking change.

uchardet now returns "ASCII" when the input was ASCII and only returns "" when it failed to detect a charset
@Bobo1239
Copy link
Contributor Author

I'm going to test this on Windows (GNU and MSVC) in a couple of hours.

@emk emk self-assigned this Oct 26, 2016
@emk
Copy link
Owner

emk commented Oct 26, 2016

As for Windows, I'm happy to set up an appveyor.yml file so that it stays
working.

Le 26 oct. 2016 8:54 AM, "Eric Kidd" [email protected] a écrit :

This is a fantastic idea! I'll take a look and merge it shortly. If not,
ping me.

If this changes the charsets detected for plain ASCII inputs, that will be
a breaking change for semver. No worries; I was thinking about a version
bump anyway.

@emk
Copy link
Owner

emk commented Oct 26, 2016 via email

@Bobo1239
Copy link
Contributor Author

(At least on my machine...)

Setting up Windows CI is a good idea. I only tested an x64 build with Visual Studio 15 (preview version; had to make cmake-rs recognize it first...).

Now on to windows-gnu...

@Bobo1239
Copy link
Contributor Author

Bobo1239 commented Oct 26, 2016

Got it (windows-gnu) working after I rediscovered the solution I found two month ago. Depends on rust-lang/cmake-rs#17.

Copy link
Owner

@emk emk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really great patch, and I'm looking forward to merging it as soon as I can test it against substudy. Happily, I was already planning to update substudy (and overhaul the build system for rust-uchardet), and you've just saved me a ton of work! 👌🏻

I ask several questions below and suggest a couple of minor changes, but this is a great PR and you've really made my day.

@@ -3,7 +3,7 @@ sudo: required
rust:
- nightly
- beta
- 1.0.0
- stable
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to always require the latest stable Rust, or do we want to support back to some specific version? I could be convinced to go either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have an opinion on that (I'm normally running nightly). Theoretically (I believe) 1.2 should suffice as the incompatibilty was caused by the usage of debug builders which are stabilized since 1.2.

//!
//! Note that the underlying implemention is written in C and C++, and I'm
//! not aware of any security audits which have been performed against it.
//!
//! ```
//! use uchardet::detect_encoding_name;
//!
//! assert_eq!(Some("windows-1252".to_string()),
//! detect_encoding_name(&[0x66u8, 0x72, 0x61, 0x6e, 0xe7,
//! 0x61, 0x69, 0x73]).unwrap());
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the detector still detect "windows-1252" and output that string? I notice that this PR removes several tests related to this encoding.

For my use case, I encounter a lot of input data in "windows-1252" (including older subtitle data for European films), and I want to make sure the detector can still detect that format correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It outputs "WINDOWS-1252" when e.g. , 0x80 is appended to the array. I just changed it so the encoded string isn't totally random but I can also change the examples back to windows-1252 if you like.

This character encoding is a superset of ISO 8859-1 in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range.

Relevant wiki pages:
https://en.wikipedia.org/wiki/Windows-1252
https://en.wikipedia.org/wiki/ISO/IEC_8859-1

@@ -54,27 +53,30 @@ struct EncodingDetector {
}

/// Return the name of the charset used in `data`, or `None` if the
/// charset is ASCII or if the encoding can't be detected. This is
/// charset if the encoding can't be detected. This is
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can no longer return None, so this comment is now incorrect.

/// `None` on error, or if the data appears to be ASCII.
fn charset(&self) -> Option<String> {
fn charset(&self) -> Result<String, EncodingDetectorError> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type is written as EncodingDetectorResult<String> above, but Result<String, EncodingDetectorError> here. At the least we should be consistent—but I'm halfway tempted to convert this library to use error-chain since we're making breaking API changes, and since our error type is relatively heavyweight already thanks to the message: String field. What do you think?

I'm totally happy to handle the error-chain conversion myself; I've done several of them lately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to try the conversion. Though it's my first time using quick-error so tell me if I'm doing dumb stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehm, I meant error-chain...

Ok("") => None,
Ok(encoding) => Some(encoding.to_string())
Ok("") => Err(EncodingDetectorError {
message: "uchardet failed to recognize a charset".to_string()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good idea, I need to double-check this new behavior against substudy to see if it still gets good results with older subtitle files.

And if I convert the library to use error-chain, I want to break this out as a distinct error type so that our callers can handle it specially.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try to integrate this.

@@ -18,3 +18,4 @@ libc = "*"

[build-dependencies]
pkg-config = '*'
cmake = "*"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite happy to see this dependency, especially if it helps us build on Windows and stay in sync with upstream.

.arg(&src.join("uchardet"))
.arg("-DCMAKE_BUILD_TYPE=Release")
.arg(&format!("-DCMAKE_INSTALL_PREFIX={}", dst.display()))
.arg(&format!("-DCMAKE_CXX_FLAGS={}", cxxflags)));
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This massive code deletion is immensely satisfying.

Copy link
Owner

@emk emk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great conversion to error-chain, and it makes things feel tidier.

I have a couple of very minor remarks below.

/// `None` on error, or if the data appears to be ASCII.
fn charset(&self) -> Result<String, EncodingDetectorError> {
/// Get the decoder's current best guess as to the encoding. May return
/// an error if uchardet was unable to detect an encoding
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing period.

/// detect_encoding_name("ascii".as_bytes()).unwrap());
/// assert_eq!("UTF-8".to_string(),
/// detect_encoding_name("©français".as_bytes()).unwrap());
/// assert_eq!("ISO-8859-1".to_string(),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These to_string calls in these assertions are no longer necessary now that the surrounding Ok has been removed from the strings, because you can compare &str and String, but not Ok(&str) and Ok(String), if I recall correctly.

fn description(&self) -> &str { "encoding detector error" }
fn cause(&self) -> Option<&Error> { None }
}
use errors::*;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try pub use instead, so that users can access all the types used in our API signatures. Older Rust will fail to compile without this, and callers can't use links to import our error chain into theirs.

let result = unsafe {
ffi::uchardet_handle_data(self.ptr, data.as_ptr() as *const i8,
data.len() as size_t)
};
match result {
0 => Ok(()),
_ => {
let msg = "Error handling data".to_string();
Err(EncodingDetectorError{message: msg})
Err(ErrorKind::DataHandlingError.into())
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any other uchardet error codes we should map to separate ErrorKind values? Or does it just return one cryptic error code for everything?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I see it, there are currently only two possible return values:

enum nsresult
{
    NS_OK,
    NS_ERROR_OUT_OF_MEMORY
};

source
In nsUniversalDetector::HandleData only those values are returned but I don't think they guarantee that it stays like this in the future.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, in this case, we should map nsresult to Ok(()), Err(ErrorKind::OutOfMemory) and Err(ErrorKind::Other). We might want to do this using an impl From<nsresult> for ErrorKind in the errors module if nsresult is an actual struct or enum in Rust. Or if nsresult is just a type alias, then we could define a private ErrorKind::from_nsresult function.

@@ -128,9 +120,7 @@ impl EncodingDetector {
match charset {
Err(_) =>
panic!("uchardet_get_charset returned invalid value"),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is my code, but I'd like to change this panic! message to something like "uchardet_get_charset returned a charset name containing invalid characters". I can do this when merging unless you want to do it yourself.

@emk
Copy link
Owner

emk commented Oct 27, 2016

OK, here's my TODO list to get this patch merged and a 2.0.0 release made. Some of these items are for me, some you'll probably want to do yourself, and some are for either of us.

Merge checklist:

  • @Bobo1239 Add back unit tests for "WINDOWS-1252" (preferably using 1252 smart quotes or something like that).
  • @emk Decide what the oldest Rust version we want to support is. Debian testing seems to have 1.10; we could try that, maybe?
  • Fix charset name decoding panic message to be more informative.
  • Fix missing period.
  • @Bobo1239 Remove all unnecessary to_string calls from doc tests.
  • Check what error codes are returned from uchardet and decide if we want to break out any of them as separate Rust-level errors.
  • @Bobo1239 pub use errors::*;
  • @emk Test a build against emk/substudy to make sure that we still produce good results with older real-world data (from subtitle files in this case).
  • @emk Set up AppVeyor on this PR branch so it will merge over to master with the PR.

I'll check off items as we do them.

@Bobo1239
Copy link
Contributor Author

What about AppVeyor Windows CI (GNU and MSVC)?

@emk
Copy link
Owner

emk commented Oct 27, 2016

AppVeyor is added to the checklist! I need to do some setup on my end, and I already have a standard config file that I use for my portable projects. I'll take a look at it shortly.

}

impl ErrorKind {
pub fn from_nsresult(nsresult: ::ffi::nsresult) -> ErrorKind {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to make this private but still accessible for EncodingDetector::handle_data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could make a public free function error_kind_from_nsresult(res: i32) -> ErrorKind in the errors module and (in the crate root) only reexport errors::{Error, ErrorKind, ChainErr}.
Does that seem fine?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In that case, we should probably just hoist all the code in mod errors up to the top of the crate for simplicity.

Thank you for taking the time to respond to all my excessively detailed review comments, by the way!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I put it in a module so I could #[allow(missing_docs)] as it appears that you're unable to document the generated code (and I don't see another way to set that attribute). I'm currently trying to write up a PR for error-chain so the generated code has #[allow(missing_docs)].

It's no problem at all; Rather it's really educational for me to have somebody more experienced guide me.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(This comment is just a remark about something I noticed, not a feature request!)

You know, I think it might actually be possible to fix error-chain to support doc comments. Comments starting with /// actually get transformed to #[doc...] attributes before macros see them (if I recall correctly), so the macro could move them around and attach them to the right enum members on output.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually also thought about that possibility but after taking a glance at quick_error.rs I quickly abandoned that idea... 😅

(PR is up: rust-lang-deprecated/error-chain#50)

@emk
Copy link
Owner

emk commented Oct 28, 2016 via email

@emk emk merged commit 2281551 into emk:master Oct 29, 2016
@emk
Copy link
Owner

emk commented Oct 29, 2016

OK, this patch is looking really great!

I'm going to go ahead and merge it, because I want to take a look at #1 (which might be more feasible now that we have from https://github.com/hsivonen/encoding_rs) as well as BurntSushi/ripgrep#1 (which can't depend on any C code, but which would benefit from a Rust-native implementation of detecting the most common character sets).

If you have any other tweaks, please feel free to open a new PR! And thank you for all your terrific work here.

@emk
Copy link
Owner

emk commented Oct 29, 2016

OK, I've set up AppVeyor for both Windows targets, I've updated Travis to trying building against 1.10.0. We can cross-compile from Linux to MinGW!

rustup run nightly cargo build --target x86_64-pc-windows-gnu
   Compiling uchardet-sys v1.0.0 (file:///home/emk/w/src/rust-uchardet/uchardet-sys)
   Compiling dbghelp-sys v0.2.0
   Compiling kernel32-sys v0.2.2
   Compiling backtrace v0.2.3
   Compiling error-chain v0.5.0
   Compiling uchardet v1.0.0 (file:///home/emk/w/src/rust-uchardet)
    Finished debug [unoptimized + debuginfo] target(s) in 2.35 secs

Unfortunately, AppVeyor fails for the MinGW build, because I don't seem to have everything installed:

CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_C_COMPILER not set, after EnableLanguage
CMake Error: CMAKE_AR was not found, please set to archive program.

If you have a spare moment, would you be interested in looking at this and offering me a hint? :-)

@Bobo1239
Copy link
Contributor Author

I can take a look at it in a couple of hours after I'm back.

@emk
Copy link
Owner

emk commented Oct 29, 2016

Thank you! I don't have a working Windows setup at the moment, so I can't test any of the Windows builds locally, except for cross-compilation (which works). I really appreciate your help.

Before we release this, we also need to take care of licensing. As the README.md points out, this crate is in the public domain so that it can be used without any restrictions at all, and to keep it that way, contributors need to include the following notice in one of their commit messages:

I dedicate any and all copyright interest in my contributions to this project to the public domain. I make this dedication for the benefit of the public at large and to the detriment of my heirs and successors. I intend this dedication to be an overt act of relinquishment in perpetuity of all present and future rights to this software under copyright law.

Once again, many thanks for helping update and port this project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants