Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CMake to build uchardet and update upstream submodule #5

Merged
merged 9 commits into from
Oct 29, 2016
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
[submodule "uchardet-sys/uchardet"]
path = uchardet-sys/uchardet
url = https://github.com/BYVoid/uchardet
url = https://anongit.freedesktop.org/git/uchardet/uchardet.git
2 changes: 1 addition & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ sudo: required
rust:
- nightly
- beta
- 1.0.0
- stable
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to always require the latest stable Rust, or do we want to support back to some specific version? I could be convinced to go either way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really have an opinion on that (I'm normally running nightly). Theoretically (I believe) 1.2 should suffice as the incompatibilty was caused by the usage of debug builders which are stabilized since 1.2.

before_script:
- |
pip install 'travis-cargo<0.2' --user &&
Expand Down
46 changes: 25 additions & 21 deletions src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,14 +1,13 @@
//! A wrapper around the uchardet library. Detects character encodings.
//! A wrapper around the uchardet library. Detects character encodings.
//!
//! Note that the underlying implemention is written in C and C++, and I'm
//! not aware of any security audits which have been performed against it.
//!
//! ```
//! use uchardet::detect_encoding_name;
//!
//! assert_eq!(Some("windows-1252".to_string()),
//! detect_encoding_name(&[0x66u8, 0x72, 0x61, 0x6e, 0xe7,
//! 0x61, 0x69, 0x73]).unwrap());
//! assert_eq!(Ok("ISO-8859-1".to_string()),
//! detect_encoding_name(&[0x46, 0x72, 0x61, 0x6e, 0xe7, 0x6f, 0x69, 0x73, 0xe9]));
//! ```
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the detector still detect "windows-1252" and output that string? I notice that this PR removes several tests related to this encoding.

For my use case, I encounter a lot of input data in "windows-1252" (including older subtitle data for European films), and I want to make sure the detector can still detect that format correctly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It outputs "WINDOWS-1252" when e.g. , 0x80 is appended to the array. I just changed it so the encoded string isn't totally random but I can also change the examples back to windows-1252 if you like.

This character encoding is a superset of ISO 8859-1 in terms of printable characters, but differs from the IANA's ISO-8859-1 by using displayable characters rather than control characters in the 80 to 9F (hex) range.

Relevant wiki pages:
https://en.wikipedia.org/wiki/Windows-1252
https://en.wikipedia.org/wiki/ISO/IEC_8859-1

//!
//! For more information, see [this project on
Expand All @@ -27,7 +26,7 @@ use std::ffi::CStr;
use std::str::from_utf8;

/// An error occurred while trying to detect the character encoding.
#[derive(Debug)]
#[derive(Debug, PartialEq)]
pub struct EncodingDetectorError {
message: String
}
Expand All @@ -54,27 +53,30 @@ struct EncodingDetector {
}

/// Return the name of the charset used in `data`, or `None` if the
/// charset is ASCII or if the encoding can't be detected. This is
/// charset if the encoding can't be detected. This is
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can no longer return None, so this comment is now incorrect.

/// the value returned by the underlying `uchardet` library, with
/// the empty string mapped to `None`.
///
/// ```
/// use uchardet::detect_encoding_name;
///
/// assert_eq!(None, detect_encoding_name("ascii".as_bytes()).unwrap());
/// assert_eq!(Some("UTF-8".to_string()),
/// detect_encoding_name("français".as_bytes()).unwrap());
/// assert_eq!(Some("windows-1252".to_string()),
/// detect_encoding_name(&[0x66u8, 0x72, 0x61, 0x6e, 0xe7,
/// 0x61, 0x69, 0x73]).unwrap());
/// assert_eq!(Ok("ASCII".to_string()),
/// detect_encoding_name("ascii".as_bytes()));
/// assert_eq!(Ok("UTF-8".to_string()),
/// detect_encoding_name("©français".as_bytes()));
/// assert_eq!(Ok("ISO-8859-1".to_string()),
/// detect_encoding_name(&[0x46, 0x72, 0x61, 0x6e, 0xe7, 0x6f, 0x69, 0x73, 0xe9]));



/// ```
pub fn detect_encoding_name(data: &[u8]) ->
EncodingDetectorResult<Option<String>>
EncodingDetectorResult<String>
{
let mut detector = EncodingDetector::new();
try!(detector.handle_data(data));
detector.data_end();
Ok(detector.charset())
detector.charset()
}

impl EncodingDetector {
Expand All @@ -85,7 +87,7 @@ impl EncodingDetector {
EncodingDetector{ptr: ptr}
}

/// Pass a chunk of raw bytes to the detector. This is a no-op if a
/// Pass a chunk of raw bytes to the detector. This is a no-op if a
/// charset has been detected.
fn handle_data(&mut self, data: &[u8]) -> EncodingDetectorResult<()> {
let result = unsafe {
Expand All @@ -102,9 +104,9 @@ impl EncodingDetector {
}

/// Notify the detector that we're done calling `handle_data`, and that
/// we want it to make a guess as to our encoding. This is a no-op if
/// we want it to make a guess as to our encoding. This is a no-op if
/// no data has been passed yet, or if an encoding has been detected
/// for certain. From reading the code, it appears that you can safely
/// for certain. From reading the code, it appears that you can safely
/// call `handle_data` after calling this, but I'm not certain.
fn data_end(&mut self) {
unsafe { ffi::uchardet_data_end(self.ptr); }
Expand All @@ -115,9 +117,9 @@ impl EncodingDetector {
// unsafe { ffi::uchardet_reset(self.ptr); }
//}

/// Get the decoder's current best guess as to the encoding. Returns
/// Get the decoder's current best guess as to the encoding. Returns
/// `None` on error, or if the data appears to be ASCII.
fn charset(&self) -> Option<String> {
fn charset(&self) -> Result<String, EncodingDetectorError> {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This type is written as EncodingDetectorResult<String> above, but Result<String, EncodingDetectorError> here. At the least we should be consistent—but I'm halfway tempted to convert this library to use error-chain since we're making breaking API changes, and since our error type is relatively heavyweight already thanks to the message: String field. What do you think?

I'm totally happy to handle the error-chain conversion myself; I've done several of them lately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to try the conversion. Though it's my first time using quick-error so tell me if I'm doing dumb stuff.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ehm, I meant error-chain...

unsafe {
let internal_str = ffi::uchardet_get_charset(self.ptr);
assert!(!internal_str.is_null());
Expand All @@ -126,8 +128,10 @@ impl EncodingDetector {
match charset {
Err(_) =>
panic!("uchardet_get_charset returned invalid value"),
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is my code, but I'd like to change this panic! message to something like "uchardet_get_charset returned a charset name containing invalid characters". I can do this when merging unless you want to do it yourself.

Ok("") => None,
Ok(encoding) => Some(encoding.to_string())
Ok("") => Err(EncodingDetectorError {
message: "uchardet failed to recognize a charset".to_string()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a good idea, I need to double-check this new behavior against substudy to see if it still gets good results with older subtitle files.

And if I convert the library to use error-chain, I want to break this out as a distinct error type so that our callers can handle it specially.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will try to integrate this.

}),
Ok(encoding) => Ok(encoding.to_string())
}
}
}
Expand Down
1 change: 1 addition & 0 deletions uchardet-sys/Cargo.toml
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ libc = "*"

[build-dependencies]
pkg-config = '*'
cmake = "*"
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm quite happy to see this dependency, especially if it helps us build on Windows and stay in sync with upstream.

80 changes: 21 additions & 59 deletions uchardet-sys/build.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,71 +5,33 @@
// Patches are welcome to help make it work on other operating systems!

extern crate pkg_config;
extern crate cmake;

use std::env;
use std::fs::create_dir;
use std::path::Path;
use std::process::{Command, Stdio};
use cmake::Config;

fn main() {
// Do nothing if this package is already provided by the system.
if pkg_config::find_library("uchardet").is_ok() { return; }

// Get our configuration from our environment.
let mut cxxflags = env::var("CXXFLAGS").unwrap_or(String::new());
let target = env::var("TARGET").unwrap();
let manifest_dir = env::var("CARGO_MANIFEST_DIR").unwrap();
let src = Path::new(&manifest_dir);
let out_dir = env::var("OUT_DIR").unwrap();
let dst = Path::new(&out_dir);

// Fix up our build flags.
if target.contains("i686") {
cxxflags.push_str(" -m32");
} else if target.contains("x86_64") {
cxxflags.push_str(" -m64");
}
if !target.contains("i686") {
cxxflags.push_str(" -fPIC");
}

// Make a build directory.
let build = dst.join("build");
// This isn't ideal but until is_dir() is stable just always try to create
// the build directory. If it exists and error is returned, which is
// ignored.
create_dir(&build).unwrap_or(());

// Set up our CMake command.
run(Command::new("cmake")
.current_dir(&dst.join("build"))
.arg(&src.join("uchardet"))
.arg("-DCMAKE_BUILD_TYPE=Release")
.arg(&format!("-DCMAKE_INSTALL_PREFIX={}", dst.display()))
.arg(&format!("-DCMAKE_CXX_FLAGS={}", cxxflags)));

// Run our make command.
run(Command::new("make")
.current_dir(&dst.join("build"))
.arg("install"));

// Decide how to link our C++ runtime. Feel free to submit patches
// to make this work on your platform. Other likely options are "c++"
// and "c++abi" depending on OS and compiler.
let cxx_abi = "stdc++";
// Build uchardet ourselves
// Mustn't build the binaries as they aren't compatible with Windows
// and cause a compiler error
let dst = Config::new("uchardet")
.define("BUILD_BINARY", "OFF")
.build();

// Print out link instructions for Cargo.
println!("cargo:rustc-flags=-L {} -l static=uchardet -l {}",
dst.join("lib").display(), cxx_abi);
println!("cargo:root={}", dst.display());
}

// Run an external build command.
fn run(cmd: &mut Command) {
println!("running: {:?}", cmd);
assert!(cmd.stdout(Stdio::inherit())
.stderr(Stdio::inherit())
.status()
.unwrap()
.success());
println!("cargo:rustc-link-search=native={}/lib", dst.display());
println!("cargo:rustc-link-search=native={}/lib64", dst.display());
println!("cargo:rustc-link-lib=static=uchardet");

if !(cfg!(target_os = "windows") && cfg!(target_env = "msvc")) {
// Not needed on windows-msvc

// Decide how to link our C++ runtime. Feel free to submit patches
// to make this work on your platform. Other likely options are "c++"
// and "c++abi" depending on OS and compiler.
let cxx_abi = "stdc++";
println!("cargo:rustc-flags=-l {}", cxx_abi);
}
}
2 changes: 1 addition & 1 deletion uchardet-sys/uchardet
Submodule uchardet updated from 69b713 to 119fed