-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use CMake to build uchardet and update upstream submodule #5
Changes from 5 commits
7c37c65
f319faf
784a47b
69c80ba
e80234a
44553be
11fdb93
5be347f
4a5a4fe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,3 +1,3 @@ | ||
[submodule "uchardet-sys/uchardet"] | ||
path = uchardet-sys/uchardet | ||
url = https://github.com/BYVoid/uchardet | ||
url = https://anongit.freedesktop.org/git/uchardet/uchardet.git |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,14 +1,13 @@ | ||
//! A wrapper around the uchardet library. Detects character encodings. | ||
//! A wrapper around the uchardet library. Detects character encodings. | ||
//! | ||
//! Note that the underlying implemention is written in C and C++, and I'm | ||
//! not aware of any security audits which have been performed against it. | ||
//! | ||
//! ``` | ||
//! use uchardet::detect_encoding_name; | ||
//! | ||
//! assert_eq!(Some("windows-1252".to_string()), | ||
//! detect_encoding_name(&[0x66u8, 0x72, 0x61, 0x6e, 0xe7, | ||
//! 0x61, 0x69, 0x73]).unwrap()); | ||
//! assert_eq!(Ok("ISO-8859-1".to_string()), | ||
//! detect_encoding_name(&[0x46, 0x72, 0x61, 0x6e, 0xe7, 0x6f, 0x69, 0x73, 0xe9])); | ||
//! ``` | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Does the detector still detect For my use case, I encounter a lot of input data in There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It outputs
Relevant wiki pages: |
||
//! | ||
//! For more information, see [this project on | ||
|
@@ -27,7 +26,7 @@ use std::ffi::CStr; | |
use std::str::from_utf8; | ||
|
||
/// An error occurred while trying to detect the character encoding. | ||
#[derive(Debug)] | ||
#[derive(Debug, PartialEq)] | ||
pub struct EncodingDetectorError { | ||
message: String | ||
} | ||
|
@@ -54,27 +53,30 @@ struct EncodingDetector { | |
} | ||
|
||
/// Return the name of the charset used in `data`, or `None` if the | ||
/// charset is ASCII or if the encoding can't be detected. This is | ||
/// charset if the encoding can't be detected. This is | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This function can no longer return |
||
/// the value returned by the underlying `uchardet` library, with | ||
/// the empty string mapped to `None`. | ||
/// | ||
/// ``` | ||
/// use uchardet::detect_encoding_name; | ||
/// | ||
/// assert_eq!(None, detect_encoding_name("ascii".as_bytes()).unwrap()); | ||
/// assert_eq!(Some("UTF-8".to_string()), | ||
/// detect_encoding_name("français".as_bytes()).unwrap()); | ||
/// assert_eq!(Some("windows-1252".to_string()), | ||
/// detect_encoding_name(&[0x66u8, 0x72, 0x61, 0x6e, 0xe7, | ||
/// 0x61, 0x69, 0x73]).unwrap()); | ||
/// assert_eq!(Ok("ASCII".to_string()), | ||
/// detect_encoding_name("ascii".as_bytes())); | ||
/// assert_eq!(Ok("UTF-8".to_string()), | ||
/// detect_encoding_name("©français".as_bytes())); | ||
/// assert_eq!(Ok("ISO-8859-1".to_string()), | ||
/// detect_encoding_name(&[0x46, 0x72, 0x61, 0x6e, 0xe7, 0x6f, 0x69, 0x73, 0xe9])); | ||
|
||
|
||
|
||
/// ``` | ||
pub fn detect_encoding_name(data: &[u8]) -> | ||
EncodingDetectorResult<Option<String>> | ||
EncodingDetectorResult<String> | ||
{ | ||
let mut detector = EncodingDetector::new(); | ||
try!(detector.handle_data(data)); | ||
detector.data_end(); | ||
Ok(detector.charset()) | ||
detector.charset() | ||
} | ||
|
||
impl EncodingDetector { | ||
|
@@ -85,7 +87,7 @@ impl EncodingDetector { | |
EncodingDetector{ptr: ptr} | ||
} | ||
|
||
/// Pass a chunk of raw bytes to the detector. This is a no-op if a | ||
/// Pass a chunk of raw bytes to the detector. This is a no-op if a | ||
/// charset has been detected. | ||
fn handle_data(&mut self, data: &[u8]) -> EncodingDetectorResult<()> { | ||
let result = unsafe { | ||
|
@@ -102,9 +104,9 @@ impl EncodingDetector { | |
} | ||
|
||
/// Notify the detector that we're done calling `handle_data`, and that | ||
/// we want it to make a guess as to our encoding. This is a no-op if | ||
/// we want it to make a guess as to our encoding. This is a no-op if | ||
/// no data has been passed yet, or if an encoding has been detected | ||
/// for certain. From reading the code, it appears that you can safely | ||
/// for certain. From reading the code, it appears that you can safely | ||
/// call `handle_data` after calling this, but I'm not certain. | ||
fn data_end(&mut self) { | ||
unsafe { ffi::uchardet_data_end(self.ptr); } | ||
|
@@ -115,9 +117,9 @@ impl EncodingDetector { | |
// unsafe { ffi::uchardet_reset(self.ptr); } | ||
//} | ||
|
||
/// Get the decoder's current best guess as to the encoding. Returns | ||
/// Get the decoder's current best guess as to the encoding. Returns | ||
/// `None` on error, or if the data appears to be ASCII. | ||
fn charset(&self) -> Option<String> { | ||
fn charset(&self) -> Result<String, EncodingDetectorError> { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This type is written as I'm totally happy to handle the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'd like to try the conversion. Though it's my first time using There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ehm, I meant |
||
unsafe { | ||
let internal_str = ffi::uchardet_get_charset(self.ptr); | ||
assert!(!internal_str.is_null()); | ||
|
@@ -126,8 +128,10 @@ impl EncodingDetector { | |
match charset { | ||
Err(_) => | ||
panic!("uchardet_get_charset returned invalid value"), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I know this is my code, but I'd like to change this |
||
Ok("") => None, | ||
Ok(encoding) => Some(encoding.to_string()) | ||
Ok("") => Err(EncodingDetectorError { | ||
message: "uchardet failed to recognize a charset".to_string() | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This looks like a good idea, I need to double-check this new behavior against And if I convert the library to use error-chain, I want to break this out as a distinct error type so that our callers can handle it specially. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will try to integrate this. |
||
}), | ||
Ok(encoding) => Ok(encoding.to_string()) | ||
} | ||
} | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -18,3 +18,4 @@ libc = "*" | |
|
||
[build-dependencies] | ||
pkg-config = '*' | ||
cmake = "*" | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm quite happy to see this dependency, especially if it helps us build on Windows and stay in sync with upstream. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to always require the latest stable Rust, or do we want to support back to some specific version? I could be convinced to go either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really have an opinion on that (I'm normally running nightly). Theoretically (I believe) 1.2 should suffice as the incompatibilty was caused by the usage of debug builders which are stabilized since 1.2.