Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Official API for parsing Cargo.tomls schema #12801

Closed
20 tasks done
epage opened this issue Oct 10, 2023 · 9 comments · Fixed by #13178
Closed
20 tasks done

Official API for parsing Cargo.tomls schema #12801

epage opened this issue Oct 10, 2023 · 9 comments · Fixed by #13178
Labels
A-cargo-api Area: cargo-the-library API and internal code issues C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` S-triage Status: This issue is waiting on initial triage.

Comments

@epage
Copy link
Contributor

epage commented Oct 10, 2023

Problem

Something is needed by crates.io for creating index entries, rather than reading the json blob which might be out of sync. They switched to community developed crates but this creates hidden relationships when making manifest changes (rust-lang/crates.io#6914).

Proposed Solution

Bigger requests, like #4614, are blocked on the fact that the code ties into everything within cargo. So instead, let's get an incremental step by just putting out the schemas, with no business logic.

Notes

Steps

Related work

See also #2259

@epage epage added C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` A-cargo-api Area: cargo-the-library API and internal code issues S-triage Status: This issue is waiting on initial triage. labels Oct 10, 2023
@epage
Copy link
Contributor Author

epage commented Oct 10, 2023

InternedString offers us two big advantages

  • Cheap clone
  • Cheap equality checks

My hope is that we can switch to a Arc<str> | InlinedStr type that can offer us the same benefits and use it everywhere. The fallback is to use it just for the schema.

I've updated string-rosetta-rs to include equality checks and everything but Arc<str> and arcstr fails on this front. I was hoping to use something with a bit more flexibility than arcstr as not having to switch between String and OptimizedString would be nice, like ecow or hipstr (flexstr doesn't have full mutability but does have a format!-like macro). However, that is for a future date and the overall performance of arcstr seems really tempting, so I'm going to propose we go with that (it is also maintained by a libs team member)

The main alternative to all of this is to make our types generic over the string type which I'd rather avoid because that gets messy.

Next step, benchmarking cargo with the change from InternedString to arcstr

@Eh2406
Copy link
Contributor

Eh2406 commented Oct 10, 2023

Edit: there is one more big advantage to InternedString, it only allocates one copy of strings. It is not uncommon for cargo to have generated the same string in hundreds of places. "serde" is the name of a package, and the name in hundreds of dependencies that depend on that package, and the feature name of the optional feature enabling that dependency in each of throws packages... So when we do benchmarking we should look at memory footprint in addition to wall time.

Cheap equality checks

The thing that makes the equality check so fast is that it does not need to dereference the pointer. Before InternedString was added the cash thrashing caused by the pointer dereferences significantly slow down the resolver. That being said, a lot else has improved since then so this may no longer be critical to performance.

Next step, benchmarking cargo with the change from InternedString to arcstr

A specific benchmark I would care about is building a large lock file. Currently all the benchmarks we have measure the time to build the project with a locked file. Furthermore, they all involve git dependencies so rebuilding a lock file is dominated by git. Which means that "just" adding a "generate lock file" call does not tell us about regressions to the performance of the resolver.

@CAD97
Copy link
Contributor

CAD97 commented Oct 11, 2023

Note that arcstr's performance as currently shown in the linked repo is a bit misleading: it only has that cheap clone when comparing strings cloned from a shared arcstr. As soon as you compare any strings that aren't that special case, you'll get the usual O(n) comparison of the string data.

If you want to avoid string comparisons when comparing strings, you need some form of interning. It can be pointer based (store strings, compare pointers) instead of purely symbolic (store indices, ask cache for real string), but you do need to deduplicate the strings somehow (i.e. intern them) to be able to exclusively test pointer equality.

@epage
Copy link
Contributor Author

epage commented Oct 11, 2023

Note that arcstr's performance as currently shown in the linked repo is a bit misleading: it only has that cheap clone when comparing strings cloned from a shared arcstr. As soon as you compare any strings that aren't that special case, you'll get the usual O(n) comparison of the string data.

This depends on how much we are doing comparisons of the data from the same source or not.

If you want to avoid string comparisons when comparing strings, you need some form of interning. It can be pointer based (store strings, compare pointers) instead of purely symbolic (store indices, ask cache for real string), but you do need to deduplicate the strings somehow (i.e. intern them) to be able to exclusively test pointer equality.

We have string interning today. The problem is that I doubt people want that as part of the Cargo.toml parsing API.

@epage
Copy link
Contributor Author

epage commented Oct 24, 2023

Just did a quick benchmark with hipstr being used only for manifests

$ ../dump/cargo-12801-bench.rs run
    Finished dev [unoptimized + debuginfo] target(s) in 0.07s
     Running `target/debug/cargo -Zscript -Zmsrv-policy ../dump/cargo-12801-bench.rs run`
warning: `package.edition` is unspecified, defaulting to `2021`
    Finished dev [unoptimized + debuginfo] target(s) in 0.04s
     Running `/home/epage/.cargo/target/0a/7f4c1ab500f045/debug/cargo-12801-bench run`
$ hyperfine "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     119.6 ms ±   2.7 ms    [User: 99.6 ms, System: 19.6 ms]
  Range (min … max):   115.6 ms … 124.5 ms    25 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     118.9 ms ±   2.8 ms    [User: 98.3 ms, System: 20.2 ms]
  Range (min … max):   115.6 ms … 125.3 ms    25 runs

Summary
  ../cargo-new check ran
    1.01 ± 0.03 times faster than ../cargo-old check
$ hyperfine --prepare "cargo clean" "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     19.626 s ±  0.198 s    [User: 151.846 s, System: 22.022 s]
  Range (min … max):   19.348 s … 19.966 s    10 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     19.807 s ±  0.403 s    [User: 152.382 s, System: 22.086 s]
  Range (min … max):   19.469 s … 20.861 s    10 runs

Summary
  ../cargo-old check ran
    1.01 ± 0.02 times faster than ../cargo-new check
$ hyperfine --prepare "cargo clean && rm -f Cargo.lock" "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     20.864 s ±  0.217 s    [User: 152.427 s, System: 22.366 s]
  Range (min … max):   20.632 s … 21.227 s    10 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     20.934 s ±  0.509 s    [User: 151.831 s, System: 22.348 s]
  Range (min … max):   20.410 s … 21.871 s    10 runs

Summary
  ../cargo-old check ran
    1.00 ± 0.03 times faster than ../cargo-new check

Code:

cargo-12801-bench.rs:

#!/usr/bin/env nargo
```cargo
[dependencies]
lexopt = "0.3"
xshell = "0.2"
```

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let args = Args::parse_args()?;

    for command in args.commands {
        match command {
            Command::Setup(args) => setup(args)?,
            Command::Run => run()?,
        }
    }

    Ok(())
}

fn setup(args: SetupArgs) -> Result<(), Box<dyn std::error::Error>> {
    let sh = xshell::Shell::new()?;
    let tmp = tmp();
    let old_path = tmp.join("cargo-old");
    let new_path = tmp.join("cargo-new");
    let built_path = std::path::Path::new("target/release/cargo");

    if args.hard && tmp.exists() {
        std::fs::remove_dir_all(&tmp)?;
    }
    std::fs::create_dir_all(&tmp)?;

    if !tmp.join("cargo").exists() {
        let _dir = sh.push_dir(&tmp);
        xshell::cmd!(sh, "git clone [email protected]:rust-lang/cargo.git").run()?;
    }

    xshell::cmd!(sh, "git checkout master").run()?;
    xshell::cmd!(sh, "cargo build --release").run()?;
    dbg!(built_path);
    dbg!(&old_path);
    std::fs::copy(built_path, &old_path)?;

    xshell::cmd!(sh, "git checkout intern").run()?;
    xshell::cmd!(sh, "cargo build --release").run()?;
    std::fs::copy(built_path, &new_path)?;

    xshell::cmd!(sh, "git checkout master").run()?;

    Ok(())
}

fn run() -> Result<(), Box<dyn std::error::Error>> {
    let sh = xshell::Shell::new()?;
    let tmp = tmp();

    let _dir = sh.push_dir(&tmp.join("cargo"));
    xshell::cmd!(sh, "../cargo-old check").run()?;
    xshell::cmd!(sh, "hyperfine '../cargo-old check' '../cargo-new check'").run()?;
    xshell::cmd!(sh, "hyperfine --prepare 'cargo clean' '../cargo-old check' '../cargo-new check'").run()?;
    xshell::cmd!(sh, "hyperfine --prepare 'cargo clean && rm -f Cargo.lock' '../cargo-old check' '../cargo-new check'").run()?;

    Ok(())
}

fn tmp() -> std::path::PathBuf {
    "target/tmp/bench".into()
}

enum Command {
    Setup(SetupArgs),
    Run,
}

struct SetupArgs {
        hard: bool
}

struct Args {
    commands: Vec<Command>,
}

impl Args {
    fn parse_args() -> Result<Self, lexopt::Error> {
        use lexopt::prelude::*;

        let mut args = Self {
            commands: Default::default(),
        };
        let mut parser = lexopt::Parser::from_env();
        while let Some(arg) = parser.next()? {
            match arg {
                Value(command) => {
                    let command = command.parse_with(|command| match command {
                        "setup" => Ok(Command::Setup(SetupArgs{
                            hard: false
                        })),
                        "run" => Ok(Command::Run),
                        _ => Err("invalid command, needed `setup`, `run`"),
                    })?;
                    args.commands.push(command);
                }
                Long("hard") => {
                    if let Some(Command::Setup(args)) = args.commands.last_mut() {
                        args.hard = true;
                    } else {
                        return Err(arg.unexpected());
                    }
                }
                Long("soft") => {
                    if let Some(Command::Setup(args)) = args.commands.last_mut() {
                        args.hard = false;
                    } else {
                        return Err(arg.unexpected());
                    }
                }
                _ => {
                    return Err(arg.unexpected());
                }
            }
        }

        Ok(args)
    }
}

@epage
Copy link
Contributor Author

epage commented Oct 25, 2023

To double check my assumptions, I went ahead and ran the benchmark with String. Apparently, we don't need anything fancy.

$ hyperfine "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     119.3 ms ±   3.2 ms    [User: 98.6 ms, System: 20.3 ms]
  Range (min … max):   115.6 ms … 124.3 ms    24 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     119.4 ms ±   2.4 ms    [User: 98.0 ms, System: 21.1 ms]
  Range (min … max):   115.7 ms … 123.6 ms    24 runs

Summary
  ../cargo-old check ran
    1.00 ± 0.03 times faster than ../cargo-new check
$ hyperfine --prepare "cargo clean" "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     20.249 s ±  0.392 s    [User: 157.719 s, System: 22.771 s]
  Range (min … max):   19.605 s … 21.123 s    10 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     20.123 s ±  0.212 s    [User: 156.156 s, System: 22.325 s]
  Range (min … max):   19.764 s … 20.420 s    10 runs

Summary
  ../cargo-new check ran
    1.01 ± 0.02 times faster than ../cargo-old check
$ hyperfine --prepare "cargo clean && rm -f Cargo.lock" "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     21.105 s ±  0.465 s    [User: 156.482 s, System: 22.799 s]
  Range (min … max):   20.156 s … 22.010 s    10 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     21.358 s ±  0.538 s    [User: 156.187 s, System: 22.979 s]
  Range (min … max):   20.703 s … 22.462 s    10 runs

Summary
  ../cargo-old check ran
    1.01 ± 0.03 times faster than ../cargo-new check

epage added a commit to epage/cargo that referenced this issue Oct 25, 2023
To have a separate manifest API (rust-lang#12801), we can't rely on interning
because it might be used in longer-lifetime applications.

I consulted https://github.com/rosetta-rs/string-rosetta-rs when
deciding on what string type to use for performance.

Originally, I hoped to entirely replacing string interning.  For that, I
was looking at `arcstr` as it had a fast equality operator.  However,
that is only helpful so long as the two strings we are comparing came
from the original source.  Unsure how likely that is to happen (and
daunted by converting all of the `Copy`s into `Clone`s), I decided to
scale back.

Concerned about all of the small allocations when parsing a manifest, I
assumed I'd need a string type with small-string optimizations, like
`hipstr`, `compact_str`, `flexstr`, and `ecow`.
The first three give us more wiggle room and `hipstr` was the fastest of
them, so I went with that.

I then double checked macro benchmarks, and realized `hipstr` made no
difference and switched to `String` to keep things simple / with lower
dependencies.

When doing this, I had created a type alias (`TomlStr`) for the string
type so I could more easily swap it out if needed
(and not have to always write out a lifetime).
With just using `String`, I went ahead and dropped that.

I had problems getting the cargo benchmarks running, so I did a quick
and dirty benchmark that is end-to-end, covering fresh builds, clean
builds, and resolution.  I ran these against a fresh clone of cargo's
code base.
```console
$ ../dump/cargo-12801-bench.rs run
    Finished dev [unoptimized + debuginfo] target(s) in 0.07s
     Running `target/debug/cargo -Zscript -Zmsrv-policy ../dump/cargo-12801-bench.rs run`
warning: `package.edition` is unspecified, defaulting to `2021`
    Finished dev [unoptimized + debuginfo] target(s) in 0.04s
     Running `/home/epage/.cargo/target/0a/7f4c1ab500f045/debug/cargo-12801-bench run`
$ hyperfine "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     119.3 ms ±   3.2 ms    [User: 98.6 ms, System: 20.3 ms]
  Range (min … max):   115.6 ms … 124.3 ms    24 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     119.4 ms ±   2.4 ms    [User: 98.0 ms, System: 21.1 ms]
  Range (min … max):   115.7 ms … 123.6 ms    24 runs

Summary
  ../cargo-old check ran
    1.00 ± 0.03 times faster than ../cargo-new check
$ hyperfine --prepare "cargo clean" "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     20.249 s ±  0.392 s    [User: 157.719 s, System: 22.771 s]
  Range (min … max):   19.605 s … 21.123 s    10 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     20.123 s ±  0.212 s    [User: 156.156 s, System: 22.325 s]
  Range (min … max):   19.764 s … 20.420 s    10 runs

Summary
  ../cargo-new check ran
    1.01 ± 0.02 times faster than ../cargo-old check
$ hyperfine --prepare "cargo clean && rm -f Cargo.lock" "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     21.105 s ±  0.465 s    [User: 156.482 s, System: 22.799 s]
  Range (min … max):   20.156 s … 22.010 s    10 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     21.358 s ±  0.538 s    [User: 156.187 s, System: 22.979 s]
  Range (min … max):   20.703 s … 22.462 s    10 runs

Summary
  ../cargo-old check ran
    1.01 ± 0.03 times faster than ../cargo-new check
```
epage added a commit to epage/cargo that referenced this issue Oct 27, 2023
To have a separate manifest API (rust-lang#12801), we can't rely on interning
because it might be used in longer-lifetime applications.

I consulted https://github.com/rosetta-rs/string-rosetta-rs when
deciding on what string type to use for performance.

Originally, I hoped to entirely replacing string interning.  For that, I
was looking at `arcstr` as it had a fast equality operator.  However,
that is only helpful so long as the two strings we are comparing came
from the original source.  Unsure how likely that is to happen (and
daunted by converting all of the `Copy`s into `Clone`s), I decided to
scale back.

Concerned about all of the small allocations when parsing a manifest, I
assumed I'd need a string type with small-string optimizations, like
`hipstr`, `compact_str`, `flexstr`, and `ecow`.
The first three give us more wiggle room and `hipstr` was the fastest of
them, so I went with that.

I then double checked macro benchmarks, and realized `hipstr` made no
difference and switched to `String` to keep things simple / with lower
dependencies.

When doing this, I had created a type alias (`TomlStr`) for the string
type so I could more easily swap it out if needed
(and not have to always write out a lifetime).
With just using `String`, I went ahead and dropped that.

I had problems getting the cargo benchmarks running, so I did a quick
and dirty benchmark that is end-to-end, covering fresh builds, clean
builds, and resolution.  I ran these against a fresh clone of cargo's
code base.
```console
$ ../dump/cargo-12801-bench.rs run
    Finished dev [unoptimized + debuginfo] target(s) in 0.07s
     Running `target/debug/cargo -Zscript -Zmsrv-policy ../dump/cargo-12801-bench.rs run`
warning: `package.edition` is unspecified, defaulting to `2021`
    Finished dev [unoptimized + debuginfo] target(s) in 0.04s
     Running `/home/epage/.cargo/target/0a/7f4c1ab500f045/debug/cargo-12801-bench run`
$ hyperfine "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     119.3 ms ±   3.2 ms    [User: 98.6 ms, System: 20.3 ms]
  Range (min … max):   115.6 ms … 124.3 ms    24 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     119.4 ms ±   2.4 ms    [User: 98.0 ms, System: 21.1 ms]
  Range (min … max):   115.7 ms … 123.6 ms    24 runs

Summary
  ../cargo-old check ran
    1.00 ± 0.03 times faster than ../cargo-new check
$ hyperfine --prepare "cargo clean" "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     20.249 s ±  0.392 s    [User: 157.719 s, System: 22.771 s]
  Range (min … max):   19.605 s … 21.123 s    10 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     20.123 s ±  0.212 s    [User: 156.156 s, System: 22.325 s]
  Range (min … max):   19.764 s … 20.420 s    10 runs

Summary
  ../cargo-new check ran
    1.01 ± 0.02 times faster than ../cargo-old check
$ hyperfine --prepare "cargo clean && rm -f Cargo.lock" "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     21.105 s ±  0.465 s    [User: 156.482 s, System: 22.799 s]
  Range (min … max):   20.156 s … 22.010 s    10 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     21.358 s ±  0.538 s    [User: 156.187 s, System: 22.979 s]
  Range (min … max):   20.703 s … 22.462 s    10 runs

Summary
  ../cargo-old check ran
    1.01 ± 0.03 times faster than ../cargo-new check
```
epage added a commit to epage/cargo that referenced this issue Oct 27, 2023
To have a separate manifest API (rust-lang#12801), we can't rely on interning
because it might be used in longer-lifetime applications.

I consulted https://github.com/rosetta-rs/string-rosetta-rs when
deciding on what string type to use for performance.

Originally, I hoped to entirely replacing string interning.  For that, I
was looking at `arcstr` as it had a fast equality operator.  However,
that is only helpful so long as the two strings we are comparing came
from the original source.  Unsure how likely that is to happen (and
daunted by converting all of the `Copy`s into `Clone`s), I decided to
scale back.

Concerned about all of the small allocations when parsing a manifest, I
assumed I'd need a string type with small-string optimizations, like
`hipstr`, `compact_str`, `flexstr`, and `ecow`.
The first three give us more wiggle room and `hipstr` was the fastest of
them, so I went with that.

I then double checked macro benchmarks, and realized `hipstr` made no
difference and switched to `String` to keep things simple / with lower
dependencies.

When doing this, I had created a type alias (`TomlStr`) for the string
type so I could more easily swap it out if needed
(and not have to always write out a lifetime).
With just using `String`, I went ahead and dropped that.

I had problems getting the cargo benchmarks running, so I did a quick
and dirty benchmark that is end-to-end, covering fresh builds, clean
builds, and resolution.  I ran these against a fresh clone of cargo's
code base.
```console
$ ../dump/cargo-12801-bench.rs run
    Finished dev [unoptimized + debuginfo] target(s) in 0.07s
     Running `target/debug/cargo -Zscript -Zmsrv-policy ../dump/cargo-12801-bench.rs run`
warning: `package.edition` is unspecified, defaulting to `2021`
    Finished dev [unoptimized + debuginfo] target(s) in 0.04s
     Running `/home/epage/.cargo/target/0a/7f4c1ab500f045/debug/cargo-12801-bench run`
$ hyperfine "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     119.3 ms ±   3.2 ms    [User: 98.6 ms, System: 20.3 ms]
  Range (min … max):   115.6 ms … 124.3 ms    24 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     119.4 ms ±   2.4 ms    [User: 98.0 ms, System: 21.1 ms]
  Range (min … max):   115.7 ms … 123.6 ms    24 runs

Summary
  ../cargo-old check ran
    1.00 ± 0.03 times faster than ../cargo-new check
$ hyperfine --prepare "cargo clean" "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     20.249 s ±  0.392 s    [User: 157.719 s, System: 22.771 s]
  Range (min … max):   19.605 s … 21.123 s    10 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     20.123 s ±  0.212 s    [User: 156.156 s, System: 22.325 s]
  Range (min … max):   19.764 s … 20.420 s    10 runs

Summary
  ../cargo-new check ran
    1.01 ± 0.02 times faster than ../cargo-old check
$ hyperfine --prepare "cargo clean && rm -f Cargo.lock" "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     21.105 s ±  0.465 s    [User: 156.482 s, System: 22.799 s]
  Range (min … max):   20.156 s … 22.010 s    10 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     21.358 s ±  0.538 s    [User: 156.187 s, System: 22.979 s]
  Range (min … max):   20.703 s … 22.462 s    10 runs

Summary
  ../cargo-old check ran
    1.01 ± 0.03 times faster than ../cargo-new check
```
epage added a commit to epage/cargo that referenced this issue Oct 28, 2023
To have a separate manifest API (rust-lang#12801), we can't rely on interning
because it might be used in longer-lifetime applications.

I consulted https://github.com/rosetta-rs/string-rosetta-rs when
deciding on what string type to use for performance.

Originally, I hoped to entirely replacing string interning.  For that, I
was looking at `arcstr` as it had a fast equality operator.  However,
that is only helpful so long as the two strings we are comparing came
from the original source.  Unsure how likely that is to happen (and
daunted by converting all of the `Copy`s into `Clone`s), I decided to
scale back.

Concerned about all of the small allocations when parsing a manifest, I
assumed I'd need a string type with small-string optimizations, like
`hipstr`, `compact_str`, `flexstr`, and `ecow`.
The first three give us more wiggle room and `hipstr` was the fastest of
them, so I went with that.

I then double checked macro benchmarks, and realized `hipstr` made no
difference and switched to `String` to keep things simple / with lower
dependencies.

When doing this, I had created a type alias (`TomlStr`) for the string
type so I could more easily swap it out if needed
(and not have to always write out a lifetime).
With just using `String`, I went ahead and dropped that.

I had problems getting the cargo benchmarks running, so I did a quick
and dirty benchmark that is end-to-end, covering fresh builds, clean
builds, and resolution.  I ran these against a fresh clone of cargo's
code base.
```console
$ ../dump/cargo-12801-bench.rs run
    Finished dev [unoptimized + debuginfo] target(s) in 0.07s
     Running `target/debug/cargo -Zscript -Zmsrv-policy ../dump/cargo-12801-bench.rs run`
warning: `package.edition` is unspecified, defaulting to `2021`
    Finished dev [unoptimized + debuginfo] target(s) in 0.04s
     Running `/home/epage/.cargo/target/0a/7f4c1ab500f045/debug/cargo-12801-bench run`
$ hyperfine "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     119.3 ms ±   3.2 ms    [User: 98.6 ms, System: 20.3 ms]
  Range (min … max):   115.6 ms … 124.3 ms    24 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     119.4 ms ±   2.4 ms    [User: 98.0 ms, System: 21.1 ms]
  Range (min … max):   115.7 ms … 123.6 ms    24 runs

Summary
  ../cargo-old check ran
    1.00 ± 0.03 times faster than ../cargo-new check
$ hyperfine --prepare "cargo clean" "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     20.249 s ±  0.392 s    [User: 157.719 s, System: 22.771 s]
  Range (min … max):   19.605 s … 21.123 s    10 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     20.123 s ±  0.212 s    [User: 156.156 s, System: 22.325 s]
  Range (min … max):   19.764 s … 20.420 s    10 runs

Summary
  ../cargo-new check ran
    1.01 ± 0.02 times faster than ../cargo-old check
$ hyperfine --prepare "cargo clean && rm -f Cargo.lock" "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     21.105 s ±  0.465 s    [User: 156.482 s, System: 22.799 s]
  Range (min … max):   20.156 s … 22.010 s    10 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     21.358 s ±  0.538 s    [User: 156.187 s, System: 22.979 s]
  Range (min … max):   20.703 s … 22.462 s    10 runs

Summary
  ../cargo-old check ran
    1.01 ± 0.03 times faster than ../cargo-new check
```
bors added a commit that referenced this issue Oct 28, 2023
refactor(toml): Decouple parsing from interning system

### What does this PR try to resolve?

To have a separate manifest API (#12801), we can't rely on interning because it might be used in longer-lifetime applications.

To keep this limited in scope, this only removes `InternedString` from manifest parsing.  Everything else still uses `InternedString`.

### How should we test and review this PR?

I had problems getting the cargo benchmarks running, so I did a quick and dirty benchmark that is end-to-end, covering fresh builds, clean builds, and resolution.  I ran these against a fresh clone of cargo's code base.  See [my comment](#12801 (comment)) for the script that managed the benchmarks.

Benchmarks:

<details>

```console
$ ../dump/cargo-12801-bench.rs run
    Finished dev [unoptimized + debuginfo] target(s) in 0.07s
     Running `target/debug/cargo -Zscript -Zmsrv-policy ../dump/cargo-12801-bench.rs run`
warning: `package.edition` is unspecified, defaulting to `2021`
    Finished dev [unoptimized + debuginfo] target(s) in 0.04s
     Running `/home/epage/.cargo/target/0a/7f4c1ab500f045/debug/cargo-12801-bench run`
$ hyperfine "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     119.3 ms ±   3.2 ms    [User: 98.6 ms, System: 20.3 ms]
  Range (min … max):   115.6 ms … 124.3 ms    24 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     119.4 ms ±   2.4 ms    [User: 98.0 ms, System: 21.1 ms]
  Range (min … max):   115.7 ms … 123.6 ms    24 runs

Summary
  ../cargo-old check ran
    1.00 ± 0.03 times faster than ../cargo-new check
$ hyperfine --prepare "cargo clean" "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     20.249 s ±  0.392 s    [User: 157.719 s, System: 22.771 s]
  Range (min … max):   19.605 s … 21.123 s    10 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     20.123 s ±  0.212 s    [User: 156.156 s, System: 22.325 s]
  Range (min … max):   19.764 s … 20.420 s    10 runs

Summary
  ../cargo-new check ran
    1.01 ± 0.02 times faster than ../cargo-old check
$ hyperfine --prepare "cargo clean && rm -f Cargo.lock" "../cargo-old check" "../cargo-new check"
Benchmark 1: ../cargo-old check
  Time (mean ± σ):     21.105 s ±  0.465 s    [User: 156.482 s, System: 22.799 s]
  Range (min … max):   20.156 s … 22.010 s    10 runs

Benchmark 2: ../cargo-new check
  Time (mean ± σ):     21.358 s ±  0.538 s    [User: 156.187 s, System: 22.979 s]
  Range (min … max):   20.703 s … 22.462 s    10 runs

Summary
  ../cargo-old check ran
    1.01 ± 0.03 times faster than ../cargo-new check
```

</details>

### Additional information

I consulted https://github.com/rosetta-rs/string-rosetta-rs when deciding on what string type to use for performance.

Originally, I hoped to entirely replacing string interning.  For that, I was looking at `arcstr` as it had a fast equality operator.  However, that is only helpful so long as the two strings we are comparing came from the original source.  Unsure how likely that is to happen (and daunted by converting all of the `Copy`s into `Clone`s), I decided to scale back.

Concerned about all of the small allocations when parsing a manifest, I assumed I'd need a string type with small-string optimizations, like `hipstr`, `compact_str`, `flexstr`, and `ecow`.
The first three give us more wiggle room and `hipstr` was the fastest of them, so I went with that.

I then double checked macro benchmarks, and realized `hipstr` made no difference and switched to `String` to keep things simple / with lower dependencies.

When doing this, I had created a type alias (`TomlStr`) for the string type so I could more easily swap it out if needed
(and not have to always write out a lifetime).
With just using `String`, I went ahead and dropped that.
bors added a commit that referenced this issue Oct 31, 2023
refactor(toml): Cleanup noticed on the way to #12801

In splitting up `toml/mod.rs` for #12801, I noticed some things to clean up.  I decided to split this up.
bors added a commit to rust-lang-ci/rust that referenced this issue Oct 31, 2023
Update cargo

7 commits in 708383d620e183a9ece69b8fe930c411d83dee27..b4d18d4bd3db6d872892f6c87c51a02999b80802
2023-10-27 21:09:26 +0000 to 2023-10-31 18:19:10 +0000
- refactor(toml): Cleanup noticed on the way to rust-lang/cargo#12801 (rust-lang/cargo#12902)
- feat(trim-paths): set env `CARGO_TRIM_PATHS` for build scripts (rust-lang/cargo#12900)
- feat: implement RFC 3127 `-Ztrim-paths` (rust-lang/cargo#12625)
- docs: clarify config to use vendored source is printed to stdout (rust-lang/cargo#12893)
- Improve the margin calculation for the search command's UI (rust-lang/cargo#12890)
- Add new packages to [workspace.members] automatically (rust-lang/cargo#12779)
- refactor(toml): Decouple parsing from interning system (rust-lang/cargo#12881)

r? ghost
bors added a commit that referenced this issue Nov 6, 2023
refactor(toml): Pull out the schema

### What does this PR try to resolve?

On its own, this PR is a net negative for readability / complexity.  It moves all of the serde types related to manifest from `toml/mod.rs` to `toml/schema.rs`, leaving a lot of the functions and some trait implementations back in `toml/mod.rs` (some basic functions that made sense for the type on their own were also moved).

So why do this?  This is an ooch towards having the schema broken out into a separate package (#12801).  To do this, we need to
- Identify what all types need to be put in the package.  This refactor highlights the dependence on `RustVersion` and `PackageIdSpec`
- Highlights what functionality we need to find a new home for

Follow up PRs would
- Find better homes for the logic in `toml/mod.rs`, moving us away from having `impl schema::Type` blocks.
- Pull out a `src/cargo/util_semver` package to own `PartialVersion` (at least) as prep for making a `cargo-util-semver` package
- Move `RustVersion` to `manifest-toml/schema.rs`, deciding what functionality needs to move with the type
- Move or copy `PackageIdSpec` into `manfest-toml/schema.rs`, deciding what functionality remain in `core/` and what moves over
- Move `toml/schema.rs` to `src/cargo/util_schema`
- Actually make `cargo-util-semver` and `cargo-util-manifest-schema` packages

So why do this now? This is a big change!  By being incremental
- Reduce churn for me and others
- Be easier to review
- Collect feedback as we go on the whole plan to avoid more painful changes later

We *can* back this out if needed but the further we go, the more painful it will be.

### How should we test and review this PR?

### Additional information
bors added a commit that referenced this issue Nov 6, 2023
refactor(toml): Pull out the schema

### What does this PR try to resolve?

On its own, this PR is a net negative for readability / complexity.  It moves all of the serde types related to manifest from `toml/mod.rs` to `toml/schema.rs`, leaving a lot of the functions and some trait implementations back in `toml/mod.rs` (some basic functions that made sense for the type on their own were also moved).

So why do this?  This is an ooch towards having the schema broken out into a separate package (#12801).  To do this, we need to
- Identify what all types need to be put in the package.  This refactor highlights the dependence on `RustVersion` and `PackageIdSpec`
- Highlights what functionality we need to find a new home for

Follow up PRs would
- Find better homes for the logic in `toml/mod.rs`, moving us away from having `impl schema::Type` blocks.
- Pull out a `src/cargo/util_semver` package to own `PartialVersion` (at least) as prep for making a `cargo-util-semver` package
- Move `RustVersion` to `manifest-toml/schema.rs`, deciding what functionality needs to move with the type
- Move or copy `PackageIdSpec` into `manfest-toml/schema.rs`, deciding what functionality remain in `core/` and what moves over
- Move `toml/schema.rs` to `src/cargo/util_schema`
- Actually make `cargo-util-semver` and `cargo-util-manifest-schema` packages

So why do this now? This is a big change!  By being incremental
- Reduce churn for me and others
- Be easier to review
- Collect feedback as we go on the whole plan to avoid more painful changes later

We *can* back this out if needed but the further we go, the more painful it will be.

### How should we test and review this PR?

### Additional information
epage added a commit to epage/cargo that referenced this issue Nov 6, 2023
For `cargo install` we'll now show a more specific parse error for
semver, much like other parts of cargo.

This came out of my work on rust-lang#12801.  I was looking at what might be
appropriate to put in a `cargo-util-semver` crate and realized we have
the `ToSemver` trait that exists but doesn't do much, so I dropped it.
epage added a commit to epage/cargo that referenced this issue Nov 6, 2023
This came out of my work on rust-lang#12801.
I was looking at what might be appropriate to put in a `cargo-util-semver` crate and realized we have the `VersionExt` trait that exists but doesn't do much, so I dropped it.
epage added a commit to epage/cargo that referenced this issue Nov 6, 2023
For `cargo install` we'll now show a more specific parse error for
semver, much like other parts of cargo.

This came out of my work on rust-lang#12801.  I was looking at what might be
appropriate to put in a `cargo-util-semver` crate and realized we have
the `ToSemver` trait that exists but doesn't do much, so I dropped it.
epage added a commit to epage/cargo that referenced this issue Nov 7, 2023
For `cargo install` we'll now show a more specific parse error for
semver, much like other parts of cargo.

This came out of my work on rust-lang#12801.  I was looking at what might be
appropriate to put in a `cargo-util-semver` crate and realized we have
the `ToSemver` trait that exists but doesn't do much, so I dropped it.
epage added a commit to epage/cargo that referenced this issue Dec 1, 2023
Originally for rust-lang#12801 we talked about a `cargo-util-manifest-schema` package
- `util` in the name to not clash with plugins
- manifest specific to keep the scope down

The problem is we have types that aren't manifest specific, like
- `PartialVersion` (currently slated for `cargo-util-semverext`)
- `RustVersion`
- `PackageIdSpec`
- `SourceKind` (soon)

Things get messy if we try to break things down into common packages.
Instead, I think it'd be useful to have a schemas package that has mods
for each type of schema, re-exporting what is needed.

Normally, componentizing your package by the layer in the stack is a
recipe for pain.
I don't think that'll apply here because these are meant to be so low
level.

The other big concern could be compile times.  My hope is it won't be
too bad.

So this moves the `util/toml` types into the module and we can add more
in the future.
epage added a commit to epage/cargo that referenced this issue Dec 1, 2023
Originally for rust-lang#12801 we talked about a `cargo-util-manifest-schema` package
- `util` in the name to not clash with plugins
- manifest specific to keep the scope down

The problem is we have types that aren't manifest specific, like
- `PartialVersion` (currently slated for `cargo-util-semverext`)
- `RustVersion`
- `PackageIdSpec`
- `SourceKind` (soon)

Things get messy if we try to break things down into common packages.
Instead, I think it'd be useful to have a schemas package that has mods
for each type of schema, re-exporting what is needed.

Normally, componentizing your package by the layer in the stack is a
recipe for pain.
I don't think that'll apply here because these are meant to be so low
level.

The other big concern could be compile times.  My hope is it won't be
too bad.

So this moves the `util/toml` types into the module and we can add more
in the future.
epage added a commit to epage/cargo that referenced this issue Dec 1, 2023
Originally for rust-lang#12801 we talked about a `cargo-util-manifest-schema` package
- `util` in the name to not clash with plugins
- manifest specific to keep the scope down

The problem is we have types that aren't manifest specific, like
- `PartialVersion` (currently slated for `cargo-util-semverext`)
- `RustVersion`
- `PackageIdSpec`
- `SourceKind` (soon)

Things get messy if we try to break things down into common packages.
Instead, I think it'd be useful to have a schemas package that has mods
for each type of schema, re-exporting what is needed.

Normally, componentizing your package by the layer in the stack is a
recipe for pain.
I don't think that'll apply here because these are meant to be so low
level.

The other big concern could be compile times.  My hope is it won't be
too bad.

So this moves the `util/toml` types into the module and we can add more
in the future.
epage added a commit to epage/cargo that referenced this issue Dec 2, 2023
Originally for rust-lang#12801 we talked about a `cargo-util-manifest-schema` package
- `util` in the name to not clash with plugins
- manifest specific to keep the scope down

The problem is we have types that aren't manifest specific, like
- `PartialVersion` (currently slated for `cargo-util-semverext`)
- `RustVersion`
- `PackageIdSpec`
- `SourceKind` (soon)

Things get messy if we try to break things down into common packages.
Instead, I think it'd be useful to have a schemas package that has mods
for each type of schema, re-exporting what is needed.

Normally, componentizing your package by the layer in the stack is a
recipe for pain.
I don't think that'll apply here because these are meant to be so low
level.

The other big concern could be compile times.  My hope is it won't be
too bad.

So this moves the `util/toml` types into the module and we can add more
in the future.
epage added a commit to epage/cargo that referenced this issue Dec 5, 2023
Originally for rust-lang#12801 we talked about a `cargo-util-manifest-schema` package
- `util` in the name to not clash with plugins
- manifest specific to keep the scope down

The problem is we have types that aren't manifest specific, like
- `PartialVersion` (currently slated for `cargo-util-semverext`)
- `RustVersion`
- `PackageIdSpec`
- `SourceKind` (soon)

Things get messy if we try to break things down into common packages.
Instead, I think it'd be useful to have a schemas package that has mods
for each type of schema, re-exporting what is needed.

Normally, componentizing your package by the layer in the stack is a
recipe for pain.
I don't think that'll apply here because these are meant to be so low
level.

The other big concern could be compile times.  My hope is it won't be
too bad.

So this moves the `util/toml` types into the module and we can add more
in the future.
bors added a commit that referenced this issue Dec 5, 2023
refactor(schemas): Pull out mod for proposed schemas package

Originally for #12801 we talked about a `cargo-util-manifest-schema` package
- `util` in the name to not clash with plugins
- manifest specific to keep the scope down

The problem is we have types that aren't manifest specific, like
- `PartialVersion` (currently slated for `cargo-util-semverext`)
- `RustVersion`
- `PackageIdSpec`
- `SourceKind` (soon)

Things get messy if we try to break things down into common packages. Instead, I think it'd be useful to have a schemas package that has mods for each type of schema, re-exporting what is needed.

Normally, componentizing your package by the layer in the stack is a recipe for pain.
I don't think that'll apply here because these are meant to be so low level.

The other big concern could be compile times.  My hope is it won't be too bad.

So this moves the `util/toml` types into the module and we can add more in the future.
epage added a commit to epage/cargo that referenced this issue Dec 6, 2023
This is a step towards moving `PackageIdSpec` into `schema` as manifests
need it as part of rust-lang#12801.

While I didn't take this approach in rust-lang#13080, that was largely how core
these functions are / how pervasive their use is.
epage added a commit to epage/cargo that referenced this issue Dec 8, 2023
This is a step towards moving `PackageIdSpec` into `schema` as manifests
need it as part of rust-lang#12801.

While I didn't take this approach in rust-lang#13080, that was largely how core
these functions are / how pervasive their use is.
bors added a commit that referenced this issue Dec 8, 2023
refactor: Pull PackageIdSpec into schema

### What does this PR try to resolve?

This removes one of the remaining ties `util_schemas` has on the rest of cargo as part of #12801.

### How should we test and review this PR?

This is broken up into small commits

### Additional information
bors added a commit that referenced this issue Dec 11, 2023
refactor(schema): Remove reliance on cargo types

### What does this PR try to resolve?

This is another step towards #12801.

The only thing left is `validate_package_name` which I left out because I want to explore ways of handing that and don't want this commit blocked on it

### How should we test and review this PR?

### Additional information
bors added a commit that referenced this issue Dec 13, 2023
fix: Fill in more empty name holes

### What does this PR try to resolve?

This is a follow up to #13152 and expands on work done in #12928.

This is also part of #12801 as I am refactoring how we do validation so that it parts of it can move into the schema, removing the last dependency the schema has on the rest of cargo.

### How should we test and review this PR?

This prevents empty registry names which should be fine for the same reason as preventing empty feature names in #12928, that this was a regression due to changes in toml parsers

### Additional information
bors added a commit that referenced this issue Dec 15, 2023
refactor: Pull name validation into `util_schemas`

### What does this PR try to resolve?

In preparation for #12801, this moves the last dependency on the rest of the `cargo` crate into the future `util_schemas` crate.  It does this by refocusing the code from being validation functions to being newtypes.  I did not try to thread the newtypes everywhere, that can be an exercise for the future.  There are places I didn't put newtypes because it didn't seem worth it (e.g. places needing `StringOrVec`, `ProfileName` not being used in CLI because of the `check` psuedo-profile, etc)

The main risk with this is when validation changes according to a nightly feature, like packages-as-namespaces.  My thought is that the validation code would be updated for the nightly behavior and then the caller would check if it isn't nightly and fail in that case.

This has a side effect of improving the error messages for manifest parsing because we will show more context

### How should we test and review this PR?

#13164 should be reviewed first

### Additional information
epage added a commit to epage/cargo that referenced this issue Dec 15, 2023
epage added a commit to epage/cargo that referenced this issue Dec 15, 2023
epage added a commit to epage/cargo that referenced this issue Dec 15, 2023
epage added a commit to epage/cargo that referenced this issue Dec 15, 2023
epage added a commit to epage/cargo that referenced this issue Dec 15, 2023
epage added a commit to epage/cargo that referenced this issue Dec 18, 2023
bors added a commit that referenced this issue Dec 18, 2023
refactor(schemas): Pull out as `cargo-util-schemas`

### What does this PR try to resolve?

Fixes #12801

### How should we test and review this PR?

See the individual commits for further justifications on the changes

### Additional information
@bors bors closed this as completed in 633929d Dec 18, 2023
@Rustin170506
Copy link
Member

Thank you for working on this! 💚 💙 💜 💛 ❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-cargo-api Area: cargo-the-library API and internal code issues C-feature-request Category: proposal for a feature. Before PR, ping rust-lang/cargo if this is not `Feature accepted` S-triage Status: This issue is waiting on initial triage.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@ehuss @epage @Eh2406 @CAD97 @Rustin170506 and others