Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initialize values for yield normalization during startup #99991

Conversation

eduardo-vp
Copy link
Member

@eduardo-vp eduardo-vp commented Mar 20, 2024

For yield normalization, we currently start by doing 8 iterations of ~1 microsecond measurements of the system yield. These iterations are scheduled lazily so the very first spin waits happen before the first measurements. Because of this, GC initialization and the first GCs can't use the adjusted yield which led to spin waits slower than should have been in some scenarios.

This is an attempt to solve this problem by doing 2 of those 8 iterations during startup (before GC initialization) while keeping the remaining 6 happening lazily so the first spin waits (including the ones done by the GC) can use an adjusted yield.

src/coreclr/vm/yieldprocessornormalized.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/yieldprocessornormalized.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/yieldprocessornormalized.cpp Show resolved Hide resolved
src/coreclr/vm/yieldprocessornormalized.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/yieldprocessornormalized.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/yieldprocessornormalized.cpp Outdated Show resolved Hide resolved
src/coreclr/vm/yieldprocessornormalized.cpp Outdated Show resolved Hide resolved
Copy link
Member

@kouvel kouvel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@jkotas
Copy link
Member

jkotas commented Mar 21, 2024

What is this trying to fix exactly? Do we have any numbers for what this change is trying to fix?

@kouvel
Copy link
Member

kouvel commented Mar 21, 2024

@Maoni0 mentioned that during GC initialization some joins involved when changing the heap count end up spinning too much, as the normalization is not done early enough. This is an attempt to address that by doing a couple of short measurements early.

@mangod9
Copy link
Member

mangod9 commented Mar 21, 2024

Correct, with DATAS enabled NativeAOT benchmarks are showing a startup regression. This should hopefully help with #94076. But would be good to validate that.

@jkotas
Copy link
Member

jkotas commented Mar 21, 2024

How does the time spent in extra spinning compares to the time spent on these eager measurements?

@jkotas
Copy link
Member

jkotas commented Mar 21, 2024

Correct, with DATAS enabled NativeAOT benchmarks are showing a startup regression

In the current state, this change is not changing native AOT.

@kouvel
Copy link
Member

kouvel commented Mar 21, 2024

We were planning on porting this and the newer normalization implementation to NativeAOT as well. NativeAOT currently uses the old normalization scheme.

@eduardo-vp eduardo-vp marked this pull request as ready for review March 22, 2024 00:59
@eduardo-vp
Copy link
Member Author

/azp run runtime

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@eduardo-vp eduardo-vp force-pushed the add-yield-normalization-initialization branch from 11f7c24 to cf053ab Compare March 22, 2024 19:34
s_normalizationState = NormalizationState::Failed;
return;
LARGE_INTEGER li;
if (!QueryPerformanceFrequency(&li) || li.QuadPart < 1000 * 1000)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat pre-existing to the PR here, but I think there's some latent bugs in the general measurement logic here...

Notably, Linux always reports 1 billion here for the frequency (1 nanosecond ticks) even if the underlying hardware timer is actually much slower.

Additionally, there are places below where we try to measure the length of time between two QPC calls and while QPC is guaranteed to be monotonic (never decreasing) there is no guarantee that two separate calls will not return the same value (which can occur if they happen faster than the underlying timer resolution). A simple program such as the following shows this on most modern hardware:

using System.Runtime.InteropServices;

internal unsafe class Program
{
    private static void Main(string[] args)
    {
        long start = 0;
        _ = QueryPerformanceCounter(&start);

        long end = 0;
        _ = QueryPerformanceCounter(&end);

        Console.WriteLine(end - start); // likely prints 0
    }

    [DllImport("kernel32", ExactSpelling = true)]
    [SuppressGCTransition]
    static extern int QueryPerformanceCounter(long* lpPerformanceCount);
}

Additionally, on some hardware the underlying mechanism used by QPC could have a resolution as low as 10-20 cycles or it could have a latency of nearly 100-300 ns (detailed in part here: https://learn.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps). While the pause instruction (x86/x64) can execute somewhere around 10-20 cycles, 40 cycles, or up to 150 cycles (depending on the hardware in question, with newer and higher core count machines often explicitly having longer delay times). All of these cycle times could be potentially much less than the 100-300ns latency that a QPC call might have which can lead to skewed measurements.

We've also seen that yield on Arm64 is known to be a hint which can often be ignored or so low that many other ecosystems have switched to using the isb instruction instead to achieve a similar effect to the x86/x64 pause instruction.

It might be good to take a deeper look at what this code is actually doing and how accurate it actually is compared to mechanisms which are known to more accurately measure the actual instruction latency in a platform specific manner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notably, Linux always reports 1 billion here for the frequency (1 nanosecond ticks) even if the underlying hardware timer is actually much slower.

Stopwatch.Frequency also seems to report similarly. The tick count reports ticks in nanoseconds. I don't think this affects the measurements, just the frequency doesn't indicate the actual underlying frequency of the clock. The actual frequency of the clock also does not necessarily reflect the total overhead involved in getting the ticks.

Additionally, there are places below where we try to measure the length of time between two QPC calls and while QPC is guaranteed to be monotonic (never decreasing) there is no guarantee that two separate calls will not return the same value (which can occur if they happen faster than the underlying timer resolution).

I don't see any assumption being made in this code that the difference wouldn't be zero.

Additionally, on some hardware the underlying mechanism used by QPC could have a resolution as low as 10-20 cycles or it could have a latency of nearly 100-300 ns (detailed in part here: https://learn.microsoft.com/en-us/windows/win32/sysinfo/acquiring-high-resolution-time-stamps). While the pause instruction (x86/x64) can execute somewhere around 10-20 cycles, 40 cycles, or up to 150 cycles (depending on the hardware in question, with newer and higher core count machines often explicitly having longer delay times). All of these cycle times could be potentially much less than the 100-300ns latency that a QPC call might have which can lead to skewed measurements.

Yes that's why the QPC overhead is measured and the measurement duration is adjusted (with some reasonable limits). Previously the measurement was taken over a 10 ms interval but that was too much, and taking multiple shorter measurements as currently done showed similar measurement results even on systems with higher QPC overhead.

We've also seen that yield on Arm64 is known to be a hint which can often be ignored or so low that many other ecosystems have switched to using the isb instruction instead to achieve a similar effect to the x86/x64 pause instruction.

The normalization should be able to figure that out and compensate by using more iterations. isb may be more useful when there is no normalization, to induce a more similar delay. isb appears to have other side effects (involving an instruction barrier) that may be unnecessary for spin-waiting.

This is somewhat pre-existing to the PR here, but I think there's some latent bugs in the general measurement logic here...

Let me know if you see any specific issues, I can take a look.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notably, Linux always reports 1 billion here for the frequency (1 nanosecond ticks) even if the underlying hardware timer is actually much slower.

For instance, suppose the actual clock resolution is 1 us or higher (quite high, though perhaps possible). Each measurement would then typically be shorter than that duration, with some being shorter and some longer. The chosen measurement that is used for normalization favors a lower ns-per-yield, which would come from the longer measurements and would be a bit closer to the actual delay.

Copy link
Member

@tannergooding tannergooding Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stopwatch.Frequency also seems to report similarly. The tick count reports ticks in nanoseconds. I don't think this affects the measurements, just the frequency doesn't indicate the actual underlying frequency of the clock. The actual frequency of the clock also does not necessarily reflect the total overhead involved in getting the ticks.

That's because Stopwatch.Frequency just calls into QPF, which is implemented in PAL as a call to the underlying CLOCK_MONOTONIC (or similar), which always reports 1 ns ticks.

The general consideration here is we have logic such as this https://github.com/dotnet/runtime/blob/cf053aba5c42ad94fe696238ff40f28ec82c7362/src/coreclr/vm/yieldprocessornormalized.cpp#L148C13-L153C14

Where the first check (did QPF fail) cannot be triggered on Windows and where the latter check (li.QuadPart < 1_000_000) cannot be triggered on Linux. Given the intent of the check, the general logic here is "flawed" and isn't actually accounting for whether normalization can be properly done. -- This is particularly relevant as we're much more likely to encounter a low clock resolution device on Linux where embedded and low-end/legacy device support is more prevalent.

I don't see any assumption being made in this code that the difference wouldn't be zero.

I think the fact that we aren't explicitly checking for and handling that is the problem.

This call is trying to detect overhead and its registering it as 0:

QueryPerformanceCounter(&li);
UINT64 startTicks = li.QuadPart;
QueryPerformanceCounter(&li);
UINT64 elapsedTicks = li.QuadPart - startTicks;
if (elapsedTicks >= s_performanceCounterTicksPerS * measureDurationUs * (1000 / 4) / NsPerS) // elapsed >= 1/4 of the measure duration
{
measureDurationUs *= 4;
}
return measureDurationUs;
, thus returning a MeasureDuration of 1us without actually confirming, for example, that 1us is a reasonable measurement duration.

This value is passed into MeasureNsPerYield: https://github.com/dotnet/runtime/blob/cf053aba5c42ad94fe696238ff40f28ec82c7362/src/coreclr/vm/yieldprocessornormalized.cpp#L59C46-L59C63. yieldCount will on many machines simply be 28 and measureDurationTicks will be 10.

On a machine, like the Ryzen 7950X which has documented pause to be 64 cycles and therefore to take ~14ns to execute (base clock of 4.5GHz, can boost up to 5.7GHz), this is generally sufficient and gets roughly the right times. However, I've seen it compute anywhere up to 30ns per pause and down to 10ns, such as if the core it happens to measure is in a low-power state or have just finished a boost state. This variance can cause quite a bit of noise in some benchmarks or even just general app execution.
 
We then have some logic just below that which tries to account for inaccurate measures.:

// Limit the minimum to a reasonable value considering that on some systems a yield may be implemented as a no-op
const double MinNsPerYield = 0.1;
// Measured values higher than this don't affect values calculated for normalization, and it's very unlikely for a yield to
// really take this long. Limit the maximum to keep the recorded values reasonable.
const double MaxNsPerYield = YieldProcessorNormalization::TargetMaxNsPerSpinIteration / 1.5 + 1;
return Max(MinNsPerYield, Min((double)elapsedTicks * NsPerS / ((double)yieldCount * ticksPerS), MaxNsPerYield));

However, the numbers are a bit odd as there is no such thing as a 10GHz processor and so 0.1 as a MinNsPerYield is completely unrealistic value. Beyond that, a more typical processor is around 2.5GHz (often more for Desktop or Workstation, sometimes less for Mobile) and so a reasonable yield time (based on actual known data) would likely be at a minimum of around 10 cycles, given around 2-4ns per yield as a reasonable minimum.

The normalization should be able to figure that out and compensate by using more iterations. isb may be more useful when there is no normalization, to induce a more similar delay. isb appears to have other side effects (involving an instruction barrier) that may be unnecessary for spin-waiting.

The main consideration is that if the underlying yield/pause instruction is truly a no-op, then we likely should be falling back to an alternative sequence that better matches the actual intent of the spin loop. If we're literally just executing what is functionally an empty while() loop, that's not great for the system.

There's a lot of cases, particularly for Arm64 systems, where we're seeing tons of context switches compared to an equivalent x64 machine. We're also spending for more time executing instructions in places like the underlying Semaphore spin loop used by the thread pool.

So not only are we spinning and not letting the processor acknowledge the potential for a proper lower power state, but, we're doing a lot of potentially expensive work on the core that could be getting spent elsewhere.

For example, on my Volterra, it's measuring around 0.5-5ns per yield, with some of those being done on BIG cores and some being done on LITTLE cores, which means that separate measurements of yield could be problematic. If you just spin doing MeasureNsPerYield forever, measing the min and max seen over time and the ProcessorId it's running on, for example:

min: 0.4478280340349306 5
max: 0.4478280340349306 6
min: 0.42662116040955633 6
max: 0.5071461502996772 5
min: 0.41736227045075125 6
min: 0.3985651654045436 5
max: 2.345415778251599 6
max: 2.487562189054726 5
min: 0.3838771593090211 6
min: 0.37091988130563797 5
max: 36.47746243739566 6
max: 89.28571428571429 5
max: 173.2193732193732 6
max: 174.28571428571428 0
max: 182.33333333333334 6

With the really high numbers (89 and 173-182) being caused by the OS context switches -- Similar happens on the 7950X:

min: 12.76595744680851 8
max: 12.76595744680851 8
max: 13.924050632911392 8
min: 12.658227848101266 8
min: 11.702127659574469 8
max: 16.949152542372882 8
max: 24.390243902439025 30
max: 26.829268292682926 26
max: 81.69014084507042 8
max: 142.85714285714286 8
max: 182.33333333333334 0

Copy link
Member

@tannergooding tannergooding Mar 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these unaccounted for nuances are causing us issues, especially for Arm64 on Linux and are part of why the retired instruction counts, the number of context switches (which are currently more expensive for Arm64), and other metrics are so much higher.

I would imagine some of them would also cause issues for things like Intel Alder Lake CPUs where you have Power and Efficiency cores with very different cycle times and base/max clock speeds.

For example, the same code running on WSL or an actual direct boot to Linux (Ubuntu 22.04) reports the following for Ryzen 7950X:

min: 8.333333333333334 23
max: 8.333333333333334 23
min: 7.046979865771812 23
min: 6.709677419354839 23
min: 6.580645161290323 23
min: 6.538461538461538 23
max: 9.0625 23
max: 10.412371134020619 23
max: 40.796460176991154 23
max: 54.01639344262295 23
max: 102.53571428571429 23
max: 181.78571428571428 23
max: 182.14285714285714 23
max: 182.33333333333334 23
min: 6.5 23
min: 6.4743589743589745 23
min: 6.3354037267080745 23
min: 6.31875 23
min: 6.3125 23
min: 6.273291925465839 22

And the following for Volterra:

min: 0.7791195948578107 2
max: 0.7791195948578107 2
min: 0.5340114431023522 2
min: 0.43294614572333684 2
min: 0.41355658664514827 2
min: 0.3994154895275207 2
min: 0.3990119703591108 2
min: 0.3896736483195324 2
min: 0.38767019667170954 2
min: 0.3806164129223914 2
min: 0.3795242062390077 2
min: 0.37821482602118 2
max: 15.130674002751032 2
max: 32.18707015130674 2
min: 0.3668715032559846 2
max: 37.61467889908257 2
max: 53.211009174311926 2
max: 55.04587155963303 2
max: 166.9724770642202 2
max: 182.33333333333334 2

So the same code on the same hardware is reporting significantly different results in Linux and not measuring something which really does what we're wanting for Arm64 as its taking so little time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tannergooding, can you file an issue? We can continue the discussion there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logged #100242

@eduardo-vp eduardo-vp force-pushed the add-yield-normalization-initialization branch from cf053ab to acd40a5 Compare March 25, 2024 20:43
@eduardo-vp
Copy link
Member Author

eduardo-vp commented Apr 17, 2024

@jkotas @Maoni0 I collected some numbers using this test case \clrmain\public\writable\users\maonis\YieldProcessor. This is the branch with the baseline and this is the one with the changes.

The GC did ~30% less YieldProcessor() calls with this change by doing an early measurement of the system yield which took on average 5 us during start up time.

image

@jkotas
Copy link
Member

jkotas commented Apr 17, 2024

How do these numbers from GCPerf simulator translate to user observable startup perf number?

The spin count measurement should happen fairly quickly in the background. I do not think that you should ever see 100ms-200ms spent by spinning during startup in real world app. If we are actually spending that much time spinning in a real world apps during startup, I think we have a different problem.

I would like us to compare the solution proposed here with a much simpler solution that simply starts with the low spin count multiplier and lets the background measurement to increase it.

@kouvel
Copy link
Member

kouvel commented Apr 17, 2024

I would like us to compare the solution proposed here with a much simpler solution that simply starts with the low spin count multiplier and lets the background measurement to increase it.

I also suggested this as an alternative to @Maoni0, I think it would indeed solve this particular issue. From my measurements it did not take very long for the background measurement to update the GC with that info, in that particular test case, and in the meantime better defaults could suffice. Granted, our defaults in CoreCLR are not great for arm (the GC has separate defaults). A thought is that the default could just be to not spin-wait, but that goes into territory that would need startup time measurements on various kinds of systems.

@jkotas
Copy link
Member

jkotas commented Apr 17, 2024

I would like us to compare the solution proposed here with a much simpler solution that simply starts with the low spin count multiplier and lets the background measurement to increase it.

Something like: 87bbad1

@Maoni0
Copy link
Member

Maoni0 commented Apr 17, 2024

this was discovered in asp.net benchmarks, it took 3 GCs for the measurement to come in. I can also easily repro this on my machines with a few GCs go by before the measurement comes in.

I see this has taken way more time than I had anticipated. at this point I'd say I'd just do the "start with a low count" thing in GC (which was what I did as a workaround just to see the effect anyway when this was discovered).

@kouvel
Copy link
Member

kouvel commented Apr 18, 2024

Wanted to clarify some things:

  • The defaults (which are currently not appropriate for all architectures) become irrelevant with this change, one of the benefits
    • There is a very small price to pay for that in startup time (~5 us)
    • Maintaining good defaults can be problematic, though there are only a few archs we are dealing with currently, so that is arguable
  • The current spin-waiting defaults are not appropriate on arm, they are closer to not spin-waiting on arm. I'm not sure that we have tested if doing a reasonable amount of spin-waiting would improve startup time there. Perhaps that is something to be tested.

In any case, the main intent of this change is to alleviate ourselves from having to come up with the correct defaults for every type of system out there, with a small price to pay for that. I think it's a worthy attempt, but I'd like to hear feedback.

@jkotas
Copy link
Member

jkotas commented Apr 18, 2024

a small price to pay

Unless we can measure what we are getting out of the proposed change in real world apps, I would argue that no or very limited spin-waiting during startup is even smaller price to pay and much simpler as well.

@kouvel
Copy link
Member

kouvel commented Apr 24, 2024

Maybe this issue can be closed for now. I've filed a followup issue (#101506) to check the defaults on ARM processors. If there are other cases where this sort of early measurement can help, perhaps they could be tackled as needed.

@eduardo-vp eduardo-vp closed this Apr 24, 2024
@github-actions github-actions bot locked and limited conversation to collaborators May 25, 2024
@eduardo-vp eduardo-vp deleted the add-yield-normalization-initialization branch May 29, 2024 20:38
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants