Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RCU CPU stall warning in a multi-core system simulation #51

Open
chiangkd opened this issue Aug 4, 2024 · 21 comments
Open

RCU CPU stall warning in a multi-core system simulation #51

chiangkd opened this issue Aug 4, 2024 · 21 comments
Assignees

Comments

@chiangkd
Copy link
Collaborator

chiangkd commented Aug 4, 2024

Execute semu with multi-core system simulation

make check SMP=4

The RCU CPU stall warning, as discussed in #49 , is accompanied by an increase in timer interrupts and clock_gettime system calls to produce a real-time timer. This causes the CPU to wait longer than a typical grace period, which is usually 21 seconds.

[   33.044071] rcu: INFO: rcu_sched self-detected stall on CPU
[   33.048071] rcu: 	2-....: (5247 ticks this GP) idle=da1c/1/0x40000002 softirq=689/689 fqs=1847
[   33.056071] 	(t=5253 jiffies g=-887 q=11 ncpus=4)
[   33.064071] CPU: 2 PID: 8 Comm: kworker/u8:0 Not tainted 6.1.94 #12
[   33.068071] Hardware name: semu (DT)
[   33.072071] Workqueue: events_unbound async_run_entry_fn
[   33.080071] epc : eat+0x34/0x54
[   33.088071]  ra : do_reset+0x38/0x6c
[   33.096071] epc : c03811fc ra : c038137c sp : c0871e40
[   33.104071]  gp : c04de858 tp : c0852f40 t0 : c0aabc70
[   33.112071]  t1 : 00000008 t2 : 67e0cb47 s0 : c0871e50
[   33.120071]  s1 : c03a48b8 a0 : 00000001 a1 : 00000000
[   33.128071]  a2 : 00000000 a3 : 0053be47 a4 : 00000000
[   33.136071]  a5 : c03a48b8 a6 : 00000000 a7 : 0000000c
[   33.144071]  s2 : 004725a7 s3 : 0038da58 s4 : 00000000
[   33.152071]  s5 : 00000030 s6 : 00000007 s7 : c03a496c
[   33.156071]  s8 : c03a48b8 s9 : c038112c s10: c03a4968
[   33.160071]  s11: c0381538 t3 : c09e4140 t4 : 00000003
[   33.168071]  t5 : ec37d331 t6 : c0871e08
[   33.172071] status: 00000120 badaddr: c03a48c0 cause: 80000005
[   33.180071] [<c03811fc>] eat+0x34/0x54
[   33.184071] [<c038137c>] do_reset+0x38/0x6c
[   33.192071] [<c0381514>] write_buffer+0x40/0x64
[   33.200071] [<c0381d20>] unpack_to_rootfs+0x1c4/0x300
[   33.212072] [<c03824ec>] do_populate_rootfs+0x70/0xd8
[   33.220072] [<c002bd58>] async_run_entry_fn+0x30/0xd0
[   33.232072] [<c002195c>] process_one_work+0x198/0x224
[   33.240072] [<c0021ee4>] worker_thread+0x238/0x294
[   33.248072] [<c002898c>] kthread+0xd4/0xd8
[   33.252072] [<c0002508>] ret_from_exception+0x0/0x1c

Implementing multi-thread support for semu, as discussed in PR #49, might improve the performance of the booting process.

@Mes0903
Copy link
Collaborator

Mes0903 commented Dec 25, 2024

Although implementing multi-threaded system emulation can significantly alleviate this issue, I think it does not mean the problem will no longer occur after multi-threaded system emulation is complete. Perhaps we can start by optimizing the timer.

Currently, the function semu_timer_get relies on clock_gettime, and semu_timer_get is primarily called during reads and writes to mtime. In other words, every time mtime is accessed, clock_gettime is invoked. This leads to an excessive number of timer interrupts, causing performance issues in the emulator.

In #49, it was suggested that lowering the frequency set in semu_timer_init could mitigate this problem. However, this would mean the Linux kernel perceives the clock as no longer being "real-time" but rather a "slowed down" or "imprecise" clock, which goes against the intent of that pull request.

To strike a balance between these two extremes, I think we can maintain a dedicated emulator timer and updating it at the start of each emulation cycle. This way, clock_gettime would only be called once per emulation cycle, while tick updates would be maintained separately. The logic for reading mtime would also be simplified.

I made some modifications to test this approach, and it did result in a slight performance improvement for the emulator. On my machine, with SMP=6, the RCU CPU Stall warning no longer appears. However, with SMP=8, the warning still occurs.

@chiangkd
Copy link
Collaborator Author

I completely agree that your implementation can help avoid RCU CPU stall warnings. However, as the number of simulated cores increases, the warning is likely to occur again unless the emulation period per cycle is also increased proportionally to the number of cores. Is my understanding correct?

I rebuilt the Linux kernel with CONFIG_PREEMPT=y to enable kernel preemption. This configuration suppresses the warning when simulating with SMP=32. The system boots successfully, as shown below:

[  351.875391] virtio_blk virtio1: [vda] 4800 512-byte logical blocks (2.46 MB/2.34 MiB)
[  356.664664] virtio_net virtio0: Assigned random MAC address 6e:8d:9a:9d:3f:86
[  360.199865] clk: Disabling unused clocks
[ 1133.368994] Freeing initrd memory: 8188K
[ 1135.085089] Freeing unused kernel image (initmem) memory: 180K
[ 1135.149092] Kernel memory protection not selected by kernel config.
[ 1135.201095] Run /init as init process
Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Starting network: OK

Welcome to Buildroot
buildroot login:

But preemption may increase the frequency of context switches, causing overall CPU usage to increase.
For high-throughput tasks, overall performance may be reduced.

@Mes0903
Copy link
Collaborator

Mes0903 commented Dec 26, 2024

However, as the number of simulated cores increases, the warning is likely to occur again unless the emulation period per cycle is also increased proportionally to the number of cores. Is my understanding correct?

yes, the improvement is limited, as the number of cores grow up, the warning would occur again.

@Mes0903
Copy link
Collaborator

Mes0903 commented Dec 26, 2024

The accuracy of the timer primarily affects user programs. I think that during the boot process, it is unnecessary to use such a high-precision timer. Instead, a less precise timer, or even one that simply increments in a straightforward manner, could be used. After the boot process is complete, the system can switch back to a more precise timer.

Here is a simple example. I added a global flag boot_complete to detect whether the boot process has finished. This can be determined by the first transition to U mode.

static void op_sret(hart_t *vm)
{
    /* Restore from stack */
    vm->pc = vm->sepc;
    mmu_invalidate(vm);
    vm->s_mode = vm->sstatus_spp;
    vm->sstatus_sie = vm->sstatus_spie;

    /* After the booting process is complete, initrd will be loaded. At this
     * point, the sytstem will switch to U mode for the first time. Therefore,
     * by checking whether the switch to U mode has already occurred, we can
     * determine if the boot process has been completed.
     */
    if (!boot_complete && !vm->s_mode)
        boot_complete = true;

    /* Reset stack */
    vm->sstatus_spp = false;
    vm->sstatus_spie = true;
}

Before the boot process is complete, I didn't use clock_gettime. Instead, I simply increment tv_nsec and combine it with the lower frequency mentioned in #49 to eliminate the CPU stall warning.

bool boot_complete = false;
static struct timespec host_time;

// ...

void semu_timer_init(semu_timer_t *timer, uint64_t freq)
{
    timer->freq = freq;
    clock_gettime(CLOCKID, &host_time);
    semu_timer_rebase(timer, 0);
}

static uint64_t semu_timer_clocksource(uint64_t freq)
{
#if defined(HAVE_POSIX_TIMER)
    if (boot_complete) {
        clock_gettime(CLOCKID, &host_time);
        return host_time.tv_sec * freq +
               mult_frac(host_time.tv_nsec, freq, 1e9);
    } else {
        return host_time.tv_sec * freq +
               mult_frac(host_time.tv_nsec++, freq / 1000, 1e9);
    }
    
// ...
#endif
}

uint64_t semu_timer_get(semu_timer_t *timer)
{
    /* Rebase the timer to the current time after the boot process. */
    static bool first = true;
    if (first && boot_complete) {
        first = false;
        timer->begin = semu_timer_clocksource(timer->freq);
    }

    return semu_timer_clocksource(timer->freq) - timer->begin;
}

// ...

This is the sample output:

mes@DESKTOP-HLQ9F6A:~/semu$ make check SMP=32
  CC    riscv.o
  CC    ram.o
  CC    utils.o
  CC    plic.o
  CC    uart.o
  CC    main.o
  CC    aclint.o
  CC    virtio-blk.o
  CC    virtio-net.o
  CC    netdev.o
  LD    semu
 DTC    minimal.dtb
Ready to launch Linux kernel. Please be patient.
failed to allocate TAP device: Operation not permitted
[    0.000000] Linux version 6.1.99 (jserv@node1) (riscv32-buildroot-linux-gnu-gcc.br_real (Buildroot 2024.02.4) 12.3.0, GNU ld (GNU Binutils) 2.41) #1 SMP Thu Jul 18 13:04:10 CST 2024
[    0.000000] Machine model: semu
[    0.000000] earlycon: ns16550 at MMIO 0xf4000000 (options '')
[    0.000000] printk: bootconsole [ns16550] enabled
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] SBI specification v2.0 detected
[    0.000000] SBI implementation ID=0x999 Version=0x1
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] SBI HSM extension detected
[    0.000000] riscv: base ISA extensions aim
[    0.000000] riscv: ELF capabilities aim
[    0.000000] percpu: Embedded 10 pages/cpu s11604 r8192 d21164 u40960
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 130048
[    0.000000] Kernel command line: earlycon console=ttyS0
[    0.000000] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    0.000000] printk: log_buf_len total cpu_extra contributions: 126976 bytes
[    0.000000] printk: log_buf_len min size: 65536 bytes
[    0.000000] printk: log_buf_len: 262144 bytes
[    0.000000] printk: early log buf free: 63952(97%)
[    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 504100K/524288K available (3578K kernel code, 345K rwdata, 873K rodata, 185K init, 140K bss, 20188K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 32 local interrupts mapped
[    0.000000] plic: interrupt-controller@0: mapped 31 interrupts with 32 handlers for 32 contexts.
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] riscv-timer: riscv_timer_init_dt: Registering clocksource cpuid [0] hartid [0]
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0xefdb196da, max_idle_ns: 440795204367 ns
[    0.000000] sched_clock: 64 bits at 65MHz, resolution 15ns, wraps every 2199023255550ns
[    0.000007] Console: colour dummy device 80x25
[    0.000008] Calibrating delay loop (skipped), value calculated using timer frequency.. 130.00 BogoMIPS (lpj=260000)
[    0.000009] pid_max: default: 32768 minimum: 301
[    0.000014] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.000015] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.000038] ASID allocator disabled (0 bits)
[    0.000040] rcu: Hierarchical SRCU implementation.
[    0.000040] rcu:     Max phase no-delay instances is 1000.
[    0.000057] smp: Bringing up secondary CPUs ...
[    0.000295] smp: Brought up 1 node, 32 CPUs
[    0.000320] devtmpfs: initialized
[    0.000381] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.000382] futex hash table entries: 8192 (order: 7, 524288 bytes, linear)
[    0.000406] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    0.000486] platform soc@F0000000: Fixed dependency cycle(s) with /soc@F0000000/interrupt-controller@0
[    0.000662] clocksource: Switched to clocksource riscv_clocksource
[    0.000806] NET: Registered PF_INET protocol family
[    0.000811] IP idents hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.000858] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
[    0.000859] Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000860] TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear)
[    0.000862] TCP bind hash table entries: 4096 (order: 4, 65536 bytes, linear)
[    0.000865] TCP: Hash tables configured (established 4096 bind 4096)
[    0.000868] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.000869] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.000874] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    0.000883] Unpacking initramfs...
[    0.005874] Freeing initrd memory: 8188K
[    0.005907] workingset: timestamp_bits=30 max_order=17 bucket_order=0
[    0.006895] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    0.006917] printk: console [ttyS0] disabled
[    0.006918] f4000000.serial: ttyS0 at MMIO 0xf4000000 (irq = 1, base_baud = 312500) is a 16550
[    0.006919] printk: console [ttyS0] enabled
[    0.006919] printk: console [ttyS0] enabled
[    0.006920] printk: bootconsole [ns16550] disabled
[    0.006920] printk: bootconsole [ns16550] disabled
[    0.006952] virtio_blk virtio1: 1/0/0 default/read/poll queues
[    0.006977] virtio_blk virtio1: [vda] 4800 512-byte logical blocks (2.46 MB/2.34 MiB)
[    0.006993] virtio_net virtio0: Assigned random MAC address 9e:21:77:c8:0b:a8
[    0.007019] clk: Disabling unused clocks
[    0.007029] Freeing unused kernel image (initmem) memory: 180K
[    0.007030] Kernel memory protection not selected by kernel config.
[    0.007031] Run /init as init process
[   17.029125] hrtimer: interrupt took 79999969 ns
Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Starting network: OK

Welcome to Buildroot
buildroot login: root
# cat /proc/cpuinfo
processor       : 0
hart            : 0
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 1
hart            : 1
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 2
hart            : 2
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 3
hart            : 3
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 4
hart            : 4
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 5
hart            : 5
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 6
hart            : 6
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 7
hart            : 7
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 8
hart            : 8
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 9
hart            : 9
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 10
hart            : 10
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 11
hart            : 11
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 12
hart            : 12
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 13
hart            : 13
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 14
hart            : 14
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 15
hart            : 15
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 16
hart            : 16
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 17
hart            : 17
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 18
hart            : 18
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 19
hart            : 19
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 20
hart            : 20
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 21
hart            : 21
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 22
hart            : 22
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 23
hart            : 23
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 24
hart            : 24
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 25
hart            : 25
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 26
hart            : 26
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 27
hart            : 27
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 28
hart            : 28
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 29
hart            : 29
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 30
hart            : 30
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

processor       : 31
hart            : 31
isa             : rv32ima
mmu             : sv32
mvendorid       : 0x12345678
marchid         : 0x80000001
mimpid          : 0x1

@chiangkd
Copy link
Collaborator Author

I agree that your changes can quickly and easily resolve the RCU stall warning issue! However, your log cannot represent the actual boot time in this situation ([ 17.029125]). It should take much more time, am I correct?

I try to reproduce your work, the RCU stall warning is resolved when simulate SMP=32

[    0.007006] Run /init as init process
[    0.026121] hrtimer: interrupt took 12000723 ns
Starting syslogd: OK

another message hrtimer: interrupt took XXXX ns is shown (also in your log)

A quick research that this warning is produced by hrtimer_interrupt() (high resolution timer) because our kernel has config CONFIG_HIGH_RES_TIMERS=y

I'm not sure is any side effect here

@Mes0903
Copy link
Collaborator

Mes0903 commented Dec 26, 2024

However, your log cannot represent the actual boot time in this situation ([ 17.029125]). It should take much more time, am I correct?

Yes, the primary reason for incrementing tv_nsec is to allow the boot process time to progress, ensuring the boot process can proceed smoothly. Therefore, during the boot process, the timer does not serve as an actual timing mechanism.

Once the system switches to U mode for the first time, the boot_complete flag is set, and the time reference is reset. As a result, the [17.029125] here represents 17.029125 seconds since the first switch to U mode.

As for the HRT warning, I think it was triggered due to a sudden jump in the system clock right after the boot process completed. I'm not sure is any side effect here too.

@jserv
Copy link
Collaborator

jserv commented Dec 26, 2024

A quick research that this warning is produced by hrtimer_interrupt() (high resolution timer) because our kernel has config CONFIG_HIGH_RES_TIMERS=y

This option was set for the sake of benchmarking purpose. RT-Tests relies on HRT features.

@jserv
Copy link
Collaborator

jserv commented Dec 26, 2024

However, your log cannot represent the actual boot time in this situation ([ 17.029125]). It should take much more time, am I correct?

The timer increments should align with the frequency defined in the device tree. We could use an approach similar to BogoMips to make the necessary adjustments.

@Mes0903
Copy link
Collaborator

Mes0903 commented Dec 29, 2024

After multiple attempts, I realize that independently maintaining nsec is not a good idea. This results in time intervals between reads being uniform, whereas in reality, we call clock_gettime at varying time intervals.

In contrast, I believe maintaining a frequency scaling factor is a better solution. As mentioned in #49, this achieves the purpose of slowing down time during the boot process. Since it’s merely a scaling factor, we can still derive real-time values from it.

As for the hrtimer: interrupt took XXXX ns issue, after a few days of testing, I found that this doesn’t seem to be caused by time drift. In fact, during these tests, I re-based the timer after the boot process using the recorded time to ensure no drift, but the hrtimer message still appeared. The timing of the message is not consistent—it sometimes appears after the "Welcome to Buildroot" message, and sometimes right after switching to U-mode. But always after switch to U-mode.

Interestingly, on my machine, the warning doesn't appear at all with SMP=16, regardless of how the frequency is adjusted. Moreover, the current implementation only affects the timer during the boot process; after switching to U-mode, the timer behaves exactly as before. Therefore, I believe this warning is due to the current sequential emulation approach. After the multi-threaded emulation is implemented, I think the situation would be mitigated a lot.

Here’s a diagram of the overall flow:

image

Below is a simple example:

static uint64_t semu_timer_clocksource(uint64_t freq)
{
#if defined(HAVE_POSIX_TIMER)
    struct timespec t;
    clock_gettime(CLOCKID, &t);

    if (boot_complete)
        return t.tv_sec * freq + mult_frac(t.tv_nsec, freq, 1e9);
    else
        return t.tv_sec * (freq / 100) + mult_frac(t.tv_nsec, (freq / 100), 1e9); 
#elif defined(HAVE_MACH_TIMER)
    static mach_timebase_info_data_t t;
    if (t.denom == 0)
        (void) mach_timebase_info(&t);
    return mult_frac(mult_frac(mach_absolute_time(), t.numer, t.denom), freq,
                     1e9);
#else
    return time(0) * freq;
#endif
}

uint64_t semu_timer_get(semu_timer_t *timer)
{
    static bool first = true;
    if (first && boot_complete) {
        first = false;
        semu_timer_rebase(timer, 0);
        printf("\033[1;33m[SEMU LOG]: Switch to real time\033[0m\n");
    }

    return semu_timer_clocksource(timer->freq) - timer->begin;
}

I think this approach is better than maintaining two separate timers during the boot process. Dividing the frequency by 100 means the boot process operates at one one-hundredth of real-time, allowing us to easily derive the actual boot time. This scaling factor can also be configured in the Makefile. I attempted to dynamically measure the cost of clock_gettime to set this value, but realized that using clock_gettime as a benchmark was not accurate. Therefore, in this example, a scale of 100 was used for simplicity.

Here is the output of the test:

mes@DESKTOP-HLQ9F6A:~/MesRv32emu/semu$ make check SMP=32
 DTC    minimal.dtb
Ready to launch Linux kernel. Please be patient.
failed to allocate TAP device: Operation not permitted
[    0.000000] Linux version 6.1.99 (jserv@node1) (riscv32-buildroot-linux-gnu-gcc.br_real (Buildroot 2024.02.4) 12.3.0, GNU ld (GNU Binutils) 2.41) #1 SMP Thu Jul 18 13:04:10 CST 2024
[    0.000000] Machine model: semu
[    0.000000] earlycon: ns16550 at MMIO 0xf4000000 (options '')
[    0.000000] printk: bootconsole [ns16550] enabled
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] SBI specification v2.0 detected
[    0.000000] SBI implementation ID=0x999 Version=0x1
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] SBI HSM extension detected
[    0.000000] riscv: base ISA extensions aim
[    0.000000] riscv: ELF capabilities aim
[    0.000000] percpu: Embedded 10 pages/cpu s11604 r8192 d21164 u40960
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 130048
[    0.000000] Kernel command line: earlycon console=ttyS0
[    0.000000] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    0.000000] printk: log_buf_len total cpu_extra contributions: 126976 bytes
[    0.000000] printk: log_buf_len min size: 65536 bytes
[    0.000000] printk: log_buf_len: 262144 bytes
[    0.000000] printk: early log buf free: 63952(97%)
[    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 504100K/524288K available (3578K kernel code, 345K rwdata, 873K rodata, 185K init, 140K bss, 20188K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 32 local interrupts mapped
[    0.000000] plic: interrupt-controller@0: mapped 31 interrupts with 32 handlers for 32 contexts.
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] riscv-timer: riscv_timer_init_dt: Registering clocksource cpuid [0] hartid [0]
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0xefdb196da, max_idle_ns: 440795204367 ns
[    0.000000] sched_clock: 64 bits at 65MHz, resolution 15ns, wraps every 2199023255550ns
[    0.000694] Console: colour dummy device 80x25
[    0.000893] Calibrating delay loop (skipped), value calculated using timer frequency.. 130.00 BogoMIPS (lpj=260000)
[    0.000992] pid_max: default: 32768 minimum: 301
[    0.001489] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.001588] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.004070] ASID allocator disabled (0 bits)
[    0.004269] rcu: Hierarchical SRCU implementation.
[    0.004368] rcu:     Max phase no-delay instances is 1000.
[    0.006056] smp: Bringing up secondary CPUs ...
[    0.049211] smp: Brought up 1 node, 32 CPUs
[    0.055530] devtmpfs: initialized
[    0.071056] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.071360] futex hash table entries: 8192 (order: 7, 524288 bytes, linear)
[    0.077127] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    0.097399] platform soc@F0000000: Fixed dependency cycle(s) with /soc@F0000000/interrupt-controller@0
[    0.141221] clocksource: Switched to clocksource riscv_clocksource
[    0.176651] NET: Registered PF_INET protocol family
[    0.177847] IP idents hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.189801] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
[    0.190000] Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.190199] TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear)
[    0.190697] TCP bind hash table entries: 4096 (order: 4, 65536 bytes, linear)
[    0.191394] TCP: Hash tables configured (established 4096 bind 4096)
[    0.192190] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.192390] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.193684] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    0.196073] Unpacking initramfs...
[    0.217861] workingset: timestamp_bits=30 max_order=17 bucket_order=0
[    0.463628] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    0.468923] printk: console [ttyS0] disabled
[    0.469123] f4000000.serial: ttyS0 at MMIO 0xf4000000 (irq = 1, base_baud = 312500) is a 16550
[    0.469422] printk: console [ttyS0] enabled
[    0.469422] printk: console [ttyS0] enabled
[    0.469622] printk: bootconsole [ns16550] disabled
[    0.469622] printk: bootconsole [ns16550] disabled
[    0.477610] virtio_blk virtio1: 1/0/0 default/read/poll queues
[    0.483698] virtio_blk virtio1: [vda] 4800 512-byte logical blocks (2.46 MB/2.34 MiB)
[    0.487487] virtio_net virtio0: Assigned random MAC address 1e:07:10:f3:00:37
[    0.494367] clk: Disabling unused clocks
[    1.486134] Freeing initrd memory: 8188K
[    1.488423] Freeing unused kernel image (initmem) memory: 180K
[    1.488622] Kernel memory protection not selected by kernel config.
[    1.488821] Run /init as init process
[SEMU LOG]: Switch to real time
[    7.946974] hrtimer: interrupt took 69481784 ns
Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Starting network: OK

Welcome to Buildroot
buildroot login:

Another output for the factor set to 50:

mes@DESKTOP-HLQ9F6A:~/MesRv32emu/semu$ make check SMP=32
  CC    riscv.o
  CC    ram.o
  CC    utils.o
  CC    plic.o
  CC    uart.o
  CC    main.o
  CC    aclint.o
  CC    virtio-blk.o
  CC    virtio-net.o
  CC    netdev.o
  LD    semu
 DTC    minimal.dtb
Ready to launch Linux kernel. Please be patient.
failed to allocate TAP device: Operation not permitted
[    0.000000] Linux version 6.1.99 (jserv@node1) (riscv32-buildroot-linux-gnu-gcc.br_real (Buildroot 2024.02.4) 12.3.0, GNU ld (GNU Binutils) 2.41) #1 SMP Thu Jul 18 13:04:10 CST 2024
[    0.000000] Machine model: semu
[    0.000000] earlycon: ns16550 at MMIO 0xf4000000 (options '')
[    0.000000] printk: bootconsole [ns16550] enabled
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] SBI specification v2.0 detected
[    0.000000] SBI implementation ID=0x999 Version=0x1
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] SBI HSM extension detected
[    0.000000] riscv: base ISA extensions aim
[    0.000000] riscv: ELF capabilities aim
[    0.000000] percpu: Embedded 10 pages/cpu s11604 r8192 d21164 u40960
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 130048
[    0.000000] Kernel command line: earlycon console=ttyS0
[    0.000000] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    0.000000] printk: log_buf_len total cpu_extra contributions: 126976 bytes
[    0.000000] printk: log_buf_len min size: 65536 bytes
[    0.000000] printk: log_buf_len: 262144 bytes
[    0.000000] printk: early log buf free: 63952(97%)
[    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 504100K/524288K available (3578K kernel code, 345K rwdata, 873K rodata, 185K init, 140K bss, 20188K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 32 local interrupts mapped
[    0.000000] plic: interrupt-controller@0: mapped 31 interrupts with 32 handlers for 32 contexts.
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] riscv-timer: riscv_timer_init_dt: Registering clocksource cpuid [0] hartid [0]
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0xefdb196da, max_idle_ns: 440795204367 ns
[    0.000000] sched_clock: 64 bits at 65MHz, resolution 15ns, wraps every 2199023255550ns
[    0.001390] Console: colour dummy device 80x25
[    0.001589] Calibrating delay loop (skipped), value calculated using timer frequency.. 130.00 BogoMIPS (lpj=260000)
[    0.001788] pid_max: default: 32768 minimum: 301
[    0.002781] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.002980] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.007549] ASID allocator disabled (0 bits)
[    0.007947] rcu: Hierarchical SRCU implementation.
[    0.007947] rcu:     Max phase no-delay instances is 1000.
[    0.011523] smp: Bringing up secondary CPUs ...
[    0.095725] smp: Brought up 1 node, 32 CPUs
[    0.108034] devtmpfs: initialized
[    0.138606] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.139202] futex hash table entries: 8192 (order: 7, 524288 bytes, linear)
[    0.150715] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    0.190212] platform soc@F0000000: Fixed dependency cycle(s) with /soc@F0000000/interrupt-controller@0
[    0.279523] clocksource: Switched to clocksource riscv_clocksource
[    0.350564] NET: Registered PF_INET protocol family
[    0.352978] IP idents hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.375270] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
[    0.375872] Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.376273] TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear)
[    0.377275] TCP bind hash table entries: 4096 (order: 4, 65536 bytes, linear)
[    0.378478] TCP: Hash tables configured (established 4096 bind 4096)
[    0.380083] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.380484] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.382890] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    0.387903] Unpacking initramfs...
[    0.415743] workingset: timestamp_bits=30 max_order=17 bucket_order=0
[    0.910292] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    0.921461] printk: console [ttyS0] disabled
[    0.921867] f4000000.serial: ttyS0 at MMIO 0xf4000000 (irq = 1, base_baud = 312500) is a 16550
[    0.922476] printk: console [ttyS0] enabled
[    0.922476] printk: console [ttyS0] enabled
[    0.922881] printk: bootconsole [ns16550] disabled
[    0.922881] printk: bootconsole [ns16550] disabled
[    0.939299] virtio_blk virtio1: 1/0/0 default/read/poll queues
[    0.951841] virtio_blk virtio1: [vda] 4800 512-byte logical blocks (2.46 MB/2.34 MiB)
[    0.959919] virtio_net virtio0: Assigned random MAC address 8e:5e:2e:78:c2:e5
[    0.974247] clk: Disabling unused clocks
[    2.984914] Freeing initrd memory: 8188K
[    2.989718] Freeing unused kernel image (initmem) memory: 180K
[    2.989918] Kernel memory protection not selected by kernel config.
[    2.990318] Run /init as init process
[SEMU LOG]: Switch to real time
Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Starting network: OK

Welcome to Buildroot
buildroot login: [   88.934001] hrtimer: interrupt took 60891661 ns

Also another output of the factor set to 10:

mes@DESKTOP-HLQ9F6A:~/MesRv32emu/semu$ make check SMP=32
  CC    riscv.o
  CC    ram.o
  CC    utils.o
  CC    plic.o
  CC    uart.o
  CC    main.o
  CC    aclint.o
  CC    virtio-blk.o
  CC    virtio-net.o
  CC    netdev.o
  LD    semu
 DTC    minimal.dtb
Ready to launch Linux kernel. Please be patient.
failed to allocate TAP device: Operation not permitted
[    0.000000] Linux version 6.1.99 (jserv@node1) (riscv32-buildroot-linux-gnu-gcc.br_real (Buildroot 2024.02.4) 12.3.0, GNU ld (GNU Binutils) 2.41) #1 SMP Thu Jul 18 13:04:10 CST 2024
[    0.000000] Machine model: semu
[    0.000000] earlycon: ns16550 at MMIO 0xf4000000 (options '')
[    0.000000] printk: bootconsole [ns16550] enabled
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] SBI specification v2.0 detected
[    0.000000] SBI implementation ID=0x999 Version=0x1
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] SBI HSM extension detected
[    0.000000] riscv: base ISA extensions aim
[    0.000000] riscv: ELF capabilities aim
[    0.000000] percpu: Embedded 10 pages/cpu s11604 r8192 d21164 u40960
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 130048
[    0.000000] Kernel command line: earlycon console=ttyS0
[    0.000000] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    0.000000] printk: log_buf_len total cpu_extra contributions: 126976 bytes
[    0.000000] printk: log_buf_len min size: 65536 bytes
[    0.000000] printk: log_buf_len: 262144 bytes
[    0.000000] printk: early log buf free: 63952(97%)
[    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 504100K/524288K available (3578K kernel code, 345K rwdata, 873K rodata, 185K init, 140K bss, 20188K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 32 local interrupts mapped
[    0.000000] plic: interrupt-controller@0: mapped 31 interrupts with 32 handlers for 32 contexts.
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] riscv-timer: riscv_timer_init_dt: Registering clocksource cpuid [0] hartid [0]
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0xefdb196da, max_idle_ns: 440795204367 ns
[    0.000000] sched_clock: 64 bits at 65MHz, resolution 15ns, wraps every 2199023255550ns
[    0.006954] Console: colour dummy device 80x25
[    0.008940] Calibrating delay loop (skipped), value calculated using timer frequency.. 130.00 BogoMIPS (lpj=260000)
[    0.009934] pid_max: default: 32768 minimum: 301
[    0.014901] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.015894] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.040726] ASID allocator disabled (0 bits)
[    0.042712] rcu: Hierarchical SRCU implementation.
[    0.042712] rcu:     Max phase no-delay instances is 1000.
[    0.059595] smp: Bringing up secondary CPUs ...
[    0.510227] smp: Brought up 1 node, 32 CPUs
[    0.577710] devtmpfs: initialized
[    0.738451] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.740435] futex hash table entries: 8192 (order: 7, 524288 bytes, linear)
[    0.801942] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    1.015202] platform soc@F0000000: Fixed dependency cycle(s) with /soc@F0000000/interrupt-controller@0
[    1.502151] clocksource: Switched to clocksource riscv_clocksource
[    1.887699] NET: Registered PF_INET protocol family
[    1.900829] IP idents hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    2.030783] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
[    2.032790] Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    2.034797] TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear)
[    2.040818] TCP bind hash table entries: 4096 (order: 4, 65536 bytes, linear)
[    2.047842] TCP: Hash tables configured (established 4096 bind 4096)
[    2.055870] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    2.057877] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    2.070922] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    2.095005] Unpacking initramfs...
[    2.187237] workingset: timestamp_bits=30 max_order=17 bucket_order=0
[    4.908858] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    4.971514] printk: console [ttyS0] disabled
[    4.972524] f4000000.serial: ttyS0 at MMIO 0xf4000000 (irq = 1, base_baud = 312500) is a 16550
[    4.975553] printk: console [ttyS0] enabled
[    4.975553] printk: console [ttyS0] enabled
[    4.978582] printk: bootconsole [ns16550] disabled
[    4.978582] printk: bootconsole [ns16550] disabled
[    5.065338] virtio_blk virtio1: 1/0/0 default/read/poll queues
[    5.132830] virtio_blk virtio1: [vda] 4800 512-byte logical blocks (2.46 MB/2.34 MiB)
[    5.178077] virtio_net virtio0: Assigned random MAC address 6e:6c:05:99:21:82
[    5.251437] clk: Disabling unused clocks
[   15.876369] Freeing initrd memory: 8188K
[   15.901243] Freeing unused kernel image (initmem) memory: 180K
[   15.903233] Kernel memory protection not selected by kernel config.
[   15.905222] Run /init as init process
[SEMU LOG]: Switch to real time
Starting syslogd: [   16.343392] hrtimer: interrupt took 61000570 ns
OK
Starting klogd: OK
Running sysctl: OK
Starting network: OK

Welcome to Buildroot
buildroot login: 

In my environment, even with varying scale factors, the hrtimer warning consistently appeared at approximately 60000000 ns. I think this observation supports my hypothesis.

@Mes0903
Copy link
Collaborator

Mes0903 commented Jan 3, 2025

Here is a summary of two potential approaches to mitigate RCU CPU stalls under the current sequentially-emulation scenario, we have two methods now

Methods

  1. Scale Frequency
  2. Manually Maintain Increment of nsec

1. Scale Frequency

This method involves calling clock_gettime during the boot process to obtain the time difference and then scaling the frequency to slow down the system. An example implementation was provided in earlier discussions of this issue.

Pros

  • The boot process can use logs to approximate real-world time very easily.
  • By scaling, RCU CPU stall warnings can be significantly reduced. As long as the scaling factor is large enough, no matter how many harts are involved, there will be no RCU CPU stall warnings during the boot process.

Cons

  • Requires calling clock_gettime, resulting in a longer execution time.

2. Manually Increment nsec

This method uses statistical estimations to determine an empirical value representing the boot process period. Based on this value, the increment of nsec is designed. A rough estimate of the relationship is provided below:

image

The orange segments in the diagram represent calls to semu_timer_clocksource. Each call adds a predefined value (blue arrows) regardless of the interval between calls.

Pros

  • Does not require clock_gettime, so the boot process executes faster.

Cons

  • During the boot process, semu_timer_clocksource no longer functions as a timer period calculator but acts like a counter. Therefore, the logs cannot be used to approximate real-world time.
  • Since nsec increments are based on a designed value, RCU CPU stalls may still occur under different environments.

Synchronization and Time Rebase

Regardless of the chosen method, the emulator's internal time and real-world execution time will not synchronize during the boot process. Thus, when switch back to real-time timer, an big time difference would appear. However, this time discrepancy can be resolved by rebasing real-time with emulation time when switching to U mode.

image

Additionally, hrtimer warnings during the boot process are unrelated to the timers being used. This issue can be revisited during the development of multi-threaded emulation to check if it can be mitigated.

Analysis of Method 2: Empirical Estimation

Below is the data from tests conducted on my computer with freq/10, showing consistency in results across multiple trials:

SMP times call semu_timer_clocksource time(sec) of boot process hrtimer warning
1 223,992,364 3.40001
2 382,486,686 8.01002
3 577,491,593 13.44003
4 774,125,110 17.85185
5 973,274,729 22.94007
6 1,174,038,398 27.11009
7 1,377,244,622 31.80010
8 1,605,001,986 37.52011
9 1,793,136,295 41.41014
10 2,005,988,752 45.53015
11 2,220,126,569 51.66018
12 2,440,897,255 56.13018
13 2,651,860,790 60.71019
14 2,882,701,067 65.92020
15 3,103,978,838 70.30022
16 3,343,030,072 76.31025
17 3,566,365,881 80.24026
18 3,800,214,669 86.59028
19 4,031,961,176 92.00030
20 4,280,331,336 94.47030
21 4,516,731,902 101.68033
22 4,883,959,327 104.95035
23 5,143,022,258 110.69036
24 5,260,058,753 118.59098
25 5,526,277,854 125.30041
26 5,790,681,086 132.98045 50000184 ns
27 6,044,658,240 140.04046 80000307 ns
28 6,328,119,424 146.18047 60000231 ns
29 6,598,156,499 154.15050 80000261 ns
30 6,868,480,625 159.83052 90000308 ns
31 7,129,979,196 163.82054 50000169 ns
32 7,410,129,712 170.80054 80000508 ns

Tests were also conducted on my workstation:

SMP times call semu_timer_clocksource time(sec) of boot process hrtimer warning
1 223,450,834 15.21302
2 388,551,174 31.45406
3 586,279,749 48.33009
4 791,644,232 68.00714
5 1,003,639,012 83.64418
6 1,216,761,778 99.95122 12000031 ns
7 1,438,276,507 120.21144 14000047 ns
8 1,704,344,789 122.50440 11000030 ns
9 1,900,605,464 156.91848 10000031 ns
10 2,140,147,966 176.43249 11000031 ns
11 2,451,031,756 179.20599 12000062 ns
12 2,633,717,918 217.70393 14000046 ns
13 2,993,790,985 216.13076 15000046 ns
14 3,165,383,012 262.75081 14000046 ns
15 3,437,855,090 286.43180 15000015 ns

Since the workstation was slow, the execution time was long. Thus I just statistics until SMP=15.

Target Time Configuration

To use the second method, a target time need to be determined. If a target boot time of 10 seconds is set, nsec increment values can be calculated based on the SMP parameter.

image

For example, with SMP=4 and a target time of 10 seconds ($10^{10}$ ns), each call to semu_timer_clocksource adds approximately:

$$ \frac{10^{10}}{774125110} \approx 13 \text{ ns} $$

to nsec.

However, this method may introduce timing discrepancies across different environments. For instance, with SMP=1, the boot process takes approximately 3 seconds on my personal computer but 18 seconds on my workstation, resulting in a sixfold difference.

This leads to an implicit problem: if adding a core increases the number of semu_timer_clocksource calls by an approximate number $2 \times 10^8$, and the target time is set to 10 seconds, then each call to semu_timer_clocksource will increment nsec by approximately:

$$ \frac{10^{10}}{SMPs \times 2 \times 10^8} = \frac{50}{SMPs} \text{ ns} $$

where SMPs represents the number followed by SMP parameter.

Under this method, the time in emulator during boot process is calculated as:

$$ \text{Number of Calls} \times \frac{10^{10}}{SMPs \times 2 \times 10^8} $$

If the assumption of $2 \times 10^8$ calls per core is incorrect, the emulation time will deviate from real-world time. For instance, if the actual number of semu_timer_clocksource calls exceeds $2 \times 10^8$, the boot process will take much longer than the target time, potentially triggering RCU CPU stall warnings.

The value $2 \times 10^8$ was derived from tests on my personal computer and workstation. Despite the sixfold difference in execution times, the number of semu_timer_clocksource calls was remarkably consistent, leading to this assumption. The corresponding numbers could be checked in the tables above: for each increment of the SMP parameter, the number of calls to semu_timer_clocksource roughly increases by $2 \times 10^8$.

If we still want a coarse-grained timer during the boot process to roughly approximate real-world time, clock_gettime could be called at specific intervals (e.g., every $10^n$ calls). A pseudo-random number generator (PRNG) or the earlier mentioned relationship could then provide time displacement values, allowing approximate synchronization with real-world time during the boot process.

However, if the actual number of semu_timer_clocksource calls exceeds $2 \times 10^8$, time recalibration using clock_gettime may lead to time regression if real-world time is less than emulation time.

image

In contrast, if the number of semu_timer_clocksource calls is too low, time will continue to increment, leading only to a deviation that can be corrected via rebase.

Although manually incrementing nsec avoids calling clock_gettime, differences in execution time across environments reduce the stability of RCU CPU stall warning mitigation. This also eliminates the ability to correlate boot process logs with real-world time.

Nonetheless, since boot process timing may not be critical, meanwhile, as the number of harts increment, we can easily notice that the execution time of boot process is getting longer and longer, so I think the benefits of avoiding the call of clock_gettime still remain attractive.

In my opinion,

  • If boot time accuracy matters: Use scaled frequency and continue relying on clock_gettime to update nsec. There is an simple example code in the previous discussion of this issue.
  • If boot time accuracy does not matter: Manually update nsec without scaling frequency. Increment nsec by the method mentioned above.
    • If the boot process timing is completely irrelevant, I think even just update nsec by an really small PRNG number like 1~3 is okay

Maybe we can discuss which method to adopt or any better modifications here. Once decided, I think I can start to submit a PR.

@jserv
Copy link
Collaborator

jserv commented Jan 3, 2025

@chiangkd and @RinHizakura, please comment the above.

@chiangkd
Copy link
Collaborator Author

chiangkd commented Jan 3, 2025

I tend to prefer using "scaled frequency", based on your analysis.

Pros

  • The boot process can use logs to approximate real-world time very easily.
  • By scaling, RCU CPU stall warnings can be significantly reduced. As long as the scaling factor is large enough, no matter how >many harts are involved, there will be no RCU CPU stall warnings during the boot process.

Cons

  • Requires calling clock_gettime, resulting in a longer execution time.

In my opinion, using real-world time offers valuable benefits for developers. It aids in analyzing and identifying potential improvements to accelerate the booting process (e.g., multi-threaded simulations).

@RinHizakura
Copy link

RinHizakura commented Jan 4, 2025

If we just consider the description, scaling frequency seems to be better for me due to the timing accuracy.

However, I am still hesitating after thinking further about it. For example, considering debug mode(yes we didn't implement this for semu, but the possibility still exists) the situation can be more complicated: The guest can be stopped, but the host timer will keep going. The clock_gettime() method may lead to a more complex in this case. On the other, most of the time we don't evaluate the software(Linux) performance on the emulator, because there are too many biases on the emulation even if we can solve this timer issue.

In summary, do we have any real experience in evaluating the boot time accuracy on the emulator(e.g. qemu)? If we never did this(in my experience this is true), the time accuracy does not matter. I'll support just using the increment counter-based method.

@Mes0903
Copy link
Collaborator

Mes0903 commented Jan 5, 2025

In summary, do we have any real experience in evaluating the boot time accuracy on the emulator(e.g. qemu)?

If I understand correctly, QEMU by default provides a rough(coarse) mapping to the host real-time. This is sufficient for the Guest OS to perceive "time passing," but it does not guarantee accurate or stable synchronization. At the same time, QEMU offers another clock type(QEMU_CLOCK_VIRTUAL_RT), the "increment-based" method(icount). In this mode, the emulator updates time based on the number of executed instructions(or events). This ensures functional correctness and repeatability, unaffected by fluctuations in host speed.

For semu, since we aim to allow user programs to leverage benchmarking tools like Dhrystone or CoreMark for performance testing, it is essential to read accurate time. However, reading mtime is a privileged operation, so such user programs will not have as frequent system calls as during the boot process.

Thus, as mentioned earlier, I think time during the boot process is relatively less critical. During this period, simply ensuring that the nsec value increments is sufficient, and timer functionality can begin after mode switching.

The guest can be stopped, but the host timer will keep going. The clock_gettime() method may lead to more complexity in this case.

At this stage, I believe we should focus solely on addressing the RCU CPU stall warning issue. Our current implementation is relatively simplified. Though more complex debugging scenarios could be resolved (e.g., using rebase to adjust time offsets), doing so would require enumerating and handling different cases within our simplified implementation, leading to code clutter.

In my opinion, if we want to account for more complex factors, a more comprehensive timer framework would be needed. For example, adopting a cycle-accurate simulation approach with a dedicated thread to update timer values and signaling this thread to obtain monotonic time would be a viable solution. However, I don't think this is very necessary currently.

@RinHizakura
Copy link

Thus, as mentioned earlier, I think time during the boot process is relatively less critical. During this period, simply ensuring that the nsec value increments is sufficient, and timer functionality can begin after mode switching.

Agree. If there is no case where the accuracy of boot time is important, it would be more attractive if the emulator could boot quickly. For example, when we use it as a CI/CD test platform. So I still prefer the counter-based method. My experience may be too one-sided, so feel free to make any exceptions.

At this stage, I believe we should focus solely on addressing the RCU CPU stall warning issue.

Right, I may overthink. We can ignore the extended features as of now.

@chiangkd
Copy link
Collaborator Author

chiangkd commented Jan 5, 2025

I tend to prefer using "scaled frequency", based on your analysis.

Sorry I'd like to revise my previous opinion. I initially preferred using a "scaled frequency" approach based on your analysis. I agree with @RinHizakura and your proposed Method 2—simply incrementing the counter during mode switching.

Thus, as mentioned earlier, I think time during the boot process is relatively less critical. During this period, simply ensuring that the nsec value increments is sufficient, and timer functionality can begin after mode switching.

I agree that precise timekeeping during the boot process is relatively less critical, and frequent clock_gettime() calls are unnecessary at this stage. During this phase, it is sufficient to ensure the nsec value increments consistently, with full timer functionality initialized after the mode switch.

I’ve learned a lot from this discussion—thank you!

@Mes0903
Copy link
Collaborator

Mes0903 commented Jan 6, 2025

TL;DR

After implementing the method of incrementing nsec, I encountered some issues. Switching to maintaining the increment of ticks resolved the problems, although I am still unsure why these issues occurred. For now, everything appears to be working well.

Detailed Explanation

Initial Attempt: Incrementing nsec

During the implementation of the increment method, I noticed that the time displayed by the emulator during the boot process did not increase proportionally with the number of harts. Instead, it oscillated significantly, albeit with an acceleration effect.

Here is the initial implementation:

#include <math.h>
#include <stdio.h>
#include <time.h>

#include "utils.h"

#if defined(__APPLE__)
#define HAVE_MACH_TIMER
#include <mach/mach_time.h>
#elif !defined(_WIN32) && !defined(_WIN64)
#define HAVE_POSIX_TIMER

/*
 * Use a faster but less precise clock source because we need quick
 * timestamps rather than fine-grained precision.
 */
#ifdef CLOCK_MONOTONIC_COARSE
#define CLOCKID CLOCK_MONOTONIC_COARSE
#else
#define CLOCKID CLOCK_REALTIME_COARSE
#endif
#endif

/* To simplify the example, just hardcode the configuration of SMP here. */
#ifndef SMP
#define SMP 32
#endif

bool boot_complete = false;

/* We want the guest to see 10 seconds by the end of boot.
 *  => 10 seconds = 1e10 ns
 * We'll guess there are about (SMP * 2e8) total calls to
 * semu_timer_total_ticks.
 */
static const double TARGET_BOOT_TIME_NS = 1.0e10; /* 10 seconds in ns */
static const double CALLS_PER_SMP = 2.0e8;        /* from table's assumption */

/* This is the increment for each call, in nanoseconds. */
static double nsec_increment = 0.0;
static struct timespec emulator_time;
static uint64_t offset = 0;

/* For testing how long did the boot process take. */
static uint64_t count = 0;
static struct timespec boot_start;

/* Calculate "x * n / d" without unnecessary overflow or loss of precision.
 *
 * Reference:
 * https://elixir.bootlin.com/linux/v6.10.7/source/include/linux/math.h#L121
 */
static inline uint64_t mult_frac(uint64_t x, uint64_t n, uint64_t d)
{
    const uint64_t q = x / d;
    const uint64_t r = x % d;

    return q * n + r * n / d;
}

static inline uint64_t get_ticks(struct timespec *ts, uint64_t freq)
{
    return ts->tv_sec * freq + mult_frac(ts->tv_nsec, freq, 1e9);
}

void semu_timer_init(semu_timer_t *timer, uint64_t freq)
{
    timer->freq = freq;

    /* From example:
     *   calls = SMP * 2e8
     *   nsec_increment = 1e10 / calls
     */
    double calls = SMP * CALLS_PER_SMP;
    nsec_increment = TARGET_BOOT_TIME_NS / calls;

    /* rebase by emulator_time */
    clock_gettime(CLOCKID, &emulator_time);
    timer->begin = get_ticks(&emulator_time, timer->freq);

    /* For measuring how long the boot process takes */
    boot_start = emulator_time;
}

static uint64_t semu_timer_total_ticks(semu_timer_t *timer)
{
#if defined(HAVE_POSIX_TIMER)
    if (!boot_complete) {
        ++count;
        emulator_time.tv_nsec += nsec_increment;

        /* If overflow 1 second, update sec part */
        if (emulator_time.tv_nsec >= 1.0e9) {
            uint64_t overflow_sec = floor(emulator_time.tv_nsec / 1.0e9);
            emulator_time.tv_sec += overflow_sec;
            emulator_time.tv_nsec = fmod(emulator_time.tv_nsec, 1.0e9);
        }

        return get_ticks(&emulator_time, timer->freq);
    } else {
        static bool first = true;
        if (first) {
            first = false;

            struct timespec rt_boot;
            clock_gettime(CLOCKID, &rt_boot);

            uint64_t rt_base = get_ticks(&rt_boot, timer->freq);
            uint64_t et_base = get_ticks(&emulator_time, timer->freq);

            offset = rt_base - et_base;

            /* print the testing info */
            printf(
                "\033[1;31m[SEMU LOG]: Boot complete, switch to real-time "
                "timer\033[0m\n");

            double boot_time = (rt_boot.tv_sec - boot_start.tv_sec) +
                               (rt_boot.tv_nsec - boot_start.tv_nsec) / 1.0e9;
            printf(
                "\033[1;31m[SEMU LOG]: Boot time: %.5f seconds, called %ld "
                "times semu_timer_clocksource\033[0m\n",
                boot_time, count);

            printf(
                "\033[1;31m[SEMU LOG]: timer->begin: %ld, "
                "rt_base: %ld, et_base: %ld, offset: %ld\033[0m\n",
                timer->begin, rt_base, et_base, offset);
        }

        clock_gettime(CLOCKID, &emulator_time);

        return get_ticks(&emulator_time, timer->freq) - offset;
    }
#elif defined(HAVE_MACH_TIMER)
    static mach_timebase_info_data_t t;
    if (t.denom == 0)
        (void) mach_timebase_info(&t);
    return mult_frac(mult_frac(mach_absolute_time(), t.numer, t.denom), freq,
                     1e9);
#else
    return time(0) * freq;
#endif
}

uint64_t semu_timer_get(semu_timer_t *timer)
{
    return semu_timer_total_ticks(timer) - timer->begin;
}

void semu_timer_rebase(semu_timer_t *timer, uint64_t time)
{
    timer->begin = semu_timer_total_ticks(timer) - time;
}

The output:

mes@DESKTOP-HLQ9F6A:~/semu$ make check SMP=32
  CC    riscv.o
  CC    ram.o
  CC    utils.o
  CC    plic.o
  CC    uart.o
  CC    main.o
  CC    aclint.o
  CC    virtio-blk.o
  CC    virtio-net.o
  CC    netdev.o
  LD    semu
 DTC    minimal.dtb
Ready to launch Linux kernel. Please be patient.
failed to allocate TAP device: Operation not permitted
[    0.000000] Linux version 6.1.99 (jserv@node1) (riscv32-buildroot-linux-gnu-gcc.br_real (Buildroot 2024.02.4) 12.3.0, GNU ld (GNU Binutils) 2.41) #1 SMP Thu Jul 18 13:04:10 CST 2024
[    0.000000] Machine model: semu
[    0.000000] earlycon: ns16550 at MMIO 0xf4000000 (options '')
[    0.000000] printk: bootconsole [ns16550] enabled
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] SBI specification v2.0 detected
[    0.000000] SBI implementation ID=0x999 Version=0x1
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] SBI HSM extension detected
[    0.000000] riscv: base ISA extensions aim
[    0.000000] riscv: ELF capabilities aim
[    0.000000] percpu: Embedded 10 pages/cpu s11604 r8192 d21164 u40960
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 130048
[    0.000000] Kernel command line: earlycon console=ttyS0
[    0.000000] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    0.000000] printk: log_buf_len total cpu_extra contributions: 126976 bytes
[    0.000000] printk: log_buf_len min size: 65536 bytes
[    0.000000] printk: log_buf_len: 262144 bytes
[    0.000000] printk: early log buf free: 63952(97%)
[    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 504100K/524288K available (3578K kernel code, 345K rwdata, 873K rodata, 185K init, 140K bss, 20188K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 32 local interrupts mapped
[    0.000000] plic: interrupt-controller@0: mapped 31 interrupts with 32 handlers for 32 contexts.
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] riscv-timer: riscv_timer_init_dt: Registering clocksource cpuid [0] hartid [0]
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0xefdb196da, max_idle_ns: 440795204367 ns
[    0.000013] sched_clock: 64 bits at 65MHz, resolution 15ns, wraps every 2199023255550ns
[    0.007380] Console: colour dummy device 80x25
[    0.008748] Calibrating delay loop (skipped), value calculated using timer frequency.. 130.00 BogoMIPS (lpj=260000)
[    0.009633] pid_max: default: 32768 minimum: 301
[    0.014808] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.015438] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.040353] ASID allocator disabled (0 bits)
[    0.042148] rcu: Hierarchical SRCU implementation.
[    0.042493] rcu:     Max phase no-delay instances is 1000.
[    0.060750] smp: Bringing up secondary CPUs ...
[    0.315316] smp: Brought up 1 node, 32 CPUs
[    0.341302] devtmpfs: initialized
[    0.404967] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.405788] futex hash table entries: 8192 (order: 7, 524288 bytes, linear)
[    0.430439] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    0.513758] platform soc@F0000000: Fixed dependency cycle(s) with /soc@F0000000/interrupt-controller@0
[    0.696968] clocksource: Switched to clocksource riscv_clocksource
[    0.848966] NET: Registered PF_INET protocol family
[    0.853929] IP idents hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    0.903390] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
[    0.904286] Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.905154] TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear)
[    0.907162] TCP bind hash table entries: 4096 (order: 4, 65536 bytes, linear)
[    0.910137] TCP: Hash tables configured (established 4096 bind 4096)
[    0.913478] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.914286] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    0.919500] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    0.930174] Unpacking initramfs...
[    0.969899] workingset: timestamp_bits=30 max_order=17 bucket_order=0
[    2.029020] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    2.051915] printk: console [ttyS0] disabled
[    2.052836] f4000000.serial: ttyS0 at MMIO 0xf4000000 (irq = 1, base_baud = 312500) is a 16550
[    2.053785] printk: console [ttyS0] enabled
[    2.053785] printk: console [ttyS0] enabled
[    2.054503] printk: bootconsole [ns16550] disabled
[    2.054503] printk: bootconsole [ns16550] disabled
[    2.088792] virtio_blk virtio1: 1/0/0 default/read/poll queues
[    2.114963] virtio_blk virtio1: [vda] 4800 512-byte logical blocks (2.46 MB/2.34 MiB)
[    2.131372] virtio_net virtio0: Assigned random MAC address 2a:34:a1:24:85:37
[    2.160763] clk: Disabling unused clocks
[    6.129851] Freeing initrd memory: 8188K
[    6.138502] Freeing unused kernel image (initmem) memory: 180K
[    6.139086] Kernel memory protection not selected by kernel config.
[    6.139707] Run /init as init process
[SEMU LOG]: Boot complete, switch to real-time timer
[SEMU LOG]: Boot time: 146.10029 seconds, called 7057127651 times semu_timer_clocksource
[SEMU LOG]: timer->begin: 7922787382082, rt_base: 7932283900770, et_base: 7923246095379, offset: 9037805391
[   26.228166] hrtimer: interrupt took 180000431 ns
Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Starting network: OK

Welcome to Buildroot
buildroot login:

Below is the graphs generated based on the data from the tables at the end of this comment:

image

The results indicated a truncation issue due to timespec using a non-floating-point data type (e.g., long) for storing time. This explains why the emulator’s time did not accurately increment to the target time.

However, it does not explain why emulator time decreased significantly as the number of harts increased. I haven't recognize why the issues occurred.

Transition to Maintaining Ticks

Since the emulator ultimately requires ticks rather than time, time is merely a tool to derive ticks. Thus, I changed the implementation to directly maintain ticks.

Implementation Details

To begin, I renamed semu_timer_clocksource to semu_timer_total_ticks. Based on the previous assumptions, by the time the boot process completes, approximately $SMPs \times 2 \times 10^8$ calls to semu_timer_total_ticks will have occurred.

With a target time set to 10 seconds, this translates to $10 \times SMPs \times 2 \times 10^8$ calls within 10 seconds. Additionally, we have set CLOCK_FREQ to 65,000,000, resulting in 650,000,000 ticks over 10 seconds.

Combining these two points, a simple calculation reveals that each call to semu_timer_total_ticks must increment by:

$$ \frac{10 \times 65,000,000}{SMPs \times 2 \times 10^8} $$

ticks.

Since this result is typically a floating-point number, I used a floating-point type for the accumulation variable. The calculated value is converted to uint64_t before being returned within semu_timer_total_ticks.

During the rebase process, I discovered that directly using semu_timer_rebase(timer, 0) is not feasible. The desired outcome is to align the emulator’s time with real-time, adding subsequent increments to the emulator's existing time. However, using semu_timer_rebase as-is does not achieve this goal.

If we use real-time to rebase, it will disregard the emulator’s previous time, resetting it to 0 and beginning accumulation anew. This leads to time regression.

If we use emulator-time to rebase, it will subtract the emulator’s time from the real-time value, which does not address the RCU CPU stall warning issue.

The proper method is to calculate the offset between real-time and emulator time. This offset, measured in ticks, is then subtracted from the tick count obtained after switching to the real-time timer using clock_gettime. This adjustment ensures accurate alignment of times:

Take the output of SMP=32 below as example, the start ticks is 8188341156344. When switching to real-time timer, the ticks of real-time timer is 8197945583709, the ticks of emulator time is 8189070441181. Thus, the offset is 8875142528:

image

Here is the preliminary implementation:

#include <math.h>
#include <stdio.h>
#include <time.h>

#include "utils.h"

#if defined(__APPLE__)
#define HAVE_MACH_TIMER
#include <mach/mach_time.h>
#elif !defined(_WIN32) && !defined(_WIN64)
#define HAVE_POSIX_TIMER

/*
 * Use a faster but less precise clock source because we need quick
 * timestamps rather than fine-grained precision.
 */
#ifdef CLOCK_MONOTONIC_COARSE
#define CLOCKID CLOCK_MONOTONIC_COARSE
#else
#define CLOCKID CLOCK_REALTIME_COARSE
#endif
#endif

/* To simplify the example, just hardcode the configuration of SMP here. */
#ifndef SMP
#define SMP 32
#endif

bool boot_complete = false;

/* We want the guest to see 10 seconds by the end of boot.
 *  => 10 seconds = 'SMP * 2e8' total calls to semu_timer_total_ticks.
 * also, 1 second = 'freq' ticks.
 * Thus, per call, we should increment by '10 * freq / (SMP * 2e8)' ticks.
 */
static const uint8_t TARGET_BOOT_TIME_SEC = 10;
static const double ticks_increment =
    (TARGET_BOOT_TIME_SEC * CLOCK_FREQ) / (2.0e8 * SMP);

/* This is the increment for each call, in nanoseconds. */
static struct timespec emulator_time;
static int64_t offset = 0;

/* For testing how long did the boot process take. */
static uint64_t count = 0;
static struct timespec boot_start;
static uint64_t boot_start_ticks;
static double boot_ticks;

/* Calculate "x * n / d" without unnecessary overflow or loss of precision.
 *
 * Reference:
 * https://elixir.bootlin.com/linux/v6.10.7/source/include/linux/math.h#L121
 */
static inline uint64_t mult_frac(uint64_t x, uint64_t n, uint64_t d)
{
    const uint64_t q = x / d;
    const uint64_t r = x % d;

    return q * n + r * n / d;
}

static inline uint64_t get_ticks(struct timespec *ts, uint64_t freq)
{
    return ts->tv_sec * freq + mult_frac(ts->tv_nsec, freq, 1e9);
}

void semu_timer_init(semu_timer_t *timer, uint64_t freq)
{
    timer->freq = freq;

    /* rebase by emulator_time */
    clock_gettime(CLOCKID, &emulator_time);
    boot_start_ticks = get_ticks(&emulator_time, timer->freq);
    timer->begin = boot_start_ticks;
    boot_ticks = boot_start_ticks;

    printf("\033[1;31m[SEMU LOG]: Start ticks: %lu\033[0m\n", boot_start_ticks);
    printf("\033[1;31m[SEMU LOG]: ticks_increment: %f\033[0m\n",
           ticks_increment);

    /* For measuring how long the boot process takes */
    boot_start = emulator_time;
}

static uint64_t semu_timer_total_ticks(semu_timer_t *timer)
{
#if defined(HAVE_POSIX_TIMER)
    if (!boot_complete) {
        ++count;
        boot_ticks += ticks_increment;
        return (uint64_t) boot_ticks;
    } else {
        static bool first = true;
        if (first) {
            first = false;

            struct timespec rt_boot;
            clock_gettime(CLOCKID, &rt_boot);

            uint64_t rt_base = get_ticks(&rt_boot, timer->freq);
            uint64_t et_base = (uint64_t) boot_ticks;

            offset = rt_base - et_base;

            /* print the testing info */
            printf(
                "\033[1;31m[SEMU LOG]: Boot complete, switch to real-time "
                "timer\033[0m\n");

            double boot_time = (rt_boot.tv_sec - boot_start.tv_sec) +
                               (rt_boot.tv_nsec - boot_start.tv_nsec) / 1.0e9;
            printf(
                "\033[1;31m[SEMU LOG]: Boot time: %.5f seconds, called %ld "
                "times semu_timer_total_ticks\033[0m\n",
                boot_time, count);

            printf(
                "\033[1;31m[SEMU LOG]: timer->begin: %lu, "
                "rt_base: %lu, et_base: %lu, offset: %ld\033[0m\n",
                timer->begin, rt_base, et_base, offset);
        }

        clock_gettime(CLOCKID, &emulator_time);
        return get_ticks(&emulator_time, timer->freq) - offset;
    }
#elif defined(HAVE_MACH_TIMER)
    static mach_timebase_info_data_t t;
    if (t.denom == 0)
        (void) mach_timebase_info(&t);
    return mult_frac(mult_frac(mach_absolute_time(), t.numer, t.denom), freq,
                     1e9);
#else
    return time(0) * freq;
#endif
}

uint64_t semu_timer_get(semu_timer_t *timer)
{
    return semu_timer_total_ticks(timer) - timer->begin;
}

void semu_timer_rebase(semu_timer_t *timer, uint64_t time)
{
    timer->begin = semu_timer_total_ticks(timer) - time;
}

The output:

mes@DESKTOP-HLQ9F6A:~/semu$ make check SMP=32
  CC    riscv.o
  CC    ram.o
  CC    utils.o
  CC    plic.o
  CC    uart.o
  CC    main.o
  CC    aclint.o
  CC    virtio-blk.o
  CC    virtio-net.o
  CC    netdev.o
  LD    semu
 DTC    minimal.dtb
Ready to launch Linux kernel. Please be patient.
failed to allocate TAP device: Operation not permitted
[SEMU LOG]: Start ticks: 8188341156344
[SEMU LOG]: ticks_increment: 0.101562
[    0.000000] Linux version 6.1.99 (jserv@node1) (riscv32-buildroot-linux-gnu-gcc.br_real (Buildroot 2024.02.4) 12.3.0, GNU ld (GNU Binutils) 2.41) #1 SMP Thu Jul 18 13:04:10 CST 2024
[    0.000000] Machine model: semu
[    0.000000] earlycon: ns16550 at MMIO 0xf4000000 (options '')
[    0.000000] printk: bootconsole [ns16550] enabled
[    0.000000] Zone ranges:
[    0.000000]   Normal   [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000000000-0x000000001fffffff]
[    0.000000] SBI specification v2.0 detected
[    0.000000] SBI implementation ID=0x999 Version=0x1
[    0.000000] SBI TIME extension detected
[    0.000000] SBI IPI extension detected
[    0.000000] SBI RFENCE extension detected
[    0.000000] SBI SRST extension detected
[    0.000000] SBI HSM extension detected
[    0.000000] riscv: base ISA extensions aim
[    0.000000] riscv: ELF capabilities aim
[    0.000000] percpu: Embedded 10 pages/cpu s11604 r8192 d21164 u40960
[    0.000000] Built 1 zonelists, mobility grouping on.  Total pages: 130048
[    0.000000] Kernel command line: earlycon console=ttyS0
[    0.000000] printk: log_buf_len individual max cpu contribution: 4096 bytes
[    0.000000] printk: log_buf_len total cpu_extra contributions: 126976 bytes
[    0.000000] printk: log_buf_len min size: 65536 bytes
[    0.000000] printk: log_buf_len: 262144 bytes
[    0.000000] printk: early log buf free: 63952(97%)
[    0.000000] Dentry cache hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    0.000000] Inode-cache hash table entries: 32768 (order: 5, 131072 bytes, linear)
[    0.000000] mem auto-init: stack:off, heap alloc:off, heap free:off
[    0.000000] Memory: 504100K/524288K available (3578K kernel code, 345K rwdata, 873K rodata, 185K init, 140K bss, 20188K reserved, 0K cma-reserved)
[    0.000000] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=32, Nodes=1
[    0.000000] rcu: Hierarchical RCU implementation.
[    0.000000] rcu: RCU calculated value of scheduler-enlistment delay is 25 jiffies.
[    0.000000] NR_IRQS: 64, nr_irqs: 64, preallocated irqs: 0
[    0.000000] riscv-intc: 32 local interrupts mapped
[    0.000000] plic: interrupt-controller@0: mapped 31 interrupts with 32 handlers for 32 contexts.
[    0.000000] rcu: srcu_init: Setting srcu_struct sizes based on contention.
[    0.000000] riscv-timer: riscv_timer_init_dt: Registering clocksource cpuid [0] hartid [0]
[    0.000000] clocksource: riscv_clocksource: mask: 0xffffffffffffffff max_cycles: 0xefdb196da, max_idle_ns: 440795204367 ns
[    0.000021] sched_clock: 64 bits at 65MHz, resolution 15ns, wraps every 2199023255550ns
[    0.011532] Console: colour dummy device 80x25
[    0.013860] Calibrating delay loop (skipped), value calculated using timer frequency.. 130.00 BogoMIPS (lpj=260000)
[    0.015045] pid_max: default: 32768 minimum: 301
[    0.023329] Mount-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.024314] Mountpoint-cache hash table entries: 1024 (order: 0, 4096 bytes, linear)
[    0.064384] ASID allocator disabled (0 bits)
[    0.067303] rcu: Hierarchical SRCU implementation.
[    0.067842] rcu:     Max phase no-delay instances is 1000.
[    0.097174] smp: Bringing up secondary CPUs ...
[    0.505388] smp: Brought up 1 node, 32 CPUs
[    0.546817] devtmpfs: initialized
[    0.648255] clocksource: jiffies: mask: 0xffffffff max_cycles: 0xffffffff, max_idle_ns: 7645041785100000 ns
[    0.649787] futex hash table entries: 8192 (order: 7, 524288 bytes, linear)
[    0.688998] NET: Registered PF_NETLINK/PF_ROUTE protocol family
[    0.821427] platform soc@F0000000: Fixed dependency cycle(s) with /soc@F0000000/interrupt-controller@0
[    1.117924] clocksource: Switched to clocksource riscv_clocksource
[    1.360409] NET: Registered PF_INET protocol family
[    1.368445] IP idents hash table entries: 8192 (order: 4, 65536 bytes, linear)
[    1.447862] tcp_listen_portaddr_hash hash table entries: 512 (order: 0, 4096 bytes, linear)
[    1.449541] Table-perturb hash table entries: 65536 (order: 6, 262144 bytes, linear)
[    1.450612] TCP established hash table entries: 4096 (order: 2, 16384 bytes, linear)
[    1.454023] TCP bind hash table entries: 4096 (order: 4, 65536 bytes, linear)
[    1.458664] TCP: Hash tables configured (established 4096 bind 4096)
[    1.463880] UDP hash table entries: 256 (order: 1, 8192 bytes, linear)
[    1.465421] UDP-Lite hash table entries: 256 (order: 1, 8192 bytes, linear)
[    1.473833] NET: Registered PF_UNIX/PF_LOCAL protocol family
[    1.488511] Unpacking initramfs...
[    1.540352] workingset: timestamp_bits=30 max_order=17 bucket_order=0
[    3.236442] Serial: 8250/16550 driver, 4 ports, IRQ sharing disabled
[    3.273553] printk: console [ttyS0] disabled
[    3.274724] f4000000.serial: ttyS0 at MMIO 0xf4000000 (irq = 1, base_baud = 312500) is a 16550
[    3.276207] printk: console [ttyS0] enabled
[    3.276207] printk: console [ttyS0] enabled
[    3.277608] printk: bootconsole [ns16550] disabled
[    3.277608] printk: bootconsole [ns16550] disabled
[    3.332378] virtio_blk virtio1: 1/0/0 default/read/poll queues
[    3.374667] virtio_blk virtio1: [vda] 4800 512-byte logical blocks (2.46 MB/2.34 MiB)
[    3.400856] virtio_net virtio0: Assigned random MAC address 2a:2b:84:52:fe:ab
[    3.445666] clk: Disabling unused clocks
[    9.770108] Freeing initrd memory: 8188K
[    9.784206] Freeing unused kernel image (initmem) memory: 180K
[    9.785467] Kernel memory protection not selected by kernel config.
[    9.786437] Run /init as init process
[SEMU LOG]: Boot complete, switch to real-time timer
[SEMU LOG]: Boot time: 147.76042 seconds, called 7180650711 times semu_timer_total_ticks
[SEMU LOG]: timer->begin: 8188341156344, rt_base: 8197945583709, et_base: 8189070441181, offset: 8875142528
[   33.089529] hrtimer: interrupt took 70000215 ns
Starting syslogd: OK
Starting klogd: OK
Running sysctl: OK
Starting network: OK

Welcome to Buildroot
buildroot login:

Output and Results Comparison

Below are the results comparing the three methods: scale freq by 10, increment nsec, and increment ticks.

Times Called semu_timer_clocksource:

SMP scale freq by 10 increment nsec increment ticks
1 223,992,364 239,936,190 239,937,385
2 382,486,686 410,350,413 410,377,969
3 577,491,593 597,403,029 599,377,578
4 774,125,110 800,859,292 825,078,230
5 973,274,729 1,006,497,972 1,005,718,212
6 1,174,038,398 1,211,971,435 1,213,769,036
7 1,377,244,622 1,417,808,155 1,422,112,653
8 1,605,001,986 1,622,327,309 1,627,353,803
9 1,793,136,295 1,832,897,356 1,841,933,374
10 2,005,988,752 2,053,838,405 2,056,838,991
11 2,220,126,569 2,258,511,835 2,266,359,688
12 2,440,897,255 2,545,025,764 2,483,505,990
13 2,651,860,790 2,732,577,418 2,703,809,852
14 2,882,701,067 2,898,655,823 2,917,853,568
15 3,103,978,838 3,122,944,650 3,141,405,055
16 3,343,030,072 3,355,869,282 3,450,041,050
17 3,566,365,881 3,534,461,625 3,587,717,644
18 3,800,214,669 3,770,840,706 3,819,287,621
19 4,031,961,176 3,997,057,349 4,047,732,234
20 4,280,331,336 4,240,098,183 4,277,291,251
21 4,516,731,902 4,470,467,091 4,506,903,787
22 4,883,959,327 4,822,792,362 4,731,616,330
23 5,143,022,258 4,954,544,816 4,975,643,903
24 5,260,058,753 5,198,275,343 5,208,915,523
25 5,526,277,854 5,592,128,390 5,601,777,728
26 5,790,681,086 5,566,017,777 5,696,829,697
27 6,044,658,240 5,813,936,910 5,932,303,468
28 6,328,119,424 6,059,320,278 6,189,134,088
29 6,598,156,499 6,304,558,565 6,425,892,515
30 6,868,480,625 6,553,948,238 6,682,215,118
31 7,129,979,196 6,808,544,491 6,935,364,394
32 7,410,129,712 7,067,744,976 7,180,650,711

Boot Process Time (seconds):

SMP scale freq by 10 increment nsec increment ticks
1 3.40001 3.43001 3.37001
2 8.01002 7.82002 7.60002
3 13.44003 12.97003 13.00003
4 17.85185 17.59004 16.74004
5 22.94007 21.54004 21.20004
6 27.11009 26.15005 25.74005
7 31.80010 30.50006 30.52006
8 37.52011 34.66007 34.33017
9 41.41014 39.05009 38.61007
10 45.53015 42.95009 42.74009
11 51.66018 47.63010 46.40010
12 56.13018 50.37010 51.27010
13 60.71019 53.60011 55.89012
14 65.92020 60.54012 60.16012
15 70.30022 65.15013 65.81013
16 76.31025 70.09014 67.95014
17 80.24026 73.90014 73.29030
18 86.59028 78.92016 78.91032
19 92.00030 82.91016 84.04035
20 94.47030 88.05019 88.06036
21 101.68033 92.53019 92.11038
22 104.95035 94.37019 96.46040
23 110.69036 103.57021 101.64042
24 118.59098 107.63022 106.39043
25 125.30041 109.88022 108.33045
26 132.98045 114.10023 116.97026
27 140.04046 120.04024 121.35049
28 146.18047 124.85025 128.63053
29 154.15050 129.38026 131.42054
30 159.83052 135.39027 136.87057
31 163.82054 140.07028 142.77042
32 170.80054 146.00028 147.76042

Emulator Time:

SMP emulator time (nsec) emulator time (ticks)
1 11.086913 11.087134
2 9.344152 9.334738
3 8.656941 9.047707
4 8.694160 9.359388
5 9.095570 9.093272
6 8.750582 9.135809
7 8.944581 9.148663
8 8.759224 9.156270
9 8.237356 9.208044
10 9.205089 9.225684
11 8.083883 9.235738
12 9.128268 9.241278
13 7.331282 9.288196
14 7.742981 9.305096
15 8.340038 9.334116
16 8.955129 9.622751
17 6.269885 9.388393
18 6.666563 9.411124
19 7.066534 9.423658
20 7.491480 9.434454
21 7.888990 9.449546
22 8.529552 9.467739
23 8.726807 9.551230
24 9.146896 9.573380
25 9.866130 9.875079
26 4.871893 9.620185
27 5.083991 9.613555
28 5.293480 9.695915
29 5.501661 9.714826
30 5.713393 9.741353
31 5.929234 9.735223
32 6.150587 9.786437

Due to the slower environment of my workstation, I tested only SMP=4 and SMP=32 on my workstation:

SMP Scale Freq by 10 Increment Ticks
4 70.87429 56.76912
32 634.91923 578.99135

It can show that this approach saved nearly one minute on the slower environment.

image

Remarks

Maintaining ticks resolved the inconsistency issues encountered with nsec increments. When using the nsec increment method, each boot process execution yield slight differences in the number of semu_timer_total_ticks calls and the emulator's internal time. These variations result in inconsistent internal emulator time across runs.

However, the ticks increment method consistently produces identical results for every run. I think this is because the value added to the ticks during each call to semu_timer_total_ticks is fixed, resulting in the emulator time being the same during each boot process.

But there is an fluctuation when SMP is set to 4, 16 and 26. I am uncertain whether this is entirely expected. Aside from this, there is no operational problems been observed in the emulator's functionality so far.

Currently, the number of SMPs is hardcoded in the code as a macro. Therefore, when testing, adjustments must be made according to the SMP count specified at compile time.

@Mes0903
Copy link
Collaborator

Mes0903 commented Jan 6, 2025

Comparatively, the scaling method appears more stable (as it only scales time). Also, there are still unexplained results when using the ticks increment method. Therefore, I am unsure which method is better.

Based on these results, I’d like to ask for your opinions. Currently, the advantages and disadvantages of the scaling method are as follows:

  • Advantages:
    • The results are highly interpretable.
    • It is easy to calculate the correspondence between emulator time and real time during the boot process.
  • Disadvantages:
    • A larger scaling parameter is required to ensure the elimination of the RCU CPU stall warning.
      • I believe a value of 1000 is ideal—it is large enough and allows real time to be easily understood at a glance.
    • Longer execution time:
      • On my workstation (a slower environment), the difference between the two methods for SMP=32 is nearly one minute.

Advantages of the Ticks Increment Method:

  • It provides highly consistent boot times, ensuring the elimination of the RCU CPU stall warning.
  • Faster execution during the boot process.

Disadvantages:

  • Some results remain unexplained.
  • The displayed time during the boot process is difficult to map back to real time (there might be a way, but I am unaware of it).

As everything is confirmed, I think I can start to submit a PR.

BTW, In the sample code above, I’ve added a debug message showing the boot process duration at startup for easier debugging and performance testing. Should we include this message in the final PR submission?

@jserv
Copy link
Collaborator

jserv commented Jan 6, 2025

BTW, In the sample code above, I’ve added a debug message showing the boot process duration at startup for easier debugging and performance testing. Should we include this message in the final PR submission?

Don't do this.

@Mes0903
Copy link
Collaborator

Mes0903 commented Jan 7, 2025

However, the ticks increment method consistently produces identical results for every run. I think this is because the value added to the ticks during each call to semu_timer_total_ticks is fixed, resulting in the emulator time being the same during each boot process.

I noticed that the results (the times call semu_timer_total_ticks) are not always consistent. After retesting the second method a day later, I observed that the values had changed. However, after a few hours, the results could sometimes revert to being identical to the previous values. In contrast, when conducting multiple tests within a short period under the same SMP configuration, the results remained the same.

semu_timer_total_ticks is called in two locations: semu_timer_get and semu_timer_rebase. In other words, semu_timer_total_ticks reflects the number of times the Guest OS reads or writes to the timer.

I am unsure why the results remain consistent over short intervals but fluctuate when retested after a day. Additionally, I am not certain whether these results are fully analyzable. I speculate that this behavior is closely related to the state of the Host OS.

At the very least, the consistency of the short-term results is good. The table below shows the values I just measured. Some values are identical to those from yesterday, while most are different. However, they still exhibit a similar distribution.

SMP times call semu_timer_total_ticks deviation increment of ticks on each call total ticks(emulator time)
1 239,937,385 16.64% 3.250000 779,796,501.25
2 410,377,969 2.52% 1.625000 666,864,199.625
3 600,190,253 0.03% 1.083333 650,205,907.353249
4 825,078,230 3.03% 0.812500 670,376,061.875
5 1,007,098,511 0.70% 0.650000 654,614,032.15
6 1,213,304,632 1.09% 0.541667 657,207,080.101544
7 1,419,857,500 1.39% 0.464286 659,219,959.245
8 1,627,353,803 1.68% 0.406250 661,112,482.46875
9 1,835,991,887 1.96% 0.361111 662,996,866.306457
10 2,056,837,458 2.76% 0.325000 668,472,173.85
11 2,269,651,319 3.06% 0.295455 670,579,830.455145
12 2,485,299,111 3.43% 0.270833 673,101,014.129463
13 2,703,809,852 3.83% 0.250000 675,952,463
14 2,917,880,774 4.03% 0.232143 677,365,596.518682
15 3,134,960,385 4.30% 0.216667 679,242,461.736795
16 3,450,041,050 7.24% 0.203125 700,789,588.28125
17 3,587,880,492 5.23% 0.191176 685,916,640.938592
18 3,810,315,132 5.51% 0.180556 687,975,258.973392
19 4,052,935,937 6.24% 0.171053 693,266,850.831661
20 4,274,307,156 6.41% 0.162500 694,574,912.85
21 4,503,423,577 6.73% 0.154762 696,958,839.623674
22 4,742,850,891 7.22% 0.147727 700,647,133.574757
23 4,973,145,039 7.50% 0.141304 702,725,286.590856
24 5,213,953,075 7.93% 0.135417 706,057,883.557275
25 5,597,276,263 10.67% 0.130000 727,645,914.19
26 5,696,829,697 8.72% 0.125000 712,103,712.125
27 5,935,553,504 9.02% 0.120370 714,462,575.27648
28 6,173,571,960 9.29% 0.116071 716,572,670.96916
29 6,620,649,951 12.39% 0.112069 741,969,619.358619
30 6,683,572,727 10.22% 0.108333 724,051,484.234091
31 6,942,616,048 10.69% 0.104839 727,856,923.856272
32 7,180,650,711 10.87% 0.101562 729,281,247.510582

image

This table helps explain why there is oscillation in the results. We assume that for every increment of 1 in SMP, the number of times semu_timer_total_ticks is called should increase by $2 \times 10^8$. However, in reality, there is always some deviation. This deviation currently shows an upward trend as SMP increases, and in certain configurations (e.g., SMP set to 1, 2, 4, 16, 25, or 29), the deviation becomes more pronounced. These points also exhibit significant increments in the plotted graphs.

Additionally, it is noteworthy that the total ticks are always greater than the ticks corresponding to 10 seconds (650,000,000). However, the time displayed during the boot process is almost always less than 10 seconds. Dividing the total ticks by 650,000,000 yields the target boot time, and subtracting this from the displayed time results in a value that generally falls between 0.9 and 1.5 seconds.

In summary, there are still several unexplained aspects of this method's results:

  1. The number of times semu_timer_total_ticks is called occasionally varies.
  2. In certain SMP configurations, the time deviation becomes larger.
  3. Even though the total ticks exceed the target ticks, the emulator time is still less than the target time (10 seconds), with the gap ranging between 0.9 and 1.5 seconds.

Nonetheless, this method reduces the execution time of the boot process. Although the resulting values are less interpretable, the deviations remain within an acceptable range. Therefore, I think the second method is still preferable.

@Mes0903
Copy link
Collaborator

Mes0903 commented Jan 10, 2025

After several days of analysis and discussion, I think the acceleration effect of the increment method is limited. However, since this method requires complete management of the tick count, it results in increased program complexity. At the same time, it has led to some currently unexplained results, but it remains an interesting experiment that I believe will be helpful for future analysis..

Therefore, I revisited the first scale method. After designing the scale parameter, it is now capable of setting the target boot time. The only remaining drawback is that the boot process is slower. Consequently, I plan to submit a PR using the scale method.

Below is an explanation of the sample code:

Using gprof, I found that semu_timer_clocksource accounts for approximately 10% to 15% of the boot process time:

  • CPU: 13th Gen Intel(R) Core(TM) i7-13700

     mes@DESKTOP-HLQ9F6A:~/semu$ gprof ./semu
     Flat profile:
     
     Each sample counts as 0.01 seconds.
       %   cumulative   self              self     total           
      time   seconds   seconds    calls  Ts/call  Ts/call  name    
      39.91     55.19    55.19                             vm_step
      10.55     69.78    14.59                             semu_timer_clocksource
       9.27     82.60    12.82                             semu_start
       6.49     91.57     8.97                             aclint_mswi_update_interrupts
       6.19    100.13     8.56                             mmu_store
       5.73    108.05     7.92                             mmu_translate
       3.98    113.56     5.51                             semu_timer_get
       3.33    118.16     4.60                             op_rv32i
       2.92    122.20     4.04                             aclint_mtimer_update_interrupts
       2.76    126.02     3.82                             aclint_sswi_update_interrupts
       2.10    128.93     2.91                             ram_read
       1.59    131.13     2.20                             mmu_load
       1.27    132.89     1.76                             mem_store
       1.19    134.54     1.65                             _init
       0.62    135.40     0.86                             mem_load
       0.46    136.04     0.64                             u8250_check_ready
       0.41    136.61     0.57                             ram_write
       0.39    137.15     0.54                             csr_read
       0.37    137.66     0.51                             virtio_net_refresh_queue
       0.28    138.05     0.39                             mem_fetch
       0.17    138.28     0.23                             op_csr_cs
    
  • CPU: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz

     mes@Mes:~/semu$ gprof ./semu
     Flat profile:
     
     Each sample counts as 0.01 seconds.
       %   cumulative   self              self     total           
      time   seconds   seconds    calls  Ts/call  Ts/call  name    
      32.96     77.77    77.77                             vm_step
      15.96    115.43    37.66                             semu_timer_clocksource
      13.95    148.34    32.91                             semu_start
       6.10    162.74    14.40                             aclint_mswi_update_interrupts
       5.95    176.77    14.03                             mmu_store
       5.35    189.39    12.62                             mmu_translate
       3.85    198.47     9.08                             aclint_mtimer_update_interrupts
       2.73    204.91     6.44                             op_rv32i
       2.46    210.72     5.81                             ram_read
       2.11    215.70     4.98                             aclint_sswi_update_interrupts
       1.74    219.81     4.11                             mmu_load
       1.74    223.91     4.10                             semu_timer_get
       1.30    226.98     3.07                             mem_store
       0.92    229.14     2.16                             ram_write
       0.81    231.05     1.91                             _init
       0.61    232.48     1.43                             mem_load
       0.55    233.78     1.30                             csr_read
       0.35    234.60     0.82                             u8250_check_ready
       0.22    235.11     0.51                             op_csr_cs
       0.21    235.61     0.50                             virtio_net_refresh_queue
       0.14    235.95     0.34                             mem_fetch
       0.00    235.96     0.01                             semu_timer_init
    

The logic inside semu_timer_clocksource is relatively simple. Thus, I assume its execution time is almost equal to clock_gettime. Based on the previous discussions, where we estimated that semu_timer_clocksource is called approximately $(2 \times 10^8)$ times, we can roughly predict the boot process will take

$$ \text{SecPerCall} \times \text{SMPs} \times 2 \times 10^8 \times \frac{100}{10\sim15} $$

seconds.

And for example, if this calculation results in 200 seconds, we can use the target time to set the scale factor. Assuming the target time is 10 seconds, the scaling factor would be $(10 / 200)$. It makes us still can set the target time for boot process.

Below is the sample code:

  • Bogomips:

     #ifndef SEMU_SMP
     #define SEMU_SMP 1
     #endif
     
     #ifndef SEMU_BOOT_TARGET_TIME
     #define SEMU_BOOT_TARGET_TIME 10
     #endif
     
     bool boot_complete = false;
     static double scale_factor;
     
     static double measure_bogomips_ns(uint64_t target_loop)
     {
         struct timespec start, end;
         clock_gettime(CLOCKID, &start);
     
         for (uint64_t loops = 0; loops < target_loop; loops++)
             clock_gettime(CLOCKID, &end);
     
         int64_t sec_diff = end.tv_sec - start.tv_sec;
         int64_t nsec_diff = end.tv_nsec - start.tv_nsec;
         double ns_per_call = (sec_diff * 1e9 + nsec_diff) / target_loop;
         double predict_sec = ns_per_call * SEMU_SMP * 2;
         scale_factor = SEMU_BOOT_TARGET_TIME / predict_sec;
     
         return ns_per_call;
     }

    This function is called at the beginning of semu_timer_init, and the two macros are currently configured directly in the Makefile:

     //...
     SMP ?= 1
     CFLAGS += -D SEMU_SMP=$(SMP)
     CFLAGS += -D SEMU_BOOT_TARGET_TIME=10
     //...
  • semu_timer_clocksource

     static uint64_t semu_timer_clocksource(semu_timer_t *timer)
     {
     #if defined(HAVE_POSIX_TIMER)
         static int64_t offset = 0;
     
         struct timespec emulator_time;
         clock_gettime(CLOCKID, &emulator_time);
     
         if (!boot_complete) {
             return get_ticks(&emulator_time, timer->freq * scale_factor);
         } else {
             static bool first = true;
             if (first) {
                 first = false;
                 offset = get_ticks(&emulator_time, timer->freq) -
                          get_ticks(&emulator_time, timer->freq * scale_factor);
             }
     
             return get_ticks(&emulator_time, timer->freq) - offset;
         }
     #elif defined(HAVE_MACH_TIMER)
         static mach_timebase_info_data_t emulator_time;
         if (emulator_time.denom == 0)
             (void) mach_timebase_info(&emulator_time);
         return mult_frac(mult_frac(mach_absolute_time(), emulator_time.numer,
                                    emulator_time.denom),
                          freq, 1e9);
     #else
         return time(0) * freq;
     #endif
     }

Currently, only the HAVE_POSIX_TIMER method has been implemented. I am not familiar with macOS, so I may need to investigate further. I plan to complete it and submit the PR (#67) tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants