S922X Ugoos AM6B Device Tree - Performance/Efficiency - Testing Needed

MasterKeyxda · 27 September 2024 06:16

It can be NG or NE, doesn’t really matter…

akmarwah03 · 27 September 2024 06:17

I ran the script posted by @doppingkoala on 905X4, but getting no cache displayed by lscpu.

MasterKeyxda · 27 September 2024 06:26

Thank you! I’m inclined to leave the higher values. I see avi files using A55 only without a sweat at low resolution and therefore utilizing big.little to it’s maximum potential.

We can change it if data proves that the values need changing. With so much interest and a new on the fly script I’m sure we’ll get all combinations and permutations.

Your thoughts on it?

MasterKeyxda · 27 September 2024 06:31

@doppingkoala if want to define your CPU cores as independent you can create 4 clusters. Each CPU will virtually in software be given a different cluster. You can have n clusters. It should stop moving the process around.

The downside would be that the calculation for bumping a process will have high threshold and all the consequences of it.

If you indeed have per core cache then it should actually help. If you have shared caches it might make the Linux scheduler slower. Try it! Worth a shot

Edit @doppingkoala can you add a L2 cache size 0 and then add L3 size 512KB? You define L3 but kernel assumes it is L2.

github.com

CoreELEC/linux-amlogic/blob/63df59dbedaf207367b0479284f77e5fd0d0efe4/kernel/sched/fair.c

// SPDX-License-Identifier: GPL-2.0
/*
 * Completely Fair Scheduling (CFS) Class (SCHED_NORMAL/SCHED_BATCH)
 *
 *  Copyright (C) 2007 Red Hat, Inc., Ingo Molnar <mingo@redhat.com>
 *
 *  Interactivity improvements by Mike Galbraith
 *  (C) 2007 Mike Galbraith <efault@gmx.de>
 *
 *  Various enhancements by Dmitry Adamushko.
 *  (C) 2007 Dmitry Adamushko <dmitry.adamushko@gmail.com>
 *
 *  Group scheduling enhancements by Srivatsa Vaddagiri
 *  Copyright IBM Corporation, 2007
 *  Author: Srivatsa Vaddagiri <vatsa@linux.vnet.ibm.com>
 *
 *  Scaled math optimizations by Thomas Gleixner
 *  Copyright (C) 2007, Thomas Gleixner <tglx@linutronix.de>
 *
 *  Adaptive scheduling granularity, math enhancements by Peter Zijlstra

This file has been truncated. show original

Function cpus_share_cache

MasterKeyxda · 27 September 2024 06:32

That SoC has been hard to find the right values for. Maybe we start with L1 cache only before we go L2 and L3.

doppingkoala · 27 September 2024 06:59

So make a new cache for L2 with a size of zero and change each core so that is the next-level-cache? How do I then reference the L3 cache?

MasterKeyxda · 27 September 2024 07:02

Inside the level 2 cache you create a new level 3 cache.

cache-level=<2> 
X
X
X
next-level-cache=<&L3>
      {
             X
             cache-level=<3>
      }

Nested like that.

It might be dumb and I may be asking something impossible. You can definitely call out my ignorance.

frodo19 · 27 September 2024 07:08

“They didn’t know, it was impossible, so they did it.”

doppingkoala · 27 September 2024 07:11

Give it a try later. Looking at some other dtsi files though, it looks like the approach used for devices with no L2 is to directly reference the L3 cache for each of the cpu cores (as it already is). There is a difference though, in that it look like the cache-level for the L3 cache is set as 2 though by others…

MasterKeyxda · 27 September 2024 07:11

You bring me back to my aerospace days when I used to code partial differential equations that solve aircraft air flow. <3 I got a chuckle out of your message.

YadaYada · 27 September 2024 07:12

There’s a number of factors when it comes to the UI, like which skin is being used, what add-ons are present, emmc vs USB install. There’s also just the placebo effect. When I first added the cache nodes I was convinced the UI felt faster too. But since I have been going back and forth between cache and no cache nodes, I lost that feeling.

I think we need more standardized testing that relies less on how fast something feels. For example, a few people were saying bootup times felt faster. That’s reported by the system, so we can grab those results

CoreELEC:~ # systemd-analyze

I took the average of 7 bootups with the cache nodes present in the dtb, vs the average of 4 without the cache nodes.

With L2 Cache nodes
5.509s (kernel) = 1.2% slower
7.930s (userspace) = 2.5% faster
6.207s (kodi.target reached) = 1.1% faster
19.647s (total time) = 1.1% faster

Without L2 Cache nodes
5.441s (kernel)
8.136s (userspace)
6.279s (kodi.target reached)
19.857s (total time)

So maybe adding the nodes speeds up the boot time by 1.1%, but that might just be variation.

The other measure I saw mentioned is loading balancing across CPU cores with software decoding. But the examples I saw were based on a one second average, but it’s hard to gauge things off of a snapshot when the CPU shifts a lot.

I used a script to get the average CPU load (1sec resolution) of a few 3-5min AV1 1080P videos. Looking at the average over a few minutes, consistent results on re-views. example

With L2 Cache node - top gun - maverick trailer - AV1 1080p
Average Loads (averaged over 259 seconds):
Core 0 Average Load: 41.2%
Core 1 Average Load: 31.7%
Core 2 Average Load: 20.2%
Core 3 Average Load: 18.3%
Core 4 Average Load: 17.9%
Core 5 Average Load: 16.4%

Without L2 Cache node - top gun - maverick trailer
Average Loads (averaged over 244 seconds):
Core 0 Average Load: 42.0%
Core 1 Average Load: 32.9%
Core 2 Average Load: 20.6%
Core 3 Average Load: 19.1%
Core 4 Average Load: 18.0%
Core 5 Average Load: 17.4%

The results look very similar. I didn’t see a noticeable difference with the other videos I test using SW decoding either.

Something may be wrong with my setup, but if we could standardize a couple tests, it would make direct comparisons easier.

If you have a video where you do see a shift in the CPU load balance can you share it, and I will run it on my setup to see if I can reproduce the effect. It’s better if the video is over 2min long to reduce the variation.

Run this script to get average CPU load for each CPU, hit CTRL+C to stop, and the average loads will be shown.

cpu-load.sh (2.1 KB)

MasterKeyxda · 27 September 2024 07:13

Is it a shared L3 labeled as level 2?

Or is it a per core separate cache?

doppingkoala · 27 September 2024 07:14

Looks like shared L3, was looking at linux/arch/arm64/boot/dts/rockchip/rk356x.dtsi at master · torvalds/linux · GitHub for example

akmarwah03 · 27 September 2024 07:15

Any direction to take when only starting with L1 cache, I tried removing l3 cache part. I can experiment with some values.

#!/bin/sh

(
    /bin/mount -o rw,remount /flash

    fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-line-size 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-line-size 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-sets 32 -t i

    fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-line-size 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-line-size 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-sets 32 -t i

    fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-line-size 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-line-size 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-sets 32 -t i

    fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-line-size 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-line-size 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-sets 32 -t i


    /bin/sync
    /bin/mount -o ro,remount /flash

    read -p "Restart now? [Y/N]: " KEYINPUT
    if [ "$KEYINPUT" != "${KEYINPUT#[Yy]}" ]; then
        /sbin/reboot
    fi
)

lscpu:

CoreELEC:~ # lscpu
Architecture:            aarch64
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               ARM
  Model name:            Cortex-A55
    Model:               0
    Thread(s) per core:  1
    Core(s) per cluster: 4
    Socket(s):           -
    Cluster(s):          1
    Stepping:            r2p0
    CPU(s) scaling MHz:  100%
    CPU max MHz:         2000.0000
    CPU min MHz:         100.0000
    BogoMIPS:            48.00
    Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp asimddp

MasterKeyxda · 27 September 2024 07:16

Keep it shared. That may be the best we can do.

MasterKeyxda · 27 September 2024 07:31

@YadaYada

All AV1 videos

1080p - All cores

Average Loads (averaged over 13 seconds):
Core 0 Average Load: 53.0%
Core 1 Average Load: 42.6%
Core 2 Average Load: 31.3%
Core 3 Average Load: 29.1%
Core 4 Average Load: 27.1%
Core 5 Average Load: 26.3%

720p

Average Loads (averaged over 13 seconds):
Core 0 Average Load: 52.4%
Core 1 Average Load: 44.3%
Core 2 Average Load: 15.1%
Core 3 Average Load: 5.7%
Core 4 Average Load: 5.0%
Core 5 Average Load: 2.5%

360p

Average Loads (averaged over 13 seconds):
Core 0 Average Load: 39.5%
Core 1 Average Load: 23.8%
Core 2 Average Load: 4.6%
Core 3 Average Load: 1.0%
Core 4 Average Load: 0.6%
Core 5 Average Load: 0.2%

You can see as the demand for CPU cycles goes down the scheduler puts priority on A53 cores. Look at the very potato quality AV1 360p, all decoding is done in A53. A73 is mostly doing some system tasks to keep the linux from crashing.

Now look at 720p, the majority of work is done by A53 only. A small amount of work is done by A73 but still A53 does heavy lifting.

Now look at 1080p, the scheduler is doing its damn best to keep all CPU loads under 80 and therefore utilizing all 6 cores even though they are uneven in power (53 is weaker than 73 kind of unevenness).

Sorry none of my test videos were 2 mins.

There is not going to be any speed increases in benchmark at high CPU usage. The hardware is the same as before. We are trying to bump up the low end without consuming extra power to make the device feel faster.

MasterKeyxda · 27 September 2024 07:33

I might have to sleep on this. Someone’s gotta have an epiphany or read 100s of pages of ARM documentation to find that one sentence which makes us go “aha”. I do not volunteer for this, I need sleep after fighting github for 3 hours today.

MasterKeyxda · 27 September 2024 07:34

@akmarwah03 line-size is 64 for arm A55.
https://developer.arm.com/documentation/100442/0100/functional-description/level-1-memory-system/about-the-l1-memory-system?lang=en

akmarwah03 · 27 September 2024 07:48

Completely understandable, thanks for all your work

MasterKeyxda · 27 September 2024 07:54

I know I’m replying to myself…

@doppingkoala what I was saying was you can trick the scheduler into thinking you have 4 separate CPU die with BGA separated by copper lines. All you gotta do is create one cluster with one CPU. I want to mimic this kind of affinity for processes in single threaded operations at a system level not process level.

What will this achieve? The scheduler will become hesitant in bumping processes between cores and only use it for true multi threaded applications.