S922X Ugoos AM6B Device Tree - Performance/Efficiency - Testing Needed

sunseeker2k5 · 27 September 2024 16:34

Just saw A2, thank you.

MasterKeyxda · 27 September 2024 17:32

@YadaYada sorry I couldn’t come up with a test that could verify all of the talk above.

I’m willing to believe the many users who have reported a nice quality of life improvement.

In my first post

With thousands of programs and tens of thousands of dependencies everywhere who knows which program checks for what. Again only in user space. The hardware always had cache and it was always used. The memcpy functions of glibc are written in assembly so I don’t know what it’s doing but most likely its highly optimized already.

Closing poll

Is your device better now than before the update? (Anonymous vote)

Yes
No

0 voters

akmarwah03 · 27 September 2024 18:09

I looked into this but I don’t have knowledge to progress from here
1.dts (123.0 KB)

MasterKeyxda · 27 September 2024 19:13

@YadaYada looked in scheduler code since I can’t come up with a test method. I do not know how the kernel gets the data whether cpu is sharing cache in this code.

/*
 * The purpose of wake_affine() is to quickly determine on which CPU we can run
 * soonest. For the purpose of speed we only consider the waking and previous
 * CPU.
 *
 * wake_affine_idle() - only considers 'now', it check if the waking CPU is
 *			cache-affine and is (or	will be) idle.
 *
 * wake_affine_weight() - considers the weight to reflect the average
 *			  scheduling latency of the CPUs. This seems to work
 *			  for the overloaded case.
 */
static int
wake_affine_idle(int this_cpu, int prev_cpu, int sync)
{
	/*
	 * If this_cpu is idle, it implies the wakeup is from interrupt
	 * context. Only allow the move if cache is shared. Otherwise an
	 * interrupt intensive workload could force all tasks onto one
	 * node depending on the IO topology or IRQ affinity settings.
	 *
	 * If the prev_cpu is idle and cache affine then avoid a migration.
	 * There is no guarantee that the cache hot data from an interrupt
	 * is more important than cache hot data on the prev_cpu and from
	 * a cpufreq perspective, it's better to have higher utilisation
	 * on one CPU.
	 */
	if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
		return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;

	if (sync && cpu_rq(this_cpu)->nr_running == 1)
		return this_cpu;

	return nr_cpumask_bits;
}

This is just one instance I can point to in the kernel scheduler.

if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))

This code only executes if cpus_share_cache is returned true. S922X A53 and A73 do not share cache with the device tree we defined so therefore we will never get a return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu; if we are checking between A53 and A73. How often does that happen? Anybody’s guess.

Only allow the move if cache is shared. as seen in the comment. Therefore cache is not shared and no move is allowed.

The CPUs are complex and I do not claim I understand what’s happenning but at a glance it seems that kernel will change its behavior of which CPU does the work.

If someone here can write assembly then maybe they can come up with two different assembly code one which uses the cache info and one that doesn’t from the userspace.

YadaYada · 27 September 2024 19:24

Adding the additional DTB information probably isn’t hurting anything, so it’s not a bad idea to include it in some form either way. I was looking at an A311D2 DTB (basically a newer version of the s922x - 4xA73+4xA53 - kernel 5.4), and it does include the capacity-dmips-mhz properties now, and a single L2 cache node. Maybe the kernel is able to automatically determine that there are two separate L2 Caches, and the separate nodes aren’t neccessary?

MasterKeyxda · 27 September 2024 19:44

For that I’d like to see the output of lscpu --caches.

Our outputs are formatted (L2 separated by cluster explicitly)

0:0:0
0:1:0
1:0:1
1:1:1
1:2:1

5.4 vs 4.9 who knows what changed under the hood in kernel.

This is beyond my knowledge now.

Story below, no technical info:
I’ll share a story of how I found out that ARM CPUs cannot do fast inverse square root. Specifically imx8.

I was working on a sensor and trying to find orientation with madgwick filter which was just not working whatsoever! We used gcc to compile binary. I was managing a SDE at the time. We were both stumped why the hell is it not working. I programmed something in Python with regular speed inverse square root and he programmed the same algorithm in C++, mine worked fine and his didn’t. We debugged and found fast inverse square root taken from Wikipedia works fine on x64 but not on ARM. If course we tested python and c++ on ARM.We stopped using fast inverse square root and suddenly the code and sensor just worked! Due to deadlines we never went further in debugging. To this day I don’t have an answer to why that happened. A case of beyond my knowledge. I’m not a programmer.

YadaYada · 27 September 2024 20:19

The A311D2 documentation is more clear than the s922x, about there being two discreet L2 cache’s for the two CPU clusters

https://dl.khadas.com/products/vim4/datasheet/a311d2-quick-reference-manual-rev-c-0.2.pdf

I don’t have an A311D2 running linux/CE to check lscpu. CE does support the SOC, so there may be someone else here with a VIM4 or GT King II that can check. I had looked an an Android DTB, I don’t know if the CE dtb for that SOC includes that information too.

EDIT: the CE DTB doesn’t have the L2 cache node, it would need to be added to check this device.

ht2tweak · 27 September 2024 21:27

Is there a thread where one can keep up with new builds? Had been looking here, but unless I missed it, not seeing toward end of that thread. Thanks.

Nm, I see it’s on GitHub.

doppingkoala · 27 September 2024 23:55

For anyone that copied my previous script, it has a problem where a phandle was duplicated for the cache, revised script is

#!/bin/sh

(
    /bin/mount -o rw,remount /flash

    fdtput -p /flash/dtb.img /cpus/l3-cache0 compatible cache -t s
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-level 2 -t i
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-unified
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-size 0x7d000 -t x
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-sets 512 -t i
    fdtput -p /flash/dtb.img /cpus/l3-cache0 phandle 1113 -t i

 	fdtput -p /flash/dtb.img /cpus/cpu@0 compatible "arm,cortex-a55" "arm,armv8" -t s
	fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 next-level-cache 1113 -t i

 	fdtput -p /flash/dtb.img /cpus/cpu@1 compatible "arm,cortex-a55" "arm,armv8" -t s
    fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 next-level-cache 1113 -t i

 	fdtput -p /flash/dtb.img /cpus/cpu@2 compatible "arm,cortex-a55" "arm,armv8" -t s
    fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 next-level-cache 1113 -t i

 	fdtput -p /flash/dtb.img /cpus/cpu@3 compatible "arm,cortex-a55" "arm,armv8" -t s
    fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 next-level-cache 1113 -t i


    /bin/sync
    /bin/mount -o ro,remount /flash

    read -p "Restart now? [Y/N]: " KEYINPUT
    if [ "$KEYINPUT" != "${KEYINPUT#[Yy]}" ]; then
        /sbin/reboot
    fi
)```

rho-bot · 28 September 2024 00:40

I think you are missing L2 cache configuration (per core!) between L1&L3
grafik
You should see L2 2M (4 instances)

Hierarchy should look like:

cpu@0 next-level-cache L2-0
cpu@1 next-level-cache L2-1
[...]
L2-0 / cache-level 2 / next-level-cache L3
L2-1 / cache-level 2 / next-level-cache L3
[...]
L3 / cache-level 3

doppingkoala · 28 September 2024 00:47

I was going off [PATCHv1 1/5] arm64: dts: amlogic: Add cache information to the Amlogic GXBB and GXL SoC and other comments here that seemed to suggest there isn’t a L2 cache.

ARM Cortex-A55 CPU uses unified L3 cache instead of L2 cache

I’ll try adding a L2 and see what happens anyway

MasterKeyxda · 28 September 2024 00:54

If L2 then must be private per core.

https://developer.arm.com/documentation/100442/latest/functional-description/level-2-memory-system/about-the-l2-memory-system

And if it’s private per core then that kernel function I mentioned above applies to you too. It has effect depending on IO topology and IRQ affinity

doppingkoala · 28 September 2024 01:10

So I updated the script to

s905x3 with L2 cache


#!/bin/sh

(
    /bin/mount -o rw,remount /flash

    fdtput -p /flash/dtb.img /cpus/l3-cache0 compatible cache -t s
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-level 3 -t i
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-unified
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-size 0x7d000 -t x
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-sets 512 -t i
    fdtput -p /flash/dtb.img /cpus/l3-cache0 phandle 1113 -t i

    fdtput -p /flash/dtb.img /cpus/l2-cache0 compatible cache -t s
    fdtput -p /flash/dtb.img /cpus/l2-cache0 cache-level 2 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache0 cache-size 0x10000 -t x
    fdtput -p /flash/dtb.img /cpus/l2-cache0 cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache0 cache-sets 512 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache0 next-level-cache 1113 -t i
	fdtput -p /flash/dtb.img /cpus/l2-cache0 phandle 1114 -t i

    fdtput -p /flash/dtb.img /cpus/l2-cache1 compatible cache -t s
    fdtput -p /flash/dtb.img /cpus/l2-cache1 cache-level 2 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache1 cache-size 0x10000 -t x
    fdtput -p /flash/dtb.img /cpus/l2-cache1 cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache1 cache-sets 512 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache1 next-level-cache 1113 -t i
	fdtput -p /flash/dtb.img /cpus/l2-cache1 phandle 1115 -t i

    fdtput -p /flash/dtb.img /cpus/l2-cache2 compatible cache -t s
    fdtput -p /flash/dtb.img /cpus/l2-cache2 cache-level 2 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache2 cache-size 0x10000 -t x
    fdtput -p /flash/dtb.img /cpus/l2-cache2 cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache2 cache-sets 512 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache2 next-level-cache 1113 -t i
	fdtput -p /flash/dtb.img /cpus/l2-cache2 phandle 1116 -t i

    fdtput -p /flash/dtb.img /cpus/l2-cache3 compatible cache -t s
    fdtput -p /flash/dtb.img /cpus/l2-cache3 cache-level 2 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache3 cache-size 0x10000 -t x
    fdtput -p /flash/dtb.img /cpus/l2-cache3 cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache3 cache-sets 512 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache3 next-level-cache 1113 -t i
	fdtput -p /flash/dtb.img /cpus/l2-cache3 phandle 1117 -t i

 	fdtput -p /flash/dtb.img /cpus/cpu@0 compatible "arm,cortex-a55" "arm,armv8" -t s
	fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 next-level-cache 1114 -t i

 	fdtput -p /flash/dtb.img /cpus/cpu@1 compatible "arm,cortex-a55" "arm,armv8" -t s
    fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 next-level-cache 1115 -t i

 	fdtput -p /flash/dtb.img /cpus/cpu@2 compatible "arm,cortex-a55" "arm,armv8" -t s
    fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 next-level-cache 1116 -t i

 	fdtput -p /flash/dtb.img /cpus/cpu@3 compatible "arm,cortex-a55" "arm,armv8" -t s
    fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 next-level-cache 1117 -t i


    /bin/sync
    /bin/mount -o ro,remount /flash

    read -p "Restart now? [Y/N]: " KEYINPUT
    if [ "$KEYINPUT" != "${KEYINPUT#[Yy]}" ]; then
        /sbin/reboot
    fi
)

@rho-bot is that how it should be?

and am now getting

CoreELEC-21:~ # lscpu
Architecture:            aarch64
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               ARM
  Model name:            Cortex-A55
    Model:               0
    Thread(s) per core:  1
    Core(s) per cluster: 4
    Socket(s):           -
    Cluster(s):          1
    Stepping:            r1p0
    CPU(s) scaling MHz:  100%
    CPU max MHz:         1908.0000
    CPU min MHz:         100.0000
    BogoMIPS:            48.00
    Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp asimddp
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    2 MiB (4 instances)

CoreELEC-21:~ # lscpu --caches
NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL SETS PHY-LINE COHERENCY-SIZE
L1d       32K     128K    4 Data            1  128                      64
L1i       32K     128K    4 Instruction     1  128                      64
L2       512K       2M   16 Unified         2  512                      64

To get the 2MB is it counting the 512 KB L3 four times?

MasterKeyxda · 28 September 2024 01:14

Can you post the output of:

lscpu --output-all -e

doppingkoala · 28 September 2024 01:15

CoreELEC-21:~ # lscpu --output-all -e
BOGOMIPS CPU CORE SOCKET CLUSTER NODE BOOK DRAWER L1d:L1i:L2 POLARIZATION ADDRESS CONFIGURED ONLINE       MHZ SCALMHZ%    MAXMHZ   MINMHZ MODELNAME
   48.00   0    0      0       0    -    -      - 0:0:0      -            -       -             yes 1908.0000     100% 1908.0000 100.0000 Cortex-A55
   48.00   1    1      0       0    -    -      - 1:1:1      -            -       -             yes 1908.0000     100% 1908.0000 100.0000 Cortex-A55
   48.00   2    2      0       0    -    -      - 2:2:2      -            -       -             yes 1908.0000     100% 1908.0000 100.0000 Cortex-A55
   48.00   3    3      0       0    -    -      - 3:3:3      -            -       -             yes 1908.0000     100% 1908.0000 100.0000 Cortex-A55

rho-bot · 28 September 2024 01:15

Sorry for confusion, according to ARM, the A55 has optional L2 & L3.
For S905X3, there datasheet confirms there is only L1 & L3. And same says the blockdiagram, so your former mod likely was right.
grafik
For S905X4, I have no datasheet. But blockdiagram shows L1/L2/L3. This may work with your latest mod.
grafik

MasterKeyxda · 28 September 2024 01:24

Is there any difference in bursty loads between shared cache vs explicit cache?

L3 can be partitioned for embedded systems. We have a dynamic workload in our OS though.

akmarwah03 · 28 September 2024 01:27

ran this one on S905X4 and for the first time seeing some values under lscpu

CoreELEC:~ # lscpu
Architecture:            aarch64
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               ARM
  Model name:            Cortex-A55
    Model:               0
    Thread(s) per core:  1
    Core(s) per cluster: 4
    Socket(s):           -
    Cluster(s):          1
    Stepping:            r2p0
    CPU(s) scaling MHz:  100%
    CPU max MHz:         2000.0000
    CPU min MHz:         100.0000
    BogoMIPS:            48.00
    Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp asimddp
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    256 KiB (4 instances)
  L3:                    512 KiB (1 instance)

CoreELEC:~ # lscpu --caches
NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL SETS PHY-LINE COHERENCY-SIZE
L1d       32K     128K    4 Data            1  128                      64
L1i       32K     128K    4 Instruction     1  128                      64
L2        64K     256K    4 Unified         2  256                      64
L3       512K     512K   16 Unified         3  512                      64

doppingkoala · 28 September 2024 01:28

tbh I haven’t seen any differences between any of the options or the original without any cache specified. The ui is still laggy, hardware accelerated video plays fine, and SW decoded video is bad unless low bitrate.

MasterKeyxda · 28 September 2024 01:28

Those are it!! I’m glad the hierarchy is now found for 905X4.