S905X4 Device Tree - Performance/Efficiency - Testing Needed

Hey everyone,

I recently came across a very insightful post by @MasterKeyxda discussing some performance hiccups with the Ugoos AM6B, related to how the Linux scheduler handles CPU process switching and cache detection. You should check out the original post for more details.

Following his work, I ran tests on my S905x4 CPU using his script, and I’d like to share my results here and get more feedback from others.

Anectodal: Browsing around menues feel faster, youtube addon loads subscriptions and recommendations faster.

CoreELEC:~ # lscpu
Architecture:            aarch64
  CPU op-mode(s):        32-bit, 64-bit
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               ARM
  Model name:            Cortex-A55
    Model:               0
    Thread(s) per core:  1
    Core(s) per cluster: 4
    Socket(s):           -
    Cluster(s):          1
    Stepping:            r2p0
    CPU(s) scaling MHz:  100%
    CPU max MHz:         2004.0000
    CPU min MHz:         100.0000
    BogoMIPS:            48.00
    Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asim
                         ddp
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    256 KiB (4 instances)
  L3:                    512 KiB (1 instance)
Vulnerabilities:
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Spec store bypass:     Not affected
  Spectre v1:            Mitigation; __user pointer sanitization
  Spectre v2:            Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected
CoreELEC:~ # lscpu --caches
NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL SETS PHY-LINE COHERENCY-SIZE
L1d       32K     128K   16 Data            1   32                      64
L1i       32K     128K   16 Instruction     1   32                      64
L2        64K     256K    2 Unified         2  512                      64
L3       512K     512K   16 Unified         3  512                      64
CoreELEC:~ # lscpu --output-all -e
BOGOMIPS CPU CORE SOCKET CLUSTER NODE BOOK DRAWER L1d:L1i:L2:L3 POLARIZATION ADDRESS CONFIGURED ONLINE       MHZ SCALMHZ%    MAXMHZ   MINMHZ MODELNAME
   48.00   0    0      0       0    -    -      - 0:0:0:0       -            -       -             yes 2004.0000     100% 2004.0000 100.0000 Cortex-A55
   48.00   1    1      0       0    -    -      - 1:1:1:0       -            -       -             yes 2004.0000     100% 2004.0000 100.0000 Cortex-A55
   48.00   2    2      0       0    -    -      - 2:2:2:0       -            -       -             yes 2004.0000     100% 2004.0000 100.0000 Cortex-A55
   48.00   3    3      0       0    -    -      - 3:3:3:0       -            -       -             yes 2004.0000     100% 2004.0000 100.0000 Cortex-A55

I’m interested in seeing if others are experiencing similar results or different results.

Script:

#!/bin/sh

(
    /bin/mount -o rw,remount /flash

    fdtput -p /flash/dtb.img /cpus/l3-cache0 compatible cache -t s
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-level 3 -t i
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-unified
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-size 0x80000 -t x
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-sets 512 -t i
    fdtput -p /flash/dtb.img /cpus/l3-cache0 phandle 1113 -t i

    fdtput -p /flash/dtb.img /cpus/l2-cache0 compatible cache -t s
    fdtput -p /flash/dtb.img /cpus/l2-cache0 cache-level 2 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache0 cache-size 0x10000 -t x
    fdtput -p /flash/dtb.img /cpus/l2-cache0 cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache0 cache-sets 512 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache0 next-level-cache 1113 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache0 phandle 1114 -t i

    fdtput -p /flash/dtb.img /cpus/l2-cache1 compatible cache -t s
    fdtput -p /flash/dtb.img /cpus/l2-cache1 cache-level 2 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache1 cache-size 0x10000 -t x
    fdtput -p /flash/dtb.img /cpus/l2-cache1 cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache1 cache-sets 512 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache1 next-level-cache 1113 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache1 phandle 1115 -t i

    fdtput -p /flash/dtb.img /cpus/l2-cache2 compatible cache -t s
    fdtput -p /flash/dtb.img /cpus/l2-cache2 cache-level 2 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache2 cache-size 0x10000 -t x
    fdtput -p /flash/dtb.img /cpus/l2-cache2 cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache2 cache-sets 512 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache2 next-level-cache 1113 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache2 phandle 1116 -t i

    fdtput -p /flash/dtb.img /cpus/l2-cache3 compatible cache -t s
    fdtput -p /flash/dtb.img /cpus/l2-cache3 cache-level 2 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache3 cache-size 0x10000 -t x
    fdtput -p /flash/dtb.img /cpus/l2-cache3 cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache3 cache-sets 512 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache3 next-level-cache 1113 -t i
    fdtput -p /flash/dtb.img /cpus/l2-cache3 phandle 1117 -t i

    fdtput -p /flash/dtb.img /cpus/cpu@0 compatible "arm,cortex-a55" "arm,armv8" -t s
    fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@0 next-level-cache 1114 -t i

    fdtput -p /flash/dtb.img /cpus/cpu@1 compatible "arm,cortex-a55" "arm,armv8" -t s
    fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@1 next-level-cache 1115 -t i

    fdtput -p /flash/dtb.img /cpus/cpu@2 compatible "arm,cortex-a55" "arm,armv8" -t s
    fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@2 next-level-cache 1116 -t i

    fdtput -p /flash/dtb.img /cpus/cpu@3 compatible "arm,cortex-a55" "arm,armv8" -t s
    fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-line-size 64 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-size 0x8000 -t x
    fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-sets 32 -t i
    fdtput -p /flash/dtb.img /cpus/cpu@3 next-level-cache 1117 -t i


    /bin/sync
    /bin/mount -o ro,remount /flash

    read -p "Restart now? [Y/N]: " KEYINPUT
    if [ "$KEYINPUT" != "${KEYINPUT#[Yy]}" ]; then
        /sbin/reboot
    fi
)

Thanks again @MasterKeyxda and @doppingkoala for getting values and the script.

Script tested so far for:

  • Homatics Box R 4k Plus
  • DuneHD Homatics Box R 4k Plus
2 Likes

here result for Dune Homatics & NG build:

CoreELEC:~ # lscpu
Architecture:            aarch64
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               ARM
  Model name:            Cortex-A55
    Model:               0
    Thread(s) per core:  1
    Core(s) per cluster: 4
    Socket(s):           -
    Cluster(s):          1
    Stepping:            r2p0
    CPU(s) scaling MHz:  100%
    CPU max MHz:         2000.0000
    CPU min MHz:         100.0000
    BogoMIPS:            48.00
    Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp asimddp
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    256 KiB (4 instances)
  L3:                    512 KiB (1 instance)
CoreELEC:~ # lscpu --caches
NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL SETS PHY-LINE COHERENCY-SIZE
L1d       32K     128K    4 Data            1  128                      64
L1i       32K     128K    4 Instruction     1  128                      64
L2        64K     256K    4 Unified         2  256                      64
L3       512K     512K   16 Unified         3  512                      64
CoreELEC:~ # lscpu --output-all -e
BOGOMIPS CPU CORE SOCKET CLUSTER NODE BOOK DRAWER L1d:L1i:L2:L3 POLARIZATION ADDRESS CONFIGURED ONLINE       MHZ SCALMHZ%    MAXMHZ   MINMHZ MODELNAME
   48.00   0    0      0       0    -    -      - 0:0:0:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000 Cortex-A55
   48.00   1    1      0       0    -    -      - 1:1:1:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000 Cortex-A55
   48.00   2    2      0       0    -    -      - 2:2:2:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000 Cortex-A55
   48.00   3    3      0       0    -    -      - 3:3:3:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000 Cortex-A55
CoreELEC:~ # cat /sys/class/aml_ddr/freq
600 MHz

Nokia 8010, NG cpm build:

CoreELEC build
CoreELEC (test): 21.1.1-Omega_devel_20240927112932 (Amlogic-ng.arm)
Machine model: RockTek G2, Nokia 8010
CoreELEC dt-id: sc2_s905x4_sei_smb_280_id7
Amlogic dt-id: sc2_s905x4_ah212-id7
Linux version: 4.9.269 (test@test) #1 Fri Sep 27 11:09:05 UTC 2024
Kodi compiled: 2024-09-27 13:22:21 +0200
Results
CoreELEC:~ # lscpu
Architecture:            aarch64
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               ARM
  Model name:            Cortex-A55
    Model:               0
    Thread(s) per core:  1
    Core(s) per cluster: 4
    Socket(s):           -
    Cluster(s):          1
    Stepping:            r2p0
    CPU(s) scaling MHz:  100%
    CPU max MHz:         2000.0000
    CPU min MHz:         100.0000
    BogoMIPS:            48.00
    Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp asimddp
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    256 KiB (4 instances)
  L3:                    512 KiB (1 instance)
CoreELEC:~ # lscpu --caches
NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL SETS PHY-LINE COHERENCY-SIZE
L1d       32K     128K    4 Data            1  128                      64
L1i       32K     128K    4 Instruction     1  128                      64
L2        64K     256K    4 Unified         2  256                      64
L3       512K     512K   16 Unified         3  512                      64
CoreELEC:~ # lscpu --output-all -e
BOGOMIPS CPU CORE SOCKET CLUSTER NODE BOOK DRAWER L1d:L1i:L2:L3 POLARIZATION ADDRESS CONFIGURED ONLINE       MHZ SCALMHZ%    MAXMHZ   MINMHZ MODELNAME
   48.00   0    0      0       0    -    -      - 0:0:0:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000 Cortex-A55
   48.00   1    1      0       0    -    -      - 1:1:1:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000 Cortex-A55
   48.00   2    2      0       0    -    -      - 2:2:2:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000 Cortex-A55
   48.00   3    3      0       0    -    -      - 3:3:3:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000 Cortex-A55
CoreELEC:~ # cat /sys/class/aml_ddr/freq
600 MHz

Wow, menu browing is indeed a little faster than before and much smoother when using 5x rows of widgets on home screen (finally feels like proper 60fps). Thank you very much MasterKeyxda!

For people that feel the difference:

I have a proposal.

If it’s possible to change the DTB of android TV I’d like you to test android 11 with new scripts too.

Don’t think that is possible without root

Kinhank G1

CoreELECg1:~ # lscpu
Architecture:            aarch64
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               ARM
  Model name:            Cortex-A55
    Model:               0
    Thread(s) per core:  1
    Core(s) per cluster: 4
    Socket(s):           -
    Cluster(s):          1
    Stepping:            r2p0
    CPU(s) scaling MHz:  100%
    CPU max MHz:         2000.0000
    CPU min MHz:         100.0000
    BogoMIPS:            48.00
    Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp asimddp
Caches (sum of all):
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    256 KiB (4 instances)
  L3:                    512 KiB (1 instance)
CoreELECg1:~ # lscpu --caches
NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL SETS PHY-LINE COHERENCY-SIZE
L1d       32K     128K    4 Data            1  128                      64
L1i       32K     128K    4 Instruction     1  128                      64
L2        64K     256K    4 Unified         2  256                      64
L3       512K     512K   16 Unified         3  512                      64
 lscpu --output-all -e
BOGOMIPS CPU CORE SOCKET CLUSTER NODE BOOK DRAWER L1d:L1i:L2:L3 POLARIZATION ADDRESS CONFIGURED ONLINE       MHZ SCALMHZ%    MAXMHZ   MINMHZ MODELNAME
   48.00   0    0      0       0    -    -      - 0:0:0:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000 Cortex-A55
   48.00   1    1      0       0    -    -      - 1:1:1:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000 Cortex-A55
   48.00   2    2      0       0    -    -      - 2:2:2:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000 Cortex-A55
   48.00   3    3      0       0    -    -      - 3:3:3:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000 Cortex-A55

So it works. Thank You.

I tried this on my X96-X4 Hardware Rev 1.3 box, which currently runs on 20.5-Nexus (Amlogic-ng.arm), Linux version 4.9.269 (portisch@ubuntu) (gcc version 12.2.0 (GCC) ) #1 SMP PREEMPT Tue Mar 5 10:40:40 CET 2024

CoreELEC:~ # lscpu
Architecture:            aarch64
  Byte Order:            Little Endian
CPU(s):                  4
  On-line CPU(s) list:   0-3
Vendor ID:               ARM
  Model name:            Cortex-A55
    Model:               0
    Thread(s) per core:  1
    Core(s) per cluster: 4
    Socket(s):           -
    Cluster(s):          1
    Stepping:            r2p0
    CPU(s) scaling MHz:  100%
    CPU max MHz:         2000.0000
    CPU min MHz:         100.0000
    BogoMIPS:            48.00
    Flags:               fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp
                          asimdhp asimddp
Caches (sum of all):     
  L1d:                   128 KiB (4 instances)
  L1i:                   128 KiB (4 instances)
  L2:                    256 KiB (4 instances)
  L3:                    512 KiB (1 instance)
CoreELEC:~ # lscpu --caches
NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL SETS PHY-LINE COHERENCY-SIZE
L1d       32K     128K    4 Data            1  128                      64
L1i       32K     128K    4 Instruction     1  128                      64
L2        64K     256K    4 Unified         2  256                      64
L3       512K     512K   16 Unified         3  512                      64
CoreELEC:~ # lscpu --output-all -e
BOGOMIPS CPU CORE SOCKET CLUSTER NODE BOOK DRAWER L1d:L1i:L2:L3 POLARIZATION ADDRESS CONFIGURED ONLINE       MHZ SCALMHZ%    MAXMHZ   MINMHZ
   48.00   0    0      0       0    -    -      - 0:0:0:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000
   48.00   1    1      0       0    -    -      - 1:1:1:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000
   48.00   2    2      0       0    -    -      - 2:2:2:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000
   48.00   3    3      0       0    -    -      - 3:3:3:0       -            -       -             yes 2000.0000     100% 2000.0000 100.0000

CoreELEC:~ # cat /sys/class/aml_ddr/freq 
396 MHz


@Kill_Bill and @karmantyu did you feel any performance difference?

I did not feel any difference, but we should have made benchmarks anyway.
7-Zip is an excellent benchmark tool also, it even can measure direct RAM performance. There is a static linked arm64 binary 7zzs available, which runs in console on coreelec 20.5 nexus ng arm. But now, as the dtb is modified, we would need to switch back to old status before, to get compareable results.
https://7-zip.org/a/7z2408-linux-arm64.tar.xz

Benchmarks, that can be done, are:
7zzs b -bt
7zzs b -bt -mm=*
7zzs b -mm=swap4 -mtic=30 -bt

They are all multithreaded and max out the cpu, so it can get very hot!

So i did these benchmarks, without shutting down kodi (it runs idle on TV), on patched dtb.img and original old one.

There is not much difference, i think it has no effect on s905x4.
The results are in these textfiles:
DeviceTree_unpatched.txt (8.0 KB)
DeviceTree_patched.txt (8.0 KB)

1 Like

@MasterKeyxda already said that it will not improve max performance but increase minimum performance

if someone has ultra high speed camera, you can count the frames it takes to open stuff when navigating

I don’t know if you are being sarcastic lmao, I sold my box today unfortunately, don’t have 905x4 anymore

lmao after all this effort and making this thread, you sell? what was the reason?

I initially chose Homatics to dual boot Android and CoreELEC, thinking it would be convenient for switching between YouTube and Kodi. However, I found myself constantly toggling between the two OSs. Unfortunately, Kodi on Android gave me frequent stutters, which made it unusable.

I’ve also lost faith in the Homatics team to resolve the issues on the Android side, so I decided to take a different route. Now, I’m using the YouTube app on my TV’s OS and keeping a dedicated CoreELEC box without dual booting. Currently, I’m deciding between the AM6B+ and SK1 (not too concerned about FEL, I want better performance than s905x4). Even x96 x10 pro if it works properly.

Hopefully this effort doesn’t go to waste and other people are able to notice some improvements

1 Like

thanks for the explanation and also I now have a better idea of what not to get if I ever needed android. I’d wait till devs worked out kinks on those boxes before settling on one

7Z mainly relies on CPU-Core performance, not I/O (memory), so it’s expected there will be no significant difference.
Cache impacts memory performance for random r/w.
Better to test with some membench, e.g. GitHub - ssvb/tinymembench: Simple benchmark for memory throughput and latency
Some example result for Odroid C2: ODROID C2 · ssvb/tinymembench Wiki · GitHub

I’m more a user than a programmer. Could you provide something precompiled with usage description?

Installed System tool addon from CoreELEC repo and made some test with stress-ng tool.

Not patched, original dtb.

CoreELEC (official): 21.1.1-Omega_nightly_20241003 (Amlogic-ng.arm)
      Machine model: Amlogic
     CoreELEC dt-id: sc2_s905x4_kinhank_g1
      Linux version: 4.9.269 (docker@894e25749f63) #1 Thu Oct 3 04:52:34 IDT 2024
      Kodi compiled: 2024-10-03 04:15:47 +0200

CoreELECg1:~ # stress-ng --skip-silent --cache-enable-all --class cpu-cache --all 1 -t 100
stress-ng: info:  [5912] setting to a 1 min, 40 secs run per stressor
stress-ng: info:  [5912] far-branch: architecture not supported
stress-ng: info:  [5912] flushcache: architecture not supported
stress-ng: info:  [5912] icache: architecture not supported
stress-ng: info:  [5912] dispatching hogs: 1 bitonicsort, 1 bsearch, 1 cache, 1 cacheline, 1 dekker, 1 heapsort, 1 hsearch, 1 insertionsort, 1 l1cache, 1 list, 1 llc-affinity, 1 lockbus, 1 lsearch, 1 malloc, 1 matrix, 1 matrix-3d, 1 membarrier, 1 memcpy, 1 mergesort, 1 misaligned, 1 peterson, 1 prefetch, 1 qsort, 1 radixsort, 1 shellsort, 1 skiplist, 1 sparsematrix, 1 spinmem, 1 str, 1 stream, 1 tree, 1 tsearch, 1 vecfp, 1 vecmath, 1 vecshuf, 1 vecwide, 1 wcs, 1 zlib
stress-ng: info:  [5915] cache: cache flags used: prefetch fence
stress-ng: info:  [5915] cache: unavailable unused cache flags: flush sfence clflushopt cldemote clwb
stress-ng: info:  [5916] cacheline: using built-in defaults as no suitable cache found
stress-ng: info:  [5916] cacheline: to fully exercise a 64 byte cache line, 32 instances are required
stress-ng: info:  [5918] heapsort: using method 'heapsort-nonlibc'
stress-ng: info:  [5935] qsort: using method 'qsort-libc'
stress-ng: info:  [5936] radixsort: using method 'radixsort-nonlibc'
stress-ng: info:  [5931] mergesort: using method 'mergesort-nonlibc'
stress-ng: info:  [5934] prefetch: using built-in defaults as no suitable cache found
stress-ng: info:  [5942] stream: using built-in defaults as no suitable cache found
stress-ng: info:  [5942] stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info:  [5942] stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info:  [5942] stream: Using cache size of 2048K
stress-ng: info:  [5939] sparsematrix: 10000 items in 500 x 500 sparse matrix (4.00% full)
stress-ng: info:  [5934] prefetch: using a 4096 KB L3 cache with prefetch method 'builtin'
stress-ng: info:  [5932] misaligned: exercised all int16rd int16wr int16inc int32rd int32wr int32inc
stress-ng: info:  [5942] stream: memory rate: 259.05 MB read/sec, 172.70 MB write/sec, 22.64 double precision Mflop/sec (instance 0)
stress-ng: info:  [5912] skipped: 7: far-branch (1) flushcache (1) icache (1) judy (1) l1cache (1) llc-affinity (1) wcs (1)
stress-ng: info:  [5912] passed: 35: bitonicsort (1) bsearch (1) cache (1) cacheline (1) dekker (1) heapsort (1) hsearch (1) insertionsort (1) list (1) lockbus (1) lsearch (1) malloc (1) matrix (1) matrix-3d (1) membarrier (1) memcpy (1) mergesort (1) misaligned (1) peterson (1) prefetch (1) qsort (1) radixsort (1) shellsort (1) skiplist (1) sparsematrix (1) spinmem (1) str (1) stream (1) tree (1) tsearch (1) vecfp (1) vecmath (1) vecshuf (1) vecwide (1) zlib (1)
stress-ng: info:  [5912] failed: 0
stress-ng: info:  [5912] metrics untrustworthy: 0
stress-ng: info:  [5912] successful run completed in 1 min, 40.58 secs

CoreELECg1:~ # stress-ng --skip-silent --cache-enable-all --class memory --all 1 -t 100
stress-ng: info:  [8670] setting to a 1 min, 40 secs run per stressor
stress-ng: info:  [8670] dispatching hogs: 1 atomic, 1 bad-altstack, 1 bitonicsort, 1 bsearch, 1 context, 1 full, 1 heapsort, 1 hsearch, 1 insertionsort, 1 list, 1 lockbus, 1 lsearch, 1 malloc, 1 matrix, 1 matrix-3d, 1 mcontend, 1 membarrier, 1 memcpy, 1 memfd, 1 memrate, 1 memthrash, 1 mergesort, 1 mincore, 1 misaligned, 1 null, 1 pipe, 1 pipeherd, 1 prefetch, 1 qsort, 1 radixsort, 1 randlist, 1 remap, 1 resources, 1 rmap, 1 shellsort, 1 skiplist, 1 sparsematrix, 1 spinmem, 1 stack, 1 stackmmap, 1 str, 1 stream, 1 tlb-shootdown, 1 tmpfs, 1 tree, 1 tsearch, 1 vm, 1 vm-addr, 1 vm-rw, 1 vm-segv, 1 wcs, 1 zero, 1 zlib
stress-ng: info:  [8678] heapsort: using method 'heapsort-nonlibc'
stress-ng: info:  [8700] mergesort: using method 'mergesort-nonlibc'
stress-ng: info:  [8699] memthrash: no NUMA nodes or maximum NUMA nodes, ignoring numa memthrash method
stress-ng: info:  [8699] memthrash: starting 4 threads on each of the 1 stressors on a 4 CPU system
stress-ng: info:  [8698] memrate: using buffer size of 262144K, cache flushing disabled
stress-ng: info:  [8698] memrate: cache flushing can be enabled with --memrate-flush option
stress-ng: info:  [8706] prefetch: using built-in defaults as no suitable cache found
stress-ng: info:  [8703] null: exercising /dev/null with writes, lseek, ioctl, fcntl, fallocate, fdatasync and mmap; for just write benchmarking use --null-write
stress-ng: info:  [8707] qsort: using method 'qsort-libc'
stress-ng: info:  [8708] radixsort: using method 'radixsort-nonlibc'
stress-ng: info:  [8706] prefetch: using a 4096 KB L3 cache with prefetch method 'builtin'
stress-ng: info:  [8731] sparsematrix: 10000 items in 500 x 500 sparse matrix (4.00% full)
stress-ng: info:  [8736] stream: using built-in defaults as no suitable cache found
stress-ng: info:  [8736] stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info:  [8736] stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info:  [8736] stream: Using cache size of 2048K
stress-ng: info:  [8746] zero: exercising /dev/zero with reads, mmap, lseek, and ioctl; for just read benchmarking use --zero-read
stress-ng: info:  [8702] misaligned: exercised all int16rd int16wr int16inc int32rd int32wr int32inc
stress-ng: info:  [8736] stream: memory rate: 41.85 MB read/sec, 27.90 MB write/sec, 3.66 double precision Mflop/sec (instance 0)
stress-ng: info:  [8670] skipped: 4: judy (1) numa (1) oom-pipe (1) wcs (1)
stress-ng: info:  [8670] passed: 52: atomic (1) bad-altstack (1) bitonicsort (1) bsearch (1) context (1) full (1) heapsort (1) hsearch (1) insertionsort (1) list (1) lockbus (1) lsearch (1) malloc (1) matrix (1) matrix-3d (1) mcontend (1) membarrier (1) memcpy (1) memfd (1) memrate (1) memthrash (1) mergesort (1) mincore (1) misaligned (1) null (1) pipe (1) pipeherd (1) prefetch (1) qsort (1) radixsort (1) randlist (1) remap (1) resources (1) rmap (1) shellsort (1) skiplist (1) sparsematrix (1) spinmem (1) stack (1) stackmmap (1) str (1) stream (1) tlb-shootdown (1) tmpfs (1) tree (1) tsearch (1) vm (1) vm-addr (1) vm-rw (1) vm-segv (1) zero (1) zlib (1)
stress-ng: info:  [8670] failed: 0
stress-ng: info:  [8670] metrics untrustworthy: 0
stress-ng: info:  [8670] successful run completed in 1 min, 41.53 secs
Patched dtb.
CoreELEC (official): 21.1.1-Omega_nightly_20241003 (Amlogic-ng.arm)
      Machine model: Amlogic
     CoreELEC dt-id: sc2_s905x4_kinhank_g1
      Linux version: 4.9.269 (docker@894e25749f63) #1 Thu Oct 3 04:52:34 IDT 2024
      Kodi compiled: 2024-10-03 04:15:47 +0200

CoreELECg1:~ # stress-ng --skip-silent --cache-enable-all --class cpu-cache --all 1 -t 100
stress-ng: info:  [5430] setting to a 1 min, 40 secs run per stressor
stress-ng: info:  [5430] far-branch: architecture not supported
stress-ng: info:  [5430] flushcache: architecture not supported
stress-ng: info:  [5430] icache: architecture not supported
stress-ng: info:  [5430] dispatching hogs: 1 bitonicsort, 1 bsearch, 1 cache, 1 cacheline, 1 dekker, 1 heapsort, 1 hsearch, 1 insertionsort, 1 l1cache, 1 list, 1 llc-affinity, 1 lockbus, 1 lsearch, 1 malloc, 1 matrix, 1 matrix-3d, 1 membarrier, 1 memcpy, 1 mergesort, 1 misaligned, 1 peterson, 1 prefetch, 1 qsort, 1 radixsort, 1 shellsort, 1 skiplist, 1 sparsematrix, 1 spinmem, 1 str, 1 stream, 1 tree, 1 tsearch, 1 vecfp, 1 vecmath, 1 vecshuf, 1 vecwide, 1 wcs, 1 zlib
stress-ng: info:  [5433] cache: cache flags used: prefetch fence
stress-ng: info:  [5433] cache: unavailable unused cache flags: flush sfence clflushopt cldemote clwb
stress-ng: info:  [5436] heapsort: using method 'heapsort-nonlibc'
stress-ng: info:  [5434] cacheline: to fully exercise a 64 byte cache line, 32 instances are required
stress-ng: info:  [5439] l1cache: l1cache: size: 32.0K, sets: 128, ways: 4, line size: 64 bytes
stress-ng: info:  [5457] sparsematrix: 10000 items in 500 x 500 sparse matrix (4.00% full)
stress-ng: info:  [5449] mergesort: using method 'mergesort-nonlibc'
stress-ng: info:  [5453] qsort: using method 'qsort-libc'
stress-ng: info:  [5460] stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info:  [5460] stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info:  [5460] stream: Using cache size of 512K
stress-ng: info:  [5454] radixsort: using method 'radixsort-nonlibc'
stress-ng: info:  [5441] llc-affinity: using LLC cache size of 512K
stress-ng: info:  [5452] prefetch: using a 512 KB L3 cache with prefetch method 'builtin'
stress-ng: info:  [5460] stream: memory rate: 230.35 MB read/sec, 153.57 MB write/sec, 20.13 double precision Mflop/sec (instance 0)
stress-ng: info:  [5450] misaligned: exercised all int16rd int16wr int16inc int32rd int32wr int32inc
stress-ng: info:  [5430] skipped: 5: far-branch (1) flushcache (1) icache (1) judy (1) wcs (1)
stress-ng: info:  [5430] passed: 37: bitonicsort (1) bsearch (1) cache (1) cacheline (1) dekker (1) heapsort (1) hsearch (1) insertionsort (1) l1cache (1) list (1) llc-affinity (1) lockbus (1) lsearch (1) malloc (1) matrix (1) matrix-3d (1) membarrier (1) memcpy (1) mergesort (1) misaligned (1) peterson (1) prefetch (1) qsort (1) radixsort (1) shellsort (1) skiplist (1) sparsematrix (1) spinmem (1) str (1) stream (1) tree (1) tsearch (1) vecfp (1) vecmath (1) vecshuf (1) vecwide (1) zlib (1)
stress-ng: info:  [5430] failed: 0
stress-ng: info:  [5430] metrics untrustworthy: 0
stress-ng: info:  [5430] successful run completed in 1 min, 41.61 secs
CoreELECg1:~ # stress-ng --skip-silent --cache-enable-all --class memory --all 1 -t 100
stress-ng: info:  [7384] setting to a 1 min, 40 secs run per stressor
stress-ng: info:  [7384] dispatching hogs: 1 atomic, 1 bad-altstack, 1 bitonicsort, 1 bsearch, 1 context, 1 full, 1 heapsort, 1 hsearch, 1 insertionsort, 1 list, 1 lockbus, 1 lsearch, 1 malloc, 1 matrix, 1 matrix-3d, 1 mcontend, 1 membarrier, 1 memcpy, 1 memfd, 1 memrate, 1 memthrash, 1 mergesort, 1 mincore, 1 misaligned, 1 null, 1 pipe, 1 pipeherd, 1 prefetch, 1 qsort, 1 radixsort, 1 randlist, 1 remap, 1 resources, 1 rmap, 1 shellsort, 1 skiplist, 1 sparsematrix, 1 spinmem, 1 stack, 1 stackmmap, 1 str, 1 stream, 1 tlb-shootdown, 1 tmpfs, 1 tree, 1 tsearch, 1 vm, 1 vm-addr, 1 vm-rw, 1 vm-segv, 1 wcs, 1 zero, 1 zlib
stress-ng: info:  [7391] heapsort: using method 'heapsort-nonlibc'
stress-ng: info:  [7413] null: exercising /dev/null with writes, lseek, ioctl, fcntl, fallocate, fdatasync and mmap; for just write benchmarking use --null-write
stress-ng: info:  [7409] memthrash: no NUMA nodes or maximum NUMA nodes, ignoring numa memthrash method
stress-ng: info:  [7409] memthrash: starting 4 threads on each of the 1 stressors on a 4 CPU system
stress-ng: info:  [7417] qsort: using method 'qsort-libc'
stress-ng: info:  [7425] sparsematrix: 10000 items in 500 x 500 sparse matrix (4.00% full)
stress-ng: info:  [7416] prefetch: using a 512 KB L3 cache with prefetch method 'builtin'
stress-ng: info:  [7410] mergesort: using method 'mergesort-nonlibc'
stress-ng: info:  [7418] radixsort: using method 'radixsort-nonlibc'
stress-ng: info:  [7430] stream: stressor loosely based on a variant of the STREAM benchmark code
stress-ng: info:  [7430] stream: do NOT submit any of these results to the STREAM benchmark results
stress-ng: info:  [7430] stream: Using cache size of 512K
stress-ng: info:  [7440] zero: exercising /dev/zero with reads, mmap, lseek, and ioctl; for just read benchmarking use --zero-read
stress-ng: info:  [7408] memrate: using buffer size of 262144K, cache flushing disabled
stress-ng: info:  [7408] memrate: cache flushing can be enabled with --memrate-flush option
stress-ng: info:  [7430] stream: memory rate: 47.53 MB read/sec, 31.69 MB write/sec, 4.15 double precision Mflop/sec (instance 0)
stress-ng: info:  [7412] misaligned: exercised all int16rd int16wr int16inc int32rd int32wr int32inc
stress-ng: info:  [7384] skipped: 4: judy (1) numa (1) oom-pipe (1) wcs (1)
stress-ng: info:  [7384] passed: 52: atomic (1) bad-altstack (1) bitonicsort (1) bsearch (1) context (1) full (1) heapsort (1) hsearch (1) insertionsort (1) list (1) lockbus (1) lsearch (1) malloc (1) matrix (1) matrix-3d (1) mcontend (1) membarrier (1) memcpy (1) memfd (1) memrate (1) memthrash (1) mergesort (1) mincore (1) misaligned (1) null (1) pipe (1) pipeherd (1) prefetch (1) qsort (1) radixsort (1) randlist (1) remap (1) resources (1) rmap (1) shellsort (1) skiplist (1) sparsematrix (1) spinmem (1) stack (1) stackmmap (1) str (1) stream (1) tlb-shootdown (1) tmpfs (1) tree (1) tsearch (1) vm (1) vm-addr (1) vm-rw (1) vm-segv (1) zero (1) zlib (1)
stress-ng: info:  [7384] failed: 0
stress-ng: info:  [7384] metrics untrustworthy: 0
stress-ng: info:  [7384] successful run completed in 1 min, 52.12 secs

From command output it seems that the patch sets up the system cache and enables L1 caching. It brings slightly slower cache performance and improved memory performance. So the patch could go upstream if nobody sees setbacks.