Just saw A2, thank you.
@YadaYada sorry I couldnāt come up with a test that could verify all of the talk above.
Iām willing to believe the many users who have reported a nice quality of life improvement.
In my first post
With thousands of programs and tens of thousands of dependencies everywhere who knows which program checks for what. Again only in user space. The hardware always had cache and it was always used. The memcpy functions of glibc are written in assembly so I donāt know what itās doing but most likely its highly optimized already.
Closing poll
- Yes
- No
I looked into this but I donāt have knowledge to progress from here
1.dts (123.0 KB)
@YadaYada looked in scheduler code since I canāt come up with a test method. I do not know how the kernel gets the data whether cpu is sharing cache in this code.
/*
* The purpose of wake_affine() is to quickly determine on which CPU we can run
* soonest. For the purpose of speed we only consider the waking and previous
* CPU.
*
* wake_affine_idle() - only considers 'now', it check if the waking CPU is
* cache-affine and is (or will be) idle.
*
* wake_affine_weight() - considers the weight to reflect the average
* scheduling latency of the CPUs. This seems to work
* for the overloaded case.
*/
static int
wake_affine_idle(int this_cpu, int prev_cpu, int sync)
{
/*
* If this_cpu is idle, it implies the wakeup is from interrupt
* context. Only allow the move if cache is shared. Otherwise an
* interrupt intensive workload could force all tasks onto one
* node depending on the IO topology or IRQ affinity settings.
*
* If the prev_cpu is idle and cache affine then avoid a migration.
* There is no guarantee that the cache hot data from an interrupt
* is more important than cache hot data on the prev_cpu and from
* a cpufreq perspective, it's better to have higher utilisation
* on one CPU.
*/
if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu;
if (sync && cpu_rq(this_cpu)->nr_running == 1)
return this_cpu;
return nr_cpumask_bits;
}
This is just one instance I can point to in the kernel scheduler.
if (available_idle_cpu(this_cpu) && cpus_share_cache(this_cpu, prev_cpu))
This code only executes if cpus_share_cache is returned true. S922X A53 and A73 do not share cache with the device tree we defined so therefore we will never get a return available_idle_cpu(prev_cpu) ? prev_cpu : this_cpu; if we are checking between A53 and A73. How often does that happen? Anybodyās guess.
Only allow the move if cache is shared. as seen in the comment. Therefore cache is not shared and no move is allowed.
The CPUs are complex and I do not claim I understand whatās happenning but at a glance it seems that kernel will change its behavior of which CPU does the work.
If someone here can write assembly then maybe they can come up with two different assembly code one which uses the cache info and one that doesnāt from the userspace.
Adding the additional DTB information probably isnāt hurting anything, so itās not a bad idea to include it in some form either way. I was looking at an A311D2 DTB (basically a newer version of the s922x - 4xA73+4xA53 - kernel 5.4), and it does include the capacity-dmips-mhz properties now, and a single L2 cache node. Maybe the kernel is able to automatically determine that there are two separate L2 Caches, and the separate nodes arenāt neccessary?
For that Iād like to see the output of lscpu --caches.
Our outputs are formatted (L2 separated by cluster explicitly)
0:0:0
0:1:0
1:0:1
1:1:1
1:2:1
5.4 vs 4.9 who knows what changed under the hood in kernel.
This is beyond my knowledge now.
Story below, no technical info:
Iāll share a story of how I found out that ARM CPUs cannot do fast inverse square root. Specifically imx8.
I was working on a sensor and trying to find orientation with madgwick filter which was just not working whatsoever! We used gcc to compile binary. I was managing a SDE at the time. We were both stumped why the hell is it not working. I programmed something in Python with regular speed inverse square root and he programmed the same algorithm in C++, mine worked fine and his didnāt. We debugged and found fast inverse square root taken from Wikipedia works fine on x64 but not on ARM. If course we tested python and c++ on ARM.We stopped using fast inverse square root and suddenly the code and sensor just worked! Due to deadlines we never went further in debugging. To this day I donāt have an answer to why that happened. A case of beyond my knowledge. Iām not a programmer.
The A311D2 documentation is more clear than the s922x, about there being two discreet L2 cacheās for the two CPU clusters
https://dl.khadas.com/products/vim4/datasheet/a311d2-quick-reference-manual-rev-c-0.2.pdf
I donāt have an A311D2 running linux/CE to check lscpu. CE does support the SOC, so there may be someone else here with a VIM4 or GT King II that can check. I had looked an an Android DTB, I donāt know if the CE dtb for that SOC includes that information too.
EDIT: the CE DTB doesnāt have the L2 cache node, it would need to be added to check this device.
Is there a thread where one can keep up with new builds? Had been looking here, but unless I missed it, not seeing toward end of that thread. Thanks.
For anyone that copied my previous script, it has a problem where a phandle was duplicated for the cache, revised script is
#!/bin/sh
(
/bin/mount -o rw,remount /flash
fdtput -p /flash/dtb.img /cpus/l3-cache0 compatible cache -t s
fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-level 2 -t i
fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-unified
fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-size 0x7d000 -t x
fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-sets 512 -t i
fdtput -p /flash/dtb.img /cpus/l3-cache0 phandle 1113 -t i
fdtput -p /flash/dtb.img /cpus/cpu@0 compatible "arm,cortex-a55" "arm,armv8" -t s
fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@0 next-level-cache 1113 -t i
fdtput -p /flash/dtb.img /cpus/cpu@1 compatible "arm,cortex-a55" "arm,armv8" -t s
fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@1 next-level-cache 1113 -t i
fdtput -p /flash/dtb.img /cpus/cpu@2 compatible "arm,cortex-a55" "arm,armv8" -t s
fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@2 next-level-cache 1113 -t i
fdtput -p /flash/dtb.img /cpus/cpu@3 compatible "arm,cortex-a55" "arm,armv8" -t s
fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@3 next-level-cache 1113 -t i
/bin/sync
/bin/mount -o ro,remount /flash
read -p "Restart now? [Y/N]: " KEYINPUT
if [ "$KEYINPUT" != "${KEYINPUT#[Yy]}" ]; then
/sbin/reboot
fi
)```
I think you are missing L2 cache configuration (per core!) between L1&L3
You should see L2 2M (4 instances)
Hierarchy should look like:
cpu@0 next-level-cache L2-0
cpu@1 next-level-cache L2-1
[...]
L2-0 / cache-level 2 / next-level-cache L3
L2-1 / cache-level 2 / next-level-cache L3
[...]
L3 / cache-level 3
I was going off [PATCHv1 1/5] arm64: dts: amlogic: Add cache information to the Amlogic GXBB and GXL SoC and other comments here that seemed to suggest there isnāt a L2 cache.
ARM Cortex-A55 CPU uses unified L3 cache instead of L2 cache
Iāll try adding a L2 and see what happens anyway
If L2 then must be private per core.
And if itās private per core then that kernel function I mentioned above applies to you too. It has effect depending on IO topology and IRQ affinity
So I updated the script to
s905x3 with L2 cache
#!/bin/sh
(
/bin/mount -o rw,remount /flash
fdtput -p /flash/dtb.img /cpus/l3-cache0 compatible cache -t s
fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-level 3 -t i
fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-unified
fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-size 0x7d000 -t x
fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/l3-cache0 cache-sets 512 -t i
fdtput -p /flash/dtb.img /cpus/l3-cache0 phandle 1113 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache0 compatible cache -t s
fdtput -p /flash/dtb.img /cpus/l2-cache0 cache-level 2 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache0 cache-size 0x10000 -t x
fdtput -p /flash/dtb.img /cpus/l2-cache0 cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache0 cache-sets 512 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache0 next-level-cache 1113 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache0 phandle 1114 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache1 compatible cache -t s
fdtput -p /flash/dtb.img /cpus/l2-cache1 cache-level 2 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache1 cache-size 0x10000 -t x
fdtput -p /flash/dtb.img /cpus/l2-cache1 cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache1 cache-sets 512 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache1 next-level-cache 1113 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache1 phandle 1115 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache2 compatible cache -t s
fdtput -p /flash/dtb.img /cpus/l2-cache2 cache-level 2 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache2 cache-size 0x10000 -t x
fdtput -p /flash/dtb.img /cpus/l2-cache2 cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache2 cache-sets 512 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache2 next-level-cache 1113 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache2 phandle 1116 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache3 compatible cache -t s
fdtput -p /flash/dtb.img /cpus/l2-cache3 cache-level 2 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache3 cache-size 0x10000 -t x
fdtput -p /flash/dtb.img /cpus/l2-cache3 cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache3 cache-sets 512 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache3 next-level-cache 1113 -t i
fdtput -p /flash/dtb.img /cpus/l2-cache3 phandle 1117 -t i
fdtput -p /flash/dtb.img /cpus/cpu@0 compatible "arm,cortex-a55" "arm,armv8" -t s
fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@0 next-level-cache 1114 -t i
fdtput -p /flash/dtb.img /cpus/cpu@1 compatible "arm,cortex-a55" "arm,armv8" -t s
fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@1 next-level-cache 1115 -t i
fdtput -p /flash/dtb.img /cpus/cpu@2 compatible "arm,cortex-a55" "arm,armv8" -t s
fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@2 next-level-cache 1116 -t i
fdtput -p /flash/dtb.img /cpus/cpu@3 compatible "arm,cortex-a55" "arm,armv8" -t s
fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-line-size 64 -t i
fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@3 next-level-cache 1117 -t i
/bin/sync
/bin/mount -o ro,remount /flash
read -p "Restart now? [Y/N]: " KEYINPUT
if [ "$KEYINPUT" != "${KEYINPUT#[Yy]}" ]; then
/sbin/reboot
fi
)
@rho-bot is that how it should be?
and am now getting
CoreELEC-21:~ # lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: ARM
Model name: Cortex-A55
Model: 0
Thread(s) per core: 1
Core(s) per cluster: 4
Socket(s): -
Cluster(s): 1
Stepping: r1p0
CPU(s) scaling MHz: 100%
CPU max MHz: 1908.0000
CPU min MHz: 100.0000
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp asimddp
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 2 MiB (4 instances)
CoreELEC-21:~ # lscpu --caches
NAME ONE-SIZE ALL-SIZE WAYS TYPE LEVEL SETS PHY-LINE COHERENCY-SIZE
L1d 32K 128K 4 Data 1 128 64
L1i 32K 128K 4 Instruction 1 128 64
L2 512K 2M 16 Unified 2 512 64
To get the 2MB is it counting the 512 KB L3 four times?
Can you post the output of:
lscpu --output-all -e
CoreELEC-21:~ # lscpu --output-all -e
BOGOMIPS CPU CORE SOCKET CLUSTER NODE BOOK DRAWER L1d:L1i:L2 POLARIZATION ADDRESS CONFIGURED ONLINE MHZ SCALMHZ% MAXMHZ MINMHZ MODELNAME
48.00 0 0 0 0 - - - 0:0:0 - - - yes 1908.0000 100% 1908.0000 100.0000 Cortex-A55
48.00 1 1 0 0 - - - 1:1:1 - - - yes 1908.0000 100% 1908.0000 100.0000 Cortex-A55
48.00 2 2 0 0 - - - 2:2:2 - - - yes 1908.0000 100% 1908.0000 100.0000 Cortex-A55
48.00 3 3 0 0 - - - 3:3:3 - - - yes 1908.0000 100% 1908.0000 100.0000 Cortex-A55
Sorry for confusion, according to ARM, the A55 has optional L2 & L3.
For S905X3, there datasheet confirms there is only L1 & L3. And same says the blockdiagram, so your former mod likely was right.
For S905X4, I have no datasheet. But blockdiagram shows L1/L2/L3. This may work with your latest mod.
Is there any difference in bursty loads between shared cache vs explicit cache?
L3 can be partitioned for embedded systems. We have a dynamic workload in our OS though.
ran this one on S905X4 and for the first time seeing some values under lscpu
CoreELEC:~ # lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: ARM
Model name: Cortex-A55
Model: 0
Thread(s) per core: 1
Core(s) per cluster: 4
Socket(s): -
Cluster(s): 1
Stepping: r2p0
CPU(s) scaling MHz: 100%
CPU max MHz: 2000.0000
CPU min MHz: 100.0000
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp asimddp
Caches (sum of all):
L1d: 128 KiB (4 instances)
L1i: 128 KiB (4 instances)
L2: 256 KiB (4 instances)
L3: 512 KiB (1 instance)
CoreELEC:~ # lscpu --caches
NAME ONE-SIZE ALL-SIZE WAYS TYPE LEVEL SETS PHY-LINE COHERENCY-SIZE
L1d 32K 128K 4 Data 1 128 64
L1i 32K 128K 4 Instruction 1 128 64
L2 64K 256K 4 Unified 2 256 64
L3 512K 512K 16 Unified 3 512 64
tbh I havenāt seen any differences between any of the options or the original without any cache specified. The ui is still laggy, hardware accelerated video plays fine, and SW decoded video is bad unless low bitrate.
Those are it!! Iām glad the hierarchy is now found for 905X4.