It can be NG or NE, doesn’t really matter…
I ran the script posted by @doppingkoala on 905X4, but getting no cache displayed by lscpu.
Thank you! I’m inclined to leave the higher values. I see avi files using A55 only without a sweat at low resolution and therefore utilizing big.little to it’s maximum potential.
We can change it if data proves that the values need changing. With so much interest and a new on the fly script I’m sure we’ll get all combinations and permutations.
Your thoughts on it?
@doppingkoala if want to define your CPU cores as independent you can create 4 clusters. Each CPU will virtually in software be given a different cluster. You can have n clusters. It should stop moving the process around.
The downside would be that the calculation for bumping a process will have high threshold and all the consequences of it.
If you indeed have per core cache then it should actually help. If you have shared caches it might make the Linux scheduler slower. Try it! Worth a shot
Edit @doppingkoala can you add a L2 cache size 0 and then add L3 size 512KB? You define L3 but kernel assumes it is L2.
Function cpus_share_cache
That SoC has been hard to find the right values for. Maybe we start with L1 cache only before we go L2 and L3.
So make a new cache for L2 with a size of zero and change each core so that is the next-level-cache? How do I then reference the L3 cache?
Inside the level 2 cache you create a new level 3 cache.
cache-level=<2>
X
X
X
next-level-cache=<&L3>
{
X
cache-level=<3>
}
Nested like that.
It might be dumb and I may be asking something impossible. You can definitely call out my ignorance.
“They didn’t know, it was impossible, so they did it.”
Give it a try later. Looking at some other dtsi files though, it looks like the approach used for devices with no L2 is to directly reference the L3 cache for each of the cpu cores (as it already is). There is a difference though, in that it look like the cache-level
for the L3 cache is set as 2 though by others…
You bring me back to my aerospace days when I used to code partial differential equations that solve aircraft air flow. <3 I got a chuckle out of your message.
There’s a number of factors when it comes to the UI, like which skin is being used, what add-ons are present, emmc vs USB install. There’s also just the placebo effect. When I first added the cache nodes I was convinced the UI felt faster too. But since I have been going back and forth between cache and no cache nodes, I lost that feeling.
I think we need more standardized testing that relies less on how fast something feels. For example, a few people were saying bootup times felt faster. That’s reported by the system, so we can grab those results
CoreELEC:~ # systemd-analyze
I took the average of 7 bootups with the cache nodes present in the dtb, vs the average of 4 without the cache nodes.
With L2 Cache nodes
5.509s (kernel) = 1.2% slower
7.930s (userspace) = 2.5% faster
6.207s (kodi.target reached) = 1.1% faster
19.647s (total time) = 1.1% faster
Without L2 Cache nodes
5.441s (kernel)
8.136s (userspace)
6.279s (kodi.target reached)
19.857s (total time)
So maybe adding the nodes speeds up the boot time by 1.1%, but that might just be variation.
The other measure I saw mentioned is loading balancing across CPU cores with software decoding. But the examples I saw were based on a one second average, but it’s hard to gauge things off of a snapshot when the CPU shifts a lot.
I used a script to get the average CPU load (1sec resolution) of a few 3-5min AV1 1080P videos. Looking at the average over a few minutes, consistent results on re-views. example
With L2 Cache node - top gun - maverick trailer - AV1 1080p
Average Loads (averaged over 259 seconds):
Core 0 Average Load: 41.2%
Core 1 Average Load: 31.7%
Core 2 Average Load: 20.2%
Core 3 Average Load: 18.3%
Core 4 Average Load: 17.9%
Core 5 Average Load: 16.4%
Without L2 Cache node - top gun - maverick trailer
Average Loads (averaged over 244 seconds):
Core 0 Average Load: 42.0%
Core 1 Average Load: 32.9%
Core 2 Average Load: 20.6%
Core 3 Average Load: 19.1%
Core 4 Average Load: 18.0%
Core 5 Average Load: 17.4%
The results look very similar. I didn’t see a noticeable difference with the other videos I test using SW decoding either.
Something may be wrong with my setup, but if we could standardize a couple tests, it would make direct comparisons easier.
If you have a video where you do see a shift in the CPU load balance can you share it, and I will run it on my setup to see if I can reproduce the effect. It’s better if the video is over 2min long to reduce the variation.
Run this script to get average CPU load for each CPU, hit CTRL+C to stop, and the average loads will be shown.
cpu-load.sh (2.1 KB)
Is it a shared L3 labeled as level 2?
Or is it a per core separate cache?
Looks like shared L3, was looking at linux/arch/arm64/boot/dts/rockchip/rk356x.dtsi at master · torvalds/linux · GitHub for example
Any direction to take when only starting with L1 cache, I tried removing l3 cache part. I can experiment with some values.
#!/bin/sh
(
/bin/mount -o rw,remount /flash
fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-line-size 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@0 d-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-line-size 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@0 i-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-line-size 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@1 d-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-line-size 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@1 i-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-line-size 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@2 d-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-line-size 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@2 i-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-line-size 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@3 d-cache-sets 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-line-size 32 -t i
fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-size 0x8000 -t x
fdtput -p /flash/dtb.img /cpus/cpu@3 i-cache-sets 32 -t i
/bin/sync
/bin/mount -o ro,remount /flash
read -p "Restart now? [Y/N]: " KEYINPUT
if [ "$KEYINPUT" != "${KEYINPUT#[Yy]}" ]; then
/sbin/reboot
fi
)
lscpu:
CoreELEC:~ # lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Vendor ID: ARM
Model name: Cortex-A55
Model: 0
Thread(s) per core: 1
Core(s) per cluster: 4
Socket(s): -
Cluster(s): 1
Stepping: r2p0
CPU(s) scaling MHz: 100%
CPU max MHz: 2000.0000
CPU min MHz: 100.0000
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp asimddp
Keep it shared. That may be the best we can do.
All AV1 videos
1080p - All cores
Average Loads (averaged over 13 seconds):
Core 0 Average Load: 53.0%
Core 1 Average Load: 42.6%
Core 2 Average Load: 31.3%
Core 3 Average Load: 29.1%
Core 4 Average Load: 27.1%
Core 5 Average Load: 26.3%
720p
Average Loads (averaged over 13 seconds):
Core 0 Average Load: 52.4%
Core 1 Average Load: 44.3%
Core 2 Average Load: 15.1%
Core 3 Average Load: 5.7%
Core 4 Average Load: 5.0%
Core 5 Average Load: 2.5%
360p
Average Loads (averaged over 13 seconds):
Core 0 Average Load: 39.5%
Core 1 Average Load: 23.8%
Core 2 Average Load: 4.6%
Core 3 Average Load: 1.0%
Core 4 Average Load: 0.6%
Core 5 Average Load: 0.2%
You can see as the demand for CPU cycles goes down the scheduler puts priority on A53 cores. Look at the very potato quality AV1 360p, all decoding is done in A53. A73 is mostly doing some system tasks to keep the linux from crashing.
Now look at 720p, the majority of work is done by A53 only. A small amount of work is done by A73 but still A53 does heavy lifting.
Now look at 1080p, the scheduler is doing its damn best to keep all CPU loads under 80 and therefore utilizing all 6 cores even though they are uneven in power (53 is weaker than 73 kind of unevenness).
Sorry none of my test videos were 2 mins.
There is not going to be any speed increases in benchmark at high CPU usage. The hardware is the same as before. We are trying to bump up the low end without consuming extra power to make the device feel faster.
I might have to sleep on this. Someone’s gotta have an epiphany or read 100s of pages of ARM documentation to find that one sentence which makes us go “aha”. I do not volunteer for this, I need sleep after fighting github for 3 hours today.
@akmarwah03 line-size is 64 for arm A55.
https://developer.arm.com/documentation/100442/0100/functional-description/level-1-memory-system/about-the-l1-memory-system?lang=en
Completely understandable, thanks for all your work
I know I’m replying to myself…
@doppingkoala what I was saying was you can trick the scheduler into thinking you have 4 separate CPU die with BGA separated by copper lines. All you gotta do is create one cluster with one CPU. I want to mimic this kind of affinity for processes in single threaded operations at a system level not process level.
What will this achieve? The scheduler will become hesitant in bumping processes between cores and only use it for true multi threaded applications.