S922X Ugoos AM6B Device Tree - Performance/Efficiency - Testing Needed

After dtb,didn’t test anything yet. I will tomorrow

CoreELEC:~ # lscpu
Architecture:           aarch64
  Byte Order:           Little Endian
CPU(s):                 6
  On-line CPU(s) list:  0-5
Vendor ID:              ARM
  Model name:           Cortex-A53
    Model:              4
    Thread(s) per core: 1
    Core(s) per socket: 2
    Socket(s):          1
    Stepping:           r0p4
    CPU(s) scaling MHz: 100%
    CPU max MHz:        1908.0000
    CPU min MHz:        500.0000
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32
  Model name:           Cortex-A73
    Model:              2
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           r0p2
    CPU(s) scaling MHz: 100%
    CPU max MHz:        2208.0000
    CPU min MHz:        500.0000
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32
Caches (sum of all):
  L1d:                  192 KiB (6 instances)
  L1i:                  320 KiB (6 instances)
  L2:                   1.3 MiB (2 instances)
CoreELEC:~ #

@Boulder Buttery smooth playback of 1080p at low CPU usage. Check the screenshots below.

Test file of AV1 - Sintel_1080_10s_20MB.mp4


I compiled libdav1d.so.7.0.0 with some non-standard maybe dangerous gcc flags along with -mtune=cortex-a73.cortex-a53. Not enough of a difference because dav1d already uses assembly files.

My old file:

AM6B+ eMMC install:

Architecture:           aarch64
  Byte Order:           Little Endian
CPU(s):                 6
  On-line CPU(s) list:  0-5
Vendor ID:              ARM
  Model name:           Cortex-A53
    Model:              4
    Thread(s) per core: 1
    Core(s) per socket: 2
    Socket(s):          1
    Stepping:           r0p4
    CPU(s) scaling MHz: 100%
    CPU max MHz:        1800.0000
    CPU min MHz:        500.0000
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32
  Model name:           Cortex-A73
    Model:              2
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           r0p2
    CPU(s) scaling MHz: 100%
    CPU max MHz:        2208.0000
    CPU min MHz:        500.0000
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32
Caches (sum of all):
  L1d:                  192 KiB (6 instances)
  L1i:                  320 KiB (6 instances)
  L2:                   1.3 MiB (2 instances)

Several hours of use after a day of having the updated dtb. Still haven’t noticed any thing broken, no crashes noticed, logs seem as clean as normal.

Some anecdotal numbers. Boot, feels maybe 5-10% faster. Initial skin (Nimbus) load, feels 10-20% faster. Initial widget loading, 20-30% faster. Subsequent widget loading, 30-40% faster. Addon activity :zipper_mouth_face:, feels about the same with my settings, but it wasn’t noticeably slow to me before. Clicking around menus, feels 20-30% faster. Menus in addons like YouTube feel particularly faster.

Playback seems smooth. Playing a big 100GB remux (2160p, h265, DTS HD):

top - 03:57:54 up 23:20,  1 user,  load average: 0.88, 0.93, 0.94
Tasks: 177 total,   1 running, 176 sleeping,   0 stopped,   0 zombie
%Cpu0  : 16.6 us,  7.0 sy,  0.7 ni, 75.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  7.7 us,  2.7 sy,  0.3 ni, 89.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  0.0 us,  0.7 sy,  0.0 ni, 99.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  0.0 us,  0.7 sy,  0.3 ni, 99.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3799.5 total,   1268.8 free,   1937.1 used,    672.7 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   1862.4 avail Mem

Clicking around menus and widgets:

top - 04:08:36 up 23:30,  1 user,  load average: 1.43, 1.08, 1.03
Tasks: 176 total,   2 running, 174 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.7 us,  5.5 sy, 34.3 ni, 59.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  2.7 us, 11.1 sy, 29.7 ni, 56.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 22.8 us,  3.6 sy,  3.6 ni, 70.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 15.8 us,  2.0 sy,  6.4 ni, 75.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  :  8.6 us,  1.7 sy,  8.3 ni, 81.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  : 17.4 us,  2.0 sy,  3.7 ni, 76.8 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3799.5 total,   2017.3 free,   1186.3 used,    675.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   2613.3 avail Mem

Live stream on YouTube addon (720p, h264):

top - 04:13:38 up 23:35,  1 user,  load average: 2.08, 1.55, 1.21
Tasks: 181 total,   1 running, 180 sleeping,   0 stopped,   0 zombie
%Cpu0  :  0.0 us,  1.3 sy,  0.0 ni, 98.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 18.7 us,  6.7 sy,  0.0 ni, 74.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 32.9 us, 11.0 sy,  0.0 ni, 56.1 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu4  : 23.2 us, 11.3 sy,  0.3 ni, 65.2 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu5  :  1.3 us,  2.0 sy,  0.3 ni, 96.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :   3799.5 total,   2277.0 free,    646.6 used,    955.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.   3153.0 avail Mem

Pretty happy with this change. Between the spectrum of Kodi on Android on a shield being a 0 and Kodi on x86 on something like an i5 Nuc being a 10. Previous in terms of interface speed was about 4, this is about a 6+. Continues to be well done! If there’s more empirical data that would be useful definitely shout it out, happy to help provide where I can.

1 Like

Interesting to see that YouTube is using A73 cluster at a 720p resolution. Before your post I’d have bet money that H264 is hardware accelerated so it would not use much CPU and would use A53, but I’d have lost that money.

It is safe to assume there must be other things in core elec which are also benefitting from this change.

2 Likes

These improvement can apply for the sei s905x4 boxes?

Short answer: it depends, gotta experiment to form an opinion and also to see if it is worth it. Have no inclination positive or negative on it.

Longer answer:
Theoretically it is a single cluster so there is no bumping of processes between clusters. Scheduler is free to put the process anywhere it wants.

However, exposing cache information (size, line-size, sets, etc.) to the userspace would benefit some programs which use _builtin_prefetch if the programmers coded it in such a way.

The kernel would detect the different L1 data and L1 instruction cache by itself and then we would define unified L2 that holds both data and instruction. The kernel would detect the size like it does for S922X. I have not done enough research on 905x4 to say what it could do. Architecturally it is similar to A53 with in-order execution pipeline. Basically lots of ifs, buts, etc.

To make the most out of a A55 core I’d propose using a Latency Aware Virtual Deadline scheduler (LAVD). In that we could utilize memory latency data that benefits “small” sized programs. Once you reach large programs that benefit goes away and CPU is saturated.

Edit
@frodo19 there is potentially a way to find out if cache information helps speed up other boxes. This is an inference exercise mostly. If someone is willing to put the -A53-Cache DTB and report back their findings about speed up compared to CE default we’d have information and then figure out how to make A55 905X4 behave the same way

Edit 2:

Dead end. Do not experiment.

Unfortunately I have been so used to the full DTB file that I would be a biased opinion.

@frodo19 do you have a 905X4 box?

Yes, i have some.

Your dtb on old Ugoos AM6plus (no B device from 2021) with cpu Amlogic s922xj rev. B and on SD card:

CoreELEC:~ # lscpu
Architecture: aarch64
Byte Order: Little Endian
CPU(s): 6
On-line CPU(s) list: 0-5
Vendor ID: ARM
Model name: Cortex-A53
Model: 4
Thread(s) per core: 1
Core(s) per socket: 2
Socket(s): 1
Stepping: r0p4
CPU(s) scaling MHz: 37%
CPU max MHz: 1800.0000
CPU min MHz: 500.0000
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32
Model name: Cortex-A73
Model: 2
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Stepping: r0p2
CPU(s) scaling MHz: 30%
CPU max MHz: 2208.0000
CPU min MHz: 500.0000
BogoMIPS: 48.00
Flags: fp asimd evtstrm aes pmull sha1 sha2 crc32
Caches (sum of all):
L1d: 192 KiB (6 instances)
L1i: 320 KiB (6 instances)
L2: 1.3 MiB (2 instances)

Can you run on AM6+

lscpu --caches

I’m happy to see we are getting more testers with devices that I do not own. Hoping I can get enough testers to prove the changes are robust.

Am6plus

CoreELEC:~ # lscpu --caches
NAME ONE-SIZE ALL-SIZE WAYS TYPE LEVEL SETS PHY-LINE COHERENCY-SIZE
L1d 32K 192K 4 Data 1 128 64
L1i 32K 320K 2 Instruction 1 256 64
L2 256K 1.3M 16 Unified 2 256 64

Thanks!

Please let us know how much of a difference you notice in speed. Best template so far is by @rmkjr.

I’m curious to see how the SD card IO hampers the performance improvements. As in, did SD card make the difference smaller because the CPU is waiting for SDIO

Personally, i feel more performance for Auto Frame Rate. Also temperature is lower (34c in place of 37c on standard DTB from CE - but must be this feature be longer time tested). GUI is smooth also in ondemand mod of cpu government. Sorry, i dont know how was created tables from @rmkjr. My Sandisk SD card have 64 GB and A2 standard for better performance of runing apps from SD card. I tried retroarch with flycast and on cpu government “ondemand” was not tearing of sounds compare to standard dtb. But for good playing i recommend always use cpu government “performance”. Temperatures are awesome!

I posted this on Kodi forumns in the best hardware thread
https://forum.kodi.tv/showthread.php?tid=376035&page=52

I also created a new thread since I believe this is fantastic news and congratualtions to MasterKeyxda
https://forum.kodi.tv/showthread.php?tid=378911

I expect a good handful will report since they are knowledgeable on hardware.

also posted this on a subreddit and might get some testers from their too.
great work and really excited for this!

1 Like

Thank you, I’m trying to understand added properties. Where are the values from capacity-dmips-mhz coming from? I found this post from a few years ago that suggested adding this property for the S922X/A311D dtb, but they have a different A53 value. And I’m not sure where that came from either.

NAME ONE-SIZE ALL-SIZE WAYS TYPE        LEVEL SETS PHY-LINE COHERENCY-SIZE
L2       256K     1.3M   16 Unified         2  256                      64

If the total L2 cache size is 1.3M, doesn’t that suggest a 1MB value for L2 for the A72 cluster, and 256KB for the A53 cluster (1MB+256KB=1.256MB - rounded to 1.3MB). So that this would be:

l2_0: l2-cache0 {
    compatible = "cache";
    cache-level = <2>;
    cache-unified;
    cache-size = <0x40000>;  // 256KB
    cache-line-size = <64>;
    cache-sets = <512>;
};

The exact values might not matter, but if it can be more accurate may as well.

Dmips calculation is from an ARM reference document for device trees. I forget the link where I got it from but it is a calculation based on energy efficiency and amount of work that can done at a given frequency of MHz by a core (core agnostic because we create a non dimensional number of work).

Yes technically 256 KB is the right value but as commented and seen everywhere in the posts above, the boot time detection finds the right amount anyway. That’s why I added a comment saying we don’t have to be 100% accurate.

The reason for doing it “incorrect” way on purpose is my method of leaving breadcrumbs for future developers. I don’t own a core elec 905X4 box or a 928X or a 905Y4. If it turns out someone from those devices wants to experiment they can take the hint that incorrect or not it’ll still show up.

I am barely scratching the surface with this. I’m not sure what the impact of incorrect value is (negative or positive). I’m happy to correct it if it brings benefits. But if not I’d like to leave hints for other developers that sometimes perfect is the enemy of good.

It is not everyday that you find 25% speed improvements in a whole system. What if someone from those other devices gets an improvement in their system too. I don’t want to rob them of their chance to make CE as good as it can be.

Edit: for explicit clarity. I have put a value of 0x80000 which is 512 KB. Which is absolutely incorrect.

2 Likes

according to this the The s928x has a 20-25% increase in single core performance, but the s922x has a 10-15% better multi-core performance.

you’ve basically made the s922x more powerful all round to the (previous) s928x, without spending a penny!
of course the s928x will also get the speed bump to so it’s still ahead, but think about it. If I wanted that kind of performance I would have had to pay for s928x, not I get that performance for free!
this is insane you achieved this! still can’t believe after 5 years owning a box, it now has even more features! DV, 25% performance boost! next you’ll be adding 4GB RAM with software update!

2 Likes

CC: @YadaYada

Here’s how I calculate DMIPS (memory recollection, can’t find original source PDF anymore)

  1. Find a reference DMIPS. Can be from other DTBs or an approximation.
  2. Another way is to find work from the energy dtsi files. This is what I’ve done

Use equation
DMIPs = (Highest Energy Work Value of current core) / ( [Frequency MHz of current core] / [ Frequency MHz of fastest core])

Example:

For separate clusters
DMIPS little= 611 / ( 1800/2200)

DMIPS big = 1192 / (2200/2200)

For a single cluster

DMIPS = 592/1 = 592.

Kernel calculates max 1024 at boot time by itself. No need to worry about scaling. Just get the ratios right

1 Like

Do we have any kernel experts here that I can bounce off ideas between? I have some dumbass ideas that might work.

1 Like

Leave them here, different timezones. People will see it approximately 8 hours from now.