S922X Ugoos AM6B Device Tree - Performance/Efficiency - Testing Needed

MasterKeyxda · 24 September 2024 16:40

If the response to this effort in performance gain is positive I’ll share DTB files for any and all S922X device.

I’m reading that improvements can range from 16-25% from one post above. Need more people to report because sample size of 1-4 is not conclusive.

On the other hand we can also experiment with on demand governor and keep CPU temps low.

With a defined cache we can enable prefetching for SW decoders so we store more data in L2 because L1d cache is 32 but L1i is 64. We’d be able to reduce DRAM penalty.

The core cluster bumps create a very heavy penalty for performance because that involves DRAM.

vpeter · 24 September 2024 16:59

Why not sending PR to CE and all would benefit by default without messing with files?

jamal2367 · 24 September 2024 17:02

@MasterKeyxda please make a pull request on GitHub. As @vpeter already wrote, everyone would benefit.

MasterKeyxda · 24 September 2024 17:05

I’m happy to do it! @vpeter

I want this to be robust and well tested. Might take a couple days for all the reports to come in.

I’ll do a PR for all S922X.

jamal2367 · 24 September 2024 17:05

If this were included in future Nightlys, the test response would be much higher and much more test it instead of this thread.

jonnypuma · 24 September 2024 17:08

EDIT - rome1931 pointed out my stupid mistake

I got an error on the mv command:
CoreELECbedroom:~ # mv dtb.img dtb_o.img
mv: can’t rename ‘dtb.img’: No such file or directory

But it worked overwriting the dtb with WinSCP after the mount command.

rome1931 · 24 September 2024 17:16

Are you in the /flash directory?

jonnypuma · 24 September 2024 17:18

D’oh, you’re right. I wasn’t. My bad

YadaYada · 24 September 2024 17:32

I think you are right about the separate L2 caches for the two clusters. The wording in the s922x datasheet is just clumsy. The A311D2 datasheet is more clear

The main system CPU is based on Big.LITTLE architecture which integrates a quad-core ARM Cortex-A73 CPU cluster and a quad-core Cortex-A53 cluster with unified L2 cache for each cluster to improve system performance. In addition, the CPU includes the NEON SIMD co-processor to improve software media processing capability.

And also

Unified system L2 cache for each cluster

The same optimization would also benefit other big little core SOCs like the A311D2, and maybe the S928X (no datasheet available yet). But again the A311D2 datasheet doesn’t specify the L2 cache sizes.

MasterKeyxda · 24 September 2024 17:47

Yep. I’m impressed by the Linux runtime/boot time detection. I found that I can even put 1 MB cache which is absolutely wrong but it still only displays 256 KB.

Caution this only applies to integrated L1 L2 cache controller. Cannot just blindly put this out for all models of CPU out there. Any processor A53 and newer is mostly integrated cache controller.

There might be a benefit in shared A55 905X4 because defining unified cache may allow for better pre fetching.

Basically for other SOCs I can put 3 MB cache and it’ll still find the right amount. Just need to specifically identify if is dual cache cluster or single.

xmlcom · 24 September 2024 20:03

How do we put this in the cube?

freddy · 24 September 2024 20:52

Is this applied only if have CoreELEC on internal?

Iam testing new DTB on external uSD

YadaYada · 24 September 2024 20:54

When MasterKeyxda posts the short dts, I can update the cube dtb.

MasterKeyxda · 24 September 2024 21:07

I’ve updated mesong12b.dtsi which is used by all g12b devices. Any device using this dtsi will benefit from this improvement.

If cube uses g12b you can post what DTB file you use and I’ll upload it here.

Next steps:

Find all core dtsi that utilize s922x
Update them all with cache info to propagate this benefit
Create a PR for kernel 4.9.20
Ask for CE teams help to propagate that change into newer 5.15 and future kernels. If they are too busy then do it the manual dirty way of PR for each kernel.

I need your (community’s) help with step 1.

saxman · 24 September 2024 21:44

Can you make one for odroid n2+ ?
g12b_s922x_odroid_n2.dtb

MasterKeyxda · 24 September 2024 22:10

Can you post the output of lscpu before the DTB? Trying to chase another hypothesis.

saxman · 24 September 2024 22:14

CoreELEC:~ # lscpu
Architecture:           aarch64
  Byte Order:           Little Endian
CPU(s):                 6
  On-line CPU(s) list:  0-5
Vendor ID:              ARM
  Model name:           Cortex-A53
    Model:              4
    Thread(s) per core: 1
    Core(s) per socket: 2
    Socket(s):          1
    Stepping:           r0p4
    CPU(s) scaling MHz: 100%
    CPU max MHz:        1908.0000
    CPU min MHz:        500.0000
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32
  Model name:           Cortex-A73
    Model:              2
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           r0p2
    CPU(s) scaling MHz: 100%
    CPU max MHz:        2208.0000
    CPU min MHz:        500.0000
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32

YadaYada · 24 September 2024 22:17

I’d also like to try changing some of the node properties, can you post the mesong12b.dtsi, that would save some time on edits.

MasterKeyxda · 24 September 2024 23:10

@YadaYada linux-amlogic-masterkeyxda/arch/arm64/boot/dts/amlogic/mesong12b.dtsi at amlogic-4.9-20 · MasterKeyxda/linux-amlogic-masterkeyxda · GitHub

MasterKeyxda · 24 September 2024 23:22

All DTB files with g12b. Please report back with your findings of lscpu and cpu usage. Data and metrics are preferred.

For e.g. using software decode VC-1 used to consume 50% CPU across cores 0-5 but after update it consumes 20% on core 0-1 and 65% on core 2-5. Something like this is ideal if you are able to get data like that.

Complete quantified data can be had if there are perf binaries compiled for CE-NG.

cc: @saxman can you post lscpu output after this update
g12b_bananapi_m2s.dtb (37.4 KB)
g12b_s922x_beelink_gs_king_x.dtb (36.9 KB)
g12b_s922x_beelink_gt_king.dtb (36.6 KB)
g12b_s922x_odroid_n2.dtb (53.8 KB)
g12b_s922x_ugoos_am6.dtb (74.1 KB)
g12b_s922x_ugoos_am6b_all.dtb (74.1 KB)

Attachment removed! It will be available in nightly build!

I have all the stuff ready in github, as soon as I have enough confidence I am creating a PR and it will be pushed to nightlies.