S922X Ugoos AM6B Device Tree - Performance/Efficiency - Testing Needed

If the response to this effort in performance gain is positive I’ll share DTB files for any and all S922X device.

I’m reading that improvements can range from 16-25% from one post above. Need more people to report because sample size of 1-4 is not conclusive.

On the other hand we can also experiment with on demand governor and keep CPU temps low.

With a defined cache we can enable prefetching for SW decoders so we store more data in L2 because L1d cache is 32 but L1i is 64. We’d be able to reduce DRAM penalty.

The core cluster bumps create a very heavy penalty for performance because that involves DRAM.

4 Likes

Why not sending PR to CE and all would benefit by default without messing with files?

6 Likes

@MasterKeyxda please make a pull request on GitHub. As @vpeter already wrote, everyone would benefit.

I’m happy to do it! @vpeter

I want this to be robust and well tested. Might take a couple days for all the reports to come in.

I’ll do a PR for all S922X.

7 Likes

If this were included in future Nightlys, the test response would be much higher and much more test it instead of this thread.

EDIT - rome1931 pointed out my stupid mistake :slight_smile:

I got an error on the mv command:
CoreELECbedroom:~ # mv dtb.img dtb_o.img
mv: can’t rename ‘dtb.img’: No such file or directory

But it worked overwriting the dtb with WinSCP after the mount command.

Are you in the /flash directory?

2 Likes

D’oh, you’re right. I wasn’t. My bad

I think you are right about the separate L2 caches for the two clusters. The wording in the s922x datasheet is just clumsy. The A311D2 datasheet is more clear

The main system CPU is based on Big.LITTLE architecture which integrates a quad-core ARM Cortex-A73 CPU cluster and a quad-core Cortex-A53 cluster with unified L2 cache for each cluster to improve system performance. In addition, the CPU includes the NEON SIMD co-processor to improve software media processing capability.

And also

  • Unified system L2 cache for each cluster

The same optimization would also benefit other big little core SOCs like the A311D2, and maybe the S928X (no datasheet available yet). But again the A311D2 datasheet doesn’t specify the L2 cache sizes.

Yep. I’m impressed by the Linux runtime/boot time detection. I found that I can even put 1 MB cache which is absolutely wrong but it still only displays 256 KB.

Caution this only applies to integrated L1 L2 cache controller. Cannot just blindly put this out for all models of CPU out there. Any processor A53 and newer is mostly integrated cache controller.

There might be a benefit in shared A55 905X4 because defining unified cache may allow for better pre fetching.

Basically for other SOCs I can put 3 MB cache and it’ll still find the right amount. Just need to specifically identify if is dual cache cluster or single.

1 Like

How do we put this in the cube?

Is this applied only if have CoreELEC on internal?

Iam testing new DTB on external uSD

When MasterKeyxda posts the short dts, I can update the cube dtb.

I’ve updated mesong12b.dtsi which is used by all g12b devices. Any device using this dtsi will benefit from this improvement.

If cube uses g12b you can post what DTB file you use and I’ll upload it here.

Next steps:

  1. Find all core dtsi that utilize s922x
  2. Update them all with cache info to propagate this benefit
  3. Create a PR for kernel 4.9.20
  4. Ask for CE teams help to propagate that change into newer 5.15 and future kernels. If they are too busy then do it the manual dirty way of PR for each kernel.

I need your (community’s) help with step 1.

3 Likes

Can you make one for odroid n2+ ?
g12b_s922x_odroid_n2.dtb

1 Like

Can you post the output of lscpu before the DTB? Trying to chase another hypothesis.

CoreELEC:~ # lscpu
Architecture:           aarch64
  Byte Order:           Little Endian
CPU(s):                 6
  On-line CPU(s) list:  0-5
Vendor ID:              ARM
  Model name:           Cortex-A53
    Model:              4
    Thread(s) per core: 1
    Core(s) per socket: 2
    Socket(s):          1
    Stepping:           r0p4
    CPU(s) scaling MHz: 100%
    CPU max MHz:        1908.0000
    CPU min MHz:        500.0000
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32
  Model name:           Cortex-A73
    Model:              2
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           r0p2
    CPU(s) scaling MHz: 100%
    CPU max MHz:        2208.0000
    CPU min MHz:        500.0000
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32

I’d also like to try changing some of the node properties, can you post the mesong12b.dtsi, that would save some time on edits.

@YadaYada linux-amlogic-masterkeyxda/arch/arm64/boot/dts/amlogic/mesong12b.dtsi at amlogic-4.9-20 · MasterKeyxda/linux-amlogic-masterkeyxda · GitHub

1 Like

All DTB files with g12b. Please report back with your findings of lscpu and cpu usage. Data and metrics are preferred.

For e.g. using software decode VC-1 used to consume 50% CPU across cores 0-5 but after update it consumes 20% on core 0-1 and 65% on core 2-5. Something like this is ideal if you are able to get data like that.

Complete quantified data can be had if there are perf binaries compiled for CE-NG.

cc: @saxman can you post lscpu output after this update
g12b_bananapi_m2s.dtb (37.4 KB)
g12b_s922x_beelink_gs_king_x.dtb (36.9 KB)
g12b_s922x_beelink_gt_king.dtb (36.6 KB)
g12b_s922x_odroid_n2.dtb (53.8 KB)
g12b_s922x_ugoos_am6.dtb (74.1 KB)
g12b_s922x_ugoos_am6b_all.dtb (74.1 KB)

Attachment removed! It will be available in nightly build!

I have all the stuff ready in github, as soon as I have enough confidence I am creating a PR and it will be pushed to nightlies.

1 Like