S922X Ugoos AM6B Device Tree - Performance/Efficiency - Testing Needed

I have been noticing some hiccups in my Ugoos AM6B and had been following it very closely. I saw that the linux scheduler was moving the process around from CPU#0 in cluster 0 to CPU#X in cluster 1. I got deep into the weeds and found that when using lscpu it doesn’t show any cache information

I read ARM documentation, S922X datasheet and arch/arm64/include/xxxx files to understand how cache detection happens. I was seeing VPTI cache detection on all cores 0-5 but then dmesg shows an error “no cache hierarchy found for cpu 0”.

Anyway, enough of the backstory. I wondered if there are programs running in CoreElec that would benefit from a better scheduling job of linux scheduler. I am still searching for an answer to this.

Regardless of benefit or not, I created a new dtb file and lo-and-behold now linux kernel can do runtime detection of cache

CoreELEC:~ £ lscpu
Architecture:           aarch64
  Byte Order:           Little Endian
CPU(s):                 6
  On-line CPU(s) list:  0-5
Vendor ID:              ARM
  Model name:           Cortex-A53
    Model:              4
    Thread(s) per core: 1
    Core(s) per socket: 2
    Socket(s):          1
    Stepping:           r0p4
    CPU(s) scaling MHz: 100%
    CPU max MHz:        1800.0001
    CPU min MHz:        500.0000
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32
  Model name:           Cortex-A73
    Model:              2
    Thread(s) per core: 1
    Core(s) per socket: 4
    Socket(s):          1
    Stepping:           r0p2
    CPU(s) scaling MHz: 100%
    CPU max MHz:        2208.0000
    CPU min MHz:        500.0000
    BogoMIPS:           48.00
    Flags:              fp asimd evtstrm aes pmull sha1 sha2 crc32
Caches (sum of all):
  L1d:                  192 KiB (6 instances)
  L1i:                  320 KiB (6 instances)
  L2:                   1.3 MiB (2 instances)
CoreELEC:~ £ lscpu --output-all -e
BOGOMIPS CPU CORE SOCKET CLUSTER NODE BOOK DRAWER L1d:L1i:L2 POLARIZATION ADDRESS CONFIGURED ONLINE       MHZ SCALMHZ%    MAXMHZ   MINMHZ MODELNAME
   48.00   0    0      0       -    -    -      - 0:0:0      -            -       -             yes 1800.0001     100% 1800.0001 500.0000 Cortex-A53
   48.00   1    1      0       -    -    -      - 1:1:0      -            -       -             yes 1800.0001     100% 1800.0001 500.0000 Cortex-A53
   48.00   2    0      0       -    -    -      - 2:2:1      -            -       -             yes 2208.0000     100% 2208.0000 500.0000 Cortex-A73
   48.00   3    1      0       -    -    -      - 3:3:1      -            -       -             yes 2208.0000     100% 2208.0000 500.0000 Cortex-A73
   48.00   4    2      0       -    -    -      - 4:4:1      -            -       -             yes 2208.0000     100% 2208.0000 500.0000 Cortex-A73
   48.00   5    3      0       -    -    -      - 5:5:1      -            -       -             yes 2208.0000     100% 2208.0000 500.0000 Cortex-A73

Now I see the linux scheduler doesn’t jump the process from cluster 0 (A53) to cluster 1 (A73) anymore.

ANECDOTE (not a data backed experiment) , now I see that av1 decoding is better than before. I still cannot decode 4k smoothly (but it drops fewer frames).

Data: When running cpu intensive task before device tree change CPU#0 would reach 89% usage and CPU#4/5 would reach 60%. AFTER device tree change CPU#0 reaches 35% usage and CPU# 4/5 are 80+%.

I see that linux scheduler is working better by identifying that L2 cache between A53 and A73 is not shared. All cache within A53 and A73 is shared.

I am attaching two files here for testing. The file that ends in ALL has the final complete form. The file ending in A53 defines 256 KB cache for all 6 cores and you can see the scheduler bumps the process out of cluster 0 because it assumes that cache is shared between all 6 processor.

I am looking for this community’s opinion on whether this is useful (for e.g. in gaming/ software decoding).

g12b_s922x_ugoos_am6b_all.dtb (74.1 KB)
g12b_s922x_ugoos_am6b-A53-Cache.dtb (74.0 KB)
cores.
Attachment removed! It will be available in nightly build!

11 Likes

@vpeter How can I update dtb file on emmc? Is there a way to do it?

EDIT: Spoke too soon. Figured it out.

mount -o remount,rw /flash
mv dtb.img dtb_o.img
  1. The emmc install is blazing fast! No more hiccups!

Loading new widgets happens on CPU 2-5 (A73) making faster than before. If you spam the navigation key (right/left) it all pins to CPU 2-5. As soon as you stop moving the kodi screen process gets pinned to CPU 0-1.

Now we can truly benefit from the arm big.LITTLE architecture.

  1. When I play Contra 3 through internet archive game launcher the CE box barely uses 20% cpu on CPU 0/1. It does not get moved to CPU2-5.
  2. In addon searching gets CPU 2-5 making the search faster.
2 Likes

Brief test, nothing seems broken, and clicking around and Nimbus widget loads do seem a fair bit quicker. Happened to have a CE update come down while testing. Update overwrote the new dtb in case anyone else testing was wondering. Well done, seems pretty great on first impressions! Will run it for a bit to see if I notice anything.

3 Likes

Thank you, I will have a go today after work to see how software decoding 1080p AV1 is affected. Currently it really runs near the limits of frame dropping in case of 1.78:1 AR material.

Which one dtb do you recommend not for gaming, but for media play, and kodi usage?

I’d still recommend the _all.dtb.

The other one is for testing only if someone wants to experiment.

Just a reminder to all that you can discuss this subject without recurring to naming banned addons. Thanks.

2 Likes

where do I put the all dtb to move it to emmc? thanks

  1. Download the file
  2. Rename it to dtb.img
  3. Run that remount command from above
  4. Use scp or any other copy software to put the newly renamed dtb file into /flash/dtb.img
1 Like

thank you, I also rebooted after copying the dtb.

Where do we insert the remount command? SSH?

Yes, through SSH.

By the way, there is an alias called “mf” (mount flash) that does the same thing. Much less to type :smiley:

mount -o remount,ro /flash

I’d recommend locking the flash partition at the end after overwriting the dtb which is unlocked in this process.

The box does feel noticeably snappier now using the all variant dtb

1 Like

@OmgA
Have you also increased the emmc speed to HS400?

Yes hs400 is enabled

1 Like

Works nicely, I’d estimate that the average CPU usage (without any upscaling) is around 60-70% and with the standard DTB it’s more or less closer to 80% constantly.

Works good, thanks. :slight_smile:
Im booting in one sd card, the boot time 10% faster, and the menu navigation snappier. :+1:

1 Like

@MasterKeyxda can you share your shortened dts, used to compile the dtb?

Looking at the s922x datasheet:

Blockquote The quad core Cortex™-A73 processor is paired with A53 processor in a big.Little configuration, with each core has L1 instruction and data chaches, together with a single shared L2 unified cache with A53.

Based on this discussion I’d accept that there isn’t actually one shared L2 cache for both A53 and A73 clusters. But it’s still not clear what the actual L2 cache sizes are for the 2 clusters? The s922x datasheet doesn’t say.

This older review suggests the A73 cluster has an L2 1MB cache + A53 cluster L2 256KB cache. I don’t know where they got that information, and it’s for the s922s rev a, but I’d guess that wouldn’t have changed for the s922x rev b.

This might be something Amlogic is willing to directly clarify.

1 Like

Working well. Thanks

I can share it once I get home tonight.

I experimented with what the S922X datasheet says and what arm says.

ARM says A53 has to have it’s own cache.

The DTB file -A53-Cache.dtb is actually defined as one unified cache like the datasheet says. I defined it as 1 MB. However the runtime detection of Linux only finds 256 KB. This little experiment is why I linked both files above. Defined 1 MB but shows 256 KB.

The _all.dtb is defined as separate clusters like ARM says to have in their cores. The runtime detection finds all cache.

This is why I’m willing to believe ARM and follow the dual cache theory.

2 Likes