If the response to this effort in performance gain is positive I’ll share DTB files for any and all S922X device.
I’m reading that improvements can range from 16-25% from one post above. Need more people to report because sample size of 1-4 is not conclusive.
On the other hand we can also experiment with on demand governor and keep CPU temps low.
With a defined cache we can enable prefetching for SW decoders so we store more data in L2 because L1d cache is 32 but L1i is 64. We’d be able to reduce DRAM penalty.
The core cluster bumps create a very heavy penalty for performance because that involves DRAM.
I think you are right about the separate L2 caches for the two clusters. The wording in the s922x datasheet is just clumsy. The A311D2 datasheet is more clear
The main system CPU is based on Big.LITTLE architecture which integrates a quad-core ARM Cortex-A73 CPU cluster and a quad-core Cortex-A53 cluster with unified L2 cache for each cluster to improve system performance. In addition, the CPU includes the NEON SIMD co-processor to improve software media processing capability.
And also
Unified system L2 cache for each cluster
The same optimization would also benefit other big little core SOCs like the A311D2, and maybe the S928X (no datasheet available yet). But again the A311D2 datasheet doesn’t specify the L2 cache sizes.
Yep. I’m impressed by the Linux runtime/boot time detection. I found that I can even put 1 MB cache which is absolutely wrong but it still only displays 256 KB.
Caution this only applies to integrated L1 L2 cache controller. Cannot just blindly put this out for all models of CPU out there. Any processor A53 and newer is mostly integrated cache controller.
There might be a benefit in shared A55 905X4 because defining unified cache may allow for better pre fetching.
Basically for other SOCs I can put 3 MB cache and it’ll still find the right amount. Just need to specifically identify if is dual cache cluster or single.
I’ve updated mesong12b.dtsi which is used by all g12b devices. Any device using this dtsi will benefit from this improvement.
If cube uses g12b you can post what DTB file you use and I’ll upload it here.
Next steps:
Find all core dtsi that utilize s922x
Update them all with cache info to propagate this benefit
Create a PR for kernel 4.9.20
Ask for CE teams help to propagate that change into newer 5.15 and future kernels. If they are too busy then do it the manual dirty way of PR for each kernel.
All DTB files with g12b. Please report back with your findings of lscpu and cpu usage. Data and metrics are preferred.
For e.g. using software decode VC-1 used to consume 50% CPU across cores 0-5 but after update it consumes 20% on core 0-1 and 65% on core 2-5. Something like this is ideal if you are able to get data like that.
Complete quantified data can be had if there are perf binaries compiled for CE-NG.