Le Potato Segmentation Fault on Large File Transfer

CoreElec throws a “Segmentation Fault” on all SSH sessions when doing a large file transfer.

Thanks for all the hard work the CoreElec team contributes to this project.

This is not Kodi related but Underlying OS related.
It seems to be the exact same issue as what was encountered on armbian here: https://forum.armbian.com/topic/6591-le-potato-network-stack-crash-on-huge-file-writes/

I shut down kodi with systemctl stop kodi.

Then I run the following test file:

#!/bin/bash
#Blue LED indicates SD card Activity
echo sd > /sys/class/leds/librecomputer:blue/trigger
#Green LED indicates heartbeat Activity
echo heartbeat > /sys/class/leds/librecomputer:system-status/trigger

LARGEFILE="/storage/.config/largefile.txt"

Clean up any old file

rm $LARGEFILE

Create a string of 4096 bytes

MAKE4K=""
for i in seq 1 4096;
do
MAKE4K="${MAKE4K}A"
done

Write the 4096 byte string a 250 000 times --> 1G file

for j in seq 1 250000;
do
echo $MAKE4K >> $LARGEFILE
done

Simultaneously I will start a
watch -n 1 cat /sys/devices/system/cpu/cpufreq/all_time_in_state
on a new SSH session.
The heartbeat LED keeps beating regularly. Blue LED flashes for Card writes.

When the filesize is randomly between 500 to 700 MB all SSH Sessions will throw a Segmentation Fault and all the SSH items running will exit.

Heartbeat LED keeps flashing regularly.

If I then run the test script again everything crashes - full freeze up. Heartbeat LED stops flashing.
Need to power cycle the device.

I looked for logs that reflected the errors but could not find any.

What can I do to get this issue looked at?
Are there any logs and where are they located on CoreElec that I can supply that would capture the segmentation faults?

Any other tests I can do to isolate the issue?

Further on I get random freezes in kodi if I try to back-up kodi but I think it is related to the underlying file system issue. Nothing gets logged in the kodi file when the entire unit freezes.

Thanks Everybody

This has been a known issue for some time, unfortunately we have been unable to find the root cause of it and its likely that this issue will only be resolved when we move to mainline kernel.

Ouch!!!

Thanks adamg

@oz-ra
BayLibre has fixed this in mainline Linux for Libre Computer. See the last two commits here: https://github.com/libre-computer-project/libretech-linux/commits/linux-4.14/net
However like @adamg stated, the patches do not translate directly in Amlogic’s BSP.

@daxue Thanks for that.

Unfortunately, I do not possess enough knowledge to do anything with the result. :grinning:

There was some updates pushed recently to the kernel repository that may fix this issue in the next release.

1 Like

Please try the following test build here and see if it helps.

@adamg WOW. Thanks very much for that.
Will try it in the next 12 hours and will let you know.

Thanks again.

@adamg I have tried your test build.
It seems to be heading in the right direction.

I can now consistently write up to 1.2GB~1.3GB to local or nfs drives before I get a segmentation fault.

Maximum size I achieved previously was limited to 700MB.

Any suggestions that I can try?

Thanks for your help.

I now believe this to be a hardware issue after discussions with another developer.

This issue sounds really similar to what I am experiencing. Playing high bitrate files locally on the 2gb LePotato after a short while the device completely locks up.

It was suggested that it was the USB powered drive issue but I have ruled that out. As the exact same configuration just running LePotato Android preview image and Kodi plays the file perfectly fine.

I was connected ssh once or twice and do believe I did see a segment fault as well when the crash happened. In my case I do not think it’s a hardware issue as it works on Android.

@ih8lag Thanks for the info. I am not sure it is a hardware issue as Armbian claim that the above issue was resolved, unless I am reading it wrong:

Part of the problem is linked to LPA issues and the other to EEE:

https://patchwork.kernel.org/patch/10142531/

https://patchwork.kernel.org/patch/10102343/

For Armbian an apt update/upgrade should fix it, as long as you can stay connected. I’d use a wifi dongle if you have one. There are quite a few other updates in there as well, such as drivers for the audio subsystem."

I’m fairly certain this is a hardware issue as testing has been done with an oscilloscope, it looks good at first but the crystal drops to 24.7mhz, if you have a switch with a webinterface you will also see RX errors.

I believe XTAL_IN has been hooked incorrectly on LePotato, a SoC PLL has been used rather than an external crystal, probably as a cost saving measure. As such when the CPU is stressed, ethernet dies because the PLL will wane.

The Android images we do not have the source for so we don’t what has been done in them, this issue can be fixed with some PHY calibration and the proper schematic but it is not something I have access to. My guess is that the system is not stressed enough when using Android to invoke the behaviour but on CE/LE you have the GUI being redrawn adding stress to CPU and then if transferring over Samba this is also CPU intensive which in turn triggers the problem.

@adamg Thanks, That would make sense. Good old Rigol! Cheap expansion options :slight_smile:

Although:

  1. During my testing, Kodi is shut down. So nothing else is stressing the SOC except the copy process. systemctl stop kodi

  2. The entire system freezes during local copying operations or LAN copy operations. Otherwise we could troubleshoot with a USB Ethernet device to narrow the OSC problems down. Or a WiFi dongle.

Is there any way we can disable the onboard ethernet? Unload it’s driver etc to minimise it’s impact?
That way we can prove to the seller etc the issue.

Interesting problem though.

PS: And then according to the above, on Armbian having full GUI Desktop it is not occuring anymore.

Armbian is using mainline kernel isn’t it? I believe Baylibre is doing a lot of work a lot of the mainline work for Amlogic and works closely with LibreComputer on adding support for LePotato so they may have already fixed it there, they probably also have the schematics for this device.

We build the ethernet module directly into the kernel and not as a loadable module so you are not able to blacklist it.

Are you copying over NFS or Samba?

I haven’t tried local copy operations but if what you say is true then I would imagine that is a seperate issue.

Thanks @adamg,

I copy local -> local and local -> NFS from an SSH console with kodi shutdown.
Both exhibit the exact same file size limitations, of 1.2~1.3 GB with your test version or 500~700 MB with 8.90.3, before a segmentation fault occurs.

Sorry, I don’t have many details about the Armbian kernel, only what I have read in their forum regarding a similar problem to mine.

Cheers

You can grab the schematics here: https://drive.google.com/file/d/0B1Rq7NcD_39QYnltdGtWWEFvS0U/view?usp=sharing

@daxue

From that schematic, Unless I am mistaken, Ethernet is on-board of the 905X with an external PHY. So no clock feeding an external Ethernet chip.

All clocks are generated and supplied by 905X from a single external 24MHz Crystal on pins B13 & C14.

If that was unstable ALL timing including Video and DDR would be affected. Ethernet would not be singled out. The crystal is rated at 20ppm so Freq will vary between 24 000 480 Hz & 23 999 520 Hz.
Standard ppm variance.

From where I am sitting is definitely not a PLL osc.
Spec Sheet for Crystal http://41j.com/blog/wp-content/uploads/2014/08/smd3225.pdf

No doubt there are internal PLL’s in the 905X but then I would expect this timing bug to affect ALL 905X not just this Le Potato implementation.

I would have thought putting a probe on a running un-buffered OSC would skew the Freq quite a bit due to the capacitance.

I cannot find any crystals connected to a XTAL_IN. The only other one is the standard 12MHz one for the USB Hub.

The measured Freq on the above scope is 25MHz which is a HUGE HUGE variance from the specced 24MHz on the schem. That confuses me.

If you look at the link I posted above to the github branch, the last two commits highlight the network issues that were discovered so far. I am not sure if your issue relates to those two. If you can replicate the issues using the mainline image, then there is another network issue that I am not aware of.
Download link: http://share.loverpi.com/board/libre-computer-project/libre-computer-board-aml-s905x-cc/image/ubuntu/

Thank You @daxue

I’ll try that. Hopefully that will give a clear indication either way.

Any specific one or would either 4.14.20 or 4.14.50 do?

Regards