CEC working unreliably after wakeup

I am running the latest stable release, that is CoreELEC 9.2.1.

I’ve got Odroid N2 connected directly to my Philips TV. I am able to suspend and wake up the Ordroid with my TV’s remote, however, after a wake up, the Odroid fails to react to TV remote button presses. It may recover by itself in 10 or so seconds, or it may not, in which case I have to switch the TV to another source and then back.

Interestingly, if it fails to recover by itself, I can make it recover merely by running cec-client from the command line. The recovery is not instant - it takes a few seconds. Here is the cec-client output for such a recovery:

CoreELEC:~ # cec-client
No device type given. Using 'recording device'
CEC Parser created - libCEC version 4.0.4
no serial port given. trying autodetect:
 path:     /dev/aocec
 com port: AOCEC

opening a connection to the CEC adapter...
DEBUG:   [             123]     Broadcast (F): osd name set to 'Broadcast'
NOTICE:  [             123]     connection opened
DEBUG:   [             123]     << Broadcast (F) -> TV (0): POLL
TRAFFIC: [             123]     << f0
DEBUG:   [             123]     processor thread started
DEBUG:   [             302]     >> POLL sent
DEBUG:   [             302]     TV (0): device status changed into 'present'
DEBUG:   [             302]     << requesting vendor ID of 'TV' (0)
TRAFFIC: [             302]     << f0:8c
DEBUG:   [            1579]     expected response not received (87: device vendor id)
TRAFFIC: [            1579]     << f0:8c
TRAFFIC: [            2405]     >> 01:46
DEBUG:   [            2798]     expected response not received (87: device vendor id)
NOTICE:  [            2799]     registering new CEC client - v4.0.4
DEBUG:   [            2799]     detecting logical address for type 'recording device'
DEBUG:   [            2799]     trying logical address 'Recorder 1'
DEBUG:   [            2799]     << Recorder 1 (1) -> Recorder 1 (1): POLL
TRAFFIC: [            2799]     << 11
WARNING: [            3077]     Write: write failed
TRAFFIC: [            3077]     << 11
WARNING: [            3077]     Write: write failed
DEBUG:   [            3077]     >> POLL not sent
DEBUG:   [            3077]     using logical address 'Recorder 1'
DEBUG:   [            3077]     Recorder 1 (1): device status changed into 'handled by libCEC'
DEBUG:   [            3077]     Recorder 1 (1): power status changed from 'unknown' to 'on'
DEBUG:   [            3077]     Recorder 1 (1): vendor = Pulse Eight (001582)
DEBUG:   [            3077]     Recorder 1 (1): CEC version 1.4
DEBUG:   [            3077]     AllocateLogicalAddresses - device '0', type 'recording device', LA '1'
DEBUG:   [            3080]     Recorder 1 (1): osd name set to 'CECTester'
DEBUG:   [            3080]     Recorder 1 (1): menu language set to 'eng'
DEBUG:   [            3080]     AutodetectPhysicalAddress - autodetected physical address '2000'
DEBUG:   [            3080]     Recorder 1 (1): physical address changed from ffff to 2000
DEBUG:   [            3080]     << Recorder 1 (1) -> broadcast (F): physical address 2000
TRAFFIC: [            3080]     << 1f:84:20:00:01
NOTICE:  [            3236]     CEC client registered: libCEC version = 4.0.4, client version = 4.0.4, firmware version = 5, logical address(es) = Recorder 1 (1) , physical address: 2.0.0.0,  compiled on Wed Nov 27 08:01:29 UTC 2019 by adamg@zulu on Linux 4.15.0-71-generic (x86_64), features: P8_USB, DRM, P8_detect, AOCEC
DEBUG:   [            3236]     << Recorder 1 (1) -> TV (0): OSD name 'CECTester'
TRAFFIC: [            3236]     << 10:47:43:45:43:54:65:73:74:65:72
DEBUG:   [            3622]     << requesting power status of 'TV' (0)
TRAFFIC: [            3622]     << 10:8f
TRAFFIC: [            4339]     >> 01:90:00
DEBUG:   [            4339]     TV (0): power status changed from 'unknown' to 'on'
DEBUG:   [            4340]     >> TV (0) -> Recorder 1 (1): report power status (90)
waiting for input
DEBUG:   [            4340]     expected response received (90: report power status)

libCEC is doing a hardware init on each resume or startup of the device. This can take up to 60s until the CEC remote start working again. It’s same on my LG. Just wait longer and press sometime a button to test if it started working again.

How common is this problem across different platforms? I know Android TV boxes don’t have it, and neither does my Linux based Sat receiver.

Do you have an idea about the root cause of such behavior? I mean, from a TVs perspective, is there any difference between a connected device waking up from suspend or the same device doing a cold start?

So, I’ve decided to investigate this issue myself. I’ve built the CoreELEC image for Odroid N2, enabled debug logging in the CEC kernel driver (drivers/amlogic/cec/hdmi_ao_cec.c) and also added my own debugging messages to it. Then I captured the log in the scenario where TV remote button presses eventually start coming through. After reading the relevant parts of the CEC spec (I’ve found only the old version 1.3a), I was able to figure out what’s going on. Below is my interpretation of what I captured in the log. If anyone is interested, I can provide the raw log as well.

1. After a wakeup, a number of incoming CEC messages are received from the TV
   and my Sat receiver, which is connected to another HDMI input on the TV.
   The messages are:
   "2f 84 10 00 01" - The Sat receiver provides its physical address.
   "0f 84 00 00 00" - The TV provides its physical address.
   "0f 87 00 90 3e" - The TV reports its vendor id.
   "0f 32 72 75 73" - The TV reports its menu language.
   "2f a6 06 40 00 00" - Some HDMI 2.0 stuff. I don't have the spec for that.

2. The userland asks to send the following message:
   "f0" - Pings the TV to check if it's there, sent from the "Unregistered"
          logical address.
   The transmission is acknowledged.

3. The userland asks to send the following message:
   "f0 8c" - Requests the TVs vendor id, send from the "Unregistered"
             logical address.
   The transmission is acknowledged.

3. The TV sends its physical address and vendor id again:
   "0f 84 00 00 00" - Physical address.
   "0f 87 00 90 3e" - Vendor id.

5. The userland asks to send the following message:
   "11" - Is there anyone using the logical address of 1 (aka "Receiver 1")?
   The transmission is not acknowledged, as no one is using the logical
   address of 1. The TV is at logical address 0 and my Sat receiver is at
   logical address of 2.

   It has to be pointed out that the driver treats the lack of transmission
   acknowledgement as a serious problem, and responds by resetting the hardware
   by calling ceca_hw_reset(). In this particular case, the lack of
   acknowledgement is not a problem at all, and shouldn't need a hardware
   reset. However, this reset contributes very little to the overall delay
   in Odroid starting to accept TV remote commands.
   For reference, see "CEC 10.2.1 Logical Address Allocation" in the spec
   (my spec is v1.3a).

6. The userland again asks to send the message "11".
   Why? Wasn't it told the previous one didn't get acknowledged?
   This time, the driver detects this situation and refuses to send
   the duplicate ping message, instead immediately responding with
   "not acknowledged".
   For reference, seach for "NACK repeat message:%x\n" in hdmi_ao_cec.c.

7. The userland asks to send the following message:
   "1f 84 20 00 01" - The Odroid provides its physical address.
   This time it's send from a logical address of 1. So, the userland
   finally figured out there is no one else using it.
   The transmission is acknowledged.

8. The userland asks to send the following message:
   "1f 87 00 15 82" - The Odroid is providing its vendor id.
   The transmission is acknowledged.

9. The userland asks to send the following message:
   "10 47 43 6f 72 65 45 4c 45 43" - The Odroid provides its device name.
   The transmission is acknowledged.

10. The userland asks to send the previous message again.
    The transmission is acknowledged.

11. The userland asks to send the following message:
    "10 8f" - The Odroid asks the TV to provide its power status (is the TV on?)
    The transmission is acknowledged.

12. The TV responds with:
    "01 90 00" - The TV's power status "On".

13. The userland asks to send the following message:
    "10 04" - The Odroid reports it's in active state.
    The transmission is acknowledged.

14. The userland asks to send the following message:
    "1f 82 20 00" - The Odroid informs everyone that it has started
                    transmitting a stream.
    The transmission is acknowledged.

15. The userland asks to send the following message:
    "10 8e 00" - The Odroid requests TV remote buttons to be passed through
                 to it.
    The transmission is acknowledged.

From there on, the TV starts reporting remote button presses.

In total, 5 seconds have passed since the Odroid woke up. We take the call
to aocec_late_resume() as the point of warking up.

The largest individual delay was 2 seconds between waking up and userland
asking us to send its first message.

However, the story doesn't end here. At this point, the userland decides
for some reason to ping all other logical addresses, by sending messages
"12" ... "1e". Now, the logical address of 2 is occupied by my Sat receiver,
so message "12" gets acknowledged, but the others are not. As mentioned before,
the driver treats each unacknowledged transmission as serious problem and
resets the hardware in response. What's worse, 2 of these messages result
in "tx timeout" rather than in "TX ERROR!" (both are searchable strings
in hdmi_ao_cec.c). I haven't figured out what causes those timeouts, but
each of them blocks transmissions for another 5 seconds.

So, it sounds to me that there are problems in both the driver and the userland. If I find more time, I’ll investigate some more.

1 Like

:+1:
Thank you for your really good debugging!
The issue is on my to-do list but with low priority. It’s “only” a delay after resume until the remote start working again. So the main function is still working - but it could be better I know.

I see the main issue in libCEC but maybe also the driver needs some tunning.
We can’t do a lot of things with libCEC as it’s really hard to get familiar with the source code. I know the handling of libCEC and the polling of all address available and so on - it’s strange.

Maybe you will take a look to the Pulse Eight libCEC support to get this issue solved by the main developers of libCEC.

I look forward to receive a pull request from you for CE to fix this issue. :heart_eyes:

Finally I’ve found some more time to debug the issue.

First of all, regarding the “tx timeout” situations in the driver. It turns out they are very easy to reproduce. Just boot the system or wake it up from suspend, and as soon as the GUI appears, start pressing the Down button (doesn’t have to be that particular one) on the TV remote fast enough. Twice a second seems to be enough to reproduce it. Now, I’ve actually added the code to check the CEC_TX_MSG_STATUS register at the point of timeout (that is after 5 seconds) and it’s still at TX_BUSY! So, it seems possible to completely saturate the CEC link just by sending 4 short messages a second (3 bytes for key press, 2 for key release). Either that, or this hardware is way too conservative with timings.

So, your key presses cause the transmissions from libCEC to timeout, and until libCEC is done probing all possible logical addresses, it won’t process the key presses, that are in fact coming through.

Now, why does libCEC probe all possible logical addresses? I’ve managed to find at least one such scenario. Here is the stack trace that leads to logical address 0xc (reserved1) being probed:

#0  CEC::CCECBusDevice::TransmitPoll (this=0xe9b59328,
    dest=CEC::CECDEVICE_RESERVED1, bUpdateDeviceStatus=true)
    at ../src/libcec/devices/CECBusDevice.cpp:937
#1  0xf7089a2c in CEC::CCECBusDevice::GetStatus (this=0x3b01418,
    bForcePoll=<optimized out>, bSuppressPoll=<optimized out>)
    at ../src/libcec/devices/CECBusDevice.cpp:861
#2  0xf708e3b0 in CEC::CCECDeviceMap::GetActive (this=0x3b38eb8, devices=...)
    at ../src/libcec/devices/CECDeviceMap.cpp:180
#3  0xf7077e44 in CEC::CCECClient::IsActiveDeviceType (this=<optimized out>,
    type=CEC::CEC_DEVICE_TYPE_AUDIO_SYSTEM) at ../src/libcec/CECProcessor.h:152
#4  0x009926ac in PERIPHERALS::CPeripheralCecAdapterUpdateThread::UpdateAudioSystemStatus (this=this@entry=0xd9e8f9f0)
    at ../xbmc/peripherals/devices/PeripheralCecAdapter.cpp:1485
#5  0x009929d4 in PERIPHERALS::CPeripheralCecAdapterUpdateThread::SetInitialConfiguration (this=this@entry=0xd9e8f9f0)
    at ../xbmc/peripherals/devices/PeripheralCecAdapter.cpp:1530
#6  0x00992ae0 in PERIPHERALS::CPeripheralCecAdapterUpdateThread::Process (
    this=0xd9e8f9f0)
    at ../xbmc/peripherals/devices/PeripheralCecAdapter.cpp:1548
#7  0x006c58ec in CThread::Action (this=0xd9e8f9f0)
    at ../xbmc/threads/Thread.cpp:208
#8  0x006c67b4 in CThread::staticThread (data=0xd9e8f9f0)
    at ../xbmc/threads/Thread.cpp:116

So, Kodi is trying to do the following:

std::string CPeripheralCecAdapterUpdateThread::UpdateAudioSystemStatus(void)
{
  std::string strAmpName;

  /* disable the mute setting when an amp is found, because the amp handles the mute setting and
       set PCM output to 100% */
  if (m_adapter->m_cecAdapter->IsActiveDeviceType(CEC_DEVICE_TYPE_AUDIO_SYSTEM))
  {

It asks libCEC whether there is an active audio system. LibCEC handles this request the following way:

bool CCECClient::IsActiveDeviceType(const cec_device_type type)
{
  CECDEVICEVEC activeDevices;
  if (m_processor)
    m_processor->GetDevices()->GetActive(activeDevices);
  CCECDeviceMap::FilterType(type, activeDevices);
  return !activeDevices.empty();
}

So, it first collects the list of all active devices (which involves polling all logical addresses) and then it returns those of them that are audio devices. This behaviour doesn’t seem hard to fix. I’ll try that when I have more time. I’d also like to try to build this CEC driver on the Amlogic kernel to see how well it works. At the very least, I can say its source code is much cleaner.

1 Like