🇨🇳 Mainland China network issues

Hi,

We have Mender-enabled devices in mainland China, which means they operate behind the Great Firewall.

Our Mender backend is hosted and operated by Mender (i.e. “Mender Professional”). We have noticed a lot of Mender’s underlying network calls (e.g. update check, inventory upload, update download) fail. Short calls (e.g. update check, inventory upload) fail randomly. Longer ones (e.g. update downloads) fail consistently. The error printed by the mender client in logs always “connection timed out”.

Does this match the experience of other Mender customers ? What can be done to address this ?

Best,
Guillaume

Hello @guillaumekh, welcome to Mender Hub.

I’ve heard some discussions around this and it’s unsurprising that there are issues. The API calls all go directly to the hosted.mender.io IP address whereas actual file downloads come from Amazon S3. I’m not surprised the behavior is different for the two classes of content.

I’m not sure there is a good workaround for this apart from hosting your own server. Perhaps @merlin or @0lmi know otherwise.

Hello and thanks for using Mender.

could you please supply as much of the exact logs as possible? if you do not want to make them public, share with me privately: peter@northern.tech

peter

@drewmoseley possibly. Are files served straight from S3 or from a CDN (e.g. Cloudfront) ? Which region ?
@peter here are some logs from the mender client on a device over the 4 last days. I have renamed some minor identifying bits, removed all occurences of the following state transitions to reduce verbosity, and inserted a typo in the https:// scheme to circumvent the ban on new users posting more than 10 urls in a post :

  • check-wait [Idle] → inventory-update [Sync]
  • inventory-update [Sync] → check-wait [Idle]
  • check-wait [Idle] → update-check [Sync]
  • update-check [Sync] → check-wait [Idle]

Dec 23 11:34:05 spectre-player-XXXXXX mender: time=“2019-12-23T18:34:05+08:00” level=info msg=“State transition: check-wait [Idle] → update-check [Sync]” module=mender
Dec 23 15:52:05 spectre-player-XXXXXX mender: time=“2019-12-23T22:52:04+08:00” level=error msg=“failed to submit inventory data: Patch https:/hosted.mender.io/api/devices/v1/inventory/device/attributes: EOF” module=“client_inventory”
Dec 23 15:52:05 spectre-player-XXXXXX mender: time=“2019-12-23T22:52:04+08:00” level=warning msg=“failed to refresh inventory: failed to submit inventory data: inventory submit failed: Patch https:/hosted.mender.io/api/devices/v1/inventory/device/attributes: EOF” module=state
Dec 24 14:23:57 spectre-player-XXXXXX mender: time=“2019-12-24T21:23:56+08:00” level=error msg=“failed to submit inventory data: Patch https:/hosted.mender.io/api/devices/v1/inventory/device/attributes: read tcp 10.120.40.101:45346->18.205.126.2:443: read: connection reset by peer” module=“client_inventory”
Dec 24 14:23:57 spectre-player-XXXXXX mender: time=“2019-12-24T21:23:56+08:00” level=warning msg=“failed to refresh inventory: failed to submit inventory data: inventory submit failed: Patch https:/hosted.mender.io/api/devices/v1/inventory/device/attributes: read tcp 10.120.40.101:45346->18.205.126.2:443: read: connection reset by peer” module=state
Dec 25 15:50:45 spectre-player-XXXXXX mender: time=“2019-12-25T22:50:45+08:00” level=error msg=“failed to submit inventory data: Patch https:/hosted.mender.io/api/devices/v1/inventory/device/attributes: EOF” module=“client_inventory”
Dec 25 15:50:45 spectre-player-XXXXXX mender: time=“2019-12-25T22:50:45+08:00” level=warning msg=“failed to refresh inventory: failed to submit inventory data: inventory submit failed: Patch https:/hosted.mender.io/api/devices/v1/inventory/device/attributes: EOF” module=state
Dec 26 03:05:14 spectre-player-XXXXXX mender: time=“2019-12-26T10:05:13+08:00” level=error msg=“Error receiving scheduled update data: update check request failed: Get https:/hosted.mender.io/api/devices/v1/deployments/device/deployments/next?artifact_name=artifact-2019-08-16-07%3A38&device_type=raspberrypi3: EOF” module=mender
Dec 26 03:05:14 spectre-player-XXXXXX mender: time=“2019-12-26T10:05:13+08:00” level=error msg=“update check failed: transient error: update check request failed: Get https:/hosted.mender.io/api/devices/v1/deployments/device/deployments/next?artifact_name=artifact-2019-08-16-07%3A38&device_type=raspberrypi3: EOF” module=state
Dec 26 03:05:14 spectre-player-XXXXXX mender: time=“2019-12-26T10:05:13+08:00” level=info msg=“State transition: update-check [Sync] → error [Error]” module=mender
Dec 26 03:05:14 spectre-player-XXXXXX mender: time=“2019-12-26T10:05:13+08:00” level=info msg=“handling error state, current error: transient error: update check request failed: Get https:/hosted.mender.io/api/devices/v1/deployments/device/deployments/next?artifact_name=artifact-2019-08-16-07%3A38&device_type=raspberrypi3: EOF” module=state
Dec 26 03:05:14 spectre-player-XXXXXX mender: time=“2019-12-26T10:05:13+08:00” level=info msg=“State transition: error [Error] → idle [Idle]” module=mender
Dec 26 03:05:14 spectre-player-XXXXXX mender: time=“2019-12-26T10:05:13+08:00” level=info msg=“authorization data present and valid” module=mender
Dec 26 03:05:14 spectre-player-XXXXXX mender: time=“2019-12-26T10:05:13+08:00” level=info msg=“State transition: idle [Idle] → check-wait [Idle]” module=mender
Dec 26 06:49:38 spectre-player-XXXXXX mender: time=“2019-12-26T13:49:38+08:00” level=info msg=“Device unauthorized; attempting reauthorization” module=client
Dec 26 14:29:01 spectre-player-XXXXXX mender: time=“2019-12-26T21:29:01+08:00” level=error msg=“failed to submit inventory data: Patch https:/hosted.mender.io/api/devices/v1/inventory/device/attributes: read tcp 10.120.40.101:50530->34.196.11.110:443: read: connection reset by peer” module=“client_inventory”
Dec 26 14:29:01 spectre-player-XXXXXX mender: time=“2019-12-26T21:29:01+08:00” level=warning msg=“failed to refresh inventory: failed to submit inventory data: inventory submit failed: Patch https:/hosted.mender.io/api/devices/v1/inventory/device/attributes: read tcp 10.120.40.101:50530->34.196.11.110:443: read: connection reset by peer” module=state
Dec 26 15:51:20 spectre-player-XXXXXX mender: time=“2019-12-26T22:51:19+08:00” level=error msg=“failed to submit inventory data: Patch https:/hosted.mender.io/api/devices/v1/inventory/device/attributes: read tcp 10.120.40.101:50550->34.196.11.110:443: read: connection reset by peer” module=“client_inventory”
Dec 26 15:51:20 spectre-player-XXXXXX mender: time=“2019-12-26T22:51:19+08:00” level=warning msg=“failed to refresh inventory: failed to submit inventory data: inventory submit failed: Patch https:/hosted.mender.io/api/devices/v1/inventory/device/attributes: read tcp 10.120.40.101:50550->34.196.11.110:443: read: connection reset by peer” module=state
Dec 26 16:05:12 spectre-player-XXXXXX mender: time=“2019-12-26T23:05:11+08:00” level=error msg=“Error receiving scheduled update data: update check request failed: Get https:/hosted.mender.io/api/devices/v1/deployments/device/deployments/next?artifact_name=artifact-2019-08-16-07%3A38&device_type=raspberrypi3: EOF” module=mender
Dec 26 16:05:12 spectre-player-XXXXXX mender: time=“2019-12-26T23:05:11+08:00” level=error msg=“update check failed: transient error: update check request failed: Get https:/hosted.mender.io/api/devices/v1/deployments/device/deployments/next?artifact_name=artifact-2019-08-16-07%3A38&device_type=raspberrypi3: EOF” module=state
Dec 26 16:05:12 spectre-player-XXXXXX mender: time=“2019-12-26T23:05:11+08:00” level=info msg=“State transition: update-check [Sync] → error [Error]” module=mender
Dec 26 16:05:12 spectre-player-XXXXXX mender: time=“2019-12-26T23:05:11+08:00” level=info msg=“handling error state, current error: transient error: update check request failed: Get https:/hosted.mender.io/api/devices/v1/deployments/device/deployments/next?artifact_name=artifact-2019-08-16-07%3A38&device_type=raspberrypi3: EOF” module=state
Dec 26 16:05:12 spectre-player-XXXXXX mender: time=“2019-12-26T23:05:11+08:00” level=info msg=“State transition: error [Error] → idle [Idle]” module=mender
Dec 26 16:05:12 spectre-player-XXXXXX mender: time=“2019-12-26T23:05:11+08:00” level=info msg=“authorization data present and valid” module=mender
Dec 26 16:05:12 spectre-player-XXXXXX mender: time=“2019-12-26T23:05:11+08:00” level=info msg=“State transition: idle [Idle] → check-wait [Idle]” module=mender
Dec 26 16:23:31 spectre-player-XXXXXX mender: time=“2019-12-26T23:23:31+08:00” level=error msg=“failed to submit inventory data: Patch https:/hosted.mender.io/api/devices/v1/inventory/device/attributes: EOF” module=“client_inventory”
Dec 26 16:23:31 spectre-player-XXXXXX mender: time=“2019-12-26T23:23:31+08:00” level=warning msg=“failed to refresh inventory: failed to submit inventory data: inventory submit failed: Patch https:/hosted.mender.io/api/devices/v1/inventory/device/attributes: EOF” module=state

Interestingly, I’m noticing these logs don’t feature any time out, but instead a lot of what must be TCP RST. These are typical of the GW.

Hey guys (and happy new year :wink: )
@drewmoseley where are hosted mender artifacts served from ? Are files served straight from S3 or from a CDN (e.g. Cloudfront) ? Which region ? I’m asking this on the assumption that all origins are not treated equally by the GFW.
@peter what do you make of these logs ?

Best

@peter can provide the details of where the images are server from. I don’t know off the top of my head.

It is all crossing the firewall in question. deployments serves from region us-east-1

peter

Hi guys,

I have learned a lot on the topic of deliverability lately, especially in the context of the the Chinese Internet.

To put it simply, you really need to put a CDN in front of S3. It’s dead easy to setup and will do wonders for your deliverability issues (yeah you have some :wink:). Right now, AFAICT all artefact downloads are served straight from S3 us-east-1. This yields some really bad bandwidth and latency in a lot of corners of the world — which is rather fine for OTA updates — and a high failure rate in mainland China, which is very much less fine. You can enable S3 logging to monitor and put a number on that if you’re interested.

For the world, a simple Cloudfront CDN will do magic and can likely be deployed to production in minutes. For mainland China, you’ll need a private link through the GFW which can be obtained from one of the Chinese ISPs (China Telecom, China Unicom, China Mobile), so your Chinese CDN PoPs have a reliable link to your S3 origin. Expect something in the 1-5k$/mo range for the link, and you’ll need to setup a Chinese company to obtain an ICP number. Larger CDN providers like Verizon or Akamai can likely provide a more streamlined, one-stop solution.

CDN delivery would fit very well with your Entreprise plan…

Best

2 Likes

Hello @guillaumekh!

thanks a lot for the suggestions and your input, we will consider it as an addition to the Mender road map.

peter

you can monitor the progress here: [MEN-3966] - Mender and CFEngine (by Northern.tech) Jira

peter

Hey @peter
Neat thanks for the follow up

Hi @guillaumekh

my pleasure! I hope that we can make it easier for you very soon.

peter

Short update on China deliverability (please forgive the necroposting):

At our company, we have deployed a new CDN with “near-China” PoPs (in Hong Kong or Singapore). They cost a fraction of the system I described above back in September, with minimal impact on performance. Check out CDNetworks/Wangsu and ChinaCache in particular.

Another key takeaway from our using a CDN over the last months : always monitor the performance (time to DNS/TCP/TLS/FirstByte, bandwidth, availability), preferably with “RUM” (real user metrics, => i.e. metrics from mender-client). It seems a lot of CDN vendors cannot be trusted to maintain their advertised QoS over time…