Deployment varies signicantly depending on number of devices

We are currently preparing for a rollout of a new IoT product using mender for our OTA updates. We have noticed that if we deploy to a single device with a rootfs of roughly 750 MB (~100 MB artifact size) it takes 8-10 minutes to complete a deployment. If we setup a deployment for say 2 devices at time and we power them both up, we are seeing the time jump to like 20 minutes. This gets worse if we increase the number of devices. I realize some of this could be bottlenecks in the local WiFi infrastructure. That being said, If I download the artifact directly onto the devices, the file takes 3 - 4 minutes to download via wget.

We are currently utilizing an AWS t3.medium instance to run the mender server. We are also using S3 for the artifacts. We also notice that regardless of actual WiFi download rate of the devices, the progress bar seems to move at the same speed for every device. Does anyone have some additional insights into the mechanics of how updates are streamed and where we might go looking for bottlenecks in the process?

Also what are you all seeing in terms of update times for a given rootfs size an/or artifact size?

Hi @drewwestrick,

We are currently utilizing an AWS t3.medium instance to run the mender server. We are also using S3 for the artifacts

If you are utilizing S3 for storage of Mender Artifacts, then I suspect that this is related to “local saturation” of network.

Once a device is starting a download, it is downloading it directly from S3 and the Mender Server at this stage is not involved at all meaning that the traffic is not going trough your “t3.medium” instance. You would plenty of devices to be able to saturate the capacity of S3 :slight_smile:, and this is the reason it is designed this way to allow large scale deployments.

the progress bar seems to move at the same speed for every device

Unfortunately the progress is actually just a mock up based on a guesstimate time which is same for everyone. That is why it be perceived as the same for every device.

The reason for this is (what I described above) that the Mender Server is not involved at all in the Download stage, and it can not know how many bytes the client has download (it does not report this). The Mender client will report state changes to the server, e.g going from Download → Install etc.

the file takes 3 - 4 minutes to download via wget.

Note that the Mender client will download the image and stream to the storage medium using synchronous writes, comparable to wget && sync.

This might also relate to which Mender client version you are using, because there was a significant improvement in Mender Client 2.x regarding write speeds, largely due to writing in larger blocks. I would expect the Mender Client 2.x to be comparable to wget, earlier versions where a bit slower.

But in the end it is the storage medium of the device that is the bottleneck if your network connectivity is optmial.

Also what are you all seeing in terms of update times for a given rootfs size an/or artifact size?

This is very hard to compare across different devices and will largely depend on the hardware write speed of the storage medium. e.g are you using eMMC, SD, RAW NAND, SSD, USB and what are the specifications of the interfaces on your SoC.

I have hardware that is able to download write 1,5GB (468MB Mender Artifact) in less then 2 minutes, but a similar size of artifact can take 10-20 minutes on a different hardware.

1 Like

@mirzak,

Thanks for answering my questions and the additional insights. We were already heavily leaning towards this being a WiFi firmware/driver or local network issue, but I just wanted to make sure I wasn’t overlooking a setting or rate limit in the mender server software. It sounds like if I were to do a wget on the mender artifact and then do a manual install on the device, that the total time for this process should be comparable to the standard “streamed” method normally used during a deployment. We are going to do a series of WiFi specific tests today an both a single device and multiple devices to see if we can track down the issue. I have confirmed that both the server and the mender client on the device is running version 2.0.1.

For troubleshooting, if you run the update in standalone mode, the stdout from the Mender executable will give details about download vs write, etc. As far as I know you can actually use the S3 URI in standalone mode but you have to view the logs from Mender in managed mode to have a URI generated.

@drewmoseley,

This is more or less what I had in mind. We run mender in debug mode and pipe the output to a log file so we can grab the signed AWS S3 URL to do these tests with wget, standalone, etc. Thanks for the input.