Slow deployment

Since we moved the mender server into a datacenter, deployments have been really slow. I applied the tactics described in [1], which make upgrade times acceptable. (1 to 1,5 hour update time, mender file is about 80MB)

Today, we pushed the first update to a device in the field. The update completed in 4 minutes.
Something must be wrong on our network, but we dont experience any other issues.

Can anyone suggest any things to investigate this?

[1] Speed up mender upgrade

I am bit confused here by these two lines,

which make upgrade times acceptable. (1 to 1,5 hour update time

and

The update completed in 4 minutes.

First you mention hours to complete an update and then 4 minutes? Could you clarify this.

These are on different networks: At our office network, things take hours, at a customer network, its in 4 minutes.

I’m pretty sure its something on our network, but no clue on how to find out what…

I turned on debug logging of the mender service, but no clues there.

I turned on debug logging of the mender service, but no clues there.

I would just try fetching an artifact with “wget” from the device, if it is still slow then Mender can be eliminated and could be “general network” issue.

Could you try that?

How do I get the URL to use? I guess the mender server is secured for unauthorized downloading?

You can check the Mender client log, the URL that is generated is printed and is pre-signed URL and is usable for 24 hours.

I did some assumption-checking:

  • wget downloads as fast as I expected

  • dd-ing the passive root partition takes 4 minutes

  • ran wireshark on the traffic. (but due to https being used, i cannot see the contents) Not too much interesting stuff there, but I’m not too familiar with things

Any suggestions I could try to further pinpoint this?

Made some big breakthrough:

At our network, we use a pfsense-based firewall. THe underlying firewall software (packet fence) has an option to pre-filter tcp packets, for bad combinations of flags, for example. Disabling that functionality solves my speed issues…

Now imo, the big question is, what is mender doing to trigger this…

I could offer a wireshark trace of the data during a (slow) upgrade…

Hard to provide anything specific about the why, but there is one difference between doing a wget and downloading with the Mender client and that is that the Mender client utilizes HTTP range requests. One of the reasons is to support resume of downloads in case of network interruption.

HTTP range requests is standard HTTP feature and I do not really think that this is causing the “red flags” but it is notable difference in how content is moved on the network.

It seems the scrubbing is not something that’s really working great with tcp windows, which probably is being used.

See http://openbsd-archive.7691.n7.nabble.com/Scrub-reassemble-tcp-td259581.html