Documentation on state transitions

I think the documentation is not entirely corrent in some places.

From the documentation on Mender 2.5:

State scripts can either be run as we transition into a state; “Enter”, or out from a state, “Leave”. Most of the states also have an “Error” transition which is run when some error occurs while executing any action inside the given state (including execution of Enter and Leave scripts).

Are you really sure that’s true? I don’t have any logs on hand to support this, but I think I observed that errors in the Enter and Leave scripts lead to different consequences. That would make sense to me, because when there’s an error in the Leave script, whatever was done within the state likely has to be reverted. That’s not the case when there’s an error in the Enter script.

Additionally, I observed the following:

  • Errors in ArtifactInstall_Leave will trigger a transition to ArtifactRollback. I couldn’t see this in the docs.
  • Errors in ArtifactCommit_Leave lead to the artifact being marked as _INCONSISTENT. There is a seperate page on that issue, but I think the page on the state scripts would benefit from carrying this information as well, or at least linking to that page.

Best regards,
Manuel

Thanks @manuel_vps that’s good input. I’ll leave it to @kacf to comment on the specifics.

Hi again, hi @kacf :slight_smile:

I hope you don’t mind me spamming your support board. I have set aside capacity to work on the deployment-related parts of our product, so that’s we I’m more active now than I used to.

I consulted the documentation on the things below but couldn’t find confidence up to the level I feel comfortable.

On our devices we observed the following behaviour:

  • During the deployment, after booting into the new artifact, the client tries to contact the server periodically. The intervall is the one we set in RetryPollIntervalSeconds.
    • If it succeeds, the server reports the deployment as successful
    • but if this fails 10 times, the device rolls back.
  • After rollback, the device again tries to contact the server. But this time the intervall seems to be hardcoded to 10min. (We observed an intervall of 10min even if RetryPollIntervalSeconds is set to some other value.)
    • If the client succeeds, the server reports the deployment as failed.
    • but if this fails 10 times, the client goes back to idle state.
  • If we enable internet access then, the client will happily pick up the deployment and do run the whole deployment cycle again.

I’d have the following questions:

  1. Is our observation correct? Except for the RetryPollIntervalSeconds I couldn’t find anything in the docs on that. It would be good to know if the behaviour is really that hardcoded or if there are some configuration options. If not we might look into patching the mender client source code.
  2. Ultimately, we want to manipulate when or how often mender client tries to talk to the server to commit the deployment without affecting the rollback behaviour.
    1.1 Can we force-trigger the client to reattempt talking to the server? Perhaps a signal? Just like one can trigger the client to check for updates by sending USR1 signal.
    1.1 If not, we would likely patch the source code to change the retry count to something better suited to us. Any thoughts on that or things we might want to know?

Thanks!

I can’t immediately see that this should be the case. To me it looks like both sections refer to the GetRetryPollInterval() call. This would need additional debugging to figure out the cause.

Not at this time, I’m afraid.

Just make sure you understand the maxSendingAttempts function. This is where the “magic happens”, so to speak.

In addition to that, you will need to raise the maximum data store count, or you will get failures that the client thinks it’s stuck in a loop. This is essentially how many times you are allowed to change state in an update.

Thanks for your input, @kacf .

I can’t immediately see that this should be the case. To me it looks like both sections refer to the GetRetryPollInterval() call. This would need additional debugging to figure out the cause.

That was a mistake on my side, I think. I missed the fact that after the rollback, the RetryPollIntervall of the original artifact (which happened to have a different value) would be in effect again.

Just make sure you understand the maxSendingAttempts function. This is where the “magic happens”, so to speak.

We will keep that in mind! The maximal retry count of 10 times is not currently part of the documentation, is it?

// try to send failed report at least minRetries times or keep trying every
// ‘retryPollInterval’ for the duration of two ‘updatePollInterval’, or a
// maximum of 10 times

If you can, I’d also ask you to take a look on the initial post in this thread and comment on that.

You’d have to back this up with logs I think. I’m pretty sure this works as described, and we have a lot of tests in this area.

Many people use both Enter and Leave scripts to make changes to the system, so it makes sense to run Error scripts in both cases. The Error script will then have to figure out exactly what needs to be done. But in most cases it’s better to put such steps in an ArtifactRollback script.

Right, this arrow is missing in the diagram, I can fix that.

Thanks, I can fix this too!