Python client repeatedly gets "[Errno 104] Connection reset by peer"

Asked by Martin Assarsson

It seems to me that the mqtt queue is compromised and the server is rejecting the client over and over again due to malformatted package.
If I shut down the client and start it again it's all fine.
This happens after an extended time.

Just to test if this is the problem, is there a way to reset the queue even if QoS set to 2?

Question information

Language:
English Edit question
Status:
Solved
For:
mosquitto Edit question
Assignee:
No assignee Edit question
Solved by:
Roger Light
Solved:
Last query:
Last reply:
Revision history for this message
Roger Light (roger.light) said :
#1

Can you reproduce this consistently? What steps do you take?

Running mosq.reinitialise() will put the client into its original state as when it was first created.

Revision history for this message
Martin Assarsson (martin-assarsson) said :
#2

At the moment the only indication on when error 104 is during low traffic at night.
It has happened twice so far during a one week period.
I can not at the moment pinpoint what is triggering it.
Since the scenario is so huge I need to see exactly when it happens, and somehow record the moment.

Hardware:
* One DELL server
* 20 Embedded ARM machines
* 40 Android and IOS handhelds

Environment:
* One central MQTT-server on server
* One Python client on server. This is the one generating 104
* On each ARM;
** One MQTT bridge
** Two different Python clients
** One C++ client
* On each handheld;
** Two different MQTT clients

The one client that gets these errors is also connected to a mySQL database.

Revision history for this message
Roger Light (roger.light) said :
#3

That does sound quite complicated :)

Is there any chance you could provide a cut down copy of the Python
client with the problem? Just the MQTT related parts should be
sufficient. Can I also check that you're using version 1.0.2? There
were some fixes in 1.0 that helped with this type of reconnecting
problem.

Revision history for this message
Martin Assarsson (martin-assarsson) said :
#4

Yes I'm using 1.0.2!
The problem seems to have been raised when introducing 1.0.2.
At the same time that we upgraded to 1.0.2 we also migrated to new hardware and at the moment I'm investigating the possibility that the hardware is the core problem.

  // Martin

Revision history for this message
Roger Light (roger.light) said :
#5

Martin,

Can you provide any details about the mqtt side of your python client? Some things that might be important:

* Is it clean_session?
* What is the keepalive?
* Does it use the threaded interface (loop_start()) or non-threaded (loop())
* If it uses loop(), what is the timeout value?

Revision history for this message
Tobias Assarsson (tobias-assarsson) said :
#6

Hello!
I'm also working with this.
It's a non-threaded approach and we are using the default values of
* clean_session (true)
* keepalive (60)
* and loop (1)

Revision history for this message
Best Roger Light (roger.light) said :
#7

I've made some progress with this. After creating a version of the broker that deliberately misbehaved on occasion I think I've isolated the problem and found a solution - some variables weren't being reinitialised on calling reconnect(). Most of the time this wouldn't be a problem but in some situations it could produce the problem you were seeing.

I've committed the code in the changeset linked below. I plan to release it fairly soon as part of 1.0.3 - I've got a few other small fixes to include. If you could give it a test and report back I'd be grateful. The only file that has been updated that is relevant for you is mosquitto.py.

https://bitbucket.org/oojah/mosquitto/changeset/89743b5a25cae1e8201c1a3573518389a01d9ac2

Although I believe this fixes the problem with the reconnections there may still be another problem that is causing the disconnects in the first place. With keepalive==60 the broker has 60 seconds to respond to the client after it has sent its ping request. That's a long time for a packet to be delayed.

Revision history for this message
Martin Assarsson (martin-assarsson) said :
#8

Thanks Roger!
Will put this in test environment immediately.
Tobias will verify this.

  // Martin

Revision history for this message
Martin Assarsson (martin-assarsson) said :
#9

This patch seems to work.

Revision history for this message
Martin Assarsson (martin-assarsson) said :
#10

Thanks Roger Light, that solved my question.