Skip to content

multiple ESP8266s on one WPA2 network -> no ARP #3095

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lexelby opened this issue Mar 28, 2017 · 26 comments
Closed

multiple ESP8266s on one WPA2 network -> no ARP #3095

lexelby opened this issue Mar 28, 2017 · 26 comments

Comments

@lexelby
Copy link

lexelby commented Mar 28, 2017

I had the issue so many others have had, where my ESP8266s would stop responding to ARP. In my case (and perhaps others?) ArduinoOTA they also stopped responding to MDNS queries. However, if I add an ARP entry, I can reach them. Unlike with others, ARP/UDP broadcast reception seems to fail after a matter of a few hours, not 36.

Through a lot of trial and error, I determined that the problem can be triggered by having more than one ESP8266 on the same WPA2 Personal AES network. I created a WEP network, moved one ESP8266 over, and to my surprise, both started working stably for days. I moved the second over to the WEP network, and they're both still stable.

I wonder if this might have to do with group key renewal. Perhaps two ESP8266s confuse each other? I can't be certain, but I've never seen one fail while the other was stable. One thing is certain: they don't fail 100% after the first group key renewal, so perhaps it's partially random?

This is all with the latest master branch of this repo, so I definitely have a recent SDK.

In any event, I have a workaround, so I'm not begging for a fix. Perhaps this issue will also help others in the same boat.

@mtnbrit
Copy link

mtnbrit commented Mar 28, 2017 via email

@lexelby
Copy link
Author

lexelby commented Mar 28, 2017

@mtnbrit Thanks for the quick reply!

I suppose a router issue is possible. I have an Asus RT-AC66U running tomato shibby. I systematically modified every advanced wifi setting I could think of, to no avail. I upgraded the firmware to the latest. No dice.

I don't really like the idea that we might close this issue as "router issue". I've never had ARP or MDNS issues with any other device connected to it. I can clearly see in packet captures on other devices and on the router itself that the ESP8266 devices in this state simply do not send ARP or MDNS responses, even to the router itself.

I'm hardly going to go and buy another $100+ router on the chance that it fixes my problem, and I doubt others would either -- especially to make a <$5 part work.

Tomato shibby is rock solid, in my experience. If there is some kind of lack of adherence to standards or whatever, it only seems to affect ESP8266s, so they should just work around the issue. It seems far more likely to me that the ESP8266, with its relatively new network stack, has some kind of edge-case bug or standards compliance issue.

@mtnbrit
Copy link

mtnbrit commented Mar 28, 2017 via email

@lexelby
Copy link
Author

lexelby commented Mar 28, 2017

I didn't try that, but I doubt it would work. Not even resetting or power-cycling the ESP8266s would bring them back. They'd DHCP fine but still not respond to ARP and MDNS. Only a router reboot would do the job. Perhaps they somehow end up in some kind of temporary blacklist in the router for disobeying the WPA2 spec?

@mtnbrit
Copy link

mtnbrit commented Mar 28, 2017 via email

@devyte
Copy link
Collaborator

devyte commented Mar 29, 2017 via email

@davisonja
Copy link

davisonja commented Mar 29, 2017 via email

@BrandonLWhite
Copy link
Contributor

I'm having the same issue! Asus RT-N66U with Tomato Shibby 140. WPA2 Personal (PSK) + AES. No problems with a single ESP8266 connected. Within a week of adding the second, both quit responding to ARP. Power cycle on the ESP8266 would reconnect to WiFi and DHCP assignment appears successful.

But ESP8266 devices would no longer respond to ping commands, except from one PC that had a previous ARP entry still cached.

@ccrisan
Copy link

ccrisan commented Nov 20, 2017

I can confirm that having multiple ESPs connected to my ASUS RT AC66U (running latest rmerlin's firmware) fail to respond to ARP requests after a few days. I can also confirm that:

  • resetting the ESPs doesn't help and that rebooting the
  • rebooting the router does fix the issue
  • manually adding an ARP entry for the ESPs on any machine temporarily fixes the problem

@papergion
Copy link

Hi
in my case ESP respond correctly at ARP request made by PC, don't respond at ARP request made by another ESP.
ARP request made by a PC (packet dissection):

No.     Time           Source                Destination           Protocol Length Info
      2 94.496161000   HonHaiPr_e0:7a:ed     Broadcast             ARP      42     Who has 192.168.2.183?  Tell 192.168.2.14

Frame 2: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0
Ethernet II, Src: HonHaiPr_e0:7a:ed (b0:10:41:e0:7a:ed), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
    Destination: Broadcast (ff:ff:ff:ff:ff:ff)
        Address: Broadcast (ff:ff:ff:ff:ff:ff)
        .... ..1. .... .... .... .... = LG bit: Locally administered address (this is NOT the factory default)
        .... ...1 .... .... .... .... = IG bit: Group address (multicast/broadcast)
    Source: HonHaiPr_e0:7a:ed (b0:10:41:e0:7a:ed)
        Address: HonHaiPr_e0:7a:ed (b0:10:41:e0:7a:ed)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Type: ARP (0x0806)
Address Resolution Protocol (request)
    Hardware type: Ethernet (1)
    Protocol type: IP (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (1)
    Sender MAC address: HonHaiPr_e0:7a:ed (b0:10:41:e0:7a:ed)
    Sender IP address: 192.168.2.14 (192.168.2.14)
    Target MAC address: 00:00:00_00:00:00 (00:00:00:00:00:00)
    Target IP address: 192.168.2.183 (192.168.2.183)

ARP request made by another ESP (failed):

No.     Time           Source                Destination           Protocol Length Info
      1 0.000000000    5c:cf:7f:3c:d4:2d     Broadcast             ARP      42     Who has 192.168.2.183?  Tell 192.168.2.184

Frame 1: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0
Ethernet II, Src: 5c:cf:7f:3c:d4:2d (5c:cf:7f:3c:d4:2d), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
    Destination: Broadcast (ff:ff:ff:ff:ff:ff)
        Address: Broadcast (ff:ff:ff:ff:ff:ff)
        .... ..1. .... .... .... .... = LG bit: Locally administered address (this is NOT the factory default)
        .... ...1 .... .... .... .... = IG bit: Group address (multicast/broadcast)
    Source: 5c:cf:7f:3c:d4:2d (5c:cf:7f:3c:d4:2d)
        Address: 5c:cf:7f:3c:d4:2d (5c:cf:7f:3c:d4:2d)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Type: ARP (0x0806)
Address Resolution Protocol (request)
    Hardware type: Ethernet (1)
    Protocol type: IP (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (1)
    Sender MAC address: 5c:cf:7f:3c:d4:2d (5c:cf:7f:3c:d4:2d)
    Sender IP address: 192.168.2.184 (192.168.2.184)
    Target MAC address: 00:00:00_00:00:00 (00:00:00:00:00:00)
    Target IP address: 192.168.2.183 (192.168.2.183)

the only difference is in "source" data

@SupraJames
Copy link

Me too. I lose the ability to ping my ESP modules after a few hours, and need to reboot to bring them back. I know it's the ARP issue because they are able to maintain a constant connection with a server, and operate otherwise correctly until the ping issue occurs from a machine which doesn't have them in it's ARP cache.

FYI my router is a TP Link Archer C7, stock firmware.

I've dug out an ancient Linksys WRT54GL running dd-wrt and have moved them over to that today, to see if it helps.

@d-a-v
Copy link
Collaborator

d-a-v commented Jan 9, 2018

Long discussion and workaround proposals are in #2330.
Testers are welcome.

@devyte
Copy link
Collaborator

devyte commented Jan 9, 2018

@SupraJames here's something interesting: I use Archer C7s with dd-wrt, and I haven't encountered the arp issue.

@SupraJames
Copy link

@devyte before I brick my router and annoy my wife with lack of internet, is there a build of DD-WRT that you recommend?

I have a Archer C7 V2 which does seem to be supported, and I can see the latest build is from Jan 7th, but I'd be wary of flashing something that new.

@devyte
Copy link
Collaborator

devyte commented Jan 11, 2018

I have the same v2. I don't have my build number at hand, but I updated beginning of December, and no issues. I'd say go for it :P

@martin072
Copy link

Just for info and referring to 2330, I am also experiencing issues with a TP-Link and the ESP.

@spilz87
Copy link

spilz87 commented Jun 3, 2018

Me too, with a TP-Link Archer C7 V1.0 stock firmware
I spend a lot of time to find the issue on my code, but finally I solved it by using the DHCP of the rooter to have fixed ip with MAC address and binding the ip and mac on ARP table. Now all my ESP run without issue for days !!!
I plane to buy the Archer C7 V2.0, (i found one for 35€)
Can you confirme there is no issue with this one ?

@riker65
Copy link

riker65 commented Jun 24, 2019

having same issue especially when esp8266 has static ip

@riker65
Copy link

riker65 commented Jun 24, 2019

@devyte
Copy link
Collaborator

devyte commented Jul 2, 2019

@riker65 no, or at least not directly.

@riker65
Copy link

riker65 commented Jul 3, 2019

@devyte,
Some Not directly
Some never
Some when I Ping them

Thus Happens mainly when connected to Unifi APs

@TD-er
Copy link
Contributor

TD-er commented Jul 4, 2019

It does seem to be related to the power consumption of the ESP itself.
I have had a few ESPs here connected to a power supply that also displays its consumption.
The nodes I have do exchange some UDP packets every minute as a p2p protocol between nodes to exchange sensor data.
All nodes in the network are also displayed on a page with the last time one of these were seen (in minutes), so it is a very simple view on what nodes can reach others.

As soon as a node starts using less energy (some kind of "eco mode"), these packets are lost. Either missed by the receiving end or never reaching the receiving end. (ARP issue?)
This lower energy mode can be achieved by calling delay(...) as long as nothing has to be done.
But it may also happen to other nodes when not occupied full time. It may take a minute or sometimes even 10 minutes.
As soon as you're actively accessing several pages served by the node, or simply sending ping packets to it, it will jump back to 'normal' power consumption again.
Funny thing is, an ICMP packet is always replied (as long as the ARP is known) but it may take several 100's of msec for the first one to be replied.

I do send Gratuitous ARP packets from the ESP to overcome the ARP issue, but that still doesn't help for missing packets when the node is in some kind of "eco mode".
But this "eco mode" can explain why it may not reply to ARP packets in the first place.

@devyte
Copy link
Collaborator

devyte commented Jul 5, 2019

@TD-er your observations sound like the modem sleep thing that was addressed in sdk pre3, where there are 2 sort of modem sleep states. In one of them, the esp could miss incoming packets due to desync of some beacon or something. In the other, all beacons were listened for, so no packets missed. I think the former case is the default and used in 2.2.x, while the latter is made available with a new api in pre3.
As a side note, the missed packets case has lower power usage, and I suspect is meant for deep sleep cases where on wakeup only transmission is done, while the other good case with higher power usage is meant for serving, i. e. when you need to access the ESP remotely. That's just my guess, though.

@TD-er
Copy link
Contributor

TD-er commented Jul 5, 2019

Is that the new "listeninterval" that can be set in SDK3?
Do you know if this is set in SDK2.2.x to some dynamic value?
Can we read its value somewhere? The wrapper in this repo is just returning 0 for SDK < 3

@d-a-v
Copy link
Collaborator

d-a-v commented Jul 5, 2019

listenInterval is ignored in sdk2.

@devyte
Copy link
Collaborator

devyte commented Oct 31, 2019

I have a whole bunch of ESPs on the same network. They have been running for several weeks without a single drop, all since #6484 .
I'm closing this.
If anyone still encounters dropouts, please open a new issue and we can continue discussion there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests