Bug #40490
Massive packet loss en route to gw.sepia.ceph.com
0%
Description
This is somewhat similar to #14767, not sure if there's anything you can do from your end about this - since last week, I'm having a lot of trouble connecting to some of the Ceph infrastructure, especially the Etherpad, which basically renders it unusable for me.
A traceroute from my location reveals an unusual high loss of packets on two nodes, ae-2-3602.ear3.Washington1.Level and 8.43.84.3:
My traceroute [v0.92] metis.fritz.box (192.168.178.114) 2019-06-24T10:30:04+0200 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. fritz.box 0.0% 215 5.5 8.6 0.6 38.3 5.2 hhb1000cihr001.versatel.de 2. hhb1000cihr001.versatel.de 0.0% 215 9.0 7.0 5.3 29.7 3.1 3. 62.214.38.9 0.0% 215 5.5 5.5 4.5 44.6 3.5 4. 62.214.37.158 0.0% 215 5.3 5.8 4.8 36.3 3.0 5. 213.242.108.153 0.0% 215 9.4 9.4 9.0 10.3 0.2 6. ae-2-3602.ear3.Washington1.Level 88.8% 215 314.5 180.4 109.1 1109. 218.3 7. 4.16.240.122 0.0% 215 115.4 115.8 115.3 133.5 1.6 8. 8.43.84.1 0.0% 215 134.2 135.8 120.9 246.0 19.1 9. 8.43.84.3 30.7% 215 115.8 115.9 115.6 116.7 0.2 10. 8.43.84.4 0.0% 215 135.5 140.2 119.6 246.1 24.4 11. 8.43.84.190 0.0% 214 123.6 129.3 116.4 244.2 24.5 12. gw.sepia.ceph.com 0.0% 214 116.1 116.0 115.6 116.7 0.2
History
#1 Updated by Lenz Grimmer almost 5 years ago
Looks like the host name was truncated in this output - it's ae-2-3602.ear3.Washington1.Level3.net
.
#2 Updated by David Galloway almost 5 years ago
I've been told in the past by Red Hat IT that ICMP packet loss at both hops is typical. The packets are low priority so if the equipment is overloaded, it'll ignore ICMP. I don't know how factual that is but here's my MTR for reference.
My traceroute [v0.92] p50 (192.168.1.45) 2019-06-26T12:39:46-0400 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. 192.168.1.1 0.0% 154 0.3 0.3 0.2 0.8 0.1 2. cpe-76-182-56-1.nc.res.rr.com 0.0% 153 159.1 173.0 37.2 248.0 41.6 3. cpe-174-111-105-112.triad.res.rr.com 0.0% 153 168.6 174.5 37.6 255.5 41.9 4. cpe-024-025-062-104.ec.res.rr.com 0.0% 153 152.5 173.0 40.3 246.2 42.8 5. be31.drhmncev01r.southeast.rr.com 0.0% 153 145.1 177.9 50.5 251.8 43.9 6. 66.109.10.176 0.0% 153 143.2 184.6 51.5 261.1 46.0 7. 0.ae4.pr0.dca10.tbone.rr.com 0.7% 153 115.3 178.6 61.4 263.9 44.6 8. 107.14.16.82 0.0% 153 85.3 181.0 47.8 271.1 45.5 9. ??? 10. ??? 11. et-3-3-0.582.rtsw.rale.net.internet2.edu 0.0% 153 79.4 185.4 31.0 262.9 41.4 12. 198.71.47.222 0.0% 153 178.2 187.9 47.1 255.0 38.7 13. 128.109.25.14 0.0% 153 177.5 187.5 70.6 260.5 39.6 14. 8.43.84.1 0.7% 153 210.7 229.0 100.0 361.4 46.0 15. 8.43.84.3 2.0% 153 189.2 188.9 40.6 269.6 43.0 16. 8.43.84.4 0.0% 153 175.3 207.7 67.5 354.1 46.8 17. 8.43.84.190 0.0% 153 141.1 211.2 73.0 348.1 48.7 18. gw.sepia.ceph.com 0.7% 153 205.5 185.6 62.0 265.6 41.5
Do you know if anyone else in Europe is seeing the same? This is the first I'm hearing of it.
#3 Updated by Lenz Grimmer almost 5 years ago
- File Screenshot from 2019-06-27 09-44-58.png View added
David Galloway wrote:
I've been told in the past by Red Hat IT that ICMP packet loss at both hops is typical. The packets are low priority so if the equipment is overloaded, it'll ignore ICMP. I don't know how factual that is but here's my MTR for reference.
[...]
Do you know if anyone else in Europe is seeing the same? This is the first I'm hearing of it.
I'm not aware of anybody else, maybe the root cause of my problem lies elsewhere then. For me, Etherpads currently only load very slowly (~30 secs per page).
As soon as I try to make changes to them, I get disconnected with the following error message:
After clicking Force reconnect, the page reloads, but all my recent changes are gone.
I'll investigate more, maybe it's a different issue on my end.
#4 Updated by David Galloway almost 5 years ago
If you're comfortable sharing your public IP along with timestamps and the pad you were working on, I can take a look at server-side logs.
#5 Updated by Lenz Grimmer almost 5 years ago
David Galloway wrote:
If you're comfortable sharing your public IP along with timestamps and the pad you were working on, I can take a look at server-side logs.
Sure! Just ran into the issue again while trying to edit https://pad.ceph.com/p/ceph-newsletter-june2019 (check the logs shortly after 15:27 CET).
My current IP: 82.207.219.209 (muedsl-82-207-219-209.citykom.de)
Thanks a lot for your help!
#6 Updated by Lenz Grimmer almost 5 years ago
Update: I now used mtr
with TCP and still observe significant packet loss on that particular router:
$ mtr --report --tcp --port=443 pad.ceph.com Start: 2019-06-28T11:15:50+0200 HOST: metis.fritz.box Loss% Snt Last Avg Best Wrst StDev 1.|-- _gateway 0.0% 10 1.0 0.9 0.7 1.0 0.1 2.|-- hhb1000cihr001.versatel.d 0.0% 10 14.0 10.7 6.6 16.3 3.6 3.|-- 62.214.38.9 0.0% 10 5.4 5.6 5.4 5.8 0.1 4.|-- 62.214.37.158 0.0% 10 5.7 5.9 5.6 6.4 0.3 5.|-- 213.242.108.153 0.0% 10 9.8 10.2 9.8 10.8 0.3 6.|-- ae-2-3602.ear3.Washington 50.0% 10 7346. 2413. 110.1 7346. 3056.3 7.|-- 4.16.240.122 0.0% 10 116.7 116.5 116.1 117.4 0.4 8.|-- 8.43.84.1 0.0% 10 118.9 125.6 116.5 142.1 9.5 9.|-- 8.43.84.3 0.0% 10 116.3 116.7 116.3 117.5 0.3 10.|-- 8.43.84.4 0.0% 10 124.5 123.5 118.3 127.9 3.0 11.|-- 8.43.84.190 0.0% 10 117.9 127.8 117.9 137.6 7.9 12.|-- gw.sepia.ceph.com 0.0% 10 117.0 117.1 116.8 117.5 0.3
#7 Updated by David Galloway over 4 years ago
- Status changed from New to In Progress
I had our IT guy take a look at your mtr output and he confirmed (basically) what I said previously. "Transit carriers will rate limit icmp because it hits the control plane, so normal in some cases"
I made some tweaks to the nginx config for pad.ceph.com based on the etherpad docs. Can you let me know if things have improved please?
I don't see anything obvious in the logs.
#8 Updated by Lenz Grimmer over 4 years ago
- Status changed from In Progress to Closed
Hmm, not sure what has changed, but it works again now. Maybe I simply had too many etherpads open in browser tabs and confused the server's session management?
Closing this one for now. Thanks a lot for your help!