Project

General

Profile

Bug #40490

Massive packet loss en route to gw.sepia.ceph.com

Added by Lenz Grimmer almost 5 years ago. Updated over 4 years ago.

Status:
Closed
Priority:
Normal
Category:
DC ops
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

This is somewhat similar to #14767, not sure if there's anything you can do from your end about this - since last week, I'm having a lot of trouble connecting to some of the Ceph infrastructure, especially the Etherpad, which basically renders it unusable for me.

A traceroute from my location reveals an unusual high loss of packets on two nodes, ae-2-3602.ear3.Washington1.Level and 8.43.84.3:

                             My traceroute  [v0.92]
metis.fritz.box (192.168.178.114)                      2019-06-24T10:30:04+0200
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                       Packets               Pings
 Host                                Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. fritz.box                         0.0%   215    5.5   8.6   0.6  38.3   5.2
    hhb1000cihr001.versatel.de
 2. hhb1000cihr001.versatel.de        0.0%   215    9.0   7.0   5.3  29.7   3.1
 3. 62.214.38.9                       0.0%   215    5.5   5.5   4.5  44.6   3.5
 4. 62.214.37.158                     0.0%   215    5.3   5.8   4.8  36.3   3.0
 5. 213.242.108.153                   0.0%   215    9.4   9.4   9.0  10.3   0.2
 6. ae-2-3602.ear3.Washington1.Level 88.8%   215  314.5 180.4 109.1 1109. 218.3
 7. 4.16.240.122                      0.0%   215  115.4 115.8 115.3 133.5   1.6
 8. 8.43.84.1                         0.0%   215  134.2 135.8 120.9 246.0  19.1
 9. 8.43.84.3                        30.7%   215  115.8 115.9 115.6 116.7   0.2
10. 8.43.84.4                         0.0%   215  135.5 140.2 119.6 246.1  24.4
11. 8.43.84.190                       0.0%   214  123.6 129.3 116.4 244.2  24.5
12. gw.sepia.ceph.com                 0.0%   214  116.1 116.0 115.6 116.7   0.2

Screenshot from 2019-06-27 09-44-58.png View (14.9 KB) Lenz Grimmer, 06/27/2019 07:45 AM

History

#1 Updated by Lenz Grimmer almost 5 years ago

Looks like the host name was truncated in this output - it's ae-2-3602.ear3.Washington1.Level3.net.

#2 Updated by David Galloway almost 5 years ago

I've been told in the past by Red Hat IT that ICMP packet loss at both hops is typical. The packets are low priority so if the equipment is overloaded, it'll ignore ICMP. I don't know how factual that is but here's my MTR for reference.

                                  My traceroute  [v0.92]
p50 (192.168.1.45)                                                2019-06-26T12:39:46-0400
Keys:  Help   Display mode   Restart statistics   Order of fields   quit
                                                  Packets               Pings
 Host                                           Loss%   Snt   Last   Avg  Best  Wrst StDev
 1. 192.168.1.1                                  0.0%   154    0.3   0.3   0.2   0.8   0.1
 2. cpe-76-182-56-1.nc.res.rr.com                0.0%   153  159.1 173.0  37.2 248.0  41.6
 3. cpe-174-111-105-112.triad.res.rr.com         0.0%   153  168.6 174.5  37.6 255.5  41.9
 4. cpe-024-025-062-104.ec.res.rr.com            0.0%   153  152.5 173.0  40.3 246.2  42.8
 5. be31.drhmncev01r.southeast.rr.com            0.0%   153  145.1 177.9  50.5 251.8  43.9
 6. 66.109.10.176                                0.0%   153  143.2 184.6  51.5 261.1  46.0
 7. 0.ae4.pr0.dca10.tbone.rr.com                 0.7%   153  115.3 178.6  61.4 263.9  44.6
 8. 107.14.16.82                                 0.0%   153   85.3 181.0  47.8 271.1  45.5
 9. ???
10. ???
11. et-3-3-0.582.rtsw.rale.net.internet2.edu     0.0%   153   79.4 185.4  31.0 262.9  41.4
12. 198.71.47.222                                0.0%   153  178.2 187.9  47.1 255.0  38.7
13. 128.109.25.14                                0.0%   153  177.5 187.5  70.6 260.5  39.6
14. 8.43.84.1                                    0.7%   153  210.7 229.0 100.0 361.4  46.0
15. 8.43.84.3                                    2.0%   153  189.2 188.9  40.6 269.6  43.0
16. 8.43.84.4                                    0.0%   153  175.3 207.7  67.5 354.1  46.8
17. 8.43.84.190                                  0.0%   153  141.1 211.2  73.0 348.1  48.7
18. gw.sepia.ceph.com                            0.7%   153  205.5 185.6  62.0 265.6  41.5

Do you know if anyone else in Europe is seeing the same? This is the first I'm hearing of it.

#3 Updated by Lenz Grimmer almost 5 years ago

David Galloway wrote:

I've been told in the past by Red Hat IT that ICMP packet loss at both hops is typical. The packets are low priority so if the equipment is overloaded, it'll ignore ICMP. I don't know how factual that is but here's my MTR for reference.

[...]

Do you know if anyone else in Europe is seeing the same? This is the first I'm hearing of it.

I'm not aware of anybody else, maybe the root cause of my problem lies elsewhere then. For me, Etherpads currently only load very slowly (~30 secs per page).
As soon as I try to make changes to them, I get disconnected with the following error message:

After clicking Force reconnect, the page reloads, but all my recent changes are gone.

I'll investigate more, maybe it's a different issue on my end.

#4 Updated by David Galloway almost 5 years ago

If you're comfortable sharing your public IP along with timestamps and the pad you were working on, I can take a look at server-side logs.

#5 Updated by Lenz Grimmer almost 5 years ago

David Galloway wrote:

If you're comfortable sharing your public IP along with timestamps and the pad you were working on, I can take a look at server-side logs.

Sure! Just ran into the issue again while trying to edit https://pad.ceph.com/p/ceph-newsletter-june2019 (check the logs shortly after 15:27 CET).
My current IP: 82.207.219.209 (muedsl-82-207-219-209.citykom.de)

Thanks a lot for your help!

#6 Updated by Lenz Grimmer almost 5 years ago

Update: I now used mtr with TCP and still observe significant packet loss on that particular router:

$ mtr --report --tcp --port=443 pad.ceph.com
Start: 2019-06-28T11:15:50+0200
HOST: metis.fritz.box             Loss%   Snt   Last   Avg  Best  Wrst StDev
  1.|-- _gateway                   0.0%    10    1.0   0.9   0.7   1.0   0.1
  2.|-- hhb1000cihr001.versatel.d  0.0%    10   14.0  10.7   6.6  16.3   3.6
  3.|-- 62.214.38.9                0.0%    10    5.4   5.6   5.4   5.8   0.1
  4.|-- 62.214.37.158              0.0%    10    5.7   5.9   5.6   6.4   0.3
  5.|-- 213.242.108.153            0.0%    10    9.8  10.2   9.8  10.8   0.3
  6.|-- ae-2-3602.ear3.Washington 50.0%    10  7346. 2413. 110.1 7346. 3056.3
  7.|-- 4.16.240.122               0.0%    10  116.7 116.5 116.1 117.4   0.4
  8.|-- 8.43.84.1                  0.0%    10  118.9 125.6 116.5 142.1   9.5
  9.|-- 8.43.84.3                  0.0%    10  116.3 116.7 116.3 117.5   0.3
 10.|-- 8.43.84.4                  0.0%    10  124.5 123.5 118.3 127.9   3.0
 11.|-- 8.43.84.190                0.0%    10  117.9 127.8 117.9 137.6   7.9
 12.|-- gw.sepia.ceph.com          0.0%    10  117.0 117.1 116.8 117.5   0.3

#7 Updated by David Galloway over 4 years ago

  • Status changed from New to In Progress

I had our IT guy take a look at your mtr output and he confirmed (basically) what I said previously. "Transit carriers will rate limit icmp because it hits the control plane, so normal in some cases"

I made some tweaks to the nginx config for pad.ceph.com based on the etherpad docs. Can you let me know if things have improved please?

I don't see anything obvious in the logs.

#8 Updated by Lenz Grimmer over 4 years ago

  • Status changed from In Progress to Closed

Hmm, not sure what has changed, but it works again now. Maybe I simply had too many etherpads open in browser tabs and confused the server's session management?
Closing this one for now. Thanks a lot for your help!

Also available in: Atom PDF