Bug #14767
closedMassive packet loss to/from gw.sepia.ceph.com
0%
Updated by Zack Cerza about 8 years ago
From my laptop in CO to gw:
My traceroute [v0.86] zwork.local (0.0.0.0) Mon Feb 15 10:35:20 2016 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. 192.168.1.1 0.0% 241 1.5 1.0 0.7 4.3 0.3 2. c-98-245-112-1.hsd1.co.comcast.net 0.0% 241 9.7 11.3 8.7 30.2 2.4 3. xe-9-1-3-sur02.boulder.co.denver.comcast.net 0.0% 241 9.7 11.3 8.4 28.8 3.0 4. ae-10-sur03.boulder.co.denver.comcast.net 0.0% 241 9.1 10.8 8.6 22.4 2.0 5. ae-29-ar01.denver.co.denver.comcast.net 0.0% 240 10.3 12.1 9.7 28.7 2.8 6. 4.68.63.165 0.0% 240 13.3 14.8 9.7 81.3 8.9 7. ae-2-3602.ear3.Washington1.Level3.net 92.9% 240 58.2 1175. 52.1 10029 2504. 8. 4.16.240.122 0.0% 240 59.4 60.6 58.1 76.5 2.8 9. 8.43.84.1 25.8% 240 40532 33568 25428 46669 4768. 10. 8.43.84.3 53.6% 240 58.2 59.8 57.7 69.4 2.2 11. 8.43.84.4 64.4% 240 28595 27510 25672 28642 828.9 12. 8.43.84.190 62.8% 240 71.1 68.5 59.8 152.3 11.8 13. 8.43.84.129 67.9% 240 59.6 60.2 57.7 73.3 2.8
Updated by Zack Cerza about 8 years ago
From gw to google.com:
My traceroute [v0.85] gw (0.0.0.0) Mon Feb 15 17:34:42 2016 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. 8.43.84.190 0.0% 117 11.7 18.1 6.3 84.8 10.1 2. 8.43.84.4 71.6% 116 28477 28368 28139 28551 100.0 3. 8.43.84.5 14.7% 116 0.5 0.6 0.4 0.8 0.0 4. 8.43.84.2 70.4% 116 28479 28400 28131 28564 91.0 5. 8.43.84.0 0.0% 116 0.8 0.9 0.7 1.2 0.0 6. 6-1-23-101.ear3.Washington1.Level3.net 0.0% 116 7.4 8.1 7.3 24.1 2.2 7. ae-1-3501.ear1.Washington12.Level3.net 90.5% 116 7.9 2776. 7.7 12923 4306. 8. Google-level3-60G.WashingtonDC.Level3.net 0.0% 116 7.9 8.6 7.9 43.6 4.0 9. 209.85.242.142 0.0% 116 8.3 9.1 8.2 89.2 7.5 10. 209.85.143.212 0.0% 116 9.9 9.6 9.3 10.9 0.1 11. 216.239.48.160 0.0% 116 16.1 16.7 16.1 24.7 1.3 12. 216.239.49.147 0.0% 116 32.9 16.6 16.0 38.3 2.5 13. ??? 14. qb-in-f102.1e100.net 0.0% 116 15.5 15.6 15.4 15.8 0.0
Updated by Zack Cerza about 8 years ago
Yet, it's not entirely broken:
$ wget -O /dev/null http://speedtest.wdc01.softlayer.com/downloads/test100.zip --2016-02-15 17:36:32-- http://speedtest.wdc01.softlayer.com/downloads/test100.zip Resolving speedtest.wdc01.softlayer.com (speedtest.wdc01.softlayer.com)... 2607:f0d0:3001:78::2, 208.43.102.250 Connecting to speedtest.wdc01.softlayer.com (speedtest.wdc01.softlayer.com)|2607:f0d0:3001:78::2|:80... failed: No route to host. Connecting to speedtest.wdc01.softlayer.com (speedtest.wdc01.softlayer.com)|208.43.102.250|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 104874307 (100M) [application/zip] Saving to: ‘/dev/null’ 100%[================================================================================>] 104,874,307 74.3MB/s in 1.3s 2016-02-15 17:36:34 (74.3 MB/s) - ‘/dev/null’ saved [104874307/104874307]
Updated by David Galloway about 8 years ago
- Status changed from New to In Progress
- Assignee set to David Galloway
gw to google now
gw (0.0.0.0) Mon Feb 15 23:41:16 2016 Resolver error: No error returned but no answers given. of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. 8.43.84.190 0.0% 10 16.5 16.4 11.3 21.9 3.1 2. 8.43.84.4 0.0% 10 4.6 5.8 1.6 10.5 3.0 3. 8.43.84.5 40.0% 10 0.5 0.5 0.5 0.5 0.0 4. 8.43.84.2 0.0% 10 39.2 31.1 16.6 42.1 8.3 5. 8.43.84.0 0.0% 10 0.7 0.7 0.7 0.8 0.0 6. 6-1-23-101.ear3.Washington1.Level3.net 0.0% 10 7.3 8.3 7.3 16.4 2.8 7. ??? 8. Google-level3-60G.WashingtonDC.Level3.net 0.0% 10 7.9 7.9 7.8 8.1 0.0 9. 209.85.242.142 0.0% 10 8.3 11.2 8.1 37.3 9.1 10. 216.239.43.67 0.0% 10 8.8 8.8 8.7 8.9 0.0 11. 209.85.250.70 0.0% 10 27.2 18.5 16.6 27.2 3.9 12. 209.85.252.69 0.0% 10 17.0 16.9 16.8 17.0 0.0 13. ??? 14. qb-in-f102.1e100.net 0.0% 10 17.5 16.8 16.7 17.5 0.0
home to gw
dgallowa (0.0.0.0) Mon Feb 15 18:43:14 2016 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. router.asus.com 0.0% 11 0.4 0.4 0.3 0.4 0.0 2. mta-107-13-32-1.nc.rr.com 0.0% 10 17.5 20.5 9.1 84.3 22.6 3. mta-107-13-32-1.nc.rr.com 0.0% 10 69.0 19.3 9.2 69.0 17.7 4. cpe-174-111-117-141.triad.res.rr.com 0.0% 10 22.9 42.1 22.3 171.6 45.7 5. cpe-024-025-062-106.ec.res.rr.com 0.0% 10 13.4 17.1 13.2 21.6 3.3 6. 24.93.67.202 0.0% 10 23.8 25.6 22.5 31.7 2.8 7. bu-ether14.atlngamq46w-bcr00.tbone.rr.com 0.0% 10 30.7 29.7 25.5 33.8 2.2 8. 0.ae5.pr0.atl20.tbone.rr.com 0.0% 10 23.8 25.3 22.4 27.9 1.9 9. ??? 10. ae-1-3502.ear3.Washington1.Level3.net 77.8% 10 556.9 996.4 556.9 1436. 621.6 11. 4.16.240.122 0.0% 10 43.8 45.3 41.6 56.7 4.3 12. 8.43.84.1 0.0% 10 63.4 62.6 58.9 67.7 2.7 13. 8.43.84.3 30.0% 10 46.4 46.7 44.4 49.8 1.6 14. 8.43.84.4 0.0% 10 74.9 79.2 73.1 89.4 6.1 15. 8.43.84.190 0.0% 10 52.0 52.6 42.1 59.4 5.3 16. 8.43.84.129 0.0% 10 49.7 47.1 41.4 56.6 4.3
There are still a few unknowns but here's what we do know.
The Sepia lab is in a shared Community space in the RDU2 lab. On Friday
afternoon, an additional tenant was brought online so they could start
building out their infrastructure.
At that time, severe packet loss, intermittent downtime, and network
degradation was observed. This manifested itself in yum and apt updates
failures on test nodes and 404 timeouts to external sites like git.ceph.com.
As of about 0300UTC today, the packet loss to the Sepia VPN server was
resolved so we're able to reliably get to gw.sepia.ceph.com and
gitbuilder.ceph.com. There's still some packet loss on the network
equipment in front of the lab which IT is reaching out to Level3 to
diagnose.
We're still not seeing the network speeds we're expecting, however.
We've run a few smoke tests to verify stability despite the network
speed drops.
I've forced rebuild on a few critical branches in the lab so tests can
resume.
Updated by David Galloway about 8 years ago
- Is duplicate of Bug #14763: teuthology.front.sepia.ceph.com: unable to connect to git.ceph.com added
Updated by David Galloway about 8 years ago
- Status changed from In Progress to Resolved
Timeouts are still intermittent but mostly resolved. Lingering network issues are still being investigated by IT.
No update as of 16:51UTC 16FEB
Updated by Nathan Cutler about 8 years ago
- Related to Bug #15199: git.ceph.com[0: 67.205.20.229]: errno=Connection timed out added