Project

General

Profile

Actions

Bug #42611

closed

mgr/dashboard: dashboard e2e Jenkins job failures when testing RGW

Added by Laura Paduano over 4 years ago. Updated about 3 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
Matthew Oliver
Category:
Testing & QA
Target version:
-
% Done:

0%

Source:
Q/A
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

  RGW buckets page

    breadcrumb tests
      [31m✗ should open and show breadcrumb[39m (0.787 sec)
        [31m- [39m[31mExpected 'Object Gateway' to be 'Buckets'.[39m
            at Object.<anonymous> (/home/jenkins-build/build/workspace/ceph-dashboard-pull-requests/src/pybind/mgr/dashboard/frontend/e2e/rgw/buckets.e2e-spec.ts:22:60)
            at step (/home/jenkins-build/build/workspace/ceph-dashboard-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules/tslib/tslib.js:136:27)
            at Object.next (/home/jenkins-build/build/workspace/ceph-dashboard-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules/tslib/tslib.js:117:57)
            at /home/jenkins-build/build/workspace/ceph-dashboard-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules/tslib/tslib.js:110:75
            at new Promise (<anonymous>)
            at Object.__awaiter (/home/jenkins-build/build/workspace/ceph-dashboard-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules/tslib/tslib.js:106:16)
            at UserContext.<anonymous> (/home/jenkins-build/build/workspace/ceph-dashboard-pull-requests/src/pybind/mgr/dashboard/frontend/e2e/rgw/buckets.e2e-spec.ts:21:43)
            at /home/jenkins-build/build/workspace/ceph-dashboard-pull-requests/src/pybind/mgr/dashboard/frontend/node_modules/jasminewd2/index.js:112:25
            at new Promise (<anonymous>)

**************************************************
*                    Failures                    *
**************************************************

1) RGW buckets page breadcrumb tests should open and show breadcrumb
  [31m- [39m[31mExpected 'Object Gateway' to be 'Buckets'.[39m

See: https://jenkins.ceph.com/job/ceph-dashboard-pull-requests/2291/


Files

Actions #1

Updated by Laura Paduano over 4 years ago

  • Description updated (diff)
Actions #2

Updated by Laura Paduano over 4 years ago

2019-11-04T10:54:34.815+0000 7f397c970700  1 -- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x463c000 msgr2=0x454e100 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 30
2019-11-04T10:54:34.815+0000 7f397c970700  1 -- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x463c000 msgr2=0x454e100 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2019-11-04T10:54:34.815+0000 7f397d171700  1 -- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x4598000 msgr2=0x44fe680 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_bulk peer close file descriptor 24
2019-11-04T10:54:34.815+0000 7f397d171700  1 -- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x4598000 msgr2=0x44fe680 crc :-1 s=STATE_CONNECTION_ESTABLISHED l=1).read_until read failed
2019-11-04T10:54:34.815+0000 7f397c970700  1 --2- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x463c000 0x454e100 crc :-1 s=READY pgs=80 cs=0 l=1 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)
2019-11-04T10:54:34.815+0000 7f397c970700  1 --2- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x463c000 0x454e100 crc :-1 s=READY pgs=80 cs=0 l=1 rx=0 tx=0).stop
2019-11-04T10:54:34.815+0000 7f397d171700  1 --2- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x4598000 0x44fe680 crc :-1 s=READY pgs=79 cs=0 l=1 rx=0 tx=0).handle_read_frame_preamble_main read frame length and tag failed r=-1 ((1) Operation not permitted)

2019-11-04T10:54:34.815+0000 7f397d171700  1 --2- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x4598000 0x44fe680 crc :-1 s=READY pgs=79 cs=0 l=1 rx=0 tx=0).stop
2019-11-04T10:54:34.815+0000 7f3967145700  1 -- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x463c000 msgr2=0x454e100 unknown :-1 s=STATE_CLOSED l=1).mark_down
2019-11-04T10:54:34.815+0000 7f3967145700  1 --2- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x463c000 0x454e100 unknown :-1 s=CLOSED pgs=80 cs=0 l=1 rx=0 tx=0).stop
2019-11-04T10:54:34.815+0000 7f397b96e700  1 -- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x4598000 msgr2=0x44fe680 unknown :-1 s=STATE_CLOSED l=1).mark_down
2019-11-04T10:54:34.815+0000 7f397b96e700  1 --2- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x4598000 0x44fe680 unknown :-1 s=CLOSED pgs=79 cs=0 l=1 rx=0 tx=0).stop
2019-11-04T10:54:34.815+0000 7f3967145700  1 --2- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x385f000 0x454c580 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
2019-11-04T10:54:34.815+0000 7f397b96e700  1 --2- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x4598c00 0x44fe100 unknown :-1 s=NONE pgs=0 cs=0 l=1 rx=0 tx=0).connect
2019-11-04T10:54:34.815+0000 7f397d171700  1 -- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x385f000 msgr2=0x454c580 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:127.0.0.1:6800/13954
2019-11-04T10:54:34.815+0000 7f397c970700  1 -- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x4598c00 msgr2=0x44fe100 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:127.0.0.1:6800/13954
2019-11-04T10:54:34.815+0000 7f397d171700  1 --2- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x385f000 0x454c580 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rx=0 tx=0)._fault waiting 0.200000
2019-11-04T10:54:34.815+0000 7f397c970700  1 --2- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x4598c00 0x44fe100 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rx=0 tx=0)._fault waiting 0.200000
2019-11-04T10:54:34.821+0000 7f3967145700  1 -- 127.0.0.1:0/1930480545 --> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] -- mgropen(rgw.8000 daemon) v3 -- 0x498edc0 con 0x385f000
2019-11-04T10:54:35.016+0000 7f397c970700  1 -- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x4598c00 msgr2=0x44fe100 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:127.0.0.1:6800/13954
2019-11-04T10:54:35.016+0000 7f397d171700  1 -- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x385f000 msgr2=0x454c580 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:127.0.0.1:6800/13954
2019-11-04T10:54:35.016+0000 7f397c970700  1 --2- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x4598c00 0x44fe100 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rx=0 tx=0)._fault waiting 0.400000
2019-11-04T10:54:35.016+0000 7f397d171700  1 --2- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x385f000 0x454c580 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rx=0 tx=0)._fault waiting 0.400000
2019-11-04T10:54:35.215+0000 7f397d171700  1 -- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6802/14723,v1:127.0.0.1:6803/14723] conn(0x4599800 msgr2=0x44ff180 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:127.0.0.1:6802/14723
2019-11-04T10:54:35.215+0000 7f397d171700  1 --2- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6802/14723,v1:127.0.0.1:6803/14723] conn(0x4599800 0x44ff180 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rx=0 tx=0)._fault waiting 1.600000
2019-11-04T10:54:35.215+0000 7f397d171700  1 -- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6802/14723,v1:127.0.0.1:6803/14723] conn(0x385f800 msgr2=0x44fcb00 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:127.0.0.1:6802/14723
2019-11-04T10:54:35.215+0000 7f397d171700  1 --2- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6802/14723,v1:127.0.0.1:6803/14723] conn(0x385f800 0x44fcb00 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rx=0 tx=0)._fault waiting 1.600000
2019-11-04T10:54:35.217+0000 7f397d972700  1 -- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6810/15052,v1:127.0.0.1:6811/15052] conn(0x4598800 msgr2=0x44fdb80 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:127.0.0.1:6810/15052
2019-11-04T10:54:35.217+0000 7f397d972700  1 --2- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6810/15052,v1:127.0.0.1:6811/15052] conn(0x4598800 0x44fdb80 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rx=0 tx=0)._fault waiting 1.600000
2019-11-04T10:54:35.218+0000 7f397d972700  1 -- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6810/15052,v1:127.0.0.1:6811/15052] conn(0x463c400 msgr2=0x454cb00 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:127.0.0.1:6810/15052
2019-11-04T10:54:35.218+0000 7f397c970700  1 -- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6818/15383,v1:127.0.0.1:6819/15383] conn(0x4598400 msgr2=0x44fd600 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:127.0.0.1:6818/15383
2019-11-04T10:54:35.218+0000 7f397d972700  1 --2- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6810/15052,v1:127.0.0.1:6811/15052] conn(0x463c400 0x454cb00 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rx=0 tx=0)._fault waiting 1.600000
2019-11-04T10:54:35.218+0000 7f397c970700  1 --2- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6818/15383,v1:127.0.0.1:6819/15383] conn(0x4598400 0x44fd600 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rx=0 tx=0)._fault waiting 1.600000
2019-11-04T10:54:35.417+0000 7f397c970700  1 -- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x4598c00 msgr2=0x44fe100 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:127.0.0.1:6800/13954
2019-11-04T10:54:35.417+0000 7f397d171700  1 -- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x385f000 msgr2=0x454c580 unknown :-1 s=STATE_CONNECTING_RE l=1).process reconnect failed to v2:127.0.0.1:6800/13954
2019-11-04T10:54:35.417+0000 7f397c970700  1 --2- 127.0.0.1:0/2029453658 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x4598c00 0x44fe100 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rx=0 tx=0)._fault waiting 0.800000
2019-11-04T10:54:35.417+0000 7f397d171700  1 --2- 127.0.0.1:0/1930480545 >> [v2:127.0.0.1:6800/13954,v1:127.0.0.1:6801/13954] conn(0x385f000 0x454c580 unknown :-1 s=START_CONNECT pgs=0 cs=0 l=1 rx=0 tx=0)._fault waiting 0.800000
2019-11-04T10:54:35.795+0000 7f397f976700 -1 received  signal: Terminated from  (PID: 17778) UID: 1001
2019-11-04T10:54:35.795+0000 7f397f976700  1 handle_sigterm
2019-11-04T10:54:35.796+0000 7f397f976700  1 handle_sigterm set alarm for 120
2019-11-04T10:54:35.796+0000 7f3994560780 -1 shutting down
2019-11-04T10:54:35.796+0000 7f3994560780  4 frontend initiating shutdown...
2019-11-04T10:54:35.796+0000 7f3994560780  4 frontend joining threads...
2019-11-04T10:54:35.800+0000 7f397f976700 -1 received  signal: Terminated from  (PID: 17780) UID: 1001
2019-11-04T10:54:35.800+0000 7f397f976700  1 handle_sigterm
2019-11-04T10:54:35.800+0000 7f397f976700  1 handle_sigterm set alarm for 120
2019-11-04T10:54:35.809+0000 7f3994560780  4 frontend done
2019-11-04T10:54:35.811+0000 7f395e934700 20 BucketsSyncThread: done
2019-11-04T10:54:35.812+0000 7f395e133700 20 UserSyncThread: done
Actions #4

Updated by Kiefer Chang over 4 years ago

This can also be reproduced by simply creating a vstart cluster. (test on 5a15148b7bd2a2b7919fb873a6c18405524151b9)
  • radosgw daemon is alive after vstart. radosgw-admin and s3cmd both work.
  • `ceph service status` contains no rgw.
  • `ceph -s` contains no rgw.

After a while (15~30 minutes). rgw appears in `ceph service status` and `ceph -s`. And the dashboard is able to load the RGW pages.

Actions #5

Updated by Laura Paduano over 4 years ago

Kiefer Chang wrote:

This can also be reproduced by simply creating a vstart cluster. (test on 5a15148b7bd2a2b7919fb873a6c18405524151b9)
  • radosgw daemon is alive after vstart. radosgw-admin and s3cmd both work.
  • `ceph service status` contains no rgw.
  • `ceph -s` contains no rgw.

After a while (15~30 minutes). rgw appears in `ceph service status` and `ceph -s`. And the dashboard is able to load the RGW pages.

I can reproduce this behavior on my local system!
It takes about ~15 minutes for RGW to show up as service in `ceph -s`.
Tested on commit 8ece36241e072dd50ea47eab8e6676fa99256d9d

Actions #6

Updated by Laura Paduano over 4 years ago

  • Related to Bug #42610: mgr/dashboard: dashboard e2e Jenkins job failures when testing "services link(s)" within the Host page added
Actions #7

Updated by Laura Paduano over 4 years ago

  • Related to deleted (Bug #42610: mgr/dashboard: dashboard e2e Jenkins job failures when testing "services link(s)" within the Host page)
Actions #8

Updated by Abhishek Lekshmanan over 4 years ago

Seems reproducible in master, mgrc register deamon command should've succeeded (or atleast the error isn't logged in rgw), so need to debug on the mgr side whether this was seen.

Actions #9

Updated by Matthew Oliver over 4 years ago

Abhishek Lekshmanan wrote:

Seems reproducible in master, mgrc register deamon command should've succeeded (or atleast the error isn't logged in rgw), so need to debug on the mgr side whether this was seen.

Just having a quick look. I've added some extra logging to debug. There is no error when if comes to the register daemon command. In fact, the return code is success.

2019-11-06T05:04:36.499+0000 7f693e0e2940  0 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< PRE rados.service_daemon_register
2019-11-06T05:04:36.499+0000 7f693e0e2940  1 mgrc service_daemon_register rgw.8000 metadata {arch=x86_64,ceph_release=octopus,ceph_version=ceph version Development (no_version) octopus (dev),ceph_version_short=Development,cpu=Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz,distro=opensuse-tumbleweed,distro_description=openSUSE Tumbleweed,distro_version=20191011,frontend_config#0=beast port=8000,frontend_type#0=beast,hostname=ceph-1,kernel_description=#1 SMP Tue Oct 8 05:42:34 UTC 2019 (2982b5d),kernel_version=4.12.14-lp151.28.20-default,mem_swap_kb=0,mem_total_kb=131887336,num_handles=1,os=Linux,pid=10561,zone_id=c349463a-c3d7-4c45-a85d-ba5989c429b7,zone_name=default,zonegroup_id=83f69839-c775-40c1-ba1f-a14f9620b7d5,zonegroup_name=default}
2019-11-06T05:04:36.499+0000 7f693e0e2940  0 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< POST rados.service_daemon_register return code: 0: (0) Success

Am now bisecting the repo to see if I can pinpoint the patch that "broke" it. As I'm new to the code base, hoping that'll be the quickest way. Let's see how I go.

Actions #10

Updated by Laura Paduano over 4 years ago

  • Status changed from New to 12
Actions #11

Updated by Laura Paduano over 4 years ago

  • Assignee set to Matthew Oliver
Actions #12

Updated by Laura Paduano over 4 years ago

Matthew Oliver wrote:

Abhishek Lekshmanan wrote:

Seems reproducible in master, mgrc register deamon command should've succeeded (or atleast the error isn't logged in rgw), so need to debug on the mgr side whether this was seen.

Just having a quick look. I've added some extra logging to debug. There is no error when if comes to the register daemon command. In fact, the return code is success.

[...]

Am now bisecting the repo to see if I can pinpoint the patch that "broke" it. As I'm new to the code base, hoping that'll be the quickest way. Let's see how I go.

Thanks a lot for looking into this! For now I've assigned the issue to you, hope that's fine!

Actions #13

Updated by Kiefer Chang over 4 years ago

Just did a test to revert the change recently in MgrCient:
https://github.com/ceph/ceph/commit/fc60989bf7a72c35b8f6b8fec2407b3080ad9bbd

And the RGW daemon appears in service map right after it is started.
Look likes radosgw daemon uses client key to report itself?
NOTE: I don't understand the code thoroughly, just try to test recent changes.

Actions #14

Updated by Matthew Oliver over 4 years ago

Nice work on finding the commit Kiefer! I'll take a look at the code and see if I can track down whats happening.

Actions #15

Updated by Matthew Oliver over 4 years ago

This was caused by an accidental negating a check (probably a copy and paste error) due to https://github.com/ceph/ceph/commit/fc60989bf7a72c35b8f6b8fec2407b3080ad9bbd
This was fixed overnight by Sage in https://github.com/ceph/ceph/pull/31422.

What's the normal next step for a bug, can I move it to resolved? Or is there some testing by qa that needs to happen first, so should it move to needs testing?

Actions #16

Updated by Kiefer Chang over 4 years ago

  • Pull request ID set to 31422
Actions #17

Updated by Kiefer Chang over 4 years ago

  • Status changed from 12 to Resolved
Actions #18

Updated by Kiefer Chang over 4 years ago

Matthew Oliver wrote:

What's the normal next step for a bug, can I move it to resolved? Or is there some testing by qa that needs to happen first, so should it move to needs testing?

Thanks Matt, I tested the PR locally and the issue should be resolved.

Actions #19

Updated by Ernesto Puerta about 3 years ago

  • Project changed from mgr to Dashboard
  • Category changed from 151 to Testing & QA
Actions

Also available in: Atom PDF