Bug #17093: test failure due to extra data in json parsing - Ceph - Ceph

Actions

Copy link

Bug #17093

closed

test failure due to extra data in json parsing

Added by Jeff Layton over 7 years ago. Updated over 7 years ago.

Status:

Resolved

Priority:

Normal

Assignee:

Category:

Target version:

% Done:

Source:

other

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

Opening this as a qa suite bug for some help with interpreting what failed, but we may need to move it to a different project. I'm seeing several fs testsuite testcases fail with variations on the sort of error below:

From this run: http://pulpito.ceph.com/jlayton-2016-08-22_12:56:03-fs-master---basic-mira/


2016-08-22T12:44:31.157 INFO:teuthology.orchestra.run.mira112.stderr:2016-08-22 12:44:30.442027 7f43d2ffd700  1 -- 172.21.9.134:0/2179350879 <== mon.1 172.21.9.134:6789/0 4 ==== auth_reply(proto 2 0 (0) Success) v1 ==== 393+0+0 (3458294841 0 0) 0x7f43c00008c0 con 0x7f43d4064630
2016-08-22T12:44:31.157 INFO:teuthology.orchestra.run.mira112.stderr:2016-08-22 12:44:30.442162 7f43d2ffd700  1 -- 172.21.9.134:0/2179350879 --> 172.21.9.134:6789/0 -- mon_subscribe({monmap=0+}) v2 -- ?+0 0x7f43d4062510 con 0x7f43d4064630
2016-08-22T12:44:31.157 INFO:teuthology.orchestra.run.mira112.stderr:2016-08-22 12:44:30.442315 7f43d9f86700  1 -- 172.21.9.134:0/2179350879 --> 172.21.9.134:6789/0 -- mon_subscribe({osdmap=0}) v2 -- ?+0 0x7f43d4062760 con 0x7f43d4064630
2016-08-22T12:44:31.157 INFO:teuthology.orchestra.run.mira112.stderr:2016-08-22 12:44:30.442665 7f43d2ffd700  1 -- 172.21.9.134:0/2179350879 <== mon.1 172.21.9.134:6789/0 5 ==== mon_map magic: 0 v1 ==== 473+0+0 (3242978494 0 0) 0x7f43c00012a0 con 0x7f43d4064630
2016-08-22T12:44:31.157 INFO:teuthology.orchestra.run.mira112.stderr:2016-08-22 12:44:30.442840 7f43d2ffd700  1 -- 172.21.9.134:0/2179350879 <== mon.1 172.21.9.134:6789/0 6 ==== osd_map(9..9 src has 1..9) v3 ==== 4040+0+0 (2928568949 0 0) 0x7f43c00008c0 con 0x7f43d4064630
2016-08-22T12:44:31.157 INFO:teuthology.orchestra.run.mira112.stderr:2016-08-22 12:44:30.460358 7f43d9f86700  1 -- 172.21.9.134:0/2179350879 --> 172.21.9.134:6789/0 -- mon_command({"prefix": "get_command_descriptions"} v 0) v1 -- ?+0 0x7f43d4052db0 con 0x7f43d4064630
2016-08-22T12:44:31.157 INFO:teuthology.orchestra.run.mira112.stderr:2016-08-22 12:44:30.463566 7f43d2ffd700  1 -- 172.21.9.134:0/2179350879 <== mon.1 172.21.9.134:6789/0 7 ==== mon_command_ack([{"prefix": "get_command_descriptions"}]=0  v0) v1 ==== 72+0+41539 (1092875540 0 1507896993) 0x7f43c0001240 con 0x7f43d4064630
2016-08-22T12:44:31.157 INFO:teuthology.orchestra.run.mira112.stderr:2016-08-22 12:44:30.635717 7f43d9f86700  1 -- 172.21.9.134:0/2179350879 --> 172.21.9.134:6789/0 -- mon_command({"pool2": "cephfs_data", "prefix": "osd pool delete", "sure": "--yes-i-really-really-mean-it", "pool": "cephfs_data"} v 0) v1 -- ?+0 0x7f43d4052db0 con 0x7f43d4064630
2016-08-22T12:44:31.157 INFO:teuthology.orchestra.run.mira112.stderr:2016-08-22 12:44:31.136322 7f43d2ffd700  1 -- 172.21.9.134:0/2179350879 <== mon.1 172.21.9.134:6789/0 8 ==== mon_command_ack([{"pool2": "cephfs_data", "prefix": "osd pool delete", "sure": "--yes-i-really-really-mean-it", "pool": "cephfs_data"}]=0 pool 'cephfs_data' removed v10) v1 ==== 177+0+0 (2812490412 0 0) 0x7f43c0001240 con 0x7f43d4064630
2016-08-22T12:44:31.157 INFO:teuthology.orchestra.run.mira112.stderr:pool 'cephfs_data' removed
2016-08-22T12:44:31.158 INFO:teuthology.orchestra.run.mira112.stderr:2016-08-22 12:44:31.139048 7f43d9f86700  1 -- 172.21.9.134:0/2179350879 mark_down 0x7f43d4064630 -- 0x7f43d408a050
2016-08-22T12:44:31.158 INFO:teuthology.orchestra.run.mira112.stderr:2016-08-22 12:44:31.139341 7f43d9f86700  1 -- 172.21.9.134:0/2179350879 mark_down_all
2016-08-22T12:44:31.158 INFO:teuthology.orchestra.run.mira112.stderr:2016-08-22 12:44:31.139654 7f43d9f86700  1 -- 172.21.9.134:0/2179350879 shutdown complete.
2016-08-22T12:44:31.158 INFO:teuthology.orchestra.run.mira112:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config get osd_mon_report_interval_max'
2016-08-22T12:44:31.414 INFO:teuthology.orchestra.run.mira112.stdout:{
2016-08-22T12:44:31.415 INFO:teuthology.orchestra.run.mira112.stdout:    "osd_mon_report_interval_max": "600" 
2016-08-22T12:44:31.415 INFO:teuthology.orchestra.run.mira112.stdout:}
2016-08-22T12:44:31.415 INFO:teuthology.orchestra.run.mira112.stdout:7-cad3c0ebbd92",
2016-08-22T12:44:31.415 INFO:teuthology.orchestra.run.mira112.stdout:    "whoami": 0,
2016-08-22T12:44:31.415 INFO:teuthology.orchestra.run.mira112.stdout:    "want_state": "up:active",
2016-08-22T12:44:31.415 INFO:teuthology.orchestra.run.mira112.stdout:    "state": "up:active",
2016-08-22T12:44:31.415 INFO:teuthology.orchestra.run.mira112.stdout:    "mdsmap_epoch": 7,
2016-08-22T12:44:31.415 INFO:teuthology.orchestra.run.mira112.stdout:    "osdmap_epoch": 7,
2016-08-22T12:44:31.415 INFO:teuthology.orchestra.run.mira112.stdout:    "osdmap_epoch_barrier": 7
2016-08-22T12:44:31.416 INFO:teuthology.orchestra.run.mira112.stdout:}
2016-08-22T12:44:31.416 INFO:tasks.cephfs.filesystem:_json_asok output: {
    "osd_mon_report_interval_max": "600" 
}
7-cad3c0ebbd92",
    "whoami": 0,
    "want_state": "up:active",
    "state": "up:active",
    "mdsmap_epoch": 7,
    "osdmap_epoch": 7,
    "osdmap_epoch_barrier": 7
}

2016-08-22T12:44:31.416 INFO:tasks.cephfs_test_runner:test_basic (tasks.cephfs.test_dump_tree.TestDumpTree) ... ERROR
2016-08-22T12:44:31.416 INFO:tasks.cephfs_test_runner:
2016-08-22T12:44:31.417 INFO:tasks.cephfs_test_runner:======================================================================
2016-08-22T12:44:31.417 INFO:tasks.cephfs_test_runner:ERROR: test_basic (tasks.cephfs.test_dump_tree.TestDumpTree)
2016-08-22T12:44:31.417 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2016-08-22T12:44:31.417 INFO:tasks.cephfs_test_runner:Traceback (most recent call last):
2016-08-22T12:44:31.417 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/ceph-qa-suite_master/tasks/cephfs/cephfs_test_case.py", line 124, in setUp
2016-08-22T12:44:31.417 INFO:tasks.cephfs_test_runner:    osd_mon_report_interval_max = int(self.fs.get_config("osd_mon_report_interval_max", service_type='osd'))
2016-08-22T12:44:31.417 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/ceph-qa-suite_master/tasks/cephfs/filesystem.py", line 168, in get_config
2016-08-22T12:44:31.417 INFO:tasks.cephfs_test_runner:    return self.json_asok(['config', 'get', key], service_type, service_id)[key]
2016-08-22T12:44:31.417 INFO:tasks.cephfs_test_runner:  File "/home/teuthworker/src/ceph-qa-suite_master/tasks/cephfs/filesystem.py", line 186, in json_asok
2016-08-22T12:44:31.417 INFO:tasks.cephfs_test_runner:    return json.loads(response_data)
2016-08-22T12:44:31.417 INFO:tasks.cephfs_test_runner:  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
2016-08-22T12:44:31.418 INFO:tasks.cephfs_test_runner:    return _default_decoder.decode(s)
2016-08-22T12:44:31.418 INFO:tasks.cephfs_test_runner:  File "/usr/lib/python2.7/json/decoder.py", line 367, in decode
2016-08-22T12:44:31.418 INFO:tasks.cephfs_test_runner:    raise ValueError(errmsg("Extra data", s, end, len(s)))
2016-08-22T12:44:31.418 INFO:tasks.cephfs_test_runner:ValueError: Extra data: line 4 column 1 - line 12 column 1 (char 45 - 214)
2016-08-22T12:44:31.418 INFO:tasks.cephfs_test_runner:
2016-08-22T12:44:31.418 INFO:tasks.cephfs_test_runner:----------------------------------------------------------------------
2016-08-22T12:44:31.418 INFO:tasks.cephfs_test_runner:Ran 1 test in 13.739s
2016-08-22T12:44:31.418 INFO:tasks.cephfs_test_runner:
2016-08-22T12:44:31.418 INFO:tasks.cephfs_test_runner:FAILED (errors=1)

Some sort of parsing error in json output?

Related issues 1 (0 open — 1 closed)

Actions

Copy link

Updated by Jeff Layton over 7 years ago

Yes, it does look like the JSON has some extra junk attached to the end that didn't seem to be present before. From a passing run a few days ago:

http://qa-proxy.ceph.com/teuthology/jlayton-2016-08-16_14:57:06-fs-wip-jlayton-statx---basic-mira/368237/teuthology.log

2016-08-16T15:09:17.810 INFO:teuthology.orchestra.run.mira119:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config get osd_mon_report_interval_max'
2016-08-16T15:09:18.081 INFO:teuthology.orchestra.run.mira119.stdout:{
2016-08-16T15:09:18.081 INFO:teuthology.orchestra.run.mira119.stdout:    "osd_mon_report_interval_max": "600" 
2016-08-16T15:09:18.081 INFO:teuthology.orchestra.run.mira119.stdout:}
2016-08-16T15:09:18.081 INFO:tasks.cephfs.filesystem:_json_asok output: {
    "osd_mon_report_interval_max": "600" 
}

2016-08-16T15:09:18.082 INFO:teuthology.orchestra.run.mira119:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 120 ceph --cluster ceph osd dump --format=json-pretty'

Actions

Copy link

Updated by Jeff Layton over 7 years ago

Ok, I see -- we run "ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-osd.0.asok config get osd_mon_report_interval_max"

...and that produced some extra junk at the end of the JSON output. Smells like a buffer length handling problem, where we're getting the leftover remnant of some other JSON info?

Actions

Copy link

Updated by Jeff Layton over 7 years ago

Project changed from 23 to Ceph

Actions

Copy link

Updated by Jeff Layton over 7 years ago

A little while before the ERROR hits, this was run:

2016-08-22T12:44:08.996 INFO:teuthology.orchestra.run.mira064:Running: 'sudo adjust-ulimits ceph-coverage /home/ubuntu/cephtest/archive/coverage timeout 0 ceph --cluster ceph --admin-daemon /var/run/ceph/ceph-mds.a-s.asok status'
2016-08-22T12:44:09.218 INFO:teuthology.orchestra.run.mira064.stdout:{
2016-08-22T12:44:09.218 INFO:teuthology.orchestra.run.mira064.stdout:    "cluster_fsid": "b62427df-c6a1-4aba-bab7-cad3c0ebbd92",
2016-08-22T12:44:09.218 INFO:teuthology.orchestra.run.mira064.stdout:    "whoami": 0,
2016-08-22T12:44:09.218 INFO:teuthology.orchestra.run.mira064.stdout:    "want_state": "up:active",
2016-08-22T12:44:09.219 INFO:teuthology.orchestra.run.mira064.stdout:    "state": "up:active",
2016-08-22T12:44:09.219 INFO:teuthology.orchestra.run.mira064.stdout:    "mdsmap_epoch": 7,
2016-08-22T12:44:09.219 INFO:teuthology.orchestra.run.mira064.stdout:    "osdmap_epoch": 7,
2016-08-22T12:44:09.219 INFO:teuthology.orchestra.run.mira064.stdout:    "osdmap_epoch_barrier": 7
2016-08-22T12:44:09.219 INFO:teuthology.orchestra.run.mira064.stdout:}
2016-08-22T12:44:09.219 INFO:tasks.cephfs.filesystem:_json_asok output: {
    "cluster_fsid": "b62427df-c6a1-4aba-bab7-cad3c0ebbd92",
    "whoami": 0,
    "want_state": "up:active",
    "state": "up:active",
    "mdsmap_epoch": 7,
    "osdmap_epoch": 7,
    "osdmap_epoch_barrier": 7
}

Note that the tail end of the above (valid) JSON output is identical to the extra junk that got tacked onto the end of the one that failed.

It's not clear to me how we'd end up with that extra info tacked onto the end though, given that these are separate invocations of "ceph" command line utility.

Actions

Copy link

Updated by Jeff Layton over 7 years ago

Correction: logfiles for the one I quoted originally are here:

http://pulpito.ceph.com/jlayton-2016-08-22_12:35:31-fs-master---basic-mira/

Actions

Copy link

Updated by Samuel Just over 7 years ago

The "config get" output should have been generated in CephContext::do_command, that would be a good place to look. Looks like we there is logging at the end indicating the length, would be good to know whether the length matches the part that makes sense, or the extended part that doesn't

Actions

Copy link

Updated by Jeff Layton over 7 years ago

I reran the test this morning, and then quickly logged in and locked the machine where the failure occurred just after it did. I don't think it was fast enough though, as I can't seem to locate the logs and the ceph packages had all already been uninstalled.

Is there some way to ensure that the osd logs are preserved after a teuthology run?

Hmm...and even then, on my own test rig, I don't see the do_command log messages. I'm guessing I'd have to crank up the debug level for the osd daemon there? Is that even possible under teuthology?

Actions

Copy link

Updated by Jeff Layton over 7 years ago

With Samuel's help, I was able to crank up the debug levels for the OSD and rerun the test. Here's what we see:

2016-08-23 15:45:14.644812 7efd43f4d700 30 asok(0xbc86560) AdminSocket: calling accept
2016-08-23 15:45:14.644883 7efd43f4d700 30 asok(0xbc86560) AdminSocket: finished accept
2016-08-23 15:45:14.645878 7efd43f4d700  5 asok(0xbc86560) AdminSocket: request 'get_command_descriptions' '' to 0xbbe8120 returned 4191 bytes
2016-08-23 15:45:14.674393 7efd43f4d700 30 asok(0xbc86560) AdminSocket: calling accept
2016-08-23 15:45:14.674446 7efd43f4d700 30 asok(0xbc86560) AdminSocket: finished accept
2016-08-23 15:45:14.674810 7efd43f4d700  1 do_command 'config get' 'var:osd_mon_report_interval_max 
2016-08-23 15:45:14.674903 7efd43f4d700  1 do_command 'config get' 'var:osd_mon_report_interval_max result is 45 bytes
2016-08-23 15:45:14.674980 7efd43f4d700  5 asok(0xbc86560) AdminSocket: request 'config get' '' to 0xbbe8030 returned 45 bytes

45 bytes is the expected size (sans the trailing junk at the end). Samuel suspects that something else is writing to the fd at the same time. I'm not so sure...

Actions

Copy link