Project

General

Profile

Actions

Bug #11510

closed

ceph-fuse crashes

Added by Jifeng Yin almost 9 years ago. Updated almost 9 years ago.

Status:
Resolved
Priority:
Normal
Assignee:
-
Category:
-
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Component(FS):
Labels (FS):
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

When I read the file through ceph-fuse, it crashes, while list is okay.
kernel drive doesn't crash, but shows 'read error'.

   
   -46> 2015-04-30 16:29:12.858174 7fe957fff700  3 client.58272 ll_getattr 10000000013.head
   -45> 2015-04-30 16:29:12.858176 7fe957fff700 10 client.58272 _getattr mask pAsLsXsFs issued=1
   -44> 2015-04-30 16:29:12.858179 7fe957fff700 10 client.58272 fill_stat on 10000000013 snap/devhead mode 040755 mtime 2015-04-30 11:42:59.464095 ctime 2015-04-30 11:42:59.464095
   -43> 2015-04-30 16:29:12.858191 7fe957fff700  3 client.58272 ll_getattr 10000000013.head = 0
   -42> 2015-04-30 16:29:12.858200 7fe957fff700  3 client.58272 ll_forget 10000000013 1
   -41> 2015-04-30 16:29:12.858202 7fe957fff700 20 client.58272 _ll_put 0x7fe96801c210 10000000013 1 -> 61
   -40> 2015-04-30 16:29:12.858206 7fe957fff700 20 client.58272 _ll_get 0x7fe96801c210 10000000013 -> 62
   -39> 2015-04-30 16:29:12.858208 7fe957fff700  3 client.58272 ll_lookup 0x7fe96801c210 1_20150416143559_6122
   -38> 2015-04-30 16:29:12.858210 7fe957fff700 20 client.58272 _lookup have dn 1_20150416143559_6122 mds.-1 ttl 0.000000 seq 0
   -37> 2015-04-30 16:29:12.858213 7fe957fff700 10 client.58272 _lookup 10000000013.head(ref=3 ll_ref=62 cap_refs={} open={} mode=40755 size=0/0 mtime=2015-04-30 11:42:59.464095 caps=pAsLsXsFs(0=pAsLsXsFs) COMPLETE parents=0x7fe96801cf30 0x7fe96801c210) 1_20150416143559_6122 = 10000000015.head(ref=2 ll_ref=8 cap_refs={} open={} mode=100664 size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr) objectset[10000000015 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7fe968022980 0x7fe968021c30)
   -36> 2015-04-30 16:29:12.858242 7fe957fff700 10 client.58272 fill_stat on 10000000015 snap/devhead mode 0100664 mtime 2015-04-20 14:14:57.961482 ctime 2015-04-20 14:14:57.960359
   -35> 2015-04-30 16:29:12.858249 7fe957fff700 20 client.58272 _ll_get 0x7fe968021c30 10000000015 -> 9
   -34> 2015-04-30 16:29:12.858250 7fe957fff700  3 client.58272 ll_lookup 0x7fe96801c210 1_20150416143559_6122 -> 0 (10000000015)
   -33> 2015-04-30 16:29:12.858257 7fe957fff700  3 client.58272 ll_forget 10000000013 1
   -32> 2015-04-30 16:29:12.858258 7fe957fff700 20 client.58272 _ll_put 0x7fe96801c210 10000000013 1 -> 61
   -31> 2015-04-30 16:29:12.858263 7fe957fff700 20 client.58272 _ll_get 0x7fe968021c30 10000000015 -> 10
   -30> 2015-04-30 16:29:12.858265 7fe957fff700  3 client.58272 ll_getattr 10000000015.head
   -29> 2015-04-30 16:29:12.858266 7fe957fff700 10 client.58272 _getattr mask pAsLsXsFs issued=1
   -28> 2015-04-30 16:29:12.858270 7fe957fff700 10 client.58272 fill_stat on 10000000015 snap/devhead mode 0100664 mtime 2015-04-20 14:14:57.961482 ctime 2015-04-20 14:14:57.960359
   -27> 2015-04-30 16:29:12.858281 7fe957fff700  3 client.58272 ll_getattr 10000000015.head = 0
   -26> 2015-04-30 16:29:12.858287 7fe957fff700  3 client.58272 ll_forget 10000000015 1
   -25> 2015-04-30 16:29:12.858289 7fe957fff700 20 client.58272 _ll_put 0x7fe968021c30 10000000015 1 -> 9
   -24> 2015-04-30 16:29:12.858296 7fe9757f2700 20 client.58272 _ll_get 0x7fe968021c30 10000000015 -> 10
   -23> 2015-04-30 16:29:12.858305 7fe9757f2700  3 client.58272 ll_open 10000000015.head 32768
   -22> 2015-04-30 16:29:12.858309 7fe9757f2700 10 client.58272 check_caps on 10000000015.head(ref=2 ll_ref=10 cap_refs={} open={1=1} mode=100664 size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr) objectset[10000000015 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7fe968022980 0x7fe968021c30) wanted pFscr used - is_delayed=1
   -21> 2015-04-30 16:29:12.858326 7fe9757f2700 10 client.58272  cap mds.0 issued pAsLsXsFscr implemented pAsLsXsFscr revoking -
   -20> 2015-04-30 16:29:12.858333 7fe9757f2700 10 client.58272 send_cap 10000000015.head(ref=2 ll_ref=10 cap_refs={} open={1=1} mode=100664 size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr) objectset[10000000015 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7fe968022980 0x7fe968021c30) mds.0 seq 1 used - want pFscr flush - retain pAsxLsxXsxFsxcrwbl held pAsLsXsFscr revoking - dropping -
   -19> 2015-04-30 16:29:12.858358 7fe9757f2700 15 client.58272 auth cap, setting max_size = 0
   -18> 2015-04-30 16:29:12.858368 7fe9757f2700 10 client.58272 _create_fh 10000000015 mode 1
   -17> 2015-04-30 16:29:12.858376 7fe9757f2700 20 client.58272 trim_cache size 14 max 16384
   -16> 2015-04-30 16:29:12.858378 7fe9757f2700  3 client.58272 ll_open 10000000015.head 32768 = 0 (0x7fe95c0052c0)
   -15> 2015-04-30 16:29:12.858385 7fe9757f2700  3 client.58272 ll_forget 10000000015 1
   -14> 2015-04-30 16:29:12.858386 7fe9757f2700 20 client.58272 _ll_put 0x7fe968021c30 10000000015 1 -> 9
   -13> 2015-04-30 16:29:12.858500 7fe974ff1700 20 client.58272 _ll_get 0x7fe968021c30 10000000015 -> 10
   -12> 2015-04-30 16:29:12.858503 7fe974ff1700  3 client.58272 ll_getattr 10000000015.head
   -11> 2015-04-30 16:29:12.858505 7fe974ff1700 10 client.58272 _getattr mask pAsLsXsFs issued=1
   -10> 2015-04-30 16:29:12.858509 7fe974ff1700 10 client.58272 fill_stat on 10000000015 snap/devhead mode 0100664 mtime 2015-04-20 14:14:57.961482 ctime 2015-04-20 14:14:57.960359
    -9> 2015-04-30 16:29:12.858518 7fe974ff1700  3 client.58272 ll_getattr 10000000015.head = 0
    -8> 2015-04-30 16:29:12.858525 7fe974ff1700  3 client.58272 ll_forget 10000000015 1
    -7> 2015-04-30 16:29:12.858526 7fe974ff1700 20 client.58272 _ll_put 0x7fe968021c30 10000000015 1 -> 9
    -6> 2015-04-30 16:29:12.858536 7fe9577fe700  3 client.58272 ll_read 0x7fe95c0052c0 10000000015  0~4096
    -5> 2015-04-30 16:29:12.858539 7fe9577fe700 10 client.58272 get_caps 10000000015.head(ref=3 ll_ref=9 cap_refs={} open={1=1} mode=100664 size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr) objectset[10000000015 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7fe968022980 0x7fe968021c30) have pAsLsXsFscr need Fr want Fc but not Fc revoking -
    -4> 2015-04-30 16:29:12.858561 7fe9577fe700 10 client.58272 _read_async 10000000015.head(ref=3 ll_ref=9 cap_refs={2048=1} open={1=1} mode=100664 size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr) objectset[10000000015 ts 0/0 objects 0 dirty_or_tx 0] parents=0x7fe968022980 0x7fe968021c30) 0~4096
    -3> 2015-04-30 16:29:12.858575 7fe9577fe700 10 client.58272  max_byes=0 max_periods=4
    -2> 2015-04-30 16:29:12.858692 7fe9577fe700  5 client.58272 get_cap_ref got first FILE_CACHE ref on 10000000015.head(ref=3 ll_ref=9 cap_refs={1024=0,2048=1} open={1=1} mode=100664 size=119/0 mtime=2015-04-20 14:14:57.961482 caps=pAsLsXsFscr(0=pAsLsXsFscr) objectset[10000000015 ts 0/0 objects 1 dirty_or_tx 0] parents=0x7fe968022980 0x7fe968021c30)
    -1> 2015-04-30 16:29:12.867657 7fe9797fa700 10 client.58272 ms_handle_connect on 172.16.3.149:6823/982446
     0> 2015-04-30 16:29:12.872787 7fe97bfff700 -1 *** Caught signal (Aborted) **
 in thread 7fe97bfff700

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (()+0x249805) [0x7fe9859eb805]
 2: (()+0x10d10) [0x7fe984f19d10]
 3: (gsignal()+0x37) [0x7fe9837b3267]
 4: (abort()+0x16a) [0x7fe9837b4eca]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x16d) [0x7fe9840c706d]
 6: (()+0x5eee6) [0x7fe9840c4ee6]
 7: (()+0x5ef31) [0x7fe9840c4f31]
 8: (()+0x5f149) [0x7fe9840c5149]
 9: (ceph::buffer::list::substr_of(ceph::buffer::list const&, unsigned int, unsigned int)+0x24b) [0x7fe985b6993b]
 10: (ObjectCacher::_readx(ObjectCacher::OSDRead*, ObjectCacher::ObjectSet*, Context*, bool)+0x1423) [0x7fe9859a6b73]
 11: (ObjectCacher::C_RetryRead::finish(int)+0x20) [0x7fe9859ad870]
 12: (Context::complete(int)+0x9) [0x7fe985967eb9]
 13: (void finish_contexts<Context>(CephContext*, std::list<Context*, std::allocator<Context*> >&, int)+0xac) [0x7fe9859aa73c]
 14: (ObjectCacher::bh_read_finish(long, sobject_t, unsigned long, long, unsigned long, ceph::buffer::list&, int, bool)+0x29e) [0x7fe98599fd2e]
 15: (ObjectCacher::C_ReadFinish::finish(int)+0x7f) [0x7fe9859ac85f]
 16: (Context::complete(int)+0x9) [0x7fe985967eb9]
 17: (C_Lock::finish(int)+0x29) [0x7fe985968269]
 18: (Context::complete(int)+0x9) [0x7fe985967eb9]
 19: (Finisher::finisher_thread_entry()+0x1b4) [0x7fe9859ff184]
 20: (()+0x76aa) [0x7fe984f106aa]
 21: (clone()+0x6d) [0x7fe983884eed]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.
Actions #1

Updated by John Spray almost 9 years ago

I've definitely seen this backtrace before, trying to remember when... could it have been when we were dealing with the case of invalid layouts on a file?

Jifeng: was this file using a customized layout?

Actions #2

Updated by Jifeng Yin almost 9 years ago

Nothing special. I just use "ceph fs new cephfs".

I've noticed the ticket #11120, and trying 'getfattr -d' shows nothing.

Actions #3

Updated by Jifeng Yin almost 9 years ago

John, could you know how to export the data from the under-laid pool? I'm very curious.

The ceph-fuse just goes crash when I copy/read the files.

Oh, FYI, the `ceph -s` said `failing to respond to capability release` when I first look into the issue, and a "ceph daemon cephfs.sock session evict" to make it gone.

And at the some point, the log said:

   -11> 2015-04-30 15:55:59.063828 7fd6a816c700 10 -- 172.30.11.188:0/10443 >> 172.16.3.153:6820/1532355 pipe(0x7fd6740344c0 sd=8 :58596 s=2 pgs=3721 cs=1 l=1 c=0x7fd674038760).reader got message 1 0x7fd65c001940 osd_op_reply(1 10000000019.00000000 [read 0~4390] v0'0 uv0 ack = -1 ((1) Operation not permitted)) v6

Hope it helps. Thanks!

Actions #4

Updated by Jifeng Yin almost 9 years ago

The admin account works, and it turns out to be a permission problem, although I'm surprised that it works at the first.

I'm on the holiday, and more details come later.

Actions #5

Updated by Zheng Yan almost 9 years ago

  • Status changed from New to 12

this bug is fixed by commit 540346d4 (osdc: In _readx() only no error can tidy read result).

Actions #6

Updated by Jifeng Yin almost 9 years ago

It works when I use the user with caps

mds 'allow rwx' mon 'allow r' osd 'allow rwx pool=cephfs_data, allow rwx pool=cephfs_metadata'
.

When I setup the fs, I think it's okay because I touch a file without any problem. Actually, it isn't???

Actions #7

Updated by Greg Farnum almost 9 years ago

  • Status changed from 12 to Need More Info

It still shouldn't have crashed and I'm not sure why that would have happened unless the kernel gave us ENOMEM or something....?

Actions #8

Updated by Zheng Yan almost 9 years ago

  • Status changed from Need More Info to 12
  • Regression set to No

It crashes because ObjectCacher does not check error code of osd read request. The bug is already fix in master branch by commit 540346d4 (osdc: In _readx() only no error can tidy read result). The question is do we need to backport the fix to hammer.

Actions #9

Updated by Jifeng Yin almost 9 years ago

Actually, I'm wondering that how the file creation(or maybe copying files) works without enough permissions? Could anyone give me a hint?

Actions #10

Updated by Zheng Yan almost 9 years ago

file creation go through MDS, it does not require client have permission to write to OSD.

Actions #11

Updated by Greg Farnum almost 9 years ago

  • Assignee set to Josh Durgin

Looks like this is something that will cause trouble for RBD as well. Josh, Jason, are you guys planning to backport any of that series? (https://github.com/ceph/ceph/pull/3625)

Actions #12

Updated by Josh Durgin almost 9 years ago

  • Assignee deleted (Josh Durgin)

Wasn't planning on backporting those, since EPERM is caught before the object cacher is used for rbd, but feel free to backport the first two commits. May need to do similar purge on shutdown in Client.cc as the last commit added to librbd.

Actions #13

Updated by Greg Farnum almost 9 years ago

  • Status changed from 12 to Resolved

I backported the pool permissions checking (https://github.com/ceph/ceph/pull/4629); it's waiting for the backports guys but will get in eventually.

Actions

Also available in: Atom PDF