Project

General

Profile

Bug #12297

Updated by Loic Dachary almost 5 years ago

Running CephFS for data volumes on a compute cluster.

ceph-fuse aborts and leaves the mount point unusable (transport endpoint not connected).

Except of client log incl. 20 latest ops:
<pre>


-20> 2015-07-13 12:08:13.905733 7f60be7fc700 20 client.412413 trim_caps counting as trimmed: 100015d4d13.head(ref=2 ll_ref=1 cap_refs={} open={} mode=120777 size=74/0 mtime=2015-07-13 12:08:05.875809 caps=pAsLsXsFscr(0=pAsLsXsFscr) 0x7f60ac8c6320)
-19> 2015-07-13 12:08:13.905744 7f60be7fc700 10 client.412413 put_inode on 100015d4d13.head(ref=2 ll_ref=1 cap_refs={} open={} mode=120777 size=74/0 mtime=2015-07-13 12:08:05.875809 caps=pAsLsXsFscr(0=pAsLsXsFscr) 0x7f60ac8c6320)
-18> 2015-07-13 12:08:13.905755 7f60be7fc700 20 client.412413 trying to trim dentries for 100015d4d14.head(ref=2 ll_ref=1 cap_refs={} open={} mode=120777 size=74/0 mtime=2015-07-13 12:08:05.893082 caps=pAsLsXsFscr(0=pAsLsXsFscr) parents=0x7f5fb43408e0 0x7f60acc44830)
-17> 2015-07-13 12:08:13.905767 7f60be7fc700 15 client.412413 trim_dentry unlinking dn raw_reads.136.raw_reads.296.N2.las in dir 100012fa7e7
-16> 2015-07-13 12:08:13.905770 7f60be7fc700 15 client.412413 unlink dir 0x7f60ac59d5a0 'raw_reads.136.raw_reads.296.N2.las' dn 0x7f5fb43408e0 inode 0x7f60acc44830
-15> 2015-07-13 12:08:13.905773 7f60be7fc700 20 client.412413 unlink inode 0x7f60acc44830 parents now
-14> 2015-07-13 12:08:13.905774 7f60be7fc700 10 client.412413 put_inode on 100015d4d14.head(ref=3 ll_ref=1 cap_refs={} open={} mode=120777 size=74/0 mtime=2015-07-13 12:08:05.893082 caps=pAsLsXsFscr(0=pAsLsXsFscr) 0x7f60acc44830)
-13> 2015-07-13 12:08:13.905785 7f60be7fc700 15 client.412413 unlink removing 'raw_reads.136.raw_reads.296.N2.las' dn 0x7f5fb43408e0
-12> 2015-07-13 12:08:13.905789 7f60be7fc700 20 client.412413 trim_caps counting as trimmed: 100015d4d14.head(ref=2 ll_ref=1 cap_refs={} open={} mode=120777 size=74/0 mtime=2015-07-13 12:08:05.893082 caps=pAsLsXsFscr(0=pAsLsXsFscr) 0x7f60acc44830)
-11> 2015-07-13 12:08:13.905801 7f60be7fc700 10 client.412413 put_inode on 100015d4d14.head(ref=2 ll_ref=1 cap_refs={} open={} mode=120777 size=74/0 mtime=2015-07-13 12:08:05.893082 caps=pAsLsXsFscr(0=pAsLsXsFscr) 0x7f60acc44830)
-10> 2015-07-13 12:08:13.905840 7f60a3fff700 10 client.412413 _async_dentry_invalidate 'ch' ino 0 in dir 100009de0f2.head
-9> 2015-07-13 12:08:13.905849 7f606bfff700 15 client.412413 de raw_reads.124.raw_reads.296.C2.las off 6170202972637346 = 0
-8> 2015-07-13 12:08:13.906701 7f60a06f7700 2 -- 192.168.2.22:0/20333 >> 192.168.6.5:6810/4247 pipe(0x7f607c030cf0 sd=2 :37085 s=2 pgs=28199 cs=1 l=1 c=0x7f607c034f90).reader couldn't read tag, (11) Resource temporarily unavailable
-7> 2015-07-13 12:08:13.906738 7f60a06f7700 2 -- 192.168.2.22:0/20333 >> 192.168.6.5:6810/4247 pipe(0x7f607c030cf0 sd=2 :37085 s=2 pgs=28199 cs=1 l=1 c=0x7f607c034f90).fault (11) Resource temporarily unavailable
-6> 2015-07-13 12:08:13.906808 7f60be7fc700 1 client.412413.objecter ms_handle_reset on osd.61
-5> 2015-07-13 12:08:13.906818 7f60be7fc700 1 -- 192.168.2.22:0/20333 mark_down 0x7f607c034f90 -- pipe dne
-4> 2015-07-13 12:08:13.907127 7f60be7fc700 10 monclient: renew_subs
-3> 2015-07-13 12:08:13.907135 7f60be7fc700 10 monclient: _send_mon_message to mon.mon-i1 at 192.168.6.50:6789/0
-2> 2015-07-13 12:08:13.907141 7f60be7fc700 1 -- 192.168.2.22:0/20333 --> 192.168.6.50:6789/0 -- mon_subscribe({mdsmap=4312+,monmap=14+,osdmap=175910}) v2 -- ?+0 0x7f60ac68f8b0 con 0x3b50160
-1> 2015-07-13 12:08:13.907279 7f606bfff700 -1 *** Caught signal (Segmentation fault) **
in thread 7f606bfff700

ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
1: ceph-fuse() [0x6235ca]
2: (()+0x10340) [0x7f60c9dd8340]
3: (std::string::assign(std::string const&)+0x1c) [0x7f60c8ff748c]
4: (Client::_readdir_cache_cb(dir_result_t*, int (*)(void*, dirent*, stat*, int, long), void*)+0x39a) [0x55263a]
5: (Client::readdir_r_cb(dir_result_t*, int (*)(void*, dirent*, stat*, int, long), void*)+0xfc5) [0x596045]
6: ceph-fuse() [0x546d2d]
7: (()+0x13e76) [0x7f60ca20fe76]
8: (()+0x1522b) [0x7f60ca21122b]
9: (()+0x11e49) [0x7f60ca20de49]
10: (()+0x8182) [0x7f60c9dd0182]
11: (clone()+0x6d) [0x7f60c875547d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

0> 2015-07-13 12:08:13.908290 7f60be7fc700 1 -- 192.168.2.22:0/20333 <== mon.3 192.168.6.50:6789/0 266 ==== osd_map(175910..175910 src has 172194..175910) v3 ==== 222+0+0 (8575416 0 0) 0x7f60b00008c0 con 0x3b50160
</pre>


The OSD mentioned in OP 8 is up and running, but might take some time to respond due to current backfilling operations. The compute jobs are setup to write output to the same file, so several cephfs clients are trying to open a single file for writing.

Back