Project

General

Profile

Bug #9860

Updated by Loïc Dachary over 9 years ago

h3. Workaround 

 Disable os-probe with  
 <pre> 
 GRUB_DISABLE_OS_PROBER=true 
 </pre> 

 http://www.gnu.org/software/grub/manual/html_node/Simple-configuration.html 

 h3. Description 

 This morning automatic debian jessie package upgrade on our running system: 

 libsigc++-2.0-0c2a,libssl1.0.0,man-db,libgtk2.0-common,libgtk2.0-bin, 
 libgtk2.0-0,openssh-sftp-server,openssh-server, 
 openssh-client,grub-pc,grub-pc-bin,grub2-common,grub-common,openssl,python-cryptography,python-pygraphviz 

 killed five OSD out of 15 on our ceph 0.80.6 cluster of 5 machines : 

 root@g2:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.7.log 
 2014-10-22 07:41:36.783358 7f4d33d55700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted 
 2014-10-22 07:41:36.783617 7f4d33d55700 -1 journal FileJournal::do_write: write_bl(pos=793935872) failed 
 2014-10-22 07:41:36.800201 7f4d33d55700 -1 os/FileJournal.cc: In function 'void FileJournal::do_write(ceph::bufferlist&)' thread 7f4d33d55700 time 2014-10-22 07:41:36.783629 
 2014-10-22 07:41:36.847389 7f4d33d55700 -1 *** Caught signal (Aborted) ** 

 root@n7:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.10.log|cut -c-120 
 2014-10-22 07:42:18.169142 7f9b977df700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted 

 root@n7:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.9.log|cut -c-120 
 2014-10-22 07:42:17.509579 7f6efa27b700 -1 osd.9 15390 heartbeat_check: no reply from osd.13 since back 2014-10-22 07:41 
 2014-10-22 07:42:17.509593 7f6efa27b700 -1 osd.9 15390 heartbeat_check: no reply from osd.14 since back 2014-10-22 07:41 
 2014-10-22 07:42:17.945433 7f6ef6a74700 -1 journal FileJournal::do_write: pwrite(fd=23, hbp.length=4096) failed :(1) Ope 
 2014-10-22 07:42:17.960678 7f6ef6a74700 -1 os/FileJournal.cc: In function 'void FileJournal::do_write(ceph::bufferlist&) 

 root@stri:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.13.log 
 2014-10-22 00:42:01.140574 7fa929b8a700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted 
 2014-10-22 00:42:01.141439 7fa929b8a700 -1 journal FileJournal::do_write: write_bl(pos=3496448000) failed 

 root@stri:/var/log/ceph# grep -E ^2014-10-22 ceph-osd.14.log 
 2014-10-22 00:41:54.828719 7f438eb45700 -1 osd.14 15388 heartbeat_check: no reply from osd.7 since back 2014-10-22 00:41:34.499777 front 2014-10-22 00:41:34.499777 (cutoff 2014-10-22 00:41:34.828717) 
 2014-10-22 00:41:55.241586 7f437217f700 0 -- 192.168.99.246:6811/17136 >> 192.168.99.253:6806/25800 pipe(0x7f439f5fd900 sd=182 :6811 s=0 pgs=0 cs=0 l=0 c=0x7f43a71f1180).accept connect_seq 34 vs existing 33 state standby 
 2014-10-22 00:42:01.235014 7f438b33e700 -1 journal FileJournal::write_bl : write_fd failed: (1) Operation not permitted 
 2014-10-22 00:42:01.235032 7f438b33e700 -1 journal FileJournal::do_write: write_bl(pos=4626878464) failed 

 The OSD all died just after a run of os-prober according to the logs: 

 Oct 22 07:41:36 g2 os-prober: debug: running /usr/lib/os-probes/mounted/05efi on mounted /dev/sda1 

 os-prober likely did an operation on the journal partition causing the write, may be the OSD could be made more robust in this case.  

 Meanwhile we deactivated os-prober/grub updates. 

Back