Project

General

Profile

Bug #19033

cephfs: mds is crushed, after I set about 400 64KB xattr kv pairs to a file

Added by Honggang Yang 4 months ago. Updated 2 months ago.

Status:
Pending Backport
Priority:
High
Assignee:
-
Category:
-
Target version:
-
Start date:
02/22/2017
Due date:
% Done:

0%

Source:
Development
Tags:
Backport:
jewel, kraken
Regression:
No
Severity:
1 - critical
Reviewed:
ceph-qa-suite:
Release:
jewel, kraken, master
Component(FS):
Client, MDS, osdc
Needs Doc:
No

Description

1. Problem

After I have set about 400 64KB xattr kv pair to a file,
mds is crashed. Every time I try to start mds, it will crash again.

bt:

    Program received signal SIGABRT, Aborted.                                                       
    [Switching to Thread 0x7f9850be8700 (LWP 24280)]                                                
    0x00007f98591155d7 in raise () from /lib64/libc.so.6                                            
    (gdb) bt                                                                                        
    #0  0x00007f98591155d7 in raise () from /lib64/libc.so.6                                        
    #1  0x00007f9859116cc8 in abort () from /lib64/libc.so.6                                        
    #2  0x00007f985c926094 in ceph::__ceph_assert_fail (assertion=0x7f985cbca728 "write_buf.length() == write_pos - flush_pos", file=0x7f985cbca0b3 "osdc/Journaler.cc", line=674, 
        func=0x7f985cbcc1c0 <Journaler::_do_flush(unsigned int)::__PRETTY_FUNCTION__> "void Journaler::_do_flush(unsigned int)") at common/assert.cc:78                                                                                                         
    #3  0x00007f985c6c012d in Journaler::_do_flush (this=0x7f986801cd00, amount=0) at osdc/Journaler.cc:674                     
    #4  0x00007f985c6c1f17 in Journaler::_finish_prezero (this=0x7f986801cd00, r=-2, start=54525952, len=4194304) at osdc/Journaler.cc:846                                                                                                                      
    #5  0x00007f985c6cf148 in C_Journaler_Prezero::finish (this=0x7f98684a8040, r=-2) at osdc/Journaler.cc:771                  
    #6  0x00007f985c2a44c7 in Context::complete (this=0x7f98684a8040, r=-2) at include/Context.h:64                             
    #7  0x00007f985c7bcf74 in Finisher::finisher_thread_entry (this=0x7f9867fdf760) at common/Finisher.cc:76                    
    #8  0x00007f985c2ec120 in Finisher::FinisherThread::entry (this=0x7f9867fdf880) at common/Finisher.h:66                     
    #9  0x00007f985c90ec1d in Thread::entry_wrapper (this=0x7f9867fdf880) at common/Thread.cc:87                                
    #10 0x00007f985c90eb52 in Thread::_entry_func (arg=0x7f9867fdf880) at common/Thread.cc:67                                   
    #11 0x00007f985a708df5 in start_thread () from /lib64/libpthread.so.0                                                       
    #12 0x00007f98591d61ad in clone () from /lib64/libc.so.6                                                                    

    (gdb) l osdc/Journaler.cc:674                                                                                               
    669                   flush_pos, len, write_bl, ceph::real_clock::now(cct),                                                 
    670                   0,                                                                                                    
    671                   NULL, wrap_finisher(onsafe), write_iohint);                                                           
    672                                                                                                                         
    673       flush_pos += len;                                                                                                 
    674       assert(write_buf.length() == write_pos - flush_pos);                                                              
    675                                                                                                                         
    676       ldout(cct, 10)                                                                        
    677         << "_do_flush (prezeroing/prezero)/write/flush/safe pointers now at "               
    678         << "(" << prezeroing_pos << "/" << prezero_pos << ")/" << write_pos                 
    (gdb) up 3                                                                                      
    #3  0x00007f985c6c012d in Journaler::_do_flush (this=0x7f986801cd00, amount=0) at osdc/Journaler.cc:674
    674       assert(write_buf.length() == write_pos - flush_pos);                                  
    (gdb) p write_pos                                                                               
    $57 = 5989821898                                                                                
    (gdb) p flush_pos                                                                               
    $58 = 83886080                                                                                  
    (gdb) p write_buf.length()                                                                      
    $59 = 1610968522                   // 0x 60056dca                                                                   
    (gdb) p (write_pos - flush_pos)         |||||||||                                                           
    $60 = 5905935818                   // 0x160056dca 

2. Reason

The root reason is write_buf._len overflowed in the following call path.

MDLog::_submit_thread() -> Journaler::append_entry() -> JournalStream::write() -> buffer::list::claim_append()

Because there is no overflow checing in buffer::list::claim_append, this overflow is hidden.

Until Journaler::_do_flush() is called, this error is captured.

- first overflow checking                                                                       

::                                                                                              

  // flush                                                                                      
  unsigned len = write_pos - flush_pos; <---------------<<<< overflow can not be detected because len is 32bit
  assert(len == write_buf.length());                                                            

- len is asigned to newlen                                                                          

::                                                                                                  

    if (newlen < len) {                                                                             
      ldout(cct, 10) << "_do_flush wanted to do " << flush_pos << "~" << len                        
                     << " but hit prezero_pos " << prezero_pos                                                                        
                     << ", will do " << flush_pos << "~" << newlen << dendl;                        
      len = newlen;

- write_buf.splice is called                                                                        

::                                                                                                  

  // adjust pointers                                                                                
  if (len == write_buf.length()) {                                                                  
    write_bl.swap(write_buf);                                                                       
  } else {                                                                                          
    write_buf.splice(0, len, &write_bl);                                                            
  }                                                                                                 

- overflow is detected                                                                              

::                                                                                                  

  filer.write(ino, &layout, snapc,                                                                  
              flush_pos, len, write_bl, ceph::real_clock::now(),                                    
              0,                                                                                    
              wrap_finisher(onsafe), write_iohint);                                                 

  flush_pos += len;                                                                                 
  assert(write_buf.length() == write_pos - flush_pos); <-------------<<<< assert failed 

3. Fix proposal

- add bufferlist's _len overflow checking
- stop appending entry to a nearly overflow write_buf

4. Bug reproduce

    #include <sys/types.h>                                                                          
    #include <sys/xattr.h>                                                                          
    #include <stdio.h>                                                                              

    #define LEN (64*1024)                                                                           

    int main(int argc, const char* argv)                                                            
    {                                                                                               
        char val[LEN];                                                            
        int i = 0;                                                                                  

        for (; i < LEN; i++) {                                                                      
            val[i] = 'X';                                                                           
        }                                                                                           

        int j = 0;                                                                                  
        for (; j < 1000; j ++) {                                                                    
            char tmp[50];                                                                           
            snprintf(tmp, 50, "%s%d", "trusted.verybigkey", j);                                     
            printf("set xattr: %s\n", tmp);                                                         
            int ret = setxattr("/mnt/mycephfs/test6/mmm", tmp, val, sizeof(val), 0);                
            if (ret < 0) {                                                                          
                perror("setxattr failed\n");                                                        
            } else {                                                                                
                printf("setxattr success\n");                                                       
            }                                                                                       
        }                                                                                           
        return 0;                                                                                   
    }   

Related issues

Copied to fs - Backport #19673: jewel: cephfs: mds is crushed, after I set about 400 64KB xattr kv pairs to a file Resolved
Copied to fs - Backport #19674: kraken: cephfs: mds is crushed, after I set about 400 64KB xattr kv pairs to a file New

History

#2 Updated by John Spray 4 months ago

  • Status changed from New to Need Review

#3 Updated by John Spray 4 months ago

Thank you for the very detailed report

#4 Updated by John Spray 2 months ago

  • Status changed from Need Review to Pending Backport
  • Backport set to jewel, kraken

#5 Updated by Nathan Cutler 2 months ago

  • Copied to Backport #19673: jewel: cephfs: mds is crushed, after I set about 400 64KB xattr kv pairs to a file added

#6 Updated by Nathan Cutler 2 months ago

  • Copied to Backport #19674: kraken: cephfs: mds is crushed, after I set about 400 64KB xattr kv pairs to a file added

Also available in: Atom PDF