Bug #21401: rgw: Missing error handling when gen_rand_alphanumeric is failing - rgw - Ceph

Actions

Copy link

Bug #21401

closed

rgw: Missing error handling when gen_rand_alphanumeric is failing

Added by Jens Harbott over 6 years ago. Updated over 4 years ago.

Status:

Resolved

Priority:

High

Assignee:

Casey Bodley

Target version:

v0.94.10

% Done:

Source:

Tags:

Backport:

luminous

Regression:

Severity:

1 - critical

Reviewed:

Affected Versions:

ceph-qa-suite:

Pull request ID:

Crash signature (v1):

Crash signature (v2):

Description

The function gen_rand_alphanumeric() tries to read some randomness from /dev/urandom and converts it into a string. The read operation may fail (e.g. with "Too many open files") and there will be a negative error code returned.

int gen_rand_alphanumeric(CephContext *cct, char *dest, int size) /* size should be the required string size + 1 */
{
  int ret = get_random_bytes(dest, size);
  if (ret < 0) {
    lderr(cct) << "cannot get random bytes: " << cpp_strerror(-ret) << dendl;
    return ret;                  
  }
...
}

The consuming function append_rand_alpha() however does not check the return code, it uses the uninitialized char buf and appends that to the result string.

static inline void append_rand_alpha(CephContext *cct, const string& src, string& dest, int len)
{
  dest = src;
  char buf[len + 1];
  gen_rand_alphanumeric(cct, buf, len);
  dest.append("_");
  dest.append(buf);
}

As a result, when this happens while an object is being copied, we see its tag and prefix fields containing garbage instead of the expected 24 character string. In particular the prefix field seems to always contain ".P_" now, leading to collisions for tail object names and in the long run to data loss, as a second objects tail objects will now overwrite those of the first object.

Originally found in v0.94.10 but the code looks still the same in master.

Related issues 3 (1 open — 2 closed)

Actions

Copy link

Updated by Jens Harbott over 6 years ago

In order to reproduce, just make /dev/urandom inaccessible for some time, e.g. run (on the rgw node):

mv /dev/urandom /dev/blah;sleep 60; mv /dev/blah /dev/urandom

During these 60 seconds now run on a client node:

$ dd if=/dev/urandom of=test01 count=520 bs=1024
520+0 records in
520+0 records out
532480 bytes (532 kB) copied, 0.044453 s, 12.0 MB/s
$ dd if=/dev/urandom of=test02 count=520 bs=1024
520+0 records in
520+0 records out
532480 bytes (532 kB) copied, 0.0425381 s, 12.5 MB/s
$ s3cmd put test01  s3://test1/atest01
test01 -> s3://test1/atest01  [1 of 1]
 532480 of 532480   100% in    0s     3.57 MB/s  done
$ s3cmd put test02  s3://test1/atest02
test02 -> s3://test1/atest02  [1 of 1]
 532480 of 532480   100% in    0s     3.91 MB/s  done

Now the first object is corrupted, as its tail object has been overwritten by the second PUT.

$ s3cmd get s3://test1/atest01 atest01
s3://test1/atest01 -> atest01  [1 of 1]
 532480 of 532480   100% in    0s    24.54 MB/s  done
WARNING: MD5 signatures do not match: computed=949048077122cc83f883a72b690db3cb, received="c9212fd208c64173a7b4b1e057f5b752"

One will also get truncated objects when one of these objects is removed, see http://tracker.ceph.com/issues/20107 and http://tracker.ceph.com/issues/20166 for related issues.

Actions

Copy link

Updated by Casey Bodley over 6 years ago

Status changed from New to 12
Assignee set to Casey Bodley

Actions

Copy link

Updated by Casey Bodley over 6 years ago

Status changed from 12 to Fix Under Review

https://github.com/ceph/ceph/pull/17972

Actions

Copy link

Updated by Yehuda Sadeh over 6 years ago

Priority changed from Normal to High

Actions

Copy link

Updated by Abhishek Lekshmanan over 6 years ago

Status changed from Fix Under Review to Pending Backport
Backport set to Luminous

Jewel backports may need a seperate fix

Actions

Copy link

Updated by Abhishek Lekshmanan over 6 years ago

Copied to Backport #21851: luminous: rgw: Missing error handling when gen_rand_alphanumeric is failing added

Actions

Copy link

Updated by Nathan Cutler over 6 years ago

Backport changed from Luminous to luminous

Actions

Copy link

Updated by Nathan Cutler over 6 years ago

Related to Bug #22006: RGWCrashError: RGW will crash when generating random bucket name and object name during loadgen process added

Actions

Copy link

Updated by Nathan Cutler over 6 years ago

Note that the luminous backport is non-trivial because random number generation has been reworked in master.

Actions

Copy link

#10

Updated by Casey Bodley almost 6 years ago

Related to Bug #22225: rgw:socket leak in s3 multi part upload added

Actions

Copy link

#11

Updated by Nathan Cutler over 4 years ago

Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

Actions

Copy link

Also available in: Atom PDF

Project

General

Profile

Ceph » rgw

Custom queries

Bug #21401

rgw: Missing error handling when gen_rand_alphanumeric is failing

Updated by Jens Harbott over 6 years ago

Updated by Casey Bodley over 6 years ago

Updated by Casey Bodley over 6 years ago

Updated by Yehuda Sadeh over 6 years ago

Updated by Abhishek Lekshmanan over 6 years ago

Updated by Abhishek Lekshmanan over 6 years ago

Updated by Nathan Cutler over 6 years ago

Updated by Nathan Cutler over 6 years ago

Updated by Nathan Cutler over 6 years ago

Updated by Casey Bodley almost 6 years ago

Updated by Nathan Cutler over 4 years ago