Project

General

Profile

Bug #11484

OPT_INT option interprets "3221225472" as -1073741824, and crashes in Throttle::Throttle()

Added by elder one about 4 years ago. Updated almost 4 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
Start date:
04/27/2015
Due date:
% Done:

60%

Source:
Community (user)
Tags:
Backport:
hammer
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:

Description

Upgraded one of my cluster nodes to Hammer 0.94.1.
After upgrade OSD-s on that node fail to start with error on log:
Throttle.cc: 38: FAILED assert(m >= 0)

See osd log for details.

Ubuntu 14.04, 3.18.12 kernel

ceph-osd.23.log View (595 KB) elder one, 04/27/2015 11:48 PM


Related issues

Related to Ceph - Feature #11525: improve OPTION macro to add a check function for invalid options. New 05/04/2015
Copied to Ceph - Backport #11738: OPT_INT option interprets "3221225472" as -1073741824, and crashes in Throttle::Throttle() Resolved 04/27/2015

Associated revisions

Revision d62f80dc (diff)
Added by Kefu Chai about 4 years ago

common/config: detect overflow of int values

  • #include "strtol.h" in strtol.cc, to ensure the function defintions
    are consistent.
  • add a test accordingly
  • fix the testcase of converting 1024E.
  • do not accept integers overflow after adding SI suffix
  • do not accept integers underflow (i.e. negative values)

Fixes: #11484
Signed-off-by: Kefu Chai <>

Revision 9b947fa3 (diff)
Added by Kefu Chai about 4 years ago

common/config: detect overflow of int values

  • #include "strtol.h" in strtol.cc, to ensure the function defintions
    are consistent.
  • add a test accordingly
  • fix the testcase of converting 1024E.
  • do not accept integers overflow after adding SI suffix
  • do not accept integers underflow (i.e. negative values)

Fixes: #11484
Signed-off-by: Kefu Chai <>
(cherry picked from commit d62f80dc7b25d312ff05b65b7be854aae15b66a8)

History

#1 Updated by Kefu Chai about 4 years ago

common/Throttle.cc: 38: FAILED assert(m >= 0)

 ceph version 0.94.1 (e4bfad3a3c51054df7e537a724c8d0bf9be972ff)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x8b) [0xbc271b]
 2: (Throttle::Throttle(CephContext*, std::string, long, bool)+0x290) [0xba7670]
 3: (FileJournal::FileJournal(uuid_d, Finisher*, Cond*, char const*, bool, bool, bool)+0x6ad) [0x9312dd]
 4: (FileStore::open_journal()+0xd9) [0x90a3c9]
 5: (FileStore::mount()+0x2824) [0x90f9f4]
 6: (OSD::init()+0x259) [0x6c47b9]
 7: (main()+0x2860) [0x651fc0]
 8: (__libc_start_main()+0xf5) [0x7f5d977c4ec5]
 9: /usr/bin/ceph-osd() [0x66aff7]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to interpret this.

looks like the "journal_queue_max_ops" and/or "journal_queue_max_bytes" is negative.

elder one, what are the values of these settings in your ceph cluster? or could you post the output of following commands:

ceph-conf --lookup journal_queue_max_bytes
ceph-conf --lookup journal_queue_max_ops

#2 Updated by elder one about 4 years ago

Here it is:

ceph-conf -s osd --lookup journal_queue_max_bytes
3G
ceph-conf -s osd --lookup journal_queue_max_ops
5000

Is it a "G" suffix causing troubles?

#3 Updated by elder one about 4 years ago

Yeah, seems to be so:

ceph --admin-daemon /var/run/ceph/ceph-osd.23.asok config show | grep journal_queue_max_bytes
"journal_queue_max_bytes": "-1073741824"

#4 Updated by elder one about 4 years ago

That's strange - changed journal_queue_max_bytes from "3G" to 3221225472 in ceph.conf and restarted all osd's on host but admin socket gives the same negative output:

ceph-conf -s osd --lookup journal_queue_max_bytes
3221225472
ceph --admin-daemon /var/run/ceph/ceph-osd.23.asok config show | grep journal_queue_max_bytes
"journal_queue_max_bytes": "-1073741824",

#5 Updated by Kefu Chai about 4 years ago

  • Status changed from New to Verified
  • Assignee set to Kefu Chai
  • Source changed from other to Community (user)

elder one, yes, this is a bug, because the 3221225472 (i.e. 0xc0000000) is out of range of numeric_limits<int>::max() (i.e. 0x7fffffff on LP64 arch0), and journal_queue_max_bytes is a config whose type is int. so your config value of 3221225472 cannot be represented by int.

so we need to:
  • fix the config parsing for int
  • probably change the type of journal_queue_max_bytes from int to unsigned long long (if we have a good reason to do so)

is it a "G" suffix causing troubles?

not exactly, you can still use the "G" suffix. but 3G exceeds the limit of int. so 2G-1 should be fine.

mean while, you could use "2147483647" instead of 3G for this config as a work around. just out of curiosity, is there any specific reason you need a 3G journal_queue_max_bytes for your system?

---
[0] most modern unix platforms on amd64 are LP64. see http://www.unix.org/whitepapers/64bit.html. and yes, i believe your linux is not an exception.

#6 Updated by Kefu Chai about 4 years ago

  • Subject changed from OSD-s die after upgrade from Giant to Hammer to OPT_INT option interprets "3221225472" as -1073741824, and crashes in Throttle::Throttle()

#7 Updated by Kefu Chai about 4 years ago

  • Status changed from Verified to In Progress
  • % Done changed from 0 to 40

#8 Updated by Kefu Chai about 4 years ago

  • Status changed from In Progress to Need Review
  • % Done changed from 40 to 60

#9 Updated by elder one about 4 years ago

You are right, apparently I don't need 3G journal queue. Value left info config after benchmarking/tuning cluster.
That 2G limit I figured out yesterday and now my cluster runs happily on Hammer:)

On Giant, OSD andmin socket gives also negative output, but osd-s start fine.

I propose if any value in config file is out of range, Ceph could use default value for that and warn about that on startup.

#10 Updated by Kefu Chai about 4 years ago

elder one wrote:

I propose if any value in config file is out of range, Ceph could use default value for that and warn about that on startup.

thank you, elder one, =)

Yeah, i had the same feeling when looking at this issue. will step back and see what we can do to make the error handling of bad config value more user friendly. maybe we should shutdown gracefully at seeing a bad config value when starting up, but reject it at runtime. will file another ticket when i have a better idea.

#11 Updated by Kefu Chai about 4 years ago

elder one, i created #11525 to track the feature request for improving the invalid option handling.

#12 Updated by Kefu Chai about 4 years ago

  • Status changed from Need Review to Resolved
  • Backport set to hammer
  • Regression set to No

#13 Updated by Kefu Chai about 4 years ago

  • Status changed from Resolved to Pending Backport

#14 Updated by Abhishek Lekshmanan about 4 years ago

  • Description updated (diff)

#15 Updated by Loic Dachary about 4 years ago

@abhishek did you intend to edit the summary to include a link to the master pull request https://github.com/ceph/ceph/pull/4494 ? You added https://github.com/ceph/ceph/pull/4889 which is the hammer backport matching the issue http://tracker.ceph.com/issues/11738.

#16 Updated by Abhishek Lekshmanan about 4 years ago

@loic sorry it was unintended.. Nathan informed me of the mistake I had done; basically if we use the script and feed it some other tracker id; it goes and updates the description to that instead..

#17 Updated by Abhishek Lekshmanan about 4 years ago

  • Description updated (diff)

#18 Updated by Loic Dachary almost 4 years ago

  • Status changed from Pending Backport to Resolved

Also available in: Atom PDF