Bug #63314
openkafka crashed during message callback in teuthology
0%
Description
test: http://pulpito.front.sepia.ceph.com/yuvalif-2023-10-25_06:07:27-rgw:notifications-wip-topic-policy-distro-default-smithi/
backtrace:
Fatal glibc error: tpp.c:87 (__pthread_tpp_change_priority): assertion failed: previous_prio == -1 || (previous_prio >= fifo_min_prio && previous_prio <= fifo_max_prio) *** Caught signal (Aborted) ** in thread 7fdf852cc640 thread_name:kafka_manager ceph version 18.0.0-6835-ge5c0f496 (e5c0f4966b3da3fed8fcdae150335c900609e4d9) reef (dev) 1: /lib64/libc.so.6(+0x54db0) [0x7fe0b8054db0] 2: /lib64/libc.so.6(+0xa154c) [0x7fe0b80a154c] 3: raise() 4: abort() 5: /lib64/libc.so.6(+0x29130) [0x7fe0b8029130] 6: /lib64/libc.so.6(+0x4daf7) [0x7fe0b804daf7] 7: /lib64/libc.so.6(+0xa7d18) [0x7fe0b80a7d18] 8: (RGWPubSubKafkaEndpoint::Waiter::finish(int)+0xbe) [0x55dc941187ee] 9: (rgw::kafka::message_callback(rd_kafka_s*, rd_kafka_message_s const*, void*)+0xef) [0x55dc94245c8f] 10: /lib64/librdkafka.so.1(+0x256ef) [0x7fe0b90776ef] 11: /lib64/librdkafka.so.1(+0x5b862) [0x7fe0b90ad862] 12: rd_kafka_poll() 13: (rgw::kafka::Manager::run()+0x328) [0x55dc94248008] 14: /lib64/libstdc++.so.6(+0xdb924) [0x7fe0b84db924] 15: /lib64/libc.so.6(+0x9f802) [0x7fe0b809f802] 16: /lib64/libc.so.6(+0x3f450) [0x7fe0b803f450]
Updated by Yuval Lifshitz 6 months ago
found this issue in the librdkafka repo: https://github.com/confluentinc/librdkafka/issues/3688
that seems related
Updated by Yuval Lifshitz 6 months ago
will attempts to fix the issue by using librdkafka as a git submodule.
using a version that has the fix above.
Updated by Yuval Lifshitz 6 months ago
according to the log, centos9 has this version: librdkafka-1.6.1.
and according to the git log, this version has the fix to the crash:
commit 7915f2a9adf30ca5938945c607b35c9c5a5a63e3 Author: Magnus Edenhill <magnus@edenhill.se> Date: Mon Apr 6 12:52:25 2020 +0200 Align bundled c11 threads (tinycthreads) constants to glibc and musl (#2681)
this means that there is no need for a newer version of the library via a submodule
Updated by Yuval Lifshitz 5 months ago
crash is happening in the dtor if the unique_lock defined here (when it goes out of scope): https://github.com/ceph/ceph/blob/main/src/rgw/driver/rados/rgw_pubsub_push.cc#L359
backtrace from gdb:
#0 0x00007fe48c4a154c in __pthread_kill_implementation () from /lib64/libc.so.6 #1 0x00007fe48c454d06 in raise () from /lib64/libc.so.6 #2 0x00007fe48c4287f3 in abort () from /lib64/libc.so.6 #3 0x00007fe48c429130 in __libc_message.cold () from /lib64/libc.so.6 #4 0x00007fe48c44daf7 in __libc_assert_fail () from /lib64/libc.so.6 #5 0x00007fe48c4a7d18 in __pthread_tpp_change_priority () from /lib64/libc.so.6 #6 0x000055be80a0ef0e in RGWPubSubKafkaEndpoint::Waiter::finish(int) () at /usr/include/c++/11/bits/unique_lock.h:103 #7 0x000055be80b37b2f in std::function<void (int)>::operator()(int) const (__args#0=0, this=0x55be87ea6c88) at /usr/include/c++/11/bits/std_function.h:590 #8 rgw::kafka::message_callback (rk=<optimized out>, rkmessage=<optimized out>, opaque=0x55be93d3c5a0) at /usr/src/debug/ceph-18.0.0-7517.gcdddd29f.el9.x86_64/src/rgw/rgw_kafka.cc:168 #9 0x00007fe48d2776ef in rd_kafka_poll_cb () from /lib64/librdkafka.so.1 #10 0x00007fe48d2ad862 in rd_kafka_q_serve.localalias () from /lib64/librdkafka.so.1 #11 0x00007fe48d2724b8 in rd_kafka_poll () from /lib64/librdkafka.so.1 #12 0x000055be80b39ea8 in rgw::kafka::Manager::run (this=0x55be84213a20) at /usr/src/debug/ceph-18.0.0-7517.gcdddd29f.el9.x86_64/src/rgw/rgw_kafka.cc:487 #13 0x00007fe48c8db924 in execute_native_thread_routine () from /lib64/libstdc++.so.6 #14 0x00007fe48c49f802 in start_thread () from /lib64/libc.so.6 #15 0x00007fe48c43f450 in clone3 () from /lib64/libc.so.6
Updated by Yuval Lifshitz 5 months ago
- Status changed from Triaged to Fix Under Review
- Pull request ID set to 54697
Updated by Yuval Lifshitz 5 months ago
- Tags changed from kafka to kafka, amqp
- Backport set to reef, quincy
Updated by Yuval Lifshitz 5 months ago
- Status changed from Fix Under Review to Pending Backport
Updated by Backport Bot 5 months ago
- Copied to Backport #63701: reef: kafka crashed during message callback in teuthology added
Updated by Backport Bot 5 months ago
- Copied to Backport #63702: quincy: kafka crashed during message callback in teuthology added
Updated by Backport Bot 5 months ago
- Tags changed from kafka, amqp to kafka, amqp backport_processed
Updated by Casey Bodley 3 months ago
@Yuval, this has still been crashing on main: http://qa-proxy.ceph.com/teuthology/suriarte-2024-01-22_16:24:20-rgw-rgw_asio_frontend_asserts-distro-default-smithi/7527259/teuthology.log
2024-01-24T10:35:50.574 DEBUG:teuthology.orchestra.run.smithi045:bucket notification tests against different endpoints> BNTESTS_CONF=/home/ubuntu/cephtest/ceph/src/test/rgw/bucket_notification/bn-tests.client.0.conf /home/ubuntu/cephtest/ceph/src/test/rgw/bucket_notification/virtualenv/bin/python -m nose -s /home/ubuntu/cephtest/ceph/src/test/rgw/bucket_notification/test_bn.py -v -a kafka_test 2024-01-24T10:36:01.500 INFO:tasks.rgw.client.0.smithi045.stdout:Fatal glibc error: tpp.c:87 (__pthread_tpp_change_priority): assertion failed: previous_prio == -1 || (previous_prio >= fifo_min_prio && previous_prio <= fifo_max_prio) 2024-01-24T10:36:01.501 INFO:tasks.rgw.client.0.smithi045.stdout:*** Caught signal (Aborted) ** 2024-01-24T10:36:01.501 INFO:tasks.rgw.client.0.smithi045.stdout: in thread 7f087c57e640 thread_name:kafka_manager 2024-01-24T10:36:01.501 INFO:tasks.rgw.client.0.smithi045.stdout: ceph version 19.0.0-717-gee50056b (ee50056b4421500d187e28edc49461ee9569002e) squid (dev) 2024-01-24T10:36:01.501 INFO:tasks.rgw.client.0.smithi045.stdout: 1: /lib64/libc.so.6(+0x54db0) [0x7f09abc54db0] 2024-01-24T10:36:01.502 INFO:tasks.rgw.client.0.smithi045.stdout: 2: /lib64/libc.so.6(+0xa154c) [0x7f09abca154c] 2024-01-24T10:36:01.502 INFO:tasks.rgw.client.0.smithi045.stdout: 3: raise() 2024-01-24T10:36:01.502 INFO:tasks.rgw.client.0.smithi045.stdout: 4: abort() 2024-01-24T10:36:01.502 INFO:tasks.rgw.client.0.smithi045.stdout: 5: /lib64/libc.so.6(+0x29130) [0x7f09abc29130] 2024-01-24T10:36:01.502 INFO:tasks.rgw.client.0.smithi045.stdout: 6: /lib64/libc.so.6(+0x4daf7) [0x7f09abc4daf7] 2024-01-24T10:36:01.502 INFO:tasks.rgw.client.0.smithi045.stdout: 7: /lib64/libc.so.6(+0xa7d18) [0x7f09abca7d18] 2024-01-24T10:36:01.502 INFO:tasks.rgw.client.0.smithi045.stdout: 8: (std::_Function_handler<void (int), RGWPubSubKafkaEndpoint::send_to_completion_async(ceph::common::CephContext*, rgw_pubsub_s3_event const&, optional_yield)::{lambda(int)#1}>::_M_invoke(std::_Any_data const&, int&&)+0x95) [0x564f77e7fa75] 2024-01-24T10:36:01.502 INFO:tasks.rgw.client.0.smithi045.stdout: 9: (rgw::kafka::message_callback(rd_kafka_s*, rd_kafka_message_s const*, void*)+0xef) [0x564f77fabaaf] 2024-01-24T10:36:01.502 INFO:tasks.rgw.client.0.smithi045.stdout: 10: /lib64/librdkafka.so.1(+0x256ef) [0x7f09aca776ef] 2024-01-24T10:36:01.502 INFO:tasks.rgw.client.0.smithi045.stdout: 11: /lib64/librdkafka.so.1(+0x5b862) [0x7f09acaad862] 2024-01-24T10:36:01.502 INFO:tasks.rgw.client.0.smithi045.stdout: 12: rd_kafka_poll() 2024-01-24T10:36:01.503 INFO:tasks.rgw.client.0.smithi045.stdout: 13: (rgw::kafka::Manager::run()+0x350) [0x564f77fade50] 2024-01-24T10:36:01.503 INFO:tasks.rgw.client.0.smithi045.stdout: 14: /lib64/libstdc++.so.6(+0xdb924) [0x7f09ac0db924] 2024-01-24T10:36:01.503 INFO:tasks.rgw.client.0.smithi045.stdout: 15: /lib64/libc.so.6(+0x9f802) [0x7f09abc9f802] 2024-01-24T10:36:01.503 INFO:tasks.rgw.client.0.smithi045.stdout: 16: /lib64/libc.so.6(+0x3f450) [0x7f09abc3f450]
Updated by Casey Bodley 3 months ago
- Related to Bug #64184: test_bn.py -v -a kafka_test: Fatal glibc error: tpp.c:87 (__pthread_tpp_change_priority): assertion failed added