Project

General

Profile

Actions

Bug #56500

closed

rgw:verify tests are failing with valgrind error

Added by Soumya Koduri almost 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
Urgent
Assignee:
-
Target version:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

Many of the rgw:verify test cases are failing with valgrind error -
http://pulpito.front.sepia.ceph.com/soumyakoduri-2022-07-08_05:08:48-rgw:verify-wip-skoduri-main-distro-default-smithi/detail

Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_10062088f503b43eff3624326bda825b23438f9b/teuthology/run_tasks.py", line 188, in run_tasks
suppress = manager.__exit__(*exc_info)
File "/usr/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/home/teuthworker/src/git.ceph.com_ceph-c_a3815e4f3cc2fcfa37d949491fb5fb434404011e/qa/tasks/ceph.py", line 1922, in task
check_status=False,
File "/usr/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_10062088f503b43eff3624326bda825b23438f9b/teuthology/contextutil.py", line 55, in nested
raise exc1
File "/usr/lib/python3.6/contextlib.py", line 99, in exit
self.gen.throw(type, value, traceback)
File "/home/teuthworker/src/git.ceph.com_ceph-c_a3815e4f3cc2fcfa37d949491fb5fb434404011e/qa/tasks/ceph.py", line 251, in ceph_log
yield
File "/home/teuthworker/src/git.ceph.com_git_teuthology_10062088f503b43eff3624326bda825b23438f9b/teuthology/contextutil.py", line 47, in nested
if exit(*exc):
File "/usr/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/home/teuthworker/src/git.ceph.com_ceph-c_a3815e4f3cc2fcfa37d949491fb5fb434404011e/qa/tasks/ceph.py", line 362, in valgrind_post
raise valgrind_exception
Exception: saw valgrind issues

Actions #1

Updated by Casey Bodley almost 2 years ago

  • Priority changed from Normal to Urgent

similar valgrind issues from http://qa-proxy.ceph.com/teuthology/mbenjamin-2022-07-17_22:24:15-rgw-wip-rgwlc-azone-distro-default-smithi/6935342/teuthology.log

2022-07-18T08:30:10.415 INFO:teuthology.orchestra.run.smithi117.stdout:/var/log/ceph/valgrind/ceph.client.0.log: <kind>InvalidRead</kind>
2022-07-18T08:30:10.416 INFO:teuthology.orchestra.run.smithi117.stdout:/var/log/ceph/valgrind/ceph.client.0.log: <kind>InvalidWrite</kind>
2022-07-18T08:30:10.416 INFO:teuthology.orchestra.run.smithi117.stdout:/var/log/ceph/valgrind/ceph.client.0.log: <kind>SyscallParam</kind>
2022-07-18T08:30:10.417 INFO:teuthology.orchestra.run.smithi117.stdout:/var/log/ceph/valgrind/ceph.client.0.log: <kind>UninitCondition</kind>
2022-07-18T08:30:10.417 INFO:teuthology.orchestra.run.smithi117.stdout:/var/log/ceph/valgrind/ceph.client.0.log: <kind>UninitValue</kind>

Actions #2

Updated by Casey Bodley almost 2 years ago

it looks like the majority of the valgrind issues are InvalidRead/Writes with a message like:

<auxwhat>Address 0x57f47f60 is on thread 135's stack</auxwhat>

this appears to be complaining about memory on the frontend's coroutine stacks. i think we've seen this in the past, and discovered that boost has a 'valgrind' compile option that causes boost::context to call VALGRIND_STACK_REGISTER()[1] on these stacks

Adam added a WITH_BOOST_VALGRIND option to ceph's cmake in [2], and this flag was enabled for shaman 'notcmalloc' builds in [3]

however, we eventually figured out how to run valgrind against tcmalloc builds, and stopped doing 'notcmalloc' builds entirely in [4]

so we lost the WITH_BOOST_VALGRIND fix over a year ago, but the rgw suite had been running clean until just recently. so what's changed?

[1] https://valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.clientreq
[2] https://github.com/ceph/ceph/pull/34043
[3] https://github.com/ceph/ceph-build/pull/1736
[4] https://github.com/ceph/ceph-build/pull/1750

Actions #3

Updated by Casey Bodley almost 2 years ago

confirmed with local testing (fedora 36) that the valgrind issues go away after:

cmake -DWITH_BOOST_VALGRIND=ON -DWITH_SYSTEM_BOOST=OFF ..

Actions #4

Updated by Laura Flores almost 2 years ago

Soumya Koduri wrote:

Many of the rgw:verify test cases are failing with valgrind error -
http://pulpito.front.sepia.ceph.com/soumyakoduri-2022-07-08_05:08:48-rgw:verify-wip-skoduri-main-distro-default-smithi/detail

Traceback (most recent call last):
File "/home/teuthworker/src/git.ceph.com_git_teuthology_10062088f503b43eff3624326bda825b23438f9b/teuthology/run_tasks.py", line 188, in run_tasks
suppress = manager.__exit__(*exc_info)
File "/usr/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/home/teuthworker/src/git.ceph.com_ceph-c_a3815e4f3cc2fcfa37d949491fb5fb434404011e/qa/tasks/ceph.py", line 1922, in task
check_status=False,
File "/usr/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/home/teuthworker/src/git.ceph.com_git_teuthology_10062088f503b43eff3624326bda825b23438f9b/teuthology/contextutil.py", line 55, in nested
raise exc1
File "/usr/lib/python3.6/contextlib.py", line 99, in exit
self.gen.throw(type, value, traceback)
File "/home/teuthworker/src/git.ceph.com_ceph-c_a3815e4f3cc2fcfa37d949491fb5fb434404011e/qa/tasks/ceph.py", line 251, in ceph_log
yield
File "/home/teuthworker/src/git.ceph.com_git_teuthology_10062088f503b43eff3624326bda825b23438f9b/teuthology/contextutil.py", line 47, in nested
if exit(*exc):
File "/usr/lib/python3.6/contextlib.py", line 88, in exit
next(self.gen)
File "/home/teuthworker/src/git.ceph.com_ceph-c_a3815e4f3cc2fcfa37d949491fb5fb434404011e/qa/tasks/ceph.py", line 362, in valgrind_post
raise valgrind_exception
Exception: saw valgrind issues

These are failing from this in the teuthology log:

ga3815e4f/rpm/el8/BUILD/ceph-17.0.0-13446-ga3815e4f/src/common/fault_injector.h: In function 'int FaultInjector<Key>::check(const Key&) const::visitor::operator()(const InjectAbort&) const [with Key = std::basic_string_view<char>]' thread 7f7441d52540 time 2022-07-08T05:48:06.589191+0000
2022-07-08T05:48:06.640 INFO:tasks.workunit.client.0.smithi006.stderr:/home/jenkins-build/build/workspace/ceph-dev-new-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.0.0-13446-ga3815e4f/rpm/el8/BUILD/ceph-17.0.0-13446-ga3815e4f/src/common/fault_injector.h: 101: FAILED ceph_assert(!"FaultInjector")
2022-07-08T05:48:06.640 INFO:tasks.workunit.client.0.smithi006.stderr: ceph version 17.0.0-13446-ga3815e4f (a3815e4f3cc2fcfa37d949491fb5fb434404011e) quincy (dev)
2022-07-08T05:48:06.640 INFO:tasks.workunit.client.0.smithi006.stderr: 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x152) [0x7f743dde0f44]
2022-07-08T05:48:06.641 INFO:tasks.workunit.client.0.smithi006.stderr: 2: /usr/lib64/ceph/libceph-common.so.2(+0x2c2165) [0x7f743dde1165]
2022-07-08T05:48:06.641 INFO:tasks.workunit.client.0.smithi006.stderr: 3: (std::__detail::__variant::__gen_vtable_impl<std::__detail::__variant::_Multi_array<int (*)(FaultInjector<std::basic_string_view<char, std::char_traits<char> > >::check(std::basic_string_view<char, std::char_traits<char> > const&) const::visitor&&, std::variant<std::monostate, InjectAbort, InjectError> const&)>, std::tuple<std::variant<std::monostate, InjectAbort, InjectError> const&>, std::integer_sequence<unsigned long, 1ul> >::__visit_invoke(FaultInjector<std::basic_string_view<char, std::char_traits<char> > >::check(std::basic_string_view<char, std::char_traits<char> > const&) const::visitor&&, std::variant<std::monostate, InjectAbort, InjectError> const&)+0x41) [0x55ad01619061]
2022-07-08T05:48:06.642 INFO:tasks.workunit.client.0.smithi006.stderr: 4: radosgw-admin(+0x911b65) [0x55ad01610b65]
2022-07-08T05:48:06.642 INFO:tasks.workunit.client.0.smithi006.stderr: 5: (RGWBucketReshard::execute(int, FaultInjector<std::basic_string_view<char, std::char_traits<char> > >&, int, DoutPrefixProvider const*, bool, std::ostream*, ceph::Formatter*, RGWReshard*)+0xe5) [0x55ad016150e5]
2022-07-08T05:48:06.643 INFO:tasks.workunit.client.0.smithi006.stderr: 6: main()
2022-07-08T05:48:06.643 INFO:tasks.workunit.client.0.smithi006.stderr: 7: __libc_start_main()
2022-07-08T05:48:06.644 INFO:tasks.workunit.client.0.smithi006.stderr: 8: _start()

Actions #5

Updated by Casey Bodley almost 2 years ago

thanks Laura. those assertions are expected - we use fault injection in qa/workunits/rgw/test_rgw_reshard.py to test that we can recover from errors and crashes

Actions #6

Updated by Laura Flores almost 2 years ago

Ah, makes sense Casey.

Actions #7

Updated by Casey Bodley almost 2 years ago

https://github.com/ceph/ceph/pull/46554 updated boost to version 1.79, but the valgrind issues persist

Actions #8

Updated by Casey Bodley almost 2 years ago

our rgw/verify suite is pinned to centos, but i ran against ubuntu and saw the same failures: https://pulpito.ceph.com/cbodley-2022-07-26_14:01:08-rgw:verify-pr-46554-kefu-distro-default-smithi/

Actions #9

Updated by Casey Bodley almost 2 years ago

$ MON=1 OSD=1 RGW=1 MDS=0 MGR=0 ../src/vstart.sh -n -d --valgrind_rgw memcheck --valgrind_args '--suppressions=../qa/valgrind.supp --max-threads=2048'
Actions #10

Updated by Mark Kogan almost 2 years ago

tested on centos stream and main branch

❯ git branch -vv
* main                  e9d361f6219 [origin/main: behind 30] Merge pull request #47334 from s0nea/wip-osd-objectstore-types-fix

gcc --version
gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-13)
(without SCL)

valgrind --version
valgrind-3.19.0

./do_cmake.sh -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DDIAGNOSTICS_COLOR=always -DBOOST_J=$(nproc) -DCMAKE_BUILD_TYPE=RelWithDebInfo 

In a vstart env, starting RGW as follows:
sudo pkill radosgw
sudo truncate -s0 ./out/radosgw.8000.log ; sudo valgrind --soname-synonyms=somalloc="*tcmalloc*" --vgdb=yes --trace-children=no --child-silent-after-fork=yes --num-callers=10 --suppressions=../qa/valgrind.supp --xml=yes --xml-file=./valgrind_log.xml --time-stamp=yes --tool=memcheck --max-threads=2048 -- ./bin/radosgw -c ./ceph.conf --log-file=./out/radosgw.8000.log --admin-socket=./out/radosgw.8000.asok --pid-file=./out/radosgw.8000.pid -n client.rgw.8000 --rgw_frontends="beast port=8000" --debug_ms=0 --debug_rgw=1 -f

and executing ../qa/workunits/rgw/run-s3tests.sh |& tee ./run-s3tests.log
with the following setup:
vim ../qa/workunits/rgw/run-s3tests.sh
...
S3TEST_CONF=s3tests.conf.SAMPLE virtualenv/bin/python -m nose -a '!fails_on_rgw,!lifecycle_expiration,!fails_strict_rfc2616,!s3select,!test_of_sts,!webidentity_test,!fails_with_subdomain,!fails_on_dbstore' -v
...

Comparing: -DWITH_BOOST_VALGRIND=ON .vs. -DWITH_BOOST_VALGRIND=OFF results:

When: -DWITH_BOOST_VALGRIND=ON

there are no un-suppressed valgrind errors:

performance of 3 hsbench 100K small obj PUT runs:

❯ nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4K -d -1 -t $(( $(numactl -N 0 -- nproc) / 2 )) -b $(( $(numactl -N 0 -- nproc) / 2 )) -n 100000 -m cxip -bp b01b

2022/08/01 15:05:31 Loop: 0, Int: TOTAL, Dur(s): 9.6, Mode: PUT, Ops: 100000, MB/s: 40.58, IO/s: 10389, Lat(ms): [ min: 1.1, avg: 2.7, 99%: 4.0, max: 8.3 ], Slowdowns: 0
2022/08/01 15:11:11 Loop: 0, Int: TOTAL, Dur(s): 9.7, Mode: PUT, Ops: 100000, MB/s: 40.34, IO/s: 10328, Lat(ms): [ min: 1.3, avg: 2.7, 99%: 4.0, max: 8.2 ], Slowdowns: 0
2022/08/01 15:19:09 Loop: 0, Int: TOTAL, Dur(s): 9.6, Mode: PUT, Ops: 100000, MB/s: 40.73, IO/s: 10427, Lat(ms): [ min: 1.3, avg: 2.7, 99%: 3.9, max: 7.5 ], Slowdowns: 0

When: -DWITH_BOOST_VALGRIND=OFF

there are many valgrind errors related to boost context & spawn
[https://www.boost.org/doc/libs/master/libs/context/doc/html/context/stack/valgrind.html]
like:

performance of 3 hsbench 100K small obj PUT runs:

❯ nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4K -d -1 -t $(( $(numactl -N 0 -- nproc) / 2 )) -b $(( $(numactl -N 0 -- nproc) / 2 )) -n 100000 -m cxip -bp b01b

2022/08/01 13:53:30 Loop: 0, Int: TOTAL, Dur(s): 9.6, Mode: PUT, Ops: 100000, MB/s: 40.57, IO/s: 10387, Lat(ms): [ min: 1.3, avg: 2.7, 99%: 4.0, max: 8.4 ], Slowdowns: 0
2022/08/01 13:55:54 Loop: 0, Int: TOTAL, Dur(s): 9.7, Mode: PUT, Ops: 100000, MB/s: 40.39, IO/s: 10340, Lat(ms): [ min: 1.2, avg: 2.7, 99%: 4.0, max: 7.9 ], Slowdowns: 0
2022/08/01 13:58:06 Loop: 0, Int: TOTAL, Dur(s): 9.6, Mode: PUT, Ops: 100000, MB/s: 40.63, IO/s: 10402, Lat(ms): [ min: 1.2, avg: 2.7, 99%: 4.0, max: 8.3 ], Slowdowns: 0

there is no discernable performance difference in IO/s between -DWITH_BOOST_VALGRIND=ON and -DWITH_BOOST_VALGRIND=OFF for the above workload

think that it could be safe to configure -DWITH_BOOST_VALGRIND=ON on Shaman and keep -DWITH_BOOST_VALGRIND=OFF on Brew

Actions #11

Updated by Casey Bodley almost 2 years ago

thanks Mark! can you please clarify whether radosgw was running under valgrind during those hsbench runs? i think it's important that we measure performance without that runtime overhead from valgrind itself

Actions #12

Updated by Mark Kogan almost 2 years ago

all benchmarks were performed not under valgrind

adding results from additional tests after rebase today, this time also large objects workload,
the large ops results are interesting because the results WITH_SYSTEM_BOOST=ON are between 0.9% to 1% slower compared to WITH_SYSTEM_BOOST=OFF : IO/s: 92 .vs 91 on PUT and IO/s: 101 .vs. 100 on GET.
although not concerned as it is within the noise threshold of boost clock variance and similar.

# branch:
tig
49937c3 2022-08-03 08:04 -0700 Yuri Weinstein               {origin/main} Merge pull request #47165 from NitzanMordhai/wip-nitzan-librados-ioctx-destroy-Segmentation-fault

-DWITH_SYSTEM_BOOST=OFF  &  boost_1_79_0.tar.bz2

# small objects

❯ nice numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 4K -d -1 -t $(( $(numactl -N 0 -- nproc) / 2 )) -b $(( $(numactl -N 0 -- nproc) / 2 )) -n 100000 -m cxip -bp b01b

-DWITH_BOOST_VALGRIND=OFF
2022/08/03 15:54:45 Loop: 0, Int: TOTAL, Dur(s): 9.3, Mode: PUT, Ops: 100000, MB/s: 42.06, IO/s: 10767, Lat(ms): [ min: 1.2, avg: 2.6, 99%: 3.9, max: 7.8 ], Slowdowns: 0

-DWITH_BOOST_VALGRIND=ON
2022/08/03 16:07:23 Loop: 0, Int: TOTAL, Dur(s): 9.3, Mode: PUT, Ops: 100000, MB/s: 42.03, IO/s: 10759, Lat(ms): [ min: 1.3, avg: 2.6, 99%: 3.9, max: 7.9 ], Slowdowns: 0

# large objects

numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 24M -d -1 -t $(( $(numactl -N 0 -- nproc) / 2 )) -b $(( $(numactl -N 0 -- nproc) / 2 )) -n 5000 -m cxip -bp b01b

numactl -N 1 -m 1 -- ~/go/bin/hsbench -a b2345678901234567890 -s b234567890123456789012345678901234567890 -u http://127.0.0.1:8000 -z 24M -d -1 -t 900 -b $(( $(numactl -N 0 -- nproc) / 2 )) -n 5000 -m g -bp b01b

-DWITH_BOOST_VALGRIND=OFF
2022/08/03 16:39:33 Loop: 0, Int: TOTAL, Dur(s): 54.6, Mode: PUT, Ops: 5000, MB/s: 2197.09, IO/s: 92, Lat(ms): [ min: 136.4, avg: 305.1, 99%: 773.8, max: 1163.4 ], Slowdowns: 0

2022/08/03 16:40:49 Loop: 0, Int: TOTAL, Dur(s): 49.5, Mode: GET, Ops: 5000, MB/s: 2422.76, IO/s: 101, Lat(ms): [ min: 68.0, avg: 2089.3, 99%: 11901.5, max: 15301.1 ], Slowdowns: 0

-DWITH_BOOST_VALGRIND=ON
2022/08/03 16:17:38 Loop: 0, Int: TOTAL, Dur(s): 54.7, Mode: PUT, Ops: 5000, MB/s: 2193.01, IO/s: 91, Lat(ms): [ min: 135.7, avg: 305.7, 99%: 768.2, max: 1202.2 ], Slowdowns: 0

2022/08/03 16:19:25 Loop: 0, Int: TOTAL, Dur(s): 49.9, Mode: GET, Ops: 5000, MB/s: 2406.48, IO/s: 100, Lat(ms): [ min: 70.1, avg: 2165.4, 99%: 10205.4, max: 11119.7 ], Slowdowns: 0
Actions #13

Updated by Mark Kogan almost 2 years ago

adding to this tracker a pointer to PR: https://github.com/ceph/ceph/pull/47441 --> commit: https://github.com/ceph/ceph/pull/47441/commits/6052ada61d585db5a577f4b3e935f7f2477ff2a1

the default cmake config value of WITH_BOOST_VALGRIND was set after cmake/modules/BuildBoost.cmake already acted on it so it was always =OFF implicitly unless explicitly specified as a parame to ./do_cmake.sh

practically it was not possible to modify it on the fly when pushing to Shaman the like of:

sed -i 's/"Boost support for valgrind" OFF/"Boost support for valgrind" ON/' ./CMakeLists.txt ; git commit ./CMakeLists.txt -m "build: temporary WITH_BOOST_VALGRIND=ON for shaman" ; git push ci wip-rgw-40183 ; git reset --mixed HEAD~1 ; sed -i 's/"Boost support for valgrind" ON/"Boost support for valgrind" OFF/' ./CMakeLists.txt

Actions #14

Updated by Casey Bodley almost 2 years ago

  • Status changed from New to Resolved

WITH_BOOST_VALGRIND should now be enabled on all shaman builds with the merge of https://github.com/ceph/ceph-build/pull/2043

Actions

Also available in: Atom PDF