unittest_rgw_dmclock_scheduler: error while loading shared libraries: libboost_thread.so.1.73.0: No such file or directory
Start 183: unittest_rgw_dmclock_scheduler /home/jenkins-build/build/workspace/ceph-pull-requests/build/bin/unittest_rgw_dmclock_scheduler: error while loading shared libraries: libboost_thread.so.1.73.0: cannot open shared object file: No such file or directory ... 99% tests passed, 1 tests failed out of 204 Total Test time (real) = 785.78 sec The following tests FAILED: 183 - unittest_rgw_dmclock_scheduler (Failed) Errors while running CTest Build step 'Execute shell' marked build as failure
#3 Updated by J. Eric Ivancich 5 months ago
IRC conversation early July 9, 2020:
[02:27:54] <kefu> josh, dmick yes. that build host is using an older prebuilt ceph-libboost package.
[02:28:15] tiger_ (~email@example.com) joined the channel
[02:29:48] distributedone1 (~firstname.lastname@example.org) joined the channel
[02:30:01] <kefu> once David Galloway is back, i will ask him to help remove ceph-libboost1.73-dev on all ubuntu test nodes. so they can upgrade it when install-deps.sh is executed again.
[02:30:42] <kefu> dmick, i think it's unittest_rgw_dmclock_scheduler => libceph-common => libboost.
[02:31:08] <dmick> does it have some kind of version of the package that doesn't obey the normal package upgrade rules or something?
[02:31:21] <dmick> and ldd is supposed to follow dependency chains
[02:31:37] tiger (~email@example.com) left IRC (Ping timeout: 480 seconds)
[02:31:43] <dmick> unless libceph-common calls dlopen() on its own?...
[02:31:54] <kefu> dmick i use "apt -qq list ceph-libboost1.72-dev" before installing prebuilt libboost to avoid unnecessary traffic.
[02:32:36] <kefu> haven't investigated in more finer grainer test taking the package version into consideration, probably i should have done this.
[02:32:47] dmick looks up that apt command
[02:32:55] <kefu> see install-deps.sh
[02:33:45] <kefu> the quick fix is to "echo "/opt/ceph/lib/x86_64-linux-gnu" > /etc/ld.so.conf.d/ceph-libboost1.73-x86_64-linux-gnu.conf" on the test node.
[02:34:44] <kefu> as ceph-libboost is installed into /opt on purpose to avoid possible naming collisions with libboost packaged by ubuntu.
[02:35:30] <dmick> wow. that....wow.
[02:36:00] <kefu> like evil ? =)
[02:36:03] distributedone (~firstname.lastname@example.org) left IRC (Ping timeout: 480 seconds)
[02:36:13] <dmick> I mean you're implementing a package manager on a package manager
[02:36:55] pcaruana (~pcaruana@2a00:1028:83a0:41a2:e102:dfed:c261:9946) joined the channel
[02:37:19] <dmick> but
[02:37:41] <dmick> so that function looks for 1.72-dev, and if it finds it, removes it and installs 1.73-dev
[02:37:55] <dmick> why would we need to remove 1.73-dev and reinstall it?
[02:38:27] <dmick> has the library changed without changing version?
[02:38:44] <kefu> no, i believe that unittest_rgw_dmclock_scheduler was looking for libboost_thread.so.1.73.0.
[02:39:11] <kefu> as it was linked against that file with 1.73.0 as the so version.
[02:39:30] <kefu> oh, you mean the function in install-deps.sh.
[02:39:32] <kefu> yes.
[02:40:01] <kefu> because of https://tracker.ceph.com/issues/46208.
[02:40:30] <kefu> ahh, you mean as fix, why i suggest remove and install it?
[02:40:34] <dmick> yes
[02:40:43] <kefu> because in the latest revision i fixed the packaging
[02:41:02] <kefu> to install /opt/ceph/lib/x86_64-linux-gnu with the correct path.
[02:41:22] <dmick> but the package version is the same?
[02:41:34] <kefu> dmick see https://github.com/tchaikov/ceph-boost/commit/9785969428c717864b185bf541836b80361f4029
[02:41:55] <kefu> yes, i confess.
[02:42:02] <dmick> <hangs head>
[02:42:37] <kefu> as i don't want to mess up with install-deps.sh to detect the package version and remove and install packages .
[02:43:47] <kefu> sorry for that, wish i could run "apt-get remove" on all ubuntu nodes.
[02:43:57] <dmick> so
[02:44:00] shyukri (~email@example.com) left IRC (Quit: shyukri)
[02:44:15] <dmick> I think I follow that horrible trail of "why libboost is broken"
[02:44:17] <dmick> but
[02:44:32] <dmick> I still don't get how it's required. all of the binaries I build don't seem to require it
[02:45:00] <dmick> is it optional based on a #define or something that's not included when one runs "run-make-check.sh"?
[02:45:03] <kefu> it depends on how you build them.
[02:45:36] <dmick> sometimes the binaries need libboost and sometimes they don't?
[02:45:52] <kefu> see src/script/run-make.sh
[02:46:00] <kefu> and search for boost_root.
[02:46:31] <kefu> no, basically all of them need it, except for some executables written in C language.
[02:46:41] <kefu> like mount.ceph.
[02:47:15] <dmick> run-make.sh looks like it controls whether boost is a custom version or system version, yes?
[02:47:22] <dmick> but not whether boost is used at all?
[02:47:27] <kefu> exactly.
[02:47:33] <dmick> so why doesn't ldd see it?
[02:47:35] <kefu> i think boost is a building block of ceph.
[02:47:50] <dmick> it must be dlopen()ed, I guess
[02:48:49] <dmick> because
[02:49:00] <dmick> a lot (maybe most?) of boost is just preprocessor stuff, right?
[02:49:10] <dmick> fairly little of it is in the .so, from what I understand
[02:50:02] <kefu> i don't think ldd checks for the dependency of dependency
[02:52:02] <kefu> lemme double check.
[02:52:24] <kefu> but it's not dlopen()'ed, i am sure.
[02:52:53] Xinying_Song (~firstname.lastname@example.org) left IRC (Remote host closed the connection)
[02:53:33] Xinying_Song (~email@example.com) joined the channel
[02:53:40] <dmick> compare objdump -p to ldd on, say, /bin/ls
[02:53:50] <dmick> the NEEDED entries are recorded for the binary
[02:53:54] <dmick> but ldd shows more
[02:54:19] tiger (~firstname.lastname@example.org) joined the channel
[02:55:15] <dmick> libselinux.so.1, for instance, has a NEEDED for libpcre
[02:55:45] <dmick> but it's not NEEDED by /bin/ls. so it doesn't show for objdump, but does for ldd
[02:57:34] shyukri (~email@example.com) joined the channel
[02:57:48] amaredia (~firstname.lastname@example.org) left IRC (Ping timeout: 480 seconds)
[02:57:49] shyukri (~email@example.com) left IRC
[02:57:52] tiger_ (~firstname.lastname@example.org) left IRC (Ping timeout: 480 seconds)
[02:58:54] <kefu> dmick yeah, you are right.
[02:59:04] <kefu> dmick https://pastebin.com/bPWyWswT
[02:59:25] <kefu> this is a sample collected from one of our ubuntu build hosts.
[03:00:12] <dmick> could the dependencies be different on rpm-based machines?
[03:00:19] <dmick> to the point of not requiring libboost?
[03:00:37] <dmick> that seems crazy, but maybe libboost is statically linked into some other .so or something?
[03:00:53] <kefu> so your executables are linked against the static libraries
[03:01:09] <kefu> git grep -w Boost_USE_STATIC_LIBS
[03:01:30] <kefu> dmick ^
[03:01:34] zyan (~zhyan@240e:398:5f3:db70::97e) left IRC (Remote host closed the connection)
[03:01:43] Xinying_Song (~email@example.com) left IRC (Ping timeout: 480 seconds)
[03:02:06] <kefu> if you build boost from source, by default you get the .a archives with BuildBoost.cmake.
[03:02:18] zyan (~zhyan@240e:398:5f3:db70::97e) joined the channel
[03:02:25] <kefu> brb
[03:02:26] kefu is now known as kefu-away
[03:04:28] <dmick> and yet
[03:04:49] varsha (~firstname.lastname@example.org) left IRC (Quit: Leaving)
[03:04:54] varsha (~email@example.com) joined the channel
[03:05:31] kefu-away is now known as kefu
[03:05:33] <kefu> back
[03:05:36] <dmick> in the build that failed for me on the jenkins slave, -DWITH_SYSTEM_BOOST=ON
[03:06:10] <dmick> and ... ENABLE_SHARED?....
[03:06:17] <dmick> jeez this just gets twistier
[03:06:36] <kefu> i don't think these two options are mutual exclusive.
[03:06:55] <dmick> ENABLE_SHARED defaults to on and isn't reset by this cmake invocation
[03:07:16] <kefu> right, i think we should drop that option.
[03:07:24] <dmick> so it seems to me this should have built dynamic
[03:07:37] <kefu> IIRC, it was added when root wanted to a single blob of executable.
[03:07:42] <kefu> to have
[03:07:50] <kefu> rook
[03:08:04] <kefu> (sorry for the typo, was using the wrong kbd)
[03:08:16] <dmick> (looking at https://jenkins.ceph.com/job/ceph-pull-requests/55206/consoleFull#-85362248744e9240e-b50a-4693-bac0-8a991bac86ac and the cmake line is cmake -DWITH_CCACHE=ON -DWITH_SYSTEM_BOOST=ON -DBOOST_ROOT=/opt/ceph -DWITH_PYTHON3=3 -DWITH_GTEST_PARALLEL=ON -DWITH_FIO=ON -DWITH_CEPHFS_SHELL=ON -DWITH_SPDK=ON -DENABLE_GIT_VERSION=OFF -DWITH_SEASTAR=ON ..
[03:08:54] <dmick> ah. but that failed. because it couldn't find the .so. so that makes sense.
[03:09:08] <dmick> but what did my centos8 build, with run-make-check.sh, do, that's the question
[03:09:57] Xinying_Song (~firstname.lastname@example.org) joined the channel
[03:10:02] <kefu> because i don't know how to break^w build rpm packages from ceph-libboost.
[03:11:16] <kefu> in other words, i don't know how to implement a package manager on a package manager with rpm spec yet.
[03:11:37] <dmick> maybe you mean that rpm builds don't do the boost magic. so how does Ceph run with a too-old libboost?
[03:12:15] <kefu> not sure i follow you. on ubuntu, it uses prebuilt libboost packages.
[03:12:33] <kefu> on rhel/centos 8, it builds libboost from source.
[03:12:44] <dmick> uh...oh
[03:12:58] <dmick> why don't we build libboost from source on ubuntu?
[03:13:25] <kefu> because i want to save precious CPU cycles and build time.
[03:13:53] <dmick> oh jeez. so this is all asymmetrical on purpose as an "optimization"
[03:13:59] <kefu> trade them for some traffic from chacra repos.
[03:14:04] <dmick> well it optimized the hell out of my day :)
[03:14:36] <kefu> i am sorry for the premature optimization.
[03:14:46] <dmick> yeah, I had no idea this was different on deb vs rpm until just now.
[03:14:52] <dmick> okay. that explains it
[03:15:18] <kefu> and different behaviors on different distros.
[03:16:16] Xinying_Song (~email@example.com) left IRC (Remote host closed the connection)
[03:16:57] <dmick> so rpm machines build and link a static libboost
[03:17:11] <dmick> and deb machines, if they can, download a prebuilt .so
[03:17:42] <dmick> but because it's all outside the package manager, the packages don't follow normal upgrade rules, so bugfixes have to be deployed manually
[03:17:45] Xinying_Song (~firstname.lastname@example.org) joined the channel
[03:17:52] <kefu> yes, rpm links with just-cooked libboost.a
[03:18:58] <kefu> install-deps.sh could take care of this. but i am using a lame detection machinary.
[03:19:21] <kefu> and failed to bump up the package version of the fixed revision.
[03:20:43] <kefu> FWIW, i fixed a dozen build hosts manually yesterday. but missed some of them apparently.
[03:20:43] tserong (~email@example.com) left IRC (Read error: No route to host)
[03:23:09] avanthakkar_ (~firstname.lastname@example.org) left IRC (Remote host closed the connection)
[03:23:22] avanthakkar_ (~email@example.com) joined the channel
[03:23:36] <dmick> no human should ever have to do that; I feel bad that you did
[03:23:39] <kefu> dmick but it's not fully out of package manager, the prebuilt .so are packaged.
[03:24:19] skoduri (~firstname.lastname@example.org) joined the channel
[03:25:50] <dmick> the one that failed for me today was 220.127.116.11
[03:26:19] jcollin (~email@example.com) joined the channel
[03:27:59] <kefu> guess these nodes are using a IP pool.
[03:28:16] <kefu> that IP is not listed in https://jenkins.ceph.com/computer/
[03:28:57] <dmick> it's an OVH dynamically-provisioned host. I guess we still have a few
[03:29:03] <kefu> and that IP is not reachable from sepia lab. i am ping'ing it now.
[03:29:08] <kefu> ic.
[03:29:19] <dmick> I thought they were gone, but apparently only mostly gone
[03:29:37] <kefu> thought those OVH builders were decommissioned.
[03:29:48] <kefu> yes.
[03:30:47] <dmick> well, we've beat this to death, and I'm exhausted. Thanks for all your help, no way I would have gotten anywhere with this without you
[03:30:58] <dmick> guess we need david
[03:31:16] <kefu> yes. i've been missing him this week.
[03:32:09] <dmick> cheers
[03:32:14] <kefu> np.
[03:32:17] <kefu> later!