Project

General

Profile

Actions

Bug #57296

open

Internal compiler errors and unmet dependencies on some sepia nodes

Added by Laura Flores over 1 year ago. Updated 8 months ago.

Status:
New
Priority:
Normal
Assignee:
-
% Done:

0%

Source:
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=focal,DIST=focal,MACHINE_SIZE=gigantic/65020

make[4]: Leaving directory '/build/ceph-16.2.10-677-g5c73933c/obj-x86_64-linux-gnu'
[100%] Built target ceph-dencoder
make[4]: Leaving directory '/build/ceph-16.2.10-677-g5c73933c/obj-x86_64-linux-gnu'
make[3]: *** [CMakeFiles/Makefile2:16816: src/test/librbd/CMakeFiles/unittest_librbd.dir/all] Error 2
make[3]: Leaving directory '/build/ceph-16.2.10-677-g5c73933c/obj-x86_64-linux-gnu'
make[2]: *** [Makefile:144: all] Error 2
make[2]: Leaving directory '/build/ceph-16.2.10-677-g5c73933c/obj-x86_64-linux-gnu'
dh_auto_build: error: cd obj-x86_64-linux-gnu && make -j90 returned exit code 2
make[1]: *** [debian/rules:47: override_dh_auto_build] Error 25
make[1]: Leaving directory '/build/ceph-16.2.10-677-g5c73933c'
make: *** [debian/rules:40: build] Error 2
dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2
E: Failed autobuilding of package

https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=bionic,DIST=bionic,MACHINE_SIZE=gigantic/65031
c++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
src/tools/ceph-dencoder/CMakeFiles/ceph-dencoder.dir/build.make:162: recipe for target 'src/tools/ceph-dencoder/CMakeFiles/ceph-dencoder.dir/osd_types.cc.o' failed
make[4]: *** [src/tools/ceph-dencoder/CMakeFiles/ceph-dencoder.dir/osd_types.cc.o] Error 4
make[4]: Leaving directory '/build/ceph-16.2.10-677-g44f3f2ab/obj-x86_64-linux-gnu'
CMakeFiles/Makefile2:9982: recipe for target 'src/tools/ceph-dencoder/CMakeFiles/ceph-dencoder.dir/all' failed

...

make[4]: Leaving directory '/build/ceph-16.2.10-677-g44f3f2ab/obj-x86_64-linux-gnu'
CMakeFiles/Makefile2:18643: recipe for target 'src/test/librbd/CMakeFiles/unittest_librbd.dir/all' failed
make[3]: *** [src/test/librbd/CMakeFiles/unittest_librbd.dir/all] Error 2
Makefile:143: recipe for target 'all' failed
make[2]: *** [all] Error 2
dh_auto_build: cd obj-x86_64-linux-gnu && make -j90 -O returned exit code 2
debian/rules:47: recipe for target 'override_dh_auto_build' failed
make[1]: *** [override_dh_auto_build] Error 25
make[1]: Leaving directory '/build/ceph-16.2.10-677-g44f3f2ab'
debian/rules:40: recipe for target 'build' failed
make: *** [build] Error 2
dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2
E: Failed autobuilding of package

https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=bionic,DIST=bionic,MACHINE_SIZE=gigantic/65039

In file included from /build/ceph-16.2.10-677-gc035170d/src/librados/AioCompletionImpl.h:21:0,
                 from /build/ceph-16.2.10-677-gc035170d/src/rgw/cls_fifo_legacy.h:41,
                 from /build/ceph-16.2.10-677-gc035170d/src/rgw/rgw_log_backing.h:35,
                 from /build/ceph-16.2.10-677-gc035170d/src/rgw/rgw_datalog.h:39,
                 from /build/ceph-16.2.10-677-gc035170d/src/rgw/services/svc_bi_rados.h:19,
                 from /build/ceph-16.2.10-677-gc035170d/src/rgw/rgw_rados.h:34,
                 from /build/ceph-16.2.10-677-gc035170d/src/rgw/rgw_admin.cc:39:
/build/ceph-16.2.10-677-gc035170d/src/osd/osd_types.h: In member function 'void object_ref_delta_t::mut_ref(const hobject_t&, int)':
/build/ceph-16.2.10-677-gc035170d/src/osd/osd_types.h:5551:35: warning: unused variable '_' [-Wunused-variable]
     [[maybe_unused]] auto [iter, _] = ref_delta.try_emplace(hoid, 0);
                                   ^
/build/ceph-16.2.10-677-gc035170d/src/rgw/rgw_admin.cc: In function 'int main(int, const char**)':
/build/ceph-16.2.10-677-gc035170d/src/rgw/rgw_admin.cc:4671:33: warning: unused variable 'name' [-Wunused-variable]
           for (auto& [name, zone] : zonegroup.zones) {
                                 ^
c++: internal compiler error: Killed (program cc1plus)
Please submit a full bug report,
with preprocessed source if appropriate.
See <file:///usr/share/doc/gcc-7/README.Bugs> for instructions.
src/rgw/CMakeFiles/radosgw-admin.dir/build.make:65: recipe for target 'src/rgw/CMakeFiles/radosgw-admin.dir/rgw_admin.cc.o' failed
make[4]: *** [src/rgw/CMakeFiles/radosgw-admin.dir/rgw_admin.cc.o] Error 4
make[4]: Leaving directory '/build/ceph-16.2.10-677-gc035170d/obj-x86_64-linux-gnu'
CMakeFiles/Makefile2:28779: recipe for target 'src/rgw/CMakeFiles/radosgw-admin.dir/all' failed
make[3]: *** [src/rgw/CMakeFiles/radosgw-admin.dir/all] Error 2

...

ake[4]: Leaving directory '/build/ceph-16.2.10-677-gc035170d/obj-x86_64-linux-gnu'
CMakeFiles/Makefile2:18643: recipe for target 'src/test/librbd/CMakeFiles/unittest_librbd.dir/all' failed
make[3]: *** [src/test/librbd/CMakeFiles/unittest_librbd.dir/all] Error 2
Makefile:143: recipe for target 'all' failed
make[2]: *** [all] Error 2
dh_auto_build: cd obj-x86_64-linux-gnu && make -j90 -O returned exit code 2
debian/rules:47: recipe for target 'override_dh_auto_build' failed
make[1]: *** [override_dh_auto_build] Error 25
make[1]: Leaving directory '/build/ceph-16.2.10-677-gc035170d'
debian/rules:40: recipe for target 'build' failed
make: *** [build] Error 2
dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2

Some more examples:

Related issues 1 (0 open1 closed)

Related to Ceph - Bug #38629: ceph_dencoder.cc: c++: internal compiler error: Killed (program cc1plus)ClosedBrad Hubbard

Actions
Actions #1

Updated by Laura Flores over 1 year ago

  • Related to Bug #38629: ceph_dencoder.cc: c++: internal compiler error: Killed (program cc1plus) added
Actions #2

Updated by Laura Flores over 1 year ago

  • Priority changed from Normal to Urgent

This is a big problem in terms of getting things approved in the teuthology testing suite.

Actions #3

Updated by Laura Flores over 1 year ago

  • Subject changed from Internal compiler errors on some sepia nodes to Internal compiler errors and unmet dependencies on some sepia nodes

This is an "unmet dependencies" example. (Actually the first one linked in the description is unmet dependencies as well).

https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=focal,DIST=focal,MACHINE_SIZE=gigantic/65086

c++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
make[4]: *** [src/test/librbd/CMakeFiles/unittest_librbd.dir/build.make:599: src/test/librbd/CMakeFiles/unittest_librbd.dir/io/test_mock_SimpleSchedulerObjectDispatch.cc.o] Error 1
make[4]: Leaving directory '/build/ceph-16.2.10-685-g33392c83/obj-x86_64-linux-gnu'
[100%] Built target ceph-dencoder
make[4]: Leaving directory '/build/ceph-16.2.10-685-g33392c83/obj-x86_64-linux-gnu'
make[3]: *** [CMakeFiles/Makefile2:16816: src/test/librbd/CMakeFiles/unittest_librbd.dir/all] Error 2
make[3]: Leaving directory '/build/ceph-16.2.10-685-g33392c83/obj-x86_64-linux-gnu'
make[2]: *** [Makefile:144: all] Error 2
make[2]: Leaving directory '/build/ceph-16.2.10-685-g33392c83/obj-x86_64-linux-gnu'
dh_auto_build: error: cd obj-x86_64-linux-gnu && make -j90 returned exit code 2
make[1]: *** [debian/rules:47: override_dh_auto_build] Error 25
make[1]: Leaving directory '/build/ceph-16.2.10-685-g33392c83'
make: *** [debian/rules:40: build] Error 2
dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2

...

Need to get 0 B/486 MB of archives. After unpacking 1672 MB will be used.
The following packages have unmet dependencies:
 libc6-dev : Depends: libc6 (= 2.31-0ubuntu9) but 2.31-0ubuntu9.9 is to be installed
 curl : Depends: libcurl4 (= 7.68.0-1ubuntu2) but 7.68.0-1ubuntu2.12 is to be installed
The following actions will resolve these dependencies:

     Upgrade the following packages:                                                            
1)     curl [7.68.0-1ubuntu2 (focal, now) -> 7.68.0-1ubuntu2.12 (focal-security, focal-updates)]
2)     libc-dev-bin [2.31-0ubuntu9 (focal, now) -> 2.31-0ubuntu9.9 (focal-updates)]             
3)     libc6-dev [2.31-0ubuntu9 (focal, now) -> 2.31-0ubuntu9.9 (focal-updates)]                

Actions #4

Updated by Brad Hubbard over 1 year ago

I tried to reproduce this locally with the latest focal container but so far no joy so this may be something specific to the environment that is failing.

[ 92%] Building CXX object src/test/librbd/CMakeFiles/unittest_librbd.dir/io/test_mock_ImageRequest.cc.o              
[ 92%] Building CXX object src/test/librbd/CMakeFiles/unittest_librbd.dir/io/test_mock_ObjectRequest.cc.o             
[ 92%] Building CXX object src/test/librbd/CMakeFiles/unittest_librbd.dir/io/test_mock_SimpleSchedulerObjectDispatch.cc.o                                                                                                                   
[ 92%] Building CXX object src/test/librbd/CMakeFiles/unittest_librbd.dir/journal/test_mock_OpenRequest.cc.o          
[ 92%] Building CXX object src/test/librbd/CMakeFiles/unittest_librbd.dir/journal/test_mock_PromoteRequest.cc.o
...
[100%] Linking CXX executable ../../../bin/unittest_librbd                                                            
[100%] Built target unittest_librbd 

I'll try a 'make all' but I'd say that won't reproduce either based on historical experience.

Suggest we start working out which build machines specifically are failing, what (if anything) is common about those machines, and whether we can reproduce this on one of those machines so we can run the compiler command manually (VERBOSE=1 make unittest_librbd) and see what we can glean from that. I would do this but I'm not sure how safe this is going to be on a machine that is running other build jobs nor how to remove a machine temporarily from the pool of machines that jobs can run on. Let me know how to do that if anyone knows and I'll take a crack at it.

Actions #6

Updated by Laura Flores about 1 year ago

  • Status changed from New to Closed

Have not been seeing this.

Actions #7

Updated by Laura Flores 11 months ago

adami03

https://shaman.ceph.com/builds/ceph/wip-yuri2-testing-2023-05-15-0810-pacific/25796df0fedbe757877f6bad6ff202a3d2ca4abf/default/343071/
https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=focal,DIST=focal,MACHINE_SIZE=gigantic/69881/consoleFull

The following packages have unmet dependencies:
 libc6-dev : Depends: libc6 (= 2.31-0ubuntu9) but 2.31-0ubuntu9.9 is to be installed
 curl : Depends: libcurl4 (= 7.68.0-1ubuntu2) but 7.68.0-1ubuntu2.18 is to be installed
The following actions will resolve these dependencies:

     Upgrade the following packages:                                                            
1)     curl [7.68.0-1ubuntu2 (focal, now) -> 7.68.0-1ubuntu2.18 (focal-security, focal-updates)]
2)     libc-dev-bin [2.31-0ubuntu9 (focal, now) -> 2.31-0ubuntu9.9 (focal-updates)]             
3)     libc6-dev [2.31-0ubuntu9 (focal, now) -> 2.31-0ubuntu9.9 (focal-updates)]                

...

[100%] Built target radosgw-admin
c++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
make[4]: *** [src/test/librbd/CMakeFiles/unittest_librbd.dir/build.make:742: src/test/librbd/CMakeFiles/unittest_librbd.dir/migration/test_mock_HttpClient.cc.o] Error 1
make[4]: *** Waiting for unfinished jobs....
c++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
make[4]: *** [src/test/librbd/CMakeFiles/unittest_librbd.dir/build.make:183: src/test/librbd/CMakeFiles/unittest_librbd.dir/test_mock_Watcher.cc.o] Error 1
c++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
make[4]: *** [src/test/librbd/CMakeFiles/unittest_librbd.dir/build.make:495: src/test/librbd/CMakeFiles/unittest_librbd.dir/image/test_mock_ListWatchersRequest.cc.o] Error 1
c++: fatal error: Killed signal terminated program cc1plus
compilation terminated.
make[4]: *** [src/test/librbd/CMakeFiles/unittest_librbd.dir/build.make:859: src/test/librbd/CMakeFiles/unittest_librbd.dir/mirror/snapshot/test_mock_ImageMeta.cc.o] Error 1
make[4]: Leaving directory '/build/ceph-16.2.13-145-g25796df0/obj-x86_64-linux-gnu'
make[3]: *** [CMakeFiles/Makefile2:16983: src/test/librbd/CMakeFiles/unittest_librbd.dir/all] Error 2
make[3]: Leaving directory '/build/ceph-16.2.13-145-g25796df0/obj-x86_64-linux-gnu'
make[2]: *** [Makefile:144: all] Error 2
make[2]: Leaving directory '/build/ceph-16.2.13-145-g25796df0/obj-x86_64-linux-gnu'
dh_auto_build: error: cd obj-x86_64-linux-gnu && make -j90 returned exit code 2
make[1]: *** [debian/rules:47: override_dh_auto_build] Error 25
make[1]: Leaving directory '/build/ceph-16.2.13-145-g25796df0'
make: *** [debian/rules:40: build] Error 2
dpkg-buildpackage: error: debian/rules build subprocess returned exit status 2
E: Failed autobuilding of package
Actions #8

Updated by Laura Flores 11 months ago

  • Status changed from Closed to New
  • Priority changed from Urgent to Normal
Actions #10

Updated by Laura Flores 11 months ago

In the past, I remember this "resolving on its own", so I'm not really sure what can be done to fix these instances.

Actions #11

Updated by Ilya Dryomov 11 months ago

It's getting axed by the OOM killer:

[6135613.010843] Out of memory: Killed process 2744119 (cc1plus) total-vm:2349764kB, anon-rss:2222100kB, file-rss:4kB, shmem-rss:19260kB, UID:1234 pgtables:4484kB oom_score_adj:0
[6135614.318236] Out of memory: Killed process 2744101 (cc1plus) total-vm:2341644kB, anon-rss:2209352kB, file-rss:4kB, shmem-rss:19440kB, UID:1234 pgtables:4460kB oom_score_adj:0
[6135616.249715] Out of memory: Killed process 2745814 (cc1plus) total-vm:2251748kB, anon-rss:2112456kB, file-rss:4kB, shmem-rss:19304kB, UID:1234 pgtables:4264kB oom_score_adj:0
[6135617.254399] Out of memory: Killed process 2746127 (cc1plus) total-vm:2227204kB, anon-rss:2088512kB, file-rss:4kB, shmem-rss:19420kB, UID:1234 pgtables:4216kB oom_score_adj:0
[6135618.573590] Out of memory: Killed process 2746295 (cc1plus) total-vm:2217952kB, anon-rss:2079664kB, file-rss:4kB, shmem-rss:19392kB, UID:1234 pgtables:4196kB oom_score_adj:0
[6135619.692717] Out of memory: Killed process 2746332 (cc1plus) total-vm:2222676kB, anon-rss:2088612kB, file-rss:4kB, shmem-rss:19192kB, UID:1234 pgtables:4208kB oom_score_adj:0
[6220856.328496] Out of memory: Killed process 1953918 (cc1plus) total-vm:3190308kB, anon-rss:3067760kB, file-rss:4kB, shmem-rss:22564kB, UID:1234 pgtables:6132kB oom_score_adj:0
[6220857.525933] Out of memory: Killed process 1953774 (cc1plus) total-vm:3115360kB, anon-rss:2989476kB, file-rss:4kB, shmem-rss:22020kB, UID:1234 pgtables:5984kB oom_score_adj:0
[6220858.339171] Out of memory: Killed process 1976078 (ld) total-vm:2156480kB, anon-rss:2097896kB, file-rss:4kB, shmem-rss:3768kB, UID:1234 pgtables:4176kB oom_score_adj:0
[6220859.066586] Out of memory: Killed process 1973142 (cc1plus) total-vm:2181256kB, anon-rss:2043632kB, file-rss:4kB, shmem-rss:22108kB, UID:1234 pgtables:4124kB oom_score_adj:0
[6220859.777710] Out of memory: Killed process 1973223 (cc1plus) total-vm:2185484kB, anon-rss:2046984kB, file-rss:4kB, shmem-rss:19036kB, UID:1234 pgtables:4128kB oom_score_adj:0
[6307146.871368] Out of memory: Killed process 1332969 (cc1plus) total-vm:2734880kB, anon-rss:2603424kB, file-rss:4kB, shmem-rss:19756kB, UID:1234 pgtables:5224kB oom_score_adj:0
[6307147.531479] Out of memory: Killed process 1332944 (cc1plus) total-vm:2721336kB, anon-rss:2587752kB, file-rss:4kB, shmem-rss:19660kB, UID:1234 pgtables:5204kB oom_score_adj:0
[6307148.426018] Out of memory: Killed process 1332856 (cc1plus) total-vm:2708952kB, anon-rss:2576892kB, file-rss:4kB, shmem-rss:19860kB, UID:1234 pgtables:5180kB oom_score_adj:0
[6307149.003684] Out of memory: Killed process 1332982 (cc1plus) total-vm:2701568kB, anon-rss:2573264kB, file-rss:4kB, shmem-rss:21932kB, UID:1234 pgtables:5160kB oom_score_adj:0
[6307149.828054] Out of memory: Killed process 1332952 (cc1plus) total-vm:2647496kB, anon-rss:2517048kB, file-rss:4kB, shmem-rss:22464kB, UID:1234 pgtables:5060kB oom_score_adj:0
[6307150.798411] Out of memory: Killed process 1342363 (cc1plus) total-vm:2198348kB, anon-rss:2059784kB, file-rss:4kB, shmem-rss:21624kB, UID:1234 pgtables:4156kB oom_score_adj:0
[6405383.240408] Out of memory: Killed process 3772439 (cc1plus) total-vm:3070440kB, anon-rss:2945592kB, file-rss:4kB, shmem-rss:22448kB, UID:1234 pgtables:5904kB oom_score_adj:0
[6405384.225951] Out of memory: Killed process 3772404 (cc1plus) total-vm:3008776kB, anon-rss:2879792kB, file-rss:4kB, shmem-rss:21976kB, UID:1234 pgtables:5784kB oom_score_adj:0
[6405384.891282] Out of memory: Killed process 3773090 (ld) total-vm:2150412kB, anon-rss:2086068kB, file-rss:4kB, shmem-rss:3748kB, UID:1234 pgtables:4152kB oom_score_adj:0
[6405385.641453] Out of memory: Killed process 3772979 (cc1plus) total-vm:2180436kB, anon-rss:2043000kB, file-rss:4kB, shmem-rss:19096kB, UID:1234 pgtables:4120kB oom_score_adj:0
[6406552.060448] Out of memory: Killed process 3885522 (cc1plus) total-vm:3068848kB, anon-rss:2945336kB, file-rss:4kB, shmem-rss:22496kB, UID:1234 pgtables:5896kB oom_score_adj:0
[6406552.795840] Out of memory: Killed process 3885508 (cc1plus) total-vm:2990384kB, anon-rss:2861984kB, file-rss:4kB, shmem-rss:22000kB, UID:1234 pgtables:5732kB oom_score_adj:0
[6406553.443921] Out of memory: Killed process 3886140 (cc1plus) total-vm:2239092kB, anon-rss:2101288kB, file-rss:4kB, shmem-rss:18964kB, UID:1234 pgtables:4236kB oom_score_adj:0
[6406554.239502] Out of memory: Killed process 3886189 (ld) total-vm:2151280kB, anon-rss:2086860kB, file-rss:4kB, shmem-rss:3992kB, UID:1234 pgtables:4164kB oom_score_adj:0
[6406554.789541] Out of memory: Killed process 3885875 (cc1plus) total-vm:2194368kB, anon-rss:2056192kB, file-rss:4kB, shmem-rss:21392kB, UID:1234 pgtables:4152kB oom_score_adj:0
[6406555.797806] Out of memory: Killed process 3885790 (cc1plus) total-vm:2193780kB, anon-rss:2056364kB, file-rss:4kB, shmem-rss:18392kB, UID:1234 pgtables:4140kB oom_score_adj:0

One fix would be to reduce the concurrency level (-j) to account not just for available CPUs but also for available memory (and the fact that C++ compilers eat memory for breakfast). A better fix might be to partially serialize the build, running targets that are known to be OOM killed most often in (relative) isolation.

Actions #12

Updated by Laura Flores 11 months ago

Ilya Dryomov wrote:

It's getting axed by the OOM killer:
[...]
One fix would be to reduce the concurrency level (-j) to account not just for available CPUs but also for available memory (and the fact that C++ compilers eat memory for breakfast). A better fix might be to partially serialize the build, running targets that are known to be OOM killed most often in (relative) isolation.

Thanks Ilya, can you link where you got that log from?

Actions #13

Updated by Ilya Dryomov 11 months ago

Laura Flores wrote:

Thanks Ilya, can you link where you got that log from?

I had it lying around from a past instance. This came up before ;)

Actions #15

Updated by Brad Hubbard 11 months ago

We should verify this is oomkiller every time then what is the nature of memory use specifically on that machine compared to the others such as tracking overall usage, number of processes running, etc. via sar or symilar. We should also look at and compare memory fragmentation since there may be available memory, just not the right size (this may also be visible in the complete oomkiller output). I believe /proc/buddyinfo will give an indication of fragmentation and could be used for comparison to the other build machines.

Actions #16

Updated by Laura Flores 11 months ago

Brad Hubbard wrote:

We should verify this is oomkiller every time then what is the nature of memory use specifically on that machine compared to the others such as tracking overall usage, number of processes running, etc. via sar or symilar. We should also look at and compare memory fragmentation since there may be available memory, just not the right size (this may also be visible in the complete oomkiller output). I believe /proc/buddyinfo will give an indication of fragmentation and could be used for comparison to the other build machines.

Here is the os info for adami08:

lflores@adami08:~$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.1 LTS" 
NAME="Ubuntu" 
VERSION_ID="22.04" 
VERSION="22.04.1 LTS (Jammy Jellyfish)" 
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/" 
SUPPORT_URL="https://help.ubuntu.com/" 
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" 
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" 
UBUNTU_CODENAME=jammy

Actions #17

Updated by Laura Flores 11 months ago

Brad and I looked into what was going on in adami08:

First, we checked proc/buddyinfo, which didn't reveal anything too exciting, but may be good to compare against other nodes:

root@adami08:~# cat /proc/buddyinfo
Node 0, zone      DMA      0      0      0      0      0      0      0      0      1      1      2 
Node 0, zone    DMA32   4291   3703   3496   3319   2780   2159   1422    663    154     16    230 
Node 0, zone   Normal  12629  10989   5862 123232 130773  81577  51114  34759  20572  10873    225 

Next, we checked the syslog and found several instances of oom-kills. In this case, process 674510 was trying to allocate memory, but there wasn't enough available, so process 674251 was killed:

May 23 22:23:38 adami08 kernel: [7110737.864795] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=system-pbuilder.slice,mems_allowed=0,global_oom,task_memcg=/system.slice/system-pbuilder.slice/system-pbuilder-build.slice/system-pbuilder-build-ceph_16.2.13\x2d224\x2dgea32eda4\x2d1focal.slice/system-pbuilder-build-ceph_16.2.13\x2d224\x2dgea32eda4\x2d1focal-315188.slice/run-r1acc4db4713a4612b6614fd6018299ea.scope,task=cc1plus,pid=674251,uid=1234
May 23 22:23:38 adami08 kernel: [7110737.864812] Out of memory: Killed process 674251 (cc1plus) total-vm:2880320kB, anon-rss:2813164kB, file-rss:4kB, shmem-rss:22740kB, UID:1234 pgtables:5644kB oom_score_adj:0
May 23 22:23:40 adami08 kernel: [7110739.540422] cc1plus invoked oom-killer: gfp_mask=0x1100dca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
May 23 22:23:40 adami08 kernel: [7110739.540430] CPU: 42 PID: 674510 Comm: cc1plus Not tainted 5.15.0-43-generic #46-Ubuntu
May 23 22:23:40 adami08 kernel: [7110739.540433] Hardware name: Supermicro AS -1013S-MTR/H11SSL-i, BIOS 2.1 02/21/2020
May 23 22:23:40 adami08 kernel: [7110739.540435] Call Trace:
May 23 22:23:40 adami08 kernel: [7110739.540437]  <TASK>
May 23 22:23:40 adami08 kernel: [7110739.540440]  show_stack+0x52/0x58
May 23 22:23:40 adami08 kernel: [7110739.540446]  dump_stack_lvl+0x4a/0x5f
May 23 22:23:40 adami08 kernel: [7110739.540450]  dump_stack+0x10/0x12
May 23 22:23:40 adami08 kernel: [7110739.540453]  dump_header+0x53/0x224
May 23 22:23:40 adami08 kernel: [7110739.540456]  oom_kill_process.cold+0xb/0x10
May 23 22:23:40 adami08 kernel: [7110739.540458]  out_of_memory+0x106/0x2e0
May 23 22:23:40 adami08 kernel: [7110739.540462]  __alloc_pages_slowpath.constprop.0+0x97a/0xa40
May 23 22:23:40 adami08 kernel: [7110739.540467]  __alloc_pages+0x30d/0x320
May 23 22:23:40 adami08 kernel: [7110739.540469]  alloc_pages_vma+0x9d/0x380
May 23 22:23:40 adami08 kernel: [7110739.540473]  do_anonymous_page+0xee/0x3b0
May 23 22:23:40 adami08 kernel: [7110739.540476]  handle_pte_fault+0x1fe/0x230
May 23 22:23:40 adami08 kernel: [7110739.540478]  __handle_mm_fault+0x3c7/0x700
May 23 22:23:40 adami08 kernel: [7110739.540481]  handle_mm_fault+0xd8/0x2c0
May 23 22:23:40 adami08 kernel: [7110739.540484]  do_user_addr_fault+0x1c5/0x670
May 23 22:23:40 adami08 kernel: [7110739.540488]  exc_page_fault+0x77/0x160
May 23 22:23:40 adami08 kernel: [7110739.540492]  ? asm_exc_page_fault+0x8/0x30
May 23 22:23:40 adami08 kernel: [7110739.540495]  asm_exc_page_fault+0x1e/0x30
May 23 22:23:40 adami08 kernel: [7110739.540498] RIP: 0033:0xb24ae7

Looking further into the syslog, we noted that for Node 0 Normal, the free space (118592kB) is lower than the defined minimum (136540kB). We also saw that Node 0 Normal does not have any larger pages available, which indicates that we're trying to allocate memory for something that requires 256kB of memory or higher, but there is not enough available:

May 23 22:23:40 adami08 kernel: [7110739.540574] Node 0 Normal free:118592kB min:136540kB low:397612kB high:658684kB reserved_highatomic:0KB active_anon:95212760kB inactive_anon:162530668kB active_file:868kB inactive_file:1124kB unevictable:0kB writepending:0kB present:265535488kB managed:261081196kB mlocked:0kB bounce:0kB free_pcp:3672kB local_pcp:492kB free_cma:0kB

May 23 22:23:40 adami08 kernel: [7110739.540605] Node 0 Normal: 7805*4kB (UME) 3134*8kB (UME) 3790*16kB (UME) 1296*32kB (UME) 38*64kB (M) 14*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 162628kB

We checked how many processes were going on like the one that was killed above, cc1plus:

root@adami08:~# grep "cc1plus" syslog | wc
     52     832    6760

The conclusion here is that the issue is with Node 0 Normal, where we are trying to allocate memory there and don't have enough. For next steps, we want to figure out what kind of monitoring is in place for adami08 as opposed to other nodes.

Actions #18

Updated by Laura Flores 11 months ago

Another observation from Brad:

Looking further at the syslog, there's a pattern where free is less than min, and there is an absence of large fragments to allocate.

$ egrep '(Node 0 Normal free:|Node 0 Normal:)' syslog.orig |tail -10

May 23 22:23:36 adami08 kernel: [7110735.779341] Node 0 Normal free:119704kB min:171356kB low:432428kB high:693500kB reserved_highatomic:0KB active_anon:95212804kB inactive_anon:162577756kB active_file:0kB inactive_file:2436kB unevictab
le:0kB writepending:4kB present:265535488kB managed:261081196kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

May 23 22:23:36 adami08 kernel: [7110735.779367] Node 0 Normal: 1248*4kB (UM) 2983*8kB (UME) 3441*16kB (UME) 1163*32kB (UME) 3*64kB (M) 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 121320kB

May 23 22:23:37 adami08 kernel: [7110736.778657] Node 0 Normal free:115288kB min:195932kB low:457004kB high:718076kB reserved_highatomic:0KB active_anon:95212788kB inactive_anon:162584708kB active_file:668kB inactive_file:316kB unevicta
ble:0kB writepending:0kB present:265535488kB managed:261081196kB mlocked:0kB bounce:0kB free_pcp:744kB local_pcp:248kB free_cma:0kB

May 23 22:23:37 adami08 kernel: [7110736.778682] Node 0 Normal: 920*4kB (U) 2647*8kB (UE) 3419*16kB (UE) 1143*32kB (UE) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 116136kB

May 23 22:23:38 adami08 kernel: [7110737.863680] Node 0 Normal free:115536kB min:122204kB low:383276kB high:644348kB reserved_highatomic:0KB active_anon:95212776kB inactive_anon:162547268kB active_file:1408kB inactive_file:2368kB unevic
table:0kB writepending:396kB present:265535488kB managed:261081196kB mlocked:0kB bounce:0kB free_pcp:644kB local_pcp:472kB free_cma:0kB

May 23 22:23:38 adami08 kernel: [7110737.863707] Node 0 Normal: 5694*4kB (UME) 3014*8kB (UME) 3631*16kB (UME) 1203*32kB (UME) 30*64kB (UM) 3*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 145784kB

May 23 22:23:40 adami08 kernel: [7110739.540574] Node 0 Normal free:118592kB min:136540kB low:397612kB high:658684kB reserved_highatomic:0KB active_anon:95212760kB inactive_anon:162530668kB active_file:868kB inactive_file:1124kB unevict
able:0kB writepending:0kB present:265535488kB managed:261081196kB mlocked:0kB bounce:0kB free_pcp:3672kB local_pcp:492kB free_cma:0kB

May 23 22:23:40 adami08 kernel: [7110739.540605] Node 0 Normal: 7805*4kB (UME) 3134*8kB (UME) 3790*16kB (UME) 1296*32kB (UME) 38*64kB (M) 14*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 162628kB

May 23 22:23:45 adami08 kernel: [7110745.081543] Node 0 Normal free:108228kB min:245084kB low:506156kB high:767228kB reserved_highatomic:0KB active_anon:95212740kB inactive_anon:162589944kB active_file:1800kB inactive_file:0kB unevictab
le:0kB writepending:0kB present:265535488kB managed:261081196kB mlocked:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB

May 23 22:23:45 adami08 kernel: [7110745.081598] Node 0 Normal: 1544*4kB (UM) 2054*8kB (UME) 3324*16kB (UME) 1146*32kB (UME) 0*64kB 1*128kB (M) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 112592kB

Actions #19

Updated by Dan Mick 11 months ago

Number of build jobs should be similar on both deb and rpm AFAICT. For deb, it's set in DEB_BUILD_OPTIONS by scripts/build_utils.sh:get_nr_build_jobs, to nproc first, then lowered to "memory/3000" (although the comment says 2200). For rpm, it's set by the ceph.spec.in file by setting _smp_ncpus_max, which uses a macro smp_limit_mem_per_job which does the same thing.

Build jobs are all considered the same at this level by default, as far as I know, by ninja/cmake, but it's important to note:

1) deb builds run under pbuilder, which will probably consume some extra memory
2) links and compiles are quite different in memory usage; links tend to be very memory-hungry because of C++ template resolution and just by their global nature

It's possible we could ameliorate build failures by dropping the concurrency in general, or by getting slightly smarter about the algorithm (maybe something that maxes out sooner on very-many-core machines; adami have nproc == 96 and at least one observed run was using 90).

There's also a facility in ninja to use different levels of concurrency for different phases of the build; see 'pools' in https://ninja-build.org/manual.html#ref_pool.

Actions #21

Updated by Laura Flores 8 months ago

Seeing this happen on other adami nodes now (adami07: https://jenkins.ceph.com/job/ceph-dev-new-build/ARCH=x86_64,AVAILABLE_ARCH=x86_64,AVAILABLE_DIST=focal,DIST=focal,MACHINE_SIZE=gigantic/72094/consoleFull)

This has never been reported with braggi nodes. The difference, as Dan noted above, is that adami nodes have a higher nproc than braggi:

adami:

++ get_nr_build_jobs
+++ nproc
++ local nproc=96
+++ vmstat --stats --unit m
+++ grep 'total memory'
+++ awk '{print int($1/3000)}'
++ local max_build_jobs=90
++ [[ 90 -eq 0 ]]
++ [[ 96 -ge 90 ]]
++ n_build_jobs=90
++ echo 90
+ DEB_BUILD_OPTIONS=parallel=90

braggi:

++ get_nr_build_jobs
+++ nproc
++ local nproc=48
+++ vmstat --stats --unit m
+++ grep 'total memory'
+++ awk '{print int($1/3000)}'
++ local max_build_jobs=90
++ [[ 90 -eq 0 ]]
++ [[ 48 -ge 90 ]]
++ n_build_jobs=48
++ echo 48
+ DEB_BUILD_OPTIONS=parallel=48

I checked several past builds for adami and braggi, and this is true for all that I saw. The confusing bit is that sometimes 90 max jobs does end up succeeding on adami, but more often than not, it fails.

Actions

Also available in: Atom PDF