Bug #19126: "libsemanage.semanage_direct_get_module_info:" error causing ceph-cm-ansible to fail - sepia - Ceph

Actions

Copy link

Bug #19126

closed

"libsemanage.semanage_direct_get_module_info:" error causing ceph-cm-ansible to fail

Added by David Galloway about 7 years ago. Updated about 7 years ago.

Status:

Resolved

Priority:

Urgent

Assignee:

David Galloway

Category:

Test Node

Target version:

% Done:

Source:

Tags:

Backport:

Regression:

Severity:

3 - minor

Reviewed:

Affected Versions:

ceph-qa-suite:

Crash signature (v1):

Crash signature (v2):

Description

The common role is occasionally failing to complete due to the following error:

TASK [common : nrpe - Load SELinux policy package] *****************************
task path: /home/dgalloway/git/ceph/ceph-cm-ansible/roles/common/tasks/nrpe-selinux.yml:38
fatal: [smithi150.front.sepia.ceph.com]: FAILED! => {
    "changed": true, 
    "cmd": [
        "semodule", 
        "-i", 
        "/tmp/nrpe.pp" 
    ], 
    "delta": "0:00:00.085664", 
    "end": "2017-03-01 21:14:07.733917", 
    "failed": true, 
    "invocation": {
        "module_args": {
            "_raw_params": "semodule -i /tmp/nrpe.pp", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "warn": true
        }, 
        "module_name": "command" 
    }, 
    "rc": 1, 
    "start": "2017-03-01 21:14:07.648253", 
    "stderr": "libsemanage.semanage_direct_get_module_info: Unable to read mod_fastcgi module lang ext file.\nlibsemanage.semanage_direct_get_module_info: Unable to read mod_fastcgi module lang ext file.\nlibsemanage.semanage_direct_get_module_info: Unable to read mod_fastcgi module lang ext file.\nsemodule:  Failed on /tmp/nrpe.pp!", 
    "stdout": "", 
    "stdout_lines": [], 
    "warnings": []
}

The testnode gets in a perpetually broken state and fails all subsequent jobs when in this state.

Example: http://sentry.ceph.com/sepia/teuthology/issues/736/events/67513

Actions

Copy link

Updated by David Galloway about 7 years ago

Description updated (diff)

Actions

Copy link

Updated by David Galloway about 7 years ago

Some notes.

I've got smithi150 (broken) and smithi143 (not broken) locked.

yum/rpm report that mod_fastcgi-2.4.7-1.ceph.el7.centos.x86_64 is installed on both machines.

However, all the files in /etc/selinux/targeted/active/modules/400/mod_fastcgi/lang_ext are empty on smithi150.

[root@smithi143 selinux]# file /etc/selinux/targeted/active/modules/400/mod_fastcgi/*
/etc/selinux/targeted/active/modules/400/mod_fastcgi/cil:      bzip2 compressed data, block size = 500k
/etc/selinux/targeted/active/modules/400/mod_fastcgi/hll:      bzip2 compressed data, block size = 500k
/etc/selinux/targeted/active/modules/400/mod_fastcgi/lang_ext: ASCII text, with no line terminators

[root@smithi150 ~]# file /etc/selinux/targeted/active/modules/400/mod_fastcgi/*
/etc/selinux/targeted/active/modules/400/mod_fastcgi/cil:      empty
/etc/selinux/targeted/active/modules/400/mod_fastcgi/hll:      empty
/etc/selinux/targeted/active/modules/400/mod_fastcgi/lang_ext: empty

So something broke mod_fastcgi. I queried the last 50 jobs ran on smithi150 and it appears it was this job: http://qa-proxy.ceph.com/teuthology/teuthology-2017-02-27_02:01:17-rbd-master-distro-basic-smithi/862407/teuthology.log

Actions

Copy link

Updated by David Galloway about 7 years ago

Description updated (diff)

All I was really able to deduce was that something was corrupting the mod_fastcgi SELinux policy module files in /etc/selinux/targeted/active/modules/400/mod_fastcgi.

We remove [1] and reinstall [2] mod_fastcgi and nrpe [3] modules with every ansible run anyway so I added a task to just make sure mod_fastcgi and nrpe are not present in /etc/selinux/targeted/active/modules/400

https://github.com/ceph/ceph-cm-ansible/pull/309

[1] https://github.com/ceph/ceph-cm-ansible/blob/master/roles/testnode/vars/yum_systems.yml#L21
[2] https://github.com/ceph/ceph-cm-ansible/blob/master/roles/testnode/vars/centos_7.yml#L78
[3] https://github.com/ceph/ceph-cm-ansible/blob/master/roles/common/tasks/nrpe-selinux.yml

Actions

Copy link

Updated by David Galloway about 7 years ago

Status changed from New to Resolved

Actions

Copy link

Updated by David Galloway about 7 years ago

Project changed from ceph-cm-ansible to sepia
Subject changed from libsemanage.semanage_direct_get_module_info: Unable to read mod_fastcgi module lang ext file. to "libsemanage.semanage_direct_get_module_info:" error causing ceph-cm-ansible to fail
Category set to Test Node
Status changed from Resolved to 12

Actions

Copy link

Updated by David Galloway about 7 years ago

Priority changed from Normal to Urgent

Problem reappeared except semodule fails on abrt (the first module in /etc/selinux/targeted/active/modules/100/) now instead of mod_fastcgi.

Here are the last passed (or dead if ceph-cm-ansible passed) jobs on 5 of the smithi with this problem:

smithi014: http://pulpito.ceph.com/sage-2017-03-04_20:16:25-rados:thrash-erasure-code-wip-osd-full---basic-smithi/882764
smithi021: http://pulpito.ceph.com/sage-2017-03-03_22:21:26-rados-wip-sage-testing---basic-smithi/879924
smithi029: http://pulpito.ceph.com/sage-2017-03-03_02:56:43-rados-wip-sage-testing---basic-smithi/877300
smithi038: http://pulpito.ceph.com/teuthology-2017-03-04_05:20:02-kcephfs-kraken-testing-basic-smithi/882656/
smithi134: http://pulpito.ceph.com/sage-2017-03-03_22:21:26-rados-wip-sage-testing---basic-smithi/879934

First instance of this problem is January 23 2017 17:58:26 UTC: http://sentry.ceph.com/sepia/teuthology/issues/736/events/60205/

Actions

Copy link

Updated by David Galloway about 7 years ago

Reinstalling selinux-policy-targeted restores the module files (/etc/selinux/targeted/active/modules/100/*/*) to a sane state and semodule exits cleanly again.

I could just have selinux-policy-targeted reinstalled on every ansible run but something is causing them to get in a bad state and I'm trying to find out what.

smithi014 and smithi021 are no longer in a broken state because of my testing.

Actions

Copy link

Updated by David Galloway about 7 years ago

Starting to wonder if maybe the latest version of the selinux-policy-targeted packages are causing this.

Here's a diff of install scripts between selinux-policy-targeted-3.13.1-102.el7_3.7.noarch and selinux-policy-targeted-3.13.1-102.el7_3.15.noarch
https://www.diffchecker.com/dxDGIUZP

According to http://mirror.centos.org/centos/7/updates/x86_64/Packages/, selinux-policy-targeted-3.13.1-102.el7_3.13 was released on 2017-01-18 which is close to the first time we saw this problem.

Difference between rebuild scripts: https://www.diffchecker.com/xqGJDnyW

Actions

Copy link

Updated by David Galloway about 7 years ago

Looking at smithi029,

abrt's module files were last modified on 3/4 at 10:33.

[root@smithi029 abrt]# stat cil 
  File: ‘cil’
  Size: 0             Blocks: 0          IO Block: 4096   regular empty file
Device: 801h/2049d    Inode: 6161979     Links: 1
Access: (0644/-rw-r--r--)  Uid: (    0/    root)   Gid: (    0/    root)
Context: unconfined_u:object_r:semanage_store_t:s0
Access: 2017-03-06 19:19:53.034358186 +0000
Modify: 2017-03-04 10:33:12.224531450 +0000
Change: 2017-03-04 10:33:12.224531450 +0000

Looking at yum history, I see transaction ID 46557 and 46558 ran at that time.

[root@smithi029 abrt]# yum history info 46558
Loaded plugins: fastestmirror, langpacks, priorities
Transaction ID : 46558
Begin time     : Sat Mar  4 10:33:03 2017
Begin rpmdb    : 860:bd1cfe5bdc56c8e26dd990bad3d73de40da68d2d
End time       :            10:33:25 2017 (22 seconds)
End rpmdb      : 837:f1d1399e1a489c3a04b8f62cfc5dd37c8b3e505b
User           :  <ubuntu>
Return-Code    : Success
Command Line   : -y remove  librados2
Transaction performed with:
    Installed     rpm-4.11.3-21.el7.x86_64                      @anaconda
    Installed     yum-3.4.3-150.el7.centos.noarch               @anaconda
    Installed     yum-plugin-fastestmirror-1.1.31-40.el7.noarch @anaconda
Packages Altered:
    Erase ceph-base-1:12.0.0-931.g8d07615.el7.x86_64          @ceph
    Erase ceph-common-1:12.0.0-931.g8d07615.el7.x86_64        @ceph
    Erase ceph-mds-1:12.0.0-931.g8d07615.el7.x86_64           @ceph
    Erase ceph-mgr-1:12.0.0-931.g8d07615.el7.x86_64           @ceph
    Erase ceph-mon-1:12.0.0-931.g8d07615.el7.x86_64           @ceph
    Erase ceph-osd-1:12.0.0-931.g8d07615.el7.x86_64           @ceph
    Erase ceph-selinux-1:12.0.0-931.g8d07615.el7.x86_64       @ceph
    Erase libcephfs-devel-1:12.0.0-931.g8d07615.el7.x86_64    @ceph
    Erase libcephfs2-1:12.0.0-931.g8d07615.el7.x86_64         @ceph
    Erase librados-devel-1:12.0.0-931.g8d07615.el7.x86_64     @ceph
    Erase librados2-1:12.0.0-931.g8d07615.el7.x86_64          @ceph
    Erase libradosstriper1-1:12.0.0-931.g8d07615.el7.x86_64   @ceph
    Erase librbd1-1:12.0.0-931.g8d07615.el7.x86_64            @ceph
    Erase librgw2-1:12.0.0-931.g8d07615.el7.x86_64            @ceph
    Erase python-ceph-compat-1:12.0.0-931.g8d07615.el7.x86_64 @ceph
    Erase python-cephfs-1:12.0.0-931.g8d07615.el7.x86_64      @ceph
    Erase python-rados-1:12.0.0-931.g8d07615.el7.x86_64       @ceph
    Erase python-rbd-1:12.0.0-931.g8d07615.el7.x86_64         @ceph
    Erase python-rgw-1:12.0.0-931.g8d07615.el7.x86_64         @ceph
    Erase qemu-img-10:1.5.3-126.el7_3.5.x86_64                @updates
    Erase qemu-kvm-10:1.5.3-126.el7_3.5.x86_64                @updates
    Erase qemu-kvm-common-10:1.5.3-126.el7_3.5.x86_64         @updates
    Erase rbd-fuse-1:12.0.0-931.g8d07615.el7.x86_64           @ceph
Scriptlet output:
   1 warning: file /etc/logrotate.d/ceph: remove failed: No such file or directory
history info

That package version leads me to https://3.chacra.ceph.com/repos/ceph/wip-zyan-testing/8d0761524228ef05170ebadd57c65b92e5b66694/centos/7/flavors/default/

Installing ceph-selinux and removing it does not reproduce the issue.

Did the same exercise on smithi038. No joy on a reproducer.

Actions

Copy link

#10

Updated by Zheng Yan about 7 years ago

still see similar errors on smithi{014,038,134}

One example http://pulpito.ceph.com/teuthology-2017-03-06_03:25:01-kcephfs-master-testing-basic-smithi/886325/

I might lock these machines manually, ran following task, then ran teuthology -r -u -t.

roles:
- [osd.0, mds.a, mds.b]
- [osd.1, mds.c, mds.d]
- [osd.2, mds.e, mds.f]
- [osd.3, mds.g, mon.0]
- [client.0]
- [client.1]

branch: wip-zyan-testing
suite_branch: wip-zyan-testing
suite_relpath: qa
kernel:
  branch: testing

overrides:
  install:
    ceph:
      branch: wip-zyan-testing
  ceph:
    conf:
      mds:
        mds thrash exports: 0
        mds debug scatterstat: 0
        debug monc: 20

tasks:
- install:
- ceph:
- kclient: [client.0, client.1]
- interactive:

maybe I did something wrong

Actions

Copy link

#11