Project

General

Profile

Bug #43551

Trying to enable the CEPH Telegraf module errors 'No such file or directory'

Added by Scott Hubbard over 2 years ago. Updated almost 2 years ago.

Status:
Resolved
Priority:
High
Assignee:
Category:
-
Target version:
-
% Done:

0%

Source:
Community (user)
Tags:
Backport:
nautilus, octopus
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Pull request ID:
Crash signature (v1):
Crash signature (v2):

Description

I followed the steps at https://docs.ceph.com/docs/master/mgr/telegraf/, and enabled the telegraf module.
I then enabled the telegraf daemon to open a local listener port, and then ran the second command to set the address, and received the error:

$ sudo ceph telegraf config-set address udp://:8094
Error EIO: Module 'telegraf' has experienced an error and cannot handle commands: (2, 'No such file or directory')

I also noticed that the cluster health degraded:

2020-01-09 22:02:25.317507 mon.ceph-mon0 [ERR] Health check failed: Module 'telegraf' has failed: (2, 'No such file or directory') (MGR_MODULE_ERROR)

No matter what command I did related to telegraf, received the same error:

$ sudo ceph telegraf config-show
Error EIO: Module 'telegraf' has experienced an error and cannot handle commands: (2, 'No such file or directory')

The ceph-mgr log, shows that it tries to load the modules, but eventually raises an exception:

2020-01-09 22:02:13.021 7fb3c44e7700 1 mgr load Constructed class from module: rbd_support
2020-01-09 22:02:13.021 7fb3c44e7700 1 mgr load Constructed class from module: restful
2020-01-09 22:02:13.021 7fb3c44e7700 1 mgr load Constructed class from module: status
2020-01-09 22:02:13.021 7fb3c44e7700 1 mgr load Constructed class from module: telegraf
2020-01-09 22:02:13.025 7fb3b94d1700 1 mgr[restful] server not running: no certificate configured
2020-01-09 22:02:13.025 7fb3c44e7700 1 mgr load Constructed class from module: volumes
2020-01-09 22:02:23.077 7fb3b84cf700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'telegraf' while running on mgr.ceph-mon0: (2, 'No such file or directory')
2020-01-09 22:02:23.077 7fb3b84cf700 -1 telegraf.serve:
2020-01-09 22:02:23.077 7fb3b84cf700 -1 Traceback (most recent call last):
File "/usr/share/ceph/mgr/telegraf/module.py", line 295, in serve
self.send_to_telegraf()
File "/usr/share/ceph/mgr/telegraf/module.py", line 243, in send_to_telegraf
with sock as s:
File "/usr/share/ceph/mgr/telegraf/basesocket.py", line 41, in __enter__
self.connect()
File "/usr/share/ceph/mgr/telegraf/basesocket.py", line 29, in connect
return self.sock.connect(self.address)
File "/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
error: (2, 'No such file or directory')

I tried to create the telegraf unix socket at the default location of /tmp/telegraf.sock and it still had the same error.

What I ended up having to do, was very quickly enable the module, set the config address and then it worked.

sudo ceph mgr module enable telegraf
sudo ceph telegraf config-set address udp://:8094

There appear to be two problems:
1: Even if the sock file exists, the telegraf module is not able to see it or read it.
2: If the module was not able to find the unix socket file, it should not break other command that try to set the options in the module.

My CEPH version:
ceph version 14.2.5 (ad5bd132e1492173c85fda2cc863152730b16a92) nautilus (stable)

3 mgr/mon nodes, and 5 osd nodes.


Related issues

Copied to mgr - Backport #45069: octopus: Trying to enable the CEPH Telegraf module errors 'No such file or directory' Resolved
Copied to mgr - Backport #45070: nautilus: Trying to enable the CEPH Telegraf module errors 'No such file or directory' Resolved

History

#1 Updated by Greg Farnum over 2 years ago

  • Project changed from Ceph to mgr

#2 Updated by Josh Durgin over 2 years ago

  • Assignee set to Wido den Hollander

Wido can you take a look at this one?

#3 Updated by Wido den Hollander over 2 years ago

Yes, so I saw this recently as well.

The telegraf module by default looks at a local unix socket: unixgram:///tmp/telegraf.sock

It thows an 'error', but not really an exception which I tried to catch when I recently found this.

            except (socket.error, RuntimeError, IOError, OSError):
                self.log.exception('Failed to send statistics to Telegraf:')

This except block doesn't seem to catch it.

Any suggestions on what we need to catch to overcome this?

#4 Updated by Sage Weil over 2 years ago

  • Priority changed from Normal to High
2020-03-18T17:03:03.739 INFO:tasks.ceph.mgr.x.smithi182.stderr:2020-03-18T17:03:03.737+0000 7f892f71a700 -1 log_channel(cluster) log [ERR] : Unhandled exception from module 'telegraf' while running on mgr.x: [Errno 2] No such file or directory
2020-03-18T17:03:03.740 INFO:tasks.ceph.mgr.x.smithi182.stderr:2020-03-18T17:03:03.737+0000 7f892f71a700 -1 telegraf.serve:
2020-03-18T17:03:03.740 INFO:tasks.ceph.mgr.x.smithi182.stderr:2020-03-18T17:03:03.737+0000 7f892f71a700 -1 Traceback (most recent call last):
2020-03-18T17:03:03.740 INFO:tasks.ceph.mgr.x.smithi182.stderr:  File "/usr/share/ceph/mgr/telegraf/module.py", line 295, in serve
2020-03-18T17:03:03.740 INFO:tasks.ceph.mgr.x.smithi182.stderr:    self.send_to_telegraf()
2020-03-18T17:03:03.741 INFO:tasks.ceph.mgr.x.smithi182.stderr:  File "/usr/share/ceph/mgr/telegraf/module.py", line 243, in send_to_telegraf
2020-03-18T17:03:03.741 INFO:tasks.ceph.mgr.x.smithi182.stderr:    with sock as s:
2020-03-18T17:03:03.741 INFO:tasks.ceph.mgr.x.smithi182.stderr:  File "/usr/share/ceph/mgr/telegraf/basesocket.py", line 41, in __enter__
2020-03-18T17:03:03.741 INFO:tasks.ceph.mgr.x.smithi182.stderr:    self.connect()
2020-03-18T17:03:03.742 INFO:tasks.ceph.mgr.x.smithi182.stderr:  File "/usr/share/ceph/mgr/telegraf/basesocket.py", line 29, in connect
2020-03-18T17:03:03.742 INFO:tasks.ceph.mgr.x.smithi182.stderr:    return self.sock.connect(self.address)
2020-03-18T17:03:03.743 INFO:tasks.ceph.mgr.x.smithi182.stderr:FileNotFoundError: [Errno 2] No such file or directory

/a/sage-2020-03-18_14:59:42-rados-wip-sage-testing-2020-03-18-0826-distro-basic-smithi/4866150
description: rados/mgr/{clusters/{2-node-mgr.yaml} debug/mgr.yaml objectstore/bluestore-comp-zlib.yaml
supported-random-distro$/{rhel_8.yaml} tasks/module_selftest.yaml}

#6 Updated by Kefu Chai over 2 years ago

  • Status changed from New to Pending Backport
  • Assignee changed from Wido den Hollander to Kefu Chai
  • Backport set to nautilus, octopus
  • Pull request ID set to 34468

#7 Updated by Nathan Cutler over 2 years ago

  • Copied to Backport #45069: octopus: Trying to enable the CEPH Telegraf module errors 'No such file or directory' added

#8 Updated by Nathan Cutler over 2 years ago

  • Copied to Backport #45070: nautilus: Trying to enable the CEPH Telegraf module errors 'No such file or directory' added

#9 Updated by Nathan Cutler over 2 years ago

  • Status changed from Pending Backport to Resolved

While running with --resolve-parent, the script "backport-create-issue" noticed that all backports of this issue are in status "Resolved" or "Rejected".

#10 Updated by Stefan Kooman almost 2 years ago

The fix does nothing to prevent this from happening, it only handles the exception. We need socket support. The underlying issue is an updated systemd services file for ceph-manager where private tmp is enabled (PrivateTmp=true) that comes with Nautilus (it works in Mimic and Luminous). The ceph manager does not find the /tmp/telegraf.sock in it's own namespace. A fix might be to change the default location for the socket file to /var/telegraf.sock.

Also available in: Atom PDF