Project

General

Profile

Actions

Bug #17415

closed

RHEV Network or Storage issues

Added by David Galloway over 7 years ago. Updated over 6 years ago.

Status:
Resolved
Priority:
Normal
Category:
Infrastructure Service
Target version:
-
% Done:

0%

Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):

Description

At about 00:27 UTC on 27SEP2016, the RHEV cluster either began having backend storage issues or overall network issues. So far all clues point to store01.front.sepia.ceph.com which is the main gluster backing node serving the RHEV VMs.

Symptoms

  • teuthology and pcp VMs were paused
  • RHEV logs show an "unknown storage error"
  • RHEV Manager VM was restarted (This VM's disk is served over NFS; not gluster)
  • Logs from different involved machines below
  • CPU Soft lockups in teuthology VM's kernel log indicating either network issues accessing LRC or its own disk (hosted on gluster)

NOTE: Times around 00:20 are UTC. Times around 20:20 are EDT.

===== hv01:/var/log/messages =====
Sep 26 20:27:00 hv01 journal: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out
Sep 26 20:27:00 hv01 ovirt-ha-agent: ERROR:ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink:Connection closed: Connection timed out
Sep 26 20:27:00 hv01 ovirt-ha-agent: WARNING:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Error while monitoring engine: Connection timed out
Sep 26 20:27:00 hv01 ovirt-ha-agent: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Sleeping 60 seconds
Sep 26 20:27:07 hv01 journal: vdsm ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out
Sep 26 20:27:07 hv01 journal: vdsm vds ERROR failed to retrieve Hosted Engine HA info#012Traceback (most recent call last):#012  File "/usr/share/vdsm/API.py", line 1851, in _getHaInfo#012    stats = instance.get_all_stats()#012  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats#012    self._configure_broker_conn(broker)#012  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn#012    dom_type=dom_type)#012  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain#012    .format(sd_type, options, e))#012RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': 'ca46f740-bc09-477f-b09e-cf7e746dca73'}: Connection timed out

===== store01:/var/log/messages =====
Sep 27 00:27:01 old-gw pdns[2958]: Backend reported permanent error which prevented lookup (GSQLBackend lookup query:Failed to execute mysql_query, perhaps connection died? Err=1: Lost connection to MySQL server during query), aborting

===== store01:/var/log/glusterfs/bricks/srv-gluster-shardbrick1.log =====
[2016-09-26 03:27:08.778199] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing
[2016-09-27 00:28:08.678750] W [socket.c:696:__socket_rwv] 0-tcp.shardvol1-server: writev on 172.21.0.5:49148 failed (Broken pipe)
[2016-09-27 00:28:15.872485] I [socket.c:2468:socket_event_handler] 0-transport: disconnecting now
[2016-09-27 00:28:17.803641] I [MSGID: 115036] [server.c:548:server_rpc_notify] 0-shardvol1-server: disconnecting connection from hv01.front.sepia.ceph.com-15267-2016/08/04-00:38:10:236127-shardvol1-client-0-0-0
[2016-09-27 00:28:30.945155] I [socket.c:3498:socket_submit_reply] 0-tcp.shardvol1-server: not connected (priv->connected = -1)
... Repeated ...
[2016-09-27 00:28:44.611277] E [rpcsvc.c:1310:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x255417e4, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.shardvol1-server)
[2016-09-27 00:28:44.611286] E [server.c:200:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.0/xlator/debug/io-stats.so(+0x1814e) [0x7f8396b4514e] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x1a2e9) [0x7f83966e72e9] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x8fa6) [0x7f83966d5fa6] ) 0-: Reply submission failed
[2016-09-27 00:28:44.611331] E [server.c:200:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.0/xlator/debug/io-stats.so(+0x1814e) [0x7f8396b4514e] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x1a2e9) [0x7f83966e72e9] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x8fa6) [0x7f83966d5fa6] ) 0-: Reply submission failed

===== mgr01:/var/log/ovirt-engine/engine.log =====
2016-09-26 20:35:58,182 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-6-thread-16) [1c630ccf] domain '67ff9a5d-b5da-4a2f-b5ce-2286bc82e3e4:iso_storage' in problem. vds: 'hv02'
2016-09-26 20:35:58,184 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-6-thread-16) [1c630ccf] domain '2a81a4ec-23c1-48b8-b296-e0bcc4f59700:shardvol1' in problem. vds: 'hv02'

2016-09-26 20:36:03,468 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-42) [4a281328] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: d6b1d38b-6f1f-4438-b04f-53619241a1ef Type: StoragePool
2016-09-26 20:36:03,480 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-43) [22db002d] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: d3ec2fae-de85-41d2-8cf4-b09c3fd77ca3 Type: StoragePool
2016-09-26 20:36:03,481 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-44) [75431892] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: 71a72320-daa6-4c1c-a9ee-c809a8f0ad9f Type: StoragePool
2016-09-26 20:36:03,510 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-42) [4a281328] Storage Pool 'd6b1d38b-6f1f-4438-b04f-53619241a1ef' - Updating Storage Domain '14b9a2fc-838c-46d4-8ba6-52f8d815f19e' status from 'Active' to 'Unknown', reason: null
2016-09-26 20:36:03,511 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-43) [22db002d] Storage Pool 'd3ec2fae-de85-41d2-8cf4-b09c3fd77ca3' - Updating Storage Domain '1ecdcdf2-4301-4bba-a17c-270a85f11ade' status from 'Active' to 'Unknown', reason: null
2016-09-26 20:36:03,523 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-44) [75431892] Storage Pool '71a72320-daa6-4c1c-a9ee-c809a8f0ad9f' - Updating Storage Domain 'b43d2eef-5cfa-4290-9e7b-f6ebc4e9c891' status from 'Active' to 'Unknown', reason: null
2016-09-26 20:36:03,535 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-41) [7734731c] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: 28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6 Type: StoragePool
2016-09-26 20:36:03,583 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-41) [7734731c] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '67ff9a5d-b5da-4a2f-b5ce-2286bc82e3e4' status from 'Active' to 'Unknown', reason: null
2016-09-26 20:36:03,596 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-41) [7734731c] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '2a81a4ec-23c1-48b8-b296-e0bcc4f59700' status from 'Active' to 'Unknown', reason: null
2016-09-26 20:36:03,598 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-41) [7734731c] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain 'ca46f740-bc09-477f-b09e-cf7e746dca73' status from 'Active' to 'Unknown', reason: null
2016-09-26 20:36:03,832 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-43) [22db002d] Correlation ID: 22db002d, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center irvingi08-Local. Setting status to Non Responsive.
2016-09-26 20:36:03,853 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-44) [75431892] Correlation ID: 75431892, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center irvingi02-Local. Setting status to Non Responsive.
2016-09-26 20:36:03,871 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-42) [4a281328] Correlation ID: 4a281328, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center irvingi07-Local. Setting status to Non Responsive.
2016-09-26 20:36:03,874 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [7734731c] Correlation ID: 7734731c, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center sepia. Setting status to Non Responsive.

2016-09-26 20:36:08,732 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-67) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM RHEV-Manager is down with error. Exit message: Failed to find the libvirt domain.
2016-09-26 20:36:08,735 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-67) [] VM '6406669d-7df3-4719-a3ae-bf82f523ff03(RHEV-Manager) is running in db and not running in VDS 'hv02'
2016-09-26 20:36:08,744 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM pcp has been paused due to unknown storage error.
2016-09-26 20:36:08,746 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-68) [] VM 'da38e6f8-9ccb-4ce0-95f0-afc98a171386'(teuthology) moved from 'Up' --> 'Paused'
2016-09-26 20:36:08,787 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Failed during vms monitoring on host hv02 error is: java.lang.NullPointerException
2016-09-26 20:36:08,788 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Exception:: java.lang.NullPointerException
        at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:99) [dal.jar:]
        at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:27) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:110) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:107) [dal.jar:]
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:73) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:61) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:103) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.saveVmsToDb(VmsMonitoring.java:374) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.refreshVmStats(VmsMonitoring.java:243) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.perform(VmsMonitoring.java:165) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:36) [vdsbroker.jar:]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_111]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_111]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

2016-09-26 20:36:08,825 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused.
2016-09-26 20:36:08,857 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused due to unknown storage error.

2016-09-26 20:36:08,732 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-67) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM RHEV-Manager is down with 
error. Exit message: Failed to find the libvirt domain.
2016-09-26 20:36:08,735 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-67) [] VM '6406669d-7df3-4719-a3ae-bf82f523ff03(RHEV-Manager) is running in db and not running in VDS 'hv02'
2016-09-26 20:36:08,744 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM pcp has been paused due to unknown storage error.
2016-09-26 20:36:08,746 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-68) [] VM 'da38e6f8-9ccb-4ce0-95f0-afc98a171386'(teuthology) moved from 'Up' --> 'Paused'
2016-09-26 20:36:08,787 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Failed during vms monitoring on host hv02 error is: java.lang.NullPointerException
2016-09-26 20:36:08,788 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Exception:: java.lang.NullPointerException
        at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:99) [dal.jar:]
        at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:27) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:110) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:107) [dal.jar:]
        at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:73) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:61) [dal.jar:]
        at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:103) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.saveVmsToDb(VmsMonitoring.java:374) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.refreshVmStats(VmsMonitoring.java:243) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VmsMonitoring.perform(VmsMonitoring.java:165) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:36) [vdsbroker.jar:]
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_111]
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_111]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

2016-09-26 20:36:08,825 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused.
2016-09-26 20:36:08,857 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused due to unknown storage error.
Actions #1

Updated by David Galloway over 7 years ago

Worth noting:
  • PowerDNS running on store01 also had trouble accessing its own mysql db around the same time.
  • The RHEV Manager VM also failed and restarted itself. This VM's disk is served over NFS from store01 (not Gluster)

My theory is either network problems or problems accessing the storage/disks/RAID on store01.

Actions #2

Updated by David Galloway over 7 years ago

Red Hat IT suspects a software update may fix switch instability caused by soft parity errors. The switch software update will be performed on Tuesday at 11AM.

Actions #3

Updated by David Galloway over 7 years ago

This happened again.

hv01:/var/log/vdsm.log

Thread-561102::ERROR::2016-10-05 20:17:05,346::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out
Thread-561102::ERROR::2016-10-05 20:17:05,347::API::1876::vds::(_getHaInfo) failed to retrieve Hosted Engine HA info
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1856, in _getHaInfo
    stats = instance.get_all_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 105, in get_all_stats
    stats = broker.get_stats_from_storage(service)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 232, in get_stats_from_storage
    result = self._checked_communicate(request)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 248, in _checked_communicate
    response = self._communicate(request)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 277, in _communicate
    isTimed=True)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 89, in socket_readline
    raise DisconnectionError("Connection timed out")
DisconnectionError: Connection timed out

...

periodic/42::WARNING::2016-10-05 20:17:10,305::periodic::268::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'b4a645a0-b444-49c1-8642-2f8f3c6beb95', u'632b8445-2d6f-4b70-8961-66b88c1e8abc', u'da38e6f8-9ccb-4ce0-95f0-afc98a171386']
periodic/42::WARNING::2016-10-05 20:17:10,306::periodic::268::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'b4a645a0-b444-49c1-8642-2f8f3c6beb95', u'632b8445-2d6f-4b70-8961-66b88c1e8abc', u'da38e6f8-9ccb-4ce0-95f0-afc98a171386']
periodic/42::WARNING::2016-10-05 20:17:10,308::periodic::268::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'b4a645a0-b444-49c1-8642-2f8f3c6beb95', u'632b8445-2d6f-4b70-8961-66b88c1e8abc', u'da38e6f8-9ccb-4ce0-95f0-afc98a171386']
Thread-561110::ERROR::2016-10-05 20:17:11,204::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out
Thread-561110::ERROR::2016-10-05 20:17:11,204::API::1876::vds::(_getHaInfo) failed to retrieve Hosted Engine HA info
Traceback (most recent call last):
  File "/usr/share/vdsm/API.py", line 1856, in _getHaInfo
    stats = instance.get_all_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats
    self._configure_broker_conn(broker)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn
    dom_type=dom_type)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain
    .format(sd_type, options, e))
RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': 'ca46f740-bc09-477f-b09e-cf7e746dca73'}: Connection timed out

mgr01:/var/log/ovirt-engine/engine.log

2016-10-05 20:17:26,307 ERROR [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) [] Not able to update response for "607926c1-a3e3-42c5-ab52-adc5fae80d24" 
2016-10-05 20:17:26,322 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-36) [41f2754e] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM hv01 command failed: Heartbeat exeeded
2016-10-05 20:17:26,323 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-36) [41f2754e] Command 'GetStatsVDSCommand(HostName = hv01, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='a8f312e9-1582-4914-aed8-14e614c7af7e', vds='Host[hv01,a8f312e9-1582-4914-aed8-14e614c7af7e]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-05 20:17:26,323 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-36) [41f2754e]  Failed getting vds stats,  vds='hv01'(a8f312e9-1582-4914-aed8-14e614c7af7e): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-05 20:17:26,323 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-36) [41f2754e] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-05 20:17:26,323 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-36) [41f2754e] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
        at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) [:1.7.0_111]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

2016-10-05 20:17:26,324 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-36) [41f2754e] Failed to refresh VDS, network error, continuing, vds='hv01'(a8f312e9-1582-4914-aed8-14e614c7af7e): VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-05 20:17:26,328 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-2) [41f2754e] Host 'hv01' is not responding. It will stay in Connecting state for a grace period of 63 seconds and after that an attempt to fence the host will be issued.
2016-10-05 20:17:26,328 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-97) [71b1cc69] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM hv02 command failed: Heartbeat exeeded
2016-10-05 20:17:26,328 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-97) [71b1cc69] Command 'GetStatsVDSCommand(HostName = hv02, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='7c70df22-d55f-4f7e-beee-9dc4bfe0b70d', vds='Host[hv02,7c70df22-d55f-4f7e-beee-9dc4bfe0b70d]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-05 20:17:26,328 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-97) [71b1cc69]  Failed getting vds stats,  vds='hv02'(7c70df22-d55f-4f7e-beee-9dc4bfe0b70d): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-05 20:17:26,328 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-97) [71b1cc69] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-05 20:17:26,328 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-97) [71b1cc69] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
        at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) [:1.7.0_111]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

2016-10-05 20:17:26,329 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-97) [71b1cc69] Failed to refresh VDS, network error, continuing, vds='hv02'(7c70df22-d55f-4f7e-beee-9dc4bfe0b70d): VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-05 20:17:26,336 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-19) [71b1cc69] Host 'hv02' is not responding. It will stay in Connecting state for a grace period of 83 seconds and after that an attempt to fence the host will be issued.
2016-10-05 20:17:26,338 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-60) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM hv02 command failed: Heartbeat exeeded
2016-10-05 20:17:26,338 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-60) [] Command 'SpmStatusVDSCommand(HostName = hv02, SpmStatusVDSCommandParameters:{runAsync='true', hostId='7c70df22-d55f-4f7e-beee-9dc4bfe0b70d', storagePoolId='28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-05 20:17:26,346 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-60) [795c08f9] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: 28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6 Type: StoragePool
2016-10-05 20:17:26,355 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-60) [795c08f9] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '67ff9a5d-b5da-4a2f-b5ce-2286bc82e3e4' status from 'Active' to 'Unknown', reason: null
2016-10-05 20:17:26,356 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-60) [795c08f9] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '2a81a4ec-23c1-48b8-b296-e0bcc4f59700' status from 'Active' to 'Unknown', reason: null
2016-10-05 20:17:26,357 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-60) [795c08f9] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain 'ca46f740-bc09-477f-b09e-cf7e746dca73' status from 'Active' to 'Unknown', reason: null
2016-10-05 20:17:26,382 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-19) [71b1cc69] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hv02 is not responding. It will stay in Connecting state for a grace period of 83 seconds and after that an attempt to fence the host will be issued.
2016-10-05 20:17:26,400 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-2) [41f2754e] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hv01 is not responding. It will stay in Connecting state for a grace period of 63 seconds and after that an attempt to fence the host will be issued.
2016-10-05 20:17:26,447 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-60) [795c08f9] Correlation ID: 795c08f9, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center sepia. Setting Data Center status to Non Responsive (On host hv02, Error: Network error during communication with the Host.).
2016-10-05 20:17:26,511 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to hv02.front.sepia.ceph.com/172.21.0.6
2016-10-05 20:17:26,522 ERROR [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) [] Not able to update response for "6f505595-9808-4626-a825-c6e0445af015" 

...

2016-10-05 20:18:54,588 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-56) [1ade0205] VM 'ed4d86bb-588a-44f1-bf23-82344a936615'(pcp) moved from 'Up' --> 'NotResponding'
2016-10-05 20:18:54,667 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-51) [73d5354f] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM hv02 command failed: Heartbeat exeeded
2016-10-05 20:18:54,667 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-51) [73d5354f] Command 'SpmStatusVDSCommand(HostName = hv02, SpmStatusVDSCommandParameters:{runAsync='true', hostId='7c70df22-d55f-4f7e-beee-9dc4bfe0b70d', storagePoolId='28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-05 20:18:54,672 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-1) [73d5354f] Host 'hv02' is not responding. It will stay in Connecting state for a grace period of 83 seconds and after that an attempt to fence the host will be issued.
2016-10-05 20:18:54,686 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-51) [73e99a83] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: 28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6 Type: StoragePool
2016-10-05 20:18:54,688 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-51) [73e99a83] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '67ff9a5d-b5da-4a2f-b5ce-2286bc82e3e4' status from 'Active' to 'Unknown', reason: null
2016-10-05 20:18:54,688 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-51) [73e99a83] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '2a81a4ec-23c1-48b8-b296-e0bcc4f59700' status from 'Active' to 'Unknown', reason: null
2016-10-05 20:18:54,689 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-51) [73e99a83] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain 'ca46f740-bc09-477f-b09e-cf7e746dca73' status from 'Active' to 'Unknown', reason: null
2016-10-05 20:18:54,713 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-1) [73d5354f] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hv02 is not responding. It will stay in Connecting state for a grace period of 83 seconds and after that an attempt to fence the host will be issued.
2016-10-05 20:18:54,727 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-92) [] VM '6406669d-7df3-4719-a3ae-bf82f523ff03'(RHEV-Manager) moved from 'Up' --> 'NotResponding'
2016-10-05 20:18:55,923 ERROR [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) [] Not able to update response for "42e8ccc6-fbcd-4859-a17c-45ec7bd78210" 
2016-10-05 20:18:57,517 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-51) [73e99a83] Correlation ID: 73e99a83, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center sepia. Setting Data Center status to Non Responsive (On host hv02, Error: Network error during communication with the Host.).
2016-10-05 20:18:57,521 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-56) [1ade0205] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM pcp is not responding.
2016-10-05 20:18:57,533 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-92) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM RHEV-Manager is not responding.
2016-10-05 20:18:57,535 INFO  [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-92) [] VM '7704dd14-08ea-4094-904a-79b129e4689f'(cobbler) moved from 'Up' --> 'NotResponding'
2016-10-05 20:18:57,598 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to hv02.front.sepia.ceph.com/172.21.0.6
2016-10-05 20:18:57,833 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (DefaultQuartzScheduler_Worker-65) [1af12bba] START, GetHardwareInfoVDSCommand(HostName = hv02, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='7c70df22-d55f-4f7e-beee-9dc4bfe0b70d', vds='Host[hv02,7c70df22-d55f-4f7e-beee-9dc4bfe0b70d]'}), log id: 29040941
2016-10-05 20:18:57,839 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (DefaultQuartzScheduler_Worker-65) [1af12bba] FINISH, GetHardwareInfoVDSCommand, log id: 29040941
2016-10-05 20:18:58,014 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-92) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM cobbler is not responding.

Found this in nagios
Service Warning[2016-10-06 00:11:08] SERVICE ALERT: store01;Current Load;WARNING;SOFT;1;WARNING - load average: 1.70, 0.90, 0.41

As a matter of fact, each time this has happened, the load on store01 has spiked.

EDIT: I've installed atop and have sar running on store01 and teuthology. atop should tell us what process is causing the spike at least

Actions #4

Updated by David Galloway over 7 years ago

Still happening almost daily. I'm beginning to think this is a network issue again.

2016-10-07 20:20:45,397 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-77) [4c58cfdb] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM hv02 command failed: Heartbeat exeeded
2016-10-07 20:20:45,397 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-77) [4c58cfdb] Command 'GetStatsVDSCommand(HostName = hv02, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='7c70df22-d55f-4f7e-beee-9dc4bfe0b70d', vds='Host[hv02,7c70df22-d55f-4f7e-beee-9dc4bfe0b70d]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-07 20:20:45,397 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-77) [4c58cfdb]  Failed getting vds stats,  vds='hv02'(7c70df22-d55f-4f7e-beee-9dc4bfe0b70d): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-07 20:20:45,397 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-77) [4c58cfdb] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-07 20:20:45,398 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-12) [50e3a2fb] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM hv01 command failed: Heartbeat exeeded
2016-10-07 20:20:45,398 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-12) [50e3a2fb] Command 'GetStatsVDSCommand(HostName = hv01, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='a8f312e9-1582-4914-aed8-14e614c7af7e', vds='Host[hv01,a8f312e9-1582-4914-aed8-14e614c7af7e]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-07 20:20:45,398 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-12) [50e3a2fb]  Failed getting vds stats,  vds='hv01'(a8f312e9-1582-4914-aed8-14e614c7af7e): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-07 20:20:45,398 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-12) [50e3a2fb] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-07 20:20:45,398 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-37) [768d56b2] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM irvingi04 command failed: Heartbeat exeeded
2016-10-07 20:20:45,398 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-37) [768d56b2] Command 'SpmStatusVDSCommand(HostName = irvingi04, SpmStatusVDSCommandParameters:{runAsync='true', hostId='faf8082a-3289-4cd4-bb34-d78d60bde960', storagePoolId='a0d40a8c-6aa7-418b-9714-9bc30aa65433'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-07 20:20:45,401 INFO  [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-37) [175d4b72] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: a0d40a8c-6aa7-418b-9714-9bc30aa65433 Type: StoragePool
2016-10-07 20:20:45,402 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-37) [175d4b72] Storage Pool 'a0d40a8c-6aa7-418b-9714-9bc30aa65433' - Updating Storage Domain 'b3391eea-7067-4cdf-93b8-8b3fabd4fff5' status from 'Active' to 'Unknown', reason: null
2016-10-07 20:20:45,397 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-77) [4c58cfdb] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
        at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) [:1.7.0_111]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

2016-10-07 20:20:45,535 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-77) [4c58cfdb] Failed to refresh VDS, network error, continuing, vds='hv02'(7c70df22-d55f-4f7e-beee-9dc4bfe0b70d): VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-07 20:20:45,398 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-12) [50e3a2fb] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded
        at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:]
        at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:]
        at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:]
        at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227) [vdsbroker.jar:]
        at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) [:1.7.0_111]
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111]
        at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111]
        at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:]
        at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:]
        at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:]
        at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:]

2016-10-07 20:20:45,535 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-25) [4c58cfdb] Host 'hv02' is not responding. It will stay in Connecting state for a grace period of 83 seconds and after that an attempt to fence the host will be issued.
2016-10-07 20:20:45,535 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-12) [50e3a2fb] Failed to refresh VDS, network error, continuing, vds='hv01'(a8f312e9-1582-4914-aed8-14e614c7af7e): VDSGenericException: VDSNetworkException: Heartbeat exeeded
2016-10-07 20:20:45,535 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-3) [50e3a2fb] Host 'hv01' is not responding. It will stay in Connecting state for a grace period of 63 seconds and after that an attempt to fence the host will be issued.
2016-10-07 20:20:45,536 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-15) [107ea268] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command failed: Error storage pool action: (u'spUUID=28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6',)
2016-10-07 20:20:45,536 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-15) [107ea268] IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException: IRSErrorException: Failed to GetStoragePoolInfoVDS, error = Error storage pool action: (u'spUUID=28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6',), code = 300
2016-10-07 20:20:45,602 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-3) [50e3a2fb] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hv01 is not responding. It will stay in Connecting state for a grace period of 63 seconds and after that an attempt to fence the host will be issued.
2016-10-07 20:20:45,630 WARN  [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-28) [768d56b2] Host 'irvingi04' is not responding.
2016-10-07 20:20:45,641 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-25) [4c58cfdb] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hv02 is not responding. It will stay in Connecting state for a grace period of 83 seconds and after that an attempt to fence the host will be issued.
2016-10-07 20:20:45,644 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to hv02.front.sepia.ceph.com/172.21.0.6
2016-10-07 20:20:45,645 WARN  [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Exception thrown during message processing
2016-10-07 20:20:45,718 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-28) [768d56b2] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host irvingi04 is not responding. Host cannot be fenced automatically because power management for the host is disabled.
2016-10-07 20:20:45,719 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-37) [175d4b72] Correlation ID: 175d4b72, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center irvingi04-Local. Setting Data Center status to Non Responsive (On host irvingi04, Error: Network error during communication with the Host.).
2016-10-07 20:20:47,768 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (DefaultQuartzScheduler_Worker-15) [107ea268] START, SpmStopVDSCommand(HostName = hv02, SpmStopVDSCommandParameters:{runAsync='true', hostId='7c70df22-d55f-4f7e-beee-9dc4bfe0b70d', storagePoolId='28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6'}), log id: 30c66b
2016-10-07 20:20:47,768 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (DefaultQuartzScheduler_Worker-15) [107ea268] SpmStopVDSCommand:: vds 'hv02' is in 'Connecting' status - not performing spm stop, pool id '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6'
2016-10-07 20:20:47,768 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (DefaultQuartzScheduler_Worker-15) [107ea268] FINISH, SpmStopVDSCommand, log id: 30c66b
2016-10-07 20:20:47,768 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-15) [107ea268] IRS failover failed - can't allocate vds server
2016-10-07 20:20:48,548 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to hv01.front.sepia.ceph.com/172.21.0.5
2016-10-07 20:20:48,639 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to irvingi04.front.sepia.ceph.com/172.21.3.231
2016-10-07 20:20:48,751 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (DefaultQuartzScheduler_Worker-5) [4267870f] START, GetHardwareInfoVDSCommand(HostName = hv02, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='7c70df22-d55f-4f7e-beee-9dc4bfe0b70d', vds='Host[hv02,7c70df22-d55f-4f7e-beee-9dc4bfe0b70d]'}), log id: 1a34a64e
2016-10-07 20:20:48,788 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (DefaultQuartzScheduler_Worker-5) [4267870f] FINISH, GetHardwareInfoVDSCommand, log id: 1a34a64e
2016-10-07 20:20:48,800 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (DefaultQuartzScheduler_Worker-99) [3f1b2582] START, GetHardwareInfoVDSCommand(HostName = hv01, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='a8f312e9-1582-4914-aed8-14e614c7af7e', vds='Host[hv01,a8f312e9-1582-4914-aed8-14e614c7af7e]'}), log id: 594e7a8a
2016-10-07 20:20:48,811 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (DefaultQuartzScheduler_Worker-99) [3f1b2582] FINISH, GetHardwareInfoVDSCommand, log id: 594e7a8a
2016-10-07 20:20:49,144 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (DefaultQuartzScheduler_Worker-5) [34b5a29f] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected :  ID: 7c70df22-d55f-4f7e-beee-9dc4bfe0b70d Type: VDS
2016-10-07 20:20:49,156 INFO  [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (DefaultQuartzScheduler_Worker-99) [230b35b2] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected :  ID: a8f312e9-1582-4914-aed8-14e614c7af7e Type: VDS

Actions #5

Updated by David Galloway over 7 years ago

This all started back around Sep12 when I Force Removed the irvingi02 export domain from RHEV.

I increased the vdsHeartbeat value on mgr01 (RHEV Manager) from 30sec to 60sec.

Actions #6

Updated by David Galloway over 7 years ago

I've got ping sequence running from mgr01 -> hv01 and from hv01 -> store01 in a screen session on mgr01 and hv01. Output is being logged to /root/pings/out.log

Actions #7

Updated by David Galloway over 7 years ago

  • Priority changed from Urgent to Normal

The daily RHEV VM pausing was resolved by:

gluster volume set shardvol1 performance.strict-o-direct on
gluster volume set shardvol1 network.remote-dio off

I also added an arbiter brick to the cluster to give us full HA if one of the Gluster nodes went down. We're experiencing a severe decrease in VM disk IO speed as a result of http://review.gluster.org/#/c/15641/. We haven't yet decided whether to remove the arbiter brick, install nightly packages that include the fix, or wait for the fix to make it upstream. The earliest it might make it into the next release would be around this time in November.

Actions #8

Updated by David Galloway over 7 years ago

We're still limping along with the arbiter brick in place. We had to disable PCP reporting for now, though. The PCP VM's load would spike to 300+ and would sit on iowait and wasn't actually writing.

The next point release should come out in the next few weeks. I'll evaluate installing that release if it includes the patch for the Gluster bug.

Actions #9

Updated by David Galloway over 7 years ago

Upgraded to 3.8.6 which was supposed to have arbiter write performance fix in it.

Turns out it's a result of sharding.

Created https://bugzilla.redhat.com/show_bug.cgi?id=1406547

Actions #10

Updated by David Galloway over 6 years ago

  • Status changed from In Progress to Resolved

All VMs are on NVMe storage using RHGS now.

Actions

Also available in: Atom PDF