Bug #17415
closedRHEV Network or Storage issues
0%
Description
At about 00:27 UTC on 27SEP2016, the RHEV cluster either began having backend storage issues or overall network issues. So far all clues point to store01.front.sepia.ceph.com which is the main gluster backing node serving the RHEV VMs.
Symptoms
- teuthology and pcp VMs were paused
- RHEV logs show an "unknown storage error"
- RHEV Manager VM was restarted (This VM's disk is served over NFS; not gluster)
- Logs from different involved machines below
- CPU Soft lockups in teuthology VM's kernel log indicating either network issues accessing LRC or its own disk (hosted on gluster)
NOTE: Times around 00:20 are UTC. Times around 20:20 are EDT.
===== hv01:/var/log/messages ===== Sep 26 20:27:00 hv01 journal: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Sep 26 20:27:00 hv01 ovirt-ha-agent: ERROR:ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink:Connection closed: Connection timed out Sep 26 20:27:00 hv01 ovirt-ha-agent: WARNING:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Error while monitoring engine: Connection timed out Sep 26 20:27:00 hv01 ovirt-ha-agent: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Sleeping 60 seconds Sep 26 20:27:07 hv01 journal: vdsm ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Sep 26 20:27:07 hv01 journal: vdsm vds ERROR failed to retrieve Hosted Engine HA info#012Traceback (most recent call last):#012 File "/usr/share/vdsm/API.py", line 1851, in _getHaInfo#012 stats = instance.get_all_stats()#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats#012 self._configure_broker_conn(broker)#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn#012 dom_type=dom_type)#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain#012 .format(sd_type, options, e))#012RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': 'ca46f740-bc09-477f-b09e-cf7e746dca73'}: Connection timed out ===== store01:/var/log/messages ===== Sep 27 00:27:01 old-gw pdns[2958]: Backend reported permanent error which prevented lookup (GSQLBackend lookup query:Failed to execute mysql_query, perhaps connection died? Err=1: Lost connection to MySQL server during query), aborting ===== store01:/var/log/glusterfs/bricks/srv-gluster-shardbrick1.log ===== [2016-09-26 03:27:08.778199] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing [2016-09-27 00:28:08.678750] W [socket.c:696:__socket_rwv] 0-tcp.shardvol1-server: writev on 172.21.0.5:49148 failed (Broken pipe) [2016-09-27 00:28:15.872485] I [socket.c:2468:socket_event_handler] 0-transport: disconnecting now [2016-09-27 00:28:17.803641] I [MSGID: 115036] [server.c:548:server_rpc_notify] 0-shardvol1-server: disconnecting connection from hv01.front.sepia.ceph.com-15267-2016/08/04-00:38:10:236127-shardvol1-client-0-0-0 [2016-09-27 00:28:30.945155] I [socket.c:3498:socket_submit_reply] 0-tcp.shardvol1-server: not connected (priv->connected = -1) ... Repeated ... [2016-09-27 00:28:44.611277] E [rpcsvc.c:1310:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x255417e4, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.shardvol1-server) [2016-09-27 00:28:44.611286] E [server.c:200:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.0/xlator/debug/io-stats.so(+0x1814e) [0x7f8396b4514e] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x1a2e9) [0x7f83966e72e9] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x8fa6) [0x7f83966d5fa6] ) 0-: Reply submission failed [2016-09-27 00:28:44.611331] E [server.c:200:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.0/xlator/debug/io-stats.so(+0x1814e) [0x7f8396b4514e] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x1a2e9) [0x7f83966e72e9] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x8fa6) [0x7f83966d5fa6] ) 0-: Reply submission failed ===== mgr01:/var/log/ovirt-engine/engine.log ===== 2016-09-26 20:35:58,182 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-6-thread-16) [1c630ccf] domain '67ff9a5d-b5da-4a2f-b5ce-2286bc82e3e4:iso_storage' in problem. vds: 'hv02' 2016-09-26 20:35:58,184 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-6-thread-16) [1c630ccf] domain '2a81a4ec-23c1-48b8-b296-e0bcc4f59700:shardvol1' in problem. vds: 'hv02' 2016-09-26 20:36:03,468 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-42) [4a281328] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: d6b1d38b-6f1f-4438-b04f-53619241a1ef Type: StoragePool 2016-09-26 20:36:03,480 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-43) [22db002d] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: d3ec2fae-de85-41d2-8cf4-b09c3fd77ca3 Type: StoragePool 2016-09-26 20:36:03,481 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-44) [75431892] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: 71a72320-daa6-4c1c-a9ee-c809a8f0ad9f Type: StoragePool 2016-09-26 20:36:03,510 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-42) [4a281328] Storage Pool 'd6b1d38b-6f1f-4438-b04f-53619241a1ef' - Updating Storage Domain '14b9a2fc-838c-46d4-8ba6-52f8d815f19e' status from 'Active' to 'Unknown', reason: null 2016-09-26 20:36:03,511 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-43) [22db002d] Storage Pool 'd3ec2fae-de85-41d2-8cf4-b09c3fd77ca3' - Updating Storage Domain '1ecdcdf2-4301-4bba-a17c-270a85f11ade' status from 'Active' to 'Unknown', reason: null 2016-09-26 20:36:03,523 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-44) [75431892] Storage Pool '71a72320-daa6-4c1c-a9ee-c809a8f0ad9f' - Updating Storage Domain 'b43d2eef-5cfa-4290-9e7b-f6ebc4e9c891' status from 'Active' to 'Unknown', reason: null 2016-09-26 20:36:03,535 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-41) [7734731c] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: 28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6 Type: StoragePool 2016-09-26 20:36:03,583 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-41) [7734731c] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '67ff9a5d-b5da-4a2f-b5ce-2286bc82e3e4' status from 'Active' to 'Unknown', reason: null 2016-09-26 20:36:03,596 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-41) [7734731c] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '2a81a4ec-23c1-48b8-b296-e0bcc4f59700' status from 'Active' to 'Unknown', reason: null 2016-09-26 20:36:03,598 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-41) [7734731c] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain 'ca46f740-bc09-477f-b09e-cf7e746dca73' status from 'Active' to 'Unknown', reason: null 2016-09-26 20:36:03,832 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-43) [22db002d] Correlation ID: 22db002d, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center irvingi08-Local. Setting status to Non Responsive. 2016-09-26 20:36:03,853 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-44) [75431892] Correlation ID: 75431892, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center irvingi02-Local. Setting status to Non Responsive. 2016-09-26 20:36:03,871 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-42) [4a281328] Correlation ID: 4a281328, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center irvingi07-Local. Setting status to Non Responsive. 2016-09-26 20:36:03,874 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [7734731c] Correlation ID: 7734731c, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center sepia. Setting status to Non Responsive. 2016-09-26 20:36:08,732 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-67) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM RHEV-Manager is down with error. Exit message: Failed to find the libvirt domain. 2016-09-26 20:36:08,735 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-67) [] VM '6406669d-7df3-4719-a3ae-bf82f523ff03(RHEV-Manager) is running in db and not running in VDS 'hv02' 2016-09-26 20:36:08,744 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM pcp has been paused due to unknown storage error. 2016-09-26 20:36:08,746 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-68) [] VM 'da38e6f8-9ccb-4ce0-95f0-afc98a171386'(teuthology) moved from 'Up' --> 'Paused' 2016-09-26 20:36:08,787 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Failed during vms monitoring on host hv02 error is: java.lang.NullPointerException 2016-09-26 20:36:08,788 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Exception:: java.lang.NullPointerException at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:99) [dal.jar:] at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:27) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:110) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:107) [dal.jar:] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:73) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:61) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:103) [dal.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.saveVmsToDb(VmsMonitoring.java:374) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.refreshVmStats(VmsMonitoring.java:243) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.perform(VmsMonitoring.java:165) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:36) [vdsbroker.jar:] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_111] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_111] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] 2016-09-26 20:36:08,825 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused. 2016-09-26 20:36:08,857 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused due to unknown storage error. 2016-09-26 20:36:08,732 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-67) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM RHEV-Manager is down with error. Exit message: Failed to find the libvirt domain. 2016-09-26 20:36:08,735 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-67) [] VM '6406669d-7df3-4719-a3ae-bf82f523ff03(RHEV-Manager) is running in db and not running in VDS 'hv02' 2016-09-26 20:36:08,744 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM pcp has been paused due to unknown storage error. 2016-09-26 20:36:08,746 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-68) [] VM 'da38e6f8-9ccb-4ce0-95f0-afc98a171386'(teuthology) moved from 'Up' --> 'Paused' 2016-09-26 20:36:08,787 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Failed during vms monitoring on host hv02 error is: java.lang.NullPointerException 2016-09-26 20:36:08,788 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Exception:: java.lang.NullPointerException at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:99) [dal.jar:] at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:27) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:110) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:107) [dal.jar:] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:73) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:61) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:103) [dal.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.saveVmsToDb(VmsMonitoring.java:374) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.refreshVmStats(VmsMonitoring.java:243) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.perform(VmsMonitoring.java:165) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:36) [vdsbroker.jar:] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_111] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_111] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] 2016-09-26 20:36:08,825 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused. 2016-09-26 20:36:08,857 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused due to unknown storage error.
Updated by David Galloway over 7 years ago
- PowerDNS running on store01 also had trouble accessing its own mysql db around the same time.
- The RHEV Manager VM also failed and restarted itself. This VM's disk is served over NFS from store01 (not Gluster)
My theory is either network problems or problems accessing the storage/disks/RAID on store01.
Updated by David Galloway over 7 years ago
Red Hat IT suspects a software update may fix switch instability caused by soft parity errors. The switch software update will be performed on Tuesday at 11AM.
Updated by David Galloway over 7 years ago
This happened again.
hv01:/var/log/vdsm.log
Thread-561102::ERROR::2016-10-05 20:17:05,346::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out Thread-561102::ERROR::2016-10-05 20:17:05,347::API::1876::vds::(_getHaInfo) failed to retrieve Hosted Engine HA info Traceback (most recent call last): File "/usr/share/vdsm/API.py", line 1856, in _getHaInfo stats = instance.get_all_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 105, in get_all_stats stats = broker.get_stats_from_storage(service) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 232, in get_stats_from_storage result = self._checked_communicate(request) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 248, in _checked_communicate response = self._communicate(request) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 277, in _communicate isTimed=True) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 89, in socket_readline raise DisconnectionError("Connection timed out") DisconnectionError: Connection timed out ... periodic/42::WARNING::2016-10-05 20:17:10,305::periodic::268::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'b4a645a0-b444-49c1-8642-2f8f3c6beb95', u'632b8445-2d6f-4b70-8961-66b88c1e8abc', u'da38e6f8-9ccb-4ce0-95f0-afc98a171386'] periodic/42::WARNING::2016-10-05 20:17:10,306::periodic::268::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'b4a645a0-b444-49c1-8642-2f8f3c6beb95', u'632b8445-2d6f-4b70-8961-66b88c1e8abc', u'da38e6f8-9ccb-4ce0-95f0-afc98a171386'] periodic/42::WARNING::2016-10-05 20:17:10,308::periodic::268::virt.periodic.VmDispatcher::(__call__) could not run <class 'virt.periodic.DriveWatermarkMonitor'> on [u'b4a645a0-b444-49c1-8642-2f8f3c6beb95', u'632b8445-2d6f-4b70-8961-66b88c1e8abc', u'da38e6f8-9ccb-4ce0-95f0-afc98a171386'] Thread-561110::ERROR::2016-10-05 20:17:11,204::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out Thread-561110::ERROR::2016-10-05 20:17:11,204::API::1876::vds::(_getHaInfo) failed to retrieve Hosted Engine HA info Traceback (most recent call last): File "/usr/share/vdsm/API.py", line 1856, in _getHaInfo stats = instance.get_all_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats self._configure_broker_conn(broker) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn dom_type=dom_type) File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain .format(sd_type, options, e)) RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': 'ca46f740-bc09-477f-b09e-cf7e746dca73'}: Connection timed out
mgr01:/var/log/ovirt-engine/engine.log
2016-10-05 20:17:26,307 ERROR [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) [] Not able to update response for "607926c1-a3e3-42c5-ab52-adc5fae80d24" 2016-10-05 20:17:26,322 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-36) [41f2754e] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM hv01 command failed: Heartbeat exeeded 2016-10-05 20:17:26,323 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-36) [41f2754e] Command 'GetStatsVDSCommand(HostName = hv01, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='a8f312e9-1582-4914-aed8-14e614c7af7e', vds='Host[hv01,a8f312e9-1582-4914-aed8-14e614c7af7e]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-05 20:17:26,323 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-36) [41f2754e] Failed getting vds stats, vds='hv01'(a8f312e9-1582-4914-aed8-14e614c7af7e): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-05 20:17:26,323 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-36) [41f2754e] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-05 20:17:26,323 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-36) [41f2754e] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) [:1.7.0_111] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] 2016-10-05 20:17:26,324 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-36) [41f2754e] Failed to refresh VDS, network error, continuing, vds='hv01'(a8f312e9-1582-4914-aed8-14e614c7af7e): VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-05 20:17:26,328 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-2) [41f2754e] Host 'hv01' is not responding. It will stay in Connecting state for a grace period of 63 seconds and after that an attempt to fence the host will be issued. 2016-10-05 20:17:26,328 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-97) [71b1cc69] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM hv02 command failed: Heartbeat exeeded 2016-10-05 20:17:26,328 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-97) [71b1cc69] Command 'GetStatsVDSCommand(HostName = hv02, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='7c70df22-d55f-4f7e-beee-9dc4bfe0b70d', vds='Host[hv02,7c70df22-d55f-4f7e-beee-9dc4bfe0b70d]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-05 20:17:26,328 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-97) [71b1cc69] Failed getting vds stats, vds='hv02'(7c70df22-d55f-4f7e-beee-9dc4bfe0b70d): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-05 20:17:26,328 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-97) [71b1cc69] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-05 20:17:26,328 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-97) [71b1cc69] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) [:1.7.0_111] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] 2016-10-05 20:17:26,329 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-97) [71b1cc69] Failed to refresh VDS, network error, continuing, vds='hv02'(7c70df22-d55f-4f7e-beee-9dc4bfe0b70d): VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-05 20:17:26,336 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-19) [71b1cc69] Host 'hv02' is not responding. It will stay in Connecting state for a grace period of 83 seconds and after that an attempt to fence the host will be issued. 2016-10-05 20:17:26,338 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-60) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM hv02 command failed: Heartbeat exeeded 2016-10-05 20:17:26,338 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-60) [] Command 'SpmStatusVDSCommand(HostName = hv02, SpmStatusVDSCommandParameters:{runAsync='true', hostId='7c70df22-d55f-4f7e-beee-9dc4bfe0b70d', storagePoolId='28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-05 20:17:26,346 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-60) [795c08f9] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: 28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6 Type: StoragePool 2016-10-05 20:17:26,355 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-60) [795c08f9] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '67ff9a5d-b5da-4a2f-b5ce-2286bc82e3e4' status from 'Active' to 'Unknown', reason: null 2016-10-05 20:17:26,356 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-60) [795c08f9] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '2a81a4ec-23c1-48b8-b296-e0bcc4f59700' status from 'Active' to 'Unknown', reason: null 2016-10-05 20:17:26,357 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-60) [795c08f9] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain 'ca46f740-bc09-477f-b09e-cf7e746dca73' status from 'Active' to 'Unknown', reason: null 2016-10-05 20:17:26,382 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-19) [71b1cc69] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hv02 is not responding. It will stay in Connecting state for a grace period of 83 seconds and after that an attempt to fence the host will be issued. 2016-10-05 20:17:26,400 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-2) [41f2754e] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hv01 is not responding. It will stay in Connecting state for a grace period of 63 seconds and after that an attempt to fence the host will be issued. 2016-10-05 20:17:26,447 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-60) [795c08f9] Correlation ID: 795c08f9, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center sepia. Setting Data Center status to Non Responsive (On host hv02, Error: Network error during communication with the Host.). 2016-10-05 20:17:26,511 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to hv02.front.sepia.ceph.com/172.21.0.6 2016-10-05 20:17:26,522 ERROR [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) [] Not able to update response for "6f505595-9808-4626-a825-c6e0445af015" ... 2016-10-05 20:18:54,588 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-56) [1ade0205] VM 'ed4d86bb-588a-44f1-bf23-82344a936615'(pcp) moved from 'Up' --> 'NotResponding' 2016-10-05 20:18:54,667 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-51) [73d5354f] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM hv02 command failed: Heartbeat exeeded 2016-10-05 20:18:54,667 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-51) [73d5354f] Command 'SpmStatusVDSCommand(HostName = hv02, SpmStatusVDSCommandParameters:{runAsync='true', hostId='7c70df22-d55f-4f7e-beee-9dc4bfe0b70d', storagePoolId='28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-05 20:18:54,672 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-1) [73d5354f] Host 'hv02' is not responding. It will stay in Connecting state for a grace period of 83 seconds and after that an attempt to fence the host will be issued. 2016-10-05 20:18:54,686 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-51) [73e99a83] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: 28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6 Type: StoragePool 2016-10-05 20:18:54,688 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-51) [73e99a83] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '67ff9a5d-b5da-4a2f-b5ce-2286bc82e3e4' status from 'Active' to 'Unknown', reason: null 2016-10-05 20:18:54,688 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-51) [73e99a83] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '2a81a4ec-23c1-48b8-b296-e0bcc4f59700' status from 'Active' to 'Unknown', reason: null 2016-10-05 20:18:54,689 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-51) [73e99a83] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain 'ca46f740-bc09-477f-b09e-cf7e746dca73' status from 'Active' to 'Unknown', reason: null 2016-10-05 20:18:54,713 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-1) [73d5354f] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hv02 is not responding. It will stay in Connecting state for a grace period of 83 seconds and after that an attempt to fence the host will be issued. 2016-10-05 20:18:54,727 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-92) [] VM '6406669d-7df3-4719-a3ae-bf82f523ff03'(RHEV-Manager) moved from 'Up' --> 'NotResponding' 2016-10-05 20:18:55,923 ERROR [org.ovirt.vdsm.jsonrpc.client.JsonRpcClient] (ResponseWorker) [] Not able to update response for "42e8ccc6-fbcd-4859-a17c-45ec7bd78210" 2016-10-05 20:18:57,517 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-51) [73e99a83] Correlation ID: 73e99a83, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center sepia. Setting Data Center status to Non Responsive (On host hv02, Error: Network error during communication with the Host.). 2016-10-05 20:18:57,521 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-56) [1ade0205] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM pcp is not responding. 2016-10-05 20:18:57,533 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-92) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM RHEV-Manager is not responding. 2016-10-05 20:18:57,535 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-92) [] VM '7704dd14-08ea-4094-904a-79b129e4689f'(cobbler) moved from 'Up' --> 'NotResponding' 2016-10-05 20:18:57,598 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to hv02.front.sepia.ceph.com/172.21.0.6 2016-10-05 20:18:57,833 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (DefaultQuartzScheduler_Worker-65) [1af12bba] START, GetHardwareInfoVDSCommand(HostName = hv02, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='7c70df22-d55f-4f7e-beee-9dc4bfe0b70d', vds='Host[hv02,7c70df22-d55f-4f7e-beee-9dc4bfe0b70d]'}), log id: 29040941 2016-10-05 20:18:57,839 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (DefaultQuartzScheduler_Worker-65) [1af12bba] FINISH, GetHardwareInfoVDSCommand, log id: 29040941 2016-10-05 20:18:58,014 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-92) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM cobbler is not responding.
Found this in nagios
Service Warning[2016-10-06 00:11:08] SERVICE ALERT: store01;Current Load;WARNING;SOFT;1;WARNING - load average: 1.70, 0.90, 0.41
As a matter of fact, each time this has happened, the load on store01 has spiked.
EDIT: I've installed atop and have sar running on store01 and teuthology. atop should tell us what process is causing the spike at least
Updated by David Galloway over 7 years ago
Still happening almost daily. I'm beginning to think this is a network issue again.
2016-10-07 20:20:45,397 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-77) [4c58cfdb] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM hv02 command failed: Heartbeat exeeded 2016-10-07 20:20:45,397 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-77) [4c58cfdb] Command 'GetStatsVDSCommand(HostName = hv02, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='7c70df22-d55f-4f7e-beee-9dc4bfe0b70d', vds='Host[hv02,7c70df22-d55f-4f7e-beee-9dc4bfe0b70d]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-07 20:20:45,397 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-77) [4c58cfdb] Failed getting vds stats, vds='hv02'(7c70df22-d55f-4f7e-beee-9dc4bfe0b70d): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-07 20:20:45,397 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-77) [4c58cfdb] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-07 20:20:45,398 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-12) [50e3a2fb] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM hv01 command failed: Heartbeat exeeded 2016-10-07 20:20:45,398 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler_Worker-12) [50e3a2fb] Command 'GetStatsVDSCommand(HostName = hv01, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='a8f312e9-1582-4914-aed8-14e614c7af7e', vds='Host[hv01,a8f312e9-1582-4914-aed8-14e614c7af7e]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-07 20:20:45,398 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-12) [50e3a2fb] Failed getting vds stats, vds='hv01'(a8f312e9-1582-4914-aed8-14e614c7af7e): org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-07 20:20:45,398 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-12) [50e3a2fb] Failure to refresh Vds runtime info: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-07 20:20:45,398 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-37) [768d56b2] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM irvingi04 command failed: Heartbeat exeeded 2016-10-07 20:20:45,398 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStatusVDSCommand] (DefaultQuartzScheduler_Worker-37) [768d56b2] Command 'SpmStatusVDSCommand(HostName = irvingi04, SpmStatusVDSCommandParameters:{runAsync='true', hostId='faf8082a-3289-4cd4-bb34-d78d60bde960', storagePoolId='a0d40a8c-6aa7-418b-9714-9bc30aa65433'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-07 20:20:45,401 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-37) [175d4b72] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: a0d40a8c-6aa7-418b-9714-9bc30aa65433 Type: StoragePool 2016-10-07 20:20:45,402 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-37) [175d4b72] Storage Pool 'a0d40a8c-6aa7-418b-9714-9bc30aa65433' - Updating Storage Domain 'b3391eea-7067-4cdf-93b8-8b3fabd4fff5' status from 'Active' to 'Unknown', reason: null 2016-10-07 20:20:45,397 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-77) [4c58cfdb] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) [:1.7.0_111] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] 2016-10-07 20:20:45,535 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-77) [4c58cfdb] Failed to refresh VDS, network error, continuing, vds='hv02'(7c70df22-d55f-4f7e-beee-9dc4bfe0b70d): VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-07 20:20:45,398 ERROR [org.ovirt.engine.core.vdsbroker.HostMonitoring] (DefaultQuartzScheduler_Worker-12) [50e3a2fb] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exeeded at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand.executeVdsBrokerCommand(GetStatsVDSCommand.java:21) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.VdsBrokerCommand.executeVDSCommand(VdsBrokerCommand.java:110) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VDSCommandBase.executeCommand(VDSCommandBase.java:65) [vdsbroker.jar:] at org.ovirt.engine.core.dal.VdcCommandBase.execute(VdcCommandBase.java:33) [dal.jar:] at org.ovirt.engine.core.vdsbroker.ResourceManager.runVdsCommand(ResourceManager.java:467) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsStats(HostMonitoring.java:472) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refreshVdsRunTimeInfo(HostMonitoring.java:114) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.HostMonitoring.refresh(HostMonitoring.java:84) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VdsManager.onTimer(VdsManager.java:227) [vdsbroker.jar:] at sun.reflect.GeneratedMethodAccessor11.invoke(Unknown Source) [:1.7.0_111] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] 2016-10-07 20:20:45,535 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-25) [4c58cfdb] Host 'hv02' is not responding. It will stay in Connecting state for a grace period of 83 seconds and after that an attempt to fence the host will be issued. 2016-10-07 20:20:45,535 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (DefaultQuartzScheduler_Worker-12) [50e3a2fb] Failed to refresh VDS, network error, continuing, vds='hv01'(a8f312e9-1582-4914-aed8-14e614c7af7e): VDSGenericException: VDSNetworkException: Heartbeat exeeded 2016-10-07 20:20:45,535 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-3) [50e3a2fb] Host 'hv01' is not responding. It will stay in Connecting state for a grace period of 63 seconds and after that an attempt to fence the host will be issued. 2016-10-07 20:20:45,536 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-15) [107ea268] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VDSM command failed: Error storage pool action: (u'spUUID=28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6',) 2016-10-07 20:20:45,536 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsBrokerCommand] (DefaultQuartzScheduler_Worker-15) [107ea268] IrsBroker::Failed::GetStoragePoolInfoVDS: IRSGenericException: IRSErrorException: Failed to GetStoragePoolInfoVDS, error = Error storage pool action: (u'spUUID=28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6',), code = 300 2016-10-07 20:20:45,602 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-3) [50e3a2fb] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hv01 is not responding. It will stay in Connecting state for a grace period of 63 seconds and after that an attempt to fence the host will be issued. 2016-10-07 20:20:45,630 WARN [org.ovirt.engine.core.vdsbroker.VdsManager] (org.ovirt.thread.pool-6-thread-28) [768d56b2] Host 'irvingi04' is not responding. 2016-10-07 20:20:45,641 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-25) [4c58cfdb] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hv02 is not responding. It will stay in Connecting state for a grace period of 83 seconds and after that an attempt to fence the host will be issued. 2016-10-07 20:20:45,644 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to hv02.front.sepia.ceph.com/172.21.0.6 2016-10-07 20:20:45,645 WARN [org.ovirt.vdsm.jsonrpc.client.internal.ResponseWorker] (ResponseWorker) [] Exception thrown during message processing 2016-10-07 20:20:45,718 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-6-thread-28) [768d56b2] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host irvingi04 is not responding. Host cannot be fenced automatically because power management for the host is disabled. 2016-10-07 20:20:45,719 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-37) [175d4b72] Correlation ID: 175d4b72, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center irvingi04-Local. Setting Data Center status to Non Responsive (On host irvingi04, Error: Network error during communication with the Host.). 2016-10-07 20:20:47,768 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (DefaultQuartzScheduler_Worker-15) [107ea268] START, SpmStopVDSCommand(HostName = hv02, SpmStopVDSCommandParameters:{runAsync='true', hostId='7c70df22-d55f-4f7e-beee-9dc4bfe0b70d', storagePoolId='28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6'}), log id: 30c66b 2016-10-07 20:20:47,768 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (DefaultQuartzScheduler_Worker-15) [107ea268] SpmStopVDSCommand:: vds 'hv02' is in 'Connecting' status - not performing spm stop, pool id '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' 2016-10-07 20:20:47,768 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStopVDSCommand] (DefaultQuartzScheduler_Worker-15) [107ea268] FINISH, SpmStopVDSCommand, log id: 30c66b 2016-10-07 20:20:47,768 ERROR [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (DefaultQuartzScheduler_Worker-15) [107ea268] IRS failover failed - can't allocate vds server 2016-10-07 20:20:48,548 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to hv01.front.sepia.ceph.com/172.21.0.5 2016-10-07 20:20:48,639 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to irvingi04.front.sepia.ceph.com/172.21.3.231 2016-10-07 20:20:48,751 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (DefaultQuartzScheduler_Worker-5) [4267870f] START, GetHardwareInfoVDSCommand(HostName = hv02, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='7c70df22-d55f-4f7e-beee-9dc4bfe0b70d', vds='Host[hv02,7c70df22-d55f-4f7e-beee-9dc4bfe0b70d]'}), log id: 1a34a64e 2016-10-07 20:20:48,788 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (DefaultQuartzScheduler_Worker-5) [4267870f] FINISH, GetHardwareInfoVDSCommand, log id: 1a34a64e 2016-10-07 20:20:48,800 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (DefaultQuartzScheduler_Worker-99) [3f1b2582] START, GetHardwareInfoVDSCommand(HostName = hv01, VdsIdAndVdsVDSCommandParametersBase:{runAsync='true', hostId='a8f312e9-1582-4914-aed8-14e614c7af7e', vds='Host[hv01,a8f312e9-1582-4914-aed8-14e614c7af7e]'}), log id: 594e7a8a 2016-10-07 20:20:48,811 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.GetHardwareInfoVDSCommand] (DefaultQuartzScheduler_Worker-99) [3f1b2582] FINISH, GetHardwareInfoVDSCommand, log id: 594e7a8a 2016-10-07 20:20:49,144 INFO [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (DefaultQuartzScheduler_Worker-5) [34b5a29f] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected : ID: 7c70df22-d55f-4f7e-beee-9dc4bfe0b70d Type: VDS 2016-10-07 20:20:49,156 INFO [org.ovirt.engine.core.bll.HandleVdsCpuFlagsOrClusterChangedCommand] (DefaultQuartzScheduler_Worker-99) [230b35b2] Running command: HandleVdsCpuFlagsOrClusterChangedCommand internal: true. Entities affected : ID: a8f312e9-1582-4914-aed8-14e614c7af7e Type: VDS
Updated by David Galloway over 7 years ago
This all started back around Sep12 when I Force Removed the irvingi02 export domain from RHEV.
I increased the vdsHeartbeat value on mgr01 (RHEV Manager) from 30sec to 60sec.
Updated by David Galloway over 7 years ago
I've got ping sequence running from mgr01 -> hv01 and from hv01 -> store01 in a screen session on mgr01 and hv01. Output is being logged to /root/pings/out.log
Updated by David Galloway over 7 years ago
- Priority changed from Urgent to Normal
The daily RHEV VM pausing was resolved by:
gluster volume set shardvol1 performance.strict-o-direct on gluster volume set shardvol1 network.remote-dio off
I also added an arbiter brick to the cluster to give us full HA if one of the Gluster nodes went down. We're experiencing a severe decrease in VM disk IO speed as a result of http://review.gluster.org/#/c/15641/. We haven't yet decided whether to remove the arbiter brick, install nightly packages that include the fix, or wait for the fix to make it upstream. The earliest it might make it into the next release would be around this time in November.
Updated by David Galloway over 7 years ago
We're still limping along with the arbiter brick in place. We had to disable PCP reporting for now, though. The PCP VM's load would spike to 300+ and would sit on iowait and wasn't actually writing.
The next point release should come out in the next few weeks. I'll evaluate installing that release if it includes the patch for the Gluster bug.
Updated by David Galloway over 7 years ago
Upgraded to 3.8.6 which was supposed to have arbiter write performance fix in it.
Turns out it's a result of sharding.
Updated by David Galloway over 6 years ago
- Status changed from In Progress to Resolved
All VMs are on NVMe storage using RHGS now.