Actions
Bug #17415
closedRHEV Network or Storage issues
Status:
Resolved
Priority:
Normal
Assignee:
Category:
Infrastructure Service
Target version:
-
% Done:
0%
Source:
other
Tags:
Backport:
Regression:
No
Severity:
3 - minor
Reviewed:
Affected Versions:
ceph-qa-suite:
Crash signature (v1):
Crash signature (v2):
Description
At about 00:27 UTC on 27SEP2016, the RHEV cluster either began having backend storage issues or overall network issues. So far all clues point to store01.front.sepia.ceph.com which is the main gluster backing node serving the RHEV VMs.
Symptoms
- teuthology and pcp VMs were paused
- RHEV logs show an "unknown storage error"
- RHEV Manager VM was restarted (This VM's disk is served over NFS; not gluster)
- Logs from different involved machines below
- CPU Soft lockups in teuthology VM's kernel log indicating either network issues accessing LRC or its own disk (hosted on gluster)
NOTE: Times around 00:20 are UTC. Times around 20:20 are EDT.
===== hv01:/var/log/messages ===== Sep 26 20:27:00 hv01 journal: ovirt-ha-agent ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Sep 26 20:27:00 hv01 ovirt-ha-agent: ERROR:ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink:Connection closed: Connection timed out Sep 26 20:27:00 hv01 ovirt-ha-agent: WARNING:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Error while monitoring engine: Connection timed out Sep 26 20:27:00 hv01 ovirt-ha-agent: INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Sleeping 60 seconds Sep 26 20:27:07 hv01 journal: vdsm ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink ERROR Connection closed: Connection timed out Sep 26 20:27:07 hv01 journal: vdsm vds ERROR failed to retrieve Hosted Engine HA info#012Traceback (most recent call last):#012 File "/usr/share/vdsm/API.py", line 1851, in _getHaInfo#012 stats = instance.get_all_stats()#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 103, in get_all_stats#012 self._configure_broker_conn(broker)#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 180, in _configure_broker_conn#012 dom_type=dom_type)#012 File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 176, in set_storage_domain#012 .format(sd_type, options, e))#012RequestError: Failed to set storage domain FilesystemBackend, options {'dom_type': 'nfs3', 'sd_uuid': 'ca46f740-bc09-477f-b09e-cf7e746dca73'}: Connection timed out ===== store01:/var/log/messages ===== Sep 27 00:27:01 old-gw pdns[2958]: Backend reported permanent error which prevented lookup (GSQLBackend lookup query:Failed to execute mysql_query, perhaps connection died? Err=1: Lost connection to MySQL server during query), aborting ===== store01:/var/log/glusterfs/bricks/srv-gluster-shardbrick1.log ===== [2016-09-26 03:27:08.778199] I [glusterfsd-mgmt.c:1596:mgmt_getspec_cbk] 0-glusterfs: No change in volfile, continuing [2016-09-27 00:28:08.678750] W [socket.c:696:__socket_rwv] 0-tcp.shardvol1-server: writev on 172.21.0.5:49148 failed (Broken pipe) [2016-09-27 00:28:15.872485] I [socket.c:2468:socket_event_handler] 0-transport: disconnecting now [2016-09-27 00:28:17.803641] I [MSGID: 115036] [server.c:548:server_rpc_notify] 0-shardvol1-server: disconnecting connection from hv01.front.sepia.ceph.com-15267-2016/08/04-00:38:10:236127-shardvol1-client-0-0-0 [2016-09-27 00:28:30.945155] I [socket.c:3498:socket_submit_reply] 0-tcp.shardvol1-server: not connected (priv->connected = -1) ... Repeated ... [2016-09-27 00:28:44.611277] E [rpcsvc.c:1310:rpcsvc_submit_generic] 0-rpc-service: failed to submit message (XID: 0x255417e4, Program: GlusterFS 3.3, ProgVers: 330, Proc: 34) to rpc-transport (tcp.shardvol1-server) [2016-09-27 00:28:44.611286] E [server.c:200:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.0/xlator/debug/io-stats.so(+0x1814e) [0x7f8396b4514e] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x1a2e9) [0x7f83966e72e9] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x8fa6) [0x7f83966d5fa6] ) 0-: Reply submission failed [2016-09-27 00:28:44.611331] E [server.c:200:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.0/xlator/debug/io-stats.so(+0x1814e) [0x7f8396b4514e] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x1a2e9) [0x7f83966e72e9] -->/usr/lib64/glusterfs/3.8.0/xlator/protocol/server.so(+0x8fa6) [0x7f83966d5fa6] ) 0-: Reply submission failed ===== mgr01:/var/log/ovirt-engine/engine.log ===== 2016-09-26 20:35:58,182 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-6-thread-16) [1c630ccf] domain '67ff9a5d-b5da-4a2f-b5ce-2286bc82e3e4:iso_storage' in problem. vds: 'hv02' 2016-09-26 20:35:58,184 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxyData] (org.ovirt.thread.pool-6-thread-16) [1c630ccf] domain '2a81a4ec-23c1-48b8-b296-e0bcc4f59700:shardvol1' in problem. vds: 'hv02' 2016-09-26 20:36:03,468 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-42) [4a281328] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: d6b1d38b-6f1f-4438-b04f-53619241a1ef Type: StoragePool 2016-09-26 20:36:03,480 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-43) [22db002d] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: d3ec2fae-de85-41d2-8cf4-b09c3fd77ca3 Type: StoragePool 2016-09-26 20:36:03,481 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-44) [75431892] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: 71a72320-daa6-4c1c-a9ee-c809a8f0ad9f Type: StoragePool 2016-09-26 20:36:03,510 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-42) [4a281328] Storage Pool 'd6b1d38b-6f1f-4438-b04f-53619241a1ef' - Updating Storage Domain '14b9a2fc-838c-46d4-8ba6-52f8d815f19e' status from 'Active' to 'Unknown', reason: null 2016-09-26 20:36:03,511 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-43) [22db002d] Storage Pool 'd3ec2fae-de85-41d2-8cf4-b09c3fd77ca3' - Updating Storage Domain '1ecdcdf2-4301-4bba-a17c-270a85f11ade' status from 'Active' to 'Unknown', reason: null 2016-09-26 20:36:03,523 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-44) [75431892] Storage Pool '71a72320-daa6-4c1c-a9ee-c809a8f0ad9f' - Updating Storage Domain 'b43d2eef-5cfa-4290-9e7b-f6ebc4e9c891' status from 'Active' to 'Unknown', reason: null 2016-09-26 20:36:03,535 INFO [org.ovirt.engine.core.bll.storage.SetStoragePoolStatusCommand] (DefaultQuartzScheduler_Worker-41) [7734731c] Running command: SetStoragePoolStatusCommand internal: true. Entities affected : ID: 28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6 Type: StoragePool 2016-09-26 20:36:03,583 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-41) [7734731c] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '67ff9a5d-b5da-4a2f-b5ce-2286bc82e3e4' status from 'Active' to 'Unknown', reason: null 2016-09-26 20:36:03,596 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-41) [7734731c] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain '2a81a4ec-23c1-48b8-b296-e0bcc4f59700' status from 'Active' to 'Unknown', reason: null 2016-09-26 20:36:03,598 INFO [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (DefaultQuartzScheduler_Worker-41) [7734731c] Storage Pool '28fc87ad-2e28-44d2-8ce4-2e63b9bad4c6' - Updating Storage Domain 'ca46f740-bc09-477f-b09e-cf7e746dca73' status from 'Active' to 'Unknown', reason: null 2016-09-26 20:36:03,832 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-43) [22db002d] Correlation ID: 22db002d, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center irvingi08-Local. Setting status to Non Responsive. 2016-09-26 20:36:03,853 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-44) [75431892] Correlation ID: 75431892, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center irvingi02-Local. Setting status to Non Responsive. 2016-09-26 20:36:03,871 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-42) [4a281328] Correlation ID: 4a281328, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center irvingi07-Local. Setting status to Non Responsive. 2016-09-26 20:36:03,874 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-41) [7734731c] Correlation ID: 7734731c, Call Stack: null, Custom Event ID: -1, Message: Invalid status on Data Center sepia. Setting status to Non Responsive. 2016-09-26 20:36:08,732 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-67) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM RHEV-Manager is down with error. Exit message: Failed to find the libvirt domain. 2016-09-26 20:36:08,735 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-67) [] VM '6406669d-7df3-4719-a3ae-bf82f523ff03(RHEV-Manager) is running in db and not running in VDS 'hv02' 2016-09-26 20:36:08,744 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM pcp has been paused due to unknown storage error. 2016-09-26 20:36:08,746 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-68) [] VM 'da38e6f8-9ccb-4ce0-95f0-afc98a171386'(teuthology) moved from 'Up' --> 'Paused' 2016-09-26 20:36:08,787 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Failed during vms monitoring on host hv02 error is: java.lang.NullPointerException 2016-09-26 20:36:08,788 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Exception:: java.lang.NullPointerException at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:99) [dal.jar:] at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:27) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:110) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:107) [dal.jar:] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:73) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:61) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:103) [dal.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.saveVmsToDb(VmsMonitoring.java:374) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.refreshVmStats(VmsMonitoring.java:243) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.perform(VmsMonitoring.java:165) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:36) [vdsbroker.jar:] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_111] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_111] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] 2016-09-26 20:36:08,825 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused. 2016-09-26 20:36:08,857 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused due to unknown storage error. 2016-09-26 20:36:08,732 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-67) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM RHEV-Manager is down with error. Exit message: Failed to find the libvirt domain. 2016-09-26 20:36:08,735 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-67) [] VM '6406669d-7df3-4719-a3ae-bf82f523ff03(RHEV-Manager) is running in db and not running in VDS 'hv02' 2016-09-26 20:36:08,744 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM pcp has been paused due to unknown storage error. 2016-09-26 20:36:08,746 INFO [org.ovirt.engine.core.vdsbroker.VmAnalyzer] (DefaultQuartzScheduler_Worker-68) [] VM 'da38e6f8-9ccb-4ce0-95f0-afc98a171386'(teuthology) moved from 'Up' --> 'Paused' 2016-09-26 20:36:08,787 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Failed during vms monitoring on host hv02 error is: java.lang.NullPointerException 2016-09-26 20:36:08,788 ERROR [org.ovirt.engine.core.vdsbroker.VmsMonitoring] (DefaultQuartzScheduler_Worker-67) [] Exception:: java.lang.NullPointerException at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:99) [dal.jar:] at org.ovirt.engine.core.dao.VmDynamicDaoImpl.createFullParametersMapper(VmDynamicDaoImpl.java:27) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:110) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao$1.map(MassOperationsGenericDao.java:107) [dal.jar:] at org.ovirt.engine.core.dal.dbbroker.SimpleJdbcCallsHandler.executeStoredProcAsBatch(SimpleJdbcCallsHandler.java:73) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:61) [dal.jar:] at org.ovirt.engine.core.dao.MassOperationsGenericDao.updateAllInBatch(MassOperationsGenericDao.java:103) [dal.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.saveVmsToDb(VmsMonitoring.java:374) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.refreshVmStats(VmsMonitoring.java:243) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.VmsMonitoring.perform(VmsMonitoring.java:165) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.PollVmStatsRefresher.poll(PollVmStatsRefresher.java:36) [vdsbroker.jar:] at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) [rt.jar:1.7.0_111] at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) [rt.jar:1.7.0_111] at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) [rt.jar:1.7.0_111] at java.lang.reflect.Method.invoke(Method.java:606) [rt.jar:1.7.0_111] at org.ovirt.engine.core.utils.timer.JobWrapper.invokeMethod(JobWrapper.java:81) [scheduler.jar:] at org.ovirt.engine.core.utils.timer.JobWrapper.execute(JobWrapper.java:52) [scheduler.jar:] at org.quartz.core.JobRunShell.run(JobRunShell.java:213) [quartz.jar:] at org.quartz.simpl.SimpleThreadPool$WorkerThread.run(SimpleThreadPool.java:557) [quartz.jar:] 2016-09-26 20:36:08,825 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused. 2016-09-26 20:36:08,857 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler_Worker-68) [] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM teuthology has been paused due to unknown storage error.
Actions