概述 PMI测试 使用kolla-ansible部署masakari。
1 2 enable_masakari: "yes" enable_hacluster: "yes"
部署
1 kolla-ansible -i /etc/kolla/multinode deploy -t masakari -vv
1.1 、错误一 如果发现hacluster-pacemaker的日志中有如下错误。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Jan 09 13:40:23.643 controller01 pacemakerd [17] (pcmk_process_exit) notice: Respawning pacemaker-based subdaemon after unexpected exit Jan 09 13:40:23.643 controller01 pacemakerd [17] (start_child) info: Using uid=42486 and group=42486 for process pacemaker-based Jan 09 13:40:23.643 controller01 pacemakerd [17] (start_child) info: Forked child 58 for process pacemaker-based Jan 09 13:40:23.673 controller01 pacemaker-based [58] (crm_log_init) info: Changed active directory to /var/lib/pacemaker/cores Jan 09 13:40:23.673 controller01 pacemaker-based [58] (main) notice: Starting Pacemaker CIB manager Jan 09 13:40:23.673 controller01 pacemaker-based [58] (get_cluster_type) info: Verifying cluster type: 'corosync' Jan 09 13:40:23.673 controller01 pacemaker-based [58] (get_cluster_type) info: Assuming an active 'corosync' cluster Jan 09 13:40:23.673 controller01 pacemaker-based [58] (pcmk__daemon_user_can_write) notice: /var/lib/pacemaker/cib/cib.xml is not owned by user hacluster | uid 42486 != 189 Jan 09 13:40:23.673 controller01 pacemaker-based [58] (pcmk__daemon_group_can_write) notice: /var/lib/pacemaker/cib/cib.xml is not readable and writable by group haclient | st_mode=0100600 Jan 09 13:40:23.673 controller01 pacemaker-based [58] (pcmk__daemon_can_write) error: /var/lib/pacemaker/cib/cib.xml must be owned and writable by either user hacluster or group haclient | st_mode=0100600 Jan 09 13:40:23.673 controller01 pacemaker-based [58] (activateCibXml) error: Ignoring invalid CIB Jan 09 13:40:23.673 controller01 pacemaker-based [58] (activateCibXml) crit: Could not write out new CIB and no saved version to revert to Jan 09 13:40:23.673 controller01 pacemaker-based [58] (cib_init) crit: Cannot start CIB... terminating Jan 09 13:40:23.673 controller01 pacemaker-based [58] (crm_xml_cleanup) info: Cleaning up memory from libxml2 Jan 09 13:40:23.673 controller01 pacemaker-based [58] (crm_exit) info: Exiting pacemaker-based | with status 66 Jan 09 13:40:23.673 controller01 pacemakerd [17] (pcmk_child_exit) error: pacemaker-based[58] exited with status 66 (Input file not available) Jan 09 13:40:23.673 controller01 pacemakerd [17] (pcmk__ipc_is_authentic_process_active) info: Could not connect to cib_ro IPC: Connection refused
需要检查容器中hacluster的用户和组。
1 2 3 4 5 cat /etc/passwd | grep haclusterhacluster:x:42486:42486::/home/hacluster:/usr/sbin/nologin cat /etc/group | grep haclienthaclient:x:42486:
再查看宿主机上docker的hacluster_pacemaker块的文件夹权限。
1 2 3 4 5 6 ll /var/lib/docker/volumes/hacluster_pacemaker/_data/ 总用量 40 drwxr-x--- 2 42486 42486 10 3月 16 2023 blackbox drwxr-x--- 2 42486 42486 4096 3月 16 2023 cib drwxr-x--- 2 42486 42486 10 3月 16 2023 cores drwxr-x--- 2 42486 42486 20480 1月 12 15:05 pengine
保持权限一致。
1.2 、错误二 容器中的hacluster用户和组都是一致,但是在masakari-hostmonitor容器还是出现如下的日志。
1 2 3 4 5 6 7 024-01-07 03:39:10.350 7 WARNING masakarimonitors.hostmonitor.host_handler.handle_host [-] Exception caught: Unexpected error while running command. Command: crmadmin -S controller01 Exit code: 102 Stdout: '' Stderr: 'error: Could not connect to controller: Connection refused\nerror: Command failed: Connection refused\n': oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2024-01-07 03:39:10.350 7 WARNING masakarimonitors.hostmonitor.host_handler.handle_host [-] 'controller01' is unstable state on cluster.: oslo_concurrency.processutils.ProcessExecutionError: Unexpected error while running command. 2024-01-07 03:39:10.350 7 WARNING masakarimonitors.hostmonitor.host_handler.handle_host [-] hostmonitor skips monitoring hosts.
失败的原因可能是docker共享内存。
1 2 crmadmin -S controller01 -V error: couldn't open file /dev/shm/qb-21-0-15-xwVUp8/qb-request-crmd-header
再部署时需要将/dev设备映射到容器中。修改kolla-ansible部署masakari的配置文件。
1 vim /usr/local/share/kolla-ansible/ansible/roles/masakari/defaults/main.yml
修改为下面。
1 2 3 4 5 6 7 masakari_hostmonitor_default_volumes: - "{{ node_config_directory }} /masakari-hostmonitor/:{{ container_config_directory }} /:ro" - "/etc/localtime:/etc/localtime:ro" - "{{ '/etc/timezone:/etc/timezone:ro' if ansible_os_family == 'Debian' else '' }} " - "kolla_logs:/var/log/kolla/" - "{{ kolla_dev_repos_directory ~ '/masakari-monitors/masakarimonitors:/var/lib/kolla/venv/lib/python' ~ distro_python_version ~ '/site-packages/masakarimonitors' if masakari_dev_mode | bool else '' }} " - "/dev/:/dev/"
最后重新部署masakari。
1.3、配置ipmi 进入控制节点的hacluster-pacemaker容器,安装依赖包。
注意目前cdimage的源中的fence-agents-4.2.1-109.uelc20.01存在国际化失败的问题,要使用fence-agents-4.2.1-109.uelc20.04。
需要提前下载rpm文件,并安装。
1 2 wget http://10.30.38.131/kojifiles/packages/fence-agents/4.2.1/109.uelc20.04/noarch/fence-agents-common-4.2.1-109.uelc20.04.noarch.rpm wget http://10.30.38.131/kojifiles/packages/fence-agents/4.2.1/109.uelc20.04/noarch/fence-agents-ipmilan-4.2.1-109.uelc20.04.noarch.rpm
检测计算节点的电源状态。
1 ipmitool -I lanplus -H <bmc_ip> -U admin -P <bmc_pass> power status
或者用如下命令。
1 fence_ipmilan -a <bmc_ip> -P -l admin -p <bmc_pass> -o status
设置stonith属性。
1 pcs property set stonith-enabled=true
创建stonith设备
1 2 3 4 5 pcs stonith create ipmi-fence-compute4 fence_ipmilan \ verbose_level=1 \ lanplus=1 pcmk_host_list='compute4' delay=5 pcmk_host_check='static-list' \ pcmk_off_action=off pcmk_reboot_action=off ip='<bmc_ip>' \ username='admin' password='<bmc_pass>' power_wait=4 op monitor interval=60s
最后使用命令查看资源状态。
1.4、测试 新建segment
1 2 3 4 openstack --os-ha-api-version 1.1 \ segment create \ --description "segment for vm ha test" \ segment_for_vmha_test auto compute
新建host
1 2 openstack --os-ha-api-version 1.1 segment host create compute1 COMPUTE SSH <segment-id> openstack --os-ha-api-version 1.1 segment host create compute4 COMPUTE SSH <segment-id>
到compute4节点关闭网卡。
查看hostmonitors的日志。
1 2 2024-01-12 15:03:23.084 7 INFO masakarimonitors.ha.masakari [-] Send a notification. {'notification': {'type': 'COMPUTE_HOST', 'hostname': 'compute4', 'generated_time': datetime.datetime(2024, 1, 12, 7, 3, 23, 84763), 'payload': {'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'NORMAL'}}} 2024-01-12 15:03:33.317 7 INFO masakarimonitors.ha.masakari [-] Response: openstack.instance_ha.v1.notification.Notification(type=COMPUTE_HOST, hostname=compute4, generated_time=2024-01-12T07:03:23.084763, payload={'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'NORMAL'}, id=2, notification_uuid=c5b2e8cb-3090-4a7e-8dfa-72561e190809, source_host_uuid=64847f55-2487-4904-bd64-680d8542b60f, status=new, created_at=2024-01-12T07:03:30.000000, updated_at=None, location=Munch({'cloud': '10.10.15.10', 'region_name': 'RegionOne', 'zone': None, 'project': Munch({'id': '43f079ad3e324bd982f205bded4fc729', 'name': None, 'domain_id': None, 'domain_name': None})}))
查看masakari-engine的日志。
1 2 3 4 5 2024-01-12 15:03:33.318 7 INFO masakari.engine.manager [req-b870773d-998d-4187-9010-c50d654ed1bf 93328d5c40d74721886733900547d730 43f079ad3e324bd982f205bded4fc729 - - -] Processing notification c5b2e8cb-3090-4a7e-8dfa-72561e190809 of type: COMPUTE_HOST 2024-01-12 15:04:01.157 7 INFO masakari.compute.nova [req-774e9209-f0bb-41c6-b842-1abaa0a52dd2 nova - - - -] Disable nova-compute on compute4 2024-01-12 15:04:09.499 7 INFO masakari.engine.drivers.taskflow.host_failure [req-774e9209-f0bb-41c6-b842-1abaa0a52dd2 nova - - - -] Sleeping 180 sec before starting recovery thread until nova recognizes the node down. ... ... 2024-01-12 15:07:46.530 7 INFO masakari.engine.manager [req-e4a7d64a-3802-46b2-8b10-5c0342b23fef nova - - - -] Notification c5b2e8cb-3090-4a7e-8dfa-72561e190809 exits with status: finished.
查看消息通知
1 2 3 4 5 6 openstack --os-ha-api-version 1.1 notification list +--------------------------------------+----------------------------+----------+--------------+--------------------------------------+----------------------------------------------------------------------------+ | notification_uuid | generated_time | status | type | source_host_uuid | payload | +--------------------------------------+----------------------------+----------+--------------+--------------------------------------+----------------------------------------------------------------------------+ | c5b2e8cb-3090-4a7e-8dfa-72561e190809 | 2024-01-12T07:03:23.000000 | finished | COMPUTE_HOST | 64847f55-2487-4904-bd64-680d8542b60f | {'event' : 'STOPPED' , 'cluster_status' : 'OFFLINE' , 'host_status' : 'NORMAL' } | +--------------------------------------+----------------------------+----------+--------------+--------------------------------------+----------------------------------------------------------------------------+
2、使用 2.1、创建segment 使用openstack创建segment。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 openstack segment create <name> <recovery_method> <service_type> openstack segment create test_segment reserved_host compute +-----------------+--------------------------------------+ | Field | Value | +-----------------+--------------------------------------+ | created_at | 2023-07-05T08:39:03.000000 | | updated_at | None | | uuid | 5d0db8e4-1b93-4c5f-931e-39b92a0eeee1 | | name | test_segment | | description | None | | id | 5 | | service_type | compute | | recovery_method | reserved_host | +-----------------+--------------------------------------+
name,不可以重复。 recovery_method可选值有: auto,表示疏散宿主机上的所有的虚拟机时,不选择目的宿主机(由nova自行选择宿主机)。 reserved_host,表示疏散宿主机上所有的虚拟机时,选择指定的宿主机。 auto_priority,表示优先采用auto策略,当执行疏散失败时,再采用reserved_host策略。 rh_priority,表示优先采用reserved_host策略,当执行疏散失败时,再采用auto策略。 service_type,默认compute。
2.2、创建host 首先使用openstack命令查看运行nova-compute的节点名称。
1 2 3 4 5 6 7 8 9 10 11 12 13 openstack compute service list --service nova-compute +----+--------------+---------+------------+---------+-------+----------------------------+ | ID | Binary | Host | Zone | Status | State | Updated At | +----+--------------+---------+------------+---------+-------+----------------------------+ | 22 | nova-compute | ceph1 | nova | enabled | up | 2023-07-05T08:58:31.000000 | | 25 | nova-compute | ceph2 | nova | enabled | up | 2023-07-05T08:58:21.000000 | | 28 | nova-compute | ceph3 | nova | enabled | down | 2023-06-28T06:03:16.000000 | | 31 | nova-compute | extend1 | extend-1-2 | enabled | up | 2023-07-05T08:58:25.000000 | | 35 | nova-compute | extend3 | nova | enabled | up | 2023-07-05T08:58:23.000000 | | 37 | nova-compute | extend2 | extend-1-2 | enabled | up | 2023-07-05T08:58:31.000000 | | 39 | nova-compute | extend4 | nova | enabled | up | 2023-07-05T08:58:31.000000 | | 41 | nova-compute | extend5 | nova | enabled | up | 2023-07-05T08:58:21.000000 | +----+--------------+---------+------------+---------+-------+----------------------------+
再使用openstack命令创建host。
1 2 3 4 openstack segment host create [--reserved <reserved>] \ [--on_maintenance <on_maintenance>] \ <name> <type > <control_attributes> \ <segment_id>
创建一个名为extend1的host,同时利用–reserved设置保留host。这里的name要和计算节点的名称一致。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 openstack segment host create --reserved True extend1 compute ssh 5d0db8e4-1b93-4c5f-931e-39b92a0eeee1 +---------------------+--------------------------------------+ | Field | Value | +---------------------+--------------------------------------+ | created_at | 2023-07-05T08:59:49.000000 | | updated_at | None | | uuid | 33c31659-ad94-487b-8ac9-609484ae9394 | | name | extend1 | | type | compute | | control_attributes | ssh | | reserved | True | | on_maintenance | False | | failover_segment_id | 5d0db8e4-1b93-4c5f-931e-39b92a0eeee1 | +---------------------+--------------------------------------+
再创建一个host,用于测试疏散。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 openstack segment host create extend2 compute ssh 5d0db8e4-1b93-4c5f-931e-39b92a0eeee1 +---------------------+--------------------------------------+ | Field | Value | +---------------------+--------------------------------------+ | created_at | 2023-07-05T09:06:48.000000 | | updated_at | None | | uuid | 68a8b714-162a-4646-9640-fa665c0e4e37 | | name | extend2 | | type | compute | | control_attributes | ssh | | reserved | False | | on_maintenance | False | | failover_segment_id | 5d0db8e4-1b93-4c5f-931e-39b92a0eeee1 | +---------------------+--------------------------------------+
使用命令查看,可以得到下面的结果。
1 2 3 4 5 6 7 openstack segment host list 5d0db8e4-1b93-4c5f-931e-39b92a0eeee1 +--------------------------------------+---------+---------+--------------------+----------+----------------+--------------------------------------+ | uuid | name | type | control_attributes | reserved | on_maintenance | failover_segment_id | +--------------------------------------+---------+---------+--------------------+----------+----------------+--------------------------------------+ | 68a8b714-162a-4646-9640-fa665c0e4e37 | extend2 | compute | ssh | False | False | 5d0db8e4-1b93-4c5f-931e-39b92a0eeee1 | | 33c31659-ad94-487b-8ac9-609484ae9394 | extend1 | compute | ssh | True | False | 5d0db8e4-1b93-4c5f-931e-39b92a0eeee1 | +--------------------------------------+---------+---------+--------------------+----------+----------------+--------------------------------------+
2.3 、进行自动疏散的测试 使用crm_mon查看集群的状态。容器部署的需要进入hacluster_pacemaker容器。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 crm_mon Cluster Summary: * Stack: corosync * Current DC: control3 (version 2.0.4-6.uelc20.8-2deceaa3ae) - partition with quorum * Last updated: Wed Jul 5 17:12:56 2023 * Last change: Wed Jul 5 11:28:59 2023 by root via cibadmin on control1 * 10 nodes configured * 7 resource instances configured Node List: * Online: [ control1 control2 control3 ] * RemoteOnline: [ ceph1 ceph2 extend1 extend2 extend3 extend4 extend5 ] Active Resources: * extend5 (ocf::pacemaker:remote): Started control1 * extend4 (ocf::pacemaker:remote): Started control2 * extend2 (ocf::pacemaker:remote): Started control3 * extend3 (ocf::pacemaker:remote): Started control1 * ceph1 (ocf::pacemaker:remote): Started control2 * extend1 (ocf::pacemaker:remote): Started control3 * ceph2 (ocf::pacemaker:remote): Started control1
进入extend2计算节点,关闭机器。
查看masakari-hostmonitor的日志,容器部署位于/var/log/kolla/masakari/masakari-hostmonitor.log。
1 2 3 2023-07-05 17:18:35.961 7 INFO masakarimonitors.hostmonitor.host_handler.handle_host [-] 'extend2' is 'offline' (current: 'offline'). 2023-07-05 17:18:35.961 7 INFO masakarimonitors.ha.masakari [-] Send a notification. {'notification': {'type': 'COMPUTE_HOST', 'hostname': 'extend2', 'generated_time': datetime.datetime(2023, 7, 5, 9, 18, 35, 961461), 'payload': {'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'NORMAL'}}} 2023-07-05 17:18:38.209 7 INFO masakarimonitors.ha.masakari [-] Response: openstack.instance_ha.v1.notification.Notification(type=COMPUTE_HOST, hostname=extend2, generated_time=2023-07-05T09:18:35.961461, payload={'event': 'STOPPED', 'cluster_status': 'OFFLINE', 'host_status': 'NORMAL'}, id=1, notification_uuid=e0ca6a58-cdda-495f-b84d-f0ef158b5e90, source_host_uuid=68a8b714-162a-4646-9640-fa665c0e4e37, status=new, created_at=2023-07-05T09:18:38.000000, updated_at=None, location=Munch({'cloud': '10.10.15.221', 'region_name': 'RegionOne', 'zone': None, 'project': Munch({'id': '53389ce5aec842ba88a8715c4b5b9d86', 'name': None, 'domain_id': None, 'domain_name': None})}))
同时集群的状态变为。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Cluster Summary: * Stack: corosync * Current DC: control3 (version 2.0.4-6.uelc20.8-2deceaa3ae) - partition with quorum * Last updated: Wed Jul 5 17:19:25 2023 * Last change: Wed Jul 5 11:28:59 2023 by root via cibadmin on control1 * 10 nodes configured * 7 resource instances configured Node List: * Online: [ control1 control2 control3 ] * RemoteOnline: [ ceph1 ceph2 extend1 extend3 extend4 extend5 ] * RemoteOFFLINE: [ extend2 ] Active Resources: * extend5 (ocf::pacemaker:remote): Started control1 * extend4 (ocf::pacemaker:remote): Started control2 * extend3 (ocf::pacemaker:remote): Started control3 * ceph1 (ocf::pacemaker:remote): Started control2 * extend1 (ocf::pacemaker:remote): Started control3 * ceph2 (ocf::pacemaker:remote): Started control1
查看masakari-engine的日志。
1 2 3 2023-07-05 17:18:47.488 8 INFO masakari.compute.nova [req-7b8637b6-51ae-487f-8635-56fa96e8bd99 nova - - - -] Disable nova-compute on extend2 ... 2023-07-05 17:22:20.616 8 INFO masakari.engine.drivers.taskflow.host_failure [req-7b8637b6-51ae-487f-8635-56fa96e8bd99 nova - - - -] Sleeping 180 sec before starting recovery thread until nova recognizes the node down.
使用命令查看Notification的状态。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 openstack notification list -f json [ { "notification_uuid" : "e0ca6a58-cdda-495f-b84d-f0ef158b5e90" , "generated_time" : "2023-07-05T17:18:35.000000" , "status" : "finished" , "type" : "COMPUTE_HOST" , "source_host_uuid" : "68a8b714-162a-4646-9640-fa665c0e4e37" , "payload" : { "event" : "STOPPED" , "cluster_status" : "OFFLINE" , "host_status" : "NORMAL" } } ]
查看host的状态
1 2 3 4 5 6 7 openstack segment host list 5d0db8e4-1b93-4c5f-931e-39b92a0eeee1 +--------------------------------------+---------+---------+--------------------+----------+----------------+--------------------------------------+ | uuid | name | type | control_attributes | reserved | on_maintenance | failover_segment_id | +--------------------------------------+---------+---------+--------------------+----------+----------------+--------------------------------------+ | 68a8b714-162a-4646-9640-fa665c0e4e37 | extend2 | compute | ssh | False | True | 5d0db8e4-1b93-4c5f-931e-39b92a0eeee1 | | 33c31659-ad94-487b-8ac9-609484ae9394 | extend1 | compute | ssh | False | False | 5d0db8e4-1b93-4c5f-931e-39b92a0eeee1 | +--------------------------------------+---------+---------+-----------