zabbix HA on AWS

Zabbix HA Requirment

On AWS
两台Zabbix Server, 在AWS上，处于一个VPC，但分别处在两个不同的AZ。
Active-Passive mode. 原因：zabbix server 自身有一套external scrips 去AWS cloudwatch 获取AWS service metrics。
如果是采用Active-Active的方式的，两台Active Zabbix server会造成数据冗余或冲突。

HA Implementation

Decouple Zabbix Database

Zabbix存储的数据库需要解耦出来，方便Zabbix Server主从切换，保持数据存储的唯一性。
对于Zabbix server来说，最重要的就是监控的数据。在AWS上，最方便的方式就是采用AWS的RDS作为Zabbix的后端数据库，利用AWS的服务。
AWS上RDS比较强大，有MySQL, MariaDB和Postgresql可供选择。为了HA方案在别的云平台上的可重用性，选择Postgresql最为后端数据库。

RDS Setup

启用RDS前，设置好master-user-name和master-user-password
选择RDS的Multi-AZ功能。备份，维护按需设置。
估算好检测数据所需的数据库容量大小，选择对应的DB实例类型 (e.g. db.m4.large)
更新RDS的security group的设置，保证与zabbix的连通性。

MariaDB

Verify connection

1	mysql -h <RDS-DNS> -P 3306 -u <RDS-master-user-name> -p

yum install -y mariadb which use its client to connect RDS
setsebool -P httpd_can_network_connect_db=1 and setsebool -P zabbix_can_network=1

Postgres

Verify connection

psql \
   --host=occ-<rds instance DNS name> \
   --port=5432 \
   --username <username> \
   --password

yum install -y postgresql

setsebool -P httpd_can_network_connect_db=1 and setsebool -P zabbix_can_network=1

AWS ELB

在AWS上，zabbix service是通过AWS ELB来发现后端哪个Zone上的Zabbix server处于服务状态。

端口Health-Check：
- external ELB:
  - 80 (apache)
  - 3000 (grafana)
- internal ELB: 10051 (用于zabbix-agent上报指标)
NOTE: 在AWS上不能通过VIP来做，因为AWS上不支持传播多播包 multicast packet

Pacemaker

通过pacemaker（foss, scalable HA cluster resource manager）来提供HA的功能。

具体安装，大体配置步骤，请参考 Configuring High Availability (HA) Zabbix Server on CentOS 7，本文不累赘，但注意以下几点：
对apache的状态检查，可以通过启用自带的状态模块（built-in module在/usr/lib64/httpd/modules/），配置如下：
- apache httpd status configuration
  - Add LoadModule status_module modules/mod_status.so to /etc/httpd/conf/httpd.conf
  - Add url handling as below

# Allow server status reports generated by mod_status,
# with the URL of http://servername/server-status
# Change the ".example.com" to match your domain to enable.
#
<Location /server-status>
    SetHandler server-status
    Order deny,allow
    Deny from all
    Allow from all
</Location>

在pacemaker里面创建资源

1
2
3

pcs resource create zabbix_server systemd:zabbix-server op monitor interval=10s
pcs resource create zabbix_httpd systemd:httpd configfile=/etc/httpd/conf/httpd.conf statusurl="http://127.0.0.1/server-status" op monitor interval=20s
pcs resource create grafana_server systemd:grafana-server op monitor interval=10s

在pacemaker里面把3个resources(zabbix, apache, grafana)组成创建资源组

1	pcs resource group add zabbixcluster zabbix_server zabbix_httpd

在AWS上注意网络连通性问题，尤其当用pcs status看到集群状态不对的时候。检查下security group里面以下端口是否开放了。
- UDP 5405: corosync 用于集群状态检查

zabbix-ha-on-aws

MISC

config files:
- /etc/corosync/corosync.conf
- /var/lib/pcsd/pcs_settings.conf
logs
- /var/log/pacemaker.log
- /var/log/cluster/corosync.log
- /var/log/pcsd/pcsd.log

Reference

本文作者：Kasper Deng
本文链接：http://kasperdeng.github.io/2017/03/16/zabbix-ha-on-aws/
版权声明：本博客所有文章除特殊声明外，均采用 CC BY-NC 4.0 许可协议。转载请注明出处 Kasper Deng的博客！