美国服务器如何使用Prometheus + Grafana 搭建炫酷的服务器监控面板?
美国服务器本文将详细介绍如何搭建功能强大、视觉炫酷的服务器监控系统,使用Prometheus进行指标收集,Grafana进行数据可视化。
一、架构设计与环境准备
监控系统架构
text
┌─────────────────┐ ┌──────────────────┐ ┌─────────────────┐ │ 监控目标 │ │ Prometheus │ │ Grafana │ │ - Node Exporter│ ──>│ 数据收集与存储 │ ──>│ 数据可视化 │ │ - 应用指标 │ │ │ │ │ │ - 自定义指标 │ │ │ │ │ └─────────────────┘ └──────────────────┘ └─────────────────┘
环境要求与检查
bash
#!/bin/bash# environment-check.shecho "=== 监控系统环境检查 ==="echo "检查时间: $(date)"echo ""# 系统信息echo "1. 系统信息:"echo " 操作系统: $(lsb_release -d | cut -f2)"echo " 内核版本: $(uname -r)"echo " CPU架构: $(uname -m)"echo ""# 资源检查echo "2. 资源检查:"echo " CPU核心: $(nproc)"echo " 内存: $(free -h | awk '/Mem:/ {print $2}')"echo " 磁盘空间:"df -h / | tail -1echo ""# 端口检查echo "3. 端口可用性检查:"for port in 9090 3000 9100; do
if ss -tulpn | grep ":$port " > /dev/null; then
echo " 端口 $port: 已被占用"
else
echo " 端口 $port: 可用"
fidoneecho ""# 依赖检查echo "4. 依赖检查:"for cmd in wget curl tar systemctl; do
if command -v $cmd &> /dev/null; then
echo " $cmd: 已安装"
else
echo " $cmd: 未安装"
fidoneecho ""echo "环境检查完成"二、Prometheus 安装与配置
2.1 Prometheus 安装
bash
#!/bin/bash# install-prometheus.shset -eecho "开始安装 Prometheus..."echo ""# 创建用户和目录sudo useradd --no-create-home --shell /bin/false prometheussudo mkdir /etc/prometheus /var/lib/prometheussudo chown prometheus:prometheus /etc/prometheus /var/lib/prometheus# 下载 Prometheuscd /tmpPROMETHEUS_VERSION="2.45.0"wget https://github.com/prometheus/prometheus/releases/download/v${PROMETHEUS_VERSION}/prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gz# 解压安装tar xvf prometheus-${PROMETHEUS_VERSION}.linux-amd64.tar.gzcd prometheus-${PROMETHEUS_VERSION}.linux-amd64sudo cp prometheus promtool /usr/local/bin/sudo cp -r consoles console_libraries /etc/prometheus/sudo chown prometheus:prometheus /usr/local/bin/prometheus /usr/local/bin/promtoolsudo chown -R prometheus:prometheus /etc/prometheus/consoles /etc/prometheus/console_libraries# 创建配置文件sudo tee /etc/prometheus/prometheus.yml > /dev/null <<'EOF'
global:
scrape_interval: 15s
evaluation_interval: 15s
# 告警规则文件
rule_files:
- "alert_rules.yml"
# 告警管理器配置
alerting:
alertmanagers:
- static_configs:
- targets:
# 抓取配置
scrape_configs:
# Prometheus 自身监控
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
scrape_interval: 30s
metrics_path: /metrics
# Node Exporter 监控
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']
scrape_interval: 15s
# 其他服务器监控(示例)
- job_name: 'web_servers'
static_configs:
- targets: ['192.168.1.101:9100', '192.168.1.102:9100']
scrape_interval: 15s
# 自定义应用监控
- job_name: 'custom_apps'
static_configs:
- targets: ['localhost:8080']
scrape_interval: 30s
metrics_path: /metrics
EOF# 创建告警规则sudo tee /etc/prometheus/alert_rules.yml > /dev/null <<'EOF'
groups:
- name: node_alerts
rules:
- alert: NodeDown
expr: up{job="node_exporter"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "节点 {{ $labels.instance }} 宕机"
description: "节点 {{ $labels.instance }} 已经宕机超过 1 分钟"
- alert: HighCPUUsage
expr: 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 2m
labels:
severity: warning
annotations:
summary: "CPU 使用率过高 - {{ $labels.instance }}"
description: "CPU 使用率超过 80% 持续 2 分钟"
- alert: HighMemoryUsage
expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
for: 2m
labels:
severity: warning
annotations:
summary: "内存使用率过高 - {{ $labels.instance }}"
description: "内存使用率超过 90% 持续 2 分钟"
- alert: DiskSpaceLow
expr: (1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100 > 85
for: 2m
labels:
severity: warning
annotations:
summary: "磁盘空间不足 - {{ $labels.instance }}"
description: "磁盘使用率超过 85%"
EOFsudo chown prometheus:prometheus /etc/prometheus/*.yml# 创建 systemd 服务sudo tee /etc/systemd/system/prometheus.service > /dev/null <<'EOF'
[Unit]
Description=Prometheus Time Series Collection and Processing Server
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus \
--config.file /etc/prometheus/prometheus.yml \
--storage.tsdb.path /var/lib/prometheus/ \
--web.console.templates=/etc/prometheus/consoles \
--web.console.libraries=/etc/prometheus/console_libraries \
--web.listen-address=0.0.0.0:9090 \
--web.external-url=
Restart=always
[Install]
WantedBy=multi-user.target
EOF# 启动服务sudo systemctl daemon-reloadsudo systemctl enable prometheussudo systemctl start prometheusecho ""echo "Prometheus 安装完成!"echo "访问地址: http://$(curl -s ifconfig.me):9090"echo "检查状态: sudo systemctl status prometheus"2.2 Node Exporter 安装
bash
#!/bin/bash# install-node-exporter.shset -eecho "开始安装 Node Exporter..."echo ""# 创建用户sudo useradd --no-create-home --shell /bin/false node_exporter# 下载 Node Exportercd /tmpNODE_EXPORTER_VERSION="1.6.1"wget https://github.com/prometheus/node_exporter/releases/download/v${NODE_EXPORTER_VERSION}/node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gz# 解压安装tar xvf node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64.tar.gzcd node_exporter-${NODE_EXPORTER_VERSION}.linux-amd64sudo cp node_exporter /usr/local/bin/sudo chown node_exporter:node_exporter /usr/local/bin/node_exporter# 创建 systemd 服务sudo tee /etc/systemd/system/node_exporter.service > /dev/null <<'EOF'
[Unit]
Description=Node Exporter
Wants=network-online.target
After=network-online.target
[Service]
User=node_exporter
Group=node_exporter
Type=simple
ExecStart=/usr/local/bin/node_exporter \
--web.listen-address=:9100 \
--collector.systemd \
--collector.systemd.unit-whitelist=(sshd|nginx|mysql|docker).service \
--collector.processes \
--collector.tcpstat
Restart=always
[Install]
WantedBy=multi-user.target
EOF# 启动服务sudo systemctl daemon-reloadsudo systemctl enable node_exportersudo systemctl start node_exporterecho ""echo "Node Exporter 安装完成!"echo "访问地址: http://$(curl -s ifconfig.me):9100"echo "检查状态: sudo systemctl status node_exporter"echo "指标端点: http://$(curl -s ifconfig.me):9100/metrics"2.3 高级 Prometheus 配置
yaml
# /etc/prometheus/advanced_config.ymlglobal: scrape_interval: 15s external_labels: cluster: 'production' environment: 'prod'# 远程写入配置(可选,用于长期存储)remote_write: - url: "http://thanos-receive:10908/api/v1/receive" queue_config: capacity: 2500 max_shards: 200 min_shards: 1# 远程读取配置(可选)remote_read: - url: "http://thanos-query:10902/api/v1/read"# 服务发现配置示例scrape_configs: - job_name: 'node_exporter' scrape_interval: 15s consul_sd_configs: - server: 'localhost:8500' services: ['node_exporter'] relabel_configs: - source_labels: [__meta_consul_node] target_label: instance # 文件服务发现 - job_name: 'file_sd_nodes' file_sd_configs: - files: - /etc/prometheus/targets/nodes.yml refresh_interval: 5m # Blackbox Exporter(网络探测) - job_name: 'blackbox_http' metrics_path: /probe params: module: [http_2xx] static_configs: - targets: - 'https://example.com' - 'https://google.com' relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: localhost:9115
三、Grafana 安装与配置
3.1 Grafana 安装
bash
#!/bin/bash# install-grafana.shset -eecho "开始安装 Grafana..."echo ""# 添加 Grafana 仓库sudo apt-get install -y software-properties-common wgetwget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list# 安装 Grafanasudo apt-get updatesudo apt-get install -y grafana# 配置 Grafanasudo tee /etc/grafana/grafana.ini > /dev/null <<'EOF' [server] protocol = http http_addr = 0.0.0.0 http_port = 3000 domain = localhost root_url = %(protocol)s://%(domain)s:%(http_port)s/ enable_gzip = true [database] type = sqlite3 path = grafana.db [security] admin_user = admin admin_password = admin123 secret_key = SW2YcwTIb9zpOOhoPsMm [users] allow_sign_up = false auto_assign_org = true auto_assign_org_role = Viewer [auth] disable_login_form = false disable_signout_menu = false [analytics] reporting_enabled = false check_for_updates = false [log] mode = console file level = info [grafana_net] url = https://grafana.net EOF# 启动服务sudo systemctl daemon-reloadsudo systemctl enable grafana-serversudo systemctl start grafana-serverecho ""echo "Grafana 安装完成!"echo "访问地址: http://$(curl -s ifconfig.me):3000"echo "用户名: admin"echo "密码: admin123"echo "检查状态: sudo systemctl status grafana-server"
3.2 数据源配置脚本
bash
#!/bin/bash# configure-grafana-datasource.shset -eecho "配置 Grafana 数据源..."echo ""GRAFANA_URL="http://localhost:3000"USERNAME="admin"PASSWORD="admin123"# 等待 Grafana 启动echo "等待 Grafana 启动..."until curl -s $GRAFANA_URL/api/health; do
sleep 2done# 配置 Prometheus 数据源curl -X POST \
-H "Content-Type: application/json" \
-d '{
"name": "Prometheus",
"type": "prometheus",
"url": "http://localhost:9090",
"access": "proxy",
"isDefault": true,
"jsonData": {
"timeInterval": "15s",
"queryTimeout": "60s",
"httpMethod": "POST",
"manageAlerts": true,
"prometheusType": "Prometheus",
"prometheusVersion": "2.45.0",
"cacheLevel": "High",
"disableMetricsLookup": false
}
}' \
http://$USERNAME:$PASSWORD@localhost:3000/api/datasourcesecho ""echo "Prometheus 数据源配置完成!"# 配置 Loki 数据源(日志)curl -X POST \
-H "Content-Type: application/json" \
-d '{
"name": "Loki",
"type": "loki",
"url": "http://localhost:3100",
"access": "proxy"
}' \
http://$USERNAME:$PASSWORD@localhost:3000/api/datasourcesecho ""echo "Loki 数据源配置完成!"四、炫酷监控面板配置
4.1 Node Exporter 全面监控面板
创建节点监控仪表板:
bash
#!/bin/bash# create-node-dashboard.shset -eecho "创建 Node Exporter 监控仪表板..."echo ""GRAFANA_URL="http://localhost:3000"USERNAME="admin"PASSWORD="admin123"DASHBOARD_JSON=$(cat << 'EOF'{
"dashboard": {
"id": null, "title": "Node Exporter 全面监控", "tags": ["node", "prometheus", "system"], "timezone": "browser", "panels": [
{
"id": 1, "title": "CPU 使用率", "type": "stat", "targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[1m])) * 100)",
"legendFormat": "{{instance}}",
"refId": "A"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 0},
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "red", "value": 80}
]
}
}
}
},
{
"id": 2,
"title": "内存使用率",
"type": "stat",
"targets": [
{
"expr": "(1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100",
"legendFormat": "{{instance}}",
"refId": "A"
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 0},
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "orange", "value": 70},
{"color": "red", "value": 90}
]
}
}
}
},
{
"id": 3,
"title": "CPU 使用率趋势",
"type": "timeseries",
"targets": [
{
"expr": "100 - (avg by (instance) (rate(node_cpu_seconds_total{mode=\"idle\"}[1m])) * 100)",
"legendFormat": "{{instance}}",
"refId": "A"
}
],
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 8},
"fieldConfig": {
"defaults": {
"unit": "percent",
"color": {"mode": "palette-classic"}
}
}
},
{
"id": 4,
"title": "内存使用趋势",
"type": "timeseries",
"targets": [
{
"expr": "node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes",
"legendFormat": "已用内存",
"refId": "A"
},
{
"expr": "node_memory_MemAvailable_bytes",
"legendFormat": "可用内存",
"refId": "B"
}
],
"gridPos": {"h": 8, "w": 24, "x": 0, "y": 16},
"fieldConfig": {
"defaults": {
"unit": "bytes",
"color": {"mode": "palette-classic"}
}
}
},
{
"id": 5,
"title": "磁盘 I/O",
"type": "timeseries",
"targets": [
{
"expr": "rate(node_disk_read_bytes_total[1m])",
"legendFormat": "{{device}} 读取",
"refId": "A"
},
{
"expr": "rate(node_disk_written_bytes_total[1m])",
"legendFormat": "{{device}} 写入",
"refId": "B"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 24},
"fieldConfig": {
"defaults": {
"unit": "Bps",
"color": {"mode": "palette-classic"}
}
}
},
{
"id": 6,
"title": "网络流量",
"type": "timeseries",
"targets": [
{
"expr": "rate(node_network_receive_bytes_total[1m])",
"legendFormat": "{{device}} 接收",
"refId": "A"
},
{
"expr": "rate(node_network_transmit_bytes_total[1m])",
"legendFormat": "{{device}} 发送",
"refId": "B"
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 24},
"fieldConfig": {
"defaults": {
"unit": "Bps",
"color": {"mode": "palette-classic"}
}
}
},
{
"id": 7,
"title": "系统负载",
"type": "timeseries",
"targets": [
{
"expr": "node_load1",
"legendFormat": "1分钟负载",
"refId": "A"
},
{
"expr": "node_load5",
"legendFormat": "5分钟负载",
"refId": "B"
},
{
"expr": "node_load15",
"legendFormat": "15分钟负载",
"refId": "C"
}
],
"gridPos": {"h": 8, "w": 12, "x": 0, "y": 32},
"fieldConfig": {
"defaults": {
"color": {"mode": "palette-classic"}
}
}
},
{
"id": 8,
"title": "磁盘使用率",
"type": "bargauge",
"targets": [
{
"expr": "(1 - (node_filesystem_avail_bytes / node_filesystem_size_bytes)) * 100",
"legendFormat": "{{mountpoint}}",
"refId": "A"
}
],
"gridPos": {"h": 8, "w": 12, "x": 12, "y": 32},
"fieldConfig": {
"defaults": {
"unit": "percent",
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "orange", "value": 70},
{"color": "red", "value": 90}
]
}
}
}
}
],
"time": {
"from": "now-6h",
"to": "now"
},
"refresh": "10s"
},
"folderId": 0,
"overwrite": true
}
EOF
)
# 创建仪表板
curl -X POST \
-H "Content-Type: application/json" \
-d "$DASHBOARD_JSON" \
http://$USERNAME:$PASSWORD@localhost:3000/api/dashboards/db
echo ""echo "Node Exporter 监控仪表板创建完成!"4.2 高级监控面板配置
创建企业级监控仪表板:
json
{
"dashboard": {
"title": "企业级服务器监控",
"tags": ["enterprise", "servers", "monitoring"],
"style": "dark",
"timezone": "browser",
"panels": [
{
"id": 1,
"type": "stat",
"title": "服务器状态",
"targets": [{
"expr": "up",
"legendFormat": "{{instance}}"
}],
"gridPos": {"h": 4, "w": 6, "x": 0, "y": 0},
"fieldConfig": {
"defaults": {
"color": {"mode": "thresholds"},
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "green", "value": 1}
]
},
"mappings": [
{
"type": "value",
"options": {
"0": {"text": "离线"},
"1": {"text": "在线"}
}
}
]
}
}
},
{
"id": 2,
"type": "heatmap",
"title": "请求延迟分布",
"targets": [{
"expr": "histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))",
"legendFormat": "P95 延迟"
}],
"gridPos": {"h": 8, "w": 12, "x": 6, "y": 0}
}
],
"templating": {
"list": [
{
"name": "instance",
"type": "query",
"query": "label_values(up, instance)",
"refresh": 1,
"includeAll": true,
"multi": true
},
{
"name": "job",
"type": "query",
"query": "label_values(up, job)",
"refresh": 1,
"includeAll": true,
"multi": false
}
]
},
"annotations": {
"list": [{
"name": "告警事件",
"datasource": "Prometheus",
"enable": true,
"expr": "ALERTS",
"iconColor": "red"
}]
}
}}五、高级功能与集成
5.1 告警配置
bash
#!/bin/bash# setup-alerting.shecho "配置 Grafana 告警..."echo ""# 创建告警通道(邮件)curl -X POST \
-H "Content-Type: application/json" \
-d '{
"name": "Email Alerts",
"type": "email",
"settings": {
"addresses": "admin@example.com",
"singleEmail": true
}
}' \
http://admin:admin123@localhost:3000/api/alert-notificationsecho ""echo "邮件告警通道配置完成!"# 创建告警通道(Slack)curl -X POST \
-H "Content-Type: application/json" \
-d '{
"name": "Slack Alerts",
"type": "slack",
"settings": {
"url": "https://hooks.slack.com/services/XXX/XXX/XXX",
"recipient": "#monitoring",
"username": "Grafana"
}
}' \
http://admin:admin123@localhost:3000/api/alert-notificationsecho ""echo "Slack 告警通道配置完成!"5.2 自动发现配置
yaml
# /etc/prometheus/targets/nodes.yml- targets: - '192.168.1.101:9100' - '192.168.1.102:9100' - '192.168.1.103:9100' labels: environment: 'production' role: 'web-server'- targets: - '192.168.1.201:9100' - '192.168.1.202:9100' labels: environment: 'production' role: 'database'
六、维护与优化
6.1 备份脚本
bash
#!/bin/bash# backup-monitoring.shBACKUP_DIR="/backup/monitoring"DATE=$(date +%Y%m%d_%H%M%S)echo "开始备份监控系统..."echo ""# 创建备份目录mkdir -p $BACKUP_DIR/$DATE# 备份 Prometheus 数据echo "备份 Prometheus 数据..."sudo systemctl stop prometheussudo tar -czf $BACKUP_DIR/$DATE/prometheus_data.tar.gz /var/lib/prometheus/sudo systemctl start prometheus# 备份 Grafana 配置echo "备份 Grafana 配置..."sudo tar -czf $BACKUP_DIR/$DATE/grafana_config.tar.gz /etc/grafana/# 备份仪表板echo "备份 Grafana 仪表板..."curl -s http://admin:admin123@localhost:3000/api/search | jq -r '.[] | select(.type == "dash-db") | .uid' | while read uid; do
curl -s http://admin:admin123@localhost:3000/api/dashboards/uid/$uid > $BACKUP_DIR/$DATE/dashboard_${uid}.jsondone# 清理旧备份(保留30天)find $BACKUP_DIR -type d -mtime +30 -exec rm -rf {} \;echo ""echo "备份完成: $BACKUP_DIR/$DATE"6.2 性能优化配置
ini
# /etc/prometheus/prometheus.yml - 优化版本global: scrape_interval: 30s scrape_timeout: 10s evaluation_interval: 30s# 优化存储配置storage: tsdb: retention: 15d out_of_order_time_window: 2h# 优化资源限制query: lookback-delta: 5m max-concurrency: 20 timeout: 2m# WAL 配置wal: segment_size: 128MB truncate_frequency: 2h
七、监控系统验证
7.1 系统健康检查
bash
#!/bin/bash# monitoring-health-check.shecho "=== 监控系统健康检查 ==="echo "检查时间: $(date)"echo ""# 检查服务状态echo "1. 服务状态检查:"for service in prometheus node_exporter grafana-server; do if systemctl is-active --quiet $service; then echo " ✅ $service: 运行中" else echo " ❌ $service: 未运行" fidoneecho ""# 检查端口监听echo "2. 端口监听检查:"for port in 9090 9100 3000; do if ss -tulpn | grep ":$port " > /dev/null; then echo " ✅ 端口 $port: 监听中" else echo " ❌ 端口 $port: 未监听" fidoneecho ""# 检查数据收集echo "3. 数据收集检查:"PROMETHEUS_URL="http://localhost:9090"if curl -s "$PROMETHEUS_URL/api/v1/query?query=up" | grep -q "result"; then echo " ✅ Prometheus 数据收集正常"else echo " ❌ Prometheus 数据收集异常"fiecho ""# 检查指标端点echo "4. 指标端点检查:"if curl -s http://localhost:9100/metrics | grep -q "node_"; then echo " ✅ Node Exporter 指标正常"else echo " ❌ Node Exporter 指标异常"fiecho ""# 检查 Grafana 访问echo "5. Grafana 访问检查:"if curl -s http://localhost:3000/api/health | grep -q "Database"; then echo " ✅ Grafana 访问正常"else echo " ❌ Grafana 访问异常"fiecho ""echo "健康检查完成"
总结
通过本文的详细指南,你已经成功搭建了一个功能强大、视觉炫酷的服务器监控系统:
核心组件:
Prometheus - 高性能指标收集和存储
Node Exporter - 系统级指标暴露
Grafana - 强大的数据可视化
关键特性:
实时监控:15秒级别的数据收集和展示
多维度监控:CPU、内存、磁盘、网络等全面覆盖
智能告警:基于阈值的自动告警机制
炫酷可视化:现代化的仪表板和图表
易于扩展:支持多节点和服务发现
最佳实践建议:
定期备份:配置和数据的定期备份
容量规划:根据数据量规划存储空间
安全加固:配置适当的访问控制
性能监控:监控监控系统本身的性能
文档维护:记录配置变更和故障处理
后续扩展方向:
集成应用级监控(Java, Python, Go应用)
添加日志监控(Loki + Promtail)
配置分布式存储(Thanos/Cortex)
实现多区域监控
添加业务指标监控
这个监控系统将为你的服务器基础设施提供全面的可视性和可靠性保障

关键词:

扫码关注
微信好友
关注抖音