📊 Bài 11: RabbitMQ Monitoring

⏱️ Thời gian đọc: 20 phút | 📚 Độ khó: Nâng cao

🎯 Sau bài học này, bạn sẽ:

Setup Prometheus + Grafana cho RabbitMQ
Biết các metrics quan trọng cần monitor
Cấu hình alerting rules
Troubleshoot các issues thường gặp

1. Các Metrics Quan Trọng

Metric	Mô tả	Alert khi
Queue depth	Số messages trong queue	> 10,000 messages
Consumer count	Số consumers đang active	= 0 (no consumers!)
Publish rate	Messages/second được gửi	Tăng đột biến hoặc = 0
Deliver rate	Messages/second consumer nhận	Thấp hơn publish rate nhiều
Unacked messages	Messages đã deliver nhưng chưa ACK	> 1000 (consumer stuck)
Memory usage	RAM RabbitMQ đang dùng	> 80% watermark
Disk free	Dung lượng disk còn trống	< 2GB
Connection count	Số client connections	> 500 hoặc tăng nhanh

2. Prometheus + Grafana Setup

# docker-compose-monitoring.yml
version: '3.8'
services:
  rabbitmq:
    image: rabbitmq:3-management
    ports:
      - "5672:5672"
      - "15672:15672"
      - "15692:15692"  # Prometheus metrics
    environment:
      RABBITMQ_DEFAULT_USER: admin
      RABBITMQ_DEFAULT_PASS: secret123
    command: >
      bash -c "rabbitmq-plugins enable rabbitmq_prometheus
      && rabbitmq-server"

  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'rabbitmq'
    static_configs:
      - targets: ['rabbitmq:15692']
    metrics_path: /metrics

3. Management API Health Check

// health-check.js
const http = require('http');

function checkRabbitMQ() {
    return new Promise((resolve, reject) => {
        const options = {
            hostname: 'localhost',
            port: 15672,
            path: '/api/healthchecks/node',
            auth: 'admin:secret123'
        };

        http.get(options, (res) => {
            let data = '';
            res.on('data', chunk => data += chunk);
            res.on('end', () => {
                const health = JSON.parse(data);
                if (health.status === 'ok') {
                    resolve({ healthy: true, details: health });
                } else {
                    reject(new Error(`Unhealthy: ${health.reason}`));
                }
            });
        }).on('error', reject);
    });
}

// Kiểm tra queue depth
async function checkQueueDepth(queueName, maxDepth = 10000) {
    const res = await fetch(
        `http://localhost:15672/api/queues/%2f/${queueName}`,
        { headers: { Authorization: 'Basic ' + btoa('admin:secret123') } }
    );
    const queue = await res.json();

    return {
        name: queue.name,
        messages: queue.messages,
        consumers: queue.consumers,
        alert: queue.messages > maxDepth
    };
}

4. Alerting Rules

# Prometheus alerting rules
groups:
  - name: rabbitmq
    rules:
      - alert: RabbitMQQueueDepthHigh
        expr: rabbitmq_queue_messages > 10000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Queue {{ $labels.queue }} has {{ $value }} messages"

      - alert: RabbitMQNoConsumers
        expr: rabbitmq_queue_consumers == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Queue {{ $labels.queue }} has no consumers!"

      - alert: RabbitMQHighMemory
        expr: rabbitmq_process_resident_memory_bytes > 1e9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "RabbitMQ using > 1GB memory"

5. Troubleshooting Phổ Biến

🔍 Queue depth tăng liên tục:
• Consumer quá chậm → Tăng số workers hoặc prefetch
• Consumer bị crash → Check logs, restart service
• Message poison → Check DLQ, fix logic

🔍 Memory alarm triggered:
• Quá nhiều messages trong queue → Tăng consumers
• Lazy queues: dùng x-queue-mode: lazy để lưu disk
• Connection leak → Kiểm tra app có close connection đúng

🔍 Unacked messages cao:
• Consumer nhận message nhưng không ACK/NACK
• Consumer xử lý quá lâu → Tăng timeout hoặc tối ưu logic
• Bug trong consumer → Check error handling

📝 Tóm Tắt

Prometheus + Grafana: Stack monitoring tiêu chuẩn cho RabbitMQ
Key metrics: Queue depth, consumer count, publish/deliver rate, memory
Alerting: Queue depth > 10K, no consumers, high memory
Health check: Management API /api/healthchecks/node

← Bài 10: Clustering & HA Bài 12: Dự Án RabbitMQ →