← Về danh sách bài họcBài 19/20

📊 Bài 19: Kafka Monitoring & Production

⏱️ Thời gian đọc: 22 phút | 📚 Độ khó: Nâng cao

🎯 Sau bài học này, bạn sẽ:

1. Key Metrics

Category Metric Alert khi
Broker Under-replicated partitions > 0
Broker Active controller count ≠ 1
Broker Request latency (p99) > 100ms
Topic Messages in/sec Đột biến tăng/giảm
Consumer Consumer lag > 10,000
Consumer Consumer lag increasing Liên tục tăng > 5 phút
System Disk usage > 80%
System Network I/O Gần saturation

2. Monitoring Stack

# docker-compose-monitoring.yml
version: '3.8'
services:
  kafka:
    image: confluentinc/cp-kafka:7.5.0
    environment:
      KAFKA_JMX_PORT: 9999
      KAFKA_JMX_HOSTNAME: kafka
      # ... other configs

  jmx-exporter:
    image: bitnami/jmx-exporter:latest
    ports:
      - "5556:5556"
    volumes:
      - ./jmx-config.yml:/etc/jmx-exporter/config.yml
    command: ["5556", "/etc/jmx-exporter/config.yml"]

  prometheus:
    image: prom/prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin

3. Consumer Lag Monitoring

// lag-monitor.js - Custom lag monitoring
const { Kafka } = require('kafkajs');

const kafka = new Kafka({ clientId: 'lag-monitor', brokers: ['localhost:9092'] });
const admin = kafka.admin();

async function checkLag(groupId) {
    await admin.connect();

    const offsets = await admin.fetchOffsets({ groupId, topics: ['orders'] });
    const topicOffsets = await admin.fetchTopicOffsets('orders');

    let totalLag = 0;
    for (const partition of offsets) {
        const endOffset = topicOffsets.find(
            t => t.partition === partition.partition
        );
        const lag = parseInt(endOffset.offset) - parseInt(partition.offset);
        totalLag += lag;
        console.log(`P${partition.partition}: offset=${partition.offset}, end=${endOffset.offset}, lag=${lag}`);
    }

    console.log(`Total lag: ${totalLag}`);

    if (totalLag > 10000) {
        console.log('🚨 ALERT: Consumer lag is too high!');
        // Send alert to Slack/PagerDuty
    }

    await admin.disconnect();
}

// Check lag mỗi 30 giây
setInterval(() => checkLag('order-processors'), 30000);

4. Performance Tuning

# Broker tuning (server.properties)
num.network.threads=8
num.io.threads=16
socket.send.buffer.bytes=102400
socket.receive.buffer.bytes=102400
log.retention.hours=168
log.segment.bytes=1073741824
log.retention.check.interval.ms=300000

# Producer tuning
batch.size=32768
linger.ms=5
compression.type=lz4
buffer.memory=33554432

# Consumer tuning
fetch.min.bytes=1
fetch.max.wait.ms=500
max.poll.records=500
session.timeout.ms=30000
💡 Performance tips:
Producer: Tăng batch.size + linger.ms cho throughput
Consumer: Tăng max.poll.records cho batch processing
Broker: SSD disk, đủ RAM cho page cache
Compression: lz4 cho throughput, zstd cho ratio tốt hơn

5. Production Checklist

📌 Deployment Checklist:
✅ Cluster tối thiểu 3 brokers
✅ Replication factor ≥ 3, min.insync.replicas = 2
SSD storage cho Kafka logs
✅ Dedicated disks cho Kafka (không share với OS)
JVM heap: 6-8GB (không quá 50% RAM)
✅ Network: ≥ 1Gbps between brokers
Monitoring: Prometheus + Grafana + alerting
Security: SSL/TLS, SASL authentication, ACLs
Backup: MirrorMaker 2 cho cross-DC replication
✅ Auto-scaling consumers based on lag

📝 Tóm Tắt