모니터링과 로깅 전략

학습 목표

관찰성(Observability)의 3가지 기둥
Prometheus + Grafana로 메트릭 모니터링
ELK Stack으로 중앙화된 로깅
Jaeger로 분산 추적
알림 전략

관찰성(Observability)의 3가지 기둥

1. Metrics (메트릭)

시스템 상태를 숫자로 측정:

CPU, 메모리 사용량
Request rate (RPS)
Latency (p50, p95, p99)
Error rate

2. Logs (로그)

이벤트의 상세 기록:

에러 스택 트레이스
요청/응답 내용
비즈니스 이벤트

3. Traces (분산 추적)

요청이 여러 서비스를 거치는 흐름:

[API Gateway] → [Order Service] → [Product Service]
                                ↘ [Inventory Service]

Prometheus + Grafana

Prometheus 메트릭 수집

Node.js 앱에서 메트릭 수집:

import express from 'express';
import { Counter, Histogram, Gauge, register } from 'prom-client';

const app = express();

// 1. Counter: 누적값 (증가만 가능)
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'path', 'status'],
});

// 2. Histogram: 분포 측정 (Latency 등)
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'path'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10],
});

// 3. Gauge: 현재 값 (증가/감소 가능)
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
});

// Middleware
app.use((req, res, next) => {
  const start = Date.now();

  activeConnections.inc(); // 연결 증가

  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;

    // Request 카운트
    httpRequestsTotal.inc({
      method: req.method,
      path: req.route?.path || req.path,
      status: res.statusCode,
    });

    // Latency 기록
    httpRequestDuration.observe(
      {
        method: req.method,
        path: req.route?.path || req.path,
      },
      duration
    );

    activeConnections.dec(); // 연결 감소
  });

  next();
});

// 비즈니스 메트릭
const ordersCreated = new Counter({
  name: 'orders_created_total',
  help: 'Total orders created',
  labelNames: ['status'],
});

app.post('/orders', async (req, res) => {
  try {
    const order = await createOrder(req.body);
    ordersCreated.inc({ status: 'success' });
    res.json(order);
  } catch (error) {
    ordersCreated.inc({ status: 'error' });
    res.status(500).json({ error: error.message });
  }
});

// Metrics 엔드포인트
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

app.listen(3000);

Prometheus 설정

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'product-service'
    kubernetes_sd_configs:
      - role: pod
        namespaces:
          names:
            - production
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_label_app]
        action: keep
        regex: product
      - source_labels: [__meta_kubernetes_pod_name]
        target_label: pod
      - source_labels: [__address__]
        target_label: __address__
        regex: ([^:]+)(?::\d+)?
        replacement: $1:3000

  - job_name: 'order-service'
    static_configs:
      - targets: ['order-service:3000']
        labels:
          service: 'order'

# 알림 규칙
rule_files:
  - 'alerts.yml'

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

Alerting Rules

# alerts.yml
groups:
  - name: service_alerts
    interval: 30s
    rules:
      # High Error Rate
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
          / sum(rate(http_requests_total[5m])) by (service)
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate on {{ $labels.service }}"
          description: "Error rate is {{ $value | humanizePercentage }} (>5%)"

      # High Latency
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
          ) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High latency on {{ $labels.service }}"
          description: "p95 latency is {{ $value }}s (>2s)"

      # Pod Down
      - alert: PodDown
        expr: up{job="product-service"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Pod {{ $labels.pod }} is down"
          description: "Pod has been down for more than 1 minute"

      # High Memory Usage
      - alert: HighMemoryUsage
        expr: |
          (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
          / node_memory_MemTotal_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is {{ $value | humanizePercentage }}"

Grafana 대시보드

PromQL 쿼리 예제:

# Request Rate (초당 요청 수)
rate(http_requests_total[5m])

# Error Rate (5xx 에러 비율)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))

# p95 Latency
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# Top 5 Slowest Endpoints
topk(5,
  histogram_quantile(0.95,
    sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path)
  )
)

중앙화된 로깅 (ELK Stack)

Structured Logging

import winston from 'winston';

const logger = winston.createLogger({
  level: 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  defaultMeta: {
    service: 'product-service',
    version: process.env.APP_VERSION,
  },
  transports: [
    new winston.transports.Console(),
    new winston.transports.File({ filename: 'error.log', level: 'error' }),
    new winston.transports.File({ filename: 'combined.log' }),
  ],
});

// 사용 예제
logger.info('Order created', {
  orderId: 'order-123',
  customerId: 'user-456',
  totalAmount: 100.50,
  items: 3,
});

logger.error('Database connection failed', {
  error: error.message,
  stack: error.stack,
  attemptCount: 3,
});

출력 (JSON):

{
  "level": "info",
  "message": "Order created",
  "orderId": "order-123",
  "customerId": "user-456",
  "totalAmount": 100.50,
  "items": 3,
  "service": "product-service",
  "version": "1.2.3",
  "timestamp": "2024-12-01T10:30:00.123Z"
}

Filebeat (로그 수집)

# filebeat.yml
filebeat.inputs:
  - type: container
    paths:
      - '/var/lib/docker/containers/*/*.log'
    processors:
      - add_kubernetes_metadata:
          host: ${NODE_NAME}
          matchers:
            - logs_path:
                logs_path: "/var/lib/docker/containers/"

output.elasticsearch:
  hosts: ['http://elasticsearch:9200']
  index: "filebeat-%{+yyyy.MM.dd}"

setup.template.name: "filebeat"
setup.template.pattern: "filebeat-*"

Elasticsearch 쿼리

// 에러 로그 검색
GET /filebeat-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "level": "error" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  },
  "sort": [{ "@timestamp": "desc" }],
  "size": 100
}

// 특정 주문 추적
GET /filebeat-*/_search
{
  "query": {
    "match": { "orderId": "order-123" }
  },
  "sort": [{ "@timestamp": "asc" }]
}

Kibana 대시보드

Visualization 예제:

Error Rate Over Time
- X-axis: @timestamp
- Y-axis: Count of level:error
- Interval: 5 minutes
Top Error Messages
- Aggregation: Terms on message.keyword
- Size: 10
- Order: Count desc
Service Health Map
- Type: Tag Cloud
- Field: service.keyword
- Size by: error count

분산 추적 (Jaeger)

OpenTelemetry 설정

import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';

const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'product-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
  }),
  traceExporter: new JaegerExporter({
    endpoint: 'http://jaeger:14268/api/traces',
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': {
        enabled: true,
      },
      '@opentelemetry/instrumentation-express': {
        enabled: true,
      },
      '@opentelemetry/instrumentation-pg': {
        enabled: true,
      },
      '@opentelemetry/instrumentation-redis': {
        enabled: true,
      },
    }),
  ],
});

sdk.start();

// Graceful shutdown
process.on('SIGTERM', () => {
  sdk.shutdown().finally(() => process.exit(0));
});

Custom Spans

import { trace } from '@opentelemetry/api';

const tracer = trace.getTracer('product-service');

async function createOrder(orderData) {
  // Span 생성
  const span = tracer.startSpan('create_order');

  try {
    span.setAttribute('order.customerId', orderData.customerId);
    span.setAttribute('order.itemCount', orderData.items.length);
    span.setAttribute('order.totalAmount', orderData.totalAmount);

    // Child Span: 재고 확인
    const inventorySpan = tracer.startSpan('check_inventory', {
      parent: span,
    });

    const available = await checkInventory(orderData.items);
    inventorySpan.setAttribute('inventory.available', available);
    inventorySpan.end();

    if (!available) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: 'Out of stock' });
      throw new Error('Out of stock');
    }

    // Child Span: 주문 생성
    const dbSpan = tracer.startSpan('save_order', { parent: span });
    const order = await orderRepo.create(orderData);
    dbSpan.setAttribute('order.id', order.id);
    dbSpan.end();

    span.setStatus({ code: SpanStatusCode.OK });
    return order;
  } catch (error) {
    span.recordException(error);
    span.setStatus({
      code: SpanStatusCode.ERROR,
      message: error.message,
    });
    throw error;
  } finally {
    span.end();
  }
}

Trace Context 전파

HTTP 헤더로 Trace ID 전파:

// Service A (Order Service)
import axios from 'axios';

async function getProductInfo(productId: string) {
  const span = tracer.startSpan('get_product_info');

  try {
    // Trace context를 헤더에 포함
    const response = await axios.get(
      `http://product-service/products/${productId}`,
      {
        headers: {
          traceparent: span.spanContext().traceId, // W3C Trace Context
        },
      }
    );

    return response.data;
  } finally {
    span.end();
  }
}

SLI, SLO, SLA

SLI (Service Level Indicator)

측정 가능한 지표:

Availability: 99.9%
Latency: p95 < 200ms
Error Rate: < 1%

SLO (Service Level Objective)

목표 수치:

# SLO 정의
slos:
  - name: availability
    target: 99.9
    window: 30d

  - name: latency_p95
    target: 200ms
    window: 7d

  - name: error_rate
    target: 1%
    window: 1h

Error Budget 계산:

Availability SLO = 99.9%
Error Budget = 100% - 99.9% = 0.1%

30일 기준:
30일 * 24시간 * 60분 = 43,200분
Error Budget = 43,200 * 0.001 = 43.2분

→ 한 달에 43.2분까지 다운타임 허용

SLA (Service Level Agreement)

고객과의 계약:

- Availability: 99.95%
- 달성 실패 시: 월 요금의 10% 환불
- 측정 방법: Uptime monitoring (5분 간격)
- 제외 사항: Scheduled maintenance

알림 채널 통합

Slack Webhook

import axios from 'axios';

async function sendSlackAlert(message: string) {
  await axios.post(process.env.SLACK_WEBHOOK_URL, {
    text: `🚨 *Alert*`,
    blocks: [
      {
        type: 'section',
        text: {
          type: 'mrkdwn',
          text: message,
        },
      },
      {
        type: 'section',
        fields: [
          {
            type: 'mrkdwn',
            text: `*Service:*\nproduct-service`,
          },
          {
            type: 'mrkdwn',
            text: `*Severity:*\nCritical`,
          },
        ],
      },
      {
        type: 'actions',
        elements: [
          {
            type: 'button',
            text: {
              type: 'plain_text',
              text: 'View Dashboard',
            },
            url: 'https://grafana.example.com',
          },
          {
            type: 'button',
            text: {
              type: 'plain_text',
              text: 'View Logs',
            },
            url: 'https://kibana.example.com',
          },
        ],
      },
    ],
  });
}

PagerDuty (On-call 관리)

import axios from 'axios';

async function triggerPagerDuty(incident: Incident) {
  await axios.post('https://events.pagerduty.com/v2/enqueue', {
    routing_key: process.env.PAGERDUTY_KEY,
    event_action: 'trigger',
    payload: {
      summary: incident.summary,
      severity: incident.severity,
      source: 'product-service',
      custom_details: {
        error_rate: incident.errorRate,
        affected_users: incident.affectedUsers,
      },
    },
  });
}

핵심 정리

관찰성 = Metrics + Logs + Traces
Prometheus + Grafana로 실시간 메트릭 모니터링
ELK Stack으로 중앙화된 로그 분석
Jaeger로 서비스 간 요청 추적
Structured Logging으로 검색 용이
SLO 설정하고 Error Budget 관리
알림은 Severity에 따라 채널 분리
모니터링은 사후 대응이 아닌 사전 예방