모니터링과 로깅 전략
학습 목표
- 관찰성(Observability)의 3가지 기둥
- Prometheus + Grafana로 메트릭 모니터링
- ELK Stack으로 중앙화된 로깅
- Jaeger로 분산 추적
- 알림 전략
관찰성(Observability)의 3가지 기둥
1. Metrics (메트릭)
시스템 상태를 숫자로 측정:
- CPU, 메모리 사용량
- Request rate (RPS)
- Latency (p50, p95, p99)
- Error rate
2. Logs (로그)
이벤트의 상세 기록:
- 에러 스택 트레이스
- 요청/응답 내용
- 비즈니스 이벤트
3. Traces (분산 추적)
요청이 여러 서비스를 거치는 흐름:
[API Gateway] → [Order Service] → [Product Service]
↘ [Inventory Service]
Prometheus + Grafana
Prometheus 메트릭 수집
Node.js 앱에서 메트릭 수집:
import express from 'express';
import { Counter, Histogram, Gauge, register } from 'prom-client';
const app = express();
// 1. Counter: 누적값 (증가만 가능)
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'path', 'status'],
});
// 2. Histogram: 분포 측정 (Latency 등)
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'path'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5, 10],
});
// 3. Gauge: 현재 값 (증가/감소 가능)
const activeConnections = new Gauge({
name: 'active_connections',
help: 'Number of active connections',
});
// Middleware
app.use((req, res, next) => {
const start = Date.now();
activeConnections.inc(); // 연결 증가
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
// Request 카운트
httpRequestsTotal.inc({
method: req.method,
path: req.route?.path || req.path,
status: res.statusCode,
});
// Latency 기록
httpRequestDuration.observe(
{
method: req.method,
path: req.route?.path || req.path,
},
duration
);
activeConnections.dec(); // 연결 감소
});
next();
});
// 비즈니스 메트릭
const ordersCreated = new Counter({
name: 'orders_created_total',
help: 'Total orders created',
labelNames: ['status'],
});
app.post('/orders', async (req, res) => {
try {
const order = await createOrder(req.body);
ordersCreated.inc({ status: 'success' });
res.json(order);
} catch (error) {
ordersCreated.inc({ status: 'error' });
res.status(500).json({ error: error.message });
}
});
// Metrics 엔드포인트
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.listen(3000);
Prometheus 설정
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'product-service'
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- production
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: product
- source_labels: [__meta_kubernetes_pod_name]
target_label: pod
- source_labels: [__address__]
target_label: __address__
regex: ([^:]+)(?::\d+)?
replacement: $1:3000
- job_name: 'order-service'
static_configs:
- targets: ['order-service:3000']
labels:
service: 'order'
# 알림 규칙
rule_files:
- 'alerts.yml'
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
Alerting Rules
# alerts.yml
groups:
- name: service_alerts
interval: 30s
rules:
# High Error Rate
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/ sum(rate(http_requests_total[5m])) by (service)
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate on {{ $labels.service }}"
description: "Error rate is {{ $value | humanizePercentage }} (>5%)"
# High Latency
- alert: HighLatency
expr: |
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High latency on {{ $labels.service }}"
description: "p95 latency is {{ $value }}s (>2s)"
# Pod Down
- alert: PodDown
expr: up{job="product-service"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is down"
description: "Pod has been down for more than 1 minute"
# High Memory Usage
- alert: HighMemoryUsage
expr: |
(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes)
/ node_memory_MemTotal_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value | humanizePercentage }}"
Grafana 대시보드
PromQL 쿼리 예제:
# Request Rate (초당 요청 수)
rate(http_requests_total[5m])
# Error Rate (5xx 에러 비율)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# p95 Latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
# Top 5 Slowest Endpoints
topk(5,
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le, path)
)
)
중앙화된 로깅 (ELK Stack)
Structured Logging
import winston from 'winston';
const logger = winston.createLogger({
level: 'info',
format: winston.format.combine(
winston.format.timestamp(),
winston.format.errors({ stack: true }),
winston.format.json()
),
defaultMeta: {
service: 'product-service',
version: process.env.APP_VERSION,
},
transports: [
new winston.transports.Console(),
new winston.transports.File({ filename: 'error.log', level: 'error' }),
new winston.transports.File({ filename: 'combined.log' }),
],
});
// 사용 예제
logger.info('Order created', {
orderId: 'order-123',
customerId: 'user-456',
totalAmount: 100.50,
items: 3,
});
logger.error('Database connection failed', {
error: error.message,
stack: error.stack,
attemptCount: 3,
});
출력 (JSON):
{
"level": "info",
"message": "Order created",
"orderId": "order-123",
"customerId": "user-456",
"totalAmount": 100.50,
"items": 3,
"service": "product-service",
"version": "1.2.3",
"timestamp": "2024-12-01T10:30:00.123Z"
}
Filebeat (로그 수집)
# filebeat.yml
filebeat.inputs:
- type: container
paths:
- '/var/lib/docker/containers/*/*.log'
processors:
- add_kubernetes_metadata:
host: ${NODE_NAME}
matchers:
- logs_path:
logs_path: "/var/lib/docker/containers/"
output.elasticsearch:
hosts: ['http://elasticsearch:9200']
index: "filebeat-%{+yyyy.MM.dd}"
setup.template.name: "filebeat"
setup.template.pattern: "filebeat-*"
Elasticsearch 쿼리
// 에러 로그 검색
GET /filebeat-*/_search
{
"query": {
"bool": {
"must": [
{ "match": { "level": "error" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
},
"sort": [{ "@timestamp": "desc" }],
"size": 100
}
// 특정 주문 추적
GET /filebeat-*/_search
{
"query": {
"match": { "orderId": "order-123" }
},
"sort": [{ "@timestamp": "asc" }]
}
Kibana 대시보드
Visualization 예제:
-
Error Rate Over Time
- X-axis: @timestamp
- Y-axis: Count of level:error
- Interval: 5 minutes
-
Top Error Messages
- Aggregation: Terms on message.keyword
- Size: 10
- Order: Count desc
-
Service Health Map
- Type: Tag Cloud
- Field: service.keyword
- Size by: error count
분산 추적 (Jaeger)
OpenTelemetry 설정
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { JaegerExporter } from '@opentelemetry/exporter-jaeger';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions';
const sdk = new NodeSDK({
resource: new Resource({
[SemanticResourceAttributes.SERVICE_NAME]: 'product-service',
[SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
}),
traceExporter: new JaegerExporter({
endpoint: 'http://jaeger:14268/api/traces',
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-http': {
enabled: true,
},
'@opentelemetry/instrumentation-express': {
enabled: true,
},
'@opentelemetry/instrumentation-pg': {
enabled: true,
},
'@opentelemetry/instrumentation-redis': {
enabled: true,
},
}),
],
});
sdk.start();
// Graceful shutdown
process.on('SIGTERM', () => {
sdk.shutdown().finally(() => process.exit(0));
});
Custom Spans
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('product-service');
async function createOrder(orderData) {
// Span 생성
const span = tracer.startSpan('create_order');
try {
span.setAttribute('order.customerId', orderData.customerId);
span.setAttribute('order.itemCount', orderData.items.length);
span.setAttribute('order.totalAmount', orderData.totalAmount);
// Child Span: 재고 확인
const inventorySpan = tracer.startSpan('check_inventory', {
parent: span,
});
const available = await checkInventory(orderData.items);
inventorySpan.setAttribute('inventory.available', available);
inventorySpan.end();
if (!available) {
span.setStatus({ code: SpanStatusCode.ERROR, message: 'Out of stock' });
throw new Error('Out of stock');
}
// Child Span: 주문 생성
const dbSpan = tracer.startSpan('save_order', { parent: span });
const order = await orderRepo.create(orderData);
dbSpan.setAttribute('order.id', order.id);
dbSpan.end();
span.setStatus({ code: SpanStatusCode.OK });
return order;
} catch (error) {
span.recordException(error);
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message,
});
throw error;
} finally {
span.end();
}
}
Trace Context 전파
HTTP 헤더로 Trace ID 전파:
// Service A (Order Service)
import axios from 'axios';
async function getProductInfo(productId: string) {
const span = tracer.startSpan('get_product_info');
try {
// Trace context를 헤더에 포함
const response = await axios.get(
`http://product-service/products/${productId}`,
{
headers: {
traceparent: span.spanContext().traceId, // W3C Trace Context
},
}
);
return response.data;
} finally {
span.end();
}
}
SLI, SLO, SLA
SLI (Service Level Indicator)
측정 가능한 지표:
- Availability: 99.9%
- Latency: p95 < 200ms
- Error Rate: < 1%
SLO (Service Level Objective)
목표 수치:
# SLO 정의
slos:
- name: availability
target: 99.9
window: 30d
- name: latency_p95
target: 200ms
window: 7d
- name: error_rate
target: 1%
window: 1h
Error Budget 계산:
Availability SLO = 99.9%
Error Budget = 100% - 99.9% = 0.1%
30일 기준:
30일 * 24시간 * 60분 = 43,200분
Error Budget = 43,200 * 0.001 = 43.2분
→ 한 달에 43.2분까지 다운타임 허용
SLA (Service Level Agreement)
고객과의 계약:
- Availability: 99.95%
- 달성 실패 시: 월 요금의 10% 환불
- 측정 방법: Uptime monitoring (5분 간격)
- 제외 사항: Scheduled maintenance
알림 채널 통합
Slack Webhook
import axios from 'axios';
async function sendSlackAlert(message: string) {
await axios.post(process.env.SLACK_WEBHOOK_URL, {
text: `🚨 *Alert*`,
blocks: [
{
type: 'section',
text: {
type: 'mrkdwn',
text: message,
},
},
{
type: 'section',
fields: [
{
type: 'mrkdwn',
text: `*Service:*\nproduct-service`,
},
{
type: 'mrkdwn',
text: `*Severity:*\nCritical`,
},
],
},
{
type: 'actions',
elements: [
{
type: 'button',
text: {
type: 'plain_text',
text: 'View Dashboard',
},
url: 'https://grafana.example.com',
},
{
type: 'button',
text: {
type: 'plain_text',
text: 'View Logs',
},
url: 'https://kibana.example.com',
},
],
},
],
});
}
PagerDuty (On-call 관리)
import axios from 'axios';
async function triggerPagerDuty(incident: Incident) {
await axios.post('https://events.pagerduty.com/v2/enqueue', {
routing_key: process.env.PAGERDUTY_KEY,
event_action: 'trigger',
payload: {
summary: incident.summary,
severity: incident.severity,
source: 'product-service',
custom_details: {
error_rate: incident.errorRate,
affected_users: incident.affectedUsers,
},
},
});
}
핵심 정리
- 관찰성 = Metrics + Logs + Traces
- Prometheus + Grafana로 실시간 메트릭 모니터링
- ELK Stack으로 중앙화된 로그 분석
- Jaeger로 서비스 간 요청 추적
- Structured Logging으로 검색 용이
- SLO 설정하고 Error Budget 관리
- 알림은 Severity에 따라 채널 분리
- 모니터링은 사후 대응이 아닌 사전 예방