Daya

DevOps Best Practices - Observability Stack

Overview

This document outlines the DevOps best practices applied to the observability stack configuration.

Architecture Principles

1. Containerization Best Practices

2. Security Practices

Environment Variables

Volume Security

Network Security

3. Data Persistence

Volume Strategy

Retention Policies

4. Service Configuration

Grafana

Prometheus

Alertmanager

Loki

Promtail

5. Network Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚      monitoring-network (bridge)        β”‚
β”‚                                         β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”β”‚
β”‚  β”‚Prometheusβ”‚  β”‚  Loki   β”‚  β”‚Grafana β”‚β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜β”‚
β”‚       β”‚            β”‚            β”‚      β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”            β”‚
β”‚  β”‚Alertmgr  β”‚  β”‚ Promtail β”‚            β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜            β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β”‚ (via host.docker.internal)
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚         Host Machine                     β”‚
β”‚  Backend: localhost:8080                β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

6. Health Monitoring

All services implement health checks:

healthcheck:
  test: ["CMD-SHELL", "wget --no-verbose --tries=1 --spider <endpoint> || exit 1"]
  interval: 30s
  timeout: 10s
  retries: 3
  start_period: 30s-40s

7. Port Management

Service Host Port Container Port Purpose
Grafana 3030 3000 Web UI (avoids conflict with Next.js)
Prometheus 9090 9090 Metrics collection
Alertmanager 9093 9093 Alert management
Loki 3100 3100 Log aggregation API
Promtail - - Internal log collection

8. Startup Dependencies

depends_on:
  - prometheus  # Grafana needs Prometheus datasource
  - loki       # Grafana needs Loki datasource

9. Environment Variables

Required for Production

Optional (Notification Channels)

10. Maintenance Operations

Backup Strategy

# Backup Grafana data
docker run --rm -v monitoring_grafana-data:/data -v $(pwd):/backup \
  alpine tar czf /backup/grafana-backup-$(date +%Y%m%d).tar.gz /data

# Backup Prometheus data
docker run --rm -v monitoring_prometheus-data:/data -v $(pwd):/backup \
  alpine tar czf /backup/prometheus-backup-$(date +%Y%m%d).tar.gz /data

Log Rotation

Update Strategy

  1. Stop services: docker compose -f monitoring/docker-compose.observability.yml down
  2. Pull new images: docker compose -f monitoring/docker-compose.observability.yml pull
  3. Start services: docker compose -f monitoring/docker-compose.observability.yml up -d
  4. Verify health: Check health endpoints

11. Troubleshooting

Check Service Health

# All services
docker compose -f monitoring/docker-compose.observability.yml ps

# Individual service logs
docker compose -f monitoring/docker-compose.observability.yml logs <service>

# Health check manually
curl http://localhost:3030/api/health  # Grafana
curl http://localhost:9090/-/healthy   # Prometheus
curl http://localhost:9093/-/healthy  # Alertmanager
curl http://localhost:3100/ready       # Loki

Port Conflicts

# Check port usage
lsof -i :3030
lsof -i :9090
lsof -i :9093
lsof -i :3100

# Kill process on port (if needed)
kill -9 $(lsof -t -i:PORT)

12. Production Considerations

  1. Secrets Management: Use Docker secrets or external secret managers
  2. TLS/SSL: Configure reverse proxy (nginx/traefik) with SSL
  3. Resource Limits: Add CPU/memory limits to services
  4. Monitoring: Monitor the monitoring stack itself
  5. Backup Automation: Schedule regular backups
  6. Access Control: Implement proper authentication/authorization
  7. Network Policies: Restrict network access in production
  8. Log Aggregation: Centralize logs for audit purposes

Compliance