Observabilité

Observabilité#

Pourquoi « observer » un système ?#

Imaginez que vous conduisez une voiture les yeux bandés. Vous savez que vous roulez, mais vous ignorez la vitesse, le niveau d’essence, la température du moteur. Un incident surviendra, et vous ne le verrez pas venir.

Un système en production sans observabilité, c’est exactement ça. L’observabilité est la capacité à comprendre l’état interne d’un système à partir de ses sorties externes. Plus précisément, c’est la propriété d’un système qui permet de répondre à des questions arbitraires sans avoir à y accéder directement.

Observabilité vs Monitoring

Le monitoring consiste à surveiller des indicateurs connus à l’avance (ex. : « alerter si CPU > 80% »). On sait ce qu’on cherche.

L”observabilité permet d’explorer des comportements inconnus (ex. : « pourquoi cette requête précise est-elle lente pour cet utilisateur spécifique ? »). On peut poser des questions qu’on n’avait pas anticipées.

En pratique, un système observable est aussi bien monitoré — mais l’inverse n’est pas vrai.

Les trois piliers de l’observabilité#

L’industrie a convergé vers trois types de données complémentaires, souvent appelés les « trois piliers » :

Show code cell source

Hide code cell source

fig, axes = plt.subplots(1, 3, figsize=(15, 6))

piliers = [
    {
        "nom": "Métriques",
        "emoji": "📊",
        "couleur": "#2196F3",
        "definition": "Valeurs numériques\nagrégées dans le temps",
        "exemples": ["CPU : 45%", "Requêtes/s : 1200", "Latence P99 : 250ms", "Erreurs : 0.3%"],
        "forces": ["Peu de stockage", "Alerting efficace", "Aggregations"],
        "limites": ["Peu de contexte", "Cardinalité limitée"]
    },
    {
        "nom": "Logs",
        "emoji": "📝",
        "couleur": "#4CAF50",
        "definition": "Événements textuels\nhorodatés et structurés",
        "exemples": ["ERROR: DB timeout", "INFO: User login ok", "WARN: Cache miss", "DEBUG: Query plan"],
        "forces": ["Contexte riche", "Debugging précis", "Audit trail"],
        "limites": ["Volume important", "Coût indexation"]
    },
    {
        "nom": "Traces",
        "emoji": "🔗",
        "couleur": "#FF9800",
        "definition": "Parcours d'une requête\nà travers les services",
        "exemples": ["Span: API 12ms", "Span: Auth 3ms", "Span: DB 45ms", "Span: Cache 1ms"],
        "forces": ["Vision end-to-end", "Localise goulets", "Dépendances"],
        "limites": ["Sampling nécessaire", "Overhead instrumentation"]
    }
]

for ax, pilier in zip(axes, piliers):
    ax.set_xlim(0, 10)
    ax.set_ylim(0, 12)
    ax.axis('off')

    # Titre
    rect = FancyBboxPatch((0.5, 9.5), 9, 2, boxstyle="round,pad=0.2",
                          facecolor=pilier["couleur"], edgecolor='none', alpha=0.9)
    ax.add_patch(rect)
    ax.text(5, 10.5, pilier["nom"], ha='center', va='center',
            fontsize=16, fontweight='bold', color='white')
    ax.text(5, 9.8, pilier["definition"], ha='center', va='center',
            fontsize=9, color='white', alpha=0.9)

    # Exemples
    ax.text(5, 9.1, "Exemples", ha='center', va='center',
            fontsize=10, fontweight='bold', color=pilier["couleur"])
    for i, ex in enumerate(pilier["exemples"]):
        ax.text(5, 8.4 - i*0.65, f"• {ex}", ha='center', va='center',
                fontsize=8.5, color='#333333',
                bbox=dict(boxstyle="round,pad=0.2", facecolor=pilier["couleur"],
                          alpha=0.1, edgecolor='none'))

    # Forces
    ax.text(5, 5.5, "✓ Forces", ha='center', va='center',
            fontsize=10, fontweight='bold', color='#2E7D32')
    for i, f in enumerate(pilier["forces"]):
        ax.text(5, 4.9 - i*0.55, f"+ {f}", ha='center', va='center',
                fontsize=8.5, color='#2E7D32')

    # Limites
    ax.text(5, 3.2, "⚠ Limites", ha='center', va='center',
            fontsize=10, fontweight='bold', color='#C62828')
    for i, lim in enumerate(pilier["limites"]):
        ax.text(5, 2.6 - i*0.55, f"- {lim}", ha='center', va='center',
                fontsize=8.5, color='#C62828')

plt.suptitle("Les trois piliers de l'observabilité", fontsize=16, fontweight='bold', y=1.01)
plt.tight_layout()
plt.savefig("obs_piliers.png", dpi=110, bbox_inches='tight')
plt.show()
print("Les trois piliers sont complémentaires : les métriques alertent,")
print("les logs expliquent, les traces localisent.")

_images/547b20b7d2161a0b4e0750e6cbfc0eb0e9d29dbfdce44391cd44dc5e225e4e5f.png

Les trois piliers sont complémentaires : les métriques alertent,
les logs expliquent, les traces localisent.

Ces trois piliers fonctionnent en synergie. Quand une alerte métrique se déclenche (CPU élevé), on consulte les logs pour comprendre ce qui s’est passé, puis on suit une trace pour identifier quel service est responsable.

Métriques Kubernetes#

Les couches de métriques#

Dans un cluster Kubernetes, les métriques proviennent de plusieurs couches :

Show code cell source

Hide code cell source

fig, ax = plt.subplots(figsize=(14, 8))
ax.axis('off')
ax.set_xlim(0, 14)
ax.set_ylim(0, 9)

couches = [
    {"y": 7.2, "label": "Application", "couleur": "#9C27B0", "alpha": 0.25,
     "outil": "client Prometheus", "exemples": "requêtes/s, erreurs métier, temps traitement"},
    {"y": 5.4, "label": "Conteneur / Pod", "couleur": "#2196F3", "alpha": 0.25,
     "outil": "cAdvisor (intégré kubelet)", "exemples": "CPU conteneur, mémoire, réseau I/O"},
    {"y": 3.6, "label": "Kubernetes (objets)", "couleur": "#00BCD4", "alpha": 0.25,
     "outil": "kube-state-metrics", "exemples": "pods pending, deployments disponibles, jobs échoués"},
    {"y": 1.8, "label": "Nœud (OS)", "couleur": "#4CAF50", "alpha": 0.25,
     "outil": "node_exporter", "exemples": "CPU nœud, mémoire, disque, réseau"},
]

for c in couches:
    rect = FancyBboxPatch((0.3, c["y"] - 0.7), 13.4, 1.4,
                          boxstyle="round,pad=0.1",
                          facecolor=c["couleur"], edgecolor=c["couleur"],
                          alpha=c["alpha"], linewidth=2)
    ax.add_patch(rect)
    ax.text(0.8, c["y"], c["label"], ha='left', va='center',
            fontsize=12, fontweight='bold', color=c["couleur"])
    ax.text(4.5, c["y"] + 0.25, f"Outil : {c['outil']}", ha='left', va='center',
            fontsize=9, color='#555555', style='italic')
    ax.text(4.5, c["y"] - 0.25, f"Ex. : {c['exemples']}", ha='left', va='center',
            fontsize=8.5, color='#333333')

# Flèche Prometheus scrape
ax.annotate("", xy=(13.2, 7.5), xytext=(13.2, 1.5),
            arrowprops=dict(arrowstyle='<->', color='#FF5722', lw=2))
ax.text(13.5, 4.5, "Prometheus\nscrape", ha='center', va='center',
        fontsize=9, color='#FF5722', fontweight='bold', rotation=90)

ax.set_title("Les couches de métriques dans Kubernetes", fontsize=14, fontweight='bold', pad=10)
plt.tight_layout()
plt.savefig("obs_couches_metriques.png", dpi=110, bbox_inches='tight')
plt.show()

_images/f19c63657693e2063b2e274b09d8925ebad1bd96e5f594437a86457874f442d8.png

Format d’exposition Prometheus#

Prometheus utilise un format texte simple. Chaque application expose un endpoint /metrics :

# HELP http_requests_total Nombre total de requêtes HTTP
# TYPE http_requests_total counter
http_requests_total{method="GET",status="200"} 1234
http_requests_total{method="POST",status="500"} 7

# HELP process_cpu_seconds_total CPU consommé en secondes
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 23.45

# HELP http_request_duration_seconds Durée des requêtes
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{le="0.1"} 800
http_request_duration_seconds_bucket{le="0.5"} 1100
http_request_duration_seconds_bucket{le="1.0"} 1200
http_request_duration_seconds_bucket{le="+Inf"} 1234
http_request_duration_seconds_sum 156.3
http_request_duration_seconds_count 1234

Les types de métriques Prometheus#

Show code cell source

Hide code cell source

fig, axes = plt.subplots(2, 2, figsize=(14, 9))

# 1. Counter : toujours croissant
ax1 = axes[0, 0]
t = np.linspace(0, 60, 300)
# Compteur avec quelques incréments plus forts
counter = np.cumsum(np.random.exponential(1.2, 300))
ax1.plot(t, counter, color='#2196F3', linewidth=2)
ax1.fill_between(t, counter, alpha=0.15, color='#2196F3')
ax1.set_title("Counter — Compteur monotone croissant", fontweight='bold')
ax1.set_xlabel("Temps (s)")
ax1.set_ylabel("http_requests_total")
ax1.text(30, counter[150]*0.3, "Toujours croissant\nNe peut que monter\n(ou repartir de 0 au restart)",
         ha='center', fontsize=9, color='#1565C0',
         bbox=dict(boxstyle='round,pad=0.3', facecolor='#E3F2FD', edgecolor='#2196F3'))

# 2. Gauge : peut monter/descendre
ax2 = axes[0, 1]
t2 = np.linspace(0, 60, 300)
gauge = 50 + 20*np.sin(t2/8) + 10*np.sin(t2/3) + np.random.normal(0, 3, 300)
ax2.plot(t2, gauge, color='#4CAF50', linewidth=2)
ax2.fill_between(t2, gauge, alpha=0.15, color='#4CAF50')
ax2.axhline(y=80, color='#F44336', linestyle='--', alpha=0.7, label='Seuil alerte')
ax2.set_title("Gauge — Jauge montante/descendante", fontweight='bold')
ax2.set_xlabel("Temps (s)")
ax2.set_ylabel("process_memory_bytes")
ax2.legend(fontsize=8)
ax2.text(30, 20, "Peut monter ET descendre\nEx. : RAM, CPU, température",
         ha='center', fontsize=9, color='#2E7D32',
         bbox=dict(boxstyle='round,pad=0.3', facecolor='#E8F5E9', edgecolor='#4CAF50'))

# 3. Histogram : distribution des latences
ax3 = axes[1, 0]
latences = np.random.lognormal(mean=np.log(0.15), sigma=0.8, size=2000)
latences = latences[latences < 2]
buckets = [0.01, 0.05, 0.1, 0.25, 0.5, 1.0, 2.0]
counts = [np.sum(latences <= b) for b in buckets]
ax3.bar([str(b) for b in buckets], counts, color='#FF9800', edgecolor='white', linewidth=0.5)
ax3.set_title("Histogram — Distribution (ex. latences)", fontweight='bold')
ax3.set_xlabel("Buckets (secondes)")
ax3.set_ylabel("Nombre de requêtes cumulé")
ax3.text(3.5, max(counts)*0.3, "Permet de calculer\ndes percentiles (P50, P95, P99)\nà partir de buckets",
         ha='center', fontsize=9, color='#E65100',
         bbox=dict(boxstyle='round,pad=0.3', facecolor='#FFF3E0', edgecolor='#FF9800'))

# 4. Summary (simplifié)
ax4 = axes[1, 1]
np.random.seed(42)
t4 = np.arange(0, 60)
p50 = 120 + 10*np.sin(t4/10) + np.random.normal(0, 5, 60)
p95 = p50 * 2.5 + np.random.normal(0, 10, 60)
p99 = p50 * 4 + np.random.normal(0, 20, 60)
ax4.fill_between(t4, p50, p99, alpha=0.15, color='#9C27B0', label='P50–P99')
ax4.plot(t4, p50, color='#9C27B0', linewidth=2, label='P50 (médiane)')
ax4.plot(t4, p95, color='#9C27B0', linewidth=1.5, linestyle='--', label='P95')
ax4.plot(t4, p99, color='#9C27B0', linewidth=1, linestyle=':', label='P99')
ax4.set_title("Summary — Quantiles pré-calculés", fontweight='bold')
ax4.set_xlabel("Temps (s)")
ax4.set_ylabel("Latence (ms)")
ax4.legend(fontsize=8)

plt.suptitle("Les quatre types de métriques Prometheus", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig("obs_types_metriques.png", dpi=110, bbox_inches='tight')
plt.show()

_images/f1ce8a300aade93f47b6c7a229991a88a1ceaf61c21457f393c70afd5741695d.png

Prometheus : architecture et PromQL#

Architecture de scraping#

Prometheus fonctionne en mode pull : c’est lui qui va chercher les métriques sur chaque cible, à intervalles réguliers (par défaut toutes les 15 secondes).

# prometheus.yml — configuration de scraping
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']

  - job_name: 'kube-state-metrics'
    static_configs:
      - targets: ['kube-state-metrics:8080']

PromQL — requêtes essentielles#

PromQL (Prometheus Query Language) permet d’interroger les séries temporelles :

# Taux de requêtes HTTP (par seconde sur 5 minutes)
rate(http_requests_total[5m])

# Taux d'erreurs (ratio)
rate(http_requests_total{status=~"5.."}[5m])
  /
rate(http_requests_total[5m])

# P95 des latences
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))

# CPU utilisé par pod
sum(rate(container_cpu_usage_seconds_total[5m])) by (pod)

# Mémoire disponible sur les nœuds
node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

# Pods en état non-running
kube_pod_status_phase{phase!="Running",phase!="Succeeded"} == 1

Show code cell source

Hide code cell source

# Simulation d'un dashboard Prometheus
np.random.seed(42)
fig = plt.figure(figsize=(16, 10))
gs = gridspec.GridSpec(3, 3, figure=fig, hspace=0.45, wspace=0.35)

t = np.arange(0, 60)  # 60 points = 60 minutes

# 1. Taux de requêtes
ax1 = fig.add_subplot(gs[0, :2])
rps = 800 + 200*np.sin(t/15) + 100*np.random.randn(60)
rps_err = 5 + 3*np.abs(np.sin(t/20)) + np.random.exponential(1, 60)
ax1.fill_between(t, rps, alpha=0.3, color='#2196F3')
ax1.plot(t, rps, color='#2196F3', linewidth=2, label='req/s total')
ax1.fill_between(t, rps_err, alpha=0.4, color='#F44336')
ax1.plot(t, rps_err, color='#F44336', linewidth=1.5, label='req/s erreurs')
ax1.set_title("Taux de requêtes HTTP", fontweight='bold', fontsize=11)
ax1.set_xlabel("Temps (min)")
ax1.set_ylabel("Requêtes / seconde")
ax1.legend(fontsize=8)
ax1.set_xlim(0, 59)

# 2. Gauge CPU
ax2 = fig.add_subplot(gs[0, 2])
cpu_val = 62
gauge_colors = ['#4CAF50' if cpu_val < 70 else '#FF9800' if cpu_val < 85 else '#F44336']
ax2.pie([cpu_val, 100-cpu_val], colors=[gauge_colors[0], '#EEEEEE'],
        startangle=90, counterclock=False,
        wedgeprops={'width': 0.5})
ax2.text(0, 0, f"{cpu_val}%", ha='center', va='center',
         fontsize=20, fontweight='bold', color=gauge_colors[0])
ax2.set_title("CPU moyen\n(cluster)", fontweight='bold', fontsize=11)

# 3. Latences P50/P95/P99
ax3 = fig.add_subplot(gs[1, :2])
p50 = 50 + 10*np.sin(t/12) + 5*np.random.randn(60)
p95 = p50 * 2.2 + 20*np.random.exponential(0.5, 60)
p99 = p50 * 4 + 50*np.random.exponential(0.3, 60)
ax3.fill_between(t, p50, p99, alpha=0.1, color='#9C27B0')
ax3.plot(t, p50, color='#4CAF50', linewidth=2, label='P50')
ax3.plot(t, p95, color='#FF9800', linewidth=1.5, linestyle='--', label='P95')
ax3.plot(t, p99, color='#F44336', linewidth=1.5, linestyle=':', label='P99')
ax3.axhline(y=200, color='#F44336', alpha=0.4, linestyle='--', linewidth=1)
ax3.text(59, 205, 'SLO 200ms', ha='right', fontsize=8, color='#F44336')
ax3.set_title("Latences HTTP (ms)", fontweight='bold', fontsize=11)
ax3.set_xlabel("Temps (min)")
ax3.set_ylabel("ms")
ax3.legend(fontsize=8)
ax3.set_xlim(0, 59)

# 4. Pods par état
ax4 = fig.add_subplot(gs[1, 2])
etats = ['Running', 'Pending', 'Failed', 'Unknown']
counts = [28, 2, 1, 0]
colors_pods = ['#4CAF50', '#FF9800', '#F44336', '#9E9E9E']
bars = ax4.barh(etats, counts, color=colors_pods)
for bar, count in zip(bars, counts):
    ax4.text(bar.get_width() + 0.2, bar.get_y() + bar.get_height()/2,
             str(count), va='center', fontsize=10, fontweight='bold')
ax4.set_title("État des Pods", fontweight='bold', fontsize=11)
ax4.set_xlim(0, 35)
ax4.set_xlabel("Nombre de pods")

# 5. Mémoire par namespace
ax5 = fig.add_subplot(gs[2, :])
namespaces = ['default', 'monitoring', 'ingress-nginx', 'cert-manager', 'kube-system']
mem_request = [4.2, 2.1, 1.0, 0.5, 3.8]
mem_used = [3.1, 1.8, 0.8, 0.3, 3.2]
x = np.arange(len(namespaces))
width = 0.35
b1 = ax5.bar(x - width/2, mem_request, width, label='Demandé (requests)', color='#90CAF9', edgecolor='white')
b2 = ax5.bar(x + width/2, mem_used, width, label='Utilisé (réel)', color='#2196F3', edgecolor='white')
ax5.set_title("Mémoire par namespace (Gi)", fontweight='bold', fontsize=11)
ax5.set_xticks(x)
ax5.set_xticklabels(namespaces)
ax5.set_ylabel("Gigaoctets")
ax5.legend(fontsize=9)

fig.suptitle("Dashboard Prometheus — Simulation", fontsize=15, fontweight='bold')
plt.savefig("obs_dashboard_prometheus.png", dpi=110, bbox_inches='tight')
plt.show()

_images/501c9378e3e283cf212059e23bcb4fc957d05f2cbbef7fd97577421dd7ff9c61.png

Alerting avec Prometheus#

Prometheus permet de définir des règles d’alerte en PromQL :

# alerting-rules.yml
groups:
  - name: kubernetes-applications
    rules:
      - alert: HighErrorRate
        expr: |
          rate(http_requests_total{status=~"5.."}[5m])
            / rate(http_requests_total[5m]) > 0.05
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Taux d'erreurs élevé sur {{ $labels.service }}"
          description: "{{ $value | humanizePercentage }} d'erreurs 5xx depuis 2 min"

      - alert: PodCrashLooping
        expr: rate(kube_pod_container_status_restarts_total[15m]) > 0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Pod {{ $labels.pod }} en crash loop"

Grafana : visualisation des métriques#

Grafana est l’outil de visualisation de référence. Il se connecte à Prometheus (et à d’autres sources) pour afficher des dashboards interactifs.

# Installation via Helm
helm repo add grafana https://grafana.github.io/helm-charts
helm install grafana grafana/grafana \
  --namespace monitoring \
  --set adminPassword='MonMotDePasse' \
  --set datasources."datasources\.yaml".apiVersion=1 \
  --set datasources."datasources\.yaml".datasources[0].name=Prometheus \
  --set datasources."datasources\.yaml".datasources[0].type=prometheus \
  --set datasources."datasources\.yaml".datasources[0].url=http://prometheus-server

# Port-forward pour accéder à Grafana
kubectl port-forward svc/grafana 3000:80 -n monitoring

Les panels Grafana supportent différents types de visualisation : graphes temporels, jauges, heatmaps, tableaux, stat panels. On peut importer des dashboards communautaires depuis grafana.com/dashboards (ex. : dashboard 315 pour Kubernetes).

Logs : centralisation et analyse#

Logs dans Kubernetes#

Par convention, les conteneurs écrivent leurs logs sur stdout et stderr. Kubernetes capture ces flux et les rend accessibles via kubectl logs :

# Logs d'un pod
kubectl logs monpod

# Logs en temps réel (follow)
kubectl logs -f monpod

# Logs d'un conteneur spécifique dans un pod multi-conteneur
kubectl logs monpod -c mon-conteneur

# Logs des 100 dernières lignes
kubectl logs monpod --tail=100

# Logs depuis 1 heure
kubectl logs monpod --since=1h

# Logs d'un déploiement entier (tous les pods)
kubectl logs deployment/mon-deploiement --all-pods

Architecture de centralisation des logs#

Show code cell source

Hide code cell source

fig, ax = plt.subplots(figsize=(15, 9))
ax.set_xlim(0, 15)
ax.set_ylim(0, 10)
ax.axis('off')

def boite(ax, x, y, w, h, texte, couleur, fontsize=9):
    rect = FancyBboxPatch((x, y), w, h, boxstyle="round,pad=0.15",
                          facecolor=couleur, edgecolor='#333333',
                          alpha=0.85, linewidth=1.5)
    ax.add_patch(rect)
    ax.text(x + w/2, y + h/2, texte, ha='center', va='center',
            fontsize=fontsize, fontweight='bold', color='white',
            wrap=True, multialignment='center')

def fleche(ax, x1, y1, x2, y2, label='', couleur='#555555'):
    ax.annotate("", xy=(x2, y2), xytext=(x1, y1),
                arrowprops=dict(arrowstyle='->', color=couleur, lw=2))
    if label:
        mx, my = (x1+x2)/2, (y1+y2)/2
        ax.text(mx, my + 0.15, label, ha='center', fontsize=7.5,
                color=couleur, style='italic')

# Pods sources
pods = [("Pod A\n(app)", 0.3, 7.5), ("Pod B\n(api)", 0.3, 5.5), ("Pod C\n(worker)", 0.3, 3.5)]
for nom, x, y in pods:
    boite(ax, x, y, 2, 1.2, nom, '#2196F3', fontsize=8)

# DaemonSet Fluent Bit
boite(ax, 3.5, 5.5, 2.5, 1.5, "Fluent Bit\n(DaemonSet)\nCollecte & filtre", '#FF9800')
for _, x, y in pods:
    fleche(ax, 2.3, y+0.6, 3.5, 6.25, couleur='#2196F3')

# Fluentd (aggregator)
boite(ax, 7.2, 5.5, 2.5, 1.5, "Fluentd\n(Deployment)\nAgrège & route", '#9C27B0')
fleche(ax, 6.0, 6.25, 7.2, 6.25, "stdout/stderr", '#FF9800')

# Destinations
boite(ax, 11.0, 7.5, 3, 1.2, "Elasticsearch\n(stockage)", '#F44336', fontsize=8)
boite(ax, 11.0, 5.5, 3, 1.2, "Loki\n(stockage)", '#4CAF50', fontsize=8)
boite(ax, 11.0, 3.5, 3, 1.2, "S3 / GCS\n(archivage)", '#607D8B', fontsize=8)

fleche(ax, 9.7, 6.5, 11.0, 8.1, couleur='#9C27B0')
fleche(ax, 9.7, 6.25, 11.0, 6.1, couleur='#9C27B0')
fleche(ax, 9.7, 6.0, 11.0, 4.1, couleur='#9C27B0')

# Kibana / Grafana
boite(ax, 11.0, 1.0, 3, 1.2, "Kibana / Grafana\n(visualisation)", '#795548', fontsize=8)
fleche(ax, 12.5, 5.5, 12.5, 2.2, "requêtes", '#333333')

# Légende des stacks
ax.text(7.5, 1.2, "Stack EFK : Elasticsearch + Fluentd + Kibana", ha='center',
        fontsize=9, style='italic', color='#F44336',
        bbox=dict(boxstyle='round,pad=0.3', facecolor='#FFEBEE', edgecolor='#F44336', alpha=0.8))
ax.text(7.5, 0.5, "Stack PLG : Promtail/Fluent Bit + Loki + Grafana", ha='center',
        fontsize=9, style='italic', color='#4CAF50',
        bbox=dict(boxstyle='round,pad=0.3', facecolor='#E8F5E9', edgecolor='#4CAF50', alpha=0.8))

ax.set_title("Architecture de centralisation des logs dans Kubernetes", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig("obs_logs_archi.png", dpi=110, bbox_inches='tight')
plt.show()

_images/28da010d6377ab42077400ea5e68021762008b12176827f76365f229802c31d3.png

Loki vs EFK#

Loki (de Grafana Labs) est souvent préféré aujourd’hui pour sa simplicité et son coût réduit. Contrairement à Elasticsearch qui indexe le contenu des logs, Loki n’indexe que les labels (metadata). La recherche full-text est plus lente, mais le stockage est bien moins coûteux.

# Requête LogQL (langage Loki, similaire à PromQL)
# Logs d'erreur du namespace "production"
{namespace="production"} |= "ERROR"

# Comptage des erreurs par service
sum by (app) (count_over_time({namespace="production"} |= "ERROR" [5m]))

# Extraction de champs structurés
{app="api"} | json | status_code >= 500

Traces distribuées#

Le problème des microservices#

Dans une architecture monolithique, quand une requête est lente, on regarde le stack trace. Dans une architecture de microservices, une requête peut traverser 10 services différents. Comment savoir lequel est responsable du ralentissement ?

Les traces distribuées répondent à ce problème en suivant une requête de bout en bout à travers tous les services.

Concepts OpenTelemetry#

OpenTelemetry (OTel) est le standard ouvert d’instrumentation. Il définit :

Trace : le parcours complet d’une requête, avec un trace_id unique
Span : une unité de travail (appel HTTP, requête DB, etc.) avec durée et attributs
Context propagation : transmission du trace_id entre services (via headers HTTP)

Show code cell source

Hide code cell source

# Simulation d'une trace distribuée
import random

class Span:
    def __init__(self, service, operation, start_ms, duration_ms, parent=None, error=False):
        self.service = service
        self.operation = operation
        self.start_ms = start_ms
        self.duration_ms = duration_ms
        self.parent = parent
        self.error = error
        self.end_ms = start_ms + duration_ms

# Construction d'une trace réaliste
spans = [
    Span("api-gateway",   "HTTP POST /checkout",       0,   320, None),
    Span("auth-service",  "ValidateToken",             5,    18, "api-gateway"),
    Span("cart-service",  "GetCart",                  25,    85, "api-gateway"),
    Span("cart-service",  "Redis GET cart:user123",   28,    12, "cart-service"),
    Span("cart-service",  "Postgres SELECT items",    42,    60, "cart-service"),
    Span("order-service", "CreateOrder",             115,   190, "api-gateway"),
    Span("order-service", "Postgres INSERT order",   120,    45, "order-service"),
    Span("payment-svc",   "ProcessPayment",          170,   130, "order-service"),
    Span("payment-svc",   "Stripe API call",         175,   120, "payment-svc"),
    Span("notif-service", "SendEmail",               305,    10, "order-service"),
]

# Couleurs par service
service_colors = {
    "api-gateway":   "#2196F3",
    "auth-service":  "#9C27B0",
    "cart-service":  "#FF9800",
    "order-service": "#4CAF50",
    "payment-svc":   "#F44336",
    "notif-service": "#00BCD4",
}

fig, ax = plt.subplots(figsize=(15, 7))

for i, span in enumerate(spans):
    y = len(spans) - 1 - i
    color = service_colors[span.service]
    # Barre de span
    rect = FancyBboxPatch((span.start_ms, y + 0.1), span.duration_ms, 0.8,
                          boxstyle="round,pad=0.05",
                          facecolor=color, edgecolor='white',
                          alpha=0.85 if not span.error else 1.0, linewidth=1)
    ax.add_patch(rect)
    if span.error:
        rect2 = FancyBboxPatch((span.start_ms, y + 0.1), span.duration_ms, 0.8,
                               boxstyle="round,pad=0.05",
                               facecolor='none', edgecolor='#F44336',
                               linewidth=3)
        ax.add_patch(rect2)

    # Texte dans la barre
    label = f"{span.service} : {span.operation} ({span.duration_ms}ms)"
    ax.text(span.start_ms + span.duration_ms/2, y + 0.5, label,
            ha='center', va='center', fontsize=7.5,
            color='white', fontweight='bold',
            clip_on=True)

ax.set_xlim(-5, 340)
ax.set_ylim(-0.3, len(spans))
ax.set_xlabel("Temps (ms depuis début de la requête)", fontsize=10)
ax.set_yticks([])
ax.set_title("Trace distribuée — Requête /checkout traversant 5 services\n"
             "Trace ID : a3f7c2b1-9d4e-4a8b-b5c1-7f2e6d3a1b9c",
             fontsize=12, fontweight='bold')

# Légende services
from matplotlib.patches import Patch
legend_elements = [Patch(facecolor=c, label=s) for s, c in service_colors.items()]
ax.legend(handles=legend_elements, loc='lower right', fontsize=8, ncol=3)

# Annotation du span le plus long
ax.annotate("Appel Stripe\n(critique path)", xy=(235, len(spans)-9.5),
            xytext=(280, len(spans)-7.5),
            arrowprops=dict(arrowstyle='->', color='#F44336', lw=1.5),
            fontsize=8.5, color='#F44336', fontweight='bold')

ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.savefig("obs_trace_distribuee.png", dpi=110, bbox_inches='tight')
plt.show()

total = max(s.end_ms for s in spans)
print(f"Durée totale de la trace : {total}ms")
print(f"Span le plus lent : payment-svc/Stripe API call (120ms)")
print(f"Nombre de services impliqués : {len(set(s.service for s in spans))}")

_images/df3b7d241bd31d1bd5e8a64661634e4958efaaa96c7d26cc5dd91b72536a6830.png

Durée totale de la trace : 320ms
Span le plus lent : payment-svc/Stripe API call (120ms)
Nombre de services impliqués : 6

Jaeger : visualisation des traces#

Jaeger est l’outil open-source standard pour visualiser les traces (créé par Uber, maintenant CNCF) :

# Installation de Jaeger via Helm
helm repo add jaegertracing https://jaegertracing.github.io/helm-charts
helm install jaeger jaegertracing/jaeger \
  --namespace monitoring \
  --set provisionDataStore.cassandra=false \
  --set allInOne.enabled=true \
  --set storage.type=memory

# Port-forward pour accéder à l'UI
kubectl port-forward svc/jaeger-query 16686:16686 -n monitoring

Pour instrumenter une application Python :

# Instrumentation OpenTelemetry (illustratif)
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
exporter = OTLPSpanExporter(endpoint="http://jaeger-collector:4317")
provider.add_span_processor(BatchSpanProcessor(exporter))
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("mon-service")

def traiter_commande(commande_id):
    with tracer.start_as_current_span("traiter-commande") as span:
        span.set_attribute("commande.id", commande_id)
        result = appeler_base_de_donnees(commande_id)
        return result

kube-state-metrics#

kube-state-metrics écoute l’API Kubernetes et expose l’état des objets K8s sous forme de métriques Prometheus. C’est différent de cAdvisor (qui mesure la consommation de ressources des conteneurs) :

# Métriques clés de kube-state-metrics
kube_pod_status_phase{phase="Pending"}               # Pods en attente
kube_deployment_status_replicas_unavailable          # Replicas non disponibles
kube_node_status_condition{condition="Ready"}        # État des nœuds
kube_job_status_failed                              # Jobs échoués
kube_persistentvolumeclaim_status_phase             # État des PVC
kube_horizontalpodautoscaler_status_current_replicas # Replicas HPA actuels

SLI / SLO / SLA#

Ces termes définissent les objectifs de fiabilité d’un service :

Show code cell source

Hide code cell source

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# --- Gauche : définitions hiérarchiques ---
ax = axes[0]
ax.set_xlim(0, 10)
ax.set_ylim(0, 10)
ax.axis('off')

niveaux = [
    (1.5, 7.5, 7, 1.8, "#1565C0", "SLA — Service Level Agreement",
     "Contrat légal avec les clients\nEx. : 99.9% de disponibilité par mois\nSinon : pénalités financières"),
    (2.5, 5.0, 5, 1.8, "#0288D1", "SLO — Service Level Objective",
     "Objectif interne de fiabilité\nEx. : 99.5% des req < 200ms\nBuffer entre SLO et SLA"),
    (3.5, 2.5, 3, 1.8, "#03A9F4", "SLI — Service Level Indicator",
     "Mesure concrète et quantifiée\nEx. : taux de succès = 99.72%\nVient de Prometheus/métriques"),
]

for x, y, w, h, couleur, titre, desc in niveaux:
    rect = FancyBboxPatch((x, y), w, h, boxstyle="round,pad=0.2",
                          facecolor=couleur, edgecolor='white', alpha=0.9)
    ax.add_patch(rect)
    ax.text(x + w/2, y + h - 0.4, titre, ha='center', va='top',
            fontsize=9, fontweight='bold', color='white')
    ax.text(x + w/2, y + h/2 - 0.3, desc, ha='center', va='center',
            fontsize=7.5, color='white', alpha=0.9, multialignment='center')

ax.annotate("", xy=(5, 5.0+1.8), xytext=(5, 5.0+1.8+1),
            arrowprops=dict(arrowstyle='->', color='#333', lw=1.5))
ax.annotate("", xy=(5, 2.5+1.8), xytext=(5, 5.0),
            arrowprops=dict(arrowstyle='->', color='#333', lw=1.5))

ax.set_title("Hiérarchie SLI / SLO / SLA", fontweight='bold', fontsize=12)

# --- Droite : Error Budget ---
ax2 = axes[1]

slo_target = 99.5  # 99.5% de disponibilité
t = np.arange(0, 30)  # 30 jours

# Disponibilité simulée avec quelques incidents
np.random.seed(7)
dispo_base = 99.7 + 0.2*np.random.randn(30)
# Incidents les jours 8, 19, 25
dispo_base[8] -= 1.5
dispo_base[19] -= 0.8
dispo_base[25] -= 2.1
dispo_base = np.clip(dispo_base, 96, 100)

# Error budget cumulatif (en minutes par jour)
total_minutes_par_mois = 30 * 24 * 60  # 43200 min
budget_total = total_minutes_par_mois * (100 - slo_target) / 100  # 216 min
budget_consomme = np.cumsum((100 - dispo_base) / 100 * 24 * 60)
budget_restant = budget_total - budget_consomme

colors_line = ['#F44336' if b < 0 else '#4CAF50' for b in budget_restant]

ax2.fill_between(t, budget_restant, 0,
                 where=(budget_restant >= 0), color='#4CAF50', alpha=0.3, label='Budget restant')
ax2.fill_between(t, budget_restant, 0,
                 where=(budget_restant < 0), color='#F44336', alpha=0.3, label='Budget épuisé')
ax2.plot(t, budget_restant, color='#1565C0', linewidth=2.5)
ax2.axhline(y=0, color='#F44336', linestyle='--', linewidth=1.5, alpha=0.7)
ax2.axhline(y=budget_total, color='#4CAF50', linestyle='--', linewidth=1,
            alpha=0.5, label=f'Budget total ({budget_total:.0f} min)')

ax2.set_title(f"Error Budget (SLO = {slo_target}%)\n= temps d'indisponibilité autorisé",
              fontweight='bold', fontsize=11)
ax2.set_xlabel("Jours du mois")
ax2.set_ylabel("Error budget restant (minutes)")
ax2.legend(fontsize=8)
ax2.set_xlim(0, 29)

# Annotations incidents
for jour, label in [(8, "Incident\n-90min"), (19, "Incident\n-48min"), (25, "Incident\n-126min")]:
    val = budget_restant[jour]
    ax2.annotate(label, xy=(jour, val), xytext=(jour+1.5, val+20),
                arrowprops=dict(arrowstyle='->', color='#F44336', lw=1),
                fontsize=7.5, color='#F44336')

plt.suptitle("SLI / SLO / SLA et Error Budget", fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig("obs_slo.png", dpi=110, bbox_inches='tight')
plt.show()

print(f"SLO cible : {slo_target}% de disponibilité")
print(f"Error budget mensuel : {budget_total:.0f} minutes ({budget_total/60:.1f}h)")
print(f"Budget consommé ce mois : {budget_consomme[-1]:.0f} minutes")
print(f"Budget restant : {budget_restant[-1]:.0f} minutes")

_images/66d37074b6034ddffa7a9b50bdcd0e60cb70269afef032d61ccbc5786b87cb68.png

SLO cible : 99.5% de disponibilité
Error budget mensuel : 216 minutes (3.6h)
Budget consommé ce mois : 202 minutes
Budget restant : 14 minutes

Simulation complète : métriques et SLO#

_images/694eb759d532b83de4ce206b569abba982f37e63600dc40a1addda627a571274.png

Bilan de la simulation :
  Requêtes totales : 5848
  Erreurs totales  : 106 (1.81%)
  Minutes avec SLO violé (P95 > 500ms) : 1

Architecture d’observabilité complète#

Show code cell source

Hide code cell source

fig, ax = plt.subplots(figsize=(16, 10))
ax.set_xlim(0, 16)
ax.set_ylim(0, 11)
ax.axis('off')

def box(ax, x, y, w, h, text, fc, ec=None, fontsize=9, alpha=0.85):
    ec = ec or fc
    r = FancyBboxPatch((x, y), w, h, boxstyle="round,pad=0.15",
                       facecolor=fc, edgecolor=ec, alpha=alpha, linewidth=1.5)
    ax.add_patch(r)
    ax.text(x+w/2, y+h/2, text, ha='center', va='center',
            fontsize=fontsize, fontweight='bold', color='white',
            multialignment='center')

def arr(ax, x1, y1, x2, y2, label='', color='#555'):
    ax.annotate("", xy=(x2, y2), xytext=(x1, y1),
                arrowprops=dict(arrowstyle='->', color=color, lw=1.8))
    if label:
        ax.text((x1+x2)/2, (y1+y2)/2 + 0.1, label, ha='center',
                fontsize=7, color=color, style='italic')

# Applications
box(ax, 0.2, 7.0, 2.2, 1.2, "App A\n(api)", '#2196F3')
box(ax, 0.2, 5.2, 2.2, 1.2, "App B\n(worker)", '#2196F3')
box(ax, 0.2, 3.4, 2.2, 1.2, "App C\n(frontend)", '#2196F3')

# Node Exporter, kube-state-metrics
box(ax, 0.2, 1.5, 2.2, 1.2, "node-exporter\nkube-state-metrics", '#607D8B', fontsize=7.5)

# Prometheus
box(ax, 3.5, 4.5, 2.5, 2.0, "Prometheus\n\nScrape ↔ Store\nAlerts", '#FF5722')

# AlertManager
box(ax, 3.5, 2.0, 2.5, 1.5, "AlertManager\n(routage alertes)", '#FF9800', fontsize=8)

# Grafana
box(ax, 7.5, 6.0, 2.5, 2.0, "Grafana\n\nDashboards\nAlerting", '#F57C00')

# Loki
box(ax, 3.5, 8.0, 2.5, 1.5, "Loki\n(logs)", '#4CAF50')

# Fluent Bit
box(ax, 0.2, 9.2, 2.2, 1.2, "Fluent Bit\n(DaemonSet)", '#66BB6A', fontsize=8)

# OpenTelemetry Collector
box(ax, 7.5, 3.5, 2.5, 1.8, "OTel\nCollector", '#9C27B0', fontsize=8)

# Jaeger
box(ax, 11.0, 3.5, 2.5, 1.8, "Jaeger\n(traces)", '#CE93D8', fontsize=9)

# Slack / PagerDuty
box(ax, 7.5, 0.8, 2.5, 1.5, "Slack /\nPagerDuty", '#795548', fontsize=8)

# Flèches
for y in [7.6, 5.8, 4.0]:
    arr(ax, 2.4, y, 3.5, 5.5, 'métriques', '#2196F3')

arr(ax, 2.4, 2.1, 3.5, 2.75, 'métriques', '#607D8B')
arr(ax, 2.4, 9.8, 3.5, 9.0, 'logs', '#4CAF50')

for y in [7.6, 5.8, 4.0]:
    arr(ax, 2.4, y, 7.5, 4.4, 'traces', '#9C27B0')

arr(ax, 6.0, 5.5, 7.5, 7.0, 'requêtes PromQL', '#FF5722')
arr(ax, 6.0, 8.75, 7.5, 7.3, 'requêtes LogQL', '#4CAF50')
arr(ax, 10.0, 4.4, 11.0, 4.4, 'OTLP', '#9C27B0')
arr(ax, 6.0, 2.75, 7.5, 1.55, 'alertes', '#FF9800')
arr(ax, 10.0, 7.0, 11.0, 5.0, 'drill-down', '#F57C00')

# Jaeger ↔ Grafana
arr(ax, 11.0, 4.4, 10.0, 7.0, '', '#CE93D8')

ax.text(13.5, 5.5, "Utilisateurs\n/ SRE", ha='center', va='center',
        fontsize=11, fontweight='bold', color='#333',
        bbox=dict(boxstyle='round,pad=0.4', facecolor='#FFF9C4', edgecolor='#F9A825'))
arr(ax, 10.0, 7.0, 13.2, 5.8, '', '#F57C00')

ax.set_title("Architecture d'observabilité complète — Métriques + Logs + Traces",
             fontsize=13, fontweight='bold')
plt.tight_layout()
plt.savefig("obs_architecture_complete.png", dpi=110, bbox_inches='tight')
plt.show()

_images/ababd65ec93d7342d63a6410aa39fb2d2148650dc07db1ea3495ccff3d4752c4.png

Récapitulatif#

Ce qu’il faut retenir

Les trois piliers : métriques (alerter), logs (expliquer), traces (localiser). Les trois sont complémentaires.
Prometheus scrape les métriques toutes les 15s. PromQL permet des requêtes puissantes. AlertManager achemine les alertes.
Grafana visualise les métriques, logs (via Loki) et traces (via Jaeger) dans des dashboards unifiés.
kube-state-metrics expose l’état des objets Kubernetes (pods pending, déploiements, etc.).
OpenTelemetry est le standard d’instrumentation pour les traces. Jaeger les visualise.
SLO (objectif) > SLI (mesure) > SLA (contrat). L’error budget est le temps d’indisponibilité autorisé avant de violer le SLO.

Un système bien observable permet de diagnostiquer des problèmes qu’on n’avait pas anticipés.

Observabilité

Contenu

Observabilité#

Pourquoi « observer » un système ?#

Les trois piliers de l’observabilité#

Métriques Kubernetes#

Les couches de métriques#

Format d’exposition Prometheus#

Les types de métriques Prometheus#

Prometheus : architecture et PromQL#

Architecture de scraping#

PromQL — requêtes essentielles#

Alerting avec Prometheus#

Grafana : visualisation des métriques#

Logs : centralisation et analyse#

Logs dans Kubernetes#

Architecture de centralisation des logs#

Loki vs EFK#

Traces distribuées#

Le problème des microservices#

Concepts OpenTelemetry#

Jaeger : visualisation des traces#

kube-state-metrics#

SLI / SLO / SLA#

Simulation complète : métriques et SLO#

Architecture d’observabilité complète#

Récapitulatif#