augusthottie

Posted on Apr 1

I Added Log Aggregation to My EKS Observability Stack, Metrics + Logs in One Dashboard

#kubernetes #devops #observability #terraform

Last week I built an observability stack with Prometheus, Grafana, and custom alerting on EKS. The LinkedIn post got more engagement than anything I'd posted before, and two comments suggested the same thing: "Integrate Loki for logs."

They were right. Metrics tell you that something is wrong. Logs tell you why. Without both in the same place, you're switching between kubectl logs and Grafana dashboards trying to correlate timestamps manually. That's not a workflow, that's a scavenger hunt.

So I added Loki.

What I Added

Loki and Promtail, deployed via ArgoCD alongside the existing Prometheus stack:

Promtail runs as a DaemonSet on every node, tailing container logs from /var/log/pods
Loki stores and indexes the logs, queryable via LogQL
Grafana gets a new "Logs & Metrics Correlation" dashboard with metrics and logs side by side
A new Loki datasource in Grafana so both Prometheus and Loki are available in the same dashboard

The entire addition was three files: an ArgoCD application for the Loki stack, a Grafana datasource ConfigMap, and a logs dashboard ConfigMap. Push to main, ArgoCD syncs, done.

The Logs & Metrics Correlation Dashboard

This is the dashboard I wish I'd had from the start. Seven panels in four rows:

Row 1: Metrics: API Request Rate and Error Rate from Prometheus. See the traffic pattern and spot anomalies.

Row 2: API Logs: Live log stream from the gitops-api containers via Loki. When you see a spike in the metrics above, scroll down and the logs from that exact time range are right there.

Row 3: Infrastructure Logs: PostgreSQL logs on the left, Redis logs on the right. Database checkpoint warnings, connection events, cache operations, all visible without running kubectl logs across multiple pods.

Row 4: Error Logs: A filtered view showing only lines matching error, fail, panic, crash, or exception across all containers. This is the "something is broken, show me what" panel.

Row 5: Log Volume: Lines per second per container. A sudden spike in log volume often means something is throwing errors in a loop.

The key insight: time-synced panels. When you drag to select a time range on the metrics graph, the log panels update to show logs from that exact window. That's the metric-to-log correlation workflow, see a spike, select the time range, read the logs. Root cause in under two minutes.

LogQL: The Query Language

If you know PromQL, LogQL feels familiar. Stream selectors use curly braces like Prometheus label matchers:

All logs from the three-tier namespace:

{namespace="three-tier"}

Just the API container:

{namespace="three-tier", container="gitops-api"}

Filter for errors using a pipeline:

{namespace="three-tier"} |~ "(?i)(error|fail|panic|crash|exception)"

Log volume as a metric (for the timeseries panel):

sum(rate({namespace="three-tier"}[5m])) by (container)

That last one is interesting: rate() on a log stream gives you lines per second, which you can graph just like a Prometheus metric. Useful for spotting error storms.

What Went Wrong (And What I Learned)

Node Capacity

Loki wouldn't schedule. The two t3.medium nodes were already running the app, Prometheus, Grafana, Alertmanager, Node Exporter, kube-state-metrics, ArgoCD, cert-manager, and the LB controller. Too many pods. I had to scale the node group to 3 nodes before Loki could start.

This is something you don't think about until it happens, t3.medium supports around 17 pods per node, and a monitoring stack eats through that fast.

EBS CSI Driver (Again)

The Loki PVC was stuck in Pending even after adding a third node. The EBS CSI driver's IAM role still had the old cluster's OIDC provider URL. Third time hitting this issue — by now the fix is muscle memory: delete the IAM service account, recreate it, reinstall the addon with --resolve-conflicts OVERWRITE.

Grafana Datasource Provisioning

The Loki datasource ConfigMap existed in Kubernetes but Grafana's sidecar didn't pick it up. After a restart, the Prometheus datasource also disappeared. I ended up adding both datasources manually through the Grafana UI.

The lesson: Grafana's sidecar provisioning is convenient when it works, but when it doesn't, just add datasources manually and move on. The dashboards are what matter, not how the datasource was configured.

Curly Quotes in Dashboard JSON

When importing dashboard JSON through Grafana's UI, the PromQL queries had corrupted quotes, curly/smart quotes instead of straight quotes. Every panel showed a parse error. The fix was to edit each panel and retype the query manually from the keyboard.

This is a subtle one. If you copy-paste JSON through a text editor or chat that auto-converts quotes, your Grafana panels will break silently.

Why Logs Complete the Observability Story

With just Prometheus, my monitoring answer was: "The error rate spiked at 3:42 PM." With Loki added, it becomes: "The error rate spiked at 3:42 PM because PostgreSQL was restarting after an OOM kill, here are the exact log lines."

That's the difference between detecting a problem and diagnosing it. In an interview, being able to describe a workflow that goes from alert → metric → log → root cause shows you've actually operated production systems, not just set up dashboards.

The full observability stack now covers:

Metrics: Prometheus + custom application instrumentation
Logs: Loki + Promtail collecting from every container
Alerting: PrometheusRules with 9 custom alerts
Dashboards: Two Grafana dashboards, one for metrics, one for metric-to-log correlation
All deployed via GitOps: ArgoCD managing four applications from a single repo

DEV Community