Monitoring with Prometheus and Grafana - Complete Guide#

In the era of microservices, containerization, and distributed systems, monitoring is no longer an optional add-on -- it has become a fundamental pillar of reliability. Without proper insight into system behavior, every incident becomes a guessing game. Prometheus and Grafana are the duo that has become the de facto standard in the open-source observability world, powering systems from small startups to Fortune 500 enterprises.

In this article, we will walk through the complete process of implementing monitoring -- from understanding the fundamentals of observability, through Prometheus architecture and exporter configuration, to building Grafana dashboards and setting up alerts. We will also discuss commercial alternatives such as Datadog and New Relic so you can make an informed decision.

The Three Pillars of Observability#

Observability is the ability of a system to reveal its internal state based on external signals. It rests on three pillars, each providing a different kind of diagnostic information.

Metrics#

Metrics are numerical data measured over time -- counters, histograms, gauges. They answer the question "what is happening?" Examples include HTTP requests per second, CPU usage, API response time, and active database connections. Metrics are lightweight to store and ideal for trend detection and alerting.

Logs#

Logs are structured or unstructured records of events. They answer the question "why did something happen?" Examples include error stack traces, HTTP access logs, and application messages. Logs are invaluable for debugging, but storing and searching them at scale can be expensive.

Traces#

Distributed traces track the flow of a request through multiple services. They answer the question "where in the call chain is the problem?" Tools like Jaeger or Tempo visualize traces as waterfall diagrams, showing exactly how much time each service needed to process a request.

Prometheus specializes in metrics -- and does it exceptionally well. For logs, consider Loki (also from Grafana Labs), and for traces -- Tempo or Jaeger. All three integrate with Grafana, creating a complete observability stack known as the "Grafana Stack" (formerly LGTM -- Loki, Grafana, Tempo, Mimir).

Prometheus Architecture#

Prometheus stands out with its pull-based architecture. Instead of receiving data from applications (push), it actively scrapes metrics from HTTP endpoints at regular intervals. This model simplifies configuration and makes it easy to detect whether a given target is available.

Key Components#

Prometheus Server -- the main server responsible for scraping, storing data in its local TSDB (Time Series Database), and querying via PromQL
Exporters -- processes that expose metrics in Prometheus format (Node Exporter for system metrics, cAdvisor for containers, database exporters for PostgreSQL/MySQL/Redis)
Pushgateway -- an intermediary for short-lived jobs (batch jobs) that cannot be scraped in the pull model
Alertmanager -- a dedicated service for handling alerts: routing, deduplication, silencing, and sending notifications to Slack, email, PagerDuty
Service Discovery -- mechanisms for automatic target discovery (Kubernetes, Consul, DNS, static files)

Data Model#

Prometheus stores data as time series -- sequences of values marked with timestamps. Each series is identified by a metric name and a set of labels:

http_requests_total{method="GET", endpoint="/api/users", status="200"} 1547

Metric types in Prometheus:

Counter -- a monotonically increasing value (e.g., http_requests_total). Never decreases, resets only on process restart
Gauge -- a value that can go up and down (e.g., temperature_celsius, active_connections)
Histogram -- bucketed observations with sum and count (e.g., http_request_duration_seconds). Ideal for measuring percentiles
Summary -- similar to histogram, but with quantiles calculated client-side. Less commonly used due to inability to aggregate across instances

Prometheus Configuration#

The main Prometheus configuration file is prometheus.yml. It defines global settings, scrape targets, and alert rules:

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
  scrape_timeout: 10s

rule_files:
  - "alerts/*.yml"

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - "alertmanager:9093"

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: "node-exporter"
    static_configs:
      - targets: ["node-exporter:9100"]

  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]

  - job_name: "dotnet-app"
    metrics_path: /metrics
    static_configs:
      - targets: ["dotnet-api:5000"]

  - job_name: "nodejs-app"
    static_configs:
      - targets: ["nodejs-api:3000"]

Each job_name defines a group of targets. Prometheus regularly queries each target at the /metrics path (by default) and stores the received metrics in its database.

PromQL -- The Prometheus Query Language#

PromQL (Prometheus Query Language) is a powerful, functional query language for analyzing metric data. It is the heart of Prometheus and the key to effective monitoring.

Basic Queries#

# Current metric value
up

# Filtering by labels
http_requests_total{method="GET", status="200"}

# Regex matching
http_requests_total{status=~"5.."}

# Label exclusion
http_requests_total{method!="OPTIONS"}

Range Functions and Rate#

# Request rate per second (last 5 minutes)
rate(http_requests_total[5m])

# Absolute increase over one hour
increase(http_requests_total[1h])

# Rate per endpoint
sum by (endpoint) (rate(http_requests_total[5m]))

Advanced Queries#

# 95th percentile response time (histogram)
histogram_quantile(0.95,
  sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)

# Error percentage (error rate)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

# Service availability (uptime)
avg_over_time(up[24h]) * 100

# Top 5 endpoints by traffic
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))

# Prediction -- how many hours until disk is full
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600)

PromQL also supports binary operations between metrics, result grouping (by, without), and aggregation functions (sum, avg, min, max, count).

Node Exporter -- System Metrics#

Node Exporter is the official Prometheus exporter for Linux/Unix system metrics. It provides data about CPU, memory, disk, network, and many other resources.

Key Metrics#

# CPU usage (percentage)
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

# RAM usage (percentage)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk usage (percentage)
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100

# Network traffic (bytes/s)
rate(node_network_receive_bytes_total{device="eth0"}[5m])

# Load average
node_load1
node_load5
node_load15

Other useful exporters include cAdvisor (Docker container metrics), blackbox_exporter (HTTP/TCP/ICMP probing), postgres_exporter, redis_exporter, and mysqld_exporter.

Application Metrics#

.NET -- prometheus-net#

For .NET applications (ASP.NET Core), we use the prometheus-net library:

// Program.cs
using Prometheus;

var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();

// Middleware for collecting HTTP metrics
app.UseHttpMetrics(options =>
{
    options.AddCustomLabel("host", context => context.Request.Host.Host);
});

app.MapControllers();

// /metrics endpoint
app.MapMetrics();

app.Run();

// Custom metrics in a service
using Prometheus;

public class OrderService
{
    private static readonly Counter OrdersCreated = Metrics
        .CreateCounter("orders_created_total", "Total orders created",
            new CounterConfiguration
            {
                LabelNames = new[] { "payment_method", "status" }
            });

    private static readonly Histogram OrderProcessingDuration = Metrics
        .CreateHistogram("order_processing_duration_seconds",
            "Time spent processing an order",
            new HistogramConfiguration
            {
                Buckets = Histogram.ExponentialBuckets(0.01, 2, 10)
            });

    private static readonly Gauge ActiveOrders = Metrics
        .CreateGauge("active_orders", "Number of currently active orders");

    public async Task<Order> CreateOrder(OrderRequest request)
    {
        ActiveOrders.Inc();
        using (OrderProcessingDuration.NewTimer())
        {
            try
            {
                var order = await ProcessOrder(request);
                OrdersCreated.WithLabels(request.PaymentMethod, "success").Inc();
                return order;
            }
            catch (Exception)
            {
                OrdersCreated.WithLabels(request.PaymentMethod, "failed").Inc();
                throw;
            }
            finally
            {
                ActiveOrders.Dec();
            }
        }
    }
}

Node.js -- prom-client#

For Node.js applications (Express), we use prom-client:

// metrics.js
const client = require('prom-client');

// Default metrics (CPU, memory, event loop, GC)
client.collectDefaultMetrics({ prefix: 'nodejs_' });

// HTTP metrics
const httpRequestDuration = new client.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});

const httpRequestsTotal = new client.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status_code'],
});

const activeConnections = new client.Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
});

// Express middleware
function metricsMiddleware(req, res, next) {
  const end = httpRequestDuration.startTimer();
  activeConnections.inc();

  res.on('finish', () => {
    const route = req.route?.path || req.path;
    const labels = {
      method: req.method,
      route: route,
      status_code: res.statusCode,
    };
    end(labels);
    httpRequestsTotal.inc(labels);
    activeConnections.dec();
  });

  next();
}

module.exports = { metricsMiddleware, client };

// app.js
const express = require('express');
const { metricsMiddleware, client } = require('./metrics');

const app = express();

app.use(metricsMiddleware);

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', client.register.contentType);
  res.end(await client.register.metrics());
});

app.get('/api/health', (req, res) => {
  res.json({ status: 'healthy', uptime: process.uptime() });
});

app.listen(3000, () => {
  console.log('Server running on port 3000');
});

Grafana -- Visualization and Dashboards#

Grafana is a visualization platform that transforms raw metrics into readable, interactive dashboards. It supports dozens of data sources -- Prometheus, InfluxDB, Elasticsearch, PostgreSQL, Loki, and many more.

Data Source Configuration#

After installing Grafana, the first step is adding Prometheus as a data source. This can be done through the graphical interface or automatically via provisioning:

# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false
    jsonData:
      timeInterval: "15s"
      httpMethod: POST

Panel Types in Grafana#

Grafana offers diverse visualization types:

Time series -- classic line chart, ideal for metrics changing over time
Stat -- a single value with optional sparkline (e.g., current error rate)
Gauge -- gauge-style visualization (e.g., CPU usage percentage)
Table -- tabular data with sorting and filtering capabilities
Heatmap -- ideal for visualizing histogram distributions
Bar chart -- comparisons between categories
Logs -- integration with Loki for displaying logs alongside metrics

The Four Golden Signals#

Google SRE defines four key signals that should be monitored for every service:

1. Latency -- response time for requests:

histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

2. Traffic -- request volume:

sum(rate(http_requests_total[5m]))

3. Errors -- error rate:

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

4. Saturation -- resource utilization level:

(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

Dashboard Provisioning#

Dashboards can be defined as code (JSON) and loaded automatically:

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: 'Infrastructure'
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

Alerting with Alertmanager#

Alertmanager handles alerts generated by Prometheus -- it groups them, deduplicates, silences, and routes them to the appropriate notification channels.

Alert Rules in Prometheus#

# alerts/application.yml
groups:
  - name: application
    rules:
      - alert: HighErrorRate
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "High error rate (> 5%)"
          description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"

      - alert: HighLatency
        expr: |
          histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
          > 1.0
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High P95 latency (> 1s)"
          description: "P95 latency is {{ $value }}s"

      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage (> 85%)"

      - alert: DiskSpaceRunningLow
        expr: |
          (1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 80
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space (> 80% used)"

      - alert: ServiceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Service {{ $labels.job }} is down"

Alertmanager Configuration#

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_from: "alerts@example.com"
  smtp_smarthost: "smtp.example.com:587"
  smtp_auth_username: "alerts@example.com"
  smtp_auth_password: "secret"

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'critical-alerts'
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: 'warning-alerts'

receivers:
  - name: 'default'
    email_configs:
      - to: "team@example.com"

  - name: 'critical-alerts'
    slack_configs:
      - api_url: "https://hooks.slack.com/services/xxx/yyy/zzz"
        channel: "#alerts-critical"
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
    pagerduty_configs:
      - service_key: "your-pagerduty-key"

  - name: 'warning-alerts'
    slack_configs:
      - api_url: "https://hooks.slack.com/services/xxx/yyy/zzz"
        channel: "#alerts-warning"

inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

Alertmanager also supports integrations with Microsoft Teams, OpsGenie, VictorOps, and webhooks, allowing you to fit into existing incident response processes.

Docker Compose -- Complete Monitoring Stack#

Below is the full Docker Compose configuration, ready to launch in minutes:

# docker-compose.monitoring.yml
version: "3.8"

services:
  prometheus:
    image: prom/prometheus:v2.48.0
    container_name: prometheus
    restart: unless-stopped
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
      - ./prometheus/alerts:/etc/prometheus/alerts
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
      - '--web.enable-admin-api'
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:10.2.0
    container_name: grafana
    restart: unless-stopped
    ports:
      - "3001:3000"
    environment:
      - GF_SECURITY_ADMIN_USER=admin
      - GF_SECURITY_ADMIN_PASSWORD=SecurePassword123
      - GF_USERS_ALLOW_SIGN_UP=false
      - GF_SERVER_ROOT_URL=https://grafana.example.com
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
      - ./grafana/dashboards:/var/lib/grafana/dashboards
    depends_on:
      - prometheus
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:v0.26.0
    container_name: alertmanager
    restart: unless-stopped
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
      - alertmanager_data:/alertmanager
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
      - '--storage.path=/alertmanager'
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.7.0
    container_name: node-exporter
    restart: unless-stopped
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.rootfs=/rootfs'
      - '--path.sysfs=/host/sys'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.47.0
    container_name: cadvisor
    restart: unless-stopped
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    devices:
      - /dev/kmsg:/dev/kmsg
    networks:
      - monitoring

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:

networks:
  monitoring:
    driver: bridge

Launch the stack:

docker compose -f docker-compose.monitoring.yml up -d

After startup, the following will be available:

Prometheus: http://localhost:9090
Grafana: http://localhost:3001
Alertmanager: http://localhost:9093
Node Exporter: http://localhost:9100/metrics
cAdvisor: http://localhost:8080

Monitoring in Kubernetes#

In Kubernetes environments, the most convenient solution is kube-prometheus-stack (formerly prometheus-operator), installed via Helm:

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

helm install kube-prometheus prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set grafana.adminPassword="SecurePassword123" \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi

ServiceMonitor -- Automatic Discovery#

In Kubernetes, service monitoring is done through the ServiceMonitor CRD:

# servicemonitor.yml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: my-api-monitor
  namespace: monitoring
  labels:
    release: kube-prometheus
spec:
  namespaceSelector:
    matchNames:
      - production
  selector:
    matchLabels:
      app: my-api
  endpoints:
    - port: http
      path: /metrics
      interval: 15s
      scrapeTimeout: 10s

Custom Metrics -- Best Practices#

When defining your own metrics, follow these principles:

Naming convention -- use the format namespace_subsystem_name_unit. Example: myapp_orders_processing_duration_seconds
Use appropriate types -- Counter for events (requests, errors), Gauge for states (connections, temperature), Histogram for distributions (latency)
Do not overuse labels -- each unique label combination creates a separate time series. Avoid high-cardinality labels (e.g., user_id, request_id)
Use base SI units -- seconds instead of milliseconds, bytes instead of megabytes
RED and USE methodologies:
- RED (Rate, Errors, Duration) -- for services
- USE (Utilization, Saturation, Errors) -- for infrastructure resources
Document your metrics -- every metric should have a descriptive help text explaining what it measures

Comparison with Commercial Solutions#

| Aspect | Prometheus + Grafana | Datadog | New Relic | |--------|---------------------|---------|-----------| | Cost | Free (open-source) | From $15/host/month | From $25/host/month | | Hosting | Self-hosted | SaaS | SaaS | | Scalability | Requires Thanos/Cortex/Mimir | Built-in | Built-in | | Learning curve | Steep (PromQL, YAML) | Moderate | Gentle | | Integrations | Hundreds of exporters | 700+ native | 500+ native | | Data retention | Configurable (local) | 15 months (Pro plan) | 8 days (Free) | | Alerts | Alertmanager | Built-in | Built-in | | APM / Traces | Requires additional tools | Built-in | Built-in | | Logs | Loki (separate deployment) | Built-in | Built-in |

When to choose Prometheus + Grafana:

Cost control is a priority -- no per-host or per-metric-volume fees
You need full control over your data and infrastructure
You have experience with Kubernetes and DevOps
You want to avoid vendor lock-in
You have a team capable of maintaining monitoring infrastructure

When to consider Datadog or New Relic:

No DevOps team to maintain monitoring infrastructure
You need an "all-in-one" solution with APM, logs, and traces in one place
Fast deployment is more important than cost
You need advanced APM with automatic instrumentation

Other alternatives worth considering include Elastic Stack (ELK) for logs and metrics, Zabbix for traditional infrastructure monitoring, and VictoriaMetrics as a drop-in replacement for Prometheus with better performance and compression.

Monitoring Best Practices#

1. Monitor What Matters to the User#

Start with business metrics and user experience (latency, availability), then work your way down to infrastructure level. If the user is not experiencing a problem, an alert should not wake anyone up at night.

2. Alert on Symptoms, Not Causes#

Alert on "error rate > 5%" instead of "CPU > 90%". High CPU does not always mean a problem -- high error rate always does.

3. Use Multi-Level Dashboards#

Executive -- SLA, uptime, key business metrics
Service -- Golden Signals per service
Infrastructure -- CPU, memory, disk, network
Debug -- detailed metrics for diagnostics

4. Retention and Compression#

Store high-resolution data (15s) for 2 weeks, downsampled data (5min) for 6 months, and heavily aggregated data (1h) for longer. Thanos, Cortex, or Mimir help with long-term storage.

5. Regularly Test Your Alerts#

Alerts that have never fired may not work when needed. Regularly conduct "fire drills" -- simulated incidents that test the entire alerting pipeline.

6. Apply Infrastructure as Code#

Keep Prometheus configurations, alert rules, and Grafana dashboards in a Git repository. Use provisioning instead of manual configuration through the UI.

Summary#

Prometheus and Grafana form a powerful, mature, and battle-tested monitoring ecosystem. The pull model, PromQL language, rich exporter ecosystem, and native Kubernetes integration make them the natural choice for DevOps and SRE teams.

Key takeaways:

Start with the three pillars of observability and the four golden signals
Instrument your applications from day one -- adding metrics later is harder
Set alerts on symptoms visible to the user
Build multi-level dashboards -- from the big picture down to details
Plan retention and scalability from the start
Keep configurations in Git as Infrastructure as Code

Need Professional Monitoring?#

MDS Software Solutions Group specializes in implementing complete observability solutions. From configuring Prometheus and Grafana, through designing alerts, to integration with CI/CD pipelines -- we will help you build monitoring that truly protects your systems.

Our team has experience in monitoring .NET, Node.js, Java, and Python applications, both in Docker Compose and Kubernetes environments. We design dashboards that deliver valuable insights, not noise.