Monitoring with Prometheus and Grafana - Complete Guide
Monitoring with Prometheus
devopsMonitoring with Prometheus and Grafana - Complete Guide
In the era of microservices, containerization, and distributed systems, monitoring is no longer an optional add-on -- it has become a fundamental pillar of reliability. Without proper insight into system behavior, every incident becomes a guessing game. Prometheus and Grafana are the duo that has become the de facto standard in the open-source observability world, powering systems from small startups to Fortune 500 enterprises.
In this article, we will walk through the complete process of implementing monitoring -- from understanding the fundamentals of observability, through Prometheus architecture and exporter configuration, to building Grafana dashboards and setting up alerts. We will also discuss commercial alternatives such as Datadog and New Relic so you can make an informed decision.
The Three Pillars of Observability#
Observability is the ability of a system to reveal its internal state based on external signals. It rests on three pillars, each providing a different kind of diagnostic information.
Metrics#
Metrics are numerical data measured over time -- counters, histograms, gauges. They answer the question "what is happening?" Examples include HTTP requests per second, CPU usage, API response time, and active database connections. Metrics are lightweight to store and ideal for trend detection and alerting.
Logs#
Logs are structured or unstructured records of events. They answer the question "why did something happen?" Examples include error stack traces, HTTP access logs, and application messages. Logs are invaluable for debugging, but storing and searching them at scale can be expensive.
Traces#
Distributed traces track the flow of a request through multiple services. They answer the question "where in the call chain is the problem?" Tools like Jaeger or Tempo visualize traces as waterfall diagrams, showing exactly how much time each service needed to process a request.
Prometheus specializes in metrics -- and does it exceptionally well. For logs, consider Loki (also from Grafana Labs), and for traces -- Tempo or Jaeger. All three integrate with Grafana, creating a complete observability stack known as the "Grafana Stack" (formerly LGTM -- Loki, Grafana, Tempo, Mimir).
Prometheus Architecture#
Prometheus stands out with its pull-based architecture. Instead of receiving data from applications (push), it actively scrapes metrics from HTTP endpoints at regular intervals. This model simplifies configuration and makes it easy to detect whether a given target is available.
Key Components#
- Prometheus Server -- the main server responsible for scraping, storing data in its local TSDB (Time Series Database), and querying via PromQL
- Exporters -- processes that expose metrics in Prometheus format (Node Exporter for system metrics, cAdvisor for containers, database exporters for PostgreSQL/MySQL/Redis)
- Pushgateway -- an intermediary for short-lived jobs (batch jobs) that cannot be scraped in the pull model
- Alertmanager -- a dedicated service for handling alerts: routing, deduplication, silencing, and sending notifications to Slack, email, PagerDuty
- Service Discovery -- mechanisms for automatic target discovery (Kubernetes, Consul, DNS, static files)
Data Model#
Prometheus stores data as time series -- sequences of values marked with timestamps. Each series is identified by a metric name and a set of labels:
http_requests_total{method="GET", endpoint="/api/users", status="200"} 1547
Metric types in Prometheus:
- Counter -- a monotonically increasing value (e.g.,
http_requests_total). Never decreases, resets only on process restart - Gauge -- a value that can go up and down (e.g.,
temperature_celsius,active_connections) - Histogram -- bucketed observations with sum and count (e.g.,
http_request_duration_seconds). Ideal for measuring percentiles - Summary -- similar to histogram, but with quantiles calculated client-side. Less commonly used due to inability to aggregate across instances
Prometheus Configuration#
The main Prometheus configuration file is prometheus.yml. It defines global settings, scrape targets, and alert rules:
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
rule_files:
- "alerts/*.yml"
alerting:
alertmanagers:
- static_configs:
- targets:
- "alertmanager:9093"
scrape_configs:
- job_name: "prometheus"
static_configs:
- targets: ["localhost:9090"]
- job_name: "node-exporter"
static_configs:
- targets: ["node-exporter:9100"]
- job_name: "cadvisor"
static_configs:
- targets: ["cadvisor:8080"]
- job_name: "dotnet-app"
metrics_path: /metrics
static_configs:
- targets: ["dotnet-api:5000"]
- job_name: "nodejs-app"
static_configs:
- targets: ["nodejs-api:3000"]
Each job_name defines a group of targets. Prometheus regularly queries each target at the /metrics path (by default) and stores the received metrics in its database.
PromQL -- The Prometheus Query Language#
PromQL (Prometheus Query Language) is a powerful, functional query language for analyzing metric data. It is the heart of Prometheus and the key to effective monitoring.
Basic Queries#
# Current metric value
up
# Filtering by labels
http_requests_total{method="GET", status="200"}
# Regex matching
http_requests_total{status=~"5.."}
# Label exclusion
http_requests_total{method!="OPTIONS"}
Range Functions and Rate#
# Request rate per second (last 5 minutes)
rate(http_requests_total[5m])
# Absolute increase over one hour
increase(http_requests_total[1h])
# Rate per endpoint
sum by (endpoint) (rate(http_requests_total[5m]))
Advanced Queries#
# 95th percentile response time (histogram)
histogram_quantile(0.95,
sum by (le) (rate(http_request_duration_seconds_bucket[5m]))
)
# Error percentage (error rate)
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100
# Service availability (uptime)
avg_over_time(up[24h]) * 100
# Top 5 endpoints by traffic
topk(5, sum by (endpoint) (rate(http_requests_total[5m])))
# Prediction -- how many hours until disk is full
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600)
PromQL also supports binary operations between metrics, result grouping (by, without), and aggregation functions (sum, avg, min, max, count).
Node Exporter -- System Metrics#
Node Exporter is the official Prometheus exporter for Linux/Unix system metrics. It provides data about CPU, memory, disk, network, and many other resources.
Key Metrics#
# CPU usage (percentage)
100 - (avg by (instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# RAM usage (percentage)
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
# Disk usage (percentage)
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100
# Network traffic (bytes/s)
rate(node_network_receive_bytes_total{device="eth0"}[5m])
# Load average
node_load1
node_load5
node_load15
Other useful exporters include cAdvisor (Docker container metrics), blackbox_exporter (HTTP/TCP/ICMP probing), postgres_exporter, redis_exporter, and mysqld_exporter.
Application Metrics#
.NET -- prometheus-net#
For .NET applications (ASP.NET Core), we use the prometheus-net library:
// Program.cs
using Prometheus;
var builder = WebApplication.CreateBuilder(args);
var app = builder.Build();
// Middleware for collecting HTTP metrics
app.UseHttpMetrics(options =>
{
options.AddCustomLabel("host", context => context.Request.Host.Host);
});
app.MapControllers();
// /metrics endpoint
app.MapMetrics();
app.Run();
// Custom metrics in a service
using Prometheus;
public class OrderService
{
private static readonly Counter OrdersCreated = Metrics
.CreateCounter("orders_created_total", "Total orders created",
new CounterConfiguration
{
LabelNames = new[] { "payment_method", "status" }
});
private static readonly Histogram OrderProcessingDuration = Metrics
.CreateHistogram("order_processing_duration_seconds",
"Time spent processing an order",
new HistogramConfiguration
{
Buckets = Histogram.ExponentialBuckets(0.01, 2, 10)
});
private static readonly Gauge ActiveOrders = Metrics
.CreateGauge("active_orders", "Number of currently active orders");
public async Task<Order> CreateOrder(OrderRequest request)
{
ActiveOrders.Inc();
using (OrderProcessingDuration.NewTimer())
{
try
{
var order = await ProcessOrder(request);
OrdersCreated.WithLabels(request.PaymentMethod, "success").Inc();
return order;
}
catch (Exception)
{
OrdersCreated.WithLabels(request.PaymentMethod, "failed").Inc();
throw;
}
finally
{
ActiveOrders.Dec();
}
}
}
}
Node.js -- prom-client#
For Node.js applications (Express), we use prom-client:
// metrics.js
const client = require('prom-client');
// Default metrics (CPU, memory, event loop, GC)
client.collectDefaultMetrics({ prefix: 'nodejs_' });
// HTTP metrics
const httpRequestDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
});
const httpRequestsTotal = new client.Counter({
name: 'http_requests_total',
help: 'Total number of HTTP requests',
labelNames: ['method', 'route', 'status_code'],
});
const activeConnections = new client.Gauge({
name: 'active_connections',
help: 'Number of active connections',
});
// Express middleware
function metricsMiddleware(req, res, next) {
const end = httpRequestDuration.startTimer();
activeConnections.inc();
res.on('finish', () => {
const route = req.route?.path || req.path;
const labels = {
method: req.method,
route: route,
status_code: res.statusCode,
};
end(labels);
httpRequestsTotal.inc(labels);
activeConnections.dec();
});
next();
}
module.exports = { metricsMiddleware, client };
// app.js
const express = require('express');
const { metricsMiddleware, client } = require('./metrics');
const app = express();
app.use(metricsMiddleware);
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});
app.get('/api/health', (req, res) => {
res.json({ status: 'healthy', uptime: process.uptime() });
});
app.listen(3000, () => {
console.log('Server running on port 3000');
});
Grafana -- Visualization and Dashboards#
Grafana is a visualization platform that transforms raw metrics into readable, interactive dashboards. It supports dozens of data sources -- Prometheus, InfluxDB, Elasticsearch, PostgreSQL, Loki, and many more.
Data Source Configuration#
After installing Grafana, the first step is adding Prometheus as a data source. This can be done through the graphical interface or automatically via provisioning:
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: false
jsonData:
timeInterval: "15s"
httpMethod: POST
Panel Types in Grafana#
Grafana offers diverse visualization types:
- Time series -- classic line chart, ideal for metrics changing over time
- Stat -- a single value with optional sparkline (e.g., current error rate)
- Gauge -- gauge-style visualization (e.g., CPU usage percentage)
- Table -- tabular data with sorting and filtering capabilities
- Heatmap -- ideal for visualizing histogram distributions
- Bar chart -- comparisons between categories
- Logs -- integration with Loki for displaying logs alongside metrics
The Four Golden Signals#
Google SRE defines four key signals that should be monitored for every service:
1. Latency -- response time for requests:
histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
2. Traffic -- request volume:
sum(rate(http_requests_total[5m]))
3. Errors -- error rate:
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
4. Saturation -- resource utilization level:
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100
Dashboard Provisioning#
Dashboards can be defined as code (JSON) and loaded automatically:
# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1
providers:
- name: 'default'
orgId: 1
folder: 'Infrastructure'
type: file
disableDeletion: false
editable: true
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
Alerting with Alertmanager#
Alertmanager handles alerts generated by Prometheus -- it groups them, deduplicates, silences, and routes them to the appropriate notification channels.
Alert Rules in Prometheus#
# alerts/application.yml
groups:
- name: application
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate (> 5%)"
description: "Error rate is {{ $value | humanizePercentage }} for {{ $labels.job }}"
- alert: HighLatency
expr: |
histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))
> 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "High P95 latency (> 1s)"
description: "P95 latency is {{ $value }}s"
- alert: HighMemoryUsage
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 85
for: 10m
labels:
severity: warning
annotations:
summary: "High memory usage (> 85%)"
- alert: DiskSpaceRunningLow
expr: |
(1 - node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) * 100 > 80
for: 15m
labels:
severity: warning
annotations:
summary: "Low disk space (> 80% used)"
- alert: ServiceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
Alertmanager Configuration#
# alertmanager.yml
global:
resolve_timeout: 5m
smtp_from: "alerts@example.com"
smtp_smarthost: "smtp.example.com:587"
smtp_auth_username: "alerts@example.com"
smtp_auth_password: "secret"
route:
group_by: ['alertname', 'severity']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
repeat_interval: 1h
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'default'
email_configs:
- to: "team@example.com"
- name: 'critical-alerts'
slack_configs:
- api_url: "https://hooks.slack.com/services/xxx/yyy/zzz"
channel: "#alerts-critical"
title: '{{ .CommonAnnotations.summary }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
pagerduty_configs:
- service_key: "your-pagerduty-key"
- name: 'warning-alerts'
slack_configs:
- api_url: "https://hooks.slack.com/services/xxx/yyy/zzz"
channel: "#alerts-warning"
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
Alertmanager also supports integrations with Microsoft Teams, OpsGenie, VictorOps, and webhooks, allowing you to fit into existing incident response processes.
Docker Compose -- Complete Monitoring Stack#
Below is the full Docker Compose configuration, ready to launch in minutes:
# docker-compose.monitoring.yml
version: "3.8"
services:
prometheus:
image: prom/prometheus:v2.48.0
container_name: prometheus
restart: unless-stopped
ports:
- "9090:9090"
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./prometheus/alerts:/etc/prometheus/alerts
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
- '--web.enable-admin-api'
networks:
- monitoring
grafana:
image: grafana/grafana:10.2.0
container_name: grafana
restart: unless-stopped
ports:
- "3001:3000"
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=SecurePassword123
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=https://grafana.example.com
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
depends_on:
- prometheus
networks:
- monitoring
alertmanager:
image: prom/alertmanager:v0.26.0
container_name: alertmanager
restart: unless-stopped
ports:
- "9093:9093"
volumes:
- ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml
- alertmanager_data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
networks:
- monitoring
node-exporter:
image: prom/node-exporter:v1.7.0
container_name: node-exporter
restart: unless-stopped
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.rootfs=/rootfs'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
networks:
- monitoring
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.47.0
container_name: cadvisor
restart: unless-stopped
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
privileged: true
devices:
- /dev/kmsg:/dev/kmsg
networks:
- monitoring
volumes:
prometheus_data:
grafana_data:
alertmanager_data:
networks:
monitoring:
driver: bridge
Launch the stack:
docker compose -f docker-compose.monitoring.yml up -d
After startup, the following will be available:
- Prometheus:
http://localhost:9090 - Grafana:
http://localhost:3001 - Alertmanager:
http://localhost:9093 - Node Exporter:
http://localhost:9100/metrics - cAdvisor:
http://localhost:8080
Monitoring in Kubernetes#
In Kubernetes environments, the most convenient solution is kube-prometheus-stack (formerly prometheus-operator), installed via Helm:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set grafana.adminPassword="SecurePassword123" \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=50Gi
ServiceMonitor -- Automatic Discovery#
In Kubernetes, service monitoring is done through the ServiceMonitor CRD:
# servicemonitor.yml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-api-monitor
namespace: monitoring
labels:
release: kube-prometheus
spec:
namespaceSelector:
matchNames:
- production
selector:
matchLabels:
app: my-api
endpoints:
- port: http
path: /metrics
interval: 15s
scrapeTimeout: 10s
Custom Metrics -- Best Practices#
When defining your own metrics, follow these principles:
- Naming convention -- use the format
namespace_subsystem_name_unit. Example:myapp_orders_processing_duration_seconds - Use appropriate types -- Counter for events (requests, errors), Gauge for states (connections, temperature), Histogram for distributions (latency)
- Do not overuse labels -- each unique label combination creates a separate time series. Avoid high-cardinality labels (e.g., user_id, request_id)
- Use base SI units -- seconds instead of milliseconds, bytes instead of megabytes
- RED and USE methodologies:
- RED (Rate, Errors, Duration) -- for services
- USE (Utilization, Saturation, Errors) -- for infrastructure resources
- Document your metrics -- every metric should have a descriptive help text explaining what it measures
Comparison with Commercial Solutions#
| Aspect | Prometheus + Grafana | Datadog | New Relic | |--------|---------------------|---------|-----------| | Cost | Free (open-source) | From $15/host/month | From $25/host/month | | Hosting | Self-hosted | SaaS | SaaS | | Scalability | Requires Thanos/Cortex/Mimir | Built-in | Built-in | | Learning curve | Steep (PromQL, YAML) | Moderate | Gentle | | Integrations | Hundreds of exporters | 700+ native | 500+ native | | Data retention | Configurable (local) | 15 months (Pro plan) | 8 days (Free) | | Alerts | Alertmanager | Built-in | Built-in | | APM / Traces | Requires additional tools | Built-in | Built-in | | Logs | Loki (separate deployment) | Built-in | Built-in |
When to choose Prometheus + Grafana:
- Cost control is a priority -- no per-host or per-metric-volume fees
- You need full control over your data and infrastructure
- You have experience with Kubernetes and DevOps
- You want to avoid vendor lock-in
- You have a team capable of maintaining monitoring infrastructure
When to consider Datadog or New Relic:
- No DevOps team to maintain monitoring infrastructure
- You need an "all-in-one" solution with APM, logs, and traces in one place
- Fast deployment is more important than cost
- You need advanced APM with automatic instrumentation
Other alternatives worth considering include Elastic Stack (ELK) for logs and metrics, Zabbix for traditional infrastructure monitoring, and VictoriaMetrics as a drop-in replacement for Prometheus with better performance and compression.
Monitoring Best Practices#
1. Monitor What Matters to the User#
Start with business metrics and user experience (latency, availability), then work your way down to infrastructure level. If the user is not experiencing a problem, an alert should not wake anyone up at night.
2. Alert on Symptoms, Not Causes#
Alert on "error rate > 5%" instead of "CPU > 90%". High CPU does not always mean a problem -- high error rate always does.
3. Use Multi-Level Dashboards#
- Executive -- SLA, uptime, key business metrics
- Service -- Golden Signals per service
- Infrastructure -- CPU, memory, disk, network
- Debug -- detailed metrics for diagnostics
4. Retention and Compression#
Store high-resolution data (15s) for 2 weeks, downsampled data (5min) for 6 months, and heavily aggregated data (1h) for longer. Thanos, Cortex, or Mimir help with long-term storage.
5. Regularly Test Your Alerts#
Alerts that have never fired may not work when needed. Regularly conduct "fire drills" -- simulated incidents that test the entire alerting pipeline.
6. Apply Infrastructure as Code#
Keep Prometheus configurations, alert rules, and Grafana dashboards in a Git repository. Use provisioning instead of manual configuration through the UI.
Summary#
Prometheus and Grafana form a powerful, mature, and battle-tested monitoring ecosystem. The pull model, PromQL language, rich exporter ecosystem, and native Kubernetes integration make them the natural choice for DevOps and SRE teams.
Key takeaways:
- Start with the three pillars of observability and the four golden signals
- Instrument your applications from day one -- adding metrics later is harder
- Set alerts on symptoms visible to the user
- Build multi-level dashboards -- from the big picture down to details
- Plan retention and scalability from the start
- Keep configurations in Git as Infrastructure as Code
Need Professional Monitoring?#
MDS Software Solutions Group specializes in implementing complete observability solutions. From configuring Prometheus and Grafana, through designing alerts, to integration with CI/CD pipelines -- we will help you build monitoring that truly protects your systems.
Our team has experience in monitoring .NET, Node.js, Java, and Python applications, both in Docker Compose and Kubernetes environments. We design dashboards that deliver valuable insights, not noise.
Contact us to discuss monitoring for your infrastructure.
Team of programming experts specializing in modern web technologies.