Skip to content

Status & Uptime

How we monitor production and respond to incidents

High uptime is the foundation of reliable applications. We show how we monitor production, what tools we use, and how we respond to problems before they impact users.

Our SLO (Service Level Objectives)

Measurable service quality goals

SLOs are concrete goals we commit to maintain. These aren't promises, but measurable metrics we regularly report on.

Uptime (availability)

99.9% per month

Maximum 43 minutes of downtime per month. Planned maintenance windows don't count towards this time.

Response Time

< 500ms (p95)

95% of API requests must be handled in less than 500ms. For HTML pages: < 1.5s Time to First Byte.

Error Rate

< 0.1%

Less than 1 failed request per 1000. 5xx errors are treated with priority and fixed immediately.

Incident Response

< 15 minutes

From incident detection to starting work on resolution. On-call available 24/7 for critical systems.

SLOs can be customized for project specifics. Above values are standard for production applications.

Monitoring and alerting stack

Tools that give us full insight into application health

Synthetic Monitoring

Uptime Robot / Pingdom

Checking application availability every 1-5 minutes from different geographic locations. SMS/email alerts in case of unavailability.

Application Performance Monitoring (APM)

Application Insights / New Relic

Performance tracking: response times, CPU/RAM load, slow SQL queries. Distributed tracing for microservices.

Error Tracking

Sentry / Azure Monitor

Automatic collection of JavaScript, .NET, SQL errors. Stack trace, user context, breadcrumbs. Real-time alerting.

Logs Aggregation

Azure Log Analytics / ELK Stack

Centralized logs from all environments. Search, filtering, event correlation. 30-90 day retention.

Infrastructure Monitoring

Azure Monitor / Datadog

Infrastructure monitoring: servers, databases, queues, cache. Metrics: CPU, RAM, disk I/O, network throughput.

Real User Monitoring (RUM)

Google Analytics 4 / Plausible

Real user monitoring: Core Web Vitals (LCP, CLS, INP), JS errors, navigation. Privacy-friendly analytics.

Alert examples and escalation policy

Automatic alerts in case of problems

We configure alerts to inform about problems before they impact users. Each alert has defined thresholds and escalation procedures.

HTTP 5xx > 1% for 5 minutes

Critical

Immediate SMS + email alert to on-call. Automatic diagnostic playbook execution.

Response Time p95 > 2s for 10 minutes

High

Email + Slack alert to team. Analysis of slow queries and bottlenecks. Escalation after 30 minutes.

Uptime check failed (3 attempts)

Critical

SMS + email alert. Health endpoint check, reboot if that doesn't help. Client notification.

Disk space > 85%

Medium

Email alert. Cleanup of old logs/temp files. Planning capacity increase.

Failed logins > 10 within a minute

High

Potential brute-force attack. Temporary IP block. Security team alert.

Escalation policy

Level 1: On-call engineer

First line of response. Available 24/7 for critical incidents. Response time: up to 15 minutes.

Level 2: Senior engineer / Team lead

Escalation if problem isn't resolved in 30 minutes or requires deeper knowledge. Rollback decisions.

Level 3: CTO / Architecture team

Escalation for systemic issues, infrastructure changes, strategic decisions. Client notification.

How we respond to incidents

Response process and communication

On-call and availability

For production applications, we maintain 24/7 on-call rotations. The on-call engineer has access to all systems and can make decisions about rollbacks or failovers.

  • On-call rotation: weekly, with backup person
  • Alerts: SMS + email + Slack (redundancy)
  • Remote access: VPN + jump server + MFA
  • Playbooks: documented procedures for common problems

Incident handling process

1

1. Detection and alert

Automatic alerts from monitoring or report from user/client.

2

2. Triaging (within 15 minutes)

Impact assessment, priority assignment, team notification.

3

3. Mitigation (immediate help)

Rollback, restart, failover - whatever restores operation as quickly as possible.

4

4. Root cause analysis

Analysis of root cause after service restoration.

5

5. Fix and deploy

Permanent fix, staging tests, production deployment.

6

6. Post-mortem

Incident report: what happened, why, how we'll prevent it in the future.

Maintenance windows

We perform planned maintenance work during established time windows to minimize impact on users.

Standard window

Monday-Thursday, 10:00 PM-2:00 AM CET. Minimal risk of user impact.

Advance notification

Minimum 48 hours before planned window. Email + status page + app banner.

Zero-downtime deployment

When possible, we use blue-green deployment or canary releases - no downtime.

Rollback plan

Each maintenance window has a prepared emergency rollback plan.

Related information

SLA and Security

Detailed SLA response rules, security practices, code review and change management.

SLA and Security

See our projects

Examples of applications we monitor and maintain for clients.

See our projects

Questions about monitoring and uptime?

Let's talk about how we can ensure your application's reliability

Contact us
Status & Uptime - How we monitor production | MDS Software Solutions Group | MDS Software Solutions Group