Status & Uptime
How we monitor production and respond to incidents
High uptime is the foundation of reliable applications. We show how we monitor production, what tools we use, and how we respond to problems before they impact users.
Our SLO (Service Level Objectives)
Measurable service quality goals
SLOs are concrete goals we commit to maintain. These aren't promises, but measurable metrics we regularly report on.
Uptime (availability)
Maximum 43 minutes of downtime per month. Planned maintenance windows don't count towards this time.
Response Time
95% of API requests must be handled in less than 500ms. For HTML pages: < 1.5s Time to First Byte.
Error Rate
Less than 1 failed request per 1000. 5xx errors are treated with priority and fixed immediately.
Incident Response
From incident detection to starting work on resolution. On-call available 24/7 for critical systems.
SLOs can be customized for project specifics. Above values are standard for production applications.
Monitoring and alerting stack
Tools that give us full insight into application health
Synthetic Monitoring
Uptime Robot / Pingdom
Checking application availability every 1-5 minutes from different geographic locations. SMS/email alerts in case of unavailability.
Application Performance Monitoring (APM)
Application Insights / New Relic
Performance tracking: response times, CPU/RAM load, slow SQL queries. Distributed tracing for microservices.
Error Tracking
Sentry / Azure Monitor
Automatic collection of JavaScript, .NET, SQL errors. Stack trace, user context, breadcrumbs. Real-time alerting.
Logs Aggregation
Azure Log Analytics / ELK Stack
Centralized logs from all environments. Search, filtering, event correlation. 30-90 day retention.
Infrastructure Monitoring
Azure Monitor / Datadog
Infrastructure monitoring: servers, databases, queues, cache. Metrics: CPU, RAM, disk I/O, network throughput.
Real User Monitoring (RUM)
Google Analytics 4 / Plausible
Real user monitoring: Core Web Vitals (LCP, CLS, INP), JS errors, navigation. Privacy-friendly analytics.
Alert examples and escalation policy
Automatic alerts in case of problems
We configure alerts to inform about problems before they impact users. Each alert has defined thresholds and escalation procedures.
HTTP 5xx > 1% for 5 minutes
CriticalImmediate SMS + email alert to on-call. Automatic diagnostic playbook execution.
Response Time p95 > 2s for 10 minutes
HighEmail + Slack alert to team. Analysis of slow queries and bottlenecks. Escalation after 30 minutes.
Uptime check failed (3 attempts)
CriticalSMS + email alert. Health endpoint check, reboot if that doesn't help. Client notification.
Disk space > 85%
MediumEmail alert. Cleanup of old logs/temp files. Planning capacity increase.
Failed logins > 10 within a minute
HighPotential brute-force attack. Temporary IP block. Security team alert.
Escalation policy
Level 1: On-call engineer
First line of response. Available 24/7 for critical incidents. Response time: up to 15 minutes.
Level 2: Senior engineer / Team lead
Escalation if problem isn't resolved in 30 minutes or requires deeper knowledge. Rollback decisions.
Level 3: CTO / Architecture team
Escalation for systemic issues, infrastructure changes, strategic decisions. Client notification.
How we respond to incidents
Response process and communication
On-call and availability
For production applications, we maintain 24/7 on-call rotations. The on-call engineer has access to all systems and can make decisions about rollbacks or failovers.
- •On-call rotation: weekly, with backup person
- •Alerts: SMS + email + Slack (redundancy)
- •Remote access: VPN + jump server + MFA
- •Playbooks: documented procedures for common problems
Incident handling process
1. Detection and alert
Automatic alerts from monitoring or report from user/client.
2. Triaging (within 15 minutes)
Impact assessment, priority assignment, team notification.
3. Mitigation (immediate help)
Rollback, restart, failover - whatever restores operation as quickly as possible.
4. Root cause analysis
Analysis of root cause after service restoration.
5. Fix and deploy
Permanent fix, staging tests, production deployment.
6. Post-mortem
Incident report: what happened, why, how we'll prevent it in the future.
Maintenance windows
We perform planned maintenance work during established time windows to minimize impact on users.
Standard window
Monday-Thursday, 10:00 PM-2:00 AM CET. Minimal risk of user impact.
Advance notification
Minimum 48 hours before planned window. Email + status page + app banner.
Zero-downtime deployment
When possible, we use blue-green deployment or canary releases - no downtime.
Rollback plan
Each maintenance window has a prepared emergency rollback plan.
Related information
SLA and Security
Detailed SLA response rules, security practices, code review and change management.
SLA and SecuritySee our projects
Examples of applications we monitor and maintain for clients.
See our projectsQuestions about monitoring and uptime?
Let's talk about how we can ensure your application's reliability
Contact us