Monitoring & Alerting
UBU Finance backend implements comprehensive monitoring, logging, and alerting systems to provide real-time visibility into system health and security events.
Overview
Monitoring and alerting are crucial for maintaining the security and reliability of the UBU Finance backend. The system collects metrics, logs events, and sends alerts when potential issues are detected.
Components
Prometheus
Prometheus is used for metrics collection and storage. It scrapes metrics from the application and other services at regular intervals and stores them in a time-series database.
Grafana
Grafana is used for visualization and dashboarding. It provides a user-friendly interface for viewing metrics and creating alerts.
Alertmanager
Alertmanager handles alerts from Prometheus and routes them to the appropriate receiver (email, Slack, etc.).
Node Exporter
Node Exporter collects host-level metrics such as CPU, memory, and disk usage.
cAdvisor
cAdvisor collects container metrics for monitoring Docker containers.
Metrics
The UBU Finance backend collects the following metrics:
HTTP Metrics
- Request count by endpoint and method
- Request duration by endpoint
- Error count by endpoint and status code
Security Metrics
- Login attempts (success, failure)
- Rate limit hits
- Account lockouts
- Token blacklist hits
- IP whitelist blocks
Business Metrics
- Transaction count by type
- Loan count by status
- Active users
System Metrics
- CPU usage
- Memory usage
- Disk usage
- Network traffic
Logging
The UBU Finance backend uses structured logging to record events in a format that is easy to parse and analyze. Logs are written to files and can be forwarded to a centralized logging system.
Log Levels
- DEBUG: Detailed information for debugging
- INFO: General information about system operation
- WARNING: Potential issues that don't affect normal operation
- ERROR: Errors that affect a specific operation
- CRITICAL: Critical errors that affect the entire system
Log Format
Logs are formatted as JSON for easy parsing:
{
"timestamp": "2023-05-23T12:34:56.789Z",
"level": "INFO",
"logger": "app.auth.authentication",
"message": "User logged in successfully",
"user_id": "123e4567-e89b-12d3-a456-426614174000",
"ip": "192.168.1.1"
}
Alerting
Alerts are triggered when certain conditions are met, such as high error rates or system resource exhaustion. Alerts can be sent via email, Slack, or webhooks.
Alert Types
- High Error Rate: Triggered when the error rate exceeds a threshold
- High Response Time: Triggered when response times exceed a threshold
- High Rate Limit Hits: Triggered when rate limit hits exceed a threshold
- High Account Lockouts: Triggered when account lockouts exceed a threshold
- System Resource Exhaustion: Triggered when CPU, memory, or disk usage exceeds a threshold
- Service Down: Triggered when a service is unreachable
Setup
Docker Compose
The monitoring infrastructure is set up using Docker Compose. The docker-compose.yml file includes all the necessary services:
version: '3'
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./monitoring/prometheus:/etc/prometheus
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--web.console.libraries=/etc/prometheus/console_libraries'
- '--web.console.templates=/etc/prometheus/consoles'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
restart: unless-stopped
alertmanager:
image: prom/alertmanager:latest
volumes:
- ./monitoring/alertmanager:/etc/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
ports:
- "9093:9093"
restart: unless-stopped
grafana:
image: grafana/grafana:latest
volumes:
- grafana_data:/var/lib/grafana
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
environment:
- GF_SECURITY_ADMIN_USER=admin
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
ports:
- "3000:3000"
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--collector.filesystem.ignored-mount-points=^/(sys|proc|dev|host|etc)($$|/)'
ports:
- "9100:9100"
restart: unless-stopped
cadvisor:
image: gcr.io/cadvisor/cadvisor:latest
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
- /dev/disk/:/dev/disk:ro
ports:
- "8080:8080"
restart: unless-stopped
volumes:
prometheus_data:
grafana_data:
Starting the Monitoring Stack
To start the monitoring stack, run:
Accessing the Dashboards
- Grafana: http://localhost:3000 (default credentials: admin/admin)
- Prometheus: http://localhost:9090
- Alertmanager: http://localhost:9093
Client Implementation Examples
Checking System Health
import requests
import json
def check_system_health(base_url):
url = f"{base_url}/health"
try:
response = requests.get(url)
if response.status_code == 200:
return {
"status": "healthy",
"details": response.json()
}
else:
return {
"status": "unhealthy",
"details": response.json() if response.text else {"error": "No response"}
}
except Exception as e:
return {
"status": "error",
"details": {"error": str(e)}
}
def get_metrics(base_url):
url = f"{base_url}/metrics"
try:
response = requests.get(url)
if response.status_code == 200:
# Parse Prometheus metrics format
metrics = {}
for line in response.text.split('\n'):
if line and not line.startswith('#'):
parts = line.split(' ')
if len(parts) >= 2:
metrics[parts[0]] = float(parts[1])
return metrics
else:
return {"error": "Failed to get metrics"}
except Exception as e:
return {"error": str(e)}
async function checkSystemHealth(baseUrl) {
const url = `${baseUrl}/health`;
try {
const response = await fetch(url);
if (response.ok) {
const data = await response.json();
return {
status: "healthy",
details: data
};
} else {
const data = await response.json().catch(() => ({ error: "No response" }));
return {
status: "unhealthy",
details: data
};
}
} catch (error) {
return {
status: "error",
details: { error: error.message }
};
}
}
async function getMetrics(baseUrl) {
const url = `${baseUrl}/metrics`;
try {
const response = await fetch(url);
if (response.ok) {
const text = await response.text();
// Parse Prometheus metrics format
const metrics = {};
text.split('\n').forEach(line => {
if (line && !line.startsWith('#')) {
const parts = line.split(' ');
if (parts.length >= 2) {
metrics[parts[0]] = parseFloat(parts[1]);
}
}
});
return metrics;
} else {
return { error: "Failed to get metrics" };
}
} catch (error) {
return { error: error.message };
}
}
#!/bin/bash
check_system_health() {
local base_url=$1
response=$(curl -s -w "%{http_code}" "$base_url/health")
http_code=${response: -3}
content=${response:0:${#response}-3}
if [ "$http_code" == "200" ]; then
echo "System is healthy"
echo "$content"
else
echo "System is unhealthy"
echo "$content"
fi
}
get_metrics() {
local base_url=$1
response=$(curl -s -w "%{http_code}" "$base_url/metrics")
http_code=${response: -3}
content=${response:0:${#response}-3}
if [ "$http_code" == "200" ]; then
echo "$content"
else
echo "Failed to get metrics"
fi
}
using System;
using System.Net.Http;
using System.Threading.Tasks;
using Newtonsoft.Json;
using System.Collections.Generic;
using System.Linq;
public class MonitoringClient
{
private readonly HttpClient _client;
private readonly string _baseUrl;
public MonitoringClient(string baseUrl)
{
_client = new HttpClient();
_baseUrl = baseUrl;
}
public class HealthStatus
{
public string status { get; set; }
public object details { get; set; }
}
public async Task<HealthStatus> CheckSystemHealthAsync()
{
string url = $"{_baseUrl}/health";
try
{
HttpResponseMessage response = await _client.GetAsync(url);
string content = await response.Content.ReadAsStringAsync();
if (response.IsSuccessStatusCode)
{
object details = JsonConvert.DeserializeObject(content);
return new HealthStatus
{
status = "healthy",
details = details
};
}
else
{
object details;
try
{
details = JsonConvert.DeserializeObject(content);
}
catch
{
details = new { error = "No response" };
}
return new HealthStatus
{
status = "unhealthy",
details = details
};
}
}
catch (Exception ex)
{
return new HealthStatus
{
status = "error",
details = new { error = ex.Message }
};
}
}
public async Task<Dictionary<string, double>> GetMetricsAsync()
{
string url = $"{_baseUrl}/metrics";
try
{
HttpResponseMessage response = await _client.GetAsync(url);
if (response.IsSuccessStatusCode)
{
string content = await response.Content.ReadAsStringAsync();
// Parse Prometheus metrics format
var metrics = new Dictionary<string, double>();
foreach (string line in content.Split('\n'))
{
if (!string.IsNullOrEmpty(line) && !line.StartsWith("#"))
{
string[] parts = line.Split(' ');
if (parts.Length >= 2 && double.TryParse(parts[1], out double value))
{
metrics[parts[0]] = value;
}
}
}
return metrics;
}
else
{
return new Dictionary<string, double> { { "error", 1 } };
}
}
catch (Exception ex)
{
return new Dictionary<string, double> { { "error", 1 } };
}
}
}
Configuration
Prometheus Configuration
The Prometheus configuration is defined in monitoring/prometheus/prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_timeout: 10s
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
rule_files:
- rules.yml
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'cadvisor'
static_configs:
- targets: ['cadvisor:8080']
- job_name: 'ubu-finance-backend'
metrics_path: /metrics
static_configs:
- targets: ['host.docker.internal:8080']
Alertmanager Configuration
The Alertmanager configuration is defined in monitoring/alertmanager/alertmanager.yml:
global:
resolve_timeout: 5m
smtp_smarthost: 'smtp.gmail.com:587'
smtp_from: 'alerts@ubufinance.com'
smtp_auth_username: 'alerts@ubufinance.com'
smtp_auth_password: 'your-app-password'
smtp_require_tls: true
route:
receiver: 'email-notifications'
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 3h
routes:
- match:
severity: critical
receiver: 'all-notifications'
continue: true
- match:
severity: warning
receiver: 'email-notifications'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'cluster', 'service']
receivers:
- name: 'email-notifications'
email_configs:
- to: 'admin@ubufinance.com'
send_resolved: true
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#alerts'
send_resolved: true
title: |-
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
text: >-
{{ range .Alerts -}}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Time:* {{ .StartsAt }}
{{ end }}
- name: 'all-notifications'
email_configs:
- to: 'admin@ubufinance.com,oncall@ubufinance.com'
send_resolved: true
slack_configs:
- api_url: 'https://hooks.slack.com/services/T00000000/B00000000/XXXXXXXXXXXXXXXXXXXXXXXX'
channel: '#alerts'
send_resolved: true
title: |-
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}] {{ .CommonLabels.alertname }}
text: >-
{{ range .Alerts -}}
*Alert:* {{ .Annotations.summary }}
*Description:* {{ .Annotations.description }}
*Severity:* {{ .Labels.severity }}
*Time:* {{ .StartsAt }}
{{ end }}
Grafana Configuration
Grafana is configured with datasources and dashboards in the monitoring/grafana/provisioning directory:
datasources/datasource.yml: Configures Prometheus as a datasourcedashboards/dashboard.yml: Configures dashboard provisioningdashboards/json/ubu_finance_overview.json: Defines the UBU Finance overview dashboard
Best Practices
- Monitor Critical Metrics: Focus on monitoring metrics that are critical to your application's health and security.
- Set Appropriate Thresholds: Set alert thresholds that balance sensitivity and noise.
- Use Structured Logging: Use structured logging to make logs easier to parse and analyze.
- Implement Log Rotation: Implement log rotation to prevent log files from growing too large.
- Secure Monitoring Endpoints: Secure monitoring endpoints to prevent unauthorized access.
- Regularly Review Alerts: Regularly review alerts to ensure they are still relevant and effective.
- Document Alert Procedures: Document procedures for responding to different types of alerts.
- Test Alert Channels: Regularly test alert channels to ensure they are working correctly.