Monitoring Stack
This document provides a comprehensive guide to the monitoring implementation in PTIIKInsight using Prometheus and Grafana for system observability and performance tracking.
Overview
The PTIIKInsight monitoring stack consists of:
Prometheus: Metrics collection and time-series database
Grafana: Visualization dashboards and alerting
FastAPI Instrumentator: Automatic metrics collection from API
Architecture
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ FastAPI App │ │ Prometheus │ │ Grafana │
│ Port: 8000 │────┤ Port: 9090 │────┤ Port: 3000 │
│ │ │ │ │ │
│ /metrics │ │ Scrapes every │ │ Queries │
│ endpoint │ │ 15 seconds │ │ Prometheus │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Implementation Details
1. FastAPI Metrics Integration
The FastAPI application automatically exposes metrics using the prometheus-fastapi-instrumentator
library:
from fastapi import FastAPI
from prometheus_fastapi_instrumentator import Instrumentator
from prometheus_client import Counter, Histogram, Gauge
app = FastAPI(title="PTIIK Insight API", description="ML-powered topic analysis")
# Initialize Prometheus metrics
instrumentator = Instrumentator()
instrumentator.instrument(app).expose(app)
# Custom metrics for ML operations
model_predictions_total = Counter('model_predictions_total', 'Total number of model predictions')
model_prediction_errors_total = Counter('model_prediction_errors_total', 'Total number of prediction errors')
model_prediction_duration = Histogram('model_prediction_duration_seconds', 'Time spent on predictions')
model_accuracy = Gauge('model_accuracy', 'Current model accuracy')
scraping_requests_total = Counter('scraping_requests_total', 'Total number of scraping requests')
scraping_errors_total = Counter('scraping_errors_total', 'Total number of scraping errors')
Available Metrics:
Standard HTTP metrics (request count, duration, status codes)
Custom ML metrics (predictions, errors, accuracy)
Application-specific metrics (scraping operations)
2. Prometheus Configuration
The Prometheus configuration file (monitoring/prometheus/prometheus.yml
) defines how metrics are collected:
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "rules/*.yml"
scrape_configs:
- job_name: 'fastapi-app'
static_configs:
- targets: ['fastapi-app:8000']
metrics_path: '/metrics'
scrape_interval: 15s
Key Configuration:
Scrape Interval: 15 seconds between metric collections
Target: FastAPI application on port 8000
Metrics Path:
/metrics
endpointRule Files: Custom alerting rules from
rules/
directory
3. Docker Compose Configuration
The monitoring services are defined in docker-compose.yml
:
services:
fastapi-app:
build:
context: .
dockerfile: Dockerfile
container_name: fastapi-scraper
ports:
- "8000:8000"
volumes:
- ./data:/app/data
- ./model:/app/model
- ./preprocessing:/app/preprocessing
- ./api:/app/api
restart: always
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./monitoring/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml
- ./monitoring/prometheus/rules:/etc/prometheus/rules
command:
- "--config.file=/etc/prometheus/prometheus.yml"
- "--storage.tsdb.path=/prometheus"
- "--web.console.libraries=/etc/prometheus/console_libraries"
- "--web.console.templates=/etc/prometheus/consoles"
- "--web.enable-lifecycle"
ports:
- "9090:9090"
restart: always
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
volumes:
- ./monitoring/grafana/provisioning:/etc/grafana/provisioning
- ./monitoring/grafana/dashboards:/var/lib/grafana/dashboards
environment:
- GF_USERS_ALLOW_SIGN_UP=false
- GF_USERS_DEFAULT_THEME=dark
- GF_INSTALL_PLUGINS=grafana-piechart-panel
depends_on:
- prometheus
restart: always
volumes:
grafana-storage:
Monitoring Features
Available Metrics
HTTP Metrics (Automatic):
http_requests_total
: Total HTTP requests by method and statushttp_request_duration_seconds
: Request processing timehttp_request_size_bytes
: Request payload sizehttp_response_size_bytes
: Response payload size
Custom ML Metrics:
model_predictions_total
: Count of topic predictions mademodel_prediction_errors_total
: Count of prediction failuresmodel_prediction_duration_seconds
: Time spent on predictionsmodel_accuracy
: Current model accuracy scorescraping_requests_total
: Count of scraping operationsscraping_errors_total
: Count of scraping failures
Grafana Dashboard Configuration
Grafana is pre-configured with:
Data Source: Prometheus connection
Provisioning: Automatic dashboard and data source setup
Theme: Dark theme by default
Plugins: Pie chart panel for better visualizations
Dashboard Features:
API request volume and response times
Error rate monitoring
ML model performance metrics
System resource utilization
Custom alerts and notifications
Alerting Rules
Custom alerting rules can be defined in monitoring/prometheus/rules/alerts.yml
:
groups:
- name: ptiik-insight-alerts
rules:
- alert: HighErrorRate
expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.1
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate detected"
description: "Error rate is above 10% for 5 minutes"
- alert: ModelPredictionFailures
expr: rate(model_prediction_errors_total[5m]) > 0.05
for: 2m
labels:
severity: warning
annotations:
summary: "Model prediction failures detected"
description: "Model prediction error rate is above 5%"
- alert: APIDown
expr: up{job="fastapi-app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "API service is down" description: "FastAPI service has been down for more than 1 minute"
Access and Usage
Service URLs
Prometheus: http://localhost:9090
Grafana: http://localhost:3000 (admin/admin)
API Metrics: http://localhost:8000/metrics
Starting the Monitoring Stack
# Start all services including monitoring
docker-compose up -d
# Check status
docker-compose ps
# View logs
docker-compose logs -f prometheus
docker-compose logs -f grafana
Grafana Setup
Access Grafana: Navigate to http://localhost:3000
Login: Use admin/admin (change password on first login)
Data Source: Prometheus should be auto-configured at http://prometheus:9090
Dashboards: Import or create custom dashboards for your metrics
Common Queries
Prometheus Query Examples:
# Request rate per minute
rate(http_requests_total[1m])
# Average response time
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
# Error rate percentage
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) * 100
# Model prediction success rate
rate(model_predictions_total[5m]) - rate(model_prediction_errors_total[5m])
# Scraping success rate
(rate(scraping_requests_total[5m]) - rate(scraping_errors_total[5m])) / rate(scraping_requests_total[5m]) * 100
Troubleshooting
Common Issues
Prometheus Not Scraping Metrics:
# Check FastAPI metrics endpoint
curl http://localhost:8000/metrics
# Check Prometheus targets
# Visit http://localhost:9090/targets
Grafana Can't Connect to Prometheus:
# Verify Prometheus is running
docker-compose logs prometheus
# Check network connectivity
docker exec -it grafana ping prometheus
Missing Custom Metrics:
# Ensure metrics are being incremented in your code
model_predictions_total.inc()
scraping_requests_total.inc()
Performance Optimization
Reduce Metric Cardinality:
Limit label values in custom metrics
Use appropriate metric types (Counter, Gauge, Histogram)
Avoid high-cardinality labels
Storage Optimization:
Set appropriate retention periods
Monitor disk usage for time-series data
Use metric aggregation for long-term storage
Best Practices
Metric Design
Use Standard Names: Follow Prometheus naming conventions
Appropriate Types: Choose Counter, Gauge, or Histogram based on use case
Consistent Labels: Use consistent label names across metrics
Documentation: Add help text to all custom metrics
Monitoring Strategy
Start Simple: Begin with basic HTTP metrics
Add Business Metrics: Include ML-specific metrics gradually
Set Meaningful Alerts: Focus on actionable alerts
Regular Review: Periodically review and clean up unused metrics
Security Considerations
Change Default Passwords: Update Grafana admin credentials
Network Isolation: Use Docker networks for service communication
Access Control: Implement proper authentication for production
Data Retention: Set appropriate data retention policies
The monitoring stack provides comprehensive observability for the PTIIKInsight system, enabling proactive monitoring and performance optimization of both the infrastructure and ML operations.
Response Time: Request duration percentiles (95th, 99th)
Error Rate: Failed requests percentage
Active Connections: Current active connections count
Throughput: Data transfer rates
System Resources: CPU and memory usage
Dashboard Panels:
Request Volume: Total requests over time
Response Time: Average response time trends
Error Rate: HTTP error rate percentage
Status Code Distribution: Breakdown of response codes
Active Users: Current active connections
System Health: Overall system status indicators
3. Challenges and Solutions
3.1 Current Service Limitations
Challenge Identified: The current ML services include scraping functionality and display of scraped data results. The team intended to add a model service for predictions but encountered significant obstacles.
Specific Issues:
Extended Build Times: Docker image build process consumes excessive time
Resource Constraints: Model integration requires substantial computational resources
Service Complexity: ML model deployment adds complexity to the monitoring stack
Current Service Architecture:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Scraping │ │ Data Display │ │ Monitoring │
│ Service │───▶│ Service │──▶│ Stack │
└─────────────────┘ └─────────────────┘ └─────────────────┘
Planned Future Architecture:
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Scraping │ │ Data Display │ │ ML Prediction │
│ Service │───▶│ Service │───▶│ Service │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────┐
│ Monitoring Stack │
│ (Prometheus + Grafana) │
└─────────────────────────────────────┘
3.2 Solutions and Workarounds
Immediate Solutions:
Optimized Build Process: Implement multi-stage Docker builds to reduce image size
Resource Management: Configure resource limits and requests for containers
Caching Strategy: Implement build caching to reduce subsequent build times
Future Enhancements:
Model Service Integration: Gradual integration of ML prediction services
Load Balancing: Implement load balancing for high-availability
Auto-scaling: Configure automatic scaling based on load metrics
4. Technical Specifications
4.1 System Requirements
Minimum Requirements:
CPU: 4 cores
RAM: 8 GB
Storage: 20 GB available space
Docker: Version 20.10+
Docker Compose: Version 2.0+
Recommended Requirements:
CPU: 8 cores
RAM: 16 GB
Storage: 50 GB SSD
Network: High-speed internet for Docker image downloads
4.2 Port Configuration
Service Ports:
FastAPI Application: 8000
Prometheus: 9090
Grafana: 3000
Network Security:
Internal communication via Docker network
External access controlled via port mapping
Authentication required for Grafana access
4.3 Data Retention
Prometheus Data Retention:
Default: 15 days
Configured: 200 hours (configurable via command line)
Storage: Time-series database (TSDB)
Grafana Data Persistence:
Dashboards: Stored in Grafana database
Configuration: Persistent via Docker volumes
Backup: Regular automated backups recommended
5. Best Practices Implemented
5.1 Security Measures
Authentication: Grafana admin password configuration
Network Isolation: Services communicate via dedicated Docker network
Access Control: Limited external port exposure
5.2 Performance Optimization
Metric Collection: Optimized 5-second scrape interval
Data Retention: Configured retention policies to manage storage
Resource Limits: Container resource constraints for stability
5.3 Monitoring Standards
Metric Naming: Consistent naming conventions
Dashboard Organization: Logical grouping of metrics
Alert Configuration: Preparation for future alerting setup
6. Future Improvements
6.1 Enhanced Monitoring
Application Performance Monitoring (APM): Integrate distributed tracing
Log Aggregation: Add ELK stack or Loki for log management
Custom Metrics: Implement business-specific metrics
6.2 Alerting System
Alert Rules: Configure Prometheus alert rules
Notification Channels: Integrate Slack, email notifications
Escalation Policies: Define alert escalation procedures
6.3 Scalability Enhancements
Horizontal Scaling: Multi-instance Prometheus setup
Load Balancing: Implement Grafana load balancing
High Availability: Configure redundant monitoring infrastructure
Conclusion
The monitoring stack implementation successfully establishes comprehensive observability for the PTIIKInsight Machine Learning System. Despite challenges with ML model service integration due to build time constraints, the current implementation provides a solid foundation for monitoring and performance tracking.
The Prometheus and Grafana integration enables real-time monitoring of system metrics, HTTP performance, and service health. The containerized deployment ensures easy management and scalability, while the dashboard provides intuitive visualization of key performance indicators.
This monitoring infrastructure will support the continued development and operation of the PTIIKInsight platform, providing essential insights for performance optimization and system reliability.
This implementation demonstrates practical application of modern monitoring tools in machine learning systems and provides a template for similar projects requiring comprehensive observability solutions.
Last updated