DIGIT Core
PlatformDomainsAcademyDesign SystemFeedback
2.9-LTS
2.9-LTS
  • 🖥️Platform
    • Overview
    • Why DIGIT?
    • Principles
    • Architecture
      • Service Architecture
      • Technology Architecture
        • Open Source Tools
      • Infrastructure Architecture
      • Deployment Architecture
    • API Specifications
      • Access Control
      • Boundary
      • Document Uploader
      • Encryption
      • File Store
      • ID Generation
      • Indexer
      • Localisation
      • Master Data Management
      • OTP
      • Payment Gateway
      • PDF Generation
      • URL Shortner
      • WhatsApp Chatbot
      • Workflow
    • Core Services
      • Access Control Services
      • Audit Service
        • Signed Audit Performance Testing Results
      • API Gateway
        • Configuring Gateway Rate Limiting
      • Boundary Service
        • Migrate Old Boundary Data - Steps
      • Email Notification Service
      • Encryption Service
        • Encryption Client Library
        • User Data Security Architecture
        • Guidelines for supporting User Privacy in a module
      • FileStore Service
      • ID Generation Service
      • Indexer Service
        • Indexer Configuration
      • Internal Gateway
      • Location
      • Localization Service
        • Configuring Localization
          • Setup Base Product Localisation
          • Configure SMS and Email
      • MDMS V2 (Master Data Management Service)
        • Adopt New MDMS - Steps
        • MDMS (Master Data Management Service)
          • Setting up Master Data
            • MDMS Overview
            • MDMS Rewritten
            • Configuring Tenants
            • Configuring Master Data
            • Adding New Master
            • State Level Vs City Level Master
        • MDMS Migration
      • OTP Service
      • Payment Gateway Service
      • PDF Generation Service
      • Persister Service
        • Persister Configuration
      • Service Request
      • SMS Notification Service
        • Setting Up SMS Gateway
          • Using The Generic GET & POST SMS Gateway Interface
      • User
        • User Session Management
      • User OTP Service
      • URL Shortening Service
      • Workflow
        • Setting Up Workflows
        • Configuring Workflows For An Entity
        • Workflow Auto Escalation
        • Migration To Workflow 2.0
      • Libraries
        • Tracer Library
        • Encryption Client
      • Accelerators
        • Inbox Service
    • DIGIT: How-Tos
      • SMS Template Approval Process
      • Notification Enhancement Based On Different Channel
    • Releases
      • DIGIT 2.9 LTS
        • Test Automation
        • Release Checklist
        • Service Build Updates
          • Hotfix
        • Test Cases
        • Automated DIGIT Deployment
        • Upgrade Guide: Transitioning DIGIT Modules to Spring Boot Version 3.2.2
        • Postgres Upgrade: Service Code Changes
        • Updating RDS Version in AWS
        • LTS DIGIT Migration - v2.8 To v2.9
        • Changelog
        • Backup PostgreSQL Database In AWS - Steps
    • Source Code
  • 📓Guides
    • Installation Guide
      • Infrastructure Setup
        • AWS
          • 1. Pre-requisites
          • 2. Setup AWS Account
          • 3. Provision Infrastructure
          • FAQ
        • Azure
          • 1. Azure Pre-requisites
          • 2. Understanding AKS
          • 3. Infra-as-code (Terraform)
        • SDC
          • 1. SDC Pre-requisites
          • 2. Infra-as-code (Kubespray)
          • CI/CD Setup On SDC
        • CI/CD Set Up
          • CI/CD Build Job Pipeline Setup
      • DIGIT Deployment
        • Full Deployment
          • Deploy DIGIT
            • Prepare Deployment Configuration
        • Full Deployment (Beta)
          • Creating New HelmChart
          • Prepare Helm Release Chart
      • Quick Setup (AWS)
    • Data Setup Guide
      • Bootstrap DIGIT
      • Productionize DIGIT
      • User Module
      • Localisation Module
      • Location Module
      • MDMS - V2
    • Design Guide
      • Model Requirements
      • Design Services
      • Design User Interface
      • Checklists
    • Developer Guide
      • Pre-requisites Training Resources
      • Backend Developer Guide
        • Section 0: Prep
          • Development Pre-requisites
          • Design Inputs
            • High Level Design
            • Low Level Design
          • Development Environment Setup
        • Section 1: Create Project
          • Generate Project Using API Specs
          • Create Database
          • Configure Application Properties
          • Import Core Models
          • Implement Repository Layer
          • Create Validation & Enrichment Layers
          • Implement Service Layer
          • Build The Web Layer
        • Section 2: Integrate Persister & Kafka
          • Add Kafka Configuration
          • Implement Kafka Producer & Consumer
          • Add Persister Configuration
          • Enable Signed Audit
        • Section 3: Integrate Microservices
          • Integrate IDGen Service
          • Integrate User Service
          • Add MDMS Configuration
          • Integrate MDMS Service
          • Add Workflow Configuration
          • Integrate Workflow Service
          • Integrate URL Shortener Service
        • Section 4: Integrate Billing & Payment
          • Custom Calculator Service
          • Integrate Calculator Service
          • Payment Back Update
        • Section 5: Other Advanced Integrations
          • Add Indexer Configuration
          • Certificate Generation
        • Section 6: Run Final Application
        • Section 7: Build & Deploy Instructions
        • FAQs
      • UI Developer Guide
        • DIGIT-UI
          • UI Components Standardisation
            • DIGIT UI Core React Components
            • DIGIT UI Core Flutter Components
              • Input Field
              • Radio
              • Toggle
              • Button
              • Dropdown
              • Checkbox
              • Toast
              • Info Card
            • DIGIT UI Components v0.2.0
              • Foundation
                • Typography
                • Colour Pallete
                • Spacer
              • Atom
                • Accordion
                • Button
                • Checkbox
        • DIGIT UI Development Pre-requisites
        • UI Configuration (DevOps)
        • Local Development Setup
        • Run Application
        • Build & Deploy
        • Pre-defined Screens In DIGIT-UI
          • Create Screen (FormComposer)
          • Inbox/Search Screen
          • Workflow Component
        • Create a New UI Module/Package
          • Project Structure
          • Install Dependency
          • Module.js
          • Import Required Components
          • Common Hooks
        • Employee Module Setup
          • Write Employee Module Code
          • Create Form - Create Screen
        • Citizen Module Setup
          • Sample screenshots
          • Citizen Landing Screen
          • Write Citizen Module Code
        • Customisation
          • Integrate External Web Application/UI With DIGIT UI
          • Utility - Pre-Process MDMS Configuration
          • CSS Customisation
          • Kibana Dashboard Integration With DSS Module
          • Login Page
        • Setup Monitoring Tools
        • Android Web View & How To Generate APK
        • FAQs
          • Troubleshoot Using Browser Network Tab
          • Debug Android App Using Chrome Browser
      • Flutter (Mobile App) UI Developer Guide
        • Introduction to Flutter
          • Flutter - Key Features
          • Flutter Architecture & Approach
          • Flutter Pre-Requisites
        • Setup Development Environment
          • Flutter Installation & Setup Guide
          • Setup Device Emulators/Simulators
          • Run Application
        • Build User Interfaces
          • Create Form Screen
        • Build Deploy & Publish
          • Build & Deploy Flutter Web Application
          • Generate Android APKs & App Bundles
          • Publishing App Bundle To Play Store
        • State Management With Provider & Bloc
          • Provider State Management
          • BloC State Management
        • Best Practices & Tips
        • Troubleshooting
    • Operations Guide
      • DIGIT - Infra Overview
      • Kubernetes
        • RBAC Management
        • Database Dump - Playground
      • Setup Jenkins - Docker way
      • GitOps
        • Git Client installation
        • GitHub organization creation
        • Adding new SSH key to it
        • GitHub repo creation
        • GitHub Team creation
        • Enabling Branch protection:
        • CODEOWNER Reviewers
        • Adding Users to the Git
        • Setting up an OAuth with GitHub
        • Fork (Fork the mdms,config repo with a tenant-specific branch)
      • Working with Kubernetes
        • Installation of Kubectl
      • Containerizing application using Docker
        • Creation of Dockerhub account
      • Infra Provisioning Using Terraform
        • Installation of Terraform
      • Customise Existing Terraform Templates
      • Cert-Manager
        • Obtaining SSL certificates with the help of cluster-issuer
      • Moving Docker Images
      • Pre and post deployment checklist
      • Multi-tenancy Setup
      • Availability
        • Infrastructure
        • Backbone services
          • Database
          • Kafka
          • Kafka Connect
          • Elastic search
            • Elastic Search Rolling Upgrade
            • ElasticSearch Direct Upgrade
        • Core services
        • DIGIT apps
        • DSS dashboard
      • Observability
        • ES-Curator - Clear Old Logs/indices
        • Monitoring
        • Environment Changes
        • Tracing
        • Jaeger Tracing Setup
        • Logging
        • eGov Monitoring & Alerting Setup
        • eGov Logging Setup
      • Performance
        • What to monitor?
          • Infrastructure
          • Backbone services
          • Core services
        • Identifying bottlenecks
        • Solutions
      • Handling errors
      • Security
      • Reliability and disaster recovery
      • Privacy
      • Skillsets/hiring
      • Incident management processes
      • Kafka Troubleshooting Guide
        • How to clean up Kafka logs
        • How to change or reset consumer offset in Kafka?
      • SRE Rituals
      • FAQs
        • I am unable to login to the citizen or employee portal. The UI shows a spinner.
        • My DSS dashboard is not reflecting accurate numbers? What can I do?
      • Deployment using helm
        • Helm Installation
        • Helm chart creation
        • Helm chart customization
      • How to Dump Elasticsearch Indexes
      • Deploy Nginx-Ingress-Controller
      • Deployment Job Pipeline Setup
      • OAuth2-Proxy Setup
      • Jira Ticket Creation
    • Implementation Guide
    • Security & Privacy Guide
      • Security & Privacy Guidelines For Product Developers
      • Security & Privacy Guidelines For Solution Implementing Agencies
      • Security & Privacy Guidelines For Program Owners
  • 🚀Accelerators
    • UI Frameworks
      • Service Build Updates
    • Integrations
      • Payment
      • Notification
      • Transaction
      • Verification
      • View
      • Calculation
    • Concepts
      • Deployment - Key Concepts
        • Security Practices
        • Readiness & Liveness
        • Resource Requests & Limits
        • Deploying DIGIT Services
        • Deployment Architecture
        • Routing Traffic
        • Backbone Deployment
    • API Playground
    • Sandbox
    • Checklists
      • API Checklist
      • Security Checklist
        • Security Guidelines Handbook
        • Security Flow - Exemplar
      • Performance Checklist
      • Deployment Checklist
    • Contribute
    • Discussion Board
    • Academy
    • Events
Powered by GitBook

All content on this page by eGov Foundation is licensed under a Creative Commons Attribution 4.0 International License.

On this page
  • Introduction
  • Prometheus
  • Loki
  • Alermanager
  • Install Monitoring Tools Using Helmfile
  • Pre-requisites
  • Deployment Steps
  • Connect To Alermanager Web Interface
  • Connect To Grafana
  • Kubernetes Dashboards
  • Dashboards Explanations
  • 1. Blackbox Exporter:
  • 2. Kubernetes overview:
  • 3. Namespaces Overview
  • 4. Nodes Overview
  • 5. Pods Overview
  • 6. Nginx Ingress Dashboard
  • 7. Persistent Volume (PV) & Persistent Volume Claim Dashboard
  • 8. Loki Grafana Dashboard
  • 9. Redis Monitoring Dashboard

Was this helpful?

  1. Guides
  2. Operations Guide
  3. Observability

Monitoring

PreviousES-Curator - Clear Old Logs/indicesNextEnvironment Changes

Last updated 1 month ago

Was this helpful?

There are many monitoring tools out there. We must consider many things before choosing what would work best for our client clusters. We use Prometheus and Grafana for monitoring the clusters.

Introduction

Monitoring is an important pillar of DevOps best practices. This gives you important information about the performance and status of your platform. This is even more true in distributed environments such as Kubernetes and microservices.

One of Kubernetes’ great strengths is its ability to extend its services and applications. When you reach thousands of applications, it’s impractical to manually monitor or use scripts. You need to adopt a scalable surveillance system! This is where Prometheus and Grafana come in.

Prometheus makes it possible to collect, store, and use platform metrics. Grafana, on the other hand, connects to Prometheus, allowing you to create beautiful dashboards and charts.

Prometheus

What is Prometheus?

  • Prometheus is an open-source monitoring and alerting toolkit designed for reliability and scalability.

  • It collects and stores time-series data, allowing real-time monitoring of applications, infrastructure, and services.

Key Features:

  • Time-Series Data Collection: Stores metrics with timestamps and labels.

  • Powerful Querying: Uses PromQL (Prometheus Query Language) for data analysis.

  • Alerting Mechanism: Integrated with Alertmanager to notify about critical issues.

  • Service Discovery: Automatically detects targets (containers, pods, nodes) for monitoring.

  • Pull-Based Model: Fetches metrics from targets instead of relying on push mechanisms.

Loki

Distributed Log Aggregation System: Loki is an open-source log aggregation system built for cloud-native environments, designed to efficiently collect, store, and query log data. Loki was inspired by Prometheus and shares similarities in its architecture and query language, making it a natural complement to Prometheus for comprehensive observability.

Key Features

  • Label-based Indexing

  • LogQL Query Language

  • Log Stream Compression

  • Scalable and Cost-Efficient

  • Integration with Grafana

Alermanager

Alertmanager is a crucial component in the Prometheus ecosystem responsible for managing alerts and notifications. It receives alerts from Prometheus and processes them by grouping, deduplicating, and silencing them based on predefined rules. This ensures that notifications are sent only when necessary, reducing alert fatigue. Alertmanager supports multiple integrations, allowing alerts to be delivered via email, Slack, PagerDuty, and other communication channels. It also provides high availability by clustering multiple Alertmanager instances. By enabling effective alert management, Alertmanager helps DevOps and SRE teams maintain system reliability and quickly respond to incidents.

Install Monitoring Tools Using Helmfile

Pre-requisites

  • Export the kubeconfig file to enable cluster login from the terminal.

  • Command to install Helm, a package manager for Kubernetes that simplifies the deployment and management of applications using Helm charts.

curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
  • Commands to install Helmfile, which allows you to manage multiple Helm releases in a declarative manner.

wget https://github.com/helmfile/helmfile/releases/latest/download/helmfile_linux_amd64 -O /usr/local/bin/helmfile
chmod +x /usr/local/bin/helmfile
  • Command to install the Helm Diff plugin, which allows you to compare the current Helm release with the proposed changes before applying them:

helm plugin install https://github.com/databus23/helm-diff
  • Use the command below to install the decryption plugin. If it is already installed, you can skip this step.

helm plugin install https://github.com/jkroepke/helm-secrets
  • With this plugin, the encrypted key file will be automatically decrypted during deployment.

  • command to verify the plugin istallation

helm plugin list

Deployment Steps

  • Clone the repository

git clone https://github.com/egovernments/DIGIT-DevOps.git
  • Command to change directory

cd DIGIT-DevOps
  • Command to switch branch(monitoring tolls branch)

git checkout DIGIT-2.9LTS-monitoring
  • Checkout to working directory

cd deploy-as-code/charts/monitoring
  • Generate and preview Kubernetes manifests to see what will be applied.

helmfile -e env -f monitoring-helmfile.yaml template
  • Compare the current state with the new changes to see what will be modified.

helmfile -e env -f monitoring-helmfile.yaml diff
  • use the below command to deploy the monitoring tools

helmfile -e env -f monitoring-helmfile.yaml apply
  • This command will deploy all the monitoring tools.

Connect To Alermanager Web Interface

The Alertmanager web UI is accessible through port-forward with this command:

kubectl port-forward svc/kube-prometheus-stack-alertmanager -n monitoring 9093:9093

Connect To Grafana

  • Grafana Dashboard URL: DomainName/monitoring

  • Command to get the Grafana admin credentials

kubectl get secret grafana -n monitoring -o json | jq -r '.data | map_values(@base64d)'
  • Use sign-in with Github for viewer access.

Kubernetes Dashboards

Dashboards Explanations

1. Blackbox Exporter:

  • Purpose:

    • The dashboard monitors the health and availability of multiple HTTP endpoints using probes.

    • It tracks response status, latency, SSL details, and DNS lookup times.

  • Key Metrics:

    • Instance: The URL of the monitored endpoint.

    • Status: Indicates whether the endpoint is UP (reachable) or DOWN (unreachable or failing).

    • HTTP Code: Displays the HTTP response code (e.g., 200 for success, 403 for forbidden, 503 for service unavailable).

    • SSL Status: Shows whether SSL/TLS is correctly configured.

    • TLS Version: Indicates the TLS version used (e.g., TLS 1.3).

    • SSL Certificate Expiry: Days remaining before the SSL certificate expires.

    • Probe Duration: The time taken to complete the probe request.

    • DNS Lookup Duration: The time required to resolve the domain name.

  • Observations:

    • SSL is correctly configured for all monitored URLs with TLS 1.3.

    • Certificate expiry dates are tracked to prevent SSL disruptions.

    • Probe and DNS lookup durations help identify performance issues.

2. Kubernetes overview:

  • Purpose:

    • Provides an overview of cluster-wide resource utilization.

    • Helps monitor CPU, memory, and resource allocation.

  • Key Metrics Monitored:

    • CPU Utilization: Tracks real usage, requested, and limit allocations.

    • Memory Utilization: Monitors actual usage, requested, and allocated limits.

    • Cluster Resources: Displays active nodes, namespaces, and running pods.

    • Kubernetes Object Counts: Shows counts of containers, services, secrets, ingresses, PVCs, and other resources.

  • Insights & Utility:

    • Identifies resource consumption trends.

    • Helps in optimizing resource requests and limits.

    • Assists in capacity planning and performance monitoring.

3. Namespaces Overview

  • Purpose:

    • Monitors resource usage at the namespace level.

    • Helps track CPU, memory, and resource allocation per namespace.

  • Key Metrics Monitored:

    • CPU Usage: Percentage and core utilization within the cluster.

    • Memory Usage: Percentage and actual memory consumption.

    • Resource Count: Tracks running pods, services, config maps, secrets, ingresses, and persistent volume claims.

  • Insights & Utility:

    • Identifies high resource consumption namespaces.

    • Helps optimize requests and limits for efficient utilization.

    • Assists in monitoring namespace health and scaling decisions.

4. Nodes Overview

  • Purpose:

    • Monitors individual node performance and resource utilization.

    • Provides insights into node-level workload distribution.

  • Key Metrics Tracked:

    • CPU & RAM Usage: Current utilization and total capacity.

    • Pod Count: Number of running pods on the node.

    • Uptime: Node availability duration.

  • Insights & Utility:

    • Helps detect resource bottlenecks or underutilized nodes.

    • Assists in load balancing and capacity planning.

    • Useful for troubleshooting node-specific performance issues.

5. Pods Overview

  • General Pod Information

    • Displays pod details such as name, namespace, creation source, node it’s running on, and IP address.

    • Includes priority and QoS class to determine resource allocation behavior.

  • Resource Utilization Metrics

    • Shows CPU and memory requests vs. limits for the pod.

    • Visualizes real-time utilization with gauges for quick assessment.

  • Container-Level Resource Usage

    • Breaks down CPU and memory usage per container inside the pod.

    • Helps in identifying whether a container is over- or under-utilizing resources.

  • Performance Insights

    • Helps monitor resource consumption to avoid over-provisioning or underutilization.

    • Supports scaling decisions based on actual usage trends.

  • Operational Use Case

    • Useful for Kubernetes administrators to track performance, optimize configurations, and ensure stability in deployments.

6. Nginx Ingress Dashboard

  • Requests Overview

    • Displays total HTTP requests over a specific period.

    • Shows the percentage of successful requests.

    • Provides active connections and recent request counts.

  • Success Rate & HTTP Status Codes

    • Shows success percentage over a 2-minute window.

    • Categorizes HTTP responses:

      • 1xx/2xx: Successful responses.

      • 3xx: Redirects.

      • 4xx: Client errors (e.g., 404, 499).

      • 5xx: Server errors (e.g., 500, 502, 503).

  • Traffic Analysis

    • HTTP Requests / Ingress Graph: Visualizes request trends over time.

    • Total HTTP Requests Graph: Breakdown of request types (color-coded for different status codes).

  • Additional Metrics

    • Latency Panels: Measure response time.

    • Connection Panels: Monitor concurrent connections.

    • CPU Intensive Graphs: Helps in resource usage analysis.

7. Persistent Volume (PV) & Persistent Volume Claim Dashboard

  • Purpose:

    • Monitors the status and usage of PVs and PVCs in the Kubernetes cluster.

    • Helps track storage capacity, usage trends, and potential issues.

  • Key Metrics Tracked:

    • PVC Utilization: Shows if any PVCs are nearing full capacity.

    • Storage Availability: Displays available storage per PVC.

    • PVC Status: Indicates if PVCs are bound, pending, or lost.

8. Loki Grafana Dashboard

  • Log Filtering & Search

    • Filtering logs based on namespaces, pods, and keywords (error|fatal).

    • Supports advanced search queries for efficient troubleshooting.

  • Real-Time Log Visualization

    • Displays log frequency over time for better incident analysis.

    • Helps identify peaks in errors or warnings.

  • Log Source Details

    • Shows logs from specific Loki stack components.

    • Includes metadata like timestamps, severity levels, and source pods.

  • Performance Insights

    • Provides query execution details, including response times and processed log entries.

    • Helps optimize query performance for log retrieval.

9. Redis Monitoring Dashboard

  • Max Uptime

    • Displays how long the Redis instance has been running without a restart (e.g., 6 days).

    • Helps track system stability and potential need for maintenance.

  • Clients

    • Shows the number of active client connections (e.g., 35).

    • A high or fluctuating number might indicate load variations or connection issues.

  • Memory Usage

    • Indicates the percentage of allocated memory in use (e.g., 83%).

    • High memory usage could lead to performance degradation or eviction of keys.

  • Total Commands per Second

    • Displays the rate of commands being executed.

    • Spikes in command execution can indicate high activity or potential performance bottlenecks.

  • Hits/Misses per Second

    • Tracks cache efficiency by measuring successful vs. failed key lookups.

    • A high miss rate may indicate suboptimal caching strategies.

  • Total Memory Usage

    • Shows used vs. maximum memory allocation over time.

    • Helps identify memory growth trends and potential out-of-memory risks.

  • Network I/O

    • Monitors data ingress and egress (e.g., in MiB).

    • Spikes may indicate high request loads or potential network-related issues.

  • Total Items per DB

    • Represents the number of keys stored in each database instance.

    • Helps monitor data distribution and growth across databases.

  • Expiring vs. Non-Expiring Keys

    • Tracks how many keys are set to expire vs. persistent ones.

    • Useful for understanding key retention policies and potential memory optimizations.

Please refer to document for the required changes in the environment and secrets YAML files.

Opening a browser tab on shows the Alertmanager web UI.

Ex:

📓
this
http://localhost:9093
https://unified-dev.digit.org/monitoring/
Grafana Dashboards