Operational Best Practices

Introduction

This article provides a DIGIT Infra overview, Guidelines for Operational Excellence while DIGIT is deployed on SDC, NIC or any commercial clouds along with the recommendations and segregation of duties (SoD). It helps to plan the procurement and build the necessary capabilities to deploy and implement DIGIT.

In shared control, the state program team/partners can consider these guidelines and must provide their own control implementation to the state’s cloud infrastructure and partners for standard and smooth operational excellence.

Operational Recommendations

DIGIT strongly recommends Site reliability engineering (SRE) principles as a key means to bridge development and operations by applying a software engineering mindset to the system and IT administration topics. In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.

Monitoring Tools Recommendations: Commercial clouds like AWS, Azure and GCP offer sophisticated monitoring solutions across various infra levels like CloudWatch and StackDriver, in the absence of such managed services to monitor we can look at various best practices and tools listed below which helps debugging and troubleshooting efficiently.

Key Standard Operating Procedures (SOPs)

  • Segregation of duties and responsibilities.

  • SME and SPOCs for L1.5 support along with the SLAs defined.

  • Ticketing system to manage incidents, converge and collaborate on various operational issues.

  • Monitoring dashboards at various levels like Infrastructure, Networks and applications.

  • Transparency of monitoring data and collaboration between teams.

  • Periodic remote sync-up meetings, acceptance and attendance to the meeting.

  • Ability to see stakeholders' availability of calendar time to schedule meetings.

  • Periodic (weekly, monthly) summary reports of the various infra, operations incident categories.

  • Communication channels and synchronization on a regular basis and also upon critical issues, changes, upgrades, releases etc.

Segregation of Duties

While DIGIT is deployed at state cloud infrastructure, it is essential to identify and distinguish the responsibilities between Infrastructure, Operations and Implementation partners. Identify these teams and assign SPOC, define responsibilities and Incident management followed to visualize, track issues and manage dependencies between teams. Essentially these are monitored through dashboards and alerts are sent to the stakeholders proactively. eGov team can provide consultation and training on a need basis depending on any of the below categories.

  • and State program team - Refers to the owner for the whole DIGIT implementation, application rollouts, and capacity building. Responsible for identifying and synchronizing the operating mechanism between the below teams.

  • Implementation partner - Refers to the DIGIT Implementation, application performance monitoring for errors, logs scrutiny, TPS on peak load, distributed tracing, DB queries analysis, etc.

  • Operations team - this team could be an extension of the implementation team and is responsible for DIGIT deployments, configurations, CI/CD, change management, traffic monitoring and alerting, log monitoring and dashboard, application security, DB Backups, application uptime, etc.

  • **State IT/Cloud team -**Refers to state infra team for the Infra, network architecture, LAN network speed, internet speed, OS Licensing and upgrade, patch, compute, memory, disk, firewall, IOPS, security, access, SSL, DNS, data backups/recovery, snapshots, capacity monitoring dashboard.

Skills Required to Set up, Operate and Maintain DIGIT on SDC

Tools/Skills
Specification
Weightage (1-5)
Yes/No

System Administration

Linux Administration, troubleshooting, OS Installation, Package Management, Security Updates, Firewall configuration, Performance tuning, Recovery, Networking, Routing tables, etc

4

Containers/Dockers

Build/Push docker containers, tune and maintain containers, Startup scripts, Troubleshooting dockers containers.

2

Kubernetes

Setup kubernetes cluster on bare-metal and VMs using kubeadm/kubespary, terraform, etc. Strong understanding of various kubernetes components, configurations, kubectl commands, RBAC. Creating and attaching persistent volumes, log aggregation, deployments, networking, service discovery, Rolling updates. Scaling pods, deployments, worker nodes, node affinity, secrets, configMaps, etc..

3

Database Administration

Setup PostGres DB, Set up read replicas, Backup, Log, DB RBAC setup, SQL Queries

3

Docker Registry

Setup docker registry and manage

2

SCM/Git

Source Code management, branches, forking, tagging, Pull Requests, etc.

4

CI Setup

Jenkins Setup, Master-slave configuration, plugins, jenkinsfile, groovy scripting, Jenkins CI Jobs for Maven, Node application, deployment jobs, etc.

4

Artifact management

Code artifact management, versioning

1

Apache Tomcat

Web server setup, configuration, load balancing, sticky sessions, etc

2

WildFly JBoss

Application server setup, configuration, etc.

3

Spring Boot

Build and deploy spring boot applications

2

NodeJS

NPM Setup and build node applications

2

Scripting

Shell scripting, python scripting.

4

Log Management

Aggregating system, container logs, troubleshooting. Monitoring Dashboard for logs using prometheus, fluentd, Kibana, Grafana, etc.

3

WordPress

Multi-tenant portal setup and maintain

2

Last updated