DIGIT - Infra Overview

Operational Guidelines & Security Standards

Overview

The objective of this section is to provide a clear guide for efficiently using DIGIT infrastructure on various platforms like SDC, NIC, or commercial clouds. The content on this page outlines the infrastructure overview, operational guidelines, and recommendations, along with the segregation of duties (SoD). It helps to plan the procurement and build the necessary capabilities to deploy and implement DIGIT.

In a shared control scenario, the state program team must adhere to these guidelines, develop their control implementation for the state's cloud infrastructure and collaborate with partners. This ensures standardized and smooth operational excellence in the overall system.

DIGIT Infrastructure

DIGIT Platform is designed as a microservices architecture, using open-source technologies and containerized apps and services. DIGIT components/services are deployed as docker containers on a platform called Kubernetes, which provides flexibility for running cloud-native applications anywhere like physical or virtual infrastructure or hypervisor or HCI and so on. Kubernetes handles the work of scheduling containerized services onto a compute cluster and manages the workloads to ensure they run as intended. And it substantially simplifies the deployment and management of microservices.

Provisioning the Kubernetes cluster will vary across from commercial clouds to state data centres, especially in the absence of managed Kubernetes services like AWS, Azure, GCP and NIC. Kubernetes clusters can also be provisioned on state data centres with bare-metal, virtual machines, hypervisors, HCI, etc. However providing integrated networking, monitoring, logging, and alerting is critical for operating Kubernetes Clusters when it comes to State data centers. DIGIT Platform also offers add-ons to monitor Kubernetes cluster performance, logging, tracing, service monitoring and alerting, which the implementation team can take advantage.

Below are the useful links to understand Kubernetes:

DIGIT Infra Specification on SDC or NIC Or Any Commercial Cloud

Systems

Specification

Spec/Count

Comment

User Accounts/VPN

Dev, UAT and Prod Envs

User Roles

Admin, Deploy, ReadOnly

Any Linux (preferably Ubuntu/RHEL)

All

Kubernetes as a managed service or VMs to provision Kubernetes

Managed Kubernetes service with HA/DRS

(Or) VMs with 2 vCore, 4 GB RAM, 20 GB Disk

If no managed k8s

3 VMs/env

Dev - 3 VMs

UAT - 3VMs

Prod - 3VMs

Kubernetes worker nodes or VMs to provision Kube worker nodes.

VMs with 4 vCore, 16 GB RAM, 20 GB Disk / per env

3-5 VMs/env

DEV - 3VMs

UAT - 4VMs

PROD - 5VMs

Disk Storage (NFS/iSCSI)

Storage with backup, snapshot, dynamic inc/dec

1 TB/env

Dev - 1000 GB

UAT - 800 GB

PROD - 1.5 TB

VM Instance IOPS

Max throughput 1750 MB/s

1750 MS/s

Disk IOPS

Max throughput 1000 MB/s

1000 MB/s

Internet Speed

Min 100 MB - 1000MB/Sec (dedicated bandwidth)

Public IP/NAT or LB

Internet-facing 1 public ip per env

3 Ips

Availability Region

VMs from the different region is preferable for the DRS/HA

at least 2 Regions

Private vLan

Per env all VMs should within private vLan

Gateways

NAT Gateway, Internet Gateway, Payment and SMS gateway

1 per env

Firewall

Ability to configure Inbound, Outbound ports/rules

Managed DataBase

(or) VM Instance

Postgres 12 above Managed DB with backup, snapshot, logging.

(Or) 1 VM with 4 vCore, 16 GB RAM, 100 GB Disk per env.

per env

DEV - 1VMs

UAT - 1VMs

PROD - 2VMs

CI/CD server self hosted (or) Managed DevOps

Self Hosted Jenkins : Master, Slave (VM 4vCore, 8 GB each)

(Or) Managed CI/CD: NIC DevOps or AWS CodeDeploy or Azure DevOps

2 VMs (Master, Slave)

Nexus Repo

Self hosted Artifactory Repo (Or) NIC Nexus Artifactory

DockerRegistry

DockerHub (Or) SelfHosted private docker reg

Git/SCM

GitHub (Or) Any Source Control tool

DNS

main domain & ability to add more sub-domain

SSL Certificate

NIC managed (Or) SDC managed SSL certificate per URL

2 urls per env

Operational Recommendations

DIGIT strongly recommends Site reliability engineering (SRE) principles as a key means to bridge development and operations gaps by applying a software engineering mindset to system and IT administration topics. In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning.

Monitoring Tools Recommendations

Commercial clouds like AWS, Azure and GCP offer sophisticated monitoring solutions across various infra levels like CloudWatch and StackDriver. In the absence of such managed services to monitor, we can look at various best practices and tools listed below which help in debugging and troubleshooting efficiently.

Key Standard Operating Procedures (SOPs)

Segregation of duties and responsibilities.
SME and SPOCs for L1.5 support along with the SLAs defined.
Ticketing system to manage incidents converge and collaborate on various operational issues.
Monitoring dashboards at various levels like Infrastructure, networks and applications.
Transparency of monitoring data and collaboration between teams.
Periodic remote sync-up meetings, acceptance and attendance to the meeting.
Ability to see stakeholders' availability of calendar time to schedule meetings.
Periodic (weekly, monthly) summary reports of the various infra, operations incident categories.
Communication channels and synchronization regularly and also upon critical issues, changes, upgrades, releases etc.

Segregation Of Duties

While DIGIT is deployed at state cloud infrastructure, it is essential to identify and distinguish the responsibilities between Infrastructure, Operations and Implementation partners. Identify these teams and assign SPOC, define responsibilities and ensure the Incident Management process is followed to visualize, track issues and manage dependencies between teams. Essentially these are monitored through dashboards and alerts are sent to the stakeholders proactively. eGov team can provide consultation and training on a need basis depending on any of the below categories.

State IT/Cloud team -Refers to state infra team for the Infra, network architecture, LAN network speed, internet speed, OS Licensing and upgrade, patch, compute, memory, disk, firewall, IOPS, security, access, SSL, DNS, data backups/recovery, snapshots, capacity monitoring dashboard.

State program team - Refers to the owner for the whole DIGIT implementation, application rollouts, and capacity building. Responsible for identifying and synchronizing the operating mechanism between the below teams.
Implementation partner - Refers to the DIGIT Implementation, application performance monitoring for errors, logs scrutiny, TPS on peak load, distributed tracing, DB queries analysis, etc.
Operations team - this team could be an extension of the implementation team that is responsible for DIGIT deployments, configurations, CI/CD, change management, traffic monitoring and alerting, log monitoring and dashboard, application security, DB Backups, application uptime, etc.

Skills to Setup, Operate & Maintain DIGIT on SDC

Tools/Skills

Specification

Weightage (1-5)

Yes/No

System Administration

Linux Administration, troubleshooting, OS Installation, Package Management, Security Updates, Firewall configuration, Performance tuning, Recovery, Networking, Routing tables, etc

Containers/Dockers

Build/Push docker containers, tune and maintain containers, Startup scripts, Troubleshooting dockers containers.

Kubernetes

Setup Kubernetes cluster on bare-metal and VMs using kubeadm/kubespary, terraform, etc. Strong understanding of various Kubernetes components, configurations, kubectl commands and RBAC. Creating and attaching persistent volumes, log aggregation, deployments, networking, service discovery and rolling updates. Scaling pods, deployments, worker nodes, node affinity, secrets, configMaps, etc..

Skills Needed: https://docs.google.com/document/d/1CM_w6Q82b70ir8m8O_0XAaJuf9fv11DRhjT0M85LaTA/edit

Database Administration

Setup Postgres DB, Set up read replicas, Backup, Log, DB RBAC setup, SQL Queries

Docker Registry

Setup docker registry and manage

SCM/Git

Source Code management, branches, forking, tagging, Pull Requests, etc.

CI Setup

Jenkins Setup, Master-slave configuration, plugins, jenkinsfile, groovy scripting, Jenkins CI Jobs for Maven, Node application, deployment jobs, etc.

Artifact management

Code artifact management, versioning

Apache Tomcat

Web server setup, configuration, load balancing, sticky sessions, etc

WildFly JBoss

Application server setup, configuration, etc.

Spring Boot

Build and deploy spring boot applications

NodeJS

NPM Setup and build node applications

Scripting

Shell scripting, python scripting.

Log Management

Aggregating system, container logs, troubleshooting. Monitoring Dashboard for logs using prometheus, fluentd, Kibana, Grafana, etc.

WordPress

Multi-tenant portal setup and maintain

Resource Requirement

Team

Roles

Responsibility

Program Management

Responsible for driving the Transformation Vision for State Team Formation, reviewing them and resolving hurdles for the teams.

Program Leader

Overall responsibility to Drive Vision of the program.

Identify Success Metrics for the program and the budgets for it. Staff the teams with the right / capable people to drive the outcomes.

Define program Structure and ensure that the various teams work in tandem towards the Program Plan/ Schedule.

Review program Progress and remove bottlenecks for the Implementation Teams

Procurement

Help timely procurements of various items/ services needed for the Program

Program Manager

Plan, establish tracking mechanism,

Track and Manage Program activities,

Conduct reviews with various teams to drive the Program. Ensure that the efforts of various teams are aligned.

Escalate/ seek support as appropriate to the Program Leader.

Program Coordinator

Track progress of activities,

Help documentation of the Program team,

Coordinate meeting schedules and logistics.

Implementation Review

Reports to program leader

Ensures Processes and System adoption happens in the ULB

Ensures the Program metrics are headed in the right direction (Their responsibility will extend well beyond the technical rollout)

Domain Team

Finalize finance and other related processes for all ULBs, Provide Specific Inputs to Technical Implementation team, Capacity Building,

Data Preparation

Oversee UAT

Monitor data to identify process execution on ground, Identify improvement areas for the Finance function.

State Finance Accounting Leader

Should be a TRUSTED Line Function person, who can be the guide to all the Accounting Head at the ULBs.

Should be able to take decisions for the state on all ULB Finance processes and appropriate automation related to that.

Finance Advisors / Consultants/ Accounts officers

Finalise Standardised Finance processes that need to be there on the ground to realise the State's vision.

Technology Implementation Team

Technical Specialist team that has knowledge of the eGov Platform, technologies, the DIGIT modules.

Configure/ customise the product to the needs of the state. Integrate the product with other systems as needed and manage and support the State

Technical Program Manager

Has a good understanding of the eGov Platform/ Product.

Plans the Technical Track of the Product Manage Technical team Coordinates with various stakeholders during different phases of Implementation to get the Product ready for rollout in the ULBs. Plan and schedule activities as needed in the program.

He/She will be part of the Program Management team.

Business Analysts

Study and design State specific Accounting and other taxation Processes working with the Domain team.

Capture and document all Processes

Ensure that the Product will meet the needs of the State

Software Designers / Architects

Designing Software requirements based on the requirements finalised by the Business Analysts and leveraging platform as appropriate.

Business Analysts

Study and design State specific Processes working with the Domain team.

Capture and document all Processes and ensure that the Product will meet the needs of the State.

Software Designers / Architects

Designing Software requirements based on the requirements finalised by the Business Analysts and leveraging platform as appropriate.

Developers

Configurations, Customization and Data Loads.

Testers

Test configuration / customisation and regression testing for each release

Project Coordinator

Coordinate activities amongst the various stakeholder and logistics support

DevOps & Cloud Monitoring

Release Management, Managing Repository, Security and Build tools

DBA

Postgres DBA. Database Tuning, backup, Archiving

Field Team

Statewide capacity building (Including Change Management). Experience in Finance Area preferred.

Measure training effectiveness and fine-tune approach.

Plan refresher training as needed.

Content Developer

Prepare content for training different roles in DIGIT.

Trainers

Execute training as per content developed for the different roles in DIGIT.

Capture feedback and identify additional training needs if required

Help Desk and Support

Central help desk

Onground support in a planned manner to each ULB during the first 2 months after rolling out.

Help Desk leader

Organise and run the help desk operations.

Ensure that tickets are handled as per agreed SLAs, Coordinate with Technical team as needed.

Analyse Help desk calls and identify potential areas for the Domain / Business Analysts to work on.

Central Help Desk

To take care of L1 and L2 Support.

Ensure Tracking of issues on the help desk tool.

Provide On ground support (Face to face) during the first 2 months of rollout

At least 1 person per 3-4 Ulbs who can travel during the first 2 months to provide support to end users. This is more for confidence building and ensuring adoption.

DIGIT - Security Standards & Operational Recommendations

This section provides insights on security principles, layers and the line of control that we focus on to prevent DIGIT security from the code, application, access, infra and operations. The target audience of this section is internal teams, partners, ecosystems and states to understand what security measures to be considered to secure DIGIT from an infrastructure and operations perspective.

DIGIT - Key Security Principles

Subscribe to the DIGIT applicable OWASP top 10 standard across various security layers.

Minimize attack surface area
Implement a strong identity foundation - Who accesses what and who does what.
Apply security at all possible layers
Automate security best practices
Separation of duties (SoD).
The principle of Least privilege (PoLP)
Templatized design - (Code, Images, Infra-as-code, Deploy-as-code, Conf-as-code, etc)
Align with MeiTY Standards to meet SDC Infra policies.

Security Layers & Line of Control

Security Layers

Line Of Controls

Application Layer

WAF, IAM, VA/PT, XSS, CSRF, SQLi, DDoS Defense.

Code

Defining security in the code, Static/Dynamic vulnerabilities scan

Libraries/Containers

Templatize Design, Vulnerabilities scanning at CI

Data

Encryption, Backups, DLP

Network

TLS, Firewalls, Ingress/Egress, Routing.

Infra/Cloud

Configurations/Infra Templates, ACL, user/privilege mgmt, Secrets mgmt

Operations

(PoLP) Least Privilege, Shared Responsibilities, CSA, etc

Application Layer

The presentation layer is likely to be the #1 attack vector for malicious individuals seeking to breach security defences like DDoS attacks, Malicious bots, Cross-Site Scripting (XSS) and SQL injection. Need to invest in web security testing with the powerful combination of tools, automation, process and speed that seamlessly integrates testing into software development, helping to eliminate vulnerabilities more effectively, deploy a web application firewall (WAF) that monitors and filters traffic to and from the application, blocking bad actors while safe traffic proceeds normally.

Key Security Measures

1. TLS-protocols/Encryption: Access control to secure authentication and authorization. All APIs that are exposed must have HTTPS certificates and encrypt all the communication between client and server with transport layer security (TLS).

2. Auth Tokens: An authorization framework that allows users to obtain admittance to a resource from the server. This is done using tokens in microservices security patterns: resource server, resource owner, authorization server, and client. These tokens are responsible for access to the resource before its expiry time. Also, Refresh Tokens that are responsible for requesting new access after the original token has expired.

3. Multi-factor Authentication: authorize users on the front end, which requires a username and password as well as another form of identity verification to offer users better protection by default as some aspects are harder to steal than others. For instance, using OTP for authentication takes microservice security to a whole new level.

4. Rate Limit/DDoS: denial-of-service attacks are the attempts to send an overwhelming number of service messages to cause application failure by concentrating on volumetric flooding of the network pipe. Such attacks can target the entire platform and network stack.

To prevent this:

We should set a limit on how many requests in a given period can be sent to each API.
If the number exceeds the limit, block access from a particular API, at least for some reasonable interval.
Also, make sure to analyze the payload for threats.
The incoming calls from a gateway API would also have to be rate-limited.
Add filters to the router to drop packets from suspicious sources.

5. Cross-site scripting (XSS): scripts that are embedded in a webpage and executed on the client side, in a user’s browser, instead of on the server side. When applications take data from users and dynamically include it in webpages without validating the data properly, attackers can execute arbitrary commands and display arbitrary content in the user’s browser to gain access to account credentials.

How to prevent:

Applications must validate data input to the web application from user browsers.
All output from the web application to user browsers must be encoded.
Users must have the option to disable client-side scripts.

6. Cross-Site Request Forgery (CSRF): is an attack whereby a malicious website will send a request to a web application that a user is already authenticated against from a different website. This way an attacker can access functionality in a target web application via the victim's already authenticated browser. Targets include web applications like social media, in-browser email clients, online banking and web interfaces for network devices. To prevent this CSRF tokens are appended to each request and associated with the user’s session. Such tokens should at a minimum be unique per user session, but can also be unique per request.

How to prevent:

By including a challenge token with each request, the developer can ensure that the request is valid and not coming from a source other than the user.

8. SQL Injection (SQLi): allows attackers to control an application’s database – letting them access or delete data, change an application’s data-driven behaviour, and do other undesirable things – by tricking the application into sending unexpected SQL commands. SQL injections are among the most frequent threats to data security.

How to prevent:

Using parameterized queries which specify placeholders for parameters so that the database will always treat them as data rather than part of an SQL command. Prepared statements and object-relational mappers (ORMs) make this easy for developers.
Remediate SQLi vulnerabilities in legacy systems by escaping inputs before adding them to the query. Use this technique only where prepared statements or similar facilities are unavailable.
Mitigate the impact of SQLi vulnerabilities by enforcing the least privilege on the database. Ensure that each application has its database credentials and that these credentials have the minimum rights the application needs.

Security In The Code

The primary causes of commonly exploited software vulnerabilities are consistent defects, bugs, and logic flaws in the code. Poor coding practices can create vulnerabilities in the system that can be exploited by cybercriminals.

What defines a security in the code:

1. White-box code analysis: As developers write code, the IDE needs to provide focused, real-time security feedback with white-box code analysis. It also helps developers remediate faster and learn on the job through positive reinforcement, remediation guidance, code examples, etc.

2. Static Code Analysis (SAST): A static analysis tool reviews program code, searching for application coding flaws, back doors or other malicious code that could give hackers access to critical data or customer information. However, most static analysis tools can only scan source code.

3: Vulnerability assessment: Vulnerability assessment for the third-party libraries/artefacts as part of CI and GitHub PR process. Test results are returned quickly and prioritized in a Fix-First Analysis that identifies both the most urgent flaws and the ones that can be fixed most quickly, allowing developers to optimize efforts and save additional resources.

4. Secure PII/Encrypt: Personally identifying information – to make sure that it is not being displayed as plain text. All the passwords and usernames must be masked during the storing in logs or records. However, adding extra encryption above TLS/HTTP won’t add protection for traffic travelling through the wire. It can only help a little bit at the point where TLS terminates, so it can protect sensitive data (such as passwords or credit card numbers) from accidental dumping into a request log. Extra encryption (RSA 2048+ or Blowfish) might help protect data against those attacks that aim at accessing the log data. But it will not help with those who try accessing the memory of the application servers or the main data storage.

5. Manual Penetration Testing: Some categories of vulnerabilities, such as authorization issues and business logic flaws, cannot be found with automated assessments and will always require a skilled penetration tester to identify them. Need to employ Manual Penetration Testing that uses proven practices to provide extensive and comprehensive security testing results for web, mobile, desktop, and back-end with detailed results, including attack simulations.

Libraries/Containers

Components, such as libraries, frameworks, container images, and other software modules, almost always run with full privileges. If a vulnerable component is exploited, such an attack can facilitate serious data loss or server takeover. Applications using components with known vulnerabilities may undermine application defences and enable a range of possible attacks and impacts.

Automating dependency checks for the libraries and container auditing, as well as using other container security processes as part of the CI periodically or as part of PRs can largely prevent these vulnerabilities. Subscribing to tools that comply with vulnerable library databases such as OSVDB, Node Security Project, CIS, National Vulnerability Database, and Docker Bench for Security can help identify and fix the vulnerabilities periodically. A private docker registry can help.

Data Security

Data Security involves putting in place specific controls, standard policies, and procedures to protect data from a range of issues, including:

Enforced encryption: Encrypt, manage and secure data by safeguarding it in transit. Password-based, easy to use and very efficient.
Unauthorized access: Blocking unauthorized access plays a central role in preventing data breaches. Implementing Strong Password Policy and MFA.
Accidental loss: All data should be backed up. In the event of hardware or software failure, breach, or any other error to data; a backup allows it to continue with minimal interruption. Storing the files elsewhere can also quickly determine how much data was lost and/or corrupted.
Destruction: Endpoint Detection and Response (EDR) – provides visibility and defensive measures on the endpoint itself, when attacks occur on endpoint devices this can eliminate gaining access systems and avoid destruction of the data.

Infra/Cloud

In microservices and the Cloud Native architectural approach, the explosion of ephemeral, containerized services that arise from scaling applications developed increases the complexity of delivery. Fortunately, Kubernetes was developed just for this purpose. It provides DevOps teams with an orchestration capability for managing the multitude of deployed services, with in-built automation, resilience, load balancing, and much more. It's perfect for the reliable delivery of Cloud Native applications. Below are some of the key areas to get more control to establish policies, procedures and safeguards through the implementation of a set of rules for compliance. These rules cover infra privacy, security, breach notification, enforcement, and an omnibus rule that deals with security compliance.

Strong stance on authentication and authorization
Role-Based Access Control (RBAC)
Kubernetes infrastructure vulnerability scanning
Hunting misplaced secrets
Workload hardening from Pod Security to network policies
Ingress Controllers for security best practices
Constantly watch your Kubernetes deployments
Find deviations from desired baselines
Should alert or deny on policy violation
Block/Whitelist (IP or DNS) connections before entering the workloads.
Templatize the deployment/secrets configs and serve as config-as-code.

Network Security

Kubernetes brings new requirements for network security, because applications, that are designed to run on Kubernetes, are usually architected as microservices that rely on the network. They make API calls to each other. Steps must be taken to ensure proper security protocols are in place. The following are the key areas for implementing network security for a Kubernetes platform:

Container Groups: Coupled communication between grouped containers, is achieved inside the Pod that contains one or more containers.
Communication between Pods: Pods are the smallest unit of deployment in Kubernetes. A Pod can be scheduled on one of the many nodes in a cluster and has a unique IP address. Kubernetes places certain requirements on communication between Pods when the network has not been intentionally segmented. These requirements include:
Containers should be able to communicate with other Pods without using network address translation (NAT).
All the nodes in the cluster should be able to communicate with all the containers in the cluster.
The IP address assigned to a container should be the same that is visible to other entities communicating with the container.
Pods and Services: Since Pods are ephemeral in nature, an abstraction called a Service provides a long-lived virtual IP address that is tied to the service locator (e.g., a DNS name). Traffic destined for that service VIP is then redirected to one of the Pods and offers the service using that specific Pod’s IP address as the destination.
Traffic Direction: Traffic is directed to Pods and services in the cluster via multiple mechanisms. The most common is via an ingress controller, which exposes one or more service VIPs to the external network. Other mechanisms include node ports and even publicly-addressed Pods.

Operational Security

It is a procedural security that manages risk and encourages to view of operations from the perspective of an adversary to protect sensitive information from falling into the wrong hands. Following are a few best practices to implement a robust, comprehensive operational security program:

Implement precise change management processes: All changes should be logged and controlled so they can be monitored and audited.
Restrict access to network devices using AAA authentication: a “need-to-know” is a rule of thumb regarding access and sharing of information.
Least Privilege (PoLP): Give the minimum access necessary to perform their jobs.
Implement dual control: Those who work on the tasks are not the same people in charge of security.
Automate tasks: reduce the need for human intervention. Humans are the weakest link in any organization’s operational security initiatives because they make mistakes, overlook details, forget things, and bypass processes.

Incident response and disaster recovery planning: are always crucial components of a sound security posture, we must have a plan to identify risks, respond to them, and mitigate potential damages.

PreviousOperations Guide NextKubernetes

Last updated 5 months ago

Was this helpful?