DIGIT - Infra overview, Operational Guidelines & Security Standards
To provide DIGIT Infra overview, Guidelines for Operational Excellence while DIGIT is deployed on SDC, NIC or DIGIT - Infra overview & Operational Guidelines (BoM) - Google Docsany commercial clouds along with the recommendations and segregation of duties (SoD). It helps to plan the procurement and build the necessary capabilities to deploy and implement DIGIT.
DIGIT Platform is designed as a microservices architecture, using open-source technologies and containerized apps and services. DIGIT components/services are deployed as docker containers on a platform called kubernetes, which provides flexibility for running cloud-native applications anywhere like physical or virtual infrastructure or hypervisor or HCI and so on. Kubernetes handles the work of scheduling containerized services onto a compute cluster and manages the workloads to ensure they run as intended. And it substantially simplifies the deployment and management of microservices.
Provisioning the kubernetes cluster will vary across from commercial clouds to state data centers, especially in the absence of managed kubernetes services like AWS, Azure, GCP and NIC. Kubernetes clusters can also be provisioned on state data centers with bare-metal, virtual machines, hypervisors, HCI, etc. However providing integrated networking, monitoring, logging, and alerting is critical for operating Kubernetes Clusters when it comes to State data centers. DIGIT Platform also offers addons to monitor kubernetes cluster performance, logging, tracing, service monitoring and alerting, where the implementation team can take advantage.
Here is the useful links to understand kubernetes:
DIGIT strongly recommends Site reliability engineering (SRE) principles as a key means to bridge between development and operations by applying a software engineering mindset to system and IT administration topics. In general, an SRE team is responsible for availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning
Commercial clouds like AWS, Azure and GCP offer sophisticated monitoring solutions across various infra levels like CloudWatch and StackDriver, in the absence of such managed services to monitor we can look at various best practices and tools listed below which helps debugging and troubleshooting efficiently.
Key Standard Operating Procedures (SOPs)
- Segregation of duties and responsibilities.
- Monitoring dashboards at various levels like Infrastructure, Network and applications.
- Transparency of monitoring data and collaboration between teams.
- Periodic remote sync up meetings, acceptance and attendance to the meeting.
- Ability to see stakeholders availability of calendar time to schedule meetings.
- Periodic (weekly, monthly) summary reports of the various infra, operations incident categories.
- Communication channels and synchronization on a regular basis and also upon critical issues, changes, upgrades, releases etc.
While DIGIT is deployed at state cloud infrastructure, it is essential to identify and distinguish the responsibilities between Infrastructure, Operations and Implementation partners. Identify these teams and assign SPOC, define responsibilities and Incident management is followed to visualize, track issues and manage dependencies between teams. Essentially these are monitored through dashboards and alerts are sent to the stakeholders proactively. eGov team can provide consultation and training on the need basis depending any of the below categories.
State IT/Cloud team -Refers to state infra team for the Infra, network architecture, LAN network speed, internet speed, OS Licensing and upgrade, patch, compute, memory, disk, firewall, IOPS, security, access, SSL, DNS, data backups/recovery, snapshots, capacity monitoring dashboard.
- State program team - Refers to the owner for the whole DIGIT implementation, application rollouts, capacity building. Responsible for identifying and synchronizing the operating mechanism between the below teams.
- Implementation partner - Refers to the DIGIT Implementation, application performance monitoring for errors, logs scrutiny, TPS on peak load, distributed tracing, DB queries analisis, etc.
- Operations team - this team could be an extension of the implementation team who is responsible for DIGIT deployments, configurations, CI/CD, change management, traffic monitoring and alerting, log monitoring and dashboard, application security, DB Backups, application uptime, etc.
This document is intended to provide insights on security principles, security layers and the line of control that we focus to prevent DIGIT security from the code, application, access, infra and operations. The audience to this document are internal teams, partners, ecosystem and states to understand what security measures to be considered to secure DIGIT from an infrastructure and operations perspective .
- 1.Minimize attack surface area
- 2.Implement a strong identity foundation - Who access to what and who does what.
- 3.Apply security at all possible layers
- 4.Automate security best practices
- 5.Separation of duties (SoD).
- 6.The principle of Least privilege (PoLP)
- 7.Templatized design - (Code, Images, Infra-as-code, Deploy-as-code, Conf-as-code, etc)
Presentation layer is likely to be the #1 attack vector for malicious individuals seeking to breach the security defenses like DDoS attacks, Malicious bots, Cross-Site Scripting (XSS) and SQL injection. Need to invest in web security testing with the powerful combination of tools, automation, process and speed that seamlessly integrates testing into software development, helping to eliminate vulnerabilities more effectively, deploy web application firewall (WAF) that monitors and filters traffic to and from the application, blocking bad actors while safe traffic proceeds normally.
1. TLS-protocols/Encryption: Access control to secure authentication and authorization. All APIs that are exposed must have HTTPS-certificates and encrypt all the communication between client and server with transport layer security (TLS).
2. Auth Tokens: An authorization framework that allows users to obtain admittance to a resource from the server. This is done using tokens in microservices security patterns: resource server, resource owner, authorization server, and client. These tokens are responsible for access to the resource prior to its expiry time. Also Refresh Tokens that are responsible for requesting new access after the original token has expired.
3. Multi-factor Authentication: authorize users on the front-end, which requires a username and password as well as another form of identity verification to offer users better protection by default as some aspects are harder to steal than others. For instance, using OTP for authentication takes microservice security on a whole new level.
4. Rate Limit/DDoS: denial-of-service attacks are the attempts of sending an overwhelming number of service messages with the aim of causing application failure by concentrating on volumetric flooding of the network pipe. Such attacks can target the entire platform and network stack.
To prevent this:
- We should set a limit on how many requests in a given period of time can be sent to each API.
- If the number exceeds the limit, block access from a particular API, at least for some reasonable interval.
- Also, make sure to analyze the payload for threats.
- The incoming calls from a gateway API would also have to be rate-limited.
- Add filters to the router to drop packets from suspicious sources.
5. Cross site scripting (XSS): scripts that are embedded in a webpage and executed on the client-side, in a user’s browser, instead of on the server-side. When applications take data from users and dynamically include it in webpages without validating the data properly, attackers can execute arbitrary commands and display arbitrary content in the user’s browser to gain access to account credentials.
How to prevent:
- Applications must validate data input to the web application from user browsers.
- All output from the web application to user browsers must be encoded.
- Users must have the option to disable client-site scripts.
6. Cross-Site Request Forgery (CSRF): is an attack whereby a malicious website will send a request to a web application that a user is already authenticated against from a different website. This way an attacker can access functionality in a target web application via the victim's already authenticated browser. Targets include web applications like social media, in-browser email clients, online banking and web interfaces for network devices. To prevent this CSRF tokens to be appended to each request and associate them with the user’s session. Such tokens should at a minimum be unique per user session, but can also be unique per request.
How to prevent:
- By including a challenge token with each request, the developer can ensure that the request is valid and not coming from a source other than the user.
8. SQL Injection (SQLi): allows attackers to control an application’s database – letting them access or delete data, change an application’s data-driven behavior, and do other undesirable things – by tricking the application into sending unexpected SQL commands. SQL injections are among the most frequent threats to data security.
How to prevent:
- Using parameterized queries which specifies placeholders for parameters so that the database will always treat them as data rather than part of a SQL command. Prepared statements and object relational mappers (ORMs) make this easy for developers.
- Remediate SQLi vulnerabilities in legacy systems by escaping inputs before adding them to the query. Use this technique only where prepared statements or similar facilities are unavailable.
- Mitigate the impact of SQLi vulnerabilities by enforcing least privilege on the database. Ensure that each application has its own database credentials, and that these credentials have the minimum rights the application needs.
The primary causes of commonly exploited software vulnerabilities are consistently defects, bugs, and logic flaws in the code. It is clear that poor coding practices can create vulnerabilities in the system that can be exploited by cyber criminals.
What defines a security in the code:
1. White-box code analysis: As developers write code, the IDE needs to provide focused, real-time security feedback with white-box code analysis. It also helps developers remediate faster and learn on the job through positive reinforcement, remediation guidance, code examples, etc.
2. Static Code Analysis (SAST): A static analysis tool reviews program code, searching for application coding flaws, back doors or other malicious code that could give hackers access to critical data or customer information. But most static analysis tools can only scan source code.
3:Vulnerability assessment: Vulnerability assessment for the third party libraries/artifacts as part of CI and GitHub PR process. Test results are returned quickly and prioritized in a Fix-First Analysis that identifies both the most urgent flaws and the ones that can be fixed most quickly, allowing developers to optimize efforts and save additional resources.
4.Secure PII/Encrypt: Personally identifying information – to make sure that it is not being displayed as a plain text. All the passwords and usernames must be masked during the storing in logs or records. However, adding extra encryption above TLS/HTTP won’t add protection for traffic traveling through the wire. It can only help a little bit at the point where TLS terminates, so it can protect sensitive data (such as passwords or credit card numbers) from accidental dumping into a request log. Extra encryption (RSA 2048+ or Blowfish) might help protect data against those attacks that aim at accessing the log data. But it will not help with those which try accessing the memory of the application servers or the main data storage.
5. Manual Penetration Testing: Some categories of vulnerabilities, such as authorization issues and business logic flaws, cannot be found with automated assessments and will always require a skilled penetration tester to identify them. Need to employ Manual Penetration Testing that uses proven practices to provide extensive and comprehensive security testing results for web, mobile, desktop, back-end with detailed results, including attack simulations.
Components, such as libraries, frameworks, container images, and other software modules, almost always run with full privileges. If a vulnerable component is exploited, such an attack can facilitate serious data loss or server takeover. Applications using components with known vulnerabilities may undermine application defenses and enable a range of possible attacks and impacts.
Automating dependency checks for the libraries and container auditing, as well as using other container security processes as part of the CI periodically or as part of PRs can largely prevent these vulnerabilities. Subscribing to tools comply with vulnerable libraries database such as OSVDB, Node Security Project, CIS, National Vulnerability Database, Docker Bench for Security can help identifying and fixing the vulnerabilities periodically. Private docker registry can help
Data Security involves putting in place specific controls, standard policies, and procedures to protect data from a range of issues, including:
- Enforced encryption: Encrypt, manage and secure data by safeguarding in transit. Password-based, easy to use and very efficient.
- Unauthorized access: Blocking unauthorized access plays a central role in preventing data breaches. Implementing Strong Password Policy and MFA.
- Accidental loss: All data should be backed up. In the event of hardware or software failure, breach, or any other error to data; a backup allows it to continue with minimal interruption. Storing the files elsewhere can also quickly determine how much data was lost and/or corrupted.
- Destruction: Endpoint Detection and Response (EDR) – provides visibility and defensive measures on the endpoint itself, when attacks occur on endpoint devices this can eliminate gaining access systems and avoid destruction of the data.
In microservices and the Cloud Native architectural approach the explosion of ephemeral, containerized services that arises from scaling applications developed increases the complexity of delivery. Fortunately, Kubernetes was developed just for this purpose. It provides DevOps teams with an orchestration capability for managing the multitude of deployed services, with in-built automation, resilience, load balancing, and much more. It's perfect for reliable delivery of Cloud Native applications. Below are some of the key areas to get more control to establish policies, procedures and safeguards through the implementation of a set of rules for compliance. These rules cover infra privacy, security, breach notification, enforcement, and an omnibus rule that deals with security compliance.
- Strong stance on authentication and authorization
- Role-Based Access Control (RBAC)
- Kubernetes infrastructure vulnerability scanning
- Hunting misplaced secrets
- Workload hardening from Pod Security to network policies
- Ingress Controllers for security best practices
- Constantly watch your Kubernetes deployments
- Find deviations from desired baselines
- Should alert or deny on policy violation
- Block/Whitelist (IP or DNS) connections before entering the workloads.
- Templatize the deployment/secrets configs and serve as config-as-code.
Kubernetes brings new requirements for network security, because applications, that are designed to run on Kubernetes, are usually architected as microservices that rely on the network. They make API calls to each other. Steps must be taken to ensure proper security protocols are in place. Following are the key areas for implementing network security for a Kubernetes platform:
- Container Groups: Coupled communication b/w grouped containers, this is achieved inside the Pod that contains one or more containers.
- Communication between Pods: Pods are the smallest unit of deployment in Kubernetes. A Pod can be scheduled on one of the many nodes in a cluster and has a unique IP address. Kubernetes places certain requirements on communication between Pods when the network has not been intentionally segmented. These requirements include:
- Containers should be able to communicate with other Pods without using network address translation (NAT).
- All the nodes in the cluster should be able to communicate with all the containers in the cluster.
- The IP address assigned to a container should be the same that is visible to other entities communicating with the container.
- Pods and Services: Since Pods are ephemeral in nature, an abstraction called a Service provides a long-lived virtual IP address that is tied to the service locator (e.g., a DNS name). Traffic destined for that service VIP is then redirected to one of the Pods and offers the service using that specific Pod’s IP address as the destination.
- Traffic Direction: Traffic is directed to Pods and services in the cluster via multiple mechanisms. The most common is via an ingress controller, which exposes one or more service VIPs to the external network. Other mechanisms include nodePorts and even publicly-addressed Pods.
It is a procedural security that manages risk that encourages to view operations from the perspective of an adversary in order to protect sensitive information from falling into the wrong hands. Following are few best practices to implement a robust, comprehensive operational security program:
- Implement precise change management processes: All changes should be logged and controlled so they can be monitored and audited.
- Restrict access to network devices using AAA authentication: a “need-to-know” is a rule of thumb regarding access and sharing of information.
- Least Privilege (PoLP): Give the minimum access necessary to perform their jobs.
- Implement dual control: Those who work on the tasks are not the same people in charge of security.
- Automate tasks: reduce the need for human intervention. Humans are the weakest link in any organization’s operational security initiatives because they make mistakes, overlook details, forget things, and bypass processes.
Incident response and disaster recovery planning: are always crucial components of a sound security posture, we must have a plan to identify risks, respond to them, and mitigate potential damages.