study guide

17 posts with the tag “study guide”

Examining Technical Processes for GCP

Sep 24, 2023

Christopher Shaun Godwin

Author

In your role as an architect, you’ll be involved in numerous technical procedures, some of which were covered in earlier chapters, like continuous integration/continuous delivery (CI/CD) and post-mortem analysis. This chapter will delve deeper into these and other processes like software development lifecycle management, testing and validation, and business continuity and disaster recovery planning. Our aim is to present a comprehensive understanding of these technical processes, with an emphasis on their connection to business goals. For a more technical exploration, including the use of tools like Jenkins in CI/CD, please refer to the subsequent discussions.

Software Development Lifecycle Plan

Analysis
Design
Development
Testing
Deployment
Documentation
Maintenance

This is called a cycle because at the end of the process, we iterate over again from our business and technical needs.

Analysis and Examination

The analysis and requirements gathering phase is critical for a Google Cloud project to ensure the solution meets business needs. Key analysis activities include:

Evaluating the Business Problem - Work with stakeholders to fully understand the issues or opportunities being addressed. Drill into the pain points experienced by users and customers. Quantify the impacts on revenue, productivity, and other metrics. This foundational understanding guides the entire project.

Assessing Solution Options - With the problem scope clarified, brainstorm potential technical and process solutions. Leverage Google Cloud technologies like BigQuery, App Engine, and AI Platform for options. Estimate level of effort, costs, and benefits for each option.

Analyzing Requirements - Gather detailed requirements through sessions with stakeholders. Document user stories, edge cases, interfaces, reporting needs, and more. Prioritize must-have versus nice-to-have capabilities. Define MVP vs. longer term functionality.

Clarifying Technical Constraints - Determine limitations imposed by data sources, legacy systems, regulations, and other factors. Identify potential blockers and dependencies.

Defining the Solution Scope - Synthesize the research into high-level solutions, priorities, delivery timelines, and measures of success. Build consensus among stakeholders on what will be delivered.

Careful analysis and requirements gathering reduces risk by aligning project plans with business needs. The deliverables enable constructive discussions on tradeoffs and set clear expectations before committing to a solution.

Problem Scoping

Problem scoping involves clearly defining the issues or opportunities to be addressed by the project. This requires understanding the current state and desired future state from the user’s perspective. Effective scoping frames the problem statement and bounds the scope to reasonable parameters. It identifies relevant systems, stakeholders, processes, and objectives. Well-defined problem scoping sets the foundation for the solution requirements and design. It focuses efforts on the core issues rather than trying to boil the ocean. The analysis should yield a narrowly targeted problem statement that the project aims to resolve for a specific set of users.

Domain knowledge from teams with direct experience is critical for accurate problem scoping. For example, having customer support agents who regularly interface with users participate in requirements gathering will surface pain points that internal teams may overlook. Operations engineers who maintain existing systems can identify technical limitations and dependencies. Subject matter experts like data scientists and UX designers can provide realistic assessments of proposed solutions. Involving these domain experts validates assumptions and brings real-world perspectives to scope the problem appropriately. Direct engagement with the right staff builds comprehensive understanding to frame the problem statement and requirements.

Evaluating Options

When evaluating solutions for Google Cloud projects, leveraging the platform’s comprehensive toolset and the team’s domain expertise is key. For example, if improving analytics processing time is the scoped problem, options would include migrating analytics to BigQuery for scalability, using Dataflow for streaming pipelines, and employing AI Platform for predictive modeling. Google engineers can provide guidance on capability, complexity, and costs of each option based on real customer engagements. The cloud support team can detail integration and migration considerations. Together, detailed problem scoping with domain knowledge of Google Cloud capabilities enables data-driven evaluation of solution options on metrics like time, cost, and quality. Evaluations based on Google’s experience and advice sets projects up for successful outcomes within reasonable constraints.

When well-aligned to the problem scope, commercial software can offer a faster and lower-risk alternative to custom development. For common needs like CRM, HR systems, or content management, COTS solutions have pre-built capabilities that can be configured versus built from scratch. This can significantly reduce project timelines and costs. COTS options should be considered when requirements closely match package functionality and limited customization is needed. However, COTS does bring constraints, like rigid workflows or license fees. Integration with other systems may be limited. Vendor dependence risks continuity. Before pursuing COTS, the team should evaluate fit, total cost of ownership, limitations, and vendor viability. Example COTS solutions that may merit consideration for applicable problems include Salesforce CRM, Workday HR, and Adobe Marketing Cloud.

Sometimes the optimal solution is to modify or extend existing applications vs. building new ones. This leverages prior investments and skills while incrementally improving capabilities. When evaluating options, modernization of legacy apps should be considered based on factors like remaining lifespan, technical debt, business value. Modifications may involve re-platforming, re-architecting databases or UIs, integrating new APIs and microservices. Google Cloud provides tools like Cloud Code and Migrate for Anthos to incrementally transform applications.

Greenfield development is advised when existing systems are highly outdated, fragmented, or limiting. Building from scratch enables creating optimal UX, modern tech stack, and cloud-native architecture. While resource-intensive, greenfield development removes legacy constraints and technical debt. It should be considered when no platform exists to meet business needs. Still, integration challenges with remaining legacy systems can add complexity.

Migrating existing apps to the cloud often requires modifications to enable cloud capabilities. Re-architecting for microservices, adding autoscaling, optimizing for serverless, and leveraging managed cloud services typically involves app changes. Google’s Migrate for Anthos can automate and modernize parts of the migration. But modifications are likely required to realize the full benefits of cloud. Assessing migration options should consider app changes needed versus “lift and shift”.

Cost Benefit Analysis

Performing cost-benefit analysis is a critical skill for cloud architects to quantify the business case for technology investments. For Google Cloud projects, analyze costs across the full lifecycle including implementation, operations, maintenance, and sun-setting legacy systems. Consider both hard costs like gear, licenses, and engineering time as well as soft costs like training, change management, and risks/liabilities.

Weigh these costs against the expected strategic and tactical benefits for metrics like revenue, customer satisfaction, brand reputation, and competitive advantage. Assign tangible values to intangible benefits where possible. Involve finance teams to model total cost of ownership and return on investment.

For example, migrating analytics to BigQuery could require higher point-in-time costs for data migration, pipeline changes, added headcount, and training. But benefits like improved insights, faster customer intelligence, and developer productivity gains over time could outweigh the near-term expenses.

Likewise, replacing legacy CRM with Salesforce adds licensing costs but can enable sales productivity and pipeline visibility gains that ultimately pay for themselves. Focus beyond simple cost comparisons to fully capture benefits. Leverage Google Cloud Pricing Calculator to estimate usage costs. Consider Cloud Billing discounts like committed use and enterprise agreements to optimize spending. Building credible business cases via thorough cost-benefit analysis is essential for gaining executive buy-in on Google Cloud investments.

Design

The design phase is crucial for architecting scalable, secure, and robust Google Cloud solutions. Design involves translating requirements into technical specifications that serve as blueprints for development teams. Areas of focus include mapping architectures, data models, infrastructure topology, connectivity, integrations, UIs, APIs, security controls, and disaster recovery. Architectural diagrams are core design artifacts. Design decisions consider factors like time-to-market, TCO, extendability, ease of maintenance, and leveraging native Google Cloud building blocks. Well-constructed designs align technology means with business ends.

High-level Design

High-level design defines the major architectural components and interactions for a solution. It establishes a conceptual blueprint prior to detailed technical specifications.

Identifying Major Components

Break down the overall system into core functional pieces. For example, an e-commerce platform may include:

Frontend app - Browser/mobile apps for shopping workflows

Backend app - Business logic, integrations, order processing

Databases - Products, customers, orders, transactions, analytics

Storage - Blobs for images, videos, documents

CDN - Cache static content closer to users

Payment gateway - Process credit cards securely

Notifications - Email, SMS, push for order status

Search/Recommendations - Catalog lookups and suggestions

Analytics - Usage statistics, metrics, reporting

Third-party APIs - Shipping, taxes, marketing, fraud detection

Component segregation promotes separation of concerns and modularity.

Defining Component Interfaces

Identify key connections and integrations between components. Specify input/output data formats and protocols.

This is crucial for high-volume transactional exchanges like orders passing between frontend, backend, databases, and payment gateways. Architect for scale during peak loads and traffic spikes like holiday sales.

Latency-sensitive UIs require responsive APIs. Asynchronous flows using message queues and caches help ensure snappy performance even during peaks. Indexed databases speed lookups for search and recommendations.

Component contracts establish clear expectations for interoperability. Strong interfaces decouple subsystems, enhancing maintainability and extensibility. Loose coupling eases onboarding of new technologies like Kubernetes and Knative serverless.

High-level designs focus on major building blocks, interactions, and flows. They help validate fit to requirements before diving into technical minutiae. Align components with Google Cloud services like Compute Engine, App Engine, and Dataflow for execution. Create modular architecture supported by clean interfaces and separation of concerns.

Detailed Design

The detailed design phase fleshes out specifications for each component. This includes:

Data structures - Define database schemas, table relationships, document formats, message payloads, etc. Optimize queries and indexes for performance. For example, denormalize tables for fast reads even if it duplicates some data.

Service accounts - Specify privileges, roles, and access controls. Follow principle of least privilege, e.g. read-only APIs for public data. Use Cloud IAM to manage permissions.

Algorithms - Map out business logic, calculations, data transformations, analytics, machine learning models, etc. Leverage Cloud services like Dataflow and AI Platform.

UIs - Wireframes, page flows, style guides, client-side logic. Ensure mobile-friendly responsive design.

Logging - Structured logs for monitoring and debugging all components. Aggregate with Cloud Logging.

Engaging domain experts who will implement the designs is vital. Their experience surfaces edge cases and opportunities to refine implementations without wasted effort. For example, App Engine developers can recommend splitting front-end and back-end services to isolate scaling and security.

Choosing foundational software like OS, languages, frameworks, and databases affects operations and costs. While open source is free, it requires more effort for patches and upgrades. Managed platforms like Cloud Run reduce admin overhead at an added cost.

For example, running containerized microservices on Cloud Run avoids managing Kubernetes infrastructure yourself. But you lose fine-grained resource controls. There are always tradeoffs to evaluate.

Detailed designs enable building smooth-running, efficient systems. Collaborating with implementation teams ensures designs translate cleanly into production-ready code.

Development, Testing, and Deployment

Development teams build out system components based on the technical designs using coding languages, frameworks, and cloud services. They create executable artifacts like applications, functions, containers, schemas and scripts.

Artifacts are configured for environments like dev, test, staging, and prod. For example, separate Redis caches per environment. Load balancers and autoscaling rules match expected usage patterns.

Static analysis tools like cred scans, dependency checks, and vulnerability scanning are integrated in CI/CD pipelines to identity issues early. Unit and integration testing validate code modules before release.

End-to-end testing across staged environments shakes out bugs before production deployment. Stress/load testing verifies performance holds at peak levels.

Monitoring and logging are implemented for observability. Canary releases rollout changes to a subset of users first.

Deployment automation tools like Terraform and Cloud Build ship artifacts to environments reliably and repeatably. Zero downtime deployments are preferred over risky big bang releases. Rollbacks recover quickly from failures.

Documentation like runbooks, playbooks, and architecture diagrams are created alongside implementation. Immutable infrastructure patterns on containers simplify environment consistency.

In summary, development brings designs to life into hardened, production-ready implementations. Testing and automation help deploy those implementations rapidly, safely, and reliably. Careful configuration, testing, and documentation are essential for smooth cloud operations.

Maintenance

Once successfully deployed, maintenance activities sustain ongoing operations of the solution:

Bug Fixes - Issues inevitably arise in production that require diagnosis and rapid patching. Monitoring alerts help surface problems early. Logs and debugging tools facilitate root cause analysis. Bug fixes aim to resolve specific defects without introducing regressions.

Enhancements - New features, capabilities, and integrations are needed over time to improve the product. Enhancements build upon the existing codebase vs. major rewrites. They go through the SDLC starting with scoping needs and designing changes.

Technical Debt Reduction - Shortcuts taken initially like tight coupling, incomplete implementations, or technical shortcuts accrue debt over time. Refactoring to modernize architectures, improve performance, and enhance resilience pays down this debt.

Upgrades - Frameworks, libraries, APIs, and cloud services eventually reach end-of-life and need upgrading. Kubernetes engine rolling upgrades exemplify non-disruptive approach.

Sun-setting - Retiring legacy solutions that have been replaced. Redirecting traffic, exporting data, and dismantling resources.

Ongoing maintenance sustains production health. Establish processes to continuously improve operations, reliability, efficiency and effectiveness. Monitor for performance, availability, and stability trends.

Leverage managed services to reduce maintenance overhead. Implement immutable infrastructure patterns for consistency. Automate testing to prevent regressions.

Evaluate when re-architecture is needed versus incremental improvements. Factor maintenance needs into solution designs and technical choices.

In summary, maintenance keeps solutions aligned with business needs through a culture of incremental, continuous improvement while remaining focused on end-user value.

Continuous Integration & Delivery

Continuous integration and deployment (CI/CD) automates building, testing, and releasing software changes rapidly and reliably handling all we’ve discussed on an automatic basis. CI/CD pipelines improve software quality and accelerate delivery to end users. Architects must design robust CI/CD workflows to unlock agility benefits. Google Cloud provides managed CI/CD services like Cloud Build, Cloud Source Repositories, and Cloud Deploy to simplify implementation.

Business Drivers to Adopt CI/CD

The first driver for CI/CD adoption is accelerating speed to market. Manual software release processes slow down delivery and cannot keep pace with the rapid rate of change expected by customers today. CI/CD automates testing and deployments enabling teams to safely release changes in hours or minutes versus weeks or months. Rapid iteration speeds new features, bug fixes, and innovations to customers faster.

The second driver is improving software quality. CI/CD bakes in testing from unit to integration to end-to-end levels for every commit. Issues are caught early before impacting users. Automated testing provides consistency across environments. Robust testing reduces risks from defects and outages caused by problematic changes. Higher quality improves customer satisfaction.

The third driver is increasing developer productivity. CI/CD eliminates tedious repetitive tasks like configuring test beds, running regressions, and deploying builds. Developers gain more time for innovation by offloading these roadblocks to automated pipelines. Self-service access enables releasing changes on demand. By systematically catching issues early, CI/CD also massively cuts down on wasteful rework. Developers can deliver more business value faster.

CI/CD’s compelling benefits around accelerating speed to market, improving software quality, and increasing developer productivity explain its widespread enterprise adoption. Businesses recognize CI/CD’s power to meet the rapid pace of change expected by modern customers.

CI/CD Elements

Continuous delivery systems are comprised of source control, build automation, testing suites, deployment orchestration, and runtime monitoring capabilities to enable push-button software releases, with core elements including version control repositories, build tools, test runners, container registries, orchestrators like Kubernetes, CI/CD platforms like Jenkins or Spinnaker, infrastructure provisioning through infrastructure-as-code tools, observability dashboards, and more.

When these capabilities for source control, build/test automation, and environment/deploy orchestration are tightly integrated and driven through code, it enables a “GitOps” approach to software delivery. With GitOps, the application source code repository acts as the single source of truth for both developers making changes as well as for the CI/CD tooling that builds, tests, packages and deploys those changes. Infrastructure definitions using infrastructure-as-code are versioned alongside the application code. Deployments and configuration changes are applied automatically on every code change merged to the main branch. Runtime monitoring checks for any drift between code definitions and system state. This tight feedback loop between git repository, automation tooling, and production environments powered by code gives DevOps teams end-to-end visibility and control of the entire software lifecycle.

VCS

Version control tools and strategies are instrumental in GitOps design planning, especially in an environment that leverages Google Cloud Platform (GCP). When preparing for the GCP Professional Cloud Architect exam, understanding how GitOps integrates with GCP services like Cloud Build, Cloud Source Repositories, and Kubernetes Engine is crucial. In GitOps, a version control system like Git serves as the ‘single source of truth’ for the declarative state of your infrastructure and applications. By treating infrastructure as code, you facilitate automated, reliable, and fast deployments, which is in line with many of the architectural best practices covered in the exam.

GCP services are built to work seamlessly with version control systems, enhancing the GitOps workflow. For instance, Google Cloud Build can be triggered to automate builds and deployments whenever there is a Git commit. Cloud Source Repositories, a fully-featured, scalable, and private Git repository service by GCP, can serve as your central Git repository, integrating directly with other GCP services. A Cloud Architect should understand how to design systems that incorporate these services for a cohesive GitOps workflow, an area of focus in the certification exam.

In GitOps, monitoring and observability are made simpler because changes are trackable and reversible through Git. Within the GCP ecosystem, monitoring solutions like Cloud Monitoring and Cloud Logging can be integrated into the GitOps pipeline to track performance metrics and logs in real-time. The ability to correlate deployments and changes with system behavior is beneficial for making informed architectural decisions. Therefore, a solid grasp of GitOps, backed by version control strategies, not only helps you implement efficient CI/CD pipelines but also prepares you for scenarios that might appear in the GCP Professional Cloud Architect exam.

Understanding the integration of version control tools and GitOps in a GCP environment is essential for two key reasons. First, it prepares you to build automated, secure, and efficient CI/CD pipelines, a crucial element in cloud architecture. Second, it equips you with knowledge that is directly applicable to topics likely to be covered in the GCP Professional Cloud Architect exam. Both of these benefits make version control and GitOps an indispensable part of your exam preparation and practical application.

Secrets

Secrets management is a critical component of cloud architecture and a focus area for anyone preparing for the GCP Professional Cloud Architect exam. The ability to securely handle sensitive information like API keys, access tokens, and certificates is crucial for maintaining the integrity and security of applications and services. Google Cloud Secret Manager, a fully managed service on GCP, provides a centralized and secure way to manage, access, and audit secrets. It allows Cloud Architects to set IAM policies, enabling fine-grained control over who can access what secrets, thereby contributing to a more robust security posture. Understanding the nuances of Secret Manager, such as versioning and audit logging, could well be a topic you encounter on the exam.

Apart from Google Cloud Secret Manager, popular vault systems like HashiCorp Vault are also widely used for secrets management. HashiCorp Vault not only provides features for storing secrets securely but also offers functionalities like secret generation, data encryption, and identity-based access. Given that the GCP Professional Cloud Architect exam may include hybrid or multi-cloud scenarios, understanding how HashiCorp Vault integrates with GCP resources is valuable. This can be particularly useful when dealing with workloads that span multiple cloud providers or even on-premises data centers.

One essential best practice to follow, which is likely to be endorsed in the GCP Cloud Architect exam, is the strict avoidance of storing secret values within code repositories. Even with private repositories, the risk associated with exposing sensitive information can lead to significant security vulnerabilities. Tools like git-secrets or pre-commit hooks can be configured to prevent accidental commits of secrets into version control systems. Also, both Google Cloud Secret Manager and HashiCorp Vault can integrate with CI/CD pipelines to provide secrets dynamically, mitigating the need to hardcode sensitive information in codebases.

A robust understanding of secrets management is indispensable for both practical application and preparation for the GCP Professional Cloud Architect exam. You’ll want to be versed in best practices like avoiding the storage of secrets in code repositories and understand the functionalities and limitations of secret management services like Google Cloud Secret Manager and HashiCorp Vault. Mastering these topics not only enhances the security posture of your cloud architecture but also prepares you for scenarios likely to appear in the certification exam.

Deployment

In the context of analyzing and defining technical processes, mastering the intricacies of Deployment Pipelines in Continuous Deployment is pivotal. A Deployment Pipeline is essentially a series of automated steps that allow software teams to reliably and efficiently release their code to the end-users. It includes building the code, running a suite of tests to detect bugs and vulnerabilities, and finally, deploying the code to production environments. For a Cloud Architect, especially one preparing for the GCP Professional Cloud Architect exam, understanding how to design and implement these pipelines on Google Cloud Platform using services like Cloud Build, Cloud Functions, and Google Kubernetes Engine is essential. These services, when properly configured, can automatically pick up code changes from repositories, build container images, and deploy them to orchestrated container platforms, thus bringing significant agility to the development cycle.

When developing deployment pipelines, certain technical processes are crucial for robustness and scalability. These include blue-green deployments, canary releases, and feature flags, which allow for minimal downtime and low-risk feature rollouts. The GCP Professional Cloud Architect exam often touches on how to architect such processes for scalability, fault-tolerance, and seamless rollbacks. For example, by leveraging Google Kubernetes Engine, you can effortlessly implement blue-green deployments by switching service labels between stable and new release versions. Additionally, Stackdriver, Google Cloud’s integrated monitoring, logging, and diagnostics host, can be woven into the pipeline to provide real-time insights and facilitate quicker decision-making.

Security also plays a vital role in deployment pipelines. Automated security checks, secret management, and compliance audits are part and parcel of the deployment process. Knowing how to integrate tools like Google Cloud Secret Manager for secure handling of API keys or credentials, and setting IAM policies to restrict pipeline access are skills that can set you apart. These considerations are not only imperative for real-world applications but are likely covered under the ‘Analyzing and Defining Technical Processes’ section of the GCP Professional Cloud Architect exam.

Understanding Deployment Pipelines in Continuous Deployment is vital for both your real-world applications and for acing the ‘Analyzing and Defining Technical Processes’ section of the GCP Professional Cloud Architect exam. Being proficient in implementing automated, secure, and scalable deployment processes using Google Cloud Platform’s array of services prepares you for complex architectural questions and scenarios you may encounter in the exam. Therefore, honing these skills is twofold beneficial, offering practical advantages while increasing your likelihood of certification success.

Secrets

Managing secrets securely is a critical element for anyone preparing for the GCP Professional Cloud Architect exam, especially when it comes to designing and implementing deployment pipelines. Google Cloud Secret Manager offers a centralized and secure way to manage sensitive information like API keys, access tokens, and certificates. Understanding how to leverage Secret Manager to inject secrets into CI/CD pipelines, which could be orchestrated using Google Cloud Build or Kubernetes Engine, is essential. Best practices such as fine-grained access control through IAM policies can ensure that only authorized services or personnel can access these secrets. Learning how to integrate Secret Manager with other GCP services for automated and secure secret retrieval during deployment phases will not only strengthen the pipeline but could also be a focus area in the certification exam. Moreover, knowing how to avoid common pitfalls like storing secrets in code repositories is pivotal for both exam success and real-world application security.

Troubleshooting and Post-mortem Culture

Troubleshooting and post-mortem culture are essential aspects of Analyzing and Defining Technical Processes, particularly when aiming to pass the GCP Professional Cloud Architect exam. Mastery in troubleshooting implies not just fixing immediate issues but understanding the architecture well enough to anticipate and prevent future problems. GCP provides robust logging and monitoring tools like Cloud Monitoring and Cloud Logging that can be integrated into the incident response strategy. Knowing how to leverage these tools to identify bottlenecks or vulnerabilities can be an important part of the certification exam.

Post-mortem culture, on the other hand, involves the systematic review of incidents or failures to understand their root causes. Lessons learned from post-mortems often lead to preventive measures that improve system resilience and performance. Google Cloud’s suite of SRE (Site Reliability Engineering) tools can facilitate effective post-mortems by providing key data and insights. A strong grasp of these methodologies not only enhances your operational excellence but is likely to be a topic covered in the GCP Professional Cloud Architect exam.

Incident Post Mortems

An incident refers to an unplanned event that disrupts the normal operation of a system or leads to a suboptimal user experience. Postmortems are the structured analyses performed after resolving the incident to uncover its root causes, learn from the event, and improve future responses. When preparing for the GCP Professional Cloud Architect exam, understanding incident management and the role of postmortems is crucial. These practices directly relate to Analyzing and Defining Technical Processes, a key domain in the certification. GCP offers specialized tools for incident monitoring and logging that can assist in both real-time troubleshooting and post-incident reviews. Mastery of these areas will better equip you for exam scenarios and real-world applications.

Learning from Incidents

When preparing for the GCP Professional Cloud Architect exam, a nuanced understanding of how to analyze and learn from both minor and major incidents is crucial. Minor incidents are those that cause limited impact on your system’s availability, performance, or user experience. Although they may seem inconsequential, overlooking them could lead to more significant issues in the long term. The key to managing minor incidents is rapid identification and containment. Tools like Google Cloud Monitoring can help you set up alerts for anomalies that indicate a minor problem, enabling quick action.

Another important aspect of dealing with minor incidents is documentation. While the incidents themselves might be minor, the patterns that emerge could indicate a larger, systemic issue. It’s crucial to log even small disruptions or glitches using a platform like Google Cloud Logging. Over time, this data can provide invaluable insights into the health and efficiency of your infrastructure, which can be crucial not just for the business but also for questions you might encounter on the GCP Professional Cloud Architect exam.

Immediate resolution should be the aim for minor incidents, but the learnings should contribute to preventive measures. After resolving the incident, run a lightweight postmortem to identify the root cause and recommend preventive actions. Though the solutions might be simple, such as code fixes or updates, their role in avoiding future incidents can be significant. Implement these preventive steps as part of a continuous improvement process, as it contributes to the stability and resilience of the system.

Lastly, minor incidents serve as a great training ground for incident response teams. They present an opportunity to improve response strategies and communication protocols without the pressure of a significant system failure. Periodic reviews of minor incidents, and the response strategies employed, can provide a wealth of knowledge to both your team and you as you prepare for the GCP Professional Cloud Architect exam.

On the other hand, major incidents are significant events that cause a noticeable impact on system performance, availability, or security. They demand immediate attention and rapid mobilization of resources. Google’s Site Reliability Engineering (SRE) principles emphasize the importance of immediate, coordinated action to mitigate the issue. When such incidents occur, it’s often necessary to establish an Incident Command System (ICS) to manage the situation efficiently. The ICS is a hierarchical structure that allows for clear command and communication lines, something often emphasized in GCP certification study material.

Post-incident, a thorough postmortem is non-negotiable. Unlike minor incidents, the postmortem for a major incident involves cross-functional teams and often requires intense scrutiny. Google Cloud Platform provides tools that allow for in-depth analysis and data mining, helping to unearth even the most obscured root causes. Each of these steps may be intricately described in your postmortem report, which should be reviewed and acted upon by all stakeholders.

Moreover, major incidents usually prompt a review of the architecture and the incident response plan. This often leads to significant changes aimed at ensuring the incident doesn’t recur. Such reviews and changes can be complex and time-consuming but are vital for the long-term health of your systems.

Additionally, the learnings from major incidents often lead to updates in policies, procedures, and perhaps even company culture. It’s essential to disseminate the learnings across the organization and, if appropriate, to external stakeholders. This is where Google Cloud’s vast array of documentation and information-sharing tools can come in handy.

Understanding how to deal with both minor and major incidents not only strengthens your real-world applications but also prepares you for the sort of complex, scenario-based questions you may encounter in the GCP Professional Cloud Architect exam.

Project Post-Mortems

Analyzing and learning from project work and retrospectives are essential skills for a GCP Professional Cloud Architect. Project work often involves deploying and managing applications and services on Google Cloud Platform, and each project provides a unique learning experience. Utilizing built-in GCP features like Cloud Monitoring, Cloud Logging, and Data Studio can help you measure the success of deployments, infrastructure scaling, and other critical metrics. These tools not only provide real-time data but also offer historical views that can help identify trends, bottlenecks, or areas for improvement. Learning to interpret this data is crucial for both improving ongoing projects and for the analytical questions that might appear on the GCP certification exam.

Retrospectives, commonly employed in Agile frameworks, offer another rich avenue for learning. These scheduled reviews allow teams to discuss what went well, where they faced challenges, and how they can improve in the future. In the context of Google Cloud Platform projects, retrospectives can focus on optimizing resource utilization, improving security protocols through services like Identity and Access Management (IAM), or enhancing automation and CI/CD pipelines with tools like Cloud Build. Retrospectives should result in actionable items, with corresponding changes tracked over time for efficacy. This iterative process of feedback and improvement is fundamental in any cloud architect’s skill set and is highly likely to be a topic of interest in the GCP Professional Cloud Architect exam.

The practice of consistently analyzing project work and conducting retrospectives provides multiple benefits. First, it cultivates a culture of continuous improvement, essential for maintaining efficient, secure, and reliable cloud architecture. Second, the insights and lessons learned directly feed into better design and decision-making for future projects. Third, it prepares you for the GCP Professional Cloud Architect exam by ingraining best practices and a systematic approach to problem-solving. As the certification exam includes scenario-based questions that assess your ability to analyze and define technical processes, being adept at learning from project work and retrospectives is invaluable.

Enterprise IT Processes

Enterprise IT Processes form a cornerstone in the preparation for the GCP Professional Cloud Architect exam, particularly when it comes to Analyzing and Defining Technical Processes. Understanding the ITIL (Information Technology Infrastructure Library) model is vital, as it provides a standardized approach to IT service management. ITIL organizes its framework around four dimensions: Organizations and People, Information and Technology, Partners and Suppliers, and Value Streams and Processes. These dimensions help create a balanced focus across the enterprise, ensuring that technology services align with business goals.

ITIL management practices are categorized into three groups: General Management Practices, Service Management Practices, and Technical Management Practices. These categories collectively aim to provide a comprehensive guide to planning, implementing, and optimizing IT services, making ITIL a valuable framework for cloud architects to understand. This knowledge can be especially beneficial when answering scenario-based questions on the GCP Professional Cloud Architect exam that require a deep understanding of how to analyze and define complex technical processes within an organization.

Business Continuity

Business continuity and disaster recovery are not merely technical or operational concerns; they profoundly impact an organization’s most important asset—its people. Imagine a scenario where a critical internal service, such as an HR portal or a data analytics dashboard, experiences a catastrophic failure. It’s not just about data loss or a dip in sales metrics; it’s about the immediate disruption it causes in the day-to-day lives of employees who rely on these services to do their jobs efficiently. For a sales team, a CRM outage means an inability to track customer interactions or follow leads, directly impacting revenue. For HR, a system failure could affect everything from payroll processing to employee onboarding, leading to delays, confusion, and frustration. The ripple effects of such a breakdown can severely compromise employee morale and productivity, which, in turn, affect customer satisfaction and the bottom line.

To mitigate these risks, the first step in business continuity planning is conducting a Business Impact Analysis (BIA). This involves identifying the most crucial business functions and the resources needed to support them. A thorough BIA will evaluate the financial and operational impact of system unavailability, helping to prioritize recovery strategies. Employee dependencies on specific services should also be assessed, as their productivity is directly tied to the availability of these services.

The next critical component is formulating a disaster recovery plan, which should outline the steps needed to restore essential functions. This plan should detail the resources, personnel, and technologies required to recover from various types of disasters such as cyber-attacks, natural calamities, or infrastructure failures. Staff should be trained and well-versed in implementing the plan, and regular drills should be conducted to test its effectiveness.

Disaster Plan: A guide outlining the specific actions to be taken in the event of various types of disruptions.
Impact Analysis: An assessment identifying critical business functions and quantifying the impact of their unavailability.
Recovery Plans: Detailed strategies for restoring essential business functions.
Recovery Time Objectives: Timeframes within which systems, applications, or functions must be recovered after an outage.

Another crucial aspect of business continuity is setting Recovery Time Objectives (RTOs), which specify the maximum allowable downtime for various business processes. Achieving the defined RTOs requires implementing appropriate technology solutions, from redundant systems to automatic failover capabilities. These technologies must be tested rigorously to ensure they meet the needs outlined in the business impact analysis and disaster recovery plans.

In summary, business continuity planning is a multifaceted exercise that goes beyond mere technology fail-safes. It encompasses a deep understanding of business processes, a thorough analysis of various impact scenarios, comprehensive recovery strategies, and clear time objectives for restoring functionality. And at the heart of it all are the employees, whose productivity and well-being are directly influenced by the resilience and reliability of the systems they use daily. Therefore, every effort must be made to ensure that the business continuity and disaster recovery plans are robust, comprehensive, and regularly updated to adapt to evolving challenges.

Disaster Recovery

Disaster recovery (DR) planning is an integral component of a GCP Professional Cloud Architect’s role, especially when it comes to safeguarding an organization’s data and applications hosted on Google Cloud Platform. The GCP certification exam tests candidates on their capability to architect robust disaster recovery solutions, making it a critical subject of focus. Architecting a DR strategy on GCP involves choosing the right combination of services such as Cloud Storage, Persistent Disk snapshots, and other backup solutions, as well as planning for multi-regional deployments to ensure data availability even when an entire region faces issues. Mastery of these services and their proper implementation is vital for both exam preparation and real-world responsibilities.

One of the key aspects of DR planning on GCP involves designing for redundancy and high availability. GCP’s various data storage options, like Cloud SQL, Bigtable, and Datastore, offer built-in replication and failover capabilities. Understanding the nuances of these features, such as replication types and eventual or strong consistency models, will not only aid in successful disaster recovery but also in answering nuanced questions that may appear in the certification exam. Knowing when to use a multi-regional storage class versus a regional or nearline storage class can significantly impact an organization’s ability to recover quickly from a failure.

Creating and executing DR plans in GCP also involves automating backup processes and orchestrating recovery workflows. For this, Google Cloud offers specialized services like Cloud Scheduler for cron job automation and Cloud Composer for workflow orchestration. A GCP Cloud Architect needs to design these automated processes in a manner that minimizes the Recovery Time Objective (RTO) and Recovery Point Objective (RPO). Knowing how to configure, trigger, and monitor these services is often scrutinized in the GCP Cloud Architect exam, as it directly relates to one’s capability to create an effective DR plan.

Furthermore, the role of a GCP Cloud Architect extends to performing regular tests of the DR plans, including failover and failback exercises. This ensures that all team members understand their roles in the event of a disaster and that the plan itself remains effective as system configurations evolve. Google Cloud Platform provides robust logging and monitoring solutions, such as Cloud Monitoring and Cloud Logging, which enable architects to keep an eye on system health and performance metrics continuously. Familiarity with these tools is essential, as they help validate the DR strategy’s effectiveness and can offer insights for ongoing optimization.

Security also plays a pivotal role in disaster recovery planning. GCP’s robust Identity and Access Management (IAM) allows architects to define roles and permissions explicitly, thereby ensuring only authorized personnel can execute different parts of the DR plan. This layer of security is crucial in the larger schema of DR planning, ensuring that the recovery process itself doesn’t become a vector for security vulnerabilities. The understanding of IAM in a disaster recovery context is another area that the GCP Professional Cloud Architect exam could potentially explore.

In summary, a GCP Professional Cloud Architect has an expansive role in disaster recovery planning, from architecture and redundancy to automation, security, and ongoing testing. Expertise in these areas is not just crucial for executing this role effectively but also for succeeding in the GCP Cloud Architect certification exam. Therefore, it’s imperative to grasp the breadth of services and features offered by Google Cloud Platform that facilitate robust disaster recovery plans. Each component, from storage and data replication to automation and security, is a critical puzzle piece in architecting resilient systems capable of withstanding and recovering from unexpected adverse events.

Summary

Software solutions require careful analysis, planning, development, testing, and ongoing maintenance. The software development lifecycle provides a structured approach to manage this process. It starts with gathering requirements by evaluating the problems to be solved, assessing potential solutions, analyzing needs, clarifying constraints, and defining the overall scope. The next phase focuses on solution design, including mapping system architecture, data models, infrastructure, integrations, interfaces, security controls, and disaster recovery. Detailed technical specifications are created to provide blueprints for development teams.

Development teams then build out the designed components using coding languages, frameworks, and cloud services. The resulting executable artifacts are configured for dev, test, staging and production environments. Testing validates code modules before release through practices like unit testing, integration testing, and end-to-end testing. Monitoring, logging and canary releases further harden releases before full production deployment. Automation tools assist with deployment, enabling frequent updates with minimal downtime and quick rollback when issues arise. Alongside implementation, documentation like runbooks and architecture diagrams are created.

Once in production, maintenance activities sustain operations. Bug fixes resolve issues without introducing regressions. Enhancements incrementally improve capabilities over time. Technical debt is paid down through refactoring and modernization. Components are upgraded before reaching end-of-life. Legacy solutions are retired after traffic redirection and data migration. Ongoing maintenance aligns solutions with evolving business needs through continuous incremental improvement.

Continuous integration and deployment (CI/CD) automates these processes through pipelines integrating version control, build automation, testing, and release orchestration. CI/CD accelerates speed to market, improves software quality through robust testing, and increases developer productivity by eliminating manual tasks. Core CI/CD components include source control repositories, build tools, test runners, container registries, orchestrators, infrastructure provisioning, observability dashboards, and deployment automation.

Troubleshooting involves not just fixing immediate issues but anticipating and preventing future problems through monitoring, logging, and post-incident analysis. Post-mortems foster improvement by systematically reviewing major incidents to understand root causes and prevent recurrence. Retrospectives similarly help teams learn from project experiences to optimize future work. These practices contribute to a culture of continuous improvement rooted in data-driven insights.

Exam Essentials

Software development lifecycles provide structured processes for delivering solutions. Know key phases like requirements analysis, solution design, development, testing, deployment, documentation, and maintenance. Analysis should align solutions with business needs through problem scoping, solution option evaluation, and cost-benefit analysis. High-level designs define major components and interactions. Detailed designs specify data structures, algorithms, interfaces, infrastructure, security controls, and integrations. Development teams build designed components using coding, frameworks, and cloud services. Testing validates code before release through practices like unit, integration, end-to-end, load, and canary testing. Deployment automation enables rapid, reliable delivery with minimal downtime. Maintenance sustains operations through bug fixes, enhancements, debt reduction, upgrades and retirement of legacy systems.

Continuous integration and deployment (CI/CD) automates testing and releases through pipelines integrating version control, build tools, test runners, registries, orchestrators, provisioning tools and monitoring. Know how source control enables GitOps workflows in GCP through integration with Cloud Build and Cloud Source Repositories. Secrets management securely injects credentials into pipelines using tools like Secret Manager and Vault. Deployment best practices include blue/green, canary releases, and feature flags. Monitoring and logging facilitate troubleshooting and post-mortems.

Troubleshooting involves not just fixing immediate issues but anticipating and preventing future problems through monitoring, logging, and post-incident analysis. Post-mortems foster improvement by systematically reviewing major incidents to understand root causes and prevent recurrence. Retrospectives help teams learn from project experiences to optimize future work. These practices contribute to a culture of continuous improvement rooted in data-driven insights.

For business continuity planning, know the purpose of business impact analysis, disaster recovery plans, and recovery time objectives. Recovery strategies should focus on restoring prioritized business functions within target timeframes. Solutions encompass redundancy, backups, multi-region deployments, and failover automation. Regular testing validates effectiveness.

Disaster recovery on GCP leverages built-in data replication, automated backup processes, workflow orchestration, and multi-regional data availability. Recovery time and recovery point objectives guide design. Failover and failback testing ensures plan readiness. Identity and access management secures access. Monitoring tools validate design and uncover optimization opportunities.

Know ITIL service management framework, including the four dimensions: Organizations/People, Information/Technology, Partners/Suppliers, Value Streams/Processes. ITIL practices fall into three groups: General Management, Service Management, and Technical Management. ITIL provides standards for planning, delivering, and improving IT services across the enterprise.

In summary, focus on understanding end-to-end software delivery processes, CI/CD pipelines, troubleshooting methodologies, business continuity planning, disaster recovery design, and ITIL for service management. Know how to leverage GCP tools and best practices across these areas. Mastering technical processes demonstrates ability to analyze and define solutions aligned with business goals.

Official Resources

Analyzing Technical Processes for GCP

Nov 17, 2022

Christopher Shaun Godwin

Author

Architects are involved in many different types of technical processes:

Continuos Deployment
Continuous Delivery
Post-mortem Analysis
Development Lifecycle Planning
Testing
Validation
Business continuity
Disaster Recovery(DR)

Here we will discuss these processes in relation to business needs and goals. We will learn to focus on and define these processes rather than simply follow them.

Software Development Lifecycle(SDLC)

The Software development lifecycle are the steps that software, and those who engineer it, goes through from beginning to end to create and host a service. This includes 12 phases. In some cases these are collapsed or combined to fewer, or some are regarded as pre-SDLC steps.

Proposal
Scope Analysis
Planning
Requirements Analysis
Design
Development
Integration & Testing
Implementing
Documentation
Operations
Maintenance
Disposition


Software Development Lifecycle - Wikipedia

Every phase does work that is required to produce quality software. It is a cycle because you reiterate over these steps until the software is no longer used. The process could start over after the Maintenance step and begin at any one of these beginning steps. After a software is deployed, the next iteration of that software could be as complex as having another Proposal created by the captains of those duties. Or 2nd time the process iterates, it loops back to the Development phase depending if the next iteration requirements are already known. Proposal, scope analysis, planning and requirements analysis can even be done by non-developers or teams of analysts.

For this reason we’re going to jump right into Planning.

Planning

Planning is a step performed by the Project Manager, they’ll create all the spaces which track work, all the spaces where the documentation, solution architect design document, specifications and roadmaps will live. They’ll create the roadmap for the different project phases. They’ll create the templates for spring planning, sprint retros, creation of overarching tasks often called ‘epics’.

Requirements Analysis

This may be done by developers and architects together. The goal is to fully understand the needs and wants of the proposal and find potential ways to meet them. The problem is discussed and ideas are put together to meet those problems. Here the solutions are not designed but considered. Any spikes that are needed to sus out requirements are performed by developers or other engineers. A spike is a short development period where a developer tries a feature to come to some knowledge required for planning a full fledged effort to achieve those requirements in the context of existing systems. Spikes are often isolated to proof of concepts. Proof of concept projects might exist here and iterate back to requirements for an actual project.

In this phase you’re trying to:

Grasp the scope of the needs and wants of the proposal
Track and assess of all possible solutions
Evaluate the cost benefits of the different paths toward a solution

Understanding the scope is a matter of both knowledge of the domain in question: if a mail problem, familiarity with mail operations and development; it is also a matter of systems and software knowledge of the existing infrastructure. Domain knowledge, for example, is knowing that kubernetes secrets are not very secure. Systems and software knowledge, is knowing where you’ll inject and use the google libraries to fetch secrets from GSM. This is precisely why developers, architects, and reliability engineers all engage together in this phase.

When finding solutions for your problem, you need to be able to filter them out without trying them. The solutions you’re filtering in your search are those that aren’t feasible, do not fit your use case, or don’t fit within your limitations. Once you know the limits of the project, you can search for possible solutions. If your Google Secret Manager project has a limit placed on it that it must work for in-house apps and third party apps, the direction you’ll go into will be wildly different than if you weren’t filtering based on this rubric. You’ll also consider if commercial software meets your needs at a better cost than you can.

Purchased or Free and Open Source Software(FOSS) can meets a wide range of use cases faster than developing something new. They also have the benefit of the ability to focus on other easier to solve problems. Purchased software or purchased FOSS support can help offset the costs of provisioning new services. This disadvantages are potential licensing models and costs and being locked into a feature set that doesn’t evolve with your needs.

You can decide to build from scratch, from a framework, or from an opensource project. There are different considerations with each of these. How much modification does ready made software require, what languages and formats does it exist in, do you have to acquire talent to work with it. Consider the lifecycles of the software you use. For instance, if you build docker images from other images, knowing the release cycles of those will help you be able to create new releases at the time new operating systems are released. Paying attention to the popularity and maintainers of the application can help you know if a project has become deprecated. You can avoid deprecated software if you do not want to deal with becoming the new maintainer or updater to the software within your use of it. Or you could choose actively maintained software to fork and modify so that you can roll in security backports from the upstream project into your project.

Scratch allows for full control but involves the most work, most maintenance, most planning, having the team with the talent and skillsets needed, most resolution of issues.

Once you have several viable solutions to consider, spike the one first with the greatest cost benefit. You’ll know this because you can do a Cost Benefit Analysis on all these options we’ve discussed.

Cost Benefit Analysis

Part of Analysis is the cost benefit analysis of meeting the requirements with your various solution options. When asked to justify the decisions in your project you’ll be asked for this and be able to contrast the different values of each solution. As part of this you’ll calculate the ROI for the different options to arrive at the solutions value. At the end of this phase you’ll decide which solutions you’ll pursue in the Design.

Design

As part of the design phase, you’ll plan out how the software will work, the structure of the schemas and endpoints, and the functionality that these will achieve. This phase starts with a high level design and ends in a detailed one.

The High Level design is an inventory of all the top levels of parts of the application. Here you’ll identify how components will interact as well their overarching functions. You might work up UML or mermaid diagrams describing parts and interactions.

The Detailed design is a plan of implementation of each of these parts. These parts will be modularized in though and broken down into the most sensible and efficient anatomies in which for them to exist. Some of the things planned include, error codes or pages, data structures, algorithms, security controls, logging, exit codes and wire-frames for user interfaces.

During the design phase, its best to work directly with the users of the system as you would work with other disciplines during other phases. The Users of a system will have a closer relationship to the requirements. In this phase developers will choose which frameworks, libraries, and dependencies.

Develop, Test, Implement

Under development, software is created by engineers and built as artifacts which are pushed to a repository. These artifacts are deployed into an operating system either with a package manager, ssh, direct copying, a build process or via Dockerfile commands. Artifacts can have within them code, binaries, documentation, configuration, or raw mime/type files.

In this phase developers might use tools like ‘VSCode’, analysis applications, administration tools; while changes are committed with source control tools that have gitOps attached to them. All these processes are in the domain of an Architect to conceive and track when designing a project.

Developers will also test as part of the commands they give the continuous integration(CI) system. Well before the CI steps are created, the developer has created unit and integration tests and knows the commands to run them so that the automation team can include them in the creation of the CI portion of the development operations. There are language specific unit tests but generally the integration tests the API endpoints and you have a choice of software for that.

Documentation

Documentation is crucial to the SDLC because it lets others using the software know how to operate the software. This is often your DevOps team handling automation in deployments. Developer documentation can be in the form of inline comments within the code, but also developers should release a manual as a README.md file in the source control repository root. A README.md file should exist in every folder where a different component has different usage instructions.

You entire solution architecture design should be documented. For a lot of companies this is a page in a intranet wiki like Confluence.

Maintenance

This is the practice of keeping the software running and updated. In Agile software practices, developers maintain code and run deployment pipelines to development environments which graduate to higher environments. In a fully agile environment, automation engineers create the pipelines but an automation release team approves the barriers so that developer initiated deployments can be released to production under supervision during a release window.

Keeping a service running includes logging, monitoring, alerting and mitigation. Some of this work includes log rotation and performance scaling. Developers control log messages but infrastructure developers like cloud engineering teams might create the terraform modules that automation engineers use to automatically create alerts and logging policies.

Continuous Integration / Continuous Delivery(CICD)

Continuous integration is the practice of building code every time there is a change to a code base. This usually starts with a commit to a version control system. If the branch or tag of the commit is part of the rules for the continuous part, then the integration part will take place automatically. Integration pipelines often have built, test, and push steps.

Continuous deployment is often the practice of deploying new artifacts as soon as they are available. If a repository’s continuous integration settings builds a package and places it in the repo, continuous deployment systems polling for new artifacts may trigger a deployment pipeline when it finds one. So once a new version is added to nexus or a deb-repository, CD systems often send that artifact down the line.

Business Use

The cornerstone of CI/CD is that individual features can be added quickly, unlike the past’s methods which had to weave several new features together into a major release. Instead, new features are built on different feature branches, those feature branches have builds, those builds can be deployed quickly and then once tested the feature branch can be merged into one of the trunks. The version control system acts as an integration engine which takes all these features and incorporates them together, if you’re using trunk based development. In the context of hosted services, users get a risk free but up-to-date experience.

CI/CD is testing heavy. In real life production pipelines, tests are over 50% or more of the pipeline steps and is used throughout the workflows as steps. Automated tests allow the test cases to pass or fail without human intervention. This means that services can be tested with scripted steps, and then deployed if those steps succeed. This prevents deployments and the building of artifacts that do not pass tests.

In certain critical cases, continuous deliver isn’t possible as the safety risk is too high to deploy the latest code. Sometimes code needs to be hand certified and hand installed.

VCS

The foundation of Continuous Deployment / Continuous Integration is Version control of software source code. When developers checkout code to work on it and improve it, they get it from a git repository. They make their changes, and push them to the git repository. Git makes a revision and keeps both copies. Points in time in the revisions are called references. Branches and tags are references. You can merge two disparate code bodies by merging two references. A request to merge two references is caled a pull request. So to merge one branch like feature/my-latest-change into develop you’d create a pull request from my ‘feature branch’ into the ‘trunk’ which in this case is develop.

This is how basic version control works with source code. When you commit, often the repository server will notify listening services that code is updated. Those services will look at the repo and if they find a build instruction file they will do the steps listed in the file. This way when we want to build our software, we put all the means to do it in that file. When new commits are made to the repo, listeners will build the application based on our instructions.

If there are no code updates, listeners, or build instructions, there is no continuous part and no integration is happening. In the ancient software world, a developer would commit code and send an integration engineer release notes in an email and the integration engineer would run and babysit a build script while the developer went and got coffee. Now the developer makes a commit and then watches a job console with output logs from the build without communicating with other engineers… they still get coffee while the build runs.

Secrets Management

Pipelines

Fixing Incident Culture

Enterprise IT Processes

Business Continuity & Disaster Recovery

Summary

blah

Exam Essentials

blah

Official Resources

Architecting for Reliability

Nov 17, 2022

Christopher Shaun Godwin

Author

A reliable system is one people can get to now. Reliability is the probability a system can be reached and used without failure and Availability is a measure of how available a system is to be used within a given period of time.

Cloud Operations Suite

In an environment of constant change, hyper scaling, frequent deployments, and business demand, you cannot maintain reliable systems without metics and insights.

There are some problems you’ll come up against such as needing additional compute power, having to handle seasonal ups and downs, errors or crashing under load, storage filling up, a memory ceiling causes cache to cycle too often and therefore cause latency. The ways things can go wrong a several and in a distributed hyper-scaled environment you’ll run into 1 in a million problems as well. That is why we need detailed information about the operation of the resources in our project.

Cloud Operations Suite which used to be known as Stackdriver has several operations products:

Cloud Logging
- Log Router
Cloud Monitoring
- Alerts
- Managed Prometheus
Service monitoring
Latency management
Performance and cost management
Security management

Cloud Logging has the Log Router which is a built in part of Cloud Logging. The Cloud Logging API receives each log message and then send it to the Log Router which stores log based metrics, an then sends those messages to log sinks which store those entires in logs in a Cloud Storage bucket. Cloud Monitoring receives these log metrics and user defined sinks can send entries to BigQuery for longer storage retention. The default retention for Cloud Logging is 30 das.

Cloud Monitoring is Google’s managed product in which you can setup alerting policies to alert you or your team when things go wrong. Things go wrong in the form of failed health or status checks, metrics over defined thresholds, and failed uptime checks. Policies can be defined so that uptime checks can meet certain requirements. Cloud Monitoring has several integrations for notifications which include Slack and custom webhooks. Alerting Policies are Google’s way of user defined criteria for notifications about problems.

These are the three major services which when combined increase observability into your operations in GCP.

Cloud Monitoring

Monitoring is collecting measurements about what hardware, infrastructure and performance. For example, CPU minimum and maximum, CPU averages, disk usage as well as capacity, network throughput and latency, application response times, memory utilization, 1/5/15 minute load averages. These metrics are generally time series. Metrics usually have a timestamp, a name and a value. Sometimes they can have other attributes like labels as is the case in GCP. GCP auto defines metrics but you can define your own metrics using BigQuery queries while having the Log Router send the custom logs to BigQuery. The timestamp is usually epoch time while the value is some value like percent of disk capacity used, web1_disk_usage might be the name of the metric.

Cloud Monitoring has an API which you can query for time series data based on name or resource, offers grouping resource groups based on attribute, list members of resource groups, list metrics and descriptors, listing descriptors of monitored resources and objects.

Dashboards

Some out of the box dashboards are created when you create certain resources such as a Cloud Run instance or a firewall rule, or a Cloud SQL instance. Otherwise you can create the dashboards that you need for your project’s golden signals and other operational metrics that are important to your workload. Users can fully customize them to their needs and to their specific data. Like development, creating dashboards in GCP can often be a cyclical process because you have to create displays which help you quickly diagnose problems at scale. You may start out with planned Key Performance Indicators(KPIs) but then you might drop some and tune into others.

Cloud Monitoring Alerts

When you monitor for problems and use your metrics data in dashboards, you may move on to automatic alerts so you don’t have to monitor the dashboards. This allows you to notify the correct parties when incidents occur. Normally your cloud infrastructure is structured in a way that auto-healing remediates problems, but in the case where auto healing can’t fix an issue. Crashed pods for instance, are restarted when their liveness probes meet the failure criteria and the RestartPolicy allows for it.

Alerts trigger when time series data goes above or below a certain threshold and can be integrated with third party notifications systems such as MS Teams and Slack. Policies specify these conditions, who to notify, and also specify a way to specify the data you’re selecting the resources to alert on. Conditions are used to determine unhealthy states so they can be fixed. It is up to the architect of the policy to figure out how to define what is unhealthy. It could be a port not responding, an http code, or how long ago a file was written, as long as it can be exposed as a metric.

It is easy to create false or flapping alerts and therefore you’ll have to adjust the timing and thresholds for your conditions. You can also increase reliability by setting automatic remediation responses. When a CPU utilization alert is set, for instance you can add new VMs to a group, you can run a job that runs kubectl patch a kubernetes deployment’s Horizontal Pod Autoscaler(HPA) to raise the replicaCount ceiling and then lower it after load is decreased.

All of Google’s managed products like BigTable and Cloud Spanner do not need to be monitored because Google manages the incident response. Switching to these services can help you reduce the amount of monitoring, notifying and alerting you have overall. Of course the recommended approach on migrating to managed services with regard to alerting is to monitor throughput and latency on managed services though resource monitoring like cpu and memory are not needed on them. This is especially true if you are connecting to in-cloud managed DBs from on-prem workloads through VPN or interconnects. Hybrid and Multi-cloud latencies are metric points of monitoring that should be shown on a dashboard and included in notifications.

Cloud Logging

Cloud Logging is a log collection service that either has log collection agents or collects logs from managed services like GKE naturally. Log entries are not time series and occur when system events happen. The /var/log/syslog or /var/log/messages in a Linux VM collects messages about several services together, but there are other logs like /var/log/auth.log or /var/log/lastlog. These logs record data about who is currently logged in and the most recent shell sessions respectively. So these logs are only filled when users trigger login events, either on the consol or remotely. Processes may run garbage collection or some kind of file de-fragmentation and print log messages.

Cloud Logging can store logs from any GCP resource or on-premises resource. In Cloud Logging logs can be searched and queried, exported to BigQuery. When you use Cloud Log Analytics log data is automatically exported to BigQuery. You may also choose to send logs to Pub/Sub and have them consumable by third party log software such as Splunk.

FOSS Monitoring

The popular free and open source(FOSS) tools such as Prometheus and Grafana can be used with Cloud Monitoring. Prometheus is controlled by the CNFN who controlls Kubernetes. Prometheus scrapes HTTP(S) endpoints and collects data and stores it in a multi-dimensional way based on attributes. This is great for Querying with PromQL the query language used with the project.

Google Managed Prometheus provides a monitoring agent which uses Google’s in-memory time-series database called Monarch. Grafana used in conjunction with Prometheus can display metrics in graphs from several data sources. Grafana has the ability to directly query data sources and monitoring services.

Continuous Delivery

Managing releases are an important part of the software lifecycle development process. Some releases are more involved and more complex than others. Releases are often interdependent and therefore need high levels of planning and coordination on behalf of development teams and release engineering teams. The better release management and deployment strategies you have the more reliable your services will become. In an agile and continuously deployed environment, there are pipelines that deploy new artifacts to dev, test, staging, and production often called intg, qa, uat, and prod. intg and qa are what are considered the lower environments and those experience lower load but a higher and more frequent rate of iterations, so intg and qa get the most deployments. These frequent deployments to development and testing environments allows developers to go back to the planning stage before a change gets to production in the case it doesn’t pass 100% of the tests or function 100% of the time.

So problems are worked out early on and once they get to Unattended Automated Testing(UAT), the programmatic list of tests in a production similar environment under production similar load validates the release to go to production. Some pipelines have fully automated and unimpeded ascents to higher environments, however the more critical workloads in Fortune 500 enterprises all have barriers to production on services that will have customer impact in the case of a release failure.

Even with that, errors get into production and need to be fixed quickly, which is where this release management using DevOps principles of Continuous Deployment allows for a pull request to be merged and tagged, automatically built and polled by the CD triggers, and in minutes be sent out to all the environments ready for the approval barrier to succeed so that the hot fix makes it to production quickly.

This is the best way to rapid produce fixes while reducing risk in releases. In this model, all the access to make release imacting changes are given to the developer who runs these pipelines when needed or when triggered automatically, while some release and integration engineers approve and perform the production runs and service swaps.

Testing in continuous deployment pipelines involves acceptance and regression tests, while unit and integration tests are usually part of Continuous Integration. The exception is that a lot of deployment code might have validations and unit testing as part of their runs. This is the case in terraform Infrastructure as Code(IaC) and with configuration management pipelines like salt and puppet. Tests usually define expected states and the resource being tested say an endpoint such as /health which prints the artifact version. The endpoint is what is checked and the state is the key and value expected. The test for that state passes when the endpoint is fetched and the real value and key are compared to the expected state. If the value was lower than expected the service has regressed and the regression testing will fail. In the case of a unit test, a yaml file might contain input, and the unit test in puppet processes the function and contrasts the output to the expected output that the developer defined in the test. Several of these definitions when related constitute a test that would run before the deployment code executives the active part of the workflow.

Integration tests can exist at all different layers that code exists: in a repository, running as a service, testing dependent APIs. Integration tests can tests for things such as a name longer than the amount of allow characters that the backend will receive per the database schema. Integration tests are different than Unit tests as they span all the units of code together in a running artifact.

Acceptance testing are generally testing if the release being deployed meets the business needs the software was designed to meet such as a customer being able to open a new account, change their account data, review it and delete their account. This is an example of an acceptance test for a root business goal of onboarding new customers.

Some times an automation department will order a whole environment tier just for performance testing and load testing. With this you can understand how your application wil fail or perform under load. You can use load testing to simulate so many transactions per minute. While load testing you can do chaos engineering and make things go wrong to see how customers will be impacted. This teases out bugs, latency tuning problems, memory tuning problems, database connection ceilings and subsequent timeouts.

Deployment Methodologies

Service swaps are done typically in a blue green or canary deployment style. There are a few different popular deployment strategies:

Big Bang
Rolling
Canary
Blue Green
Swap Only

	Big Bang	Rolling	Canary	Blue Green
Expense	$	$	$	$$
Risk	very high	high	low	very low
Complexity	low	low	mid	high

Big Bang deployments, often called “complete” deployments simply update all instances of the software wherever they occur according to the recommended approach in the release notes. On a linux server that uses rpm packages as the method of deployment delivery, a service is stopped, the RPMs are installed with yum, dnf, or rpm directly, database deltas are applied if they are included in the release, and the service is started again. This may happen in series or parallel on all the systems to which it will be applied. This process can be run by script, package configuration and package manager, or a configuration management tool like Ansible, Salt or Puppet. Before continuous deployment was popular, this was the most used deployment style and it was performed manually at first, then with automation.

This is the cheapest as it only ever involves one copy of your infrastructure to be alive at one time.

Rolling deployments are the second cheapest because in some contexts you only need two copies of your infrastructure running while the latest one boots and becomes available. Once its healthy the previous versions are terminated. This is the case with Cloud Run and Kubernetes Pods with regard to rolling deployments. Otherwise with VMs, rolling deployments upgrades one server, tests or problems and then after a time moves onto another until the deployment is rolled out.

This kind of deployment is database delta risky because changes to the database which are not additions might cause 9 out of 10 servers fail running the older version. In this example scenario, 90% of your customers are impacted until the rollout progresses to 2 servers, then 80% suffer until the 3rd server has its deployment updated. If you don’t have db deltas or only ever append to you schemas, the risk is considerably less and only impacting a subset of customers at a time becomes an advantage.

Canary deployments are a type which releases new artifacts to infrastructure that receives a test amount of live traffic. When no errors are detected in the deployment, the rest of the traffic is routed to the new infrastructure. In the case of VMs this can be in the form of creating a new Managed Instance Group(MIG) with a new image that has been built with the new code. It can use its existing disk image but run some configuration management code to perform the upgrade, or it can have a new version label applied to it to be selected for an ssh script which does the deployment. In the case of containers, this comes in the form of a new docker tag, a new docker deployment and some routing magic which is automatically built into services like GKE and Cloud Run. There are several ways to choose users whose traffic is routed to the canary deployment.

Blue Green strategies are those which use two environments, one active while the other is inactive. When deployment pipelines run, they keep track of which service, either blue or green is active. When the deployment workflow performs the release, it releases to the inactive set of infrastructure which is receiving not traffic. Verifications, regressions and production tests validate the inactive deployment and then the workflow switches all the traffic to the new deployment at once.

While the most expensive route because it requires constantly maintaining two copies of identical infrastructure, it mitigates the most risk. Firstly, failed deployments result in a failed iteration and no change in routing therefore customers continue using the older version of the service. Then, if a live active deployment fails, the traffic can be swapped to the inactive service to reinstate an older version of the software without any new releases. The iteration can fail and the developer can take the feedback and begin again fixing the issues and cutting a new release version for a new deployment. In a blue green strategy you have to decide if both versions will connect to the same database. If you ony append to your schema this is fine, otherwise database deltas which edit, rename, remove or change tabes and felids you may consider running a blue and green database, configuring each service with either of these and when swapping the traffic, you change an environment variable selecting the database and restart the service. In kubernetes this is as simple as running kubectl set env on the deployment. You can run this command in swap deployment workflows for pod, replicationcontroller, deployment, daemonset statefulset, cronjob, replicaset.

With blue green deployments you’ll have to also script workflows which swap any urls of the services from active to inactive so that all the active services point to active urls while all the inactive deployments point to inactive endpoints. You can accomplish this in manually in the application config prior to deployment, or you can script this as part of your deployment swap workflows. Inside a GKE namespace, the nginx service is actively routing services to the nginx-blue pod while the nginx-stage service routes to the nginx-green pod. The nginx pods all proxy content for application pods called app. So nginx_blue needs to point its configuration at app_blue, which then connect to database_blue. Both the nginx and the app pods will need their urls swapped via kubectl set env or kubectl patch.

Continuous Integration

This is the practice of building code with triggers that listen to each commit to a repository or set of repositories. The CI jobs are configured to run syntax validation, vulnerability scanning, unit tests, code quality test uploads, and pushing artifacts to artifact repositories. CI jobs might be single stage, multistage, create java artifacts, create deb and rpm packages and then repackage them in docker files. There are several CI suites which drive integration from Jenkins to Bamboo. Google Cloud Build is googles managed and serverless Continuous Integration product. With it you can host source code in Cloud Source Repositories, or sync them there. Cloud Build Triggers can then listen to the repository and trigger the jobs configured in the cloudbuild.yaml file stored in the triggering repo.

You can also manually build Continuous Integration pipelines. Configuring build pipelines is a much more consistent way to ensure artifact quality than manual integrations and manually running the build steps by hand.

Reliability Engineering

Reliability engineering is mostly about building resiliency in pipelines, in the software, in the performance of services under load. One such example is a vanilla linux postfix mail server which uses linux users and groups as the main source of mail accounts. If users send mail using SMTP auth and check it with Imap, the shadow and passwd files are being queried ever time mail sends and receives. Additionally, when users change their password at the same time, there’s always a chance of collision in that the files in question become corrupted because it suffers form simultaneous writes from two different processes. Not collecting password changes an queuing them one at a time with a success confirmation between each change, that corruption will never happen. The acts of Setting up a message queue to collect password jobs, writing an agent which reads the queue, and does the work while tracking what work as been done successfully or with errors, and what work is undone are all efforts of Reliability Engineering(RE).

RE takes place on any layer of the technology surrounding a service from the code that runs services to the code that deploys services. Ensuring quality on every distinct layer is an SREs job.

Load is something you cannot plan for exactly how much you’ll have. Errors happen at rates because there are certain chances of an error occurring. When you increase load, you not only increase frequency of known errors, you pull out of the chaotic universe higher magnitude errors and lower frequencies. These are you one in a million errors, that say Ticket Master might face every day since they’re doing 100k transactions per second in some cases.

Overload

You can guarantee that at some point you’ll experience increased load and need to scale. If you aren’t using a service like Cloud Run’s autoscale then you’ll have to manually configure and reconfigure each service to handle the load per that service’s resource usage. Even in that case if you’re running a Cloud SQL instance you’ll have to vertically or horizontally scale it at some point.

Its best to design for this possibility at the beginning. The more user-facing a service is, the more reliability engineering will surround that service. Internal services and things like batch services which can fail cyclically and then at some point reach eventual processing, we don’t necessarily have to worry about unless we have inter-team SLAs which we have to honor.

You can simply shed load, meaning you can respond to requests greater than a system can handle with error codes instead of the application. This isn’t a clean approach though it is an approach. Based on revenue, business needs, you can shed load from priority services last and tertiary services first.

You can handle overload by degrading the accuracy or quality of the services. Switch ‘contains’ filters to ‘beginswith’ filters to reduce load on the database. Reduce latency everywhere you can but Instead of delivering full images deliver thumbnails to reduce load and restore higher resolution delivery later.

Upstream throttling is another way to deal with overload, you limit the calls or requests that you make on crippled systems. You can cache requests and process them later, enter requests into a message queue and process them later. You can switch from instant operations to queued operations modes, reducing load you can later offload to batch processing like profile edits or other non critical parts of your application. Spotify used a combination of CDNs and peer to peer client network to handle overload. The first 10 seconds of a song are loaded from a server and the rest of the file is loaded from other spotify users who have recently listened to a track.

If you build in a trip switch into your app, and then use monitoring to flip the trip switch, you application can decide to cache requests and process them when batch processes are performed like a wordpress cron job, for instance. You can flip the trip switch back when load returns to normal and the logic in your app will return its mode of behavior to the default behavior. When applications have built-in internal responses to overload, they become more reliable and they can log these occurrences for increase observability.

Cascading Failures

Cascading failures are those whose effect becomes the cause of another failure. If a database has a disk error, application instances fail, and then proxy instances fail. This is the simplest and easiest form but consider when the application is generally mostly functioning but particular operations are inefficiently written in code and so they create unnecessary cycles. Certain days when certain jobs run and there are intermittent failures, and everything retries three times before completing. This is like cars backing up on the highway because they have to try three times to change into the lane that goes their way. The traffic gets backed up and it affects not only the cars here, but also the cars in queue to arrive here and this can compound and compound and remain a problem long after the initial prime cause is removed from the situation.

In a cascading failure, you may have a resource consumption problem that is the root cause and have issues determining on which system the root cause is happening. You can Upstream throttle in this case, and really apply any overload strategy in this case. If you have increased observability, say a dashboard for every impact causing signal, you can quickly see all the failed services in the cascade. You can have them organized and ordered by dependency so that you eye can go right to the problem. You can order your tests the same way in reverse so that things at the bottom of the stack like database and db disk size are the first tests so that you and run a test to identify the last responding service in a stack to quickly locate the root. So deal with these as you would overload, including using degraded levels of service. Windows introduced safe mode as a way to reliably boot your computer amid problems enabling users to make changes, and fix the issues, rebooting into disabled mode. Windows safe mode boots into a degraded level of service and sheds some load by not enabling it.

When mitigating overload with autoscaling, consider that you need to set the thresholds as low as they need to be so that the load does eat up the resource gap by the time the new resources become available. If you set your Horizontal Pod Austoscalers to add a new replica when one container reaches 90% of CPU resources, but it takes 156 seconds to start a new pod, but only 100 seconds to eat up the remaining 10% of the resources, there will be a period of 56 seconds of unavailability. You’ll need to set your thresholds lower or work on a speedier boot time on our containers.

Scaling down too quickly is also a concern and if your scaling down thresholds are too low, you might create a flapping situation where pods or instances are created and then destroyed repeatedly.

Reliability Testing

The reason why you want to test is certainly increased observability but also have you ever put a bed-sheet on alone? One change to one area tugs an unexpected change in another area. You have to iterate too many times to arrive at your goals. With testing, you peg values as non-moving targets and the more you add the more of the field you successes and errors you can orient your self against. It’s like pinning the sheet into place on one side while you tighten it on the other side getting out every wrinkle. This ensures that all processes stop right when you make a change that disrupts expected states and values.

Unit tests do this for software, are written and performed by developers and then incudes into the continuous integration process. Integration tests ensure that the units of a feature perform as a whole represented as a function. System tests are those which tests all components under a simple set of conditions that represent sanity checks. System tests that are called performance tests do this same process while placing several repeated requests simulating load. Regression tests are system tests which check to make sure past issues continue to be resolved in future releases. Reliability stress tests are those which do not limit the load but increase it continually until something breaks. The configuration and memory management of a java application, or instance is adjusted and the tests are rerun. You approach this until you exceed at minimum 20% growth over your highest load.

Stress tests are often used to simulate and understand cascading failures. This will inform your monitoring goals and strategy. Chaos engineering puts load on a system an then just randomly causes probably problems to see how the system will respond in order to tease out mitigation responses before they occur.

Incident Management and Post-mortem sessions

Incidents are major problems. Severe incidences are those that impact services which have Service Level Agreements. Severe incidents can be defined as those which impact multiple teams and multiple different type of customer experiences on the service-level. Incident management is the set of duties surrounding incidents and include remediation and fixing the incident, recording details about the state of the incident as it initially occurred and a history of all the decision surrounding the incident. Incident management duties often include making calls to involved parties in an escalation tree.

Notify a captain who coordinates the incident response.
Call a working session with available response teams from operations, automation, and development teams.
Analyze the problem, make corrections
Audit all actions taken into a log for the post-mortem analysis

Incident management focuses on correcting the service-level disruption as soon as possible. There should be less concern with why it failed but how it will be fixed. Incident management focuses on correcting the service-level disruption as soon as possible. There should be less concern with why it failed but how it will be fixed.

The post mortem should focus on a blameless cause of the incident. Blameless postmortems create less of an environment of fear which reduces cognition. Cognition is key to production solutions which fix future versions of this problem. In the spectrum of problems one can have there are patterns, unique to your app, that will form in incidents. If you catch and fix each one, 20% of all fixes will negate 80% of the errors. This zipfy statistic is what allows startups to launch on a startup amount of effort. As an application matures, engineers take on the remaining 80% of fixes which are one offs which apply to fringe cases that only affect 20% of the customers.

Incidents/Bugs	Fixes	Customer Affect
Wide field	20%	80%
Narrow field	80%	20%

This zipfy pareto principle is basically a law of nature and governs everything.

Summary

Reliability is a measure of how available the system is over a period of time. Creating reliable systems is a discipline involving application design and development, deployment methodologies, Incident management, Continuous Testing and more. Continuous Integration and Delivery managed code releases and bring sanity and mitigate risk in what was traditionally quickly changing process. Systems Reliability Engineering involves software development that includes operations goals, things like safe modes with degraded services or upstream throttling. Architects must understand that systems will fail, and that the best way to live with failures are to have defined service level objectives service level indicators, monitor services to detect incidences, and learn from failures by to risk assessment and mitigation techniques.

Exam Essentials

Understand monitoring, logging, and alerting in gcp and in relation to reliability
Be able to design for continuous deployments and integration
Be versed in kinds of tests use in reliability engineering
Understand that Reliability Engineering(RE) is a collaboration of operations and development goals combined on all levels of the system to reduce the risk of conflicting interests between development and operations.
Understand that RE includes planning for unplanned load, cascading failures, and responding to incidents
Understand that testing is a cornerstone of reliability engineering

Official Resources

Architecting GCP Solutions for Security and Legal Compliance

Nov 3, 2022

Christopher Shaun Godwin

Author

Identity and Access Management

Identity and Access Management or IAM is a service which lets you specify which users can perform which actions in the cloud. IAM includes the following objects:

Identities and Groups
Resources
Permissions
Roles
Policies

Identities users and service accounts, groups are collections of those. The Identity entity itself is the thing which is granted access. When you perform any actions in GCP, you must first authenticate against an identity, either on the Console or with the gcloud command. Identities are also called ‘members’. There are three kinds of core identities: Google account, Service Accounts, and Cloud Identity Domains.

Google accounts are members that represent users who access resources in GCP. Active directory users often are synced as Google accounts. Service accounts are accounts systems and programs use. Your terraform instances in an Enterprise environment might be created by a service account with the appropriate IAM roles or permissions to do so. Service accounts are denoted by a service account id in projects/{{project}}/serviceAccounts/{{email}} notation, or email notation sa-name@iam.gserviceaccounts.com. GKE service accounts are the same as compute service accounts. All compute operations run as the default compute service account.

Cloud Identity is an Identity as a service managed product which creates identities that are not tied to Google accounts. You can interface this with Active Directory OIDC and SAML.

Federating Google Cloud with Active Directory

Groups are collections of Identities belonging together. A group is the object that binds the members or the entity they’re associated with. The kind of members of a Google Group in IAM are service accounts and Google accounts. G Suite users and domains are group identities in GCP.

All of these: identities, groups and service accounts can be granted permissions or roles on Resources. A Resource is any GCP object.

Resources:

Compute Instances
Storage Buckets
GSM Secrets
Projects
etc…

Every resource has both granular permissions that correspond to any action that can be done on that resource and predefined roles which represent workloads a person may be assigned with regard to the resource(i.e. developer, viewer, administrator).

Permissions correspond to specific actions like getting, listing, or deleting a resource.

Cloud Run IAM permissions examples:

Permission	Description
run.services.get	View services, excluding IAM policies.
run.services.list	List services.
run.services.create	Create new services.
run.services.update	Update existing services.
run.services.delete	Delete services.
run.services.getIamPolicy	Get an IAM policy.
run.services.setIamPolicy	Set an IAM policy.

In Enterprise level companies, these fine grained permissions are more often used. Small companies may use the roles or even basic roles. If you’re going for a least privilege principal of access, then steering clear of the roles and only granting permissions will provide this. You’ll collect job roles from the team, and consider the privileges needed to do that work. Secrets accessor can be granted on the project level or the secret level. Enterprise companies will want to place it on the secret level. They’ll want to group the secrets to a service which accesses it and create a specific service account it will impersonate so that only that service can access it secrets and not the secrets of other services. The exam will not require you to know the permissions, however, knowing how granular they can be is what the exam creators will expect GCP Certified Architect’s to know.

Roles are groups of these permissions bound together in a role which you assign to an identity or group in order to grant access. Identities can have multiple roles. Roles apply across the project.

Cloud Run IAM predefined roles examples:

Role	Permission	Description
(roles/run.developer)	…	…
(roles/run.developer)	run.jobs.create	Create Cloud Run jobs
(roles/run.developer)	run.jobs.delete	Delete Cloud Run jobs
(roles/run.developer)	run.jobs.get	Get Cloud Run jobs
(roles/run.developer)	run.jobs.list	List cloud Run jobs
(roles/run.developer)	run.jobs.run	Run a job in Cloud Run
(roles/run.developer)	run.jobs.update	Update a Cloud Run job
(roles/run.developer)	…	so on and so forth

Applying these to an identity can be done at the Org, Folder or Project level and would apply to all sub resources in one of those three. Predefined roles are those like the above example. They pre-exist and are pre-defined collections of permissions. Other kinds of roles exist named basic roles which were the roles that existed before IAM. Basic roles apply to every resource and are Viewer, Editor, and Owner. The Viewer role gives read only access to resources, the editor grants change and delete access to resources which the Owner role inherits. Additionally the Owner role can assign roles and manage permissions to resources.

You can grant basic roles per resource so you can make on identity or group owner over certain Compute Managed Instance Groups while giving Owner to other MIGs. Owner role over resources allows users to set up a billing account for those resources. Its best to consider basic roles legacy and avoid them when possible.

Custom roles are those which are created by you where you group a permissions set into a custom role which you grant to identities or groups. This can help you adhere closer to the lease privilege access principle. Some developer roles allow you to set anything in a space where sometimes a thing should be restricted, like production. You would use a custom role to include all the things that a developer role has without the ability to write code and instead limit writes only to pull requests to the master branch.

Policies are Json definitions or directives called a binding which specifies which identities are bound to which roles and permissions. The IAM API allows you to set or get policies and test permissions. Policies can be set on the Organization, individual projects or folders of projects and are inherited infinitely deep.

IAM also has Conditions that are written in a logic language called CEL that is a versatile way to define access granting logic so that things like resource tagging may trigger granting access to certain groups over that resource based simply upon its attributes. Conditions can apply to the following services:

Cloud Storage
Compute Engine
Cloud KMS
GSM
Resource Manager
Cloud SQL
Bigtable
IAP

Google recommends these best practices for using IAM in a secure way.

Lease Privilege Principles

Do not ever use Basic roles in production
- Role recommendations will recommend a role to replace basic roles.
- Policy Simulator will allow you to ensure this replacement will not change access
Consider each layer of workload of your app is untrusted, give each one its own serviceaccount and grant only the permissions the app needs.
Consider that all child Resources inherit the permissions of their parent Resources. Don’t grant project level roles when Resource level roles will suffice.
Grant permissions or roles on the smallest scope needed.
Specify who can impersonate which service accounts
Limit who can create and access service accounts.
Take care who you grant Project IAM Admin and Folder IAM Admin
Conditional bindings can allow access to expire
Consider granting privileged access only on a just-in-time basis.

Service Accounts

Rotate your service account keys using the IAM service account API.
Label the service account with a deploy name that tells you about what it is for and what it has access to.
Don’t leave the service account keys in email, check them into code, or leave them in the Downloads directory.
Audit changes to your policies with Cloud Audit Logs
Export logs to Cloud Storage for preservation
Audit who has the ability to change your allow policies on your projects.
Limit access to logs per least privilege principles
Use the Cloud Audit Logs to audit who has service account key access

Policy management

If identities need to access all projects in an organization, grant access at the organization level.
Use groups instead of users when possible.

Bad Actors will look for Service Account Keys in these locations:

Source code repositories of open-source projects
Public Cloud Storage buckets
Public data dumps of breached services
Compromised Email inboxes
File shares
Backup storage
Temporary file system directories

Identity Aware Proxies(IAP)

IAP are Layer 7 proxies which are capable of allowing or denying HTTP(S) requests based on IAM policy and identity membership. If a user making the request doesn’t have an identity associated with it, the user will be redirected to a Google Oath page to sign into to a Google account or single signon account. Once an identity is associated with the request, and if the identity is allowed to access the resource, then the IAP forwards the connection to its destination.

Using IAP Proxies in front of app are ways you can limit access to parts or all of your application based on Google account. IAP for On-Premises Apps is Googles way of protecting Apps in Hybrid-Cloud Networking environments with IAM.

Workload Identity

Workload Identity is a way to grant IAM roles and permissions to external identities. If you want a Kubernetes service account to have certain permissions in GCP, the secretAccessor role for instance, workload identity federation is the IAM feature which will allow you to do that. Workload Identity providers do the magic of connecting the external entity to the workload defined. These providers either use SAML or OAuth 2.0 token exchange.

Providers supported:

AWS
Azure Active Directory
On-premises Active Directory
Okta
Kubernetes clusters

Organization Policy Constraints

Organizations can have limits placed on them for any number of attributes of the org’s resources. You can prevent certain actions from being taken by identities or service accounts. For instance, if you want all CloudFunctions to work through the VPC in a given project, you can create and then apply a constraint against constraints/cloudfunctions.requireVPCConnector. Depending on the constraint, it may apply to a set o Google services, or to specific services. You can find a full list here.

Data Security and Encryption

Encryption is the process of masquerading data that is in one form into another form using encoding algorithms which produce results that are impractical to convert back without having the cypher keys. Encryption at rest is usually denoting filesystem encryption. Encryption in transit usually refers to things like SSL over TCP or HTTPS encryption.

Within the ecosystem of Google Cloud, Encryption at rest occurs at the hardware level, at the data infrastructure level, and using file encryption. On the infrastructure level the data is grouped into chunks and each one is encrypted. Using AES 256 and 128 encryption, Google can either use encryption keys Google creates and manages or customer managed keys in Cloud KMS.

Cloud SQL encrypts all data together with one key in the same instance. Cloud Spanner, Cloud Bigtable, and Cloud Firestore using an infrastructure encryption mechanism. In storage systems, the data is grouped into chunks which can be several gigabytes in size, and each chunk is encrypted with a data encryption key(DEK) which Google encrypts with key encryption keys(KEKs). DEKs are stored near the chunks they encrypt and sent to a centralized store where they are encrypted by the KEKs which are also stored in a centralized location. If data is changed or added to a chunk, a new key is created and the chunk re-encrypted. Keys are never reused with regard to chunks. Access control lists refer to some of the chunks’ unique identifiers. All these chunks are stored on drives which have hardware encryption built into their chips.

Encryption in transit or encryption-in-motion protects against network interceptors and middle men. Data in transit in GCP on the Google network may not be encrypted but is authenticated at every transfer. Data in GCP that is outside the borders of the Google network is always encrypted. All incoming traffic to Google Cloud goes through the Google Frontend which runs on distributed global loadbalancers and protects against DDoS attacks. All communication to Google Cloud uses either TLS or QUIC. Within the Google network Application Layer Transport Security(ALTS) to authenticate and encrypt most intra-network connections.

Users do not have to create resources or set anything up to enable this encryption but they cannot control or manipulate the default Google Managed keys. Rather they can use their own keys with Cloud KMS. By default, DEKs and KEKs are rotated by Google. When a system tries to access a chunk, it requests the DEK from the key management system which authenticates the calling service, and then it sends the DEK to the storage system that decrypts the data.

Cloud KMS is a managed service for customer controlled encryption keys. It handles generating, importing and storing the keys within Google for application layer encryption on services such as Cloud Storage, BigQuery and Bigtable.

Cloud HSM is Google’s support for FIPS 140-2 keys using them only in Level 3 hardware modules which are tamper-evident and respond to tamper attempts.

Customer Supplied Keys is the option for using your own key management entirely. Keys are generated and kept onpremises and passed along with API calls which only use them in memory never storing them to disk. This way, Google can encrypt or decrypt the data with the customer supplied keys. This customer provided key is used to create a new customer derived key in combination with a per-persistent disk cryptographic nonce. In many cases, the customer supplied key is used to seed other keys that only stay in memory except for the nonce. Cloud External Key Manager(EKM) is the service which allows one to use third party management of keys and sets up Cloud KMS to consume them.

Cloud Storage supports ACLs in finegrained access mode to mirror support for them in Amazon S3 buckets to aid migrations, but this support is considered legacy. Otherwise buckets support IAM access at the bucket and project levels in uniform access mode. You can also use url signatures to grant temporary access to objects. Storage Buckets can also be made available publicly.

With Cloud Storage, signed policy documents can be created and set to restrict uploads based on sizes, type and file attribute based restrictions. It is a best practice to write checksums for all uploads and verify them. Google recommends creating and using CRC32C vs MD5 checksums due to its support of composite objects that are created with parallel uploads.

You can Secure your GKE or Anthos clusters with binary authorization, istio and mesh networking(ASM), cert manager, OPA policies and create all your elevated access service accounts with ACM.

Security Observability

Evaluation of security practices starts with increased observability into the different layers and components of the application you’re working with. This starts with understanding if your access controls and IAM policies work correctly. Otherwise you’re unaware the security measures put in place to run the application are working.

Auditing your policies begins with reviewing them and what has happened in your projects audit logs. The Cloud Logging agent will collect most common logs needed and can be configured to collect specific logins and accesses. Cloud Audit Logs is a logging service which records administrative operations taken in your project. Audit Logs are saved for a limited amount of time so they need to be exported to Cloud Storage or BigQuery if regulations require retaining for a longer amount of time. Logging can export messages to Pub/Sub as JSON messages, to ‘Logging’ datasets in BigQuery, or as JSON files to Cloud Storage. When everything is sufficiently logged, you can create access monitoring and run audit queries which that scan for anomalies which can be reported. Turning on a Google Artifact Registry’s automatic scan for vulnerabilities is an example of increasing security observability.

Penetration testing simulates an attack, particularly on a network interface of a host or a firewall. These tests connect with services and detect security vulnerabilities in running services. The solution is to often upgrade or patch an application so that it is no longer vulnerable.

The first phase of Penetration Testing is Reconnaissance where testers scope out the target much like a burglar looking for ways in. All information that can be gathered is gathered like Apache’s ServerToken string. Recon phase testing might include social aspects where the tester learns everything they can about the operators who do have access to the target system. This might come in the form of phishing or leaving a USB key near someone’s car in the parking lot. This phase can can be very short or very long.

The second step is Scanning. Once information is gathered, points of access on the network like IPs and ports are scanned, http endpoints have their root and header capabilities fetched and tested, commonly vulnerable urls are checked to see if they exist to determine if an access vector is present.

Gaining Access is a phase where the information gathered and a scanned access vector is exploited to obtain access to the breeched system. Maintaining access is what happens when parts of the exploit or other exploits are stored or hidden in the filesystem, obfuscated, set to sleep or listen for commands from some remote uri. They may even scrub logs hiding their tracks.

It is recommended for highly secure environments to create automatic pentesting tools that run automatically and log to Cloud Logging, from which you can draw monitoring alerts or reports.

Security Principles

Three main principles apply when we discuss Cloud Security: Separation of Duties, Least Privilege, Defense in Depth.

Separation of Duties

Separation of Duties, especially combined with these two other principles, creates a strong accountability and oversight in the work. Separation of Duties means code committers aren’t the same as pull-request approvers. When multiple people have a scope of duties that are closely related, the impact and risk of internal bad actors is reduced.

Developers use pipelines created by reliability engineers through DevOps principles, but often they are not allowed to approve pipeline steps in the higher environments such as production. Small teams may have a harder time accomplishing this.

Least Privilege

Least privilege is the principle of giving only the access that is needed. Working in least privilege focused companies is often a headache as nothing is easy to setup, it often takes coming up against an access denial to know what requests you need to make of the access teams. It may take you weeks to set up something it’d take you days to do if you had full access. This is because when access is denied, despite planning, and requests are made for grants, documentation has to be updated, the Solution Architecture Document may need to be updated, several security teams may need to reapprove your project after discovery of new facets of the work, you might need to wait on a Cloud Solutions team to produce a terraform module which provides your needed resources for a part of the project.

If you have microservices that use serviceaccounts to access resources, separate the serviceaccounts into ones that represent the workload, so that resources are grouped and only the services which need to access their resources will be able to do so.

IAM roles and permissions can be granted to satisfy whatever schema you can conceive. Once roles are granted, or custom roles created, you can use the Recommendation Engine to help prune unnecessary principle grants in IAM.

Defense in Depth

This is the practice of controlling security at multiple levels of your application using the tools of those layers. For instance, If you treat a kubernetes pod as if it has a bad actor built into its image, we can distrust the filesystem as a safe place to store sensitive data. We can exclude secrets from env vars and use Google’s SDK to request the directly from the secret manager api upon startup of our application. This assures the secrets are only in memory and our bad actor now can be inside our pod and not be able to know the sensitive information.

So like a stairway of distrust we design while considering:

The Network is compromised
The Cluster or VM is compromised
The disk is compromised
Root is compromised
The Application is compromised

We are trying to reduce the above list to just the last item:

The Application is compromised

As an SRE, SRE Manager or an Architect, it is important to know that last item is the responsibility of the application development team to secure their code and app. The other items on the list, we can as SREs design around. We can introduce securityContexts on pods or containers which mount the root FS readonly. We can ask the app team to modify those applications so they only write to volumes. We can design around this stairway of distrust. If every connection is suspect, then when we secure them with istio and certificates then we can be fulfilling the principles of Defense in Depth.

Regulations

Regulations are a big part of organizations and business. Every industry and company is regulated. Understanding where those regulations intersect with your design decisions is the same thing as knowing the impact they’ll have on your project. Cloud Architects should know how these regulations apply to them and how to stay compliant with regulations like US medical industry’s HIPPA/HITECH, Europe’s GDPR, and COPPA.

The exam will cover these as well as Sarbanes-Oxley.

HIPPA/HITECH

This is the law which applies to medical records in the United States. It is designed to protect personal information and privacy.

The Health Insurance Portability and Accountability Act (HIPAA) was enacted in 1996 to improve the portability and continuity of health insurance coverage. The HITECH Act, enacted as part of the American Recovery and Reinvestment Act of 2009, promotes the adoption and meaningful use of health information technology. Both HIPAA and HITECH place privacy, security, and breach notification requirements on covered entities and their business associates.

As a cloud architect, it is important to be aware of HIPAA and HITECH and how they impact the handling of health information in the cloud. HIPAA and HITECH impose requirements on covered entities and business associates with respect to the security, privacy, and confidentiality of health information. These requirements must be met when storing or transmitting

The HIPAA Security Rule is a federal law that establishes national standards for the security of electronic protected health information. The Rule requires covered entities to implement security measures to protect the confidentiality, integrity, and availability of PHI.

What are the Security Rule Safeguards? The HIPAA Security Rule is a set of standards that must be met in order to ensure the confidentiality, integrity, and availability of electronic protected health information (PHI).

There are four main types of safeguards that must be in place in order to meet the requirements of the Security Rule: administrative, physical, technical, and organizational. Administrative safeguards are policies and procedures that must be put in place in order to protect PHI, while physical safeguards are measures taken to secure the physical environment in which PHI is stored. Technical safeguards are security measures used to protect electronic personal health information. Organizational safeguards are measures taken by an organization to protect the personal information of its clients, employees, and other individuals it deals with. Organizational safeguards are specified under Section 164.308 of the HIPAA Security Rule. Organizations must be able to design and implement appropriate administrative, technical, and physical safeguards to protect the privacy and security of individuals’ health information.

The most common technical safeguards are authentication, authorization, integrity, confidentiality, and availability.

The Privacy Rule requires entities covered by HIPAA to identify the personal health information (PHI) of individuals in certain transactions and maintain that information in an identifiable form only for legitimate business purposes.

The European Union’s (EU) General Data Protection Regulation (GDPR) came into effect on 25 May 2018, replacing the previous EU data protection legislation from 1995.

Under the new rules, organizations handling personal data of EU citizens must comply with a variety of requirements covering privacy by design, consent for data use, and access to personal information.

The GDPR treats Controllers and Processors differently. A controller is any person, organization or company that controls the collection and use of personal data. A person or company that collects data on people for their own use is called a processor. Any processor that uses personal data to create a valuable asset is required to identify the asset as a data subject’s must be informed and give consent to.

In the event of a data breach (e.g. leaked passwords), data processors must notify the data controllers who have to notify the government and the people whose data was breached.

SOX

The Sarbanes-Oxley (SOX) Act is a set of rules and regulations that help ensure the accuracy and transparency accounting information in publicly traded companies.

The act was introduced by Senator Paul Sarbanes of Maryland in 2002. The primary purpose of the act is to ensure that public information served by companies is accurate and complete.

In addition, the act requires companies to disclose any material weaknesses in their internal control over financial reporting.

What rules do they put in place? As far as IT Architects are concerned, the act requires the prevention of falsification and deletion of records, retention of certain records for defined periods.

This includes measures to increase transparency, and may include: periodic auditing compliance with SOX, developing a plan to disclose material information on a regular basis, ensuring that employees understand the company’s reporting process and comply with it, developing training programs to help employees recognize potential conflicts of interest, and creating a culture in which employees feel confident to raise issues without fear of being sued.

requirement to implement tamper-prevention controls
requirement for an annual audits
requirement to keep data confidential

Childrens’ Online Privacy Protection Act(COPPA)

COPPA is a United States law passed in 1998 which requires websites and online service to restrict what they do reguarding the personal information of children under the age of 13. Websites which serve this audience must:

Notify Parents before collecting data about their child
Allow Parents to block such collection
Give Parents access to the data collected
Give Parents the choice of how such data is used
Have clear and understandable privacy policies
Retain the data only for the length of time for which it is needed
Maintain confidentiality, integrity and availability of the collected data.

All the data covered by the law aren’t limited to but specifically mention the identifying information in the data, such as name, dwelling, photographs.

ITIL Framework

ITIL is a standard of IT management practices that dovetails business goals with common IT activities. ITIL has 34 practices that are grouped into General, Service, and Technical practices. General are strategy, risk management, disaster recovery, architecture, project and security management. Service management items are analytics and analysis, service design, capacity and performance, incident management, and asset management. Technical practices are those which include management of deployments, infrastructure, and software development practices. Businesses adopt something like the ITIL because its a magic box of best practices that fits many different scenarios. It creates a repeatable standard which can simplify a lot of trouble and guesswork in IT management.

Summary

Designing secure systems that will live in GCP starts with access and ends with compliance and touches everywhere in between. IAM is used to give access to identities which are users, groups or serviceaccounts, Permissions, custom roles, predefined roles, and basic roles provide for just about any concievable combination of access and limits. Policies ensure that company wide standards are enforced.

Encryption is everywhere and its power can be placed within the customer’s hands. Least privilege, defense-in depth, and proper auditing fill in the gaps.

Exam Essentials

Understand all the different parts of IAM and how they interact
Understand that roles are simply groups of permissions which go together
Basic roles are legacy and should be avoided when possible
Understand that access can be granted at the resource, project and folder levels
Understand that Policies use bindings to associate roles with resources
Understand the hierarchy of Organizations, Folders, Projects and inheritance
Understand Google’s Encryption at Rest and in Transit, know the AES bit level for each
Understand DEKs, KEKs, and how they’re used and interact
Understand all the types of managing keys
Understand pentesting and auditing
Understand the best practices for security
Understand how to use access and storage classes to achieve compliance

Official Resources

Architecting GCP Network Solutions

Oct 31, 2022

Christopher Shaun Godwin

Author

OSI Model Layers

Physical, the actual metal, wires, electrons and plastic ethernet plugs. You’ll find wifi’s radio frequency here because radio is physical phenomenon. In Quantum networking this layer are the entangled particles and the equipment uses to read and write to them plus the equipment used to connect to that. Voltage is sometimes the physical layer in Ethernet Over Power. With tin cans on a string, this layer is the cans, string and the vocal vibrations traveling through them.
Data Link, ARP, Mac Addresses, Collision avoidance. This is broken into two mini-layers, the first is media access control(MAC) and the second is Logical Link Control(LLC). The second acts as a negotiator between the MAC layer and the third ‘Network’ layer.
Network, this is where IP Addresses live. Keep in mind these network layers are the layers of a packet sent over the network. This is the base layer for packets. A packet is data encapsulated in a route with source and destination addresses.
Transport. The protocol that makes this process work known by all networking devices speak is Transmission Control Protocol(TCP) or User Datagram Protocol(UDP). The Protocol identifier stored in a packet lives in this layer.
Session, this layer manages handshakes. An SMTP connection timeout would exist on this layer. TLS handshakes happen here. An https packet is fully encrypted, so a request to a server asking for a url cannot be understood unless it is decrypted, then it can be seen. Inside layer 4 lives an encrypted layer 5 envelope in the case oF HTTPS connections. Layer 5 is the encrypted data, while layer 6 is the decrypted data.
Presentation, A GET / request is in this layer. Mappings of network resources to application resources in the OS kernel happen at layer 6.
Application, This is the later applications connect to in order to do networking. A webbrowser fetches web pages from this layer. This layer one might consider a data format. A TXT file vs a Json file. Mime types exist at this Layer. Layer 7 in the packet is the raw data unenveloped by network dressing that tells the network about it.

OSI Model as an Out of Town Drive

Gravel, Concrete, Rebar, Paint, Reflectors, Lights, Engine, Fuel, Speed Limit Sign
The Lane
Connected Roads
Vehicle Tags, Driving Skills, Driving Laws
The Trip Session
The Itinerary of the Trip
The People on the Trip

Architects really only need to worry about layers 3, 4, and 7 with regard to load balancers, gateways, proxies, firewall rules, subnets, and traffic flow.

IP Addresses, FW Rules, and Routing

CIDR means classless inter-domain routing notation. It’s a way of simplifying the subnet mask by only specifying the bits. Understanding CIDR notation and IPV4 should be sufficient for the exam.

Networking in the cloud and in general works with IP Networking. IP networks are groups of devices. Subnets are spaces that identifiers live. A subnet is a street in a neighborhood. If all the addresses are single digits, then only 10 houses on that street are addressable. This is how IP networking works. You have to add more digits, or break the street into east street and west street to fit more addresses on that street.

In this way, networks are partitioned by their octets and subnet masks. Additionally they are partitioned with firewalls, NATing and public vs private IP spaces. Computers on the same physical network have an ARP table which maps IPs to MAC addresses as well as routing tables which map certain networks to specific network interfaces. IPV4 uses a four octet notation. Each octet represents numbers from 0-255. 0.0.0.0 is the internet. 255.255.255.255 is a subnet mask. Routers usually sit on the first or last ip in a network: 1 or 254. In binary to count to 255 you need 8 bits: 11111111, while 255 can be represented in hexadecimal as FF. but both have the same number of bits. So the highest number in an IPv6 block(FFFF) is 65535. That means that the IPv6 block has an entire IPv4 class B network within just one of its 8 groups: F0d8:0000:0000:0000:0000:0000:0000:0000. No IPv6 knowledge is required.

You’ll use CIDR ranges to specify subnets in GCP. You can learn subnetting in IPv4 or use tools online or in the shell like ipcalc to find the right amount of addresses for your private networks. Remember to consider growth. No overlapping subnets can be created in a VPC and each subnet must be uniquely defined.

In IP Networks, there are public and private networks. Certain online committees like the Internet Engineering Tas Force(IETF) process documents lie those called RFCs which define internet open standards. RFC 1918 designates these subnets for internal private use:

10.0.0.0/8

$ ipcalc 10.0.0.0/8
Address:   10.0.0.0                 00001010. 00000000.00000000.00000000
Netmask:   255.0.0.0 = 8            11111111. 00000000.00000000.00000000
Wildcard:  0.255.255.255            00000000. 11111111.11111111.11111111
=>
Network:   10.0.0.0/8               00001010. 00000000.00000000.00000000
HostMin:   10.0.0.1                 00001010. 00000000.00000000.00000001
HostMax:   10.255.255.254           00001010. 11111111.11111111.11111110
Broadcast: 10.255.255.255           00001010. 11111111.11111111.11111111
Hosts/Net: 16777214                  Class A, Private Internet

172.16.0.0/12

$ ipcalc 172.16.0.0/12
Address:   172.16.0.0           10101100.0001 0000.00000000.00000000
Netmask:   255.240.0.0 = 12     11111111.1111 0000.00000000.00000000
Wildcard:  0.15.255.255         00000000.0000 1111.11111111.11111111
=>
Network:   172.16.0.0/12        10101100.0001 0000.00000000.00000000
HostMin:   172.16.0.1           10101100.0001 0000.00000000.00000001
HostMax:   172.31.255.254       10101100.0001 1111.11111111.11111110
Broadcast: 172.31.255.255       10101100.0001 1111.11111111.11111111
Hosts/Net: 1048574               Class B, Private Internet

192.168.0.0/16

$ ipcalc 192.168.0.0/16
Address:   192.168.0.0              11000000.10101000. 00000000.00000000
Netmask:   255.255.0.0 = 16         11111111.11111111. 00000000.00000000
Wildcard:  0.0.255.255              00000000.00000000. 11111111.11111111
=>
Network:   192.168.0.0/16           11000000.10101000. 00000000.00000000
HostMin:   192.168.0.1              11000000.10101000. 00000000.00000001
HostMax:   192.168.255.254          11000000.10101000. 11111111.11111110
Broadcast: 192.168.255.255          11000000.10101000. 11111111.11111111
Hosts/Net: 65534                     Class C, Private Internet

Above, Hosts/Net shows the total number of ip addresses on the network.

Firewall rules control the flow of traffic over any network. In a VPC in GCP, you’ll find firewall rules are part of the network. Traffic flowing into a network is called ingress, and traffic which exits the network is called egress.

Respectively, firewall rules fall into the categories of controlling either ingressive or egressing traffic. Implied firewall rules exist by default. The first one blocks all ingressive traffic and the second one allows all egressing traffic. These rules cannot be deleted and they aren’t listed, they’re implied. To override them you make other rules with a higher priority. If traffic enters or exits the network, its properties are matched to all the rules in order of priority. When a match occurs the rules are no longer processed. Therefore a higher priority rule allowing all HTTPS traffic into the network that matches an incoming packet will allow the packet and not move on to the lower priority implied rule that blocks all traffic.

Rule priority is processed from low to high, low being 65535 and the highest being 0. The two implied rules have a priority of 65535.

There are four default rules designated on each default VPC network.

default-allow-internal: allows all VPC traffic to and from the VPC
default-allow-ssh: allows ssh from outside the network to any instance within the network
default-allow-rdp: allows Remote Desktop Protocol(RDP) connections from any source to any VPC destination
default-allow-icmp: allows ping to ingress into the VPC

These four rules have a priority of 65534 and are therefore the second lowest.

Ingress rules can specify the source ip while egress rules can specify the destination. To get more granular than that you can use network tagging in your firewall rules, and then tag compute resources. Otherwise all rules can specify an allow or deny action, the targets to which the rule applies, the protocol, the port, and enforcement status(enabled or disabled). Firewall rules exist in Google’s network at the global scale so all a Projects rules apply to every location within which the project has resources.

Cloud Router is a Border Gateway Protocol(BGP) software router in the cloud which advertises its IPs to networks outside of the cloud. When it interacts with those networks, it learns IP information about them. These public routers then speak to each other to map and remap the internet to physical connections. In this way, a ip range can be moved from one internet provider to another when they both allow BGP to communicate over them. This allows physical internet connectivity redundancy.

Cloud Router handles routing for the following services:

Dedicated & Partner Interconnects
High Availability VPNs
Router appliances

Cloud armor is an application layer(OSI Layer 7) web applications firewall(WAF) what protects against DDoS. attacks, cross-site scripting, and Database injections. The preconfiguration for Cloud Armor uses rules mitigating OWASPs top ten threat list. Cloud Armor has security policies that filter connections that use attack methodologies allowing the ones free of them to pass. Policies are available as preconfiguration while allowing for manually configured policies. Rules are defined with a rules language, but policies can also simply specify whitelists of trusted parties.

Virtual Private Clouds(VPCs) are networks which exist in the cloud at the global scale, so VPCs in Google span all regions. VPCs have subnets and all resources that use internal ips, which are Compute Engine based services for the most part. Cloud Run and App Engine can connect to VPC resources through a Serverless VPC connector, though the connectors for each.

Though VPCs are global, subnet resources are regional resources, since there is no overlap between subnets, each region’s subnet resources must be unique from other subnet resources in any region including unique from those within the same region. When VPCs are created you can specify automatic creation of subnets for different regions, or you can choose a custom provisioning of subnets for the regions involved. /29 subnets are the smallest allowed networks within a VPC.

VPSs can be set to one of three modes:

default: the mode selected when creating a new project
auto-mode: an automatic mode that creates subnets in every region
custom: allows full control of subnetting for production and high security environments

Auto-mode uses this range to create a subnet in every region automatically:

$ ipcalc 10.128.0.0/9
Address:   10.128.0.0           00001010.1 0000000.00000000.00000000
Netmask:   255.128.0.0 = 9      11111111.1 0000000.00000000.00000000
Wildcard:  0.127.255.255        00000000.0 1111111.11111111.11111111
=>
Network:   10.128.0.0/9         00001010.1 0000000.00000000.00000000
HostMin:   10.128.0.1           00001010.1 0000000.00000000.00000001
HostMax:   10.255.255.254       00001010.1 1111111.11111111.11111110
Broadcast: 10.255.255.255       00001010.1 1111111.11111111.11111111
Hosts/Net: 8388606               Class A, Private Internet

The VPC reserves four ip addresses from every subnet. Shared VPCs are shared from one project to another. This may be part of an organizational structure, or collaboration between parts of a company. Google recommends using one VPC because its easier to manage. However large enterprises will ignore this.

Shared VPCs are how the resources across several projects can be on the same network. This works because the host project defines service projects. The firewall rules for the resources can exist on the project but apply the shared VPC. You can specify that all future subnets are shared in a host project or just specific subnets.

You can take this further and delineate network and project duties partitioning them among teams and therefore separating their privileges. As long as the host project and service projects are in the same organization, shared VPCs can be used. Migrations are the exception.

When projects are in different organizations and need to communicate over a network, they can use network peering. VPC Network peering allows two VPCs to communicate with one another via RFC 1918 private ranges. Organizations usually communicate over the internet with public ips. If a lot of private communication exists between companies, they’ll use a VPN to communicate over private networks. VPC Network peering is an alternative to these approaches.

VPC Network peering might be used by an organization wanting to make their services available to its customers who are different organizations in GCP. A Concert company might make a private cloud network available to the ticketing vendor and the marketing vendor so that the concert organization can coordinate ticketing and sales from booths within the venue.

Companies might use organizations as part of a higher segmentation of their projects and may have a need for organizations to communicate over its peered VPC.

VPC Network Peering:

has lower latency, doesn’t travel over the internet
as an alternative to public ips, a peered VPC is a reduced attack surface
egress between peered VPC is free

Peered VPCs have their own firewall rule definitions from the VPC that is within an organization. A single VPC can have up to 25 connections peered at maximum. VPC peering works with Compute base services which receive a private IP. With peering, both peers must set up the configuration and the configurations must match. If a peer deletes their side’s configuration, the peering will cease and go into inactive mode. Peering doesn’t add latency.

Hybrid-Cloud Networking

Hybrid-Cloud Networking is networking which spans clouds or to onprem datacenters. When only separate public clouds are involved, Multi-cloud Networking is involved. But when an onprem datacenters is involved with one or more public clouds, Hybrid-Cloud Networking is the term applied. Services which connect to onprem databases thorough a dedicated or partner interconnect is considered Hybrid-cloud networking as is something like Anthos Service Mesh in a hybrid context.

Top 5 workloads staying onprem according to Dell:

Unstructured Data Analytics is staying onprem 31% of the time due to a more secure environment for which the data to live.
Structured Data Management & Analytics for the same reasons.
Business Applications like ERM, ERP, CRM
Engineering/Technical

Top 5 workloads moving to the cloud:

Databases
Batch processing, File lifecycle
Backups, Disaster Recovery
Petabyte scale data warehouses
Scaled workloads, Compute Workloads, Stateless kubernetes applications

Data warehouses in the cloud like BigQuery can use onprem sources, and the interconnect between cloud and onpremises datacenters needs to have the capacity for that connectivity. You must know the projected bandwidth usage and adequately plan for not only growth but redundancy for critical operations. This keeps the network reliable under load.

Latency is also a consideration. Running stateless GKE applications that connect to an onpremises database can expect a 2000 millisecond latency accessing a moderate payload event when they run on the fastest and most compute specialized nodes. The bottleneck is entirely the connectivity between datacenters and cloud regions. This is less of an issue with customer non-facing applications, however with things like JAM stack APIs running in cloud, this affects page load and the quickness of your app.

One way to handle latency is to use caching in the cloud so that the calls back to onprem databases or APIs will only take long once in a while. One might take a local database and sync it with mongo mirror or add a replica to a local MySQL database in the cloud to reduce latency and continue to meet SLAs.

Network Topologies:

Mirrored topology: General onprem resources are exactly mirrored in the cloud
Meshed topology: All resources can connect with all resources
Gated egress topology: Onprem APIs are made available to the cloud
Gated ingress topology: Cloud APIs are made available to onprem services
Gated egress and ingress topology: both the prior two
Handover topology: Onprem data is uploaded to the cloud to be used by cloud services

Your choice of these depends on workload distribution, latency, throughput, and existing topology.

The ways to implement Hybrid-Cloud Networking are by three different means:

Cloud VPN
Cloud Interconnect (either direct or partner)
Direct Peering

Cloud VPNs are services that create a virtual private connection between your VPC in Google and your other networks. Cloud VPNs are IPSec tunnels and so they require public static IPs on both ends. Google offers an HA VPN and VPN Classic. The HA VPN uses two connections to one HA VPN gateway, each connection comes with their own external IP addresses. The HA option has 99.99% availability. VPN Classic provides 99.9% availability with one connection and endpoint. Both option support 3Gbps. When data egresses to the VPN is it encrypted and when it ingresses into the destination network it is decrypted. Cloud VPNs operate with Internet Key Exchange(IKE) protocol.

Cloud Interconnects provide direct connections between GCP and onpremises networks. Highly available interconnects use two connections. Interconnects are available in 10Gbps and 100Gbps bandwidths. Partner interconnects are available between 50Mbps to 50Gbps bandwidths. Google’s interconnects terminate at one of Google’s Points of Port(PoP). If you are not near enough to a PoP, you interconnect to a third party near you who has connections near one.

Interconnects are:

Private
VPC addresses are available to onpremises networks without NAT or encryption
You can scale up interconnects

Interconnect scaling chart:

	Dedicated	Partner
Unscaled	10/100Gbps	50Mbps-50Gbps
Scaled	80/200Gbps	80Gbps

80Gbps connections use eight 10Gbps combined and 200Gbps interconnects use two 10Gbps combined.

Direct Network Peering is used when you need to affect the BGP routing from GCP and Google Workspace services. Peering doesn’t utilize any part of GCP, rather it affects the internets routing matrix so that your public resources route directly to you. Google recommends simply using interconnects when not needing to connect to Workspace services.

Private Service Connect for Google APIs connects Google’s public API to private locations without the need for egressing over the public side of the network. Private Service connect can be configured to point to private.Googleapis.com(all-apis) or restricted.Googleapis.com(vpc-sc).

Private Service Connect for Google APIs with Consumer HTTP(S) offers the same service but connects to internal loadbalancers inside your VPC which forwards the correct requests to the correct API.

Private Google Access connects custom domains to Google’s APIs through a VPC’s internet gateway. With this option you have to create the DNS records you’re using and the dns records that point to the all-apis or vpc-sc api domains.

Private Google Access for Onpremises Hosts is access that allows onpremises hosts to connect to private Google resources over Cloud VPN or Cloud Interconnect.

Private Service Connect for Published Services allows you to privately connect to services in a different VPC that has published their service using the Private Service Connect for Service Producers.

Private Service Access is network access used by Serverless GCP resources to connect to VPC resources over IP when VPC peering is used.

Serverless VPC Access is used by serverless resources to connect to VPC resources using an internal IP address. This option uses VPC Connectors to connect from Cloud Run, Cloud Functions, and App Engine Standard.

Load Balancing

GCP has five different loadbalancers(LBs) for different use cases. Is your workload balanced between addresses in a region or across several regions? Does the LB receive internal, external, or both internal and external traffic? What are the protocols of the connections being balanced?

GCP Loadbalancers:

Network TCP/UDP
Internal TCP/UDP
HTTP(S) Proxy
SSL Proxy
TCP Proxy

multiregional=>condition: Multi-Regional Balancing?
https=>condition: HTTP(S)?
ssl=>condition: SSL?
tcp=>condition: TCP?
intorext=>condition: Internal traffic?
internallb=>operation: Internal TCP/UDP
externallb=>operation: Network TCP/UDP
httpstraffic=>operation: HTTP(S) Proxy
ssllb=>operation: SSL Proxy
tcplb=>operation: TCP Proxy
e=>end: End

multiregional(yes)->https->e
multiregional(no)->intorext
intorext(yes)->internallb
intorext(no)->externallb
https(yes)->httpstraffic
https(no)->ssl
ssl(yes)->ssllb
ssl(no)->tcp
tcp(yes)->tcplb

HTTP(S) Load balancers are Layer 7 LBs and specifically handle http traffic. For other SSL purposes, like loadbalancing SMTP TLS you’d use the SSL LB as it is also a Layer 7 LB which operates on other protocols. For everything else, there’s TCP. You would use any of these three if you are balancing across two or more regions.

Additional Networking Services in GCP

Service Directory is a managed service discovery meta- database. Service directory can be accessed by a number of means, clouds, and GCP Services.

Cloud CDN is a managed content delivery network enabling global latency reduction for data access of files such as images or documents. Cloud CDN can pull content from Compute Engine Managed Instance Groups, App Engine, Cloud Run, Cloud Functions, and Cloud Storage.

Cloud DNS is a managed and globally distributed hosting service for the Domain Name System. Cloud DNS supports public and private DNS zones. Private zones are visible within the VPC and public zones are published to the internet.

Summary

Virtual Private Clouds are global resources which contain your addressed services. VPCs have various ways of having serverless environments connect to them, or private connections out to Google APIs from VPCs with no egress to the internet. Connecting VPCs to onpremises networks is done through hard connection and network management of the flow of traffic over interconnects which can be Highly Available as can Cloud VPNs.

Hybrid Cloud Networking, either with Interconnects, VPNs or Direct Peering allow workloads to span between local datacenters and cloud resources. Architects include latency, network topology, transfer time, maximum throughput, and room for growth.

Load Balancing handles different use cases with 5 types of loadbalancers, 2 regional, and 3 global.

Exam Essentials

Grasp VPCs
Understand VPC Sharing
Understand Firewall Rules, priorities, and direction
Know CIDR notation, lean how to subnet in your head or with ipcalc
Understand Hybrid-cloud Networking(HCN)
Understand when to use HCN
Know the advantages and disadvantages of each HCN option
Understand Private Access Services
Understand GCP Load Balancing

Official Resources

Architecting Storage Solutions in Google Cloud Storage

Oct 31, 2022

Christopher Shaun Godwin

Author

Object Storage

Object Storage is common to all cloud systems and has its roots way back in 2006 with Amazon S3 and Rackspace Files/OpenStack Swift and Google Cloud Storage in 2010. These systems are for storing files or documents as objects as opposed to a directory filesystem. Instead of hierarchical the particulate nature of object storage treats everything atomically. You can’t seek and read parts of the file, you can’t tail off of object storage. You can get, put, delete objects. Their organization depends on the system.

Organization

Buckets in GCP are containers filled with these particular objects. Objects when updated create new versions, you cannot update an old version with a new file. Once a version is there they’re immutable or unchangeable. The bucket is the logical definition with the IAM permissions that the objects inherit. Therefor you’ll give write access to all the objects in the bucket to any accounts with write access. You can place individual IAM permissions upon individual objects. There is an illusion of a directory structure because the file /pictures/2022-10-20/picture.jpg on a file system would be named picture.jpg and live in the folder /2022-10-20/ which in turn lives in the folder /pictures/. However, with object storage, /pictures/2022-10-20/picture.jpg is the entire filename.

Buckets must be uniquely named from all other buckets in the cloud owned by all other users. Buckets cannot be renamed or automatically copied to a new bucket. Objects don’t have to be uniquely named.

Bucket name best practices:

Bucket names shouldn’t have personal information.
Use DNS naming standards.
Use UUIDs or GUIDs if you have buckets in any real quantity.
Don’t upload objects with time series based filenames in parallel
Don’t name objects in sequence if uploading them in parallel
It’s best to use the fully qualified subdomain

One way to access cloud storage is through a FUSE mount. FUSE (Filesystem in Userspace) is a software interface that allows users to create and access virtual filesystems. This can be useful for mounting cloud storage buckets so that they can be accessed like any other local filesystem. To do this, first install the FUSE package for your operating system. Then, create a directory that will serve as the mount point for the bucket. For example, if you want to mount a bucket named “mybucket” on your local machine, you would create a directory named “mybucket” in your home directory. Next, use the fuse-bucket tool to mount the bucket. To use FUSE with Cloud Storage, you first need to install the FUSE library and the gcsfuse tool. Once these are installed, you can use the gcsfuse command to mount a bucket. For example, the following command will mount a bucket named mybucket.

Storage classes

GCP has different classes of storage:

Standard
Nearline
Coldline
Archive

Different storage classes in Google Cloud Storage offer different benefits for different workloads. The most basic storage class, Standard, is great for storing data that is accessed frequently. The next class, Nearline, is ideal for data that is accessed less frequently, but still needs to be accessed quickly. The last class, Coldline, is perfect for data that is infrequently accessed and can tolerate higher retrieval costs. By understanding the different workloads and access patterns, users can select the most appropriate storage class for their needs and optimize their Google Cloud Storage experience.

The Standard storage class is designed for frequently accessed data. Data stored in the Standard storage class is charged based on how much you store.

Nearline storage is a type of cloud storage that is similar to online storage but with lower availability and higher latency. Nearline storage is typically used for data that is not accessed more often than once every 30 days but needs to be stored for long-term retention. Costs are calculated based on how often you access the data and how much you store.

Coldline is a class of storage that was announced by Google in October 2016. It is designed for data that doesn’t need to be frequently accessed, such as historical logs or data archival. The storage itself is designed for files accessed less than once per year. It has a higher retrieval cost than nearline.

Archive storage is the lowest cost storage option in Google Cloud with the highest retrieval costs. It is specifically for data that you don’t need to access more than once a year, such as historical data, backup files, or log files. This is great for compliance storage of files that never need to be accessed.

Feature Summary

Feature	Standard	Nearline	Coldline	Archive
Multiregion SLA	99.95%	99.9%	99.9%	99.9%
Region SLA	99.9%	99.0%	99.0%	99.0%
Latency	millisecond access	millisecond access	millisecond access	millisecond access
Frequency	Often	1x30 days	1x90 days	1x1 year
Capabilities	Video, Multimedia, Business Continuity, Transcoding, Data analytics, General Compute	Backup, Long-tail content, Rarely accessed docs	Archive, Source File Escrow, Disaster Recovery Testing	Compliance Retention, Disaster Recovery

Costs Summary

Cost	Standard	Nearline	Coldline	Archive
Size	$0.020/GB	$0.010/GB	$0.004/GB	$0.0012/GB
Retrieval	$0.00/GB	$0.01/GB	$0.02/GB	$0.05/GB

Example use-cases for Google Cloud Storage:

Hosting website static assets (images, JS, CSS)
Distributed backup and disaster recovery
Storing data for analytics and Big Data processing
Storing data for internet of things devices
Storing data for mobile apps
Storing data for gaming applications
Storing data for video and audio streaming
Collaboration and sharing of files non-persistent attached storage
Security and compliance data
Geospacial data storage
In combination with Cloud Functions

These examples leverage both the storage classes and the atomic treatment of the objects themselves. Architects must understand the differences between these storage classes.

Network Attached Storage

Network Attached Storage (NAS) is a type of storage that allows files to be accessed over a network. NAS devices typically connect to a network using Ethernet and can be used by any computer on the network.

Google Cloud Filestore is a NAS service that provides high performance, scalable file storage for applications running on Google Cloud Platform. Cloud Filestore is built on top of Google Cloud Storage and offers the same benefits as other Cloud Storage products, such as high availability, durability, and security.

Cloud Filestore is a good choice for applications that require low latency access to files, such as video editing, media streaming, and scientific computing. Cloud Filestore is also a good choice for applications that require high throughput.

Google Cloud Filestore is a high-performance, managed file storage service for applications that require a file system interface and a shared filesystem. It supports industry-standard file system protocols such as NFSv3 and SMB. Google Cloud Filestore is available in three storage tiers: Basic, High Scale, and Enterprise.

Basic HDD, Good
Basic SSD, Great
High Scale SSD, Better
Enterprise, Best

Basic

The basic Filestore option strikes a good match for file sharing, software development, and use as a backend service with GKE workloads. You can opt for either hard disk drives (HDD) or solid state disks (SSD) when choosing storage, but SSDs provide higher performance at higher cost. For HDD, the I/O performance is reliant on the provisioned capacity, with peak performance increasing when the storage capacity exceeds 10 TiB. For SSD, the performance is fixed no matter the storage capacity.

High Scale

High-scale SSD storage tiers instances are ideal for performing large-scale computing tasks such as DNA sequencing and data analysis for financial services. It gives fast throughput with the ability to scale up and down with demand.

Enterprise

Enterprise tier is designed for enterprise-grade NFS workloads, critical applications (for example, SAP), and GKE workloads. It supports regional high availability and data replication over multiple zones for resilience within a region.

Service Tier	Provisionable capacity	Scalability	Performance	Availability	Data recovery	Monthly Pricing
Basic HDD	1–63.9 TiB	Up only in 1 GiB units	Standard fixed	Zonal	Backups	$204.80($0.20/GiB)
Basic SSD	2.5–63.9 TiB	Up only in 1 GiB units	Premium fixed	Zonal	Backups	$768.00($0.30/GiB)
High Scale SSD	10–100 TiB	Up or down in 2.5 TiB units	Scales with capacity	Zonal	None	$3,072.00($0.30/GiB)
Enterprise	1–10 TiB	Up or down in 256 GiB units	Scales with capacity	Regional	Snapshots	$614.40(0.60/GiB)

Filestore Networking

Cloud file store can connect to a Virtual Private Cloud (VPC) network either by using VPC Network Peering or Private Services Access. When connecting to a VPC network with standalone VPC, when creating an Instance within a Host Project of a Shared VPC, or when accessing the Filesystem from an On-Premises network, you can use VPC Network Peering. When connecting from a Service Project to a Shared VPC, or when using Centralized IP Range Management for Multiple Google Services, you need to use the Private Services Access.

Filestore Access Controls

Iam roles only grant you management access on the GCP resource but file access is managed with unix permissions in an octet format 0777, chown and chgrp.

Databases

Google cloud has several different database options. Relational, NoSQL, Analytical.

Relational Databases

Relational databases have tables with fields which can to refer to fields in other tables. An Example:

User Table:

ID	Name	Age
0	Jeff	35
8	John	35

Jobs Table:

ID	Job Title
25	Software Engineer
8	CEO
0	Director of Engineering

From the example above we can see that these two tables relate on the ID colum, they are relational. So Jeff is Director of Engineering.

Relational databases are built to support a query language and minimize problems with the data often called anomalies. In the above two tables, ID 25 doesn’t exist in the user table so the first row in the Jobs table above is a data anomaly. When fields are properly related, deleting a record in one should cascade to the others. These constraints are part of table schemas. Relational databases conform to ACID(atomicity, consistency, isolation, and durability) transaction models.

ACID Transactions

Atomicity means the whole transaction is done or none at all. A transaction is indivisible for relational databases to work.
Consistency Means when a transaction is complete, the database is constrained to a consistent state so that all foreign key reference a primary key, all unique keys are unique and the database is in an integral state.
Isolation Isolation means that parts of transactions cannot be mixed. Meaning strict grouping and ordering of transaction data in buffers.
Durability Durability means that when a transaction is complete, its change will be immediately reflected in requests for the data that was changed even if the database crashes after the completed transaction.

Cloud SQL

Cloud SQL offers MySQL server, Microsoft SQL server, or PostgresSQL via managed VMs. Google will perform upgrades and backups and let you specify maintenance times. Failovers are automatically managed and healing is an automatic process. Regional Databases are perfect for Cloud SQL. Cloud SQL supports databases up to 30 Terabytes.

All data is encrypted at rest and in transit
Data is replicated across the region to other zones
Failover to replicas is automatic
Standard tools and libraries can connect to Cloud SQL as if they’re connecting to MySQL, SQL Server, or Postgres
Logging is integrated as well as monitoring

Cloud SQL Machine Type Examples

Legacy Type	vCPUs	Memory(MB)	Machine Type
`db-f1-micro`	1	614	n/a
`db-g1-small`	1	1700	n/a
`db-n1-standard-1`	1	3840	`db-custom-1-3840`
`db-n1-standard-2`	2	7680	`db-custom-2-7680`
`db-n1-standard-4`	4	15360	`db-custom-4-15360`
`db-n1-standard-8`	8	30720	`db-custom-8-30720`
`db-n1-standard-16`	16	61440	`db-custom-16-61440`
`db-n1-standard-32`	32	122880	`db-custom-32-122880`
`db-n1-standard-64`	64	245760	`db-custom-64-245760`
`db-n1-standard-96`	96	368640	`db-custom-96-368640`
`db-n1-highmem-2`	2	13312	`db-custom-2-13312`
`db-n1-highmem-4`	4	26624	`db-custom-4-26624`
`db-n1-highmem-8`	8	53248	`db-custom-8-53248`
`db-n1-highmem-16`	16	106496	`db-custom-16-106496`
`db-n1-highmem-32`	32	212992	`db-custom-32-212992`
`db-n1-highmem-64`	64	425984	`db-custom-64-425984`
`db-n1-highmem-96`	96	638976	`db-custom-96-638976`

Shared core types db-f1-micro and db-g1-small are not covered by Google’s Cloud SQL SLA.

By default a Cloud SQL instance is a single machine in a single zone, but high availability options for provisioning additional failover and read replicas in additional zones exist. Additionally you can add read replicas in different regions. This is one way to migrate data between regions and to do disaster recovery testing. Failover replica’s are automatically promoted from read replicas to master in the case of failure.

GCP’s Database Migration Service is designed for MySQL and PostgresSQL workloads and will continuously replicate data from on-premises or other clouds. It performs and initial snapshot of the database and then leverages the native replication features of your database to continually migrate the data. You can also perform lift and shift migrations with this tool in addition to continuous. Cloud SQL scales well only vertically and not well horizontally. More memory and CPU power is needed for bigger workloads, they aren’t sharded across several devices in a workload agnostic manner.

Cloud Spanner

Cloud Spanner is a globally consistent and distributed database that provides the highest level of horizontal scalability as any relational database on the biggest network of its kind. It is fully managed and scales to multiple regions. Spanner supports relational schemas and 2011 ANSI SQL as well as Postgres dialects. Supporting instance consistency rather than “eventual consistency” as is the case with Cloud SQL read replicas, so the risk of data anomalies that eventual models produce are reduced.

Example Use Cases:

Stock trading systems that want to enable global purchasing of a security at a current price at a known time of day.
Shipping companies who need a consistent view of their global distribution network, the status of packages and the sending of global notifications.
Global inventory for a company like Sony Playstation.

Spanner provides a 5 9s availability which means less than 5 minutes of downtime per year. Its fully managed and like other managed database services in GCP its upgraded, backed up, and failover is managed. Data is also encrypted and at rest and in transit.

Analytical Databases

Analytical databases are usually data warehouses. We’ve described some data lake and data warehouse options in Google Cloud’s Hadoop and Spark offering. Though they’re used or ETL, Hadoop data lakes can be used as the data from which analytical systems can draw their data.

BigQuery(BQ)

Hadoop cannot do analytics but BigQuery is able to provide insights and is an analytics solution. Its queries scan large amounts of data and can perform data aggregation. BigQuery uses SQL and is serverless, managed and scales automatically.

BQ Analytics

Big Query is built upon Dremel, Colossus, Borg and Jupiter. Dremel maps queries to extraction trees with leaves called slots. Slots read information from storage and do a bit of processing on the data. Branches on the tree aggregate the data. Colossus is distributed filesystem by Google that offers encryption and replication. Borg is a request router that can handle rerouting during node failure. Jupiter is a petabyte per second network built by Google with rack aware placement which improves fault tolerance and throughput and requires less replication.

While other databases group rows together, in BigQuery, the data in the same column are stored together in an columnar structure called Capacitor. Capacitor supports nested fields and is used because the analytics and business intelligence filtering only happens on a small number of columns compared to a traditional application’s filtering of a number of columns in a row.

BigQuery has batch and streaming jobs to load the data, jobs can export the data, run queries or copy data. Projects contain objects called a dataset that are regional or multi-regional. Regional is straight forward, is what it sounds like. But with multi-regional you either choose the United States or Europe and Google copies the dataset into multiple regions within the continent you’ve chosen.

BigQuery bills on size stored as well as the query size and the data scanned when running the query. For this reason it is advisable to partition your query to specifically the time when the data occurred. Use less broad queries for smaller ones and less data scanned while running the query. You can read more about BigQuery Pricing. For this reason, don’t use queries to view the structure of the tables use bq head or use the Preview Option on the console. You can also use --dry-run to test command line queries which will tell you the number of bytes the query would have returned. You’re not billed for errors or queries whose results are returned from cache.

BQ IAM Roles

Access permissions in all of GCP’s products are granted by IAM, which generally has predefined roles for its products. The roles in IAM for BigQuery are:

roles/bigquery.dataViewer can list projects, tables, and access table data.
roles/bigquery.dataEditor has the permissions of dataViewer and can create and change tables and datasets.
roles/bigquery.dataOwner has dataEditor and can delete tables and datasets.
roles/bigquery.metadataViewer can list tables, datasets and projects.
roles/bigquery.user can list projects, tables has metadataViewer, and can create jobs and datasets.
roles/bigquery.jobUser Can list projects, create queries and jobs.
roles/bigquery.admin Can do any BigQuery operation.

In addition to these overarching roles, granular access can be given to Google service accounts, Google groups, etc over organizations, projects, datasets, tables and table views.

Loading Data into BQ

You can batch load or stream load data into BigQuery.

Batch Loading

Through ETL and ELT processes, data is typically batch loaded into a data warehouse through combining some sort of extraction, loading and transformation. Jobs which load the data into BigQuery can use files as objects in Cloud Storage, files on your local filesystem. Files can be Avro, CSV, ORC, and Parquet formats.

The Data Transfer Service in BigQuery loads the data from other services such as Youtube, Google Ads and Google Ad Manager, Google’s SaaS products and third-party sources. The Storage Write API is used to load data in a batch and process the records in one shot atomically, meaning the whole thing goes in or none of it does. Big Query can load data from Cloud Datastore and Cloud Firestore.

Stream Loading

To stream data into BigQuery you can use the Storage Write API or Cloud Dataflow which uses a runner in Apache Beam to write the data directly to BigQuery tables from a job in Cloud Dataflow. The Storage Write API will ingest the data with high throughput and ingest each record only once.

NoSQL Databases

GCP has four NoSQL databases: BigTable, Datastore, Cloud Firestore and Redis via Cloud MemoryStore(especially with RDB snapshotting).

Cloud Bigtable

Bigtable is a wide column multidimensional database that supports petabyte size databases for analytics, operational use, and time series data for Internet of Things(IoT) sensors. It’s ability to handle time series data well means it is good for marketing, advertisement, financial data and graphs.

Bigtable supports latencies lower than 10ms, Stores at the Petabyte scale, replicates into multiple regions, supports Hadoop HBase interfacing, data is stored in the Colossus filesystem, and metadata is stored in the cluster directly.

Data is stored in tables with key to value maps and each row stores information about the entry which is indexed by a row-key. Columns are grouped into column families like collections and a table can container multiple column families.

Tables are sectioned into blocks of contiguous rows called tablets. These tablets are stored in Colossus. Hotspots occur when you make the row key associated with a workload. For instance, if you make the row key the user ID, the heavier use users will all write to one tablet server. Design the workloads so that they’re as distributed as possible, and if hotspots still do occur you can limit or throttle the keys that cause the problem. Find out more about Bigtable hotspots.

Bigtable has support for the HBase API, so one can migrate from Hadoop HBase to Bigtable. Bigtable is the best option for migrating Cassandra databases to Google Cloud. One can create Bigtable as a multi-cluster and multi-regional and Google will take care of replicating the data. Multi-cluster systems can have their workloads separated, one being the read cluster and the other being assigned a write workload. The cluster replication procedures will assure that both cluster reach “eventual consistency”.

Cloud Datastore

Datastore is a fully managed, autoscaled, flexible structure NoSQL database for storing json objects called documents. It is superseded by Cloud Firestore. Datastore doesn’t have tables, it has what is known as a ‘kind’. Kinds contain entities. Datastore does have relational column called a property and it has a key instead of a primary key.

Cloud Firestore

This is the next product iteration of Cloud Datastore. Firestore is consistent, has two data models(collections and documents). Firestore operates under Datastore mode or Firestore mode, for supporting the latest document database features. Firestore is strongly consistent in either mode. Firestore in Datastore mode is strongly consistent where as Cloud Datastore is eventually consistent. Firestore offers millions of writes a second and the fully featured mode can handle millions of connections.

Memorystore

Managed as other products are, Memory store comes in two forms, Redis and Memcached. You can use memory caches for message processing, database caching, session sharing, etc. Memory caches are generally nonpersistent, but Redis can be configured to snapshot to dir and start again with that same data.

Memorystore for Redis

Redis is a memory datastore designed to return information with sub-millisecond latency. You can store many data types in Redis. Instance memory ceilings top out at 300GB with 12 Gigabit networking. Caches can be replicated across zones for 3 nines availability. As a managed service, Google handles updates, upgrades, syncing and failing over to other instances.

Memorystore for Redis comes in two tiers:

Basic
Standard

Basic is a single server with no replication, Standard is a multi-zonal replication and failover model.

Memorystore for Memcached

Memcached is an opensource cache which was first written for LiveJournal to perform query results caching, session caching, and data caching. Memcached nodes within a cluster called an ‘instance’ must all have the same cpu and memory geometry. So the same amount of resources on each node. Instances can have 20 nodes max, nodes can utilize a max of 32vCPUs and 256GB of memory, with a total cluster memory size of 5TB. This integrated service can be accessed from other services.

Data retention & Lifecycle Management

Data has lifecycles, is fresh, becomes inactive over time, must be archived or pruned. Different types of data have different stages, not only that they can be differently required. As an architect, you must track and handle these data lifecycles for a project or migration.

Storage requirements often impact how policies can be implemented. That is why intimate knowledge of various storage attributes is required of Cloud Architects.

Considering these things is a matter of knowing all your data and the types of data. From there you can record how quickly data must available to be accessed. Then knowing the frequency of access for each type helps define your retention planning and your planning the management of its lifecycles.

Frequency	Solution
Sub-millisecond	Cloud Memorystore, Bigtable, Firestore
frequently	Cloud Storage, Database, NoSQl, Document database
infrequently	Cloud Storage Coldline
not accessed archived	Cloud Storage Archive
not accessed	Prune

In Cloud Storage, one can create triggers that run based on the age of an object or file, the versions of that file, the object’s storage class, actions can include deleting, manipulating or changing the storage class of the object. So when objects are old and not accessed, they can be migrated to different classes. Retention policies can be created and when their specifications are not yet satisfied, they are locked into place guaranteeing their retention under the conditions specified in the policy.

Network and Latency

Latency is a big consideration in overall cloud design. There are decisions you can make that impact latency without knowing their consequences if you are unfamiliar with the particulars of different storage cloud products. Reducing latency is as simple as:

Replicating data into regions across customer locations
Distributing data over a CDN
Using the Premium Network Tier
Using services like Firestore or Spanner which are already global

Summary

GCP has Relational, Analytical, and Unstructured Databases. There are four kinds of cloud storage systems:

Cloud Storage for objects
NAS via Cloud Filestore
Databases
Memory Caches

GCP Relational Databases:

Cloud SQL: Eventual Consistency
Cloud Spanner: High Consistency

GCP Analytical Databases:

BigQuery: Columnar

NoSQL Databases:

Bigtable
Datastore
Firestore

Exam Essentials

Understand all the Storage Systems in GCP
Understand: Standard, Nearline, Coldline, Archive classes in Cloud Storage
Understand: Cloud Filestore NAS features, accessing from Compute
Know how to deploy Cloud SQL as a single server or with replication
Understand horizontal scalability in GCP Storage options
Be familiar with BigQuery as a data warehouse
Be familiar with BigTables Petabyte Scale Options and Operations
Be familiar with migrating data to GCP
Understand GCP’s JSON Document stores
Understand Caching services
Understand data retention and lifecycle management
Understand how to consider latency when designing storage for GCP

Official Resources

Architecting Compute Engine Solutions in GCP

Oct 20, 2022

Christopher Shaun Godwin

Author

Compute Engine Services

Each of these services have different use cases. You’ll have to know how to select the right one for your requirements.

Service	Use Case	Fancy Buzzword
Compute Engine	If you need root access and are running multiple processes in the same operating system instance.	Infrastructure as a Service (IaaS)
App Engine	You need to run a nodeJS, Java, Ruby, C#, Go, Python or PHP application quickly with no configuration or management.	Platform as a service (PaaS)
Cloud Functions	You need to run a serverless routine.	Executions as a Service (EaaS)
Cloud Run	Run individual containers.	PaaS
Kubernetes Engine	Run several docker containers in group.	Containers as a Service (CaaS)
Anthos	Run containers in a hybrid or multi-cloud environment.	Hybrid CaaS

GCE

Compute Engine is an Infrastructure as a Service solution that is the underlying platform for many services like Cloud Functions. Compute Engine provides virtual machines called instances.

New virtual machines require a type be specified along with boot image, availability status, and security options. Machine types are sorted into different CPU and Memory options. Machine types are grouped into families like general purpose, cpu optimized, memory optimized, and GPU-capable.

Compute Instance Options

General Purpose
- shared-core
- standard
- high memory
- high cpu
CPU Optimized
- Standard
Memory Optimized
- Mega-memory
- Ultra-memory
GPU Capable
- Type of GPU / GPU Platform
Disk
- Standard Persistent Disk (SPD)
- Balanced Persistent Disk (BPD)
- SSD Persistent Disk (SPD)
- Extreme Persistent Disk (EPD)
- Disk size

Compute Disk Options

Type	Workload
Standard Persistent Disks	Block storage for large processing with sequential I/O
Balanced Persistent Disks	SSDs which balance cost for less performance with a higher IOPS than SPDs
SSD Persistent Disks	Low latency, high IOPS in the single digit milliseconds, databases
Extreme Persistent Disks	sequential and random access at highest IOPS that is user configurable

Compute disks are encrypted automatically with Google managed keys or customer managed keys with Google KMS which allows storage outside of GCP. Virtual machines run in your Google project as the default GCE service account though you can specify which service account the VM runs as.

Sole-tenancy

Sole-tenant VMs in Google compute engine offer a high degree of isolation and security for your workloads. By running your VMs on dedicated hardware, you can be sure that your data and applications are protected from other users on the same system. Additionally, sole-tenant VMs can be configured with custom security settings to further protect your data.

Good for Bring Your Own License (BYOL) applications that are based on the number of CPUs, cores, or memory. Sole tenancy VMs can allow CPU overcommit so that unused cycles can be given to other instances to balance performance fluctuations.

Preemptible VMs

Preemptible VMs are a type of VM offered by Google Compute Engine at a discounted price. These VMs may be preempted by Google at any time in order to accommodate higher priority workloads. Preemptible VMs are typically used for batch processing jobs that can be interrupted without affecting the overall workflow.

Preemptible VMs can run for a maximum of 24 hours and are terminated but not deleted when preempted. You can use preemptible VMs in a Managed Instance Group. These types of virtual machines cannot live migrate and cannot be converted to a standard VM. The compute SLA doesn’t cover preemptible or spot VMs.

Shielded VMs

Shielded VMs in Google Compute Engine provide an extra layer of security by enabling features like secure boot and vTPM. These features help to ensure the integrity of the VM and its contents. Additionally, integrity monitoring can be used to detect and respond to any changes that occur within the VM. By using shielded VMs, businesses can rest assured that their data and applications are safe and secure.

Secure boot is a UEFI feature that verifies the authenticity of bootloaders and other system files before they are executed. This verification is done using digital signatures and checksums, which are compared against a known good value. If the signature or checksum does not match, the file is considered malicious and is not executed. This helps to protect the system from bootkits and other forms of malware that could be used to gain access to the system.

A vTPM is a virtual Trusted Platform Module. It’s a security device that stores keys, secrets, and other sensitive data. Measured boot is a security feature that verifies the integrity of a system’s boot process. The vTPM can be used to measure the boot process and verify the integrity of the system. This helps ensure that the system is not compromised by malware or other malicious software.

Integrity monitoring is the process of verifying the accuracy and completeness of data. This is typically done by comparing a trusted baseline to current data, looking for changes or discrepancies. Logs can be used to track changes over time, and integrity checks can be used to verify the accuracy of data. Sequence integrity checks can be used to verify the order of events, and policy updates can be used to ensure that data is properly protected. In the context of a Shielded VM this is all built into the boot up process of the instances of this type.

Confidential VMs

Confidential VMs in Google Compute Engine encrypt data in use, providing an extra layer of security for sensitive information. By encrypting data at rest and in transit, confidential VMs help ensure that only authorized users can access it. Additionally, Confidential VMs can be used to comply with industry-specific regulations, such as HIPAA.

These VMs run on host systems which use AMD EPYC processors which provide Secure Encrypted Virtualization (SEV) that encrypts all memory.

Recommender

Google Compute Engine offers a recommender system that can help optimize your compute engine workloads. The recommender system uses Google’s extensive data and machine learning expertise to recommend the best way to save on cloud expense, improve security, and make your cloud usage more efficient.

Recommenders

Discount recommender
Idle custom image recommender
Idle IP address recommender
Idle persistent disk recommender
Idle VM recommender

Instance Groups

An instance group is a cluster of VMs that are managed together. Google Compute Engine offers both managed and unmanaged instance groups. Managed instance groups are ideal for instances that need to be closely monitored and controlled, such as web servers or database servers. Unmanaged instance groups are not identical and so they are not ‘managed’ by an instance template.

An instance template is a blueprint for creating virtual machines (VMs) in Google Compute Engine. You can use an instance template to create as many VMs as you want. To create a VM from an instance template, you must specify a machine type, disk image, and network settings. You can also specify other properties, such as the number of CPUs and the amount of memory.

Advantage of Managed Instance Groups (MIGS)

Minimum availability, auto-replacement on failure
Autohealing with healthchecks
Distribution of instances
Loadbalancing across the group
Autoscaling based on workload
Auto-updates, rolling and canary

Compute Engine Use Cases

GCP Compute Engine is a flexible, customizable platform that provides you with full control over a virtual machine (VM), including the operating system. This makes it an ideal choice for a wide range of workloads, from simple web applications to complex data processing and machine learning tasks.

GCP Compute Engine can be used to create a VM from a container image. The base image can be stored in GCS or GAR, and GCE uses COS to deploy the image. This allows for a more flexibility and full control over all aspects of a VM running docker.

Cloud Run is a GCP managed service for running stateless containers. It is a serverless platform that allows you to run your code without having to provision or manage any servers. All you need to do is supply your image and Cloud Run will take care of the rest. Cloud Run is highly scalable and can automatically scale your container up or down based on traffic demands.

Google Cloud Platform’s Compute Engine can be used for a variety of workloads, from simple web apps to complex distributed systems. Cloud Run is a great option for running stateless web applications or microservices, while Kubernetes can be used for managing containerized workloads at scale. App Engine is also a popular choice for web applications, offering both standard and flexible environments. In addition, Compute Engine can be used for batch processing, analytics, and other compute-intensive workloads.

GCP Compute Engine root access is granted through the cloud console or SSH. Once logged in, you can install packages and run configuration management agents. This gives you full control over your server and its environment.

GCP Compute Engine is a powerful platform for running stateful applications such as databases, accounting systems, and file-based transaction engines. The platform provides high performance, scalability, and reliability specifically for these workloads making it an ideal choice for mission-critical applications. In addition, GCP Compute Engine offers a number of features that make it easy to manage and deploy stateful applications, such as automatic failover and snapshotting.

GCP Compute Engine is a high security environment that offers Shielded VMs and sole-tenancy. This makes it an ideal platform for BYOL. Shielded VMs offer increased security by protecting against malicious activities such as rootkits and bootkits. Sole-tenancy provides an additional layer of security by ensuring that only authorized users have access to the platform.

Cloud Functions

Cloud functions are a type of serverless computing that allows you to execute code in response to events. This means that you can write code that will be triggered in response to certain events, such as a user request or a file being uploaded. This can be used to invoke additional processing, such as sending a notification or running a report. Cloud functions are a convenient way to add extra functionality to your application without having to provision and manage a server.

Event triggers are a great way to automate tasks in Google Cloud Functions. You can use event triggers to respond to events from HTTP requests, logging, storage, and Pub/Sub. Event triggers can make your life much easier by automate tasks that would otherwise be manual. For example, you can use an event trigger to automatically archive old logs when they’re created, or to automatically delete files from storage when they’re no longer needed.

Cloud Function Triggers

Broadly, triggers fall into two categories:

HTTP triggers, which react to HTTP(S) requests, and correspond to HTTP functions.
Event triggers, which react to events within your Google Cloud project, and correspond to event-driven functions.

HTTP Triggers

You can use these HTTP methods:

GET
POST
PUT
DELETE
OPTIONS

Depending on configuration, HTTP triggers to Cloud Functions can be by both authenticated and unauthenticated means.

Event Triggers

Pub/Sub triggers
Cloud Storage triggers
Generalized Eventarc triggers
- Supports any event type supported by Eventarc, including 90+ event sources via Cloud Audit Logs

Execution runtimes

dotnet core
Ruby
PHP
Node.js
Python 3
Go
Java 11

Requests are handled one at a time on a Cloud Function instance. If the instance doesn’t exist it’ll be created. You can specify the maximum number of concurrent instances for a function. HTTP triggered functions are executed at most once and other event triggers are ran at least once. Cloud Functions need to be idempotent, meaning that when ran multiple times does less and less work until the work is complete. When an idempotent script is ran after all work is completed, no work is performed.

::: tip Idempotent A script that downloads all of the pages of a website may be interrupted. If it picks up where it left off on a rerun, or especially if it doesn’t redownload the entire site on that rerun, it is idempotent. :::

Cloud Function Use Cases

Do something when something is uploaded to a Cloud Storage bucket
Run functions such as sending messages when code is updated
If a long app operation is issued, send a pub sub message to a queue and run a function around it
When a queued process completes, write a pub/sub message
When people login, write to an audit log

GKE

Google Kubernetes Engine (GKE) is GCP’s Kubernetes managed offering. This service offers more complex container orchestration than either App Engine or Cloud Run.

Kubernetes can be used for stateful deployments with certain storage objects configured into your deployment. Kubernetes has internal hooks that are auto configured by Google to provide you with GCP provisioned architecture when you deploy it. Kubernetes has different storage classes and some can be marked as default. This way when you provision an object of kind persistentvolumeclaim, a Cloud persistent disk is spun up, attache to the node running the pod, then mounted into the pod per your specifications.

To put it simple: it will create a cloud volume and mount it where you say in your yaml. You can install your own storage controllers by creating the yaml for one, creating a template that generates one(helm chart), or by following third party storage controller instructions.

The NFS-Ganesha storage controller is the most robust durable way to share highly available disks across a whole region in a cluster or set of clusters. You can set persistent volume defaults so that they don’t delete when you delete a k8s object, that way you can specify it in a create-once, reattach many deployment style. You can use logging and monitoring to initiate manual deletes when there are orphaned volumes in the process.

In k8s a combination of privoxy, istio and cert manager can secure connections between pods to institute a trust-no-one level of security. Here we assume your pods can be compromised so we configure them to only talk to the pods which we want and disallow the rest. We can disallow internet access and poke holes only to the services we need. We can ingress only to customer facing services and even put some armor on it by placing CloudFlare or Akamai in front of the services. In this model, we disallow all incoming connections to the ingress that aren’t from on-premises or from the proxies we may put in front of your customer facing services.

GKE Orchestrates the following operations:

Service discovery
Error correction and healing
Volume create, deletion, resizing
Load Balancing
Configuration
Restarts, Rollouts, and Rollbacks
Optimal resource allocation
Resource Versioning
Secrets management

As Free and Open Source Software(FOSS), Kubernetes can be self hosted, third-party hosted, or managed as it is hosted. Anthos is Google’s implementation of that designed to connect to the popular clouds and on-premises.

Kubernetes Cluster Architecture

Kubernetes is organized into nodes and masters. Masters usually only have one unless replicated or made highly available by whatever means. Nodes usually connect to masters but managed kubernetes options often group the nodes into node pools.

Default Node Pool

There is a default node pool with no toleration or taints specified, defaulted nodes will be added to this pool unless specified. In GKE node pools are specified when you provision the cluster. If using terraform your GKE module or resource ought to specify.

Kubernetes Workloads

Pods
Services
ReplicaSets
Deployments
Persistent Volumes
StatefulSets
Ingress
Node pool
CronJob

Pods are units of containers. Pods are basically containers if they only have one, but if there are many containers in a pod, consider them a dual headed container that shares networking.

Pods are ephemeral, their file systems are removed and recreated upon start up. Any stored data needs to be placed in storage via a volume and volumemount. Pods are deployed by the scheduler on nodes per no rules or specified rules.

ReplicaSets are controllers which scale pods up and down per specifications in the deployment.

Services are in-custer dns abstractions as proxies which route to to pods.

Deployments are controllers of pods running the same version of a container artifact.

PersistentVolumes are volumes requested from storage controllers, either CSI requests volumes from the cloud which attaches to a specific Kubernetes Node. Other types of volumes exist as different storage class attributes on the persistent volume.

PersistentVolumeClaims are the ways pods refer to a persistentvolume.

StatefulSets are like deployments in that they create pods, but the pods are always named the same consistent name with the replica number appended starting with zero.

Ingress objects define rules that allow requests into the cluster targeting a service. Some ingress gateways are capable of updating cloud dns entries directly while there’s always a docker image out there which will watch your public ips on your ingress load balancers and update Cloud DNS.

Node Pools are commonly labeled and generally of the same hardware class and size with the same disk geometry across nodes. One can run an NFS Ganesha storage controller from helm chart on a certain set of node pools using a shared volume on the instances. You can run one or two nodes in that pool and consider it a storage pool and then create another node pool that is your workload node pool, whose pods utilize the storage controller’s storage class. Kubernetes does the automatic job of connecting the NFS controller pods to the service pods. The controller pods can use PersistentVolumes of a more durable gcp default storage class which uses persistent disks.

Node pools and their labels allow pods to be configured with nodeAffinities and nodeSelectors among other ways of matching workloads to pools designed to handle their resource consumption.

Types of Clusters

Kubernetes Clusters come in two forms:

Standard
Autopilot

Standard is the most flexible but Autopilot is the easiest and requires the least management.

Feature	GKE Standard	GKE Autopilot
Zonal	🟢	🔴
Regional	🟢	🟢
Add Zones	🟢	⚪
Custom Networking	🟢	🔴 VPC native
Custom Version	🟢	🔴 GKE Managed
Private Clusters	🟢	🟢

Kubernetes Networking

Inside the cluster, networking is generally automatic. Outside the cluster, huge workloads, however, will often have to build node pools on top up subnets which are large enough for the NodePool to scale into.

Within the cluster service networking is handled by:

Ingresses: which stand up external load balancers that direct traffic at one of the services in the cluster.
Services
- ClusterIP, a private ip assigned to the vpc subnet that the cluster is using
- NodeIP, the ip of the node a pod is running within
- Pod IP, local private networks

Like the subnets of the nodepools, you’ll have to give pod subnets enough room to run your pods.

Service Types

Services can either be LoadBalancer for an external loadbalancer, ClusterIP for an ip that is only accessible within the cluster.

NodePort type services use an assigned port from the range 30000-32768 on the Node IP of the node that the pods which the service points to runs in.

LoadBalancers automatically create NodePort and ClusterIP resources and externally route traffic to them from a Cloud Provided LoadBalancer.

Load balancing across pods and containers is automatic, while service loadbalancing is external.

Cloud Run

Google Cloud Run is a serverless and stateless computing platform for container images. This product is ideal for deploying microservices and handling large scale data processing jobs. Cloud Run is highly scalable and can be deployed on demand.

You aren’t restricted to a set of runtime options, you build your runtime as a docker image and push to Google Artifact Registry or Google Container Registry. Google Cloud Run pulls the image and runs it.

::: tip Cloud Run Availability Google Cloud Run has regional availability. :::

If you app can only handle a single request or if that request uses most of the container’s resources, set its replica count to 1. You can set the maximum amount of requests a container can handle before it is killed and restarted. You can also adjust for avoiding cold starts by setting the minimum available count.

Each Cloud Run deployment is considered a revision and rollbacks when the latest revision is unhealthy is automatic. In fact, the health of a new revision is verified before traffic is sent to the most recent deployment. Each deployment in Cloud Run is a set of yaml syntax configuration that can live in a repo or inside Cloud Run itself. You can run gcloud against this file to issue new deployments or you can use command line options.

App Engine

App Engine is a serverless PaaS that runs on Google’s compute engine. It is fully managed, meaning you only need to provide your code. App Engine handles the rest, including provisioning servers, load balancing, and scaling.

Standard

App Engine Standard is a serverless environment that runs on Google’s compute engine. It is a fully managed PaaS that requires only code. There are no servers to manage. You simply upload your code and Google detects how to build it and runs it on App Engine.

Standard Code Environments

Python 2.7, Python 3.7, Python 3.8, Python 3.9, and Python 3.10.
Java 8, Java 11, and Java 17.
Node. js 10, Node. js 12, Node. js 14, Node. js 16.
PHP 5.5, PHP 7.2, PHP 7.3, PHP 7.4, and PHP 8.1.
Ruby 2.5, Ruby 2.6, Ruby 2.7, and Ruby 3.0.
Go 1.11, Go 1.12, Go 1.13, Go 1.14, Go 1.15, and Go 1.16.

App Engine Standard provides two types of instance classes or runtime generations: first-generation and second-generation. First-generation instance classes are legacy, while second-generation instance classes are offered for Python 3, Java 11 & 17, Node.js, PHP 7, Ruby, and Go >= 1.12. The F1 class is the default instance class and provides 600Mhz CPU limit and 256MB of memory. The maximum instances can have is 2048MB or ram and 4.8Ghz Compute speed.

First generation is provided for Python 2.7, PHP 5.5, and Java 8.

App Engine Flexible

App Engine Flexible allows you to customize the runtime via Dockerfile. This gives you the ability to modify the supported App Engine Flexible runtime and environment. You can also deploy your own custom containers. This makes it easy to scale your app and keep it running in a consistent environment.

Go
Java 8
dotnet
Node.s
PHP 5/7
Python 2.7 and 3.6
Ruby

Custom Runtimes

You can SSH into App Engine instances run custom docker containers and specify CPU and memory configuration. Other features include:

Health Checks
Automatically updated
Automatic replication of VM instances
Maintenance restarts
Root access

Use Cases

App Engine can be used for a variety of applications, from simple websites to complex applications that handle millions of requests. Some common use cases include:

Web applications: App Engine can host standard web applications written in languages like PHP, Java, Python, and Go.
Mobile backends: App Engine can be used to power the backend of mobile applications written in any language.
API services: App Engine can be used to build APIs that can be consumed by other applications.
IoT applications: App Engine can be used to build applications that collect and process data from IoT devices.
Data processing applications: App Engine can be used to build applications that process large amounts of data.

App Engine Flexible Key Differences from GCE

Flexible containers are restarted once a week
SSH can be enabled, but is defaulted to disabled
Built using cloud build
Settings controlled location and automatic collocation

App Engine includes a cron service, and deploys into many zones by default. App Engine is designed to run stateless workloads but you can write to disk on App Engine Flexible. App Engine provides task queues for a synchronous and background computing.

Anthos

Google Cloud Anthos is an advanced cloud computing service that provides the flexibility to run your containerized applications on-premise or in the cloud.

At its core, Google Cloud Anthos offers access to the benefits of the cloud without having to move all of your applications there. So you’ll be able to use the same tools, processes, and infrastructure you’re used to today—and still access the benefits of having a global platform.

Google Cloud Anthos offers security and privacy by design; it’s built with multi-factor authentication and encryption at all levels of data storage, from internal compute instances to external storage systems. It also has built-in threat detection capabilities that alert you when something seems fishy.

Google Cloud Anthos gives you access to powerful analytics features through its real-time reporting dashboard and machine learning algorithms that help you make better decisions based on data. And because everything runs in a virtual environment on Google’s worldwide network of datacenters, there are no limits on how many applications can run at once—so long as they’re all within one region or continent!

Anthos:

Centrally managed
Can use Version Control Based rollbacks
Centralizes infrastructure in a single view
Centralizes deployments and rollouts
Enables Code instrumentation(performance measurements) using ASM
Uses Anthos Service Mesh(ASM) for auth and cert based routing

::: tip Anthos is just Kubernetes designed to run in GCP, other cloud providers, and on-premises. :::

Anthos Service Mesh

Service meshes are patterns which provide common frameworks for intra-service communication. They’re used for monitoring, authentication, networking. Imagine wrapping every service in an identity aware proxy, that’s a service mesh. Difficult to set up initially, service meshes save time by defining systematic policy-compliant ways of communicating across infrastructure. Facilitating hybrid and multi-cloud communications is what Anthos Service Mesh does.

ASM is built on istio which is an open source service mesh. In a service mesh there is a control plane which configures sidecar proxies running as auxiliary services attached to each pod.

Anthos Service Mesh:

Can control the traffic between pods on the application and lower layers.
Collects metrics and logs
Has preconfigured Cloud Monitoring Dashboards
Service authentication with mutual TLS certificates
Encryption of communication with the Kubernetes Control Plane

ASM can be deployed in-cluster, across Compute VMs or via Managed Anthos Service Mesh. In-cluster options include running the control plane in kubernetes to manage discovery, authentication, security and traffic. With managed ASM Google managed the control plane, maintains it, scales it and updates it. When running istiod on Compute Engine, you can have instances in groups take advantage of using the service mesh. Anthos Service mesh only works on certain configurations for in-cluster VMWare, AWS EKS, GCP GKE and bare metal, while you must use an attached cluster if using Microsoft AKS.git

Multi-cluster Ingress

The Anthos Multi-Cluster Ingress controller is hosted on Google Cloud and enables load balancing across multi-regional clusters. A single virtual ip address is provided for the ingress object regardless of where it is deployed in your hybrid or multi cloud infrastructure setup. This makes your services more highly available, enables seamless migration from on-premises to the cloud.

The Ingress controller in this case is a globally replicated service that runs outside of your cluster.

Anthos Deployment Permutations

You can deploy anthos a number of ways depending on your needs and the features you would like to utilize. ASM and Anthos Config Management(ACM) are included in all Anthos deployments.

Traffic rules for TCP, HTTP(S), & gRPC
All HTTP(S) traffic in and out of the cluster is metered, logged and traced
Authentication and authorization at the service level
Rollout testing and canary rollouts

Anthos Config Management uses Kustomize to generate k8s yaml that configures the cluster. Yaml can be grouped into deployed services and supporting infrastructure. An NFS helm chart might be deployed to a cluster using ACM at cluster creation time to support a persistentvolume class of NFS within the deployment yaml.

ACM can be used to create initial kubernetes serviceaccounts(KSAs), namespaces, resource policy enforcers, labels, annotations, RBAC roles and role bindings. GKE Anthos deployments support a number of features:

Node auto provisioning
Vertical pod autoscaling
Shielded GKE Nodes
Workload Identity Bindings
GKE Sandboxes

ACM, ASM, Multi-Cluster ingress, and binary authorization also come with the GKE implementation of Anthos.

On-Prem Anthos GKE On-prem includes these features:

The network plugin
Anthos UI & Dash
ACM
CSI storage and hybrid storage
Authentication Plugin for Anthos
When running VMWare
Prometheus and Grafana
Layer 4 Load Balancers

Anthos on AWS includes:

ACM
Anthos UI & Dashboards
The network plugin
CSI storage and hybrid storage
Anthos Authentication Plugin
AWS Load Balancers

Attached Clusters which run on any cloud or On-prem have these features:

ACM
Anthos UI & Dash
Anthos Service Mesh

AI and Machine Learning

GCP offers several AI options and machine learning options. Vertex AI is an AI platform that offers one place to do machine learning. It handles development, deployment and scaling the ML models. Cloud TPUs are training accelerators for training deep networks.

Google also provides:

Speech-to-Text
Text-to-Speech
Virtual Agents
Dialogflow CX
Translation
Vision OCR
Document AI

Vertex AI

Vertex AI is basically a merger of two products: AutoML and the AI Platform. The merged Vertex AI provides one api and one interface for the two platforms. With Vertex you can train your models or you can let AutoML train them.

Vertex AI:

Supports AutoML training or custom training
Support for model deployment
Data labeling, which includes human assisted labeling training examples for supervised tasks
Feature store repo for sharing Machine Learning features
Workbench, a Jupyter notebook development environment

Vertex AI provides preconfigured deep learning VM images and containers.

Cloud TPU

Cloud TPU are Cloud Tensor Processing Units(TPUs) that are Google designed application specific integrated circuits(ASICs). They can train deep learning models faster than GPUs or CPUs. A Cloud TPU v2 can offer 180 teraflops, and a v3 420 teraflops. Groups of TPUs are called pods and a v2 pod can offer 11.5 petaflops while a v3 pod provides over 100 petaflops.

You can use Cloud TPUs in an integrated fashion by connecting from other Google services, for example, the Compute VM running a deep learning operating system image. TPUs come in preemptible form at a discount.

Dataflows and Pipelines

The model of the monolithic application is dead. It may be tempting to put your whole business on one web application but when an enterprise runs an application at scale, there are dozens of supporting applications that ensure reliability, applications which meter the availability, application code which deploys highly customized pipeline steps and standards, especially in the financial industry. At Enterprise scales, the pipeline or workflow steps have a Check to Action Ratio(CtAR) of probably 1 to 20. This means we’ll have about 20 checks, tests, tracking, metering, or logging steps to one step which actually makes a change like kubectl or cf push. And that’s just deployment.

To illustrate this dimension further there’s disaster recovery, durability, maintenance, ops and reporting all done as part of Continuous Deployment Standards. Therefore, each application is an ecosystem of standards and reporting.

Add to that that a company is often now an entire ecosystem of applications which work together, this is especially true for Internet of Things companies, for example. Some of these operations may have even been made auxiliary by leveraging some serverless functions, triggers, or webhooks.

Consider, for a moment, a vehicle insurance claim made on behalf of a driver by their spouse, the processing workflow of the claim might look like this:

Verifying that the spouse is on the policy and has access to file a claim.
Analyzing the damage and repair procedures and assigning a value to the damage
Reviewing the totals to make sure the repairs don’t exceed the value of the vehicle
Any fraud compliance reviews
Sending these interactions to a data warehouse for analysis
Sending the options and communications of circumstance to the claimant

Different applications monolithic or not will process this data in different ways.

If you buy a product online the inventory application may be a monolithic system or microservices, it may be separate or built into something else, but likely it is independent is some wise. A grocery story self checkout application would have to interact with this inventory application much like a cashier’s station. Each station is a set of services from the receipt printer to the laser scanner to the payment system. A simple grocery story transaction is not so simple and is fairly complex.

It is of key importance to consider the entire flow of data when designing for GCP.

Pub/Sub Pipelines

Cloud Pub/Sub is a giant buffer. It comes in regular and lite flavors. It supports pushing messages to subscribers or having subscribers pull messages from the queue. A message is a record or entry in the queue.

With push subscriptions, Pub/Sub makes and HTTP POST to a push endpoint. This method benefits when there is a single place to push in order to process the workload. This means its a perfect way to post to a Cloud Function, App Engine App, or Container.

Regarding pull subscriptions, services read the messages from the Pub/Sub topic. This is the most efficient method for processing large sets of messages within a topic. Pub/Sub works best when it is used as a buffer between communicating services which cannot have synchronous operations due to load, differences in availability, differences in resource pools serving the sending and receiving services. Consider a service that can quickly collect and send messages. It certainly uses less resources than the consuming services which has to do additional processing work on the messages. It is highly likely that at some point in time the sending service will be able to exceed the speed of the consuming service. Pub/Sub can bridge that gap by buffering the messages to the processing service. In a synchronous design, messages would be lost if there was no place for the sending service to put them. In this case Pub/Sub bridges the gap.

::: tip Pub/Sub is good for buffering, transmitting or flow controlling data. If you need to transform the data, Cloud Dataflow is the way to go. :::

Cloud Dataflow Pipelines

Cloud Dataflow is Apache Beam stream processing implemented as a fully managed Google Cloud Platform service. Normally you’d have to provision instances of this service on virtual machines, but Google managed the entire infrastructure for this service and maintains its availability and reliability.

The service works via processing code written in Python, Java or SQL. Code can be batch or stream processed. You can combine services and send the output from Dataflow into Dataproc or BigQuery or BigTable and so forth. Dataflow is organized into pipelines that are designed to tackle the work of the part of the app that comes after ingests data, but otherwise can be used anywhere Apache Beam is used in applications.

Cloud Dataproc

Dataproc is managed Spark + Hadoop. This is for stream / batch processing and machine learning at the largest magnitudes. Dataproc clusters are stood up and taken down quickly so they’re often treated as ephemeral after they produce batch results. Obviously a stream processing effort may run all the time, but if the stream is some sort of live data from an occasional event, like Olympics score data or Sports, can create the need for ephemeral clusters in either case.

Dataproc is already integrated with BigQuery, BigTable, Cloud Storage, Cloud Logging, Cloud Monitoring. This services replaces on-premises clusters in a migration.

Cloud Workflows

Workflows are HTTP api services and workflows. In conjunction with Cloud Run, Cloud Functions, GitOps webhooks, Cloud Build Triggers and so forth, you can accommodate any business and technical requirements. You set them up as yaml or json steps.

You can trigger a workflow to make several api calls in sequence to do a workload. Workflows do not perform well processing data, rather they do smaller actions in a series well. You wouldn’t use workflows to make large http POST calls.

Cloud Data Fusion

Another managed service, Cloud Data Fusion is based on something called Cask Data Application Platform (CDAP), which Atlassian defines as “a developer-centric middleware for developing and running Big Data applications. Before you learn how to develop and operate applications, this chapter will explain the concepts and architecture of CDAP.”

This platform allows the ELT pattern of extraction, load, and transform as well as the ETL pattern of extraction, transformation, load. It allows this without any coding. CDAP allows drag and drop interfaces as a no-code development tool that has around 200 connectors and transformations.

Cloud Data Fusion instances are deployed as one in three versions: developer, basic, and enterprise.

Developer	Basic	Enterprise
low cost but limited	visual editor, preloaded transformations, and an SDK	streaming, integration, high availability, triggers and schedules

Cloud Composer

Composer is basically a managed instance of Airflow which is a workflow coordination system that fires off workflows of a specific type: directed acyclic graphs (DAGs), which are python definitions of nodes and their connections. Here is an example:

import networkx as nx
graph = nx.DiGraph()
graph.add_edges_from([("root", "a"), ("a", "b"), ("a", "e"), ("b", "c"), ("b", "d"), ("d", "e")])

DAG example

example from Munging Data

These DAGs are stored in Cloud Storage and loaded in to Composer. Google gives this example on the Cloud Composer Concepts Page:

overview dag and tasks

Figure 1. Relationship between DAGs and tasks

Airflow includes plugins, hooks. operators, and tasks. Plugins are combinations of hooks and operators. Hooks are third party interfaces and operators define how tasks are run and can combine actions, transfers, and sensor operations. Tasks are work done symbolized as one of these nodes in the DAG.

Upon execution of a DAG, logs are stored in a Cloud Storage bucket. Each task has its own log and streaming logs are available.

Compute Systems and Provisioning

You can provision compute services via the console or via terraform. You can run terraform in Cloud Build or in Deployment Manager. Using Terraform allows you to perform GitOps on the processes surrounding version control, integration, pull requests and merging code. Branching strategies allow segmentation of environments. Multiple repositories can be combined into project creation code, infrastructure creation code, access granting code and its best to run all this as a privileged but guarded service account. Enterprises will use a series of layers of access, projects, folders and organizations in complex networks of infrastructure as code. It can all be pulled together using terraform modules, cloud build triggers and repository and project layering.

Compute Design problems

The key concerns when designing services that rely on compute systems are configuration, deployment, communication between services, data flows and monitoring and logging.

State

Inside the application you’ll have to work out how state will be stored either in a shared volume or in a distributed manor among your instances. This kind of design decision can leverage Cloud Storage or Persistent Volumes. Another problem is how to distribute state among instances. There are several means of doing this mathematically using modulo division on some unique attribute. You could also use aggregate level IDs.

You get around this by using things like Redis for session data, shared storage options and you make your app itself stateless in its core but know how to connect to where state information is stored. Running two replicas of Nextcloud containers requires state data be shared somehow or when you login to one, your round robin connection to the other will present you with another login screen. The browser will not be able to maintain the session data of two sessions when there’s one and therefore the disparity between the replicas will prevent the application from functioning.

So in memory caches bridge the gap between different instances. Wordpress for instance, is completely stateless(when you use Storage Bucket Media Backends) as it keeps all session and any other state data in the database so a memory cache is not needed.

Async vs Synchronous

Synchronous strategies are used when data can’t be lost. NFS mounts can be mounted async or sync, for instance. Synchronous setups require lightening fast networks that are fast than the disks involved with low to no latency and probably nothing else on the network. Otherwise if that’s not the case your system will try to save a file and will wait for the network to respond before it lets the process move on to other tasks. When a VM or bare-metal system has processes which have to wait on a slow network, the processes stack on top of each other increasing load. Load exponentially reduces a systems ability to respond to requests. Synchronous NFS systems on slow networks crash and so people can’t and therefore don’t use them.

These problems are universal across all independent systems that need to communicate over means that involve variable speeds. With Google’s premium network, however, the problem will always be rather load than network speed. Scaling ingestion, for instance, will resolve synchronous problems.

However, services like Pub/Sub can make this process asynchronous, relaxing some of the stress and impact on on such a system’s costs and reliability.

Credit card transactions are synchronous as well as maybe a bitcoin mining operation.

Overview

The most popular options provided by Google Compute Engine that cover a wide variety of use-cases include:

Dataprocessing and Workflow options include:

Exam Essentials

Know when to use particular compute services
Know all the optional features of these services
Know the differences between App Engine Standard and Flexible
Know when to use Machine Learning and Data workflows and pipelines
Understand the features of different Anthos clusters: EKS, AKS, GKE, Attached
Know Kubernetes features

Official Resources

Designing Solutions for Technical Requirements

Sep 18, 2022

Christopher Shaun Godwin

Author

High Availability

High availability is a key characteristic of any reliable system, and is typically measured by what is known as the “99999” rule. This rule states that a system must be operational 99.9999% of the time in order to be considered highly available. This equates to a maximum downtime of just over 5 minutes per year. In order to achieve such a high level of availability, a system must be designed and implemented with care, and must be constantly monitored and maintained. Additionally, a high availability system must have a robust service-level agreement (SLA) in place in order to ensure that the system meets the required availability levels.

::: tip The best general strategy for increasing availability is redundancy. :::

Table of Availability SLAs and Downtime

% Uptime	Downtime / Day	Downtime / Week	Downtime / Month
99	14 m 24 s	1h 40m 48s	7h 18m 17s
99.9	1m 26s	10m 4s	43m 49s
99.99	8s	1m	4m 22s
99.999	864 ms	6s 500ms	26s
99.9999	86 ms	604 ms	2s 630ms

When it comes to SLAs and account for hardware failures, it is important to consider network equipment and disk drives. Hardware failures can often be caused by a variety of factors, including physical damage, overheating, and software issues. By having a plan in place for how to deal with these failures, you can help minimize the impact on your business.

One way to prepare for hardware failures is to have a redundancy and a backup plan for your equipment. This way, if one piece of equipment fails, you can quickly switch to another while still running. The work of a cloud business with a 5 9s SLA is to statistically predict disk drive failures overall and plan redundancy and recover procedures. This way, if a drive fails, you actually never know there’s a problem.

::: danger Failure Stack

Application Bugs
Service problem
DB Disk Full
NIC Fails
Network fails
Misconfiguration of infrastructure or networks :::

One way to mitigate the errors that can occur during deployment and configuration is to test thoroughly before making any changes. This can be done by creating staging or lower environments that are identical to the production environment and testing all changes in it before deploying them to production. Canary deployments are another way to mitigate errors. With canary deployments, changes are first deployed to a small subset of users before being rolled out to the entire user base. This allows for any errors to be detected and fixed before they impact the entire user base. Regression testing can also be used to mitigate errors. This is where changes are tested not only in the staging environment, but also in the production environment.

Continuous deployment and continuous verification are two key concepts in minimizing downtime for deployments. By continuously deploying code changes and verifying them before they go live, we can ensure that only working code is deployed and that any issues are caught early. This minimizes the amount of time that our systems are down and keeps our users happy.

Compute Availability

Google Compute Engine is the underlying provider of the following services:

GCE VMs
GKE Masters and Worker Nodes
App Engine Applications
Cloud Functions

The process of meeting your availability needs using each of these services is slightly different for each one.

High Availability in Compute Engine

Hardware Redundancy and Live Migration

On the lowest level, much of the servers at Google have levels of redundancy. If a server fails for hardware issues, others are there for failing over to while others are booted up to replace redundancy.

Google also live migrates VMs to other hypervisors like it does when power or networks systems fail or during maintenance activities which have a real impact on hypervisors.

::: warning Live Migration

Live migration isn’t supported for the following VMs:

Confidential VMs
GPU Attached VMs
Cloud TPUs
Preemptible VMs
Spot VMs

:::

Managed Instance Groups

Managed Instance Groups(MIGs) create groups or clusters of virtual machines which exist together as instances of the same VM template.

Instance Templates A VM template looks like this:

POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/global/instanceTemplates

Here is what you’re posting before you make replacements:

{
  "name": "INSTANCE_TEMPLATE_NAME"
  "properties": {
    "machineType": "zones/ZONE/machineTypes/MACHINE_TYPE",
    "networkInterfaces": [
      {
        "network": "global/networks/default",
        "accessConfigs":
        [
          {
            "name": "external-IP",
            "type": "ONE_TO_ONE_NAT"
          }
        ]
      }
    ],
    "disks":
    [
      {
        "type": "PERSISTENT",
        "boot": true,
        "mode": "READ_WRITE",
        "initializeParams":
        {
          "sourceImage": "projects/IMAGE_PROJECT/global/images/IMAGE"
        }
      }
    ]
  }
}

Or with gcloud

gcloud compute instance-templates create example-template-custom \
    --machine-type=e2-standard-4 \
    --image-family=debian-10 \
    --image-project=debian-cloud \
    --boot-disk-size=250GB

And then instantiate the instance template into a group.

gcloud compute instance-groups managed create INSTANCE_GROUP_NAME \
    --size SIZE \
    --template INSTANCE_TEMPLATE \
    --zone ZONE

What makes it work well is that when a VM fails in the group, it is deleted and a new one created. This ensures the availability of the group.

Managed Instance Groups(MIGs) can be zonal, regional and can be autoscaled. Their traffic is load balanced and if one of the instances are unavailable the traffic will be routed to the other instances.

Multiple Regions and Global Load Balancing

Instance group’s top level is regional. You can however run many multizonal MIGs in different regions and balance them with a regional load balancer. Workload is distributed across all MIGs to each of the regional LBs. If one or more of the MIGs becomes unavailable, the global LB will exclude them from routing.

Users will be connected by the global load balancer(LB) to their closest region reducing latency.

High Availability in Kubernetes Engine

Kubernetes by default and if uses correctly provides high availability for containers and orchestrates their replication, scaling up, scaling down, container networking, service ingress. This enables canary, blue green and rollout deployments for further reliability testing.

GKE has an extra layer of availability on top of that which is provided by Kubernetes(k8s). Node pools are Managed Instance Groups of VMs running Kubernetes nodes.

Kubernetes monitors pods for readiness and liveness. Pods in k8s are replica sets of containers. Usually a pod has one container defined but often might have a sidecar or binary container pattern. Different containers in the same pod can communicate with IPC, network over localhost, or by volume. You cannot share the individual sockets but you can share the whole socket directory if you have permissions on the environment.

::: info For example PHP-FPS might need to run with the webserver it is coupled with. The nginx webserver would be configure similar to this:

        upstream webapp {
            server 127.0.0.1:9000;
        }

The would both share 127.0.0.1. :::

If one of the containers in a pod crashes, the restartPolicy directive tells k8s what to do.

Because Managed Instance groups are zonal or multizonal(regional), Kubernetes clusters are also zonal and multizonal(regional). Regional clusters have their control planes replicated across zones so if a control plane goes down, it hasn’t lost availability.

High Availability in App Engine and Cloud Functions

These services experience automatic high availability. When running these services, the items in the failure stack to worry about involve deployment, integration concerns, application failures.

High Availability Computing Requirements in Case Studies

Recall our case studies

EHR Healthcare needs a highly available API service to meet the business requirement of “entities will need and currently have different access to read and change records and information”. This is essential as it is an external-facing service for customers, vendors, and partners.
HRL requires high availability for its real-time telemetry and video feed during races to enhance the spectator experience. This is crucial to ensure uninterrupted live streaming of races.
A high availability analytics solution is needed to gain insights into viewer behavior and preferences. This will ensure uninterrupted access to critical viewer data for business decision-making.
The archival storage for past races also needs to be highly available for on-demand viewing by fans and analysts.
High availability is vital for the online video games developed by Mountkirk Games. This is necessary to ensure a seamless gaming experience for players across the globe.
The high scores and player achievements system also require high availability to record and display player scores and achievements in real time.
The user data collection system for personalizing the gaming experience needs to be highly available to collect and process user data efficiently.
For TerramEarth, high availability is essential for their IoT sensor data system, which provides crucial data for improving their products and services.
The migration of their existing on-premises data infrastructure to the cloud needs to ensure high availability to prevent any disruption to their operations.
The data analytics solution for deriving insights from sensor data also requires high availability to ensure continuous access to valuable business insights.

Storage Availability

Storage is considered Highly available when it is available and functional at all times.

GCP Storage Types

Object storage
block storage
Network attached storage
Database services
Caching

Availability refers to the quality belonging storage that its contents are retrievable right now. Durability, on the other hand, refers to the long term ability of the data to be in tact and to stay retrievable.

Availability of Different Storage Types

Object Storage

Cloud Storage is entirely managed service for storing objects, files, images, videos, backups, documents, and other unstructured data. It is always highly available as a managed service.

File storage

Cloud Filestore is a NAS that is fully managed and thus Google ensures it is highly available.

Block Storage

Persistent disks are disks that are attached to VMs but remain available after those VMs are shutoff. They can be used like any local hard drive on a server so they can store files and database backends. PDs are also highly available because they can be resized while in use. Google offers different types of persistent disks:

	Standard	Balanced	SSD	Extreme
Zonal	reliable block storage	reliable blk storage with higher IOPS	better IOPS than Balanced	Highest IOPS
Regional	PDs replicated across 2 zones within a region	dual zone replicated higher IOPS	dual zone replicated better IOPS	N/A

Better performance leads to higher costs as does going from a zonal PD to a regional PD.

Zonal Persistent Disks with a standard IOPS have a 4 9s durability(99.99%), while all the others have a 5 9s uptime(99.999%).

Availability of Databases

Self-Managed Databases

If you run your own database on a virtual machine topology, ensuring these systems are redundant is the key to managing your own database availability. The underlying db software will affect how you plan for availability in a architectural design.

For example, MySQL or MariaDB usually use master and replicas. You may want to set up a few regional sql proxy hosts and a global LB to them all to provide an endpoint for the app to all of these. Making your db cluster multiregional and therefore multizonal would involve considering the cost of network traffic, latency, consistency.

In each different sql server case you’ll have to decide if it is best to try to share a disk between active and inactive servers, filesystem replication to a standby system, or to use multimaster replication. You could also use vitesse to create your own globally available MySQL server either with containers or with virtual servers.

Or you could use Cloud SQL selecting a highly available cluster during creation and not worry about it. You could use Cloud Spanner for guaranteed consistency.

Managed Databases

HA by Default:

Firestore
BigQuery
Cloud Spanner

Have HA Options:

Cloud SQL
Bigtable

With services that have High Availability through setup or configuration, it is important to remember that seeking greater availability, say going from 3 9s to a 4 9s SLO, will cost more.

Availability of Caching

Caching is storing the most important immediate use data in low latency services to improve retrieval and storage speed. For example, using a high performance SSD on a raid array as the cache, or a redis server. Google’s managed caching service is made highly available.

::: tip Memcached and redis are supported by Google’s Cloud Memory Store. :::

High Availability Storage Requirements in Case Studies

EHR HealthCare’s active data available through the API will need to be highly durable and highly available at all times. Thier databases should take advantage of a managed database sorage solutions.
HRL needs highly durable storage for retaining permenant videos of races using archive class object storage. They also need always available storage for serving the most recent videos to audiences on their website. If transcoding is intense you might consider an extreme IOPS or SSD but a Regional SSD will have better availability. You might transcode locally and copy to an available drive.
Mountkirk will need durable and highly available Big Table as well as Firestore or Firebase Realtime Database. They can achieve this as these services are fully managed. If they required some durable volume space to share among gaming servers, highly durable Regional Balanced PDDS with backups. Their billing will be supported by Cloud Spanner.
TerramEarth will have highly available storage in BigQuery.

Network Availability

Using premium tier networing and redunant networks, you can increase network availability. If one interconnect is down, often a second will provide protection against connectivity loss. Interconnects have a minimum of 10Gbps and traffic does not cross the public internet. When crossing the internet is not a problem, Google offers and HA VPN which has redundant connections and offers a 4 9s(99.99%) uptime SLA.

Communication within Google usually uses their low latency Premium Network teir which doesn’t cross the internet and is global. Standard networking tiers will not be able to use this global network and so cannot take advantage of global load balancing. Communications within the cloud on the Standard Networking tier do cross the internet.

High Availability Network Requirements in Case Studies

Since networking requirements are not often specified, the Architect should analyze the requirements, ask questions and suggest the most cost effective solution which meets the needs of the requirements both business and technical.

Application Availability

Application Availablility is 3 parts infrastructure availability(network, storage, and compute), but its 1 part reliability engineering in the application design, integration and deployment. Logging and Monitoring is the most appropriate way to handle availability unknowns in the application. Technical and Development processes iterate over the logs and alerts in order to achieve their reliability SLOs within the application.

::: tip Add Cloud Monitoring with alerts as part of your availability standards to increase application and infrastructure reliability. :::

Scalability

This is the ability to add or remove resources based on load and demand. Different parts of the cloud scale differently and efficienly.

Managed Instance Groups, for instance, increase and decrease the amount of instances in the group.
Cloud Run when no one is requesting a resource, scales replicas of containers down to 0.
Unstructured Databases scale horizontally making consistency the main concern.

Stateless apllications can scale horizontally without additional configuration or without each unit needing to be aware of the other. Stateful applications, however, generally scale vertically but can scale horizontally with certain solutions:

Putting session data into a Redis cache in Cloud Memorystore
Shared volumes
Shared Database such as Cloud SQL

Resources of different flavors scale at different rates based on needs. Storage might need to scale up once a year while compute engine resources might scale up and down every day. Subnets do not auto scale so when creating a GKE cluster you’ll have to configure its network to handle the scaling of the node pool.

::: tip Scale database servers by allocating higher cpu and memory limits. This way, non-managed relational database servers often can handle pead load without scaling. :::

If you decouple your services which need to scale, they can scale separatley. For example, if your mail server system is a series of services on a VM like postfix, dovcot and mysql, to scale it you’d have to scale the whole VM. Alternatively, decoupling the database from your VM allows you to have more hosts that use the same information with a shared volume. Containerizing each process in the mail server, however, will allow you to scale each customer facing service to the exact appropriate level at all times.

::: warning Scaling often depends on active user count, request duration, and total memory/latency per process/thread. :::

The only network scaling you might do with GCP is increasing your on-premises bandwidth to GCP by increasing the number of interconnects or try an additional VPN over an additional internet connection.

Scaling Compute Resources

Google Compute Engine, Google Kubernetes Engine supports autoscaling while App Engine and Cloud Functions autoscale out of the box.

Scaling Compute in GCE

MIGs will scale the number of instances running your application. Statefully configured VMs cannot autoscale. Unmanaged instance groups also cannot autoscale. Compute instances can scale by CPU utilization, HTTP Load Balancing utilization, and metrics monitored with monitoring and logging.

Autoscaling policies define targets for average CPU use, this is compared to the data collected in the present and if the target is met, the autoscaling policy will grow or shring the group.

Autoscalers can make decisions and recommend a number of instances based on the metrics it is selected to use. You can autoscale based on time schedules and specify the capacity in the schedule. The Scaling schedule will operate at a start time, for a duration, with configuration about requency to reoccur. This enables to you skip slow days in the schedule. Use this option for predictable workloads which may have a long startup time. When using autoscaling with processes that have a long start, often the request times out before the scaling is completed. It is important that you use the appropriate scaling strategy to match what you’re dealing with.

When MIGs are scaled in or down, they can be set to run a script upon shutdown with a best-effort with no gaurentees. If this script is doing quick artifact collection, it will probably run. If it is doing a heavy shutdown workload, it may stall or be killed.

::: danger Cannot Autoscale

Stateful instance workloads
Unmanaged instance groups :::

Scaling Compute in GKE

Containers with sidecars or containers that run in the same pod will be scaled up and down together. Deployments specify replicasets which are sets of identically configured pods with a integer for a replica count. You can scale a deployment up from 1 to any number your worker nodes support.

Kubernetes autoscaling is split horizon, scaling the cluster and scaling what is in the cluster. Node pools are groups of nodes which have the same configuration. If a pod is deployed into a node pool that has no more resources, it will add another node to the pool.

Specifying the minimum and maximum number of replicas per depoyment with resource targets like CPU use and a threashold, in cluster scaling operates effortlessly.

Scaling Storage Resources

GCP uses virtualized storage, so a volume may not be a physical disk.

Locally attached SSD on VMs which aren’t persistent are the least scalable storage option in GCP. Preemptible VMs volumes are cleaned when VMs are preempted.

Zonal and regional persistent idsks and persistent SSDs are scalable up to 64TB while increasing performance is a matter of provisioning and migrating to a new disk with a higher IO operations per second(IOPS). Once you add a disk to a system, you have to use that systems commands to mount it and make it available for use. You may also have to sync data to it and remount it in the place of a lower performing disk. This isn’t scaling and it isn’t automatic but is often required planning to grow a design beyond its limits.

All managed services either automatically scale or must be configured to do so. BigQuery, Cloud Storage, Cloud Spanner, to name a few, provide scalable storage without effort. Big Query charges by data scanned. So if you logically partition the data by time, you can avoid scaling costs up when you scale your workload. Scanning only the last weeks of data will enable BigQuery to improve query time.

Network Design for Scalability

When designing connections from GCP with VPNs or interconnects, you need to plan for peak, or peak-plus-twenty(peak + 20%). Check with your provider as you may only be charged for traffic or bandwidth actually used.

Reliability

Reliability is repeatable consistency. Try/Catch statements are an example of reliabiity in code. If your app does the same thing all the time, but only under the circumstance it was developed in but not all the circumstances it was designed for it sn’t reliable. Another example of reliability is when an applications uses methods of quietly reconnecting to a database in the case of bandwidth issues.

Reliability is a specific part of availability which hovers around human error. Reliability Engineering is the practice of engineering to have your workload run consistently under all the circumstances which it will face within the scope of its support and design, or within the scope of what’s normal and reasonable.

To measure reliability, one measures the probability of failure and then tries to minimize it to see if they can have an affect on that measurement. This involves defining standards, best practices, identifying risk, gracefully deploying changes.

It is important to be throughly versed in your workload’s dependencies, their dependencies and the teams or organizations which provide those and the documentation produced by those entities. Knowing these trees will make the difference in the successful reliabiliy of a design.

Measuring Reliability

Uptime is one way to measure reliability, percentage of failed deployments to production to successful deployments is another. All of that shit should be wored out in lower environments. Other metrics may need to be logged or cataloged and placed in a report or dashboard for regular collection. Number of failed requests that didn’ return 200 versus number of successful requests. Each workload will have different reliability measurements. A set of microservices that together create a mail server will want to measure delierability and mail loss from the queue. You’ll have to design around these metrics.

Reliability Engineering

The design supports reliability in the long run by:

Identifying the best way to monitor services
Deciding on the best way to alert team and systems of failure.
Consider incedence response procedues those teams or systems will trigger
Implement tracking for outages, process introspection, to understand disruptions

Emphasize issues pertaining to management and operations, decide whose responsibilities are whose.

Exam Essentials

Be able to contrast availablility scalability, reliability, and availablility
Know how redundancy improves availability
Rely on managed services to increase availability and scalability
Understand the availability of GCE Migs and GKE globally loadbalanced Regionally replicate clusters
Be able to link reliability to risk mitigation

Official Resources

Designing and Planning GCP Solutions for Business Requirements

Sep 14, 2022

Christopher Shaun Godwin

Author

Key Considerations

Business Use Case & Product Strategy
Cost Optimization
Dovetail with Application Design
Integration with External Systems
Movement of Data
Security
Measuring Success
Compliance and Observability

Business Use Cases and Product Strategy

Business requirements dictate technical requirements implicitly. From statements like:

EHR HealthCare

Business Requirements

EHR Healthcare provides B2B services to various entities, including vendors, insurance providers, and network directories.
Different entities will need to have varying levels of access to read and modify records and information. This implies the need for a robust access control system.
Given the nature of their work, EHR Healthcare needs to ensure that their services are always available. High availability is thus a core business requirement.
Some of the information that entities will access is regulated, so compliance with relevant data protection and privacy laws is a must.
Confidentiality is crucial since EHR Healthcare deals with sensitive health data.
The company wants to track the number and type of data accessed and gain insights into trends. This suggests a need for a comprehensive analytics solution.
Different entities involved possess varying levels of expertise, which might require the development of user-friendly interfaces or provision of training for the effective use of EHR Healthcare’s systems.

::: tip Minimal Effort Predictions Cloud AutoML is a cloud-based tool that allows developers to train machine learning models with minimal effort. It is designed to make the process of training machine learning models easier and faster. Cloud AutoML is based on the Google Cloud Platform and offers a variety of features that make it a powerful tool for machine learning. :::

Technical Requirements

A publicly exposed API or set of APIs needs to be developed to facilitate interactions between various entities.
Access restrictions must be applied at the API level to adhere to the varying access rights of different entities.
There will be involvement of legacy systems due to insurance entities. This implies the need for systems integration or migration strategies.
Redundant infrastructure is required to ensure high availability and continuous operation of the services.
Data lifecycle management must be implemented, considering regulation, insights, and access controls.
Given the nature of their work, EHR Healthcare needs to employ Cloud Machine Learning to build insight models faster than they can be planned and built. This indicates a requirement for machine learning capabilities in their infrastructure.

Mountkirk Games

Business Requirements

Mountkirk Games develops and operates online video games. They need a robust and scalable solution to handle high scores and player achievements.
They aim to collect minimal user data for personalizing the gaming experience, complying with data privacy regulations.
The solution must be globally available to cater to their worldwide player base.
They seek low latency to ensure a smooth and responsive gaming experience.
Mountkirk Games expresses interest in Managed services which can automatically scale to meet demand.

Technical Requirements

A globally available high score and achievement system is needed to keep track of player progress and milestones.
User data needs to be collected and processed in a manner that is privacy-compliant and secure.
The system must provide low latency to ensure a seamless gaming experience, which may require a global distribution of resources.
Managed services can be used to handle automatic scaling, reducing the overhead of manual resource management.

::: tip Business to Technical Requirements When designing a new project, while collecting and studying business requirements, you’ll have to translate those into technical requirements. You’ll find that there’s not a one to one relationship. One technical solution may meet two business requirements. While one business requirement might encapsulate several solutions. :::

TerramEarth

Business Requirements

TerramEarth manufactures heavy equipment for the construction and mining industries. They want to leverage their extensive collection of IoT sensor data to improve their products and provide better service to their customers.
They aim to move their existing on-premises data infrastructure to the cloud, indicating a need for a comprehensive and secure cloud migration strategy.

Technical Requirements

IoT data needs to be ingested and processed in real-time. This involves creating a robust pipeline for data ingestion from various IoT devices, and real-time data processing capabilities.
A robust data analytics solution is needed to derive insights from the sensor data. This requires the deployment of big data analytics tools that can process and analyze large volumes of sensor data.
A migration plan is needed to move existing data and systems to the cloud.
This involves choosing the right cloud services for storage, computation, and analytics, and planning the migration process to minimize downtime and data loss.

::: tip Extract, Transform, Load It is what it says. It takes large volumes of data from different sources. Transforms it to useable data, and makes available the results somewhere for retrieval by others.

Cloud Datafusion handles these tasks for data scientists and makes it easy to transfer data between various data sources. It offers a simple drag-and-drop interface that makes it easy to connect to different data sources, transform and clean data, and load it into a centralized data warehouse. Cloud Datafusion is a cost-effective solution for businesses that need to quickly and easily integrate data from multiple sources. :::

Helicopter Racing League

Business Requirements

The Helicopter Racing League (HRL) organizes and manages helicopter races worldwide. They aim to enhance the spectator experience by providing real-time telemetry and video feed for each race.
HRL wants to archive all races for future viewing on demand. This will allow fans and analysts to revisit past races at their convenience.
A robust data analytics solution is required to gain insights into viewer behavior and preferences. This will help HRL understand their audience better and make data-informed decisions to improve the viewer experience.
The solution must be highly available and scalable to handle spikes during race events. This is essential to ensure a seamless live streaming experience for viewers, regardless of the number of concurrent viewers.

Technical Requirements

Real-time data processing capability is needed to handle race telemetry data. This involves setting up a system that can ingest and process high volumes of data in real time.
A scalable video streaming solution is needed to broadcast races worldwide. This system must be capable of handling high video quality and large volumes of concurrent viewers without degradation of service.
Archival storage is needed for storing race videos for on-demand viewing. This involves choosing a storage solution that is cost-effective, secure, and capable of storing large volumes of video data.
An analytics solution is needed for analyzing viewer behavior and preferences. This requires the deployment of data analytics tools that can process and analyze viewer data to provide actionable insights.

Application Design

Business requirements will affect application design when applications are brought into the cloud. In every set of requirements, stated or unstated will be the desire to reduce cost.

Licensing Costs
Cloud computing costs
Storage
Network Ingress and Egress Costs
Operational Personnel Costs
3rd Party Services Costs
Sanctions on missed SLA costs
Inter-connectivity charges

These contribute to the Total Cost Ownership(TCO) of a cloud project.

Managed Services

Google has a set of managed services like Cloud SQL which remove the low level work from running these services yourself.

Some of these include:

Compute Engine
- Virtual machines running in Google’s data center.
Cloud Storage
- Object storage that’s secure, durable, and scalable.
Cloud SDK
- Command-line tools and libraries for Google Cloud.
Cloud SQL
- Relational database services for MySQL, PostgreSQL, and SQL Server.
Google Kubernetes Engine
- Managed environment for running containerized apps.
BigQuery
- Data warehouse for business agility and insights.
Cloud CDN
- Content delivery network for delivering web and video.
Dataflow
- Streaming analytics for stream and batch processing.
Operations
- Monitoring, logging, and application performance suite.
Cloud Run
- Fully managed environment for running containerized apps.
Anthos
- Platform for modernizing existing apps and building new ones.
Cloud Functions
- Event-driven compute platform for cloud services and apps.
And dozens more.

To see an exhaustive list, please see My List of All GCP Managed Services

::: tip Reducing Latency on Image Heavy Applications Google Cloud CDN is a content delivery network that uses Google’s global network of edge locations to deliver content to users with low latency. It is a cost-effective way to improve the performance of your website or web application by caching static and dynamic content at the edge of Google’s network. Cloud CDN can also be used to deliver content from your own servers, or from a content provider such as a CDN or a cloud storage service.

Using Google’s Cloud CDN in combination with multi-regional storage will reduce load time. :::

Reduced-level Services

Many times when computing needs are considered, certain services with availability requirements lower than others can benefit from reduced-level services. If a job that must be processed can have those processes paused during peak times but can otherwise run normally, it can be preempted.

Reduced level services:

Preemptible Virtual Machines
Spot VMs
Standard Networking
Pub/Sub Lite
Durable Reduced Availability Storage

Preemptive VMs

Preemptible VMs are shutdown after 24 hours and Google can pause them at any time. Running process on those vms do not stop but they slow to a crawl and speed back up when services become available. You can write a robust application by setting it up to detect the preemptions. These VMs cost 60-90% or so less than their standard counterparts.

Preemptible VMs also get discounts on volumes and GPUs. Managed resource groups will replace a preempted VM when it is suspended after 24 hours. Preemptible VMs can use other services to reduce the overall cost of using those services with VMs.

::: warning Live Migration Preemptible and Spot VMs are not eligible for live migration. :::

Spot VMs

Spot VMs are the next generation Preemptible virtual machine. Though spot VMs are not automatically restarted, they can run for longer than 23 hours. Spot VMs can be set to a stopped state or be deleted on preemption. With a managed resource group of spot VMs, one can set the VMs to be deleted and replaced when resources are available.

Standard & Premium Networking

Premium Networking is the default, but Standard Tier Networking is a lower performing option. With Standard Tier Networking, Cloud Load Balancing is only regional load balancing and not global balancing. Standard Networking is not compliant with the global SLA

Pub/Sub Lite vs Pub/Sub

Pub/Sub is an extremely scalable but Pub/Sub Lite can be scaled providing lower levels of cost-effective service.

Pub/Sub come with features such as parallelism, automatic scaling, global routing, regional and global endpoints.

Pub/Sub Lite is less durable and less available than Pub/Sub. Messages can only be replicated to a single zone, while Pub/Sub has multizonal replication within a region. Pub/Sub Lite users also have to manage resource capacity themselves.

But if it meets your needs, Pub/Sub Lite is 80% cheaper.

App Engine Standard vs Flexible

App Engine Standard allows scaling down to zero, though the trade offs are that you can only use a set of languages, can only write to /tmp with java, can’t write with python. Standard apps cannot access GCP services, cannot modify the runtime, or have background processes, though they can have background threads.

Durable Reduced Availability Storage(DRA)

These are buckets which have an SLA of 99% availability instead of equal to and greater than 99.99% availability. Storage operations are divided into class A and class B operations:

API	Class A($0.10*/10,000 ops)	Class B($0.01*/10,000 ops)
JSON	storage.*.insert1	storage.*.get
JSON	storage.*.patch	storage.*.getIamPolicy
JSON	storage.*.update	storage.*.testIamPermissions
JSON	storage.*.setIamPolicy	storage.*AccessControls.list
JSON	storage.buckets.list	storage.notifications.list
JSON	storage.buckets.lockRetentionPolicy	Each object change notification
JSON	storage.notifications.delete
JSON	storage.objects.compose
JSON	storage.objects.copy
JSON	storage.objects.list
JSON	storage.objects.rewrite
JSON	storage.objects.watchAll
JSON	storage.projects.hmacKeys.create
JSON	storage.projects.hmacKeys.list
JSON	storage.*AccessControls.delete
XML	GET Service	GET Bucket (when retrieving bucket configuration or when listing ongoing multipart uploads)
XML	GET Bucket (when listing objects in a bucket)	GET Object
XML	POST	HEAD

* DRA Pricing

Data Lifecycle Management

Sort your data along a spectrum of most frequent to infrequent use. Spread your data along the following:

Memory Caching
Live Database
Time-series Database
Object Storage
- Standard
- Nearline
- Coldline
- Archive
Onprem, Offline storage

Objects have a storage class of either standard, nearline, coldline, or archive. Storage classes can be changed on single objects along this direction. You cannot move a storage class to a more frequent use class, only the opposite. You can move frequency lower on an object.

standard -> nearline -> coldline -> archive

storage class	`standard`	`nearline`	`coldline`	`archive`
accessing at least once per	week	month	quarter	year

::: tip Time series data Time series data is a type of data that is collected over time. This data can be used to track trends and patterns over time. Time series data can be collected manually or automatically. Automatic time series data collection is often done using sensors or other devices that collect data at regular intervals. This data can be used to track the performance of a system over time, or to predict future trends. These are examples of time-series data:

MRTG graph data
SNMP polled data
Everything a fitbit records
An EKG output

Time series data is best stored in BigTable which handles this workload better than BigQuery or CloudSQL. :::

Systems Integ and Data Management

Once we have these requirements, our minds already start placing the need in the right product, though we may be provisionally thinking about it. The same thing should be happening when you think of dependencies.

Systems Integration Business Requirements

Let’s review the business needs of our use cases.

Business requirements dictate technical requirements implicitly. From statements like:

EHR Healthcare

Business Requirements

EHR Healthcare provides B2B services to various entities, vendors, insurance providers, network directories, etc.
Different entities have different access rights to read and edit records and information.
Different entities possess varying levels of expertise.
The services must always be up and running.
Some information accessed by entities is regulated.
Confidentiality is of utmost importance.
The company wishes to track the number and type of data accessed to gain insights into trends.

Technical Requirements

They will need to publicly expose an API or set of them.
Access restrictions must be applied at the API level.
There will be legacy systems involved because of insurance entities.
Infrastructure redundancy is necessary.
Data Lifecycles must consider regulation, insights, and access controls.
Cloud Machine Learning can be leveraged to build insight models faster than they can be planned and built. ::: tip Cloud Dataflow Cloud dataflow is a cloud-based data processing service for batch and streaming data. It is a fully managed service that is designed to handle large data sets with high throughput and low latency. Cloud dataflow is a serverless platform that can scale automatically to meet the needs of your application. It is a cost-effective solution that allows you to pay only for the resources you use. :::

Helicopter Racing League

Business Requirements

The Helicopter Racing League (HRL) organizes and manages helicopter races worldwide.
HRL wants to enhance the spectator experience by providing real-time telemetry and video feed for each race.
HRL wants to archive all races for future viewing on demand.
A robust data analytics solution is needed to gain insights into viewer behavior and preferences.
The solution must be highly available and scalable to handle spikes during race events.

Technical Requirements

Real-time data processing capability is needed to handle race telemetry data.
A scalable video streaming solution is needed to broadcast races worldwide.
Archival storage is needed for storing race videos for on-demand viewing.
An analytics solution is needed for analyzing viewer behavior and preferences.
The solution must be highly available and scalable to handle traffic spikes during races. ::: tip Service Level Objectives Business requirements typically demand these common type of SLOs.

High Availability SLO Always accessible.
Durability SLO Always kept.
Reliability SLO Always meeting workloads.
Scalability SLO Always fitting its workloads.

:::

Mountkirk Games

Business Requirements

Mountkirk Games develops and operates online video games.
They need a solution to handle high scores and player achievements.
They need to collect minimal user data for personalizing the gaming experience.
The solution must be globally available and provide low latency.
They are interested in Managed services which can automatically scale.

Technical Requirements

A globally available high score and achievement system is needed.
User data needs to be collected and processed in a privacy-compliant manner.
The system must provide low latency for a smooth gaming experience.
Managed services can be used to handle automatic scaling.

::: tip Global Up-to-Date Data Cloud spanner is the best option for an SQL based global records storage with a High Consistency SLO. :::

TerramEarth

Business Requirements

TerramEarth manufactures heavy equipment for the construction and mining industries.
They want to leverage their vast trove of IoT sensor data to improve their products and provide better service to their customers.
They want to move their existing on-premises data infrastructure to the cloud.

Technical Requirements

IoT data needs to be ingested and processed in real-time.
A robust data analytics solution is needed to derive insights from the sensor data
A migration plan is needed to move existing data and systems to the cloud.

::: tip Cloud Dataproc Cloud Dataproc is a cloud-based platform for processing large data sets. It is designed to be scalable and efficient, and to handle data processing workloads of all types. Cloud Dataproc is based on the open-source Apache Hadoop and Apache Spark platforms, and provides a simple, cost-effective way to process and analyze data in the cloud. :::

Data Management Business Requirements

Business requirements help us know what platforms to connect and how they will work. Those same requirements will tell us what data is stored, how often, for how long, and who and what workloads have access to it.

How Processed?

What is the distance between where the data is stored and where it is processed? What volume of data will be moved between storage and processing during an operation or set of operations? Are we using stream or batch processing?

The first question’s answer influences both the read and write times and the network costs associated with transferring the data. Creating replicas in regions nearer to the point of processing will increase read times, but will only decrease network costs in a ‘replicate one time, read many times’ situation. Using storage solutions with a single write host will not improve replication times.

The second questions’s answer influences time and cost as well. On a long enough timeline, all processes fail. Build shorter running processes and design reconnecting robust processes.

The third question’s answer and future-plans answer will influence how you perform batch processing. Are you going to migrate from batch to stream?

Style	Pros	Cons
Batch	tolerates latency, on time data	interval updates, queue buildup
Stream	realtime	late/missing data

::: tip If using VMs for batch processing, use preemptible VMs to save money. :::

How Long?

At what point does data lose business value? With email, the answer is never, people want their past emails, they want all their backed up emails delivered. But with other kinds of data, like last year’s deployment errors, lose certain levels of value as it becomes less actionable now.

You’ll have to design processes for removing less valuable data from persistent storage locations and stored in archival locations or deleted. How long data is stored for each set of data will have a great affect on an architectural design.

How Much?

The volumes of data and how it will scale up when business goals are met or exceeded need to be planned for or else there will be a dreaded redesign and unnecessary iterations.

Storage related managers will need to know the volume and frequency of data storage and retrieval so they can plan for their duties and procedures which touch your design.

::: tip Factors of Volume and Load The main factors that affect volume are the number of data generators or sensors. If you consider each process that can log as a sensor, the more you log the higher your volume in Cloud Logging, the higher the processing costs in BigQuery and so forth.

Number of hosts
Number of logging processes
Network Connectivity
Verbosity Configuration

:::

Compliance and Regulations

Many businesses are under regulatory constraints. For example, “Mountkirk” receives payment via credit cards. So they must be PCI compliant and financial services laws apply their receiving payment.

Health Insurance Portability and Accountability Act (HIPAA) is United States legislation that provides data privacy and security regulations for safeguarding medical information.
General Data Protection Regulation (GDPR) a set of regulations that member states of the European Union must implement in order to protect the privacy of digital data.
The Sarbanes-Oxley (SOX) Act a number of provisions designed to improve corporate governance and address corporate fraud.
Children’s Online Privacy Protection Act (COPPA) is a U.S. law that requires website operators to get parental consent before collecting children’s personal information online.
Payment Card Industry Data Security Standard (PCI DSS) is a set of security standards designed to protect cardholders’ information.
Gram-Leach Bliley Act (GLBA) designed to protect consumers’ personal financial information held by financial institutions.

::: tip Compliance TLDR In the United States

SOX regulates financial records of corporate institutions.
HIPPA regulates US companies protecting consumer access and the privacy of medical data.
PCI DSS is a standard for taking credit cards which processing underwriters may require an e-commerce vendor to abide by.

In Europe

GDPR regulates information stored by companies operating in Europe for its protection and privacy.

:::

When we know what regulations apply to our workload it is easier to plan our design accordingly. Regulations can apply to jurisdictions like HealthCare or like the State of California. Operating within a jurisdiction means you’ll have to research your industry’s governance and what it may be subject to.

Privacy Regulations

Regulations on data slant toward protecting the consumer and giving them greater rights over their information and who it is shared with. You can review privacy policies per Country at Privacy Law By County

Architects not only need to comply with these laws, but kindle the the spirit of the law within themselves, that of protecting the consumer. Architects need to analyze each part of their design and ask themselves how is the consumer protected when something goes wrong?

Access controls need to cascade in such a way that permissions are restrictive and then opened, and not the other way around. Data needs to be encrypted at rest and in transit and potentially in memory. Networks need firewall and systems need verification of breaches through logging. One can use the Identity Aware Proxy and practice Defense in Depth.

Data Integrity Regulations

The Sarbanes-Oxley (SOX) Act aims to put controls on data that make tampering more difficult. I worked for a SOX compliant business, IGT PLC and we had to take escrow of code, making versions of code we deployed immutable so it could be audited. In this case, tampering with the data was made more difficult by using an escrow step in the data processing flows. Other business might need to store data for certain number of years while also being immutable or having some other condition applied to it.

Security

IS, information security, infosec or cybersecurity is the practice or discipline of keeping information secure. Secured information as a business need comes from the need for confidentiality, the need for lack of tampering, and availability. Unavailable systems are generally secure. No can remotely compromise a computer, for instance, that has no network interface.

Confidentiality

Businesses need to limit access to data so that only the legal, ethical and appropriate parties can read, write, or audit the data. In addition to compliance with data regulations, competing businesses have a need to keep their information private so that competitors cannot know their trade secrets, plans, strategies, and designs.

Google cloud offers several options for meeting these needs. Encryption at rest and in transit is a good start. Memory encryption using N2D compute instances and shielded VMs make the system the least compromisable.

Other offerings include Google Secret Manager, Cloud KMS for keeping Google from reading the data except for the least-access cases you let it. When using customer supplied keys, they are stored outside of Google’s own key management infrastructure.

Protected networks keep data confidential. Services also can be configured for maximum protection. For instance, consider these apache configuration directives:

ServerTokens Prod
ServerSignature Off
<Directory /opt/app/htdocs>
  Options -Indexes
</Directory>
FileETag None

Similar directives in other service configuration make confidential your software versions and system software. In fact, turning ServerTokens and ServerSignature off and prod is a PCI DSS requirement.

Determine the methods of authentication how methods of authorization can compromise confidentiality.

::: tip Dealing with Inconsistent Message Delivery Cloud Pubsub is a messaging service that allows applications to exchange messages in a reliable and scalable way. It is a fully managed service that can be used to build applications that require high throughput and low latency.

If Applications are working synchronously, decouple them and have the reporters interact with a third services that is always available and that autoscales. :::

Integrity

Data Integrity is required by regulations which focus on making data tamper-proof, but normally is simply a business requirement. You need your records to be consistent and reflect reality. Data Integrity is also about keeping it in that state.

Ways in Google cloud that you can promote and increase data integrity are to use ways to promote data integrity in Google cloud are to use tools like Data Loss Prevention (DLP) and Data Encryption. You should also enforce least privilege, use strong data encryption methods, and use access control lists.

Colocate report data instead of drawing on active data. That way, if data is tampered with discrepancies exist directly within the app. The search for these discrepancies can be automated into their own report.

Availability

DDos attacks, ransomware, and disgruntled administrators and bad faith actors threaten the availability of data.

You can combat ransomware with a well hardened IaC(Infrastructure as Code) pattern culling resources which have their availability degraded, restoring their data and stateful information from trusted disaster recovery provisions.

When designing a project, design around these scenarios to ensure a business can survive malicious activity. Design a project which can not only survive a malicious attack but one that can also continue to be available during one.

::: tip Keeping Data Entirely Secret Cloud KMS is a cloud-based key management system that allows you to manage your cryptographic keys in a secure, centralized location. With Cloud KMS, you can create, use, rotate, and destroy cryptographic keys, as well as control their permissions. Cloud KMS is integrated with other Google Cloud Platform (GCP) services, making it easy to use your keys with other GCP products.

When you manage the encryption keys Google uses to encrypt your data, the data is kept secret from anyone who doesn’t have access to decrypt it, which requires access to uses those keys. :::

Success Measures

As businesses move to agile continuous deployment and integration, they want to see reports of the deployments going well, development costs decreasing, the speed of development therefore increasing. Amid all of this the want to measure the overall success of an endeavor so they can correctly support the resources which will increase the bottom line.

::: tip Continuous Integration & Delivery The benefits of CICD to business requirements is that it enables smaller incremental trunk-based development. This shortens the feedback loop, reduces risks to services during deployment, increases the speed of debuging, isolates featuresets to known risks. :::

Key Performance Indicators

The first to two important measurements is Key Performance Indicators(KPIs). The other is Return on Investment(ROI). KPIs measures of value of some portion of business activity which can be used as a sign things are well and an effort is achieving its objectives. A KPI for an automation team of reliability engineers might be a certain percentage as a threshold of failed deployments to successful ones.

Project KPIs

Cloud migration projects have KPIs which the project manager can use to gauge the progress of the overall migration. Another KPI might be having a set of databases migrated to cloud and no longer being used on premises. KPIs are particular to a projects own needs.

::: tip Improving SQL Latency Export unaccessed data older than 90 days from the database and prune those records. Store these exports in Google Cloud Storage in Coldline or Archive class buckets. :::

Operations KPIs

Operations departments will use KPIs to determine if they are handling the situations they set out to address. Product support teams can use KPIs to determine if they are helping their customers use their product to the degrees which mean the business objectives. Cloud Architects will need to know which KPIs will the used to measure the success of the project being designed. The help the architect understand what takes priority and what motivates decision-makers to invest in a project or business effort.

::: tip Total Cost of Ownership When Managers and Directors Only Compare Infrastructure Costs Calculate the TCO of legacy projects against planned cloud projects. Calculate the potential ROI with regard to the TCO of the investment. Use this wider scope to compare the true cost of running legacy projects or forgoing cloud migrations. :::

Return on Investment

Return on investment is the measure of how much of a financial investment pays off. ROI is a percentage that measures the difference between the business before and after the investment. The profit or loss after an investment divided by the total value of the investment. So:

$ROI=\left(\frac {investment\ value-cost\ of\ investment} {cost\ of\ investment} \right) \times 100$

Lets work this out for a 1 year period. Host U Online bought $3000 in network equipment and spent $6000 to migrate to fiber. The total cost of investing in fiber was $9000. They began reselling their fiber internet to sublets in the building. In one year the acquire six customers totalling $12,000 per month. A year’s revenue from the investment is $144,000.

$\left(\frac {135000} {9000} \right) \times 100 = 1500%$

This is a real scenario I orchestrated for a real company. Our return on investment, the ROI, was a tremendous 1500%.

In a cloud migration project the investment costs includes costs Google cloud services and infrastructure, personnel costs, and vendor costs. You should include expenses saved in the value of the investment.

::: tip Reducing Costs When designing for cost reduction, there are three options you should strongly consider:

The goals and concepts that the organization places high value upon will be underlying the KPIs and ROI measures.

Essentials

Understanding the sample requirements word for word
Knowing the meanings of business terms like TCO, KPI, ROI
Learn about what Google services are for what use cases
Understanding managing data
Understanding how compliance with law can affect the architecture of a solution
Understand the business impetus behind the aspects of security pertaining to business requirements
- Confidentiality
- Integrity
- Availabiltiy
Understand the motives behind KPIs

Official Resources

List of All Managed Google Cloud Platform(GCP) Services

Sep 13, 2022

Christopher Shaun Godwin

Author

Service	Type	Description
AutoML Tables	AI and Machine Learning	Machine learning models for structured data
Recommendations AI	AI and Machine Learning	Personalized recommendations
Natural Language AI	AI and Machine Learning	Entity recognition, sentiment analysis, language identification
Cloud Translation	AI and Machine Learning	Translate between two languages
Cloud Vision	AI and Machine Learning	Understand contents of images
Dialogflow Essentials	AI and Machine Learning	Development suite for voice to text
BigQuery	Analytics	Data warehousing and analytics
Batch	Compute	fully managed batch jobs at scale
VMware Engine(GCVE)	Compute	running VMware workloads on GCP
Cloud Datalab	Analytics	Interactive data analysis tool based on Jupyter Notebooks
Data Catalog	Analytics	Managed and scalable metadata management service
Dataproc	Analytics	Managed hadoop and Spark service
Dataproc Metastore	Analytics	Managed Apache Hive
Cloud Composer	Analytics	Data workflow orchestration service
Cloud Data Fusion	Analytics	Data integration and ETL tool
Data Catalog	Analytics	Metadata management service
Dataflow	Analytics	Stream and Batch processing
Cloud Spanner	Database	Global relational database
Cloud SQl	Database	Regional relational database
Cloud Deployment Manager	Development	Infrastructure-as-code service
Cloud Pub/Sub	Messaging	Messaging service
Bigtable	Storage	Wide column, NoSQL databases
Cloud Data Transfer	Storage	Bulk data transfer service
Cloud Memorystore	Storage	Managed cache service using Redis or memcached
Cloud Storage	Storage	Managed object storage
Cloud Filestore	Storage	Managed shared files via NFS or mount
Cloud DNS	Networking	Managed DNS with API for publishing changes
Cloud IDS	Networking	Intrusion Detection Systems
Cloud Armor Managed Protection Plus	Networking	DDos Protection with Cloud Armor’s AI adaptive protection
Service Directory	Networking	Managed Service registry
Cloud Logging	Operations	Fully managed log aggregator
AI Platform Neural Architecture Search (NAS)	AI Platform	AI Search
AI Platform Training and Prediction	AI Platform	NAS training
Notebooks	AI Platform	JupyterLab environment
Apigee	API Management	API Gateway security and analysis
API Gateway	API Management	API Gateways
Payment Gateway	API Management	integration with real-time payment systems like UPI
Issuer Switch	API Management	user transactor deployment
Anthos Service Mesh	Hybrid/Multi-Cloud	devide up gke traffic into workloads and secure them with istio
BigQuery Omni	Analysis	Use BigQuery to query other clouds
BigQuery Data Transfer Service	Analysis	Migrate data to BigQuery
Database Migration Service	Storage	fully-managed migration service
Migrate to Virtual Machines	Migration	migrate workloads at scale into Google Cloud Compute Engine
Cloud Data Loss Prevention	Security and Identity	discover, classify, and protect your most sensitive data
Cloud HSM	Security and Identity	Fully managed hardware security module
Managed Service for Microsoft Active Directory (AD)	Identity & Access	Managed Service for Microsoft Active Directory
Cloud Run	Serverless Computing	Run serverless containers
Cloud Scheduler	Serverless Computing	cron job scheduler
Cloud Tasks	Serverless Computing	distributed task orchestration
Eventarc	Serverless Computing	Event rules between gcp services
Workflows	Serverless Computing	reliably execute sequences of operations across APIs or services
IoT Core	Internet of Things	Collect, process, analyze and visualize data from Iot devices in real time
Cloud Healthcare	Healthcare and Life Sciences	send, receive, store, query, transform, and analyze healthcare and life sciences data
Game Servers	Media and Gaming	deploy and manage dedicated game servers across multiple Agones clusters

Official Resources

Designing and Planning Solutions in Google Cloud with GCP Architecture

Sep 12, 2022

Christopher Shaun Godwin

Author

Key Considerations

Business Use Case & Product Strategy
Cost Optimization
Dovetail with Application Design
Integration with External Systems
Movement of Data
Planning decision trade-offs
Build, buy, modify, deprecate
Measuring Success
Compliance and Observability

Collecting & Reviewing Business Requirements

Architects begin by collecting business requirements and other required information. Architects are always solving design patterns for the current unique mix of particular business needs. So every design is different. Because of this, you cannot reuse as a template a previous design even if it solved a similar use case.

Operational Topology

This unique mix of business requirements and needs is what we’ll call the Operational Topology. An Architect begins their work by making a survey of this landscape.

The peaks and valleys, inlets and gorges of this topological map include things like:

Pressure to reduce costs.
Speeding up the rate at which software is changed and released.
Measuring Service Level Objectives(SLOs)
Reducing incidents and recovery time.
Improving legal compliance.

::: tip Incident An incident is a period of time where SLOs are not met. Incidents are disruptions in a service’s availability therefore becoming degraded. :::

Reducing Capital Expenditures

The use of Managed services places certain duties on specialized companies who can reduce the cost of management by focusing on that discipline’s efficiency. This enables your business to consolidate its focus on its trade and products.

Managed services remove from an engineering team’s focus those concerns such as provisioning setup, initial configuration, traffic volume increases, upgrades, and more. If planned properly, this will reduce costs but those projections need to be verified. Workloads need to be separated in scope of their availability requirements. Workloads that don’t need highly available systems can use preemptive workloads. Pub/Sub Lite trades availability for cost. Auto-scaling and scaling down to zero, for instance, enable cost savings in tools like Cloud Run and App Engine Standard. Compute Engine Managed Instance Groups will scale up with load and back down to their set minimum when that load subsides.

Accelerating Development

We want to accelerate all development to a speed of constant innovation, the CI/CD singularity. This is what success means. Again, using Managed Services enables this by letting developers and release engineers focus on other things besides infrastructure management. The services Google hosts and manages and offers allows developers without domain expertise in those fields to use those services.

Continuous Integration and Deployment enable quick delivery of minor changes so that reviews can be quick and tracked work can be completed like lightning. Automated testing and reporting can be built into these delivery pipelines so that developers can release their own software and get immediate feedback about what it is doing in development and integration environments.

However, sometimes there are tacit business requirements that prevent you from using one of these solutions on every asset a business needs to maintain. You may be tasked to architect solutions around an ancient monolithic service which cannot be delivered to production in an agile manner. Planning to get out of this situation is your job and selling that plan to decision makers is also your goal. You have to believe in your designs and be an optimist that these specifications are all that is needed to meet the Operational Topology.

You may break apart the giant macroservice into microservices, but even if you do, that’s the future, what to do now? Do you rip and replace, meaning rebuild the app from scratch? Do you lift and shift, bring the macroservice into a compute engine while moving to microservices later. Finally, you could convert to microservices as you move it into cloud striking a hybrid between the two. Business requirements will point the way to the correct solution every time without fail.

Reporting Service Level Objectives(SLOs)

An application’s requirements which surround how available it needs to be to those whom it serves is called Service Level Objectives. Accounting systems might not need to be running except during business hours, while Bill pay applications that customers use will need to always be available. These two different systems used by two different audiences needs two different Service Level Objectives.

SLOs specify things like uptime, page load time. These events are recorded within Cloud Logging. When they are not met Alerts can be created with Cloud Monitoring. The data points in these logs are called Service Level Indicators(SLIs). An SLO is a formal definition of a threshold which SLIs need to stay compliant with.

Incident Recovery-time Reduction

When services become unavailable or degrades, an Incident has occurred. A Business’s response to an incident may vary from company to company, but for the most part, every company has some sort of response system.

Collecting metrics and log entries along the way reduce the time it takes to recover from incidents because it illuminates the states of parts of the system when the error occurred. The first thing a reliability engineer does is look at logs on a problematic system. If one can see all logs from all components in one place at the same time one can better put together a complete story rather than having to revise the story continually as the information about the problem is discovered.

Compliance and Regulations

The Big five most architects have to worry about are:

Health Insurance Portability and Accountability Act(HIPPA), a healthcare regulation
Children’s Online Privacy Protection Act (COPPA), a privacy regulation
Sarbanes-Oxley Act(SOX), a financial reporting regulation
Payment Card Industry Data Standard(PCI), Compliance data regulation protection for credit card processing
General Data Protection Regulation(GDPR), a European Union privacy regulation

Compliance with these means controlling who has access to read and change the regulated data, how and where it is stored, how long it must be retained. Architects track and write schemes of controls which meet these regulations.

Capital Expenditure

Capital expenditures are funds used to purchase or improve fixed assets, such as land, buildings, or equipment. This type of spending is typically used to improve a company’s long-term prospects, rather than for day-to-day operations. Because of this, capital expenditures can be a significant financial decision for a business, and one that should not be made lightly.

Compliance

Implementation of controls on access, storage, and lifecycle of sensitive data.

Digital Transformation

Digital transformation is the process of using digital technologies to create new or improved business processes, products, and services. It can be used to improve customer experience, operational efficiency, and competitive advantage. In order to be successful, digital transformation must be driven by a clear strategy and executed with careful planning and execution.

Governance

Governance is the process by which organizations are directed and managed. It includes the creation and implementation of policies, the setting of goals, and the monitoring of progress. Good governance is essential for the success of any organization, as it ensures that resources are used efficiently and effectively. There are four main principles of good governance: accountability, transparency, participation, and inclusiveness. Accountability means that those in positions of authority are held accountable for their actions. Transparency means that information is readily available and accessible to those who need it. Participation means that all stakeholders have a say in decision-making. Inclusiveness means that all voices are heard and considered. These principles are essential for the success of any organization.

Key Performance Indicators(KPI)

A key performance indicator (KPI) is a metric used to evaluate the success of an organization or individual in achieving specific goals. KPIs are often used in business to track progress and compare performance against objectives. While there are many different KPIs that can be used, some common examples include measures of sales, profitability, productivity, customer satisfaction, and safety.

Line of Business

A line of business (LOB) is a group of products or services that are related to each other. Businesses often have multiple lines of business, each with its own set of customers, products, and services. For example, a company that sells both cars and trucks would have two lines of business: automotive and commercial vehicles. Lines of business can be created for different reasons. Sometimes, businesses create lines of business to take advantage of different market opportunities. Other times, businesses create lines of business to better serve their customers’ needs. Lines of business can be a helpful way for businesses to organize their products and services. By creating lines of business, businesses can more easily target their marketing and sales efforts.

Operational Expenditures

Operational expenditures are the costs associated with running a business on a day-to-day basis. They can include everything from rent and utilities to payroll and inventory costs. For many businesses, operational expenditures are the largest category of expenses. Managing operational expenditures is a key part of running a successful business. Careful planning and budgeting can help keep costs under control and ensure that the business is able to generate enough revenue to cover all of its expenses. Operational expenditures can have a major impact on a business’s bottom line. Therefore, it is important to carefully track and manage these costs. Doing so can help ensure that the business is able to remain profitable and continue to grow.

Operating Budget

An operating budget is a financial plan that details how a company will generate and spend revenue over a specific period of time. The operating budget is important because it ensures that a company has the resources it needs to meet its operational goals. The budget also provides a way to track actual results against desired outcomes.

Service-Level Agreement(SLA)

A service level agreement (SLA) is a contract between a service provider and a customer that specifies the nature and quality of the service to be provided. The SLA will typically include a description of the service to be provided, the standards that the service must meet, the customer’s responsibilities, and the service provider’s obligations. The SLA may also specify the remedies available to the customer if the service provider fails to meet the agreed-upon standards.

Service-Level Indicators(SLI)

Service-level indicators (SLIs) are performance metrics that help organizations measure and track the quality of their services. SLIs can be used to track the performance of individual service components, as well as the overall performance of the service. Common service-level indicators include uptime, response time, and error rates. By tracking SLIs, organizations can identify service problems early and take steps to improve the quality of their services.

Service-Level Objectives(SLO)

Service-level objectives (SLOs) are a key component of any effective service-level management (SLM) program. SLOs help ensure that services are delivered in a consistent and predictable manner, and help identify and track the key performance indicators (KPIs) that are most important to the success of the business.

SLOs should be designed to meet the specific needs of the business, and should be based on a thorough understanding of the customer’s requirements. They should be realistic and achievable, and should be reviewed and updated on a regular basis.

An effective SLM program will help to ensure that services are delivered in a timely and efficient manner, and that customer expectations are met or exceeded.

Analyzing Technical Requirements

Technical requirements specify the characteristics that a system or component must have in order to be able to perform its required functions. These include requirements such as atomicity, consistency, reliability, and durability. Atomicity refers to the ability of a system to guarantee that a transaction is either completed in its entirety or not at all. Consistency refers to the ability of a system to maintain data integrity. Reliability refers to the ability of a system to perform its required functions correctly and consistently. Durability refers to the ability of a system to maintain data integrity in the face of failures.

Functional Requirements

Functional requirements are the specific capabilities that a system must have in order to perform its intended functions. For example, a compute requirement might be the ability to process a certain amount of data within a certain time frame, while a storage requirement might be the need for a certain amount of space to store data. Network requirements might include the need for certain bandwidth or the ability to connect to certain types of devices. All of these requirements must be taken into account when designing a system.

Cloud Compute Engine

Requirements can be grouped into being met by the cloud’s offerings. Compute Engine, App Engine, Kubernetes Engine, Cloud Run, and Cloud Functions all solve unique use cases. It is forseeable that all of your requirements are going to fall along these lines when it comes to processing data requests, responding to requests, delivering content and interfaces. If not, another Google product will represent a Functional Needs subset.

Cloud Storage

Similarly, storage options are plethora. One or more of them meet our needs. Is your data Structured, or Unstructured, Relational? What latency requirements do you have? Group your requirements together and look at how the offerings meet those needs. If you are only appending dumps of data somewhere, you can chose a better option for that.

Cloud Networking Requirements

How many instances or nodes will you need? That number will affect how big your subnets will need to be. Can Firewall rules be allowed by service accounts? Do you have multiple workloads that you can sort into different groups to which the rules correspond?

Do you need DNS peering to enable hybrid-cloud networking between your VPC and your on-premises networks? These are questions an architect asks. You have to take the company’s subnets into account so that you can avoid collisions. So is automated or custom subnetting right for your project?

How is hybrid peering accomplished: VPN Peering which has high security but low througput? Or will Dedicated Interconnect and Partner Interconnects be used at higher cost for greater throughput?

Nonfunctional Requirements

Nonfunctional requirements are those that specify system characteristics such as availability, reliability, scalability, durability, and observability. They are often expressed as quality attributes or service level agreements. Functional requirements define what the system does, while nonfunctional requirements define how the system behaves. Nonfunctional requirements are important because they ensure that the system will meet the needs of its users.

Availabiltiy
Reliability
Scalability
Durability
Observability

Availability Requirements

There are many factors to consider when determining the availability requirements for a system. The first is the required uptime, which is the percentage of time that the system must be operational. For example, a system with a required uptime of 99% must be operational for at least 99% of the time. Other factors include the reliability of the components, the redundancy of the system, and the response time to failures. Availability requirements are often specified in terms of uptime and downtime, which is the amount of time that the system is operational and unavailable, respectively.

Reliability Requirements

Reliability requirements are those that specify how often a system or component must perform its required functions correctly. They are typically expressed as a percentage or a probability, and they may be specified for a single function or for the system as a whole. Reliability requirements are important because they help ensure that a system will be able to meet its operational objectives. Related to Availability, Reliability is the same requirement under the pressure of business load.

Scalability Requirements

Scalability requirements are those that dictate how well a system can cope with increased loads. They are typically expressed in terms of throughput, response time, or capacity. For example, a system that can handle twice the number of users without any degradation in performance is said to be scalable.

Scalability is a key consideration in the design of any system, be it a website, an application, or a network. It is especially important in the case of web-based systems, which are often subject to sudden and unexpected spikes in traffic. A system that is not scalable will quickly become overloaded and unable to cope, leading to a poor user experience and potential loss of business. Scalability requirements often are linked to Reliability factors.

Durability Requirements

In order for a product to be considered durable, it must be able to withstand repeated use and exposure to the elements without showing signs of wear and tear. This means that the materials used to construct the product must be of high quality and able to withstand regular use. Additionally, the product must be designed in a way that minimizes the likelihood of damage. For example, a durable product might have reinforced seams or be made from waterproof materials. Ultimately, the durability of a product is a key factor in determining its overall quality and usefulness.

Durability in the cloud is the ability to retrieve data placed there in the future. This means not losing volumes, files, objects and the immediate replacability and reproducibility of any resources that are not functioning correctly.

Observability Requirements

Observability requirements are those that enable a system to be monitored and its performance to be assessed and internal states to be known. They are typically concerned with aspects such as the availability of data, the ability to detect and diagnose faults, and the ability to predict future behavior. In many cases, these requirements will need to be trade-offs between conflicting goals, such as the need for timely data versus the need for comprehensive data.

Official Resources

Features in Google Cloud for Securing Virtual Machines(VMs)

Sep 11, 2022

Christopher Shaun Godwin

Author

Shielded VMs

Shielded VMs use verification on hardware IDs and chips to defend against Linux bootkits and rootkits and provides self-healing security features such as integrity monitoring and healing.

It uses Secure Boot, Virtual trusted platform module(vTPM)-enabled Measured Boot, and Integrity monitoring.

Monitoring

You can monitor your VMs in a few ways with Shielded VMs:

You can monitor the boot integrity of shielded VMs with cloud monitoring.
You can automatically take action on integrity failures with cloud functions.

Confidential VMs

These Virtual Machines use encryption-in-use and encrypt the data in memory. You provision this type of VM with the type N2D:

n2d-standard-2
n2d-standard-4
n2d-standard-8
n2d-standard-16
n2d-standard-32
n2d-standard-48
n2d-standard-64
n2d-standard-80
n2d-standard-96
n2d-standard-128
n2d-standard-224

VPC Service Controls

VPC Service Controls can define perimeters around sets of services within a VPC and can have their access limited. Traffic that crosses perimeters have Ingress and Egress rules. This affords us the following benefits:

Unauthorized networks with stolen credentials are blocked
Data exfiltration blocked.
Safety net for misconfigured over-permissive IAM policies.
Honeypot perimetering and additional monitoring.
Extend perimeters to on-premiss networks
Context-aware access to resources

VPC Service Control Netflow

Official Resources

Comparison of Google Cloud Database Options

Sep 10, 2022

Christopher Shaun Godwin

Author

Bigtable

There are many pros to using bigtable, including the ability to handle large amounts of data, the flexibility to scale up or down as needed, and the ability to support a variety of data types. Additionally, bigtable is designed to be highly available and can provide near-real-time access to data. resizing without downtime, simple administration, highly scalable.

BigQuery

BigQuery is a very powerful tool that can handle large amounts of data very efficiently. It is also easy to use and has a lot of features that make it a great choice for data analysis. On the downside, BigQuery can be expensive to use, and it can be challenging to get started if you are not familiar with it.

Cloud SQL

Google Cloud SQL is fully managed, flexible, automatically replicated across multiple zones, encrypted at rest and in transit, automatic updates.

Cloud Spanner

Cloud Spanner uses TrueTime to execute the same query on multiple regions to ensure consistency. If your data needs to be consistent and cannot wait for replication, cloud spanner is the clear choice.

Compute VM

Running a database cluster on Compute VM, you take all the management upon yourself. If you select the wrong compute sizes, either too big or too small, you run risks of rising costs or falling performance.

Product	Relational	Structured	Unstructured	Heavy R/W	Low Latency	Global Consistency
Bigtable	🔴	🟢	🟢	🟢	🟢	🔴
BigQuery	🟢	🟢	🟢✝	🔴✝✝	🔴	🔴
Cloud Firestore	🔴	🔴	🟢	🔴	🔴	🔴
Firebase Realtime Database	🟢	🟢	🟢✝	🔴✝✝	🔴	🟢
Cloud SQL	🟢	🟢	🟢	🔴	🔴	🔴
Cloud Spanner	🟢	🟢	🔴	🔴	🔴	🟢
Compute VM	🟢	🟢	🟢	🔴	🔴	🔴

Symbol	Meaning
🟢	Yes
🔴	No
✝	Semi Unstructured Data with the Json type
✝✝	Read / Append Only

Official Resources

Comparison of Standard and Flexible App Engine Environments

Sep 10, 2022

Christopher Shaun Godwin

Author

Table of App Engine Distinguishing Features

Product	Access GCP Services	Any Language	Scaling	Scale to Zero	Background threads	Background Processes	Modify the Runtime	Websockets	Write to Disk
Standard	🔴	🔴	🟢	🟢	🟢	🔴	🔴	🔴	Java: /tmp
Flexible	🟢	🟢	🟢	🔴	🟢	🟢	🟢	🟢	🟢

Symbol	Meaning
🟢	Yes
🔴	No

Official Resources

Contrasting Preemptible and Spot Virtual Machines(VMs)

Sep 10, 2022

Christopher Shaun Godwin

Author

Table of Preemptible vs Spot Distinguishing Features

Product	Unlimited Runtime	preemptive delete	preemptive pause	SLA Coverage	Cost Reduction	Migrate to Standard VM	Restart on Event	Live Migration
Preemptible VMs	🔴	🔴	🟢	🔴	🟢	🔴	🔴	🔴
Spot VMs	🟢	🟢	🟢	🔴	🟢	🔴	🔴	🔴

Symbol	Meaning
🟢	Yes
🔴	No

Official Resources

Differences in Google Cloud Platform(GCP) Premium Network Tiers

Sep 10, 2022

Christopher Shaun Godwin

Author

Premium Tier

This tier uses more resources to think out the best route. This tier has more than 100 Points-of-Presence(PoP) which lets the packets leave Google’s network nearest to the customer. Packets use a more direct route a “cold potato” algorithm. This Tier support global load balancers.

Standard Tier

Using the “Hot Potato” method, the Standard Tier network tries rid itself of the packet by sending it to the earliest responding route. This is a less direct path and may not egress through a PoP as near to the destination. This Tier can only support Load Balancers which are regional.

Product	Global LB	PoP Closest Hop	Next Hop Algorythm	High Performance	Inter-Regional Traffic	Cloud CDN	Cloud VPN/Router
Premium	🟢	🟢	Cold Potato	🟢	Google Network	🟢	🟢
Standard	🔴	🔴	Hot Potato	Standard ISP	Encrypted over Public ISPs	🔴	🔴

Symbol	Meaning
🟢	Yes
🔴	No

GCP Network Tier Decision Tree

Official Resources

Complete List of Google Cloud Certified Professional Cloud Architecture Skills

Sep 3, 2022

Christopher Shaun Godwin

Author

Ability to Design for Business

Business Use-Cases

When it comes to designing a cloud for business use-cases, there are a few key considerations that need to be taken into account.

Product Strategies

Product strategies have a big impact on cloud architecture design.

Go-live Strategies

Cloud-based launches require careful planning to ensure a successful outcome.

Application Design

Applications that are designed to run in the cloud must be able to take advantage of the functionality, scalability and flexibility that the cloud offers.

Cost Considerations

For every cost-effective option for running an application, there are at least two other cost-ineffective ways to accomplish the same thing in a cloud.

Systems Integration

Key considerations for communicating between cloud and on-premises networks and applications.

Data Handling and Management

Keeping your data safe and secret is the goal of proper data handling, especially production and sensitive data.

Compliance, Regulations, Access Restrictions

Designing cloud architecture requires staying within industry compliance, legal regulation, and policy access restrictions. That involves careful consideration and arduous planning.

Security

Trust no file, no connection, no application. Treat everything as untrusted as a potential layer of breach, act according to that and you’ll be fine.

Measures of Success

Does our proof of concept have any measures of success other than a successful healthcheck? It should. What else are we missing?

Ability to Design Around Technical Needs

Technical Debt Reconciliation

Technical debt is the amount of time and effort required to fix or improve software that is not up to date or is poorly designed. It can accumulate over time as a result of shortcuts taken during development, such as using quick and dirty solutions instead of taking the time to do things properly.

Availability

What is required to ensuring high-availability in the cloud. Is availability one of your business needs?

Scaling Resources

Ensuring that the scaling process is efficient and cost effective and elastic.

Reliability Engineering

Working to identify and mitigate potential sources of failure.

Ability to Design Compute Architecture

Compute Engine

Google Compute Engine is a cloud computing service that provides virtual machines that run on Google infrastructure.

Google Kubernetes Engine(GKE)

Google Kubernetes Engine (GKE) is a managed, production-ready environment for deploying containerized applications.

Anthos

A platform that enables runs containerized applications in hybrid or multi-cloud environment, whether they are deployed on-premises or in the cloud.

Cloud Functions

Google Cloud Functions is a serverless computing platform that allows you to run code in the cloud without having to manage a server or cluster.

Configuration

Handling Application configuration.

Management

Considering infrastructure management tasks such as log rotation.

States

Stateful applications, application states, statelessness.

Data flows

Message queuing, bottlenecks and performance.

Data Integrity

Maintaining the accuracy and consistency of data over its entire lifecycle.

Monitoring and Alerting

Set up alerts and view monitoring data for your projects via dashboards.

Ability to Design Storage Architecture

Object Storage

Object storage is a type of storage that is well-suited for storing large amounts of data that is unstructured or semi-structured.

Network Storage

Google Cloud Platform (GCP) offers a managed network attached storage (NAS) service called Filestore and volumes attached to compute based services such as GKE.

Databases

Knowledge of relational database creation, maintenance, backup, and related processes.

Cloud SQL

Managed relational databases meet certain needs that compute database engines do not.

Cloud Spanner

Google Cloud Spanner is a relational database service that offers global horizontal scaling, strong consistency, and high availability. Basically Managed Vitesse.

Big Query

Google BigQuery is a cloud-based big data analytics web service for processing very large read-only data sets.

Cloud Firestore

Cloud Firestore is an auto-scaling document database for storing, syncing, and querying data for mobile and web apps.

Bigtable

Google Big Table is a distributed storage system for low-latency access to large amounts(petabytes) of structured data.

Data Retention and Lifecycles

As data doesn’t need to be access as frequently, it can be time partitioned for cost-effective solutions.

Latency Issues

Latency is a key consideration for accessing data in any kind of storage. Regional, zone, and CDN considerations must be made.

Ability to Design Network Architecture

Virtual Private Cloud(VPC) Basics

IP Addressing, CIDR ranges, Firewall Rules and Routers. Cloud Router, Cloud Armor, VPC subnet and VPC sharing.

Hybrid Cloud Networking

Knowledge of how hybrid cloud networking a mix of on-premises, private cloud, and public cloud services are setup and maintained.

CDN, DNS Zones, Zone Peering, Service Registry

Additional services provided in GCP Networking.

Load Balancing(LB)

Regional and Global Load Balancing have different use-cases. How GKE and provisions LBs.

Legal and Security Centric Design Scrutiny

Identity and Access Management(IAM)

All access is managed through IAM, it is relevant to every gcp service.

Encryption

Storage Encryption

Understanding GCP’s encryption at-rest schema.

Connection Encryption

Understanding encryption in transit in GCP.

Key Management

Ability to ascertain the needs of projects which need to control their own key management for data encryption.

Security Evaluation

Penetration Testing & Iam Policy Auditing

Security Design Principles

Full understanding of concepts like separation of duties, least privilege and Defense in Depth.

Major Regulations

Information Technology Infrastructure Library Framework

The Information Technology Infrastructure Library (ITIL) is a framework that provides a set of best practices for managing IT services.

HIPPA/HITECH

The Health Insurance Portability and Accountability Act, or HIPAA, is a federal law that was enacted in 199 HIPAA protects the privacy of patients’ health information and establishes national standards for the security of electronic health information. The HITECH Act is a federal law that promotes the adoption and meaningful use of health information technology.

GDPR applies to any company that processes the personal data of EU citizens, regardless of where the company is located. It strengthens EU data protection rules by giving individuals more control over their personal data, and establishing new rights for individuals.

SOX Compliance

The Sarbanes-Oxley Act was enacted in 2002 in response to the Enron scandal. The Act includes provisions to protect investors from fraudulent accounting practices and to improve the accuracy and transparency of corporate disclosures. The Act also created the Public Company Accounting Oversight Board to oversee the auditing of public companies.

COPPA

The Children’s Online Privacy Protection Act (COPPA) is a law that requires companies to get parental consent before collecting, using, or disclosing personal information from children under 1 COPPA also gives parents the right to review and delete their child’s personal information, and to refuse to allow companies to collect or use it.

Ability to Design for Reliability

Stackdriver

Stackdriver is a cloud monitoring tool that provides comprehensive monitoring and logging for cloud-powered applications. It offers powerful features like monitoring dashboards, alerting, log management, and tracing. Stackdriver is a great tool for keeping track of the health and performance of your cloud-based applications.

Continuous Deployment

Just use Harness. Cloud Deploy in GCP is $15 per pipeline per month.

Continuous Integration

Cloud build basics.

Reliability Engineering

Reliability engineering via Cloud Ops: Logging, Monitoring, Alerting, Etc.

Overloads, Cascading Failure and Testing

Designs need to deal with capacity overloads, they need to fail in a cascading manor, and reliability testing.

Incident Management, Analysis, and Reporting

Identify Incident cause, Plan for fix remediation, and log the actions taken.

Technical Process Introspection

Lifecycle Planning

Create and Understand Software Development Lifecycle plans.

Troubleshooting

Fixing your technical processes by revisiting your Incident Response and Post-Mortem Culture

Enterprise IT Processes

Fit your Technical Processes into the IT processes of your wider group. For example, creating AD groups and syncing them to GCP for IAM federation.

Business Continuity Planning and Disaster Recovery

Architects wil be asked to help teams to be better prepared to run their app in a new environment from scratch.

Business Process Introspection

Stakeholder Management

The ability to deliver and set expectations with people who have an interest in the project you’re designing.

Change Management

Understanding of Plan, Do, Study, Act.

Team Skill Management

Help develop internal skill-sets among the team.

Customer Success Management

Helping customers to get the most value from your services.

Cost Management

Resource planning, Cost estimation, budgeting, and cost control.

Cost Optimization

Familiar with HR Costs, Infrastructure costs, Operational Costs, and Capital Costs. Can contribute to optimizing these costs.

Development and Operations Design

Create Development-and-Redevelopment-for-Cloud Strategies

Ability to guide app developers to plan for redeveloping applications for cloud specific services.

API Best Practices

Understanding APIs, RESTful and RPC. API Security familiarization and comprehension of resource limiting.

Testing Frameworks

Vulnerability Testing, Unit Testing, Regression Testing, WebDriver Testing, HTTP and Healthcheck verifications.

Secrets Integration for Third Party Apps

Strategy for storing sensitivity data in the cloud.

Google Cloud SDK

gcloud, gsutil, bq, cbt, kubectl, pubsub emulator…

Cloud Emulators

Awareness in local emulators for development reduction. Bigtable, Datastore, Firestore, Pub/Sub, Spanner.

Migration Strategies

Lift and Shift, Move and improve, or Rip and restore?

Tools

Storage Transfer Service, gsutil, Google Cloud Database Migration, Google Transfer Appliances, and 3rd party options.

Migration Cost/Time Optimization

Data Size, Redevelopment Time, Migration Time, Planning Time.

Migration Planning

Integrating Cloud Systems with Existing Services

Migrating Applications and Data to Support a Solution

Planning Changing code and configuration to support shifts in platform differences.

Planning for Data Migration

Consider the size and type of data being migrated, the workload requirements, and the available budget. Other restrictions.

Governance of Data and Migrations

Ensuring that data is managed to stay in-compliance effectively and consistently across a migration.

Migrating Object Storage

Bucket structure, Roles and Access Controls. Time and Cost comprehension, transfer sequence, transfer methods.

Migration Relational Data

Volume considerations, downtime considerations, replicate in the cloud for no-downtime migrations.

Software License Mapping

Understanding of BYOL models and pay-as-you-go models.

Network Planning

Planning Shared Networks in Tiered Projects, Planning VPCs, Planning Network Access Standards, Scaling & Performance Testing, Connectivity