Default provisioning of infrastructure in a standby state

Configuring infrastructure within cloud services in a cost-reduced way by default

Table Of Contents

Today I Explained

Shared infrastructure as code packages often have the number of replicas or minimum size configured with a value of at least 1, or an equivalent to ensure that the service is running when first setup. This is a default behaviour of plug & play to reduce the time to value (TTV). As after running the initial setup, it is possible to begin interacting with the service immediately. This doesn’t have to be the default behaviour. One of the consequences of choosing this behaviour is that infrastructure is provisioned without consideration for the expected workload traffic. This results in numerous infrastructure with resource reservations far exceeding the kind of traffic you are actually working with. Especially in the prototyping or development case, it causes the problem of many small wasteful expenses that have the cumulative effect of a significant budget expenditure.

An alternative behaviour is one of standby as the default behaviour. In this pattern, the resource reservations & services are set to be provisioned in an offline state, requiring them to be explicit scaled up to begin use. The pattern of plug & play can be achieved by overriding the default configuration to turn on the service. In practice, this will mean that AWS resources such as IAM Roles, SecretsManager secrets, Route53 DNS entries, Load Balancers or Security Groups are created & configured within the AWS region. Resources that can be shutoff or scaled down such as Relational Databases (RDS), Machines (EC2s), Containers (Kubernetes / ECS) or Serverless (DynamoDB Reserved Capacity / Lambda Capacity) will be configured offline, or in a cost effective way.

This approach still means a fixed cost will be incurred from the infrastructure existing, as resources like secretsmanager secrets, load balancers, or kubernetes control plane have fixed prices associated with them. However, this infrastructure can be easier to identify for automatic removal as it has a straightfoward definition of unused, in which the infrastructure has been offline for some period of time.

This does assume that some mechanism exists to reset infrastructure back into the standby state, as otherwise infrastructure would just be created with an plug & play configuration to override the default standby behaviour.

A note on Lifecycles & Costs

Standby states are more part of lifecycle management, than cost reduction. As having infrastructure able to be provisioned in a cost reduced configuration still requires coordination for shifting infrastructure into this state. The standby state stands out when dealing with infrastructure that has lease-based lifecycles. Infrastructure is first provisioned into a standby state, and as a lease is assigned, the infrastructure transitions into a running state. When the lease expires or is released, the infrastructure can transition back into the standby state for eventual expiration or re-use.

        ┌───────┐
        │       │
┌───────┴─┐   ┌─▼───────┐
│ Standby │   │ Running │
└───────▲─┘   └─┬───────┘
        │       │
        └───────┘

Diagram showing 2 nodes, ‘standby’ and ‘running’, in which arrows connect in one direction ‘standby’ to ‘running’, and ‘running’ to ‘standby’

A note on disaster recovery

One of nice benefits of this approach is when performing disaster recovery simulations. As to ensure business continuity, one could be provisioning the entire service (& associated infrastructure) from a scratch environment. By having the default configuration provision the infrastructure in a cost reduced way, it can be less expensive to investigate hidden dependency assumptions of infrastructure, helping to improve the resiliency of the infrastructure.