Adrian Cantrill’s SAA-C02 study course, 90 minutes: ‘RDS High Availability (multi-AZ)’, ‘RDS Automatic backups, RDS snapshots and restore’, ‘RDS read replicas’
CSA CCSK Security Guidance, 20 minutes: Deployment models
RDS High-Availability (Multi AZ): Single Instance RDS instances available to the failure of the AZ because instance and storage both contained within that AZ
Multi-AZ adds resilience. When enabled, secondary hardware is allocated inside another AZ. This is referred to as the ‘standby replica’. This enables synchronous replication between primary and replica. You access RDS via the endpoint address: a CNAME, DNS name, that applications use to connect to the RDS instance. With a single-AZ instance it points at the only database instance. In a multi-AZ system you only have the single endpoint address, and it points at the primary instance. You only access RDS using the CNAME, and CNAME points at the primary instance. The standby replica can’t be used for extra capacity during normal operations, it just sits there accepting data from the primary instance using synchronous replication. This flow happens as follows: database writes happen, and they’re directed via the CNAME at the primary database instance. Second, the primary instance writes these changes to its local storage. Third, while the changes are happening, the changes are immediately replicated across to the standby replica. Fourth, the standby replica commits these writes to its local storage. Because writes to the primary and standby occur in parallel, there’s very little if any lag in data between the primary and standby replica. This is synchronous replication; it creates almost zero lag between the primary and the standby instance. If an error occurs with the primary instance, RDS detects this and it changes the database endpoint CNAME, moving it from the primary instance across to the standby replica. Typically this failover occurs within 60 to 120 seconds. RDS Multi-AZ provides high availability, but no fault tolerance. This is not free-tier available, and is generally twice as expensive. Standby replica cannot be directly used, it is available only in the event of a failover. The replica is available in the same region only. Backups are taken from the standby replica, which lessens performance impact in the event of a failover. Some types of failover events include: AZ outage, primary failover, manual failover, instance type change and software patching. You can perform failovers manually.
RDS BACKUPS and RESTORES: Knowledge of this topic is necessary to understand how to protect customer data and plan for disaster recovery and business continuity. This topic is also important to understand how a restore is performed inside RDS for application updates that require data restores after updates.
Two important terms: RPO: Recovery Point Objective; RTO: Recovery Time Objective.
RPO: Time between last backup and incident, amount of maximum data loss, influences technical solution and cost, generally lower values cost more (more regular backups or system replication required)
RTO: Time between the DR event and full recovery, influenced by process, staff, tech, and documentation, generally lower values cost more
For AWS exams, focus on RPO and RTO objectives.
There are two types of backups as far as RDS is concerned: automated backups and manual snapshots. Both types use S3 with AWS managed buckets so they won’t be visible in the console, which makes any data contained in the buckets region resilient. (S3 replicates data across multiple AZ’s in that region). RDS backups occur from either the single AZ instance if multi-AZ is disabled, or from the standby if multi-AZ is enabled. The primary is never used.
Snapshots: manual, run against RDS database instance. They function like EBS snapshots: The first snapshot is a full copy of the data that’s used on the RDS volume. From then on the snapshots are incremental and only store the change in data. When any snapshot occurs there’s a brief interruption of the flow of data between the compute resource and the storage. If you’re using single-AZ then this can impact your application, but if you choose to use multi-AZ then this occurs on the standby replica, so it will have no effect on any applications using your RDS instance. The initial snapshot will take a while because it’s a full copy of the data that’s used inside the RDS instance. From then on the backups will be much quicker for any databases except those databases that have massive data changes because incremental backups only store changes in data.
Manual snapshots don’t expire; you have to clear them up yourself. They will exist inside your AWS account forever until you delete them. This means that manual snapshots will live on past the lifetime of the RDS instance. If you delete an RDS instance any manual snapshots will remain in your RDS account. When you delete an RDS instance it will offer to make one final snapshot on your behalf. Snapshots are taken of the database instance storage, so they contain all of the databases inside the RDS instance, not just a single database. Duration between snapshots is up to you because they’re fully manual. If RDS could only use manual snapshots then you would have complete manual control over the RPO because it would be based on how often you did those snapshots. The more regular, the lower the RPO.
Automated backups: You can configure how often these occur, but the architecture is the same; they are just snapshots that occur automatically. The first snapshot is a full one and the ones which follow are all incremental. All these backups occur during a backup window defined by the end user, the general idea being to use a window which fits your business. If you are using single-AZ you have to make sure it doesn’t happen when you are actively using the database. The window impacts the RPO, the time between a successful backup and any potential failure, so timing this backup should be done to minimize the RPO value.
In addition to the automated backup, every five minutes database transaction logs are also written to S3. Transaction logs store the actual data which changes inside a database. This means that a database can be restored to a point in time often with a five minute granularity. This translates to an effective RPO of 5 minutes. The way this works is a database snapshot is restored and then the transaction logs are replayed over the top of the snapshot to bring the database backup to a specific point in time. This offers really low RPO values. Automatic backups are retained indefinitely, and can be retained by setting a retention period anywhere from zero to thirty-five days, which would mean being able to restore to any point in that thirty-five day period, which uses both the snapshot and transaction logs. When deleting the database you can choose to retain the backups, but they still expire based on retention period. The only way to maintain backups indefinitely after deleting an RDS instance is to create a final snapshot.
When performing a restore, a new RDS instance is created, because the restore is given a new endpoint name. Snapshots are a single point in time, while Automated backup restores are available for any 5 minute point in time. For backups, the backup is restored and transaction logs are ‘replayed’ to bring the DB to the desired point in time. Important: Restores are not fast. The store time directly influences the RTO. The only way to protect against data corruption is to use snapshots.
Read Replicas: Read replicas provide performance benefits and availability benefits.
Read-replicas are read-only replicas of an RDS instance. Unlike Multi-AZ where you can’t use the standby replica for anything, you can use read-replicas, but only for read operations. Read replicas have their own database endpoint address, so applications need to be adjusted to use them directly, and they’re kept in sync using asynchronous replication. Asynchronous means read replicas. With this method, data is written fully to the primary instance first, and then once it’s stored on disk, it’s replicated to the read replicators. In theory there could be a small amount of lag, maybe seconds between writes to the primary instance and the read replicas, but this is at least somewhat dependent on network conditions and the number of writes to the primary instance. Read-replicas can be created in the same region or in different regions, and this is known as a cross-region read-replica. If you create a cross-region read-replica then AWS handles all the networking between regions and this transpires transparently to you; it’s fully encrypted in transit and you have no exposure to the configuration.
There are two main benefits of read-replicas. The first is performance improvements. You can have five direct read-replicas per DB instance, each providing an additional instance of read performance. This is a way to scale out read capacity for a database. You can configure your application to perform read operations against the read replicas themselves and not the primary instance and only use the primary for write operations. You can deploy read-replicas and multi-AZ at the same time, using multi-AZ to provide the availabilty benefits and to remove any issues with backups affecting performance and then use read replicas to scale out your reads. You can configure read-replicas of read-replicas, but lag starts to become a problem. Read replicas can provide global performance improvements by deploying a new application-front end, connect it directly to a read-replica, and then use it only for read-only operations. The second improvement is availability. Snapshots and backups don’t help with RTO’s. RR’s offer near-zero RPO. Because writes to read replicas from the the primary instance are asynchronous, you can promote a read replica to becoming a new read-write instance quickly in the case of primary instance failure, all with low RTO. This only works for failures, and read replicas can replicate data corruption.
CSA CCSK Security Guidance: 184.108.40.206 Deployment models:
Public cloud: Cloud customers: reduced ability to govern operations in public cloud; provider responsible for management and governance of infrastructure, employees, and everything else, including contract negotiation. Inflexible contracts affect governance impact in cloud and is a natural consequence of multitenancy: everything runs on one set of resources and uses one set of processes. Hosting private cloud enables full customization, but at increased costs (loss of economies of scale)
Private: Third party ownership/management affects governance of private cloud (similar to governance issues with any outsourced provider); there will be shared responsibilities with obligations that are defined in the contract. There will be more control over contractural terms, but you must ensure that these cover the needed governance mechanisms: a hosted private cloud may only offer exactly what is in the contract. Everything extra costs extra. This must be considered and accounted for in negotiations (including clauses to require platform stays current). For private cloud, governance will focus on internal sla’s for cloud users, and chargeback and billing models for cloud access.
Hybrid and Community: Governance strategy must consider the minimum common set of controls comprised of CSP’s contract and the org’s internal governance. Cloud user: connecting two cloud environs or cloud and data center. Governance is intersection of models. For community: shared platform with multi-orgs, but not public, governance extends to relationships with members of community, not just provider and customer; mix of public cloud and hosted private cloud governance.