Deploying Relational Databases in AWS
Many applications require the ability to store data in a relational database. From web services, business intelligence, and analytics to infrastructure management, many different tasks require the recording of data in a database. In AWS, you have two choices:
You can deploy an EC2 instance with a database server application installed.
You can choose to use Amazon Relational Database Service (RDS).
Amazon RDS
The choice between a standalone EC2 instance with a database on top and RDS is essentially the choice between an unmanaged environment where you have to manage everything yourself and a managed service where most of the management tasks are automated and complete control over deployment, backups, snapshots, restores, sizing, high availability, and replicas is as simple as making an API call. When developing in AWS, it always makes sense to lean toward using a managed service as the benefits of reducing the management overhead can be numerous. Aside from simplifying the management, another business driver can be increased flexibility and automation, which can be achieved by using the AWS CLI, the SDKs, and CloudFormation to deploy the database back end with very little effort or through an automated CI/CD system. Managed services essentially empower developers to take control of the infrastructure and design services that can be easily deployed and replicated and that can have auto-healing characteristics built into them.
Example 4-6 shows how the deployment of an RDS database can be integrated in a Java application by using the AWS Java SDK, giving you the ability to deploy the database and use the database string returned to connect to the newly created database.
Example 4-6 Java Script That Can Be Used to Build an RDS Database
// define the credentials AWSCredentials credentials = new BasicAWSCredentials( "AJDEIX4EE8UER4", " D3huG40jThD3huG40jThNPaAx2P3py85NPaAx2P3py85" ); AmazonRDSClientBuilder.standard().withCredentials(credentials) // pull the credentials into RDS Builder .withRegion(Regions.US_EAST_2) // define the region as us-east-2 .build(); CreateDBInstanceRequest request = new CreateDBInstanceRequest(); // define the create request request.setDBInstanceIdentifier("javadbinstance"); // give the database instance (the server) a name request.setDBInstanceClass("db.t3.small"); // define the size of the database instance request.setEngine("mysql"); // define the database engine type request.setMultiAZ(true); // make the database highly available with MultiAZ request.setMasterUsername("master"); // define the database master username request.setMasterUserPassword("javadbpw"); // define the database master password request.setDBName("masterdb"); // give the database a name request.setStorageType("gp2"); // define the storage type - gp2 is general purpose SSD request.setAllocatedStorage(30); // define the storage size as 30 GB amazonRDS.createDBInstance(request); // issue the request
Once the script is created, you can list all your instances with the DescribeDBInstanceResult class. You will want to get the instance identifier and the endpoint, which is the SQL endpoint URL that you can later use to connect to the database. You can do this by including the snippet shown in Example 4-7 in your Java code.
Example 4-7 Using the Java DescribeDBInstanceResult Class
DescribeDBInstancesResult result = amazonRDS.describeDBInstances(); List<DBInstance> instances = result.getDBInstances(); for (DBInstance instance : instances) { String identifier = instance.getDBInstanceIdentifier(); Endpoint endpoint = instance.getEndpoint(); }
Supported Database Types
Currently the RDS service supports six different database engines that can be deployed from RDS:
MySQL
MariaDB
PostgreSQL
Amazon Aurora
Oracle
Microsoft SQL Server
RDS for MySQL, MariaDB, and PostgreSQL
MySQL, MariaDB, and PostgreSQL are the most popular open-source relational databases used in today’s enterprise environments. Being open source and requiring little or no licensing while still having enterprise-grade support available makes these databases a great choice for an enterprise looking to deploy applications in a more efficient manner. They can easily replace traditional databases that tend to have expensive licensing attached to them.
The MySQL, MariaDB, and PostgreSQL engines all have similar general characteristics and support highly available Multi-AZ deployment topologies with a synchronous master/slave pair across two availability zones. All of them also have the ability to deploy multiple read replicas in the same region or in another region. The RDS service supports the following versions of these open-source databases:
MySQL Community Edition versions 5.5+ and 8.0
MariaDB Server versions 10.0+
All PostgreSQL versions (though version 9.3.5t is required for Multi-AZ and read replicas)
Figure 4-5 illustrates synchronous replication in Multi-AZ RDS deployments.
Figure 4-5 A Multi-AZ RDS Deployment
The MySQL, MariaDB, and PostgreSQL databases all support the use of SSL connections for the encryption of data in transit and can be configured with built-in volume encryption for data at rest.
These three database types are limited in size to 16 TB per volume and can use numerous different RDS instance types so you can scale the size of an instance from small to 8xlarge.
Amazon Aurora
Amazon Aurora is the next-generation open-source engine currently supporting the MySQL and PostgreSQL database types. The benefit of Aurora is that it decouples the processing from the storage. All the data is stored on a synchronously replicated volume in three availability zones, and the processing of SQL requests is performed on the cluster instances. The instances have no local storage, and they all access the cluster volume at the same time, so the performance of the cluster can be linearly scaled by adding nodes.
The write node in an Aurora cluster, also called the primary instance, is used to process all write requests. The primary instance type needs to be scaled to the write performance requirements of your application and can be easily resized by promoting a larger read replica to the primary role. All other members of the cluster are called replica instances, and they can respond to read requests. The primary and the replicas have different DNS names to which you send requests, which means you can simply configure your application with two FQDN targets—one for the writes and another for the reads—and do not need to handle the read/write distribution on your own.
Because the primary and replica instances have access to the same synchronously replicated cluster volume, you can also instantly promote any read replica into the primary role if the primary instance fails or if the availability zone where the primary instance is running experiences difficulties. Figure 4-6 illustrates how the Aurora design ensures synchronous writes and decouples storage from the compute layer.
Figure 4-6 Design of an Aurora Database Cluster
An Aurora cluster can scale quite a bit because you can add up to 15 replicas to the primary instance while additionally adding another 16 asynchronous replicas in another region. The Aurora engine also extends the maximum cluster volume to 64 TB, delivering not only a performance advantage but also a capacity advantage over traditional open-source databases, while maintaining the ability to use SSL for encryption in transit and delivering built-in encryption at rest.
Aurora is now available in serverless on-demand mode as a pay-per-request service. This is a great option for any kind of transient SQL clusters where keeping the primary and replicas running 24/7 would cause unnecessary costs. The on-demand Aurora also handles all scaling and capacity management automatically so that you can send as many requests as you need and always get a response. This essentially allows you to also support very spiky applications where you are not sure of the performance required before the requests start rolling in.
Oracle and Microsoft SQL on RDS
Traditional enterprise databases are sometimes the only option, so RDS allows you to deploy an Oracle 11g or Microsoft 2008 or newer SQL server as a service. The cost of these two engine types can have the licensing included, so there is no need to spend large sums of money for licensing upfront. There is, of course, also an option to bring your own license for each.
While you have a lot of choice of RDS instance types to run on, the Oracle and Microsoft engines are limited to a Multi-AZ mode and provide no support for read replicas and a maximum size of 16 TB per volume. To protect data at rest and in transit, Transparent Data Encryption (TDE) is supported on both engine types.
Scaling Databases
There are four general ways to scale database performance:
Vertical scaling: You can give a single database engine more power by adding more CPU and RAM.
Horizontal scaling: You can give a database cluster more power by adding more instances.
Read offloading: You can add read replicas and redirect read traffic to them.
Sharding: You can distribute the data across multiple database engines, with each one holding one section, or shard, of data.
With relational databases, vertical scaling always works, but it has a maximum limit. In AWS, the maximum limit is the largest instance size that can be deployed in the service. An alternative is horizontal scaling, but generally relational databases are not the best at being able to scale horizontally. The nature of the atomicity of the SQL transactions usually means that the whole transaction must be processed by one server—or sometimes even in one thread on a single CPU.
If an RDS database is deployed in a Multi-AZ configuration, the resizing can be done transparently because the slave database is resized first, the data is synchronized, the connection fails over, and the slave becomes the master while the previous master instance is resized. When the resizing is complete, data is again synchronized, and a failover is performed to the previous master instance.
Example 4-8 uses the boto3 Python SDK to increase the instance size from db.t3.small to db-t3-medium for the instance created in the previous example.
Example 4-8 Python SDK (boto3) Script That Can Be Used to Create an RDS Instance
import boto3 # boto3 is the AWS SDK for python client = boto3.client('rds') # define the RDS client to be used response = client.modify_db_instance( # modify an existing instance DBInstanceIdentifier=' javadbinstance ', # specify the instance ID DBInstanceClass=' db.t3.medium ', # define the new size ApplyImmediately=True, # run the command immediately (will not impact the availability since we set the database to be MultiAZ )
Another way of scaling is to distribute the read and write transactions on multiple nodes. A typical relational database is more read intensive than write intensive, with a typical read-to-write ratio being 80:20 or even 90:10. By introducing one or more read replicas, you can offload 80% or even 90% of the traffic off your write node. Aurora excels at read replica scaling, whereas the other services that support read replicas support only asynchronous replication, which means the read data is not as easily distributed across the cluster because the data read from the replica might be stale. But even asynchronous replicas can be a great benefit for offloading your write master where historical analytics and business intelligence applications are concerned.
Typically the last resort for scaling relational databases is to shard the data. Essentially this means that a dataset is sliced up into meaningful chunks and distributed across multiple masters, thus linearly increasing write performance.
For example, imagine a phone directory in a database with names from A to Z. When you need more performance, you can simply split up the database into names starting with A to M and N to Z. This way, you have two databases to write to, thus theoretically doubling the performance. Figure 4-7 illustrates the principle of sharding RDS databases to achieve better performance.
Figure 4-7 Sharding a Phone Directory into Two Databases
However, the limitation of sharding is immediately apparent when you try to perform analytics as you need to access two databases, join the two tables together, and only then perform the analytics or BI operation. Figure 4-8 illustrates tables from sharded databases being joined to an analytical database.
Figure 4-8 Steps Required for Analytics on Sharded Databases