Relational Versus Nonrelational Databases
Before doing a deep dive into the database services available in AWS, you need to take a look at the two major categories of databases that exist. In essence, a database is a collection of data that can be accessed in a certain ordered manner. Traditionally enterprise databases have been designed using a relational format. Enterprises traditionally needed to collect strictly formatted and well-structured data and then perform queries to get insights into how the different pieces of data in the database related to each other. When running business intelligence, analytics, ERP, accounting, management, and other tasks commonly done in an enterprise, it is always preferred to run these tasks on a well-structured dataset to get the clearest results and easily identify trends and outliers. These are typically SQL databases that are designed to run a whole SQL command on one server or even one CPU thread.
But the world has changed. Internet-connected companies today are ingesting data from sources where the structure is undefined, where data is stored for temporary purposes, where relationship information is not provided, or where the relationships are so complex that the sheer volume of data would easily overwhelm a traditional database server. The Internet-connected world needs databases where data can be stored at high velocity, where performance can be scaled linearly by adding nodes, and where data can be stored in an arbitrary format with an arbitrary structure. A new generation of databases has emerged; these databases are called NoSQL (or Not only SQL). These database models are designed to store key/value pairs, documents, data payloads of arbitrary structure, graphs, and so on. Also, nonrelational databases are typically schema-less. Figure 4-4 illustrates different data structures of SQL and NoSQL databases.
Figure 4-4 SQL Versus NoSQL Databases
Choosing which type of database to use is essentially governed by the data model. A typical relational database model is strictly structured by row and column, as illustrated in Table 4-2. The data must fit fully within one row and is required to be structured to fit the categories defined in the columns.
Table 4-2 Relational Database Table Example
Index |
Name |
Surname |
Occupation |
Active |
0000 |
Anthony |
Soprano |
Waste Management Consultant |
Y |
0001 |
Christopher |
Moltisanti |
Disposal Operator |
N |
Different columns of a traditional database are indexed to expedite the retrieval of the data from the database. The index is usually loaded into memory and allows for very fast retrieval of specific pieces of data. Traditional databases are usually also ACID compliant, where ACID stands for
Atomicity: Every transaction against an SQL database is atomic and (usually) cannot be broken down into smaller pieces. If a transaction fails, the whole operation needs to be restarted.
Consistency: Data must be consistent at all times, even if replicated across a cluster. This means that data will be made unavailable until the replication has completed and the data is confirmed to be consistent.
Isolation: Concurrent transactions can never interfere with each other. In SQL, for example, you always put a lock on a table or an index that you are modifying so no other transaction can interfere with it.
Durability: Data must be stored durably and must also have the ability to be recovered in case of a failure. You should keep a transaction log that can be replayed and one or more backups of a database to maintain durability.
With a NoSQL database, you can represent the whole dataset of one row as a set of key/value pairs that are stored and retrieved as a document. This document needs to be encoded in a format from which the application can build the rows and columns represented in the document. Example 4-3 demonstrates a JSON-formatted document that represents the same data as the first row of your SQL table (refer to Table 4-2).
Example 4-3 JSON-Formatted Data with Key/Value Pairs Matching the First Row of Table 4-2
{ "Index":"0000", "Name":" Anthony ", "Surname":" Soprano" "Occupation":" Waste Management Consultant" "Active":"Y" }
To speed up retrieval of the data, you need to select a key that can appear in all documents and allow for the prompt retrieval of the complete dataset. The benefit of this type of format is that only a certain part of the data—not the complete dataset—defines the structure. So you can essentially shorten or extend the dataset with any number of additional key/value pairs on the fly. For example, if you want to add the date of last activity for a user, you can simply add an additional key/value pair to the document denoting the date, as demonstrated in Example 4-4.
Example 4-4 Adding the Last Active Attribute to the Data
{ "Index":"0001", "Name":" Christopher ", "Surname":" Moltisanti " "Occupation":" Disposal Operator " "Active":"N" "Last active": 13052007 }
You could even structure the day, month, and year as their own nested key/value pairs in the Last active key, as demonstrated in Example 4-5.
Example 4-5 Adding an Entry as Nested Key/Value Pairs
{ "Index":"0001", "Name":" Christopher ", "Surname":" Moltisanti " "Occupation":" Disposal Operator " "Active":"N" "Last active": [ { "Day":13}, { "Month":"05"}, { "Year":"2007" }, ] }
The ability to nest keys in your database adds a lot more flexibility to the way you store and access the data in the NoSQL database. Just think of the impact of the schema modifications required to fit the new type of data into an existing SQL table: Not only would the process be disruptive to ongoing operations, but rolling back changes to a schema is sometimes impossible. With NoSQL, you can change the data model on the fly by adding and removing key/value pairs to items with ease.
NoSQL databases are designed with linear scalability in mind as all data is distributed across multiple nodes, which become authoritative for a certain subset of indexing keys. To retrieve the data, you usually address a common front end that then delivers the data by contacting multiple back ends and delivers documents from all of them in parallel. With a SQL database, that design is very hard to implement as the transaction usually cannot be easily distributed across multiple back ends. Unlike SQL databases, NoSQL databases usually conform to the BASE database ideology, where BASE stands for
Basic availability: Availability of the database is the main requirement. The database must seem to be up all the time, and reads from/writes to the database must succeed as much as possible.
Soft state: The state of the system is allowed to change over time: The database is allowed to be repartitioned (by adding or removing nodes), and the data can be expired, deleted, or offloaded. The replication system ensures that the data is replicated as soon as possible, but the availability must not be affected by the any state changes.
Eventual consistency: The system will eventually (after a period of time) achieve consistency of data across all the nodes of a cluster. The data will be available even during replication, and a client requesting data could access a node with a stale piece of data. To mitigate eventual consistency, strongly consistent reads can be utilized to read data from multiple nodes to ensure that the data is always in a consistent state when being read. The read consistency must be handled by the application.