PostgreSQL: Petabyte-scale at AWS

Matheus Neuhaus
15 min readApr 8, 2023

--

Managed vs Self-managed PostgreSQL

Administering a PostgreSQL database in Amazon EKS (Elastic Kubernetes Service), in EC2 ( Amazon Elastic Compute Cloud) and Amazon RDS (Relational Database Service) have some significant differences in terms of deployment, management, and cost.

Amazon EC2 is a virtual machine-based service that provides scalable computing resources that can run various types of applications, including PostgreSQL. To deploy PostgreSQL on EC2, you need to create an EC2 instance, install PostgreSQL on it, and configure its networking, storage, and security settings. The benefits of using EC2 include its flexibility, customizability, and cost-effectiveness, as you only pay for what you use. However, this approach requires more infrastructure management and expertise compared to managed services like RDS.

Amazon EKS is a container-based managed Kubernetes service that provides a highly scalable and available platform for running containerized applications. In this approach, you can deploy a PostgreSQL database inside a container and deploy it to Amazon EKS. The benefits of using EKS include the ability to manage containerized applications using Kubernetes, and the flexibility to deploy PostgreSQL with custom configurations and access to Kubernetes ecosystem tools. However, this approach requires more expertise in containerization, Kubernetes, and infrastructure management.

Amazon RDS, on the other hand, is a managed relational database service that provides an easy-to-use and highly available platform for deploying and managing PostgreSQL databases. With RDS, you can deploy a PostgreSQL database with just a few clicks, and Amazon handles the underlying infrastructure, backup, and scaling. The benefits of RDS include its simplicity, high availability, automated backups, and scalability, but it comes with less flexibility and customizability compared to running PostgreSQL on EKS.

The choice between EKS, EC2, and RDS depends on your specific use case, budget, and expertise. EKS provides more flexibility and control over PostgreSQL deployments but requires more infrastructure management and expertise. EC2 offers more customization options and cost-effectiveness but requires more infrastructure management and expertise. RDS provides a simple and automated approach to deploying PostgreSQL databases but comes with less flexibility and customizability.

In this article, we won’t walk you through a step-by-step guide on how to deploy PostgreSQL on EKS or in a EC2 or in RDS. We already have a lot of articles about that. We will assume that you have a medium understanding of this AWS services, and give you a big picture of successful architecture cases in real-world.

Scaling PostgreSQL in EC2 (Elastic Compute Cloud)

We have some options to scale PostgreSQL as we can find below:

  1. Vertical Scaling: This involves upgrading the hardware resources of a single database server, such as increasing the CPU, RAM, or storage capacity. Vertical scaling can provide a temporary boost in performance, but it may not be a sustainable solution in the long term.
  2. Horizontal Scaling: This involves adding more database servers to a cluster and distributing the workload across them. Horizontal scaling can provide better scalability and reliability by allowing for more processing power and redundancy.
  3. Read Replicas: This involves creating one or more read-only copies of the database that can handle read-only queries. Read replicas can improve read performance by reducing the load on the master database server. This can be achieved using Streaming replication, once It’s possible to get the WAL records as soon as they are generated without waiting for the log files to be filled.
  4. Sharding: This involves partitioning the database into smaller, more manageable chunks called shards. Each shard contains a subset of the data and is hosted on a separate server. Sharding can provide better performance and scalability, especially for large databases.
  5. Connection Pooling: This involves using a connection pooler to manage a pool of database connections that can be shared among multiple clients. Connection pooling can improve the efficiency of database connections and reduce the overhead of creating and destroying connections.
  6. Caching (Shared Buffers or Redis): This involves using a caching layer to store frequently accessed data in memory, reducing the number of database queries required. Caching can improve performance and reduce database load, especially for read-heavy workloads.

Read Replica

For a Read Replica setup we have one server (master) accepting all writes and one or more servers (slaves) responding to most of the reads. The process of replicating data from the master server to the slave servers can be achieved using either PostgreSQL’s built-in replication capability or a middleware solution like Pgpool-II. However, when it comes to write requests, this approach has limitations because only one node is designated as the master, and thus all write operations must go through this node. This can become a bottleneck if the application has a high write to read ratio, as the writes will need to be replicated to all the slaves. This can put a significant load on the infrastructure and potentially impact performance.

In the below paragraph we’ll be using Pgpool-II that is a middleware that provides connection pooling and load balancing for PostgreSQL databases. It can distribute client connections among multiple PostgreSQL servers, allowing for better utilization of resources and improved scalability. Pgpool-II can be configured to work with different load balancing algorithms, such as round-robin or least connections, and provides support for failover and replication.

To have this architecture setup we need to open the ‘pgpool.conf’ file and edit the below variables (for Pgpool-II):

  • listen_addresses: This sets the IP address on which pgpool-II should listen for incoming connections. In this case, it is set to the loopback address, which means that pgpool will only listen for connections from the same machine.
  • backend_hostname0: This sets the hostname or IP address of the first PostgreSQL backend node. In this case, it is set to the loopback address, which means that the backend node is on the same machine as pgpool.
  • backend_port0: This sets the port number on which the first PostgreSQL backend node is listening for incoming connections.
  • backend_weight0: This sets the load balancing ratio for the first PostgreSQL backend node. A weight of 0 means that this node will not receive any SELECT queries. Instead, all SELECT queries will be directed to the second node. This is because, in this configuration, the first node is the master and the second node is the standby.
  • backend_data_directory0: This sets the data directory of the first PostgreSQL backend node.
  • backend_hostname1: This sets the hostname or IP address of the second PostgreSQL backend node.
  • backend_port1: This sets the port number on which the second PostgreSQL backend node is listening for incoming connections.
  • backend_weight1: This sets the load balancing ratio for the second PostgreSQL backend node. A weight of 1 means that this node will receive all SELECT queries that are not directed to the first node.

Pgpool-II also have more variables that can be used to make an Automatic Failover and Load Balancing (as we saw using weights) and also for Connection Pooling.

Sharding

Sharding is a technique used to horizontally partition a database by distributing its data across multiple servers. In sharding, the data is divided into horizontal blocks called shards or partitions, with each shard containing a subset of the data. These shards are then stored on different servers, typically in a distributed or clustered environment.

For example, if we have a database with 1000 records, we can consider splitting them into 10 shards of 100 records each and host the shards on 10 different servers. This allows us to scale the database horizontally by adding more servers to the cluster as the data grows, rather than scaling vertically by increasing the resources of a single server. Sharding is different from partitioning, which involves splitting the data vertically by breaking it into smaller tables based on its attributes. In sharding, the data is split horizontally based on a certain criterion such as a range of values, geographical location, or any other relevant factor.

Choosing the right sharding key is a critical decision in sharding a PostgreSQL database. The sharding key determines how data is partitioned horizontally across different shards. If the sharding key is chosen poorly, it can lead to poor performance and scalability once the sharding key should ensure that data is evenly distributed across all the shards.

Sharding can provide significant benefits in terms of scalability, performance, and fault-tolerance, but it also comes with its own set of challenges: like ensuring data consistency across different shards can be difficult, and some queries may require joining data from multiple shards, which can be slow and resource-intensive. Additionally, sharding requires careful planning and design to ensure that the shards are balanced, and queries are directed to the appropriate shards.

All sharding process can be done using Pgpool-II.

Caching with Shared Buffers

Shared buffers was here to andress a very common problem when there are many users trying to read and write data from different tables, accessing files and directories will result in a non-scalable system because it involves many file operations like opening, reading, writing, and locking which can be slow and cause contention.

To improve performance, PostgreSQL uses the concept of shared buffers, which is a memory area in RAM. Instead of accessing files and directories, PostgreSQL stores frequently accessed data in shared buffers to avoid disk I/O. When data is requested, the system checks if it’s in the shared buffer, and if it is, it’s retrieved from there instead of going to the disk. This can result in significant performance improvements because accessing data from RAM is much faster than accessing data from disk.

The amount of memory allocated to shared buffers is specified in the shared_buffers parameter in the postgresql.conf configuration file. This parameter sets the size of a fixed block of shared memory allocated when the server starts. The appropriate size of the shared buffers depends on the size of the database and the available system memory. Too small shared buffers can result in more disk I/O, while too large shared buffers can lead to memory contention and performance issues.

Caching with Redis

PostgreSQL has built-in caching mechanisms that can help improve performance, such as shared buffers and query cache (as we already discussed). However, caching can also be implemented using external tools like Redis, which can offer more advanced features like distributed caching and expiration policies.

To use Redis as a caching solution for PostgreSQL, you can use a tool like pg_redis_cache or pgbouncer-redis. These tools allow you to configure Redis as a cache for frequently accessed data, reducing the number of queries to the PostgreSQL database.

When deciding whether to use Redis for caching in PostgreSQL, consider factors like the complexity of your application, the size of your database, and your performance requirements. Redis can be a powerful caching solution, but it adds complexity to your architecture and requires additional infrastructure to manage. It may be more appropriate for larger or more complex applications, or for use cases where PostgreSQL’s built-in caching mechanisms are insufficient.

Connection Pooling

Connection pooling is a technique used to improve the performance and scalability of database applications. It works by creating a pool of database connections that can be reused by multiple client applications, rather than creating a new connection for every client request. This approach reduces the overhead of connection setup and teardown and minimizes the amount of system resources needed to handle a large number of clients.

Pgpool-II works as a proxy between the client applications and the PostgreSQL database. It creates a pool of database connections and routes client requests to the appropriate backend server. When a client application requests a connection to the database, pgpool-II checks if there is an available connection in the pool. If there is an idle connection, it is reused. If not, pgpool-II creates a new connection to the database.

By default, Pgpool-II listens on port 9999. So, if your PostgreSQL server is listening on port 5432, you can connect to pgpool-II by using the following command:

psql -h <pgpool-II hostname> -p 9999 -U <username> <database name>

Once connected, Pgpool-II will manage the connections to the PostgreSQL servers and distribute the workload across the servers. It will also handle connection pooling, so that clients can reuse connections to the servers, rather than establishing a new connection for each request.

Finally: the overview about what was covered until now is draw below

Scaling Strategies for PostgreSQL in a EC2 machine

About Common Scalability Issues — PostgreSQL Databases

Scalability issues can often be traced to a variety of factors, such as poorly written queries, suboptimal database parameters, and inadequate use of database features.

For instance, poorly written queries that miss a WHERE clause can lead to the database returning a large number of rows, which can be resource-intensive and slow down the overall performance of the system. Keeping default database parameters may also result in suboptimal performance, as these parameters may not be tuned to the specific needs of the application.

Not creating enough or the right indexes can also negatively impact performance, as indexes are used to speed up queries and data retrieval. Similarly, not fully exploiting available object types such as materialized views and partitions can lead to suboptimal performance as well.

Using the wrong data types is another common mistake that can lead to scalability issues. For example, using VARCHAR to store dates can result in inefficient sorting and querying operations.

Code issues such as transactions locking each other can also be a factor in scalability issues. For instance, if two transactions are trying to modify the same data simultaneously, this can result in a deadlock, where neither transaction can proceed until the other one finishes.

Finally, not using proper normalization techniques while designing tables can also lead to scalability issues. Proper normalization helps reduce data redundancy and ensures efficient querying and storage of data.

Data Replication — Disaster Recovery Strategies for EC2 Machines

Data replication is a key component of disaster recovery strategies in PostgreSQL. In the event of a primary server failure, having one or more standby servers with up-to-date copies of the data can help to ensure continuity of service and minimize the risk of data loss.

By replicating data from the primary database server to one or more standby servers, data replication provides a means of ensuring that there is a reliable and up-to-date copy of the data available in the event of a disaster or system failure. This is critical for disaster recovery scenarios where downtime and data loss can have serious consequences for the business or organization.

PostgreSQL supports several types of replication, each with its own features and advantages:

  1. Streaming replication: This is the most commonly used replication method in PostgreSQL. It uses the PostgreSQL built-in logical replication feature to replicate changes made on the primary database to one or more standby servers in near real-time.
  2. Logical replication: Logical replication is a feature that was introduced in PostgreSQL 10 that allows you to replicate data at a logical level rather than at a physical level. It provides a more flexible and efficient way to replicate data than traditional physical replication.
  3. Asynchronous replication: This replication method does not require the primary server to wait for confirmation from standby servers before committing transactions. It is often used in situations where data consistency is not critical and a small amount of data loss is acceptable.
  4. Synchronous replication: This replication method ensures that data changes made on the primary database server are immediately replicated to all standby servers, and that the primary server waits for confirmation from at least one standby server that the changes have been successfully written to disk before it acknowledges the transaction as committed. This method ensures data consistency but can have a performance impact.

We also have Physical replication and Bi-Directional Replication that won’t be covered on this article.

To implement these disaster recovery strategies below, you need to carefully plan and test your backup and recovery procedures. You should also consider using a tool like Barman or WAL-E to automate your backup and recovery processes.

  1. Backup and Restore: This strategy involves creating regular backups of the database and restoring them in case of a disaster. This can be done using the pg_dump and pg_restore tools provided by Postgres. Backups can be stored on a different server or in the cloud to protect against hardware failures.
  2. Warm Standby: A warm standby server is a replica of the primary server that is kept up-to-date with continuous streaming replication. In case of a disaster, the warm standby server can be promoted to the primary server to minimize downtime.
  3. Hot Standby: A hot standby server is similar to a warm standby server, but it is kept in a state where it can be promoted to the primary server almost immediately. This is achieved using synchronous replication, which ensures that all transactions are committed to both the primary and standby servers before they are considered complete.
  4. Continuous Archiving and Point-in-Time Recovery (PITR): This strategy involves creating continuous backups of the database using the archive_command setting in Postgres. In case of a disaster, the database can be restored to a specific point in time using the backups and the WAL (Write-Ahead Log) files.
  5. Multi-region Replication: This strategy involves replicating the database to multiple regions to ensure that it is available even if one region experiences a disaster. This can be done using tools like pglogical, which provide logical replication between different Postgres instances.

PostgreSQL— EKS (Elastic Kubernetes Service)

Disclaimer: We need to be very careful when deploying PostgreSQL. Our initial architecture can compromise the entire project in long-term. There’s no right or wrong architectures, it all depends on your requirements. This article is intended to provide you with useful informations on how to choose the best architecture for your use case.

For EKS, I strongly recommend using Amazon EFS file system, Amazon EFS CSI driver and a PostgreSQL Operator (I already had a good experience with the Operator from Crunchy Data) to deploy it into Kubernetes. Some useful links are below:

Postgres Operator:

CSI Driver:

You can also check the final architecture on EKS on this link:

It’s important to see that: the architecture is immutable. The Operator does not offer you such customized architecture as we saw in the last section when deploying to EC2. To enhance the primary/standby setup, it’s possible to incorporate a PgBouncer connection pooler between the application and the PostgreSQL database. Furthermore, replica clusters can be utilized by setting up one or more disaster recovery clusters in distinct regions.

Pgpool-II vs PgBouncer — Postgres Operator (EKS)

Pgpool-II is a sophisticated middleware that provides several features, including connection pooling, load balancing, query caching, and failover. It is designed for high availability and scalability, and is a good option for large-scale, high-traffic applications. Pgpool-II sits between the application and the PostgreSQL database, and manages connections to the database by creating a pool of reusable connections. This can improve the performance of the database and reduce the overhead of creating new connections.

On the other hand, PgBouncer is a lightweight connection pooler that focuses primarily on connection pooling. It is designed for low-latency, high-concurrency applications, and can handle a large number of connections with low overhead. PgBouncer also sits between the application and the PostgreSQL database, and manages connections to the database by creating a pool of reusable connections.

The PgBouncer is already integrated with Postgres Operator providing some level of customization. We can also have a way to create a Custom Sidecar Containers for pgBouncer Pods:

As you can see, we have a limited number of customizations that we can do, when using the Postgres Operator.

The EKS deployment has this big trade-off: some limitations of customizing the architecture, and the benefit of having Kubernetes Operator managing the life-cycle of your cluster/nodes.

High Availability and Disaster Recovery — Postgres Operator (EKS)

For Postgres Operator we have Patroni that offers the Operator a way to do Synchronous Replication across servers for High Availability.

And we also have in the same Operator Streaming replication for Disaster Recovery.

Again we can see that we do have some limitations when using an Operator in Kubernetes.

Blue/Green Deployment — Amazon RDS

As we already know, Amazon RDS offers a set of nice features that are intended to reduce the overwhelming tasks of setting a PostgreSQL database. The intrinsic cost is to have less options to customise.

The guide below teach us how to perform a blue/green deployment on RDS

This knowledge can also be extended to our self-managed cluster as well.

Final Thoughts

EKS provides a containerized, scalable and resilient Kubernetes infrastructure to manage your cluster, while EC2 instances provides the customization power needed to any requirements that should be met. RDS offers a managed database service that takes all the overhead over your shoulders, making it easier to manage and maintain the database.

However, scaling a pentabyte scale database can be challenging, and there are several factors to consider, such as sharding, partitioning, load balancing, and connection pooling. Disaster recovery planning is also critical, as data loss or downtime can be catastrophic.

Customization is another important consideration, as PostgreSQL has many configuration options that can impact performance and scalability. Optimizing the database configuration, choosing the right hardware, and tuning the system for maximum performance can be a complex task that requires expertise and experience.

In conclusion, hosting a pentabyte scale PostgreSQL database on EKS, EC2, and RDS requires careful planning, testing, and ongoing management. While it offers a highly scalable and robust solution, it also comes with challenges and requires expertise to optimize and maintain.

--

--