System Design

Why study System Design?

You may have built some personal projects in college where you have a backend in NodeJS (or any other framework) and a Database.

User (or client) requests to your application. Then you may be doing some calculations in the backend server and performing CRUD operations in the database to return a response. This is good for prototype and personal projects, but in the real world, when we have actual users (in millions or billions), this simple architecture may not work. We need to think about scaling, fault tolerance, security, monitoring and all sorts of things to make our system reliable and work efficiently in all cases. For this we study different concepts in system design.

What is a Server?

A server is nothing but a physical machine (such as a laptop) where your application code is running. When you build a ReactJS or NodeJS application then, your application runs on http://localhost:8080 . Localhost is the domain name which resolves to the IP Address 127.0.0.1 . This IP address is the IP address of the local laptop.

For external websites, you type https://abc.com. When you hit this in your browser, the following things happen:

abc.com goes to the DNS (Domain Name Service) resolver to find the IP address of the server which corresponds to this domain. Every server has an IP address. Its a physical address which is unique for each device.
Your browser got the IP address of the server. With the help of an IP Address, your browser requests the server.
Now, the server got the request. In a server, there are multiple applications running. (Like on your laptop, there are multiple applications running at the same time, such as Google Chrome, Netflix, etc). The server finds the correct application with the help of Port. Then it returns the response.

Requesting https://abc.com is the same as requesting 35.154.33.64:443. Here, 443 is the port (default port of https).

Remembering IP Addresses is tough, so people generally buy domain names for it and point their domain name to their server’s IP address.

- How to deploy an application?

Your application runs on Port 8080. Now, you want to expose it to the internet so that other people can visit your website. For this, you need to buy a public IP Address and attach this public IP address to your laptop so that people can hit https://<your_laptop_ip_address>:8080 to visit your site.

Doing all this and managing the server on your own is a pain, so people generally rent servers from Cloud Providers such as AWS, Azure, GCP, etc. They will give you a virtual machine where you can run your application, and that machine also has a public IP address attached to it, so you can visit it from anywhere. In AWS, this virtual machine is called EC2 Instance. And putting your application code into the virtual machine of a cloud provider from your local laptop is called deployment.

Latency and Throughput

- Latency

Latency is the time taken for a single request to travel from the client to the server and back (or a single unit of work to complete). It is usually measured in milliseconds (ms).

Loading a webpage: If it takes 200ms for the server to send the page data back to the browser, the latency is 200ms.

In simple terms, if a website loads faster, then it takes less time, so it has low latency. If it loads slower, then it takes more time, so it has high latency.

Round Trip Time (RTT): The total time it takes for a request to go to the server and for the response to come back. Sometimes, you can also hear RTT as a replacement word for latency.

- Throughput

Throughput is the number of requests or units of work the system can handle per second. It is typically measured in requests per second (RPS) or transactions per second (TPS).

Every server has a limit, which means it can handle X number of requests per section. Giving further load can choke it, or it may go down.

High Throughput: The system can process many requests at the same time.
Low Throughput: The system struggles to process many requests concurrently.

In the ideal case, we want to make a system whose throughput is high, and latency is low.

- Example:

Latency: The time it takes for a car to travel from one point to another (e.g., 10 minutes).

Throughput: The number of cars that can travel on a highway in one hour (e.g., 1,000 cars).

In short,

Latency measures the time to process a single request.
Throughput measures how many requests can be handled concurrently.

Scaling and its types

You often see that whenever someone launches their website on the internet and suddenly traffic increases, their website crashes. To prevent this, we need to scale our system.

Scaling means we need to increase the specs of our machine (like increasing RAM, CPU, storage etc) or add more machines to handle the load.

You can relate it with the example of a mobile phone, when you buy a cheap mobile with less RAM and storage then your mobile hangs by using heavy games or a lot of applications simultaneously. The same happens with the EC2 instance, when a lot of traffic comes at the same time then it also starts choking, that time we need to scale our system.

- Types of Scaling

It is of 2 types:

Vertical Scaling
Horizontal Scaling

Vertical Scaling (Scale Up/Down)

If we increase the specs (RAM, Storage, CPU) of the same machine to handle more load, then it is called vertical scaling.

This type of scaling is mostly used in SQL databases and also in stateful applications because it is difficult to maintain consistency of states in a horizontal scaling setup.

Horizontal Scaling (Scale Out/In)

Vertical scaling is not possible beyond a point. We can’t infinitely increase the specs of the machine. We will hit a bottleneck, and beyond that, we can’t increase the specs.

The solution to this is to add more machines and distribute the incoming load. It is called horizontal scaling.

Ex: We have 8 clients and 2 machines and distribute our load. We want to distribute the load equally so the first 4 clients can hit one machine-1 and the next 4 clients can hit machine-2. Clients are not smart; we can’t give them 2 different IP addresses and let them decide what machine to hit because they don’t know about our system. For this, we put a load balancer in between. All the clients hit the load balancer, and this load balancer is responsible for routing the traffic to the least busy server.

In this setup, Clients don’t make requests directly to the server. Instead, they send requests to the load balancer. The load balancer takes the incoming traffic and transfers it to the least busy machine.

Most of the time, in the real world, we use horizontal scaling.

Below is the picture where you can see that 3 clients are making requests, and the load balancer distributes the load in 3 EC2 instances equally.

Auto Scaling

Suppose you started a business and made it online. You rented an EC2 server to deploy your application. If a lot of users come to your website at the same time, then your website may crash because the EC2 server has limitations (CPU, RAM, etc) on serving a certain number of users concurrently at one time. At this time, you will do horizontal scaling and increase the number of EC2 instances and attach a load balancer.

Suppose one EC2 machine can serve 1000 users without choking. If the number of users on our website is not constant. Some days, we have 10,000 users on our website, which can be served by 10 instances. Some days, we have 100,000 users then we need 100 EC2 instances. One solution might be to keep running the maximum number (100) of EC2 instances all the time. In this way, we can serve all users at all times without any problem. But during low-traffic periods, we are wasting our money in extra instances. We only need 10 instances, but we are running 100 instances all the time and paying for it.

The best solution for this is to run only the required number of instances every time. And add some sort of mechanism that if CPU usage of an EC2 instance goes up to a certain threshold (say 90%) then launch another instance and distribute the traffic without us manually doing this. This changing number of servers dynamically based on the traffic is called Auto Scaling.

Note: These numbers are all hypothetical to make you understand the topic. If you want to find the actual threshold, then you can do Load Testing on your instance.

Back-of-the-envelope Estimation

You can see that above, in the horizontal scaling, we saw that we need more servers to handle the load. In back-of-the-envelope estimation, we estimate the number of servers, storage, etc., needed.

We do approximation here to make our calculation easy.

|Power of 2| Approximate value| Power of 10 | Full name   |  Short name |
|----------|------------------|-------------|-------------|-------------|
|  10      |  1 Thousand      |   3         |  1 Kilobyte |   1 KB      |
|  20      |  1 Million       |   6         |  1 Megabyte |   1 MB      |
|  30      |  1 Billion       |   9         |  1 Gigabyte |   1 GB      |
|  40      |  1 Trillion      |   12        |  1 Terabyte |   1 TB      |
|  50      |  1 Quadrillion   |   15        |  1 Petabyte |   1 PB      |

There can be many things that you can calculate here, but I prefer doing calculations of only below things:

Load Estimation
Storage Estimation
Resource Estimation

Let’s take an example of Twitter and do this calculation.

- Load Estimation

Here, ask for DAU (Daily Active Users). Then, calculate the number of reads and the number of writes.

Suppose Twitter has 100 million daily active users, and one user posts 10 tweets per day.

Number of Tweets in one day:

100 million * 10 tweets = 1 billion tweets per day

This means the number of writes = 1 billion tweets per day.

Suppose 1 user reads 1000 tweets per day.

Number of reads in one day:

100 million * 1000 tweets = 100 billions reads per day

- Storage Estimation

Tweets are of two types:

Normal tweet and tweet with a photo. Suppose only 10% of the tweets contain photos.

Let one tweet comprise 200 characters. One character is of 2 bytes and one photo is of 2 MB

Size of one tweet without photo = 200 character * 2 byte = 400 bytes ~ 500 bytes

Total no. of tweets in one day = 1 billion (as calculated above)

Total no. of tweets with photo = 10% of 1 billion = 100 million

Total Storage required for each day:

=> (Size of one tweet) * (Total no. of tweets) + (Size of photo) * (total tweets with photo)

=> (500 bytes * 1 billion) + (2MB * 100 million)

Take approx to make calculation easy:

=> (1000 bytes * 1 billion) + (2MB * 500 million)

=> 1 TB + 1 PB

Which is approx = 1 PB (ignore 1TB as it is very small to 1PB, so adding it won’t matter)

Each Day we require 1 PB of storage.

- Resource Estimation

Here, calculate the total number of CPUs and servers required.

Assuming we get 10 thousand requests per second and each request takes 10 ms for the CPU to process.

Total CPU time to process:

(10,000 rqst Per sec) * (10ms) = 100,000 ms Per second CPU requires.

Assuming each core of the CPU can handle 1000 ms of processing per second, the total no. of cores required:

=> 100,000 / 1000 = 100 cores.

Let one server have 4 cores of CPU.

So, the total no. of servers required = 100/4 = 25 servers.

So, we will keep 25 servers with a load balancer in front of them to handle our requests.

CAP Theorem

This theorem states a very important tradeoff while designing any system.

CAP theorem consists of 3 words:

C: Consistency
A: Availability
P: Partition Tolerance

Distributed System means data is stored in multiple servers instead of one, as we saw in horizontal scaling.

Why do we make the system distributed?

With multiple servers, we can spread the workload & handle more requests simultaneously, improving overall performance.
Keep Databases in different locations and serve data from the nearest location to the user. It reduces the time of access & retrieval.

One individual server that is part of the overall distributed system is called Node.

In this system, we replicate the same data across different servers to get the above benefits.

You can see this in the picture below. The same data is stored in multiple database servers (nodes) kept in different locations in India.

If data is added in one of the nodes then it gets replicated into all the other nodes automatically. How this replication happens, we will talk about it later in this blog.

Let's discuss all 3 words of CAP:

Consistency: Every read request returns the same result irrespective of whichever node we are reading from. This means all the nodes have the same data at the same time. In the above picture, you can see that our database cluster is consistent because every node has the same data.
Availability: The system is available and always able to respond to requests, even if some nodes fail. This means even if some node failures occur, the system should continue serving requests with other healthy nodes.
Partition Tolerance: The system continues to operate even if there is a communication breakdown or network partition between different nodes.

Availability is continued serving when node failure happens.

Partition Tolerance is continued serving when “network failures” happen.

Example: Let there be 3 nodes A, B, C.

Consistency: A, B and C all have the same data. If there is an update in node B, then data replication happens, and B will propagate that update to A and C.
Availability: Let node B experience a hardware failure and go offline. Nodes A and C are still operational. Despite the failure of node B, the system as a whole remains available because nodes A and C can still respond to client requests.
Partition Tolerance: Network partition happens that separates B from A and C. Node B can still function and serve requests, but it can’t communicate with A and C.

- What is the CAP Theorem?

CAP theorem states that in a distributed system, you can only guarantee two out of these three properties simultaneously. It’s impossible to achieve all three.

CA - Possible
AP - Possible
CP - Possible
CAP - Impossible

It is quite logical if you think of it.

In a distributed system, network partition is bound to happen, so the system should be Partition tolerant. This means for a distributed system, “P” will always be there. We will make a tradeoff between CP and AP.

- Why can we only achieve CP or AP and not CAP?

Again, take the same example of nodes A, B, and C.
Suppose a network partition happens, and B loses its communication with A and C.
Then B will not be able to propagate its changes to A and C.
If we prioritise availability, then we will continue to serve requests. B is not able to propagate its changes with A & C. So, users who are served from B might get different results from the ones who are served from A & C. Therefore, we achieved availability by sacrificing consistency.
If we prioritise consistency, then we will not take requests until the network partition of B is resolved because we don’t want the write operation of B to not reach nodes A and C. We want the same data in A, B, and C. Therefore, we achieved consistency by sacrificing availability.

- Why not choose CA?

CA means network partition is not happening, you can achieve both consistency and availability because communication is not a problem.
In a practical scenario, network partition is bound to happen in a distributed system. So, we choose between CP or AP.

- What to choose, CP or AP?

For secure apps like banking, payments, stocks, etc, go with consistency. You can’t afford to show inconsistent data.
For social media etc, go with availability. If likeCount on a post is inconsistent for different users, we are fine.

Scaling of Database

You normally have one database server; your application server queries from this DB and gets the result.

When you reach a certain scale, this database server starts giving slow responses or may go down because of its limitations. In that situation, how to scale the database, which we will be going to study in this section.

We will be scaling our database step by step, which means if we have only 10k users, then scaling it to support 10 million is a waste. It’s over-engineering. We will only scale up to that limit which is sufficient for our business.

Suppose you have a database server, and inside that, there is a user table.

There are a lot of read requests fired from the application server to get the user with a specific ID. To make the read request faster, do the following thing:

- Indexing

Before indexing, the database checks each row in the table to find your data. This is called a full table scan, and it can be slow for large tables. It takes O(N) time to check each id.

With indexing, the database uses the index to directly jump to the rows you need, making it much faster.

You make the "id" column indexed, then the database will make a copy of that id column in a data structure (called B-trees). It uses the B-trees to search the specific id. Searching is faster here because IDs are stored in a sorted way such that you can apply a binary search kind of thing to search in O(logN).

If you want to enable indexing in any column, then you just need to add one line of syntax, and all the overhead of creating b-trees, etc, is handled by DB. You don’t need to worry about anything.

This was a very short and simple explanation about indexing.

- Partitioning

Partitioning means breaking the big table into multiple small tables.

You can see that we have broken the users table into 3 tables:

user_table_1
user_table_2
user_table_3

These tables are kept in the same database server.

- What is the benefit of this partitioning?

When your index file becomes very big, then it also starts showing some performance issues when searching on that large index file. Now, after partitioning, each table has its own index, so searching on smaller tables is faster.

You may be wondering how we know which table to query from. Since before that, we can hit SELECT * FROM users where ID=4 . Don’t worry, you can again hit the same query. Behind the scenes, PostgreSQL is smart. It will find the appropriate table and give you the result. But you can also write this configuration on the application level as well if you want.

- Master Slave Architecture

Use this method when you hit the bottleneck, like even after doing indexing, partitioning and vertical scaling, your queries are slow, or your database can’t handle further requests on the single server.

In this setup, you replicate the data into multiple servers instead of one.

When you do any read request, your read request (SELECT queries) will be redirected to the least busy server. In this way, you distribute your load.

But all the Write requests (INSERT, UPDATE, DELETE) will only be processed by one server.

The node (server) which processes the write request is called the Master Node.

Nodes (servers) that take the read requests are called Slave Nodes.

When you make a Write Request, it is processed and written in the master node, and then it asynchronously (or synchronously depending upon configuration) gets replicated to all the slave nodes.

- Multi-master Setup

When write queries become slow or one master node cannot handle all the write requests, then you do this.

In this, instead of a single master, use multiple master databases to handle writes.

Ex: A very common thing is to put two master nodes, one for North India and another for South India. All the write requests coming from North India are processed by North-India-DB, and all the write requests coming from South India are processed by South-India-DB, and periodically, they sync (or replicate) their data.

In a multi-master setup, the most challenging part is how you would handle conflicts. If for the same ID, there are two different data present in both masters, then you have to write the logic in code like do you want to accept both, override the previous with the latest one, concatenate it, etc. There is no rule here. It totally depends upon the business use case.

- Database Sharding

Sharding is a very complex thing. Try to avoid this in practical life and only do this when all the above things are not sufficient and you require further scaling.

Sharding is similar to partitioning, as we saw above, but instead of putting the different tables in the same server, we put it into a different server.

You can see in the above picture we cut the table into 3 parts and put it into 3 different servers. These servers are usually called shards.

Here, we did sharding based on IDs, so this ID column is called a sharding key.

Note: The sharding key should distribute data evenly across shards to avoid overloading a single shard.

Each partition is kept in an independent database server (called a shard). So, now you can scale this server further individually according to your need, like doing master-slave architecture to one of the shards, which takes a lot of requests.

- Why sharding is difficult?

In partitioning (keeping chunks of the table in the same DB server), you don’t have to worry about which table to query from. PostgreSQL handles that for you. But in sharding (keeping chunks of the table in different DB servers), you have to handle this at the application level. You need to write a code such that when querying from id 1 to 2, it goes to DB-1; when querying id 5–6, it goes to DB-3. Also, when adding a new record, you manually need to handle the logic in the application code that in which shard you are going to add this new record.

- Sharding Strategies

Range-Based Sharding:

Data is divided into shards based on ranges of values in the sharding key.

Example:

Shard 1: Users with user_id 1-1000.
Shard 2: Users with user_id 1001-2000
Shard 3: Users with user_id 2001-3000

Pros: Simple to implement.

Cons: Uneven distribution if data is skewed (e.g., some ranges have more users).

Hash-Based Sharding:

A hash function is applied to the sharding key, and the result determines the shard.

Example:

HASH(user_id) % number_of_shards determines the shard.

Pros: Ensures even distribution of data.

Cons: Rebalancing is difficult when adding new shards, as hash results change.

Geographic/Entity-Based Sharding:

Data is divided based on a logical grouping, like region or department.

Example:

Shard 1: Users from America.
Shard 2: Users from Europe.

Pros: Useful for geographically distributed systems.

Cons: Some shards may become "hotspots" with uneven traffic.

Directory-Based Sharding:

A mapping directory keeps track of which shard contains specific data.

Example: A lookup table maps user_id ranges to shard IDs.

Pros: Flexibility to reassign shards without changing application logic.

Cons: The directory can become a bottleneck.

- Disadvantage of sharding

Difficult to implement because you have to write the logic yourself to know which shard to query from and in which shard to write the data.
Partitions are kept in different servers (called shards). So, when you perform joins, then you have to pull out data from different shards to do the join with different tables. It is an expensive operation.
You lose consistency. Since different parts of data are present in different servers. So, keeping it consistent is difficult.

- Sum up of Database Scaling

First, always and always prefer vertical scaling. It's easy. You just need to increase the specs of a single device. If you hit the bottleneck here then only do the below things.
When you have read heavy traffic, do master-slave architecture.
When you have write-heavy traffic, do sharding because the entire data can’t fit in one machine. Just try to avoid cross-shard queries here.
If you have read heavy traffic but master-slave architecture becomes slow or not able to handle the load, then you can also do sharding and distribute the load. But it generally happens on a very large scale.

SQL vs. NoSQL Databases and when to use which Database

- SQL Database

Data is stored in the form of tables.
It has a predefined schema, which means the structure of the data (the tables, columns, and their data types) must be defined before inserting data.
It follows ACID properties, ensuring data integrity and reliability.
Ex: MySQL, PostgreSQL, Oracle, SQL Server, SQLite.

- NoSQL Database

It is categorized into 4 types:

Document-based: Stores data in documents, like JSON or BSON. Ex: MongoDB.
Key-value stores: Stores data in key-value pairs. Ex: Redis, AWS DynamoDB.
Column-family stores: Stores data in columns rather than rows. Ex: Apache Cassandra.
Graph databases: Focuses on relationships between data as it is stored like a graph. Useful in social media applications like creating mutual friends, friends of friends, etc. Ex: Neo4j.

It has a flexible schema, which means we can insert new data types or fields which may not be defined in the initial schema.

It doesn’t strictly follow ACID. It prioritizes other factors, such as scalability and performance.

- Scaling in SQL vs. NoSQL

SQL is designed primarily to scale vertically, which means increasing the hardware (CPU, RAM, storage) of a single server to handle larger data volumes.
NoSQL databases are designed primarily to scale horizontally, which means adding more servers (nodes) to a cluster to handle increasing data volumes.
Generally, sharding is done in NoSQL DBs to accommodate large amounts of data.
Sharding can also be done in SQL DB, but generally, we avoid it because we use SQL DB for ACID, but ensuring data consistency becomes very difficult when data is distributed across multiple servers and querying data by JOINS across shards is also complex and expensive.

- When to use which Database?

When data is unstructured and want to use flexible schema, go with NoSQL.
Ex: reviews, recommendation of e-commerce app.
When data is structured and has a fixed schema, go with SQL.
Ex: customer accounts table of e-commerce app.
If you want data integrity and consistency, go with SQL DB because it maintains the ACID property.
Ex: Financial transactions, account balances of banking app.
Orders, payments of e-commerce app.
Stock trading platforms.
If you want high availability, scalability (means storing large amounts of data which doesn’t fit in one server), and low latency, go with NoSQL because of horizontal scalability & sharding.
Ex: Posts, likes, comments, messages of social media app.
Store large amounts of real-time data, such as the driver location of a delivery app.
When you want to perform complex queries, joins, and aggregations, go with SQL. Generally, we have to do complex queries, joins, etc, when performing data analytics. Store required data for those in SQL.

Microservices

- What is Monolith and Microservice?

Monolith: The entire application is built as a single unit in a monolithic architecture. Suppose you are building an e-commerce app. In a monolith, you only make one backend and entire functionality (like user management, product listing, order, payment, etc) in one app.

Microservice: Break down large applications into smaller, manageable, and independently deployable services.

Ex: For an e-commerce app, suppose you break it into the following services:

User Service
Product Service
Order Service
Payment Service

Make separate backend apps for each service.

- Why do we break our app into Microservices?

Suppose one component has a lot of traffic and requires more resources; then, you can scale only that service independently.
Flexibility to choose tech stack. In a monolith, the whole backend is written in one single language/tech. But in microservice, you can write different services into different tech stacks. Ex: You can build User Service in NodeJS and Order Service in Golang.
Failure of one service doesn’t necessarily impact others. In a monolith, suppose one part of the backend crashes, then the whole app crashes. But in microservices, if the the Order service crashes still, other parts, such as the User and Product Service, are unaffected.

- When to use Microservice?

"Microservices of any startup defines its internal team structure". Suppose a startup has 3 teams working in 3 different business functionalities. Then, it will have 3 microservices, and as no. of team grows, microservices will also split.
Most startups start with a monolith because, at the starting point, only 2-3 people work on the tech side, but it eventually moves to microservice when no. of teams increases.
When we want to avoid single-point failure, then also we choose microservice.

- How do clients request in a Microservice architecture?

Different microservice has different backends and are deployed independently.

Suppose user service is deployed in a machine with an IP address 192.168.24.32 , product service on 192.168.24.38 and so on with other services. All are deployed on different machines. It’s very hectic to use different IP addresses (or domain names) for each microservice. So, we use API Gateway for that.

The client does every request in a single end-point of the API gateway. It will take the incoming request and map it to the correct microservice.

You can scale each service independently. Suppose the product service has more traffic and requires 3 machines, user service requires 2 machines, and payment service is sufficient with 1 machine, then you can do this also.

API Gateway provides several other advantages as well:

Rate Limiting
Caching
Security (Authentication and Authorization)

Load Balancer

- Why do we need the Load Balancer?

As we saw earlier, in horizontal scaling, if we have many servers to handle the request, then we can’t give all the IP Addresses of the machines to the client and let the client decide which server to do the request.

Load Balancer acts as the single point of contact for the clients. They request the Domain Name of the Load Balancer, and the load balancer redirects to one of the servers which is least busy.

What algorithm load balancer follow to decide which server to send the traffic to?

- Load Balancer Algorithms

Round Robin Algorithm

How it works: Requests are distributed sequentially to servers in a circular order.

Suppose we have 3 servers: Server-1, Server-2 and Server-3.

Then, in the round-robin, 1st request goes to server-1, the second request to server-2, the third request to server-3, the fourth again goes to server-1, the fifth to server-2, the sixth to server-3, then the seventh request again goes to server-1, eighth to server-2 and so on.

Advantages:

Simple and easy to implement.
Works well if all servers have similar capacity.

Disadvantages:

Ignores server health or load.

Weighted Round Robin Algorithm

How it works: Similar to Round Robin, but servers are assigned weights based on their capacity. Servers with higher weights receive more requests. You can see in the below picture the request number to understand how it works. In the below picture, 3rd server is bigger (has more RAM, Storage, etc). So, twice the request is going to it than 1st and 2nd.

Advantages:

Handles servers with unequal capacities better.

Disadvantages:

Static weights may not reflect real-time server performance.

Least Connections Algorithm

How it works: Directs traffic to the server with the fewest active connections. The connection here can be anything like HTTP, TCP, WebSocket etc. Here, the load balancer will redirect traffic to the server which has the least active connection with the load balancer.

Advantages:

Balances load dynamically based on real-time server activity.

Disadvantages:

May not work well with servers handling connections with varying durations.

Hash-Based Algorithm

How it works: The load balancer takes anything, such as the client’s IP, user_id, etc, as input and hash that to find the server. This ensures a specific client is consistently routed to the same server.

Advantages:

Useful for maintaining session persistence.

Disadvantages:

Server changes (e.g., adding/removing servers) can disrupt hashing and session consistency.

Caching

- Caching Introduction

Caching is the process of storing frequently accessed data in a high-speed storage layer so that future requests for that data can be served faster.

Ex: Suppose some data is taking 500ms to fetch from MongoDB Database, then it takes 100 ms to do some calculations in the backend on that data and finally send it to the client. So, in total, it takes 600 ms for the client to get the data. If we cache this calculated data and store it in a high-speed store like Redis and serve it from there, then we can reduce the time from 600 ms to 60 ms. (These are hypothetical numbers).

Caching means storing pre-computed data in a fast access datastore like Redis, and when the user requests that data, then serve it from Redis instead of querying from the database.

Example: Let’s take example of blog website. When we hit the route /blogs then we get all the blogs. If user hits this route first time then there is no data on cache so we have to get the data from database and suppose its response time is 800ms. Now we stored this data in Redis. Next time a user hits this route, he’ll get the data from Redis, not from database. The response time of this is 20ms. When a new blog is added, then, we have to somehow remove the old value of blogs from Redis and update it with a new one. It is called Cache invalidation. There are many ways for cache invalidation. We can set an expiry time (Time To Live - TTL); after every 24 hrs, Redis will delete the blogs, and when the request comes the first time from any user after 24 hrs, then he’ll get data from DB. After that, it will be cached for the next requests.

- Benefits of Caching:

Improved Performance: Reduces latency for end-users.
Reduced Load: Offloads backend databases and services.
Cost Efficiency: Reduces network and compute costs.
Scalability: Enables better handling of high-traffic loads.

- Types of Caches

Client-Side Cache

Stored on the user's device (e.g., browser cache).
Reduces server requests and bandwidth usage.
Examples: HTML, CSS, JavaScript files.

Server-Side Cache

Stored on the server.
Examples: In-memory caches like Redis or Memcached.

CDN Cache

Used for static content delivery (HTML, CSS, PNG, MP4 etc files).
Cached in geographically distributed servers.
Examples: AWS CloudFront, Cloudflare CDNs.

Application-Level Cache

Embedded within application code.
Caches intermediate results or database query results.

Redis

Redis is an in-memory data structure store. In-memory means data is stored in RAM. Reading and writing data from RAM is extremely fast compared to disk. We leverage this fast access for caching. Databases use disks to store data. And reading/writing is very slow compared to Redis.

One question you might be thinking is that if Redis is so fast, then why use a database? Can’t we rely on Redis to store all the data?

Redis stores data in RAM, and RAM has very little memory as compared to Disk. If we store too much data in Redis, it can have a memory leakage error.

Redis stores data in key-value pairs. In the Database, we access data from the table the same way in Redis; we access data from keys.

Values can be of any data type like string, list, etc, as shown in the figure below.

There are more data types also, but the above ones are mostly used.

Blob Storage

In databases, text and numbers are stored as it is. But think of something like a file such as mp4, png, jpeg, pdf etc. These things can’t just be stored in rows and columns.

These files can be represented as a bunch of 0s and 1s, and this binary representation is called Blob (Binary Large Object). Storing data as mp4 is not feasible, but storing its blob is easy because it's just a bunch of 0s and 1s.

The size of this Blob can be very big like a single mp4 video can be 1 GB. If you store this in databases like MySQL or MongoDB, your queries become too slow. You also need to take care of scaling, backups, and availability by storing such a large volume of data. These are some reasons why we don’t store Blob data into Database. Instead, we store it in a Blob Storage that is a managed service (Managed service means its scaling, security, etc, are taken care of by companies such as Amazon, and we use it as a black box).

Example of Blob Storage: AWS S3, Cloudflare R2.

AWS S3

S3 (Simple Storage Service, AWS S3) is used to store files (blob data) such as mp4, png, jpeg, pdf, html or any kind of file that you can think of.

You can think of S3 as Google Drive, where you store all your files.

The best thing about S3 is its very cheap. Storing 1 GB of data in S3 is a lot cheaper than storing it in RDS (RDS is the DB service of AWS that provides various DBs such as PostgreSQL, MySQL, etc).

Features of S3:

Scalability: Automatically scales to handle large volumes of data.
Durability: S3 provides 99.999999999% (11 9s) durability.
Availability: High availability, with various service-level agreements.
Cost Efficiency: Pay-as-you-go model.
Security: Encryption at rest and in transit, bucket policies, and IAM permissions.
Access Control: Fine-grained control using policies, ACLs, and pre-signed URLs.

Content Delivery Network (CDN)

sdf

Source: Medium