Why study System Design?
You may have built some personal projects in college where you have a backend in NodeJS (or any other framework) and a Database.
User (or client) requests to your application. Then you may be doing some calculations in the backend server and performing CRUD operations in the database to return a response. This is good for prototype and personal projects, but in the real world, when we have actual users (in millions or billions), this simple architecture may not work. We need to think about scaling, fault tolerance, security, monitoring and all sorts of things to make our system reliable and work efficiently in all cases. For this we study different concepts in system design.
What is a Server?
A server is nothing but a physical machine (such as a laptop) where your application code is running. When you build a ReactJS or NodeJS application then, your application runs on http://localhost:8080 . Localhost is the domain name which resolves to the IP Address 127.0.0.1 . This IP address is the IP address of the local laptop.
For external websites, you type https://abc.com. When you hit this in your browser, the following things happen:
Requesting https://abc.com is the same as requesting 35.154.33.64:443. Here, 443 is the port (default port of https).
Remembering IP Addresses is tough, so people generally buy domain names for it and point their domain name to their server’s IP address.
- How to deploy an application?
Your application runs on Port 8080. Now, you want to expose it to the internet so that other people can visit your website. For this, you need to buy a public IP Address and attach this public IP address to your laptop so that people can hit https://<your_laptop_ip_address>:8080 to visit your site.
Doing all this and managing the server on your own is a pain, so people generally rent servers from Cloud Providers such as AWS, Azure, GCP, etc. They will give you a virtual machine where you can run your application, and that machine also has a public IP address attached to it, so you can visit it from anywhere. In AWS, this virtual machine is called EC2 Instance. And putting your application code into the virtual machine of a cloud provider from your local laptop is called deployment.
Latency and Throughput
- Latency
Latency is the time taken for a single request to travel from the client to the server and back (or a single unit of work to complete). It is usually measured in milliseconds (ms).
Loading a webpage: If it takes 200ms for the server to send the page data back to the browser, the latency is 200ms.
In simple terms, if a website loads faster, then it takes less time, so it has low latency. If it loads slower, then it takes more time, so it has high latency.
Round Trip Time (RTT): The total time it takes for a request to go to the server and for the response to come back. Sometimes, you can also hear RTT as a replacement word for latency.
- Throughput
Throughput is the number of requests or units of work the system can handle per second. It is typically measured in requests per second (RPS) or transactions per second (TPS).
Every server has a limit, which means it can handle X number of requests per section. Giving further load can choke it, or it may go down.
In the ideal case, we want to make a system whose throughput is high, and latency is low.
- Example:
Latency: The time it takes for a car to travel from one point to another (e.g., 10 minutes).
Throughput: The number of cars that can travel on a highway in one hour (e.g., 1,000 cars).
In short,
Scaling and its types
You often see that whenever someone launches their website on the internet and suddenly traffic increases, their website crashes. To prevent this, we need to scale our system.
Scaling means we need to increase the specs of our machine (like increasing RAM, CPU, storage etc) or add more machines to handle the load.
You can relate it with the example of a mobile phone, when you buy a cheap mobile with less RAM and storage then your mobile hangs by using heavy games or a lot of applications simultaneously. The same happens with the EC2 instance, when a lot of traffic comes at the same time then it also starts choking, that time we need to scale our system.
- Types of Scaling
It is of 2 types:
Vertical Scaling (Scale Up/Down)
If we increase the specs (RAM, Storage, CPU) of the same machine to handle more load, then it is called vertical scaling.
This type of scaling is mostly used in SQL databases and also in stateful applications because it is difficult to maintain consistency of states in a horizontal scaling setup.
Horizontal Scaling (Scale Out/In)
Vertical scaling is not possible beyond a point. We can’t infinitely increase the specs of the machine. We will hit a bottleneck, and beyond that, we can’t increase the specs.
The solution to this is to add more machines and distribute the incoming load. It is called horizontal scaling.
Ex: We have 8 clients and 2 machines and distribute our load. We want to distribute the load equally so the first 4 clients can hit one machine-1 and the next 4 clients can hit machine-2. Clients are not smart; we can’t give them 2 different IP addresses and let them decide what machine to hit because they don’t know about our system. For this, we put a load balancer in between. All the clients hit the load balancer, and this load balancer is responsible for routing the traffic to the least busy server.
In this setup, Clients don’t make requests directly to the server. Instead, they send requests to the load balancer. The load balancer takes the incoming traffic and transfers it to the least busy machine.
Most of the time, in the real world, we use horizontal scaling.
Below is the picture where you can see that 3 clients are making requests, and the load balancer distributes the load in 3 EC2 instances equally.
Auto Scaling
Suppose you started a business and made it online. You rented an EC2 server to deploy your application. If a lot of users come to your website at the same time, then your website may crash because the EC2 server has limitations (CPU, RAM, etc) on serving a certain number of users concurrently at one time. At this time, you will do horizontal scaling and increase the number of EC2 instances and attach a load balancer.
Suppose one EC2 machine can serve 1000 users without choking. If the number of users on our website is not constant. Some days, we have 10,000 users on our website, which can be served by 10 instances. Some days, we have 100,000 users then we need 100 EC2 instances. One solution might be to keep running the maximum number (100) of EC2 instances all the time. In this way, we can serve all users at all times without any problem. But during low-traffic periods, we are wasting our money in extra instances. We only need 10 instances, but we are running 100 instances all the time and paying for it.
The best solution for this is to run only the required number of instances every time. And add some sort of mechanism that if CPU usage of an EC2 instance goes up to a certain threshold (say 90%) then launch another instance and distribute the traffic without us manually doing this. This changing number of servers dynamically based on the traffic is called Auto Scaling.
Note: These numbers are all hypothetical to make you understand the topic. If you want to find the actual threshold, then you can do Load Testing on your instance.
Back-of-the-envelope Estimation
You can see that above, in the horizontal scaling, we saw that we need more servers to handle the load. In back-of-the-envelope estimation, we estimate the number of servers, storage, etc., needed.
We do approximation here to make our calculation easy.
|Power of 2| Approximate value| Power of 10 | Full name | Short name | |----------|------------------|-------------|-------------|-------------| | 10 | 1 Thousand | 3 | 1 Kilobyte | 1 KB | | 20 | 1 Million | 6 | 1 Megabyte | 1 MB | | 30 | 1 Billion | 9 | 1 Gigabyte | 1 GB | | 40 | 1 Trillion | 12 | 1 Terabyte | 1 TB | | 50 | 1 Quadrillion | 15 | 1 Petabyte | 1 PB |
There can be many things that you can calculate here, but I prefer doing calculations of only below things:
Let’s take an example of Twitter and do this calculation.
- Load Estimation
Here, ask for DAU (Daily Active Users). Then, calculate the number of reads and the number of writes.
Suppose Twitter has 100 million daily active users, and one user posts 10 tweets per day.
Number of Tweets in one day:
This means the number of writes = 1 billion tweets per day.
Suppose 1 user reads 1000 tweets per day.
Number of reads in one day:
- Storage Estimation
Tweets are of two types:
Normal tweet and tweet with a photo. Suppose only 10% of the tweets contain photos.
Let one tweet comprise 200 characters. One character is of 2 bytes and one photo is of 2 MB
Size of one tweet without photo = 200 character * 2 byte = 400 bytes ~ 500 bytes
Total no. of tweets in one day = 1 billion (as calculated above)
Total no. of tweets with photo = 10% of 1 billion = 100 million
Total Storage required for each day:
=> (Size of one tweet) * (Total no. of tweets) + (Size of photo) * (total tweets with photo)
=> (500 bytes * 1 billion) + (2MB * 100 million)
Take approx to make calculation easy:
=> (1000 bytes * 1 billion) + (2MB * 500 million)
=> 1 TB + 1 PB
Which is approx = 1 PB (ignore 1TB as it is very small to 1PB, so adding it won’t matter)
Each Day we require 1 PB of storage.
- Resource Estimation
Here, calculate the total number of CPUs and servers required.
Assuming we get 10 thousand requests per second and each request takes 10 ms for the CPU to process.
Total CPU time to process:
(10,000 rqst Per sec) * (10ms) = 100,000 ms Per second CPU requires.
Assuming each core of the CPU can handle 1000 ms of processing per second, the total no. of cores required:
=> 100,000 / 1000 = 100 cores.
Let one server have 4 cores of CPU.
So, the total no. of servers required = 100/4 = 25 servers.
So, we will keep 25 servers with a load balancer in front of them to handle our requests.
CAP Theorem
This theorem states a very important tradeoff while designing any system.
CAP theorem consists of 3 words:
Distributed System means data is stored in multiple servers instead of one, as we saw in horizontal scaling.
Why do we make the system distributed?
One individual server that is part of the overall distributed system is called Node.
In this system, we replicate the same data across different servers to get the above benefits.
You can see this in the picture below. The same data is stored in multiple database servers (nodes) kept in different locations in India.
If data is added in one of the nodes then it gets replicated into all the other nodes automatically. How this replication happens, we will talk about it later in this blog.
Let's discuss all 3 words of CAP:
Availability is continued serving when node failure happens.
Partition Tolerance is continued serving when “network failures” happen.
Example: Let there be 3 nodes A, B, C.
- What is the CAP Theorem?
CAP theorem states that in a distributed system, you can only guarantee two out of these three properties simultaneously. It’s impossible to achieve all three.
It is quite logical if you think of it.
In a distributed system, network partition is bound to happen, so the system should be Partition tolerant. This means for a distributed system, “P” will always be there. We will make a tradeoff between CP and AP.
- Why can we only achieve CP or AP and not CAP?
- Why not choose CA?
- What to choose, CP or AP?
Scaling of Database
You normally have one database server; your application server queries from this DB and gets the result.
When you reach a certain scale, this database server starts giving slow responses or may go down because of its limitations. In that situation, how to scale the database, which we will be going to study in this section.
We will be scaling our database step by step, which means if we have only 10k users, then scaling it to support 10 million is a waste. It’s over-engineering. We will only scale up to that limit which is sufficient for our business.
Suppose you have a database server, and inside that, there is a user table.
There are a lot of read requests fired from the application server to get the user with a specific ID. To make the read request faster, do the following thing:
- Indexing
Before indexing, the database checks each row in the table to find your data. This is called a full table scan, and it can be slow for large tables. It takes O(N) time to check each id.
With indexing, the database uses the index to directly jump to the rows you need, making it much faster.
You make the "id" column indexed, then the database will make a copy of that id column in a data structure (called B-trees). It uses the B-trees to search the specific id. Searching is faster here because IDs are stored in a sorted way such that you can apply a binary search kind of thing to search in O(logN).
If you want to enable indexing in any column, then you just need to add one line of syntax, and all the overhead of creating b-trees, etc, is handled by DB. You don’t need to worry about anything.
This was a very short and simple explanation about indexing.
- Partitioning
Partitioning means breaking the big table into multiple small tables.
You can see that we have broken the users table into 3 tables:
These tables are kept in the same database server.
- What is the benefit of this partitioning?
When your index file becomes very big, then it also starts showing some performance issues when searching on that large index file. Now, after partitioning, each table has its own index, so searching on smaller tables is faster.
You may be wondering how we know which table to query from. Since before that, we can hit SELECT * FROM users where ID=4 . Don’t worry, you can again hit the same query. Behind the scenes, PostgreSQL is smart. It will find the appropriate table and give you the result. But you can also write this configuration on the application level as well if you want.
- Master Slave Architecture
Use this method when you hit the bottleneck, like even after doing indexing, partitioning and vertical scaling, your queries are slow, or your database can’t handle further requests on the single server.
In this setup, you replicate the data into multiple servers instead of one.
When you do any read request, your read request (SELECT queries) will be redirected to the least busy server. In this way, you distribute your load.
But all the Write requests (INSERT, UPDATE, DELETE) will only be processed by one server.
The node (server) which processes the write request is called the Master Node.
Nodes (servers) that take the read requests are called Slave Nodes.
When you make a Write Request, it is processed and written in the master node, and then it asynchronously (or synchronously depending upon configuration) gets replicated to all the slave nodes.
- Multi-master Setup
When write queries become slow or one master node cannot handle all the write requests, then you do this.
In this, instead of a single master, use multiple master databases to handle writes.
Ex: A very common thing is to put two master nodes, one for North India and another for South India. All the write requests coming from North India are processed by North-India-DB, and all the write requests coming from South India are processed by South-India-DB, and periodically, they sync (or replicate) their data.
In a multi-master setup, the most challenging part is how you would handle conflicts. If for the same ID, there are two different data present in both masters, then you have to write the logic in code like do you want to accept both, override the previous with the latest one, concatenate it, etc. There is no rule here. It totally depends upon the business use case.
- Database Sharding
Sharding is a very complex thing. Try to avoid this in practical life and only do this when all the above things are not sufficient and you require further scaling.
Sharding is similar to partitioning, as we saw above, but instead of putting the different tables in the same server, we put it into a different server.
You can see in the above picture we cut the table into 3 parts and put it into 3 different servers. These servers are usually called shards.
Here, we did sharding based on IDs, so this ID column is called a sharding key.
Note: The sharding key should distribute data evenly across shards to avoid overloading a single shard.
Each partition is kept in an independent database server (called a shard). So, now you can scale this server further individually according to your need, like doing master-slave architecture to one of the shards, which takes a lot of requests.
- Why sharding is difficult?
In partitioning (keeping chunks of the table in the same DB server), you don’t have to worry about which table to query from. PostgreSQL handles that for you. But in sharding (keeping chunks of the table in different DB servers), you have to handle this at the application level. You need to write a code such that when querying from id 1 to 2, it goes to DB-1; when querying id 5–6, it goes to DB-3. Also, when adding a new record, you manually need to handle the logic in the application code that in which shard you are going to add this new record.
- Sharding Strategies
Range-Based Sharding:
Data is divided into shards based on ranges of values in the sharding key.
Example:
Pros: Simple to implement.
Cons: Uneven distribution if data is skewed (e.g., some ranges have more users).
Hash-Based Sharding:
A hash function is applied to the sharding key, and the result determines the shard.
Example:
HASH(user_id) % number_of_shards determines the shard.
Pros: Ensures even distribution of data.
Cons: Rebalancing is difficult when adding new shards, as hash results change.
Geographic/Entity-Based Sharding:
Data is divided based on a logical grouping, like region or department.
Example:
Pros: Useful for geographically distributed systems.
Cons: Some shards may become "hotspots" with uneven traffic.
Directory-Based Sharding:
A mapping directory keeps track of which shard contains specific data.
Example: A lookup table maps user_id ranges to shard IDs.
Pros: Flexibility to reassign shards without changing application logic.
Cons: The directory can become a bottleneck.
- Disadvantage of sharding
- Sum up of Database Scaling
SQL vs. NoSQL Databases and when to use which Database
- SQL Database
- NoSQL Database
It is categorized into 4 types:
It has a flexible schema, which means we can insert new data types or fields which may not be defined in the initial schema.
It doesn’t strictly follow ACID. It prioritizes other factors, such as scalability and performance.
- Scaling in SQL vs. NoSQL
- When to use which Database?
Microservices
- What is Monolith and Microservice?
Monolith: The entire application is built as a single unit in a monolithic architecture. Suppose you are building an e-commerce app. In a monolith, you only make one backend and entire functionality (like user management, product listing, order, payment, etc) in one app.
Microservice: Break down large applications into smaller, manageable, and independently deployable services.
Ex: For an e-commerce app, suppose you break it into the following services:
Make separate backend apps for each service.
- Why do we break our app into Microservices?
- When to use Microservice?
- How do clients request in a Microservice architecture?
Different microservice has different backends and are deployed independently.
Suppose user service is deployed in a machine with an IP address 192.168.24.32 , product service on 192.168.24.38 and so on with other services. All are deployed on different machines. It’s very hectic to use different IP addresses (or domain names) for each microservice. So, we use API Gateway for that.
The client does every request in a single end-point of the API gateway. It will take the incoming request and map it to the correct microservice.
You can scale each service independently. Suppose the product service has more traffic and requires 3 machines, user service requires 2 machines, and payment service is sufficient with 1 machine, then you can do this also.
API Gateway provides several other advantages as well:
Load Balancer
- Why do we need the Load Balancer?
As we saw earlier, in horizontal scaling, if we have many servers to handle the request, then we can’t give all the IP Addresses of the machines to the client and let the client decide which server to do the request.
Load Balancer acts as the single point of contact for the clients. They request the Domain Name of the Load Balancer, and the load balancer redirects to one of the servers which is least busy.
What algorithm load balancer follow to decide which server to send the traffic to?
- Load Balancer Algorithms
Round Robin Algorithm
How it works: Requests are distributed sequentially to servers in a circular order.
Suppose we have 3 servers: Server-1, Server-2 and Server-3.
Then, in the round-robin, 1st request goes to server-1, the second request to server-2, the third request to server-3, the fourth again goes to server-1, the fifth to server-2, the sixth to server-3, then the seventh request again goes to server-1, eighth to server-2 and so on.
Advantages:
Disadvantages:
Weighted Round Robin Algorithm
How it works: Similar to Round Robin, but servers are assigned weights based on their capacity. Servers with higher weights receive more requests. You can see in the below picture the request number to understand how it works. In the below picture, 3rd server is bigger (has more RAM, Storage, etc). So, twice the request is going to it than 1st and 2nd.
Advantages:
Disadvantages:
Least Connections Algorithm
How it works: Directs traffic to the server with the fewest active connections. The connection here can be anything like HTTP, TCP, WebSocket etc. Here, the load balancer will redirect traffic to the server which has the least active connection with the load balancer.
Advantages:
Disadvantages:
Hash-Based Algorithm
How it works: The load balancer takes anything, such as the client’s IP, user_id, etc, as input and hash that to find the server. This ensures a specific client is consistently routed to the same server.
Advantages:
Disadvantages:
Caching
- Caching Introduction
Caching is the process of storing frequently accessed data in a high-speed storage layer so that future requests for that data can be served faster.
Ex: Suppose some data is taking 500ms to fetch from MongoDB Database, then it takes 100 ms to do some calculations in the backend on that data and finally send it to the client. So, in total, it takes 600 ms for the client to get the data. If we cache this calculated data and store it in a high-speed store like Redis and serve it from there, then we can reduce the time from 600 ms to 60 ms. (These are hypothetical numbers).
Caching means storing pre-computed data in a fast access datastore like Redis, and when the user requests that data, then serve it from Redis instead of querying from the database.
Example: Let’s take example of blog website. When we hit the route /blogs then we get all the blogs. If user hits this route first time then there is no data on cache so we have to get the data from database and suppose its response time is 800ms. Now we stored this data in Redis. Next time a user hits this route, he’ll get the data from Redis, not from database. The response time of this is 20ms. When a new blog is added, then, we have to somehow remove the old value of blogs from Redis and update it with a new one. It is called Cache invalidation. There are many ways for cache invalidation. We can set an expiry time (Time To Live - TTL); after every 24 hrs, Redis will delete the blogs, and when the request comes the first time from any user after 24 hrs, then he’ll get data from DB. After that, it will be cached for the next requests.
- Benefits of Caching:
- Types of Caches
Client-Side Cache
Server-Side Cache
CDN Cache
Application-Level Cache
Redis
Redis is an in-memory data structure store. In-memory means data is stored in RAM. Reading and writing data from RAM is extremely fast compared to disk. We leverage this fast access for caching. Databases use disks to store data. And reading/writing is very slow compared to Redis.
One question you might be thinking is that if Redis is so fast, then why use a database? Can’t we rely on Redis to store all the data?
Redis stores data in RAM, and RAM has very little memory as compared to Disk. If we store too much data in Redis, it can have a memory leakage error.
Redis stores data in key-value pairs. In the Database, we access data from the table the same way in Redis; we access data from keys.
Values can be of any data type like string, list, etc, as shown in the figure below.
There are more data types also, but the above ones are mostly used.
Blob Storage
In databases, text and numbers are stored as it is. But think of something like a file such as mp4, png, jpeg, pdf etc. These things can’t just be stored in rows and columns.
These files can be represented as a bunch of 0s and 1s, and this binary representation is called Blob (Binary Large Object). Storing data as mp4 is not feasible, but storing its blob is easy because it's just a bunch of 0s and 1s.
The size of this Blob can be very big like a single mp4 video can be 1 GB. If you store this in databases like MySQL or MongoDB, your queries become too slow. You also need to take care of scaling, backups, and availability by storing such a large volume of data. These are some reasons why we don’t store Blob data into Database. Instead, we store it in a Blob Storage that is a managed service (Managed service means its scaling, security, etc, are taken care of by companies such as Amazon, and we use it as a black box).
Example of Blob Storage: AWS S3, Cloudflare R2.
AWS S3
S3 (Simple Storage Service, AWS S3) is used to store files (blob data) such as mp4, png, jpeg, pdf, html or any kind of file that you can think of.
You can think of S3 as Google Drive, where you store all your files.
The best thing about S3 is its very cheap. Storing 1 GB of data in S3 is a lot cheaper than storing it in RDS (RDS is the DB service of AWS that provides various DBs such as PostgreSQL, MySQL, etc).
Features of S3:
Content Delivery Network (CDN)
sdf
Source: Medium