Why higher Replication Factor of your Kafka Topic might be killing your Latency Goals
If you are reading this article, I am assuming you are aware of Replication Factors in Kafka, which is a way to safeguard your data in case of broker (server) failures.
As a quick revision: Suppose you have a Kafka Cluster with 3 brokers (servers) as shown below:
Now, you create a topic called “sales-topic” with 1 partition, and so the topic gets created on any one of the brokers as shown below:
The problem with above setup is, all your “sales-topic” data is on Broker1. What happens if that broker fails (which at some point in time it will)?
If you loose that server forever, then all your data is lost too, and in case you don’t loose it forever but it takes you around 3 hours to bring it up again, so in those 3 hours you loose the availability of that topic, no producers or consumers can write or read data from it.
That’s where replication factor helps us. Not only it keeps data safe by storing it on multiple brokers, it also increases availabilty in case of failures. See the diagram below:
This time, while creating the topic, we configured it with Replication Factor of 2. So now, the same topic has two copies, one acting as Leader (L) and the other acting as Follower (F).
If in this case, you loose your Broker1, your data is now safe on Broker 2 as everything that is written to sales-topic’s leader on Broker 1 is copied to sales-topic’s follower on Broker 2 (this is an internal process). In case of Broker1 failures, the follower will become the leader and your availability is maintained too.
But what happens if we loose both Broker1 and Broker2? In this case, again there will be data loss and availability loss. So what’s the fix? We can increase the replication factor by 1 more and create the topic with Replication Factor of 3, as shown below:
In this case, your data’s durability is enhanced even further. Now you can withstand failure of 2 brokers simultaneously and still manitain your data and topic’s availability.
The final point would be, in the above setup you cannot go beyond Replication factor of 3 (you will get error if you try to create a topic with replication factor greater than the number of brokers), becuase there if you increase RF by 1 more, now the next follower has to be created on one of the Broker1,2,3 and there is no point in maintaining both leader and follower on the same broker, as in case of failure, you will loose both of them at once.
In your production environment, you might typically have 10s or even 100s of brokers. As we have seen the more replication factor we have, the more broker failure we will be able to sustain without any impact to durability or availability. So in a cluster with say, 20 brokers, should be keep Replication factor as 20? So that if with 19 failed brokers, there will not be any data loss.
The answer is NO! Increasing replication factor has several downsides too. Few are obvious and few are not. Let’s take a look at them one by one.
Limitaions of Higher Replication Factor
- Hardware Costs: This one is pretty obvious. When we create replicas (copies) of a topic, we are storing same data at mutliple places. In our Image 3 where we have replication factor of 2, data is stored in 2 brokers, B1 and B2. In image 4, where we have replication factor of 3, data is stored in 3 brokers, B1,B2 and B3. With RF of 3 and data size of 1TB, you will actually need 3 TB to store data in 3 places.
- Inter-broker Communication: When we create a topic with RF of 3, we get 1 leader and 2 followers as shown in Image 4. Producers always write data to the Leader. The followers internally keep polling the leaders (just like consumer applications) to fetch the data. Let’s say your throughput it 10Mbps from producer to leader. On top of that an extra 10Mbps traffic will be created from Leader to Follower1 and additional 10Mbps traffic will be created from Leader to Follower2. So with RF3, your original traffic of 10Mbps is increased to 30Mbps. This will have both network costs and well as performance impact on your cluster
- Latency: This is a slightly non-obvious drawback and will take some explanations. So let’s get to it in the following section.
How Can Increasing Replication Factor Increase End To End Latency?
To set some basics right, we know that Producers can only write to Leader Replica and by default Consumers also consumes only from Leader Replica (although Consumers can be configured to read from a Follower which is most closest to it, network wise).
So how is data copied from Leader Replica to a Follower Replica? It’s done by Followers sending a fetch request to Leaders in a continuos poll loop, as shown in the figure below:
Because Followers keeps on fetching the data from leader, there is a chance that it takes some time for the follower to get all the messages. If the followers are all caught up with the leader, they are called “In-Sync Replicas” or ISR.
Now here are 3 interesting facts:
- A message is only available to the Consumer application, if it is replicated in all the ISR. This means that even if Producer sent a message M1 to the Leader, Consumers will not receive it until M1 is copied to all the ISRs. This is where latency starts creeping in. Let’s say a message M1 has been sent to a Leader at T0, but it takes some 2 seconds for the follower to get it’s copy. In those 2 seconds, the Consumer will not be sent M1 even if they fetch for it. So we have introduced a latency of 2 seconds becuase we have replicas that needs the data to be copied to
- This delay is only introduced for the ISRs to catch up, not all the followers. If you have a replication factor of 5, it means you have 1 leader and 4 followers. Out of the 4 followers, not all of them might be ISR. Let’s say only 2 are ISRs. So for the message to be avaialable to the consumer, it does not have to be copied to all the 4 followers, but only to the ISRs which is 2 in this case
- It takes 30 second for an ISR to become a non-ISR. What this means is, just because a follower does not have all the message as compared to Leader, does not make it non-ISR. Suppose at T=0 both leader and follower have 5000 messages. Now, leader receives a surge of 2000 messages more but the follower has only copied 500 messages more by T=10 seconds. So effectively, it’s behnd by 1500 messages. Does that make is a non-ISR? No! Because it has not been 30 seconds before which it was an ISR! This can increase the latency even further! Let’s say you have replication factor of 5 which means 1 leader and 4 followers, 2 of the followers are ISR and 2 are non-ISR. We do not need to worry about non-ISR because Kafka doesn’t wait for non-ISR to get the message, but only waits till the ISRs get the message. But the fact is, it takes 30 seconds for an ISR to become non-ISR. For some reason, let’s say at T=0 one of the ISR starts falling behind and at T=5 it is behind by 2000 messages. But since it’s not T=30 yet, it is still considered as an ISR, and any message that leader gets, it will wait for this ISR (which is behind by 2000 message already) to receive it and only then send to the consumer. As you can see now the consumer has to wait for this follower to catch up. This wait can be as long as 30 seconds of latency!
So in conclusion, we should be very careful in setting the replication factor of a topic. Yes, it does increase the durability and availability in cases of broker failures, but setting a replication factor beyond 3 is seldom recommended and a higher RF can cause a severe latency issues, even when your Kafka cluster is working perfectly fine!
Bonus Section
You might be wondering, why does Kafka wait for the ISR to get all the messages before making it available to the consumer. Why not make the message avaialable to the consumer the moment Leader receives it. The answer is Consistency! I will cover that in the future blog! Keep reading!