The World of Observability

Ankit Sahay
3 min readMay 26, 2024

--

Recently, I started digging a bit deep into Observability and it’s principles, and the field is riddled with terms like observability, telemetry, signal, instrumentation, etc.

This article is aimed to ease your journey into the world of Observability. We will cover enough basics that will prepare you with the Key Terms so that you can dive in your journey of observability.

For the curious hearts, let’s address the elephant in the room —

Now let’s take a step back and ask the question — What is observability observing?
In today’s world in all likliness, we would want to observe a distributed system, be it cloud or on-premise.

A distributed system is a system whose components are located on different networked computers (servers/machines) that communicate and co-ordinate among themselves by passing messages to one another

How do we observe these distributed systems?
We can’t, unless they emit telemetry.

Telemetry is data that describes what your system is doing. Without telemetry, your system is just a big black box filled with mystery.

The word telemetry is a bit confusing and overloaded. The distinction drawn in our industry and in systems monitoring in general, is between user-telemetry and performance-telemetry.

User Telemetry

Refers to data about how a user is interacting with a system through a client: button clicks, session duration, etc. We can use this data to understand how users are interacting with an e-commerce site, APIs, etc.

Performance Telemetry

This is not primarily used to analyse user behaviours, but instead it provides operators and SRE engineers with statistical information about the behaviour and performance of system components. Performance data can come from different sources in a distributed system and offers developers a breadcrumb trail to follow, connecting cause with effect.

In plainer terms, user telemetry will tell you how long someone hovered their mouse cursor over a checkout-button. Performance telemetry will tell you how long it took for that checkout page to load and which programs and resources the system utilized along the way.

Underneath user and performance telemetry are different types of signals. A signal is a particular form of telemetry. Event logs are one kind of signal, metrics are another kind of signal. These signal types each serve a different purpose and they are not interchangeable.

You can’t derive the events that came from a user interaction just by looking at system metrics.
Similarly, you can’t derive system load just by looks at users’ transaction logs.

Each signal consists of two parts — Instrumentation and Transmission

Instrumentation is a piece of code that emits telemetry data from within the program itself.
Tranmission system is a system for sending data over the network to an analysis tool where the actual observing occurs.

This raises an important distinction: it’s common to confuse telemetry and analysis as the same thing, but it’s not. The system that emits the data and the system that analyses the data are separate from each other.

Telemetry is the data itself. Analysis is what you do with the data.

And finally, as depicted initially, telemetry plus an analysis equals observability.

I hope the distinction between terms like observability, telemetry, intrumentation, signals, etc are now clear! Let me know if there are any questions in the comment section.

--

--

No responses yet