Building a Telemetry Service

by Daniel López

We have run our telemetry system live for the past three years now. In this article, we’ll share code and outline some of the things you’ll need to keep in mind if you want to build an anonymous usage reporting system for an open-source project as we did (at a ridiculous cost).

We will show you how to use our reporting client and the details of our actual reporting servers in case you are thinking in implementing a cost-efficient solution with high scalability.

Lets’ get started!

Our anonymous report

We are well aware of the importance of privacy. We are not in the data-mining business, so we selected a set of minimal details to share from your KrakenD instances that would give us enough insights into the matter without being invasive. We decided that we’d rather lose some accuracy than collect (maybe) sensible information, so we went for the anonymous data.

We ignored typical system metrics like the number of CPU/cores, CPU usage, available and consumed ram, network throughput, etc. That’s something more related to system monitoring than to the use of KrakenD, and we felt that collecting these metrics generates friction with the acceptance of a telemetry system.

The SPAM problem

The telemetry system gives us some value, but we were aware of its risks, both technically and economically. This system is not our primary focus, so it should be small and simple to keep the development and operational costs under control. Our requirements of storage space and computing cycles were almost proportional to the number of reports to process, and there was no way to avoid DoS attacks hitting the service and suffering a death by a thousand cuts (and a massive bill from our providers).

We took a different approach: instead of preventing a pure DoS, why don’t we avoid the possible spam?. We classified as spam all the requests that look well-formed but contained forged or expired data. By rejecting these requests as quickly as possible, we contain the amount of storage and computing cycles consumed by a malicious actor. We looked at the strategies used in other sectors when dealing with spam, such as in emails and crypto transactions. We found an interesting idea: "…to require a user to compute a moderately hard, but not intractable function…". To get a request to be accepted, proof of work is needed.

Given that the generation of proof of work is several times more expensive than the verification itself, we could require a new proof of work for every report that we would receive. It could also be used as a signature for the request, making the reuse (and other techniques such as precalculated tables, memoization, etc.) impossible.

TLDR: every request will cost more to the sender than to the receiver. If you’re a well-intended sender, the cost is 100% negligible (1 second of work every 12 hours). If you’re a malicious actor, your bill will be some orders of magnitude bigger than ours. It will be like trying to drown us with your blood.

We decided to go with hashcash from all the available options since it’s free and already used to limit email spam and DoS attacks. For proof of work, the client (sender) must solve, with brute force, a small guessing problem consisting of discovering an offset to add to the challenge of the proof of work to satisfy a condition with a very low probability of occurrence. On the other side, the server (receiver) merely needs to check if the proof of work is not expired and if its hash starts with a given number of zeros (the condition to satisfy).

Sessions and reports

When a new instance of KrakenD gets started, the usage client asks for a new session. The usage server returns a token for the given pair of identifiers (ServerID and ClusterID) that should be used when creating the reports as part of the challenge.

To know our users better, we thought it was essential to get some system details from our binary, this is the version, the architecture, and the operating system it was running on.

Then we thought about keeping track of every running instance, so we generated a random UUID before creating the client (ServerID) and used it to identify a given single server.

To know the average size of the KrakenD clusters, we needed to group instances, so we decided to use a hash of the configuration as a cluster identifier (ClusterID).

Finally, we wanted to know the average life span of the KrakenD instances, so we added one final metric to the reports: the service’s uptime.

Reporting frequency

Another critical decision: if the reporting frequency stayed too low, the value of the telemetry system would be comparable with the value of our stats from the downloads (deb, rpm and tgz repositories, docker hub, the marketplaces, etc), but if it were too high we could affect the trust of the community or even the performance of the product. The worst-case scenario also included an auto-generated DoS due to the high amount of instances that could be running out there sending legit reports.

After several discussions, models, and benchmarks, we thought sending a report every 12 hours would be the right choice. Looking back, I guess I should have given my vote to a little more ambitious option and set the reporting to every 3 or 4 hours.

The reporting client

Here is the source code of our reporting client. It has some opinionated options that you should keep in mind if you intend to use it. The opinionated options are:

Requests will timeout after 15 seconds
Sessions will be created sending a POST request to the endpoint /session
Reports will be created sending a POST request to the endpoint /report
Reports will be sent every 12 hours
The proof of work will be done with hashcash and the following params:
- HashBits = 20
- SaltChars = 40
- DefaultExtension = ""

The configurable options are:

ClusterID: the identification for a cluster
ServerID: the identification for the instance
URL: the URL of your telemetry service (by default: https://usage.krakend.io)
Version: the version of the binary

To start a reporting client, just import github.com/devopsfaith/krakend-usage/client and call the single exposed function

client.StartReporter(ctx, client.Options{
    ClusterID: "your_cluster_id",
    ServerID:  "your_server_id",
    URL:       "https://your-usage-domain.tld",
    Version:   "your_binary_version",
})

The reporting client will create a session and use the token to keep reporting every 12 hours to the server with a new proof of work, so the report gets accepted.

The reporting server

To limit our risk exposure, we decided to keep the server-side part of our telemetry system undisclosed for now. That means we won’t publish the entire source code of it, but we are open to sharing some details with the community, making it easier for everyone to build their own version.

Nevertheless, the service is a small API with just two endpoints exposed:

/session
/report

After every successful request to any of these endpoints, the service updates the database, sends a notification to our company Slack, and calls Google Analytics as if it was a simple page view, so we have complete visibility, in real-time.

In the future, we have plans to integrate into the system other metrics from our public rpm, deb and tgz repos, from docker hub, etc., so we could also cross-exam the correlation between downloads and usage.

Summary

This article shared our experience building and deploying a small but powerful telemetry system for the KrakenD API Gateway. Some recommendations might be handy for you if you are interested in adding a similar feature to your project. We’ve stressed the importance of respecting our users’ privacy and keeping the trust they have in us.

Did you make it this far?

Thanks for reading! If you like our product, don’t forget to star our project!

Categories: Technical Insights & Best Practices

Blog categories

Recent entries