When you have bad credit and are drowning in debt, it can feel like nothing is ever going to change. It may feel like, no matter what you do, it doesn’t feel like you are making a difference. You may…
With over 49M learners tallying-up over 680M course enrollments, it’s imperative that there are no errors in our platform’s code that could disrupt learners’ ability to access educational content. Meanwhile, as developers, we have an abstract representation of what our code does. We spend so much time inside our heads, it’s very easy to make mistakes when mapping our thinking to reality. As we continuously ship our code into production, we don’t actually know if users encounter any problems on their side. One method to overcome this blind spot is to monitor user events such as clicks, page views, etc., which lets you see the granular details of your user activity and in turn, any errors users encounter trying to navigate these events.
In this blog post, we will look at how we make life easier for both event developers and designers by creating Auto-generated Event Monitors. We will also examine why monitoring is necessary to provide a better user experience and explore how we collect our monitoring data, aka our metrics.
At Udemy, we are always improving our system by adding new features and functionalities, thus our system is becoming increasingly complex and distributed. Consequently, end-to-end monitoring of our systems is essential to provide a better service for our users. You may be asking, “why don’t you use testing strategies like unit or integration tests?” And we are using different sets of testing strategies, but these testing strategies may not be successful at preventing:
This is where event monitors come into play. Setting up the event monitors helps us minimize these issues. To set-up, we supply these event monitors with metrics — for the purpose of this blogpost, metrics are the total number of user events by their type — as input from our servers. On top of this, we define a set of alarms on the monitors to alert when any threshold is exceeded. If any alarm is triggered by these monitors, we take immediate action to fix the issue so that no one using our platform is having a bad experience.
We use Datadog as a SaaS-based monitoring tool for the functionalities mentioned above. Datadog has plenty of features for monitoring, alerting, notifying, and even collecting metrics in various ways. Since metrics are essential data for monitoring, let’s take a step back and see how we collect our metrics at Udemy.
Let’s make an analogy using a scenario where you go to the hospital for your annual health check-up. Most likely one of the exams will be an electrocardiogram (ECG) test in which the electrical signals of your heart are recorded. These heart signals are equivalent to the metrics and graphs (ie — monitors) on the test report.
The illustration above which includes Datadog Agents running as pods in the Kubernetes clusters is similar to the ECG test. Each pod created with DeamonSet acts as an ECG machine, which collects the metrics from other pods and sends them to the Datadog Cloud. There can be different types of metrics collected, such as server health metrics, application metrics, or even our event metrics which are automatically created by the event tracking system after new event registration.
We will walk-through an example about how
AddToCartEvent is created, from which service the metrics are being sent, and finally-how we use the relevant event monitor:
Still, having these metrics in Datadog Cloud does not necessarily mean that required event monitors are created; someone needs to go to the UI and create them manually.
P.S. You can read more about our event tracking system’s architecture in this post.
At Udemy, each team is responsible for creating and managing its own event data. Although there are some generic monitors for events that are applicable to all types of events, such as event serialization exception rates, each team should create its own event monitors in some cases. One example is an event monitor focused on traffic pattern anomalies as the team are subject-matter experts on their domains and know what is expected and what is not expected in terms of their data.
On the other hand, creating a new event monitor requires extra effort since it includes a manual process. First, you need to be aware that such event monitors can be created, and then you should know which metric to choose and how to configure it in the UI. This method may be an easy step for some individuals who are familiar with DataDog monitoring but it may not be that simple for others. Either way, it includes a manual process.
To encourage teams to create their own monitors, we had to think about how we make this process easier. As a requirement, everyone should be able to create monitors with minimal effort.
In the event tracking system, event schemas are stored in the single repository on GitHub as Avro IDL files. These schemas may change according to business requirements over time. In order to reflect these changes on the system, we have an internal service called ESM (Event Schema Manager) which works with GitHub webhooks. GitHub calls these webhooks when a new comment is added to a pull request including schema change. Commenting as
esm register to the pull request takes care of the registration process.
Having this background, we have two options for the monitor automatization:
After reviewing these options, we chose the first one since the annotation-based configuration is a relatively more intuitive approach for developers. The disadvantage of the latter is that developers wouldn’t be able to know if there are any monitors created for their schemas by just looking at the schema structure. We love thinking of our schemas as the source of truth in our system, meaning that they should keep track of anything related to events including their monitor changes.
Now let’s explain how the annotation-based approach works by following the above illustration step-by-step:
Although we updated our parameter in the monitor annotation of the schema, our event monitor in Datadog hasn’t been updated yet. When registering the new schema changes with ESM, we need to get the monitor ID and pass the parameters to it to reflect changes in the Datadog as well. Instead of storing these monitor ids in a database, we chose to use name and tag pairs that are specific to each event monitor. This enables us to search the monitor with the unique pair and eventually get its ID. For example:
AddToCartEvent Traffic Anomaly Monitor → Name: Traffic Anomaly with AddToCartEvent, Tag:
After a while, a monitor may not be needed anymore. To delete a monitor, you will only need to remove the monitor annotation from the schema and register it again. After the registration, the monitor will be deleted on the Datadog UI as well.
Although Auto-generated Event Monitors work great when creating predefined event monitors, it is not a silver bullet for every monitoring need. More complex monitoring requirements should/can still be handled by Datadog UI or some other approaches like Terraform.
At Udemy, we are consistently working to be more data-driven in our decisions and we want to use our data to create more resilient and robust systems. Monitoring is one way of doing this and we hope this blog post clearly covered why event monitoring is so important, how we supply data to these monitors, and how to create monitors for eventing data easily with annotation-based Auto-generated Event Monitors.
Thank you for your time and hopefully, this blog post is useful for your projects. One final word, we encourage you to use automation for monitoring when it applies!
This work was made possible with the efforts of the Udemy Event Tracking Team. Additionally, huge thanks to Udemy Customer Development Guild to come up with the idea!
We ease into things by starting with a simple shape: a square. And with the basics established we can tackle making the more complex Mandelbrot set. Technically, everything we make will be a…
Nothing ever really describes that feeling, Of feeling your jawline, And noticing the first strands of hair; Knowing that the beard is on its way, It often comes with a kind of responsibility…
Jobs are at the core of most voter’s day-to-day decision-making. If a person feels secure in their job and in their future job prospects, a whole range of different priorities cascade from that…