Let’s start with a common situation.

You’ve just created a new application in node.js. The application uses some SQL database, redis for cache, Kafka as a broker, and many many other useful things.

You are ready to deploy your masterpiece on production, so you choose your host provider, set up a domain, and finally - deploy the application. Everything works fine so far.

But after a few days, you have calls and emails from your clients. The application has started to work slower and slower, you are taking a look at your infrastructure metrics and you catch memory and CPU spikes, but when you take a look at APM you see no usual spikes in requests, everything looks the same as yesterday. At this point, you are blind.

This hypothetical case sounds scary, but usually, restarting fixes the problem and you can buy some time to investigate what actually happens.

Image description

We could easily avoid this situation, partially, if we had enabled profiling on production. But what the heck is profiling? Let’s get familiar with the definition

Profiling, in a general sense, refers to the process of monitoring and analyzing how a computer program utilizes CPU and memory resources. It is a crucial technique for identifying performance bottlenecks, optimizing code, and enhancing the efficiency of software applications during the development and testing phases.

So in general, we want to track how memory and processor are used by our application and which part of our applications has a problem like

  • Memory leaks
  • Processor abuse
  • Slow connections
  • I/O bottlenecks
  • Ineffective caching
  • Third-party library issues

Let’s see what kind of tools can we use in production to measure and profile the application

Google Profiler

Cloud Profiler is a statistical, low-overhead profiler that continuously gathers CPU usage and memory-allocation information from your production applications. It attributes that information to the source code that generated it, helping you identify the parts of your application that are consuming the most resources, and otherwise illuminating your application’s performance characteristics.

Google Cloud Profiler has been a good friend of mine when it comes to finding what function or class has been naughty.

But as with any tool, it has some flaws, like showing you… too much

Image description Whole application heap

But, at the end of the day, it does its job by helping you find your code and 3rd library usage

Image description (Usage of processor - wall time)

Image description (Usage of memory - heap)

Image description Ability to see top CPU and memory functions

Image description Ability to compare profiles in time and by versions

It’s basic, but it works. You can see what function or 3rd library is after your resources and debug them locally (For example, by using ClinicJS - https://clinicjs.org/)

Pros:

  • It’s cheap, you can use it in as many instances as you want
  • Easy to install
  • You only need to install this as a package and include it in your main file

Cons:

  • UI is very laggy
  • Wall time is sometimes not useful without many filters
  • Lack of correlations with infrastructure
  • Capture profile every 10 minutes
  • Higher memory usage because it’s stores temporary profiles in application memory

Link to Google Cloud Profiler https://cloud.google.com/profiler/docs/about-profiler

DataDog

DataDog is a far bigger player when it comes to APM and infrastructure monitoring as well as when it comes to profiling applications. To be honest, DataDog is my current top 1 tool for profiling in the Node.JS environment.

And there is a simple explanation for that - they understand how Node.JS works and adjust profiling to show you what you really need to see

Image description

At first, they allow you to see the timeline for wall time and heap usage, so you can spot that something is off at first glance.

Image description

Secondly, they allow you to see only your code. Moreover, you can see at what point in time the profile has been captured

Image description

What I also find quite useful, is that the UI is not laggy at all, even when you have hundreds of services on the UI

Pros:

  • Smart and fast UI
  • Capture profiler every minute
  • Correlations with infrastructure, logs, etc.
  • Adjusted to Node.JS needs, you only see what’s important

Cons:

  • Kinda expensive, 48 USD per host + 18 USD per infra
  • You need to install the agent on the infrastructure and then install the package in your code in order to communicate with the agent

Lint to DataDog profiler: https://docs.datadoghq.com/profiler/

So as you can see, those tools can help you drastically reduce the time of debugging and make your application and clients happy again.