Observable Node.js Applications

September 08, 2016 - by Wyatt Lyon Preul

Why observability matters

Getting an application into production is only one part of application development. After an application is in production, understanding, and diagnosing issues can be trying, to say the least. Fortunately, there are numerous tools to help the production debugging cause. For example, in SmartOS there are many process tools that will do everything from showing the arguments a process was executed with to displaying memory address space for a process. Not to mention DTrace, which is critical to have in your satchel of tools if you are running anything important. It cannot be overstated how useful DTrace is to have at your disposal, even if you never end up using it.

Even though all of the above tools and much more exist, they can only do so much. There is still a need for you, the developer, to make your application observable. A few years ago, the recommendation was simply to log operations and application specific details. This is still necessary, however, with the further distributed nature of applications, more effort is required to log and trace the interaction of the application services.

The architecture of a modern Node.js application

As a result of the increased popularity of microservices, newer applications tend to be decomposed into tiny services that are distributed in nature. Previously, you would make a request to a server and the business logic was contained in the single process that received the request. There may have been a few requests made to upstream services and data stores, but for the most part, the application was contained within a single process. As a result, it was more trivial to reason about bugs as they were contained in a single, monolithic application. It also meant that logging data could be easily correlated to incoming requests. Below is an illustration of one of these applications, which has its own datastore and is located behind a load balancer. The application logs can be aggregated together simply since the logs are uniform in their data structures.

legacy app

While the above application architecture is still prevalent, there is a definite shift towards a decomposed application that's distributed in nature. Since the shift is more recent, we will refer to this microservices architecture as a modern application architecture. Unfortunately, spreading an application into tiny services can make it more difficult to trace an incoming request throughout the entirety of the system. This is a result of a single incoming request now relying on many more service calls than would be the case in a monolithic application. This issue is often exacerbated when the various services don't log to an aggregated store or they don't have consistent data structures. In these cases, trying to identify the microservice that is the culprit to a slow response or flawed business logic can be difficult. However, once identified, the microservice is often easier to debug as it has a smaller surface area. Fortunately, implementing appropriate observability practices in a modern application will often make it easier to debug than its monolithic counterpart. The figure below is an example of an application comprised of microservices.

modern app

How to make applications observable

As we saw in the example of a modern application architecture, there is a definite need to be able to correlate a request to its subsequent upstream service requests. One common way to solve this problem is to generate a correlation identifier and to forward this identifier to upstream requests. The correlation identifier will be logged wherever request logs are stored. From this information, it's possible to correlate an external request to subsequent upstream service requests.

Seneca

In Seneca, a Node.js microservices toolkit, the concept of a correlation identifier exists. Seneca uses the actor model and has a concept of messages and transactions, as a result the correlation identifier is a property of the message object named tx$. This identifier is forwarded to upstream Seneca services and helps trace an originating action to all of the subsequent upstream actions that occur.

Having an identifier that is unique across an entire system of services is a powerful concept. As a result, you can perform additional logging for function calls that include the request identifier. This makes it possible to identify the actions and function executions that resulted from a single service call. With this information its usually trivial to identify and reproduce slow executing requests.

Logging timing information for data store executions along with the correlation identifier is often useful for diagnosing issues. In the example below, a Seneca action saves a user to a data store and logs the execution time along with the tx$ identifier.

const Seneca = require('seneca');

const seneca = Seneca();
seneca.add({ role: 'user', cmd: 'save' }, (msg, reply) => {
  console.time(`User Save (${msg.tx$})`);
  // Perform save
  console.timeEnd(`User Save (${msg.tx$})`);
  reply();
});

Wreck

Wreck is an HTTP request module for Node.js that is lightweight and intended for service to service requests. In order to accommodate the concept of the correlation identifier, wreck supports global request and response events (disclosure: I'm the creator and lead maintainer of wreck). With these global events, you can intercept requests and log them as well as forward or generate a correlation identifier header that is also forwarded by other services that utilize wreck. The events are available on the Node.js process module with the Symbol named 'wreck'. Below is an example of what the generation and forwarding of a tracking header could look like.

const symbol = Symbol.for('wreck');

process[symbol].on('request', (uri, options) => {
  options.headers.CorrelationID = options.headers.CorrelationID || genId();
});

Both the request and response events are useful to handle for logging purposes as well. Logging the CorrelationID from the example above along with the URI and timing information are beneficial for observing the flow of a request through a system of services.

OpenTracing

Even though the previous examples in Seneca and wreck help with observing a request as it flows through a system of services, they don't solve the issue of what the logging data structures should be. This is a common issue that reaches beyond the previous examples and is one that often plagues any sizeable organization. Even if an organization mandates that all request logs contain a correlation identifier, there are other issues that aren't solved by simply having an ID and timing information. For example, how would you determine what upstream service calls were the result of a downstream request? Having a correlation identifier and the time of the request isn't always enough to say with certainty that a request occurred because of a previous parent request. As a result of this common problem and similar problems, OpenTracing was created.

There are OpenTracing libraries for common languages and platforms to make it easier to adopt the specification across an organization. There is a supported Node.js and browser OpenTracing module that is freely available. The OpenTracing specification abstracts requests and work done in an application into a concept named spans that contain relationship data about parents and children as well as a context associated with the work. Again, this is intended to be an abstraction on top of the tracing data that concerns a modern application. Because of this need to be generic, the language involved in tracing services is itself generic.

Once a modern application implements the OpenTracing specification it's possible to have a time series graph of service calls throughout a system and how they are correlated. Visualizing the OpenTracing data is still something thats in the early stages. One example of the reporting that's available to OpenTracing applications is Zipkin, which is shown in the screenshot below. Even if you decide against having a visual representation of the application service calls, having consistent data structures to describe these calls means that they're easier to trace and debug. This point is even more obvious if you have ever tried to debug an issue in a microservices application that has inconsistent logging that's spread out across various logging stores.

Zipkin screenshot

Summary

Decomposing a monolith into microservices is only going to become more prevalent in the future. As this change in design continues to become more popular, my hope is that appropriate time will be paid to making these services more traceable. Fortunately, Node.js services have many modules that help to make it easier to start relating these various service calls to one another. Furthermore, this relationship building between services and the requests they make, whether over HTTP or another protocol, can be made simpler with the adoption of a specification like OpenTracing.