Monitoring Monolith With Datadog: How to Avoid the Bystander Effect

In large monolithic applications, error tracking and monitoring often become ineffective due to a lack of clear ownership. This guide addresses the issue by proposing a structured approach to assign accountability through domain annotations.

Setting up effective monitoring for large monolithic with multiple teams can be challenging. Without clear ownership, error tracking becomes generic and often ignored. One solution is to have on-call engineers identify which team should respond to monitoring alarms. However, a more efficient approach is to include domain and team information in each log and Datadog span.

Understanding Domain Annotations

To keep track of which team is responsible for various parts of our application, we use a system called Domain Annotations. Domain Annotations label every part of your application's code, clearly indicating who is accountable for what. This provides clear organization and accountability in managing responsibilities.

The Benefits of Using Domain Annotations

Domain annotations provide a clear and organized method to track team responsibilities within a monolithic application. By tagging parts of your code with domain annotations, you can:

Simplify Log and Trace Management: Filter logs and traces based on specific criteria, such as team responsibility, enabling quick identification and resolution of issues.
Maintain Accurate Tracking: Adapt seamlessly to changes in team responsibilities, as annotations are tied to the domain rather than team names.
Enhance Accountability: Clearly define which team is responsible for each domain, improving organization and targeted monitoring.
Improve Monitoring Efficiency: Facilitate better monitoring practices by providing precise accountability and enhancing overall efficiency.

Domain Annotations Processing

To ensure efficient monitoring and traceability, each web request is tagged with the appropriate domain information. This is achieved through the collaboration of several components: DomainProvider, DomainSpanService, DomainMdcProvider, and DomainHandlerInterceptor.

Here’s a high-level overview of the process depicted in the following diagram:

Explanation of Key Components

DomainProvider: Identifies the domain associated with specific handler methods or beans. It helps in finding domain annotations in AOP (Aspect-Oriented Programming) and MVC (Model-View-Controller) calls.
DomainSpanService: Adds domain tags to spans, which are units of work in tracing systems. This service ensures that each span is tagged with the appropriate domain information.
DomainMdcProvider: Manages domain tags within the MDC (Mapped Diagnostic Context), a feature of logging frameworks that allows tagging log entries with contextual information.
DomainHandlerInterceptor: Intercepts web requests, ensuring that each request is tagged with the appropriate domain information for better monitoring and traceability.

The detailed implementation of these components will be encapsulated in a shared library, providing a reusable solution for tagging and monitoring web requests in large monolithic applications.

Sorting Out Who Owns What Code

Defining ownership at the class level is straightforward with domain annotations. By applying top-level annotations to main classes, ownership propagates down to all detailed resources within those classes. Each team can label classes that they own with the appropriate domain annotations, ensuring clarity and accountability without the need to mark every single method.

In cases when multiple teams own code in one class and immediate refactoring isn’t appropriate, you can mark individual methods with different domain annotations, which take priority over class-level annotations. This allows specific methods to be assigned to different teams, providing flexibility without complicating the overall structure.

Overcoming Not Supported By Annotations Cases

While domain annotations are incredibly useful, there are rare cases where they can't be used. For instance, we encountered issues with Quartz job creation, which did not function seamlessly with domain annotations due to a clash between Quartz's AOP logic and the AOP logic used for domain annotations.

For jobs and processes that cannot be annotated directly, we used the DomainTagsService directly in the job implementations. This approach allowed us to manually add domain tags within the job's execution logic.

Here's an example of how we integrated DomainTagsService into a Quartz job:

final override fun execute(context: JobExecutionContext) {
    domainTagsService.invoke(domain) {
        withLoggedExecutionDetails(context, ::doExecute)
    }
}

Improve Monitoring and Visibility with Artificial Services

While having separate services for each team offers significant advantages in monitoring and ownership, it comes with high costs and efforts for splitting the monolith, along with potential additional development expenses. Considering the possibility of improving build times with Gradle when the monolith is split into modules, maintaining a monorepo might be the most efficient solution in many cases.

Introduction of Artificial Services

To simplify monitoring each team's activities in Datadog, you can assign artificial service names for spans of different teams. This approach ensures that every team has its own dedicated section in Datadog's monitoring tools. While using artificial service names can be confusing if you have many services to manage, it becomes manageable with a limited number of backend services. Adding prefixes to these artificial service names helps maintain organization and clarity in your Datadog setup, making it easier to distinguish between different teams and their responsibilities.

use diagram instead of the screenshot?? having worker/webapp makes no sense here

Why Not Use Artificial Services for Logs?

Using artificial service names for logs can create confusion as the same log entry might appear under different services.

For example, consider two endpoints using the same authentication service. If these endpoints are annotated with different domains, the authentication logic will produce logs under different artificial services. This could cause confusion when one is exploring logs, as they appear under multiple service names. To avoid this issue, it's better to apply artificial service names only to spans that are aggregated together in traces so there is less confusion

Does it make any sense? I don’t think it is

Here is a visual representation of this problem:

Using Artificial Services in Monitoring and Dasboards

Using artificial services enables you to not only to work with APM traces, but also filter by service in Datadog Metrics, which are stored for an extended period, allowing for tracking changes over a prolonged period.

Example of Monitor

Below is a screenshot of a monitor in Datadog that uses the artificial service name konsus-assets in the query:

Example of Dashboard

Below is a screenshot of a dashboard in Datadog that uses the artificial service name konsus-assets in the filter:

By utilizing fake services in your monitoring strategy, you can enhance the visibility and accountability of each team's activities within a monolithic application. This approach simplifies the process of creating and maintaining team-specific monitors and dashboards, leading to more effective and organized monitoring in Datadog.

Wrapping Up

Domain annotations provide a straightforward approach to simplifying the monitoring of monolithic applications in Datadog. By implementing this strategy, you can enhance the manageability of logs, spans, and metrics, transforming your monitoring setup into a tool tailored to specific teams. This improves accountability and organization and facilitates more effective and efficient troubleshooting and performance analysis across your application.

Key Takeaways

Enhanced Ownership and Accountability: By annotating parts of your code with domain annotations, you can clearly define which team is responsible for each domain. This facilitates better organization and targeted monitoring.
Improved Log and Trace Management: Domain annotations allow you to filter both logs and traces based on specific criteria, such as team responsibility, enabling quick identification and resolution of issues.
Flexibility with Artificial Services: Using artificial service names for spans (not logs) ensures that logs remain clear and traceable to their true origins, avoiding confusion.
Overcoming Integration Challenges: For cases where annotations cannot be directly applied, such as with certain job execution frameworks like Quartz, using services like DomainTagsService directly in the job implementations ensures that domain-specific monitoring can still be maintained.

Step-by-Step Approach to Using Domain Annotations:

Define Domains and Teams

That will change with the lib!!!

Create enums representing different domains and teams in your application:
- @Domain is an annotation that can be applied to classes or functions, marking them with a specific domain value.
- DomainValue is an enum representing different domains, each associated with a team.
- Team is an enum representing the various teams working on the application.
```
@Retention(AnnotationRetention.RUNTIME)
@Target(AnnotationTarget.CLASS, AnnotationTarget.FUNCTION)
annotation class Domain(val value: DomainValue)

enum class DomainValue(val team: Team) {
    USER_MANAGEMENT(Team.TEAM_A),
    PAYMENT_PROCESSING(Team.TEAM_B),
    NOTIFICATIONS(Team.TEAM_C)
}

enum class Team {
    TEAM_A,
    TEAM_B,
    TEAM_C
}
```

Annotate Classes (and Methods if necessary)

@Domain(DomainValue.USER_MANAGEMENT)
class UserService {
    @Domain(DomainValue.PAYMENT_PROCESSING)
    fun processPayment() { ... }
}

Handle Unsupported Cases

For cases that cannot be annotated directly, use DomainTagsService directly to wrap the logic

fun executeNotSupportedByAnnotationsLogic() {
    domainTagsService.invoke(domain) {
        executeLogic()
    }
}

Monitor with Datadog

Use artificial service filters for monitors, dashboards, and APM traces filtering

By following these steps, you can effectively implement domain annotations in your monolithic application, ensuring improved monitoring, accountability, and overall efficiency.