2

Suppose you have an application and you have an APM platform like Datadog or Application Insights that you use to monitor the performance and the inner workings of your application.

Sometimes, there are application events that you want to record and check later. By analyzing these events, you can spot bugs in your application, detect malicious users, or understand that some of your customers are using the application in the wrong way.

Example scenario: you have a message queue in your backend, and sometimes there are items in the queue that cannot be processed because some business validation rules fail (e.g.: mandatory data are missing in the work item).

My question is: where do you usually store these kinds of application events? Please, notice that I'm not referring to domain-level events that you want to raise and/or handle in the application code to model the business domain. I'm referring to application events to be used by operation teams to monitor the application behavior and its usage patterns and ensure that the system is behaving correctly.

I have seen different approaches to solve this problem:

  • use a structured logging system and store this information in the logs (e.g.: Serilog in the dotnet space).
  • store events at the APM platform level. As an example, Application Insights as a built-in concept of custom events with associated metadata and metrics (see here, for an example in the C# sdk).
  • Store these events in the application database using a custom table. It is also useful to be sure that all the events are indeed stored and to avoid sampling issues and/or losing events because of logs and trace retention policies.

Each one of these approaches has some pros and cons. I'm just looking for guidance and some advice from people having experience with this topic. My favorite approach is storing this information using APM-level events; the problem with that is that not all the APM systems have the same concept of event and the same underlying data model (as an example, the concept of event in Application Insight and Datadog is quite different).

2
  • 2
    Questions asking for general guidance tend to get closed as needing focus. The trouble with questions like this is we don't have a well defined problem to solve. Answers become a giant list of possible solutions, which doesn't fit the Q&A format of this community.CommentedMar 17 at 21:39
  • in your example I would program in a way to feed the error back to the user and get them to resubmit. No need to store anything?
    – Ewan
    CommentedMar 17 at 21:52

1 Answer 1

5

Based on my experience, the answer is the usual, it depends. Even given the specific scenario you described, it depends on the exact scope and what you want to monitor and troubleshoot. Starting to monitor "everything" is quite demanding.

  • Do you want to know how frequently such events occur? Let's go with metrics.
  • Do you want to know the exact reason such events couldn't be processed? Traces enriched with custom tags.
  • Do you want to add domain information? Structured logs. Information related to what is going on with the transport of your data is available out of the box with traces, adding logs for this kind of info is expensive for costs, resources consumed, and maintenance.

Sampling is indeed an issue, not of the tool, but of how much you invest in it. With AppInsight and Datadog you can just pay more and you will not encounter this issue. But most of the time, it is just better to reduce the amount of data stored, just save the telemetry that you actually need. Still, selecting data to save could be hard depending on the system you are working on, an alternative could be not relying on external products but having your own monitoring platform. Prometheus, Grafana, Tempo, Loki, Elastic, Kibana, Logstash. I would avoid custom solutions with generic tools, or I would use it just if I don't plan to invest/expand it. Somewhere you have to invest time, money, resources.

Once you define what and how you need to monitor your flows, all the rest will follow. And, in my personal opinion, start small. Just with metrics or traces. Once people will start using the monitoring platform, more requests will come. Just like a product with customer requests, it is a flow feature>user>feedback, don't expect it to be a time-boxed activity, it is a constant process.

1
  • 3
    Lots of good advice, considering the somewhat broad focus of the question. The main questions that you need to ask yourself as an organization considering some solution: what do you want to achieve with it, is the solution suitable for the problem, are the costs reasonable when weighed against the possible gain? Especially the last question may often lead to "may be nice to have, but isn't really carrying its own weight".CommentedMar 18 at 8:54

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.