Getty Images/iStockphoto
How to choose IT operations metrics that deliver real value
The sheer volume of data collected in today's IT environments can be overwhelming. But by focusing on the user experience, IT ops teams can pinpoint the metrics that matter most.
Many IT organizations focus on activity metrics, such as total messages sent and received between a user and each application component, or total data capacity used. But this information is of limited value because it doesn't relate well to the experiences of users themselves.
The purpose of IT is to deliver a satisfying experience in which the user -- whether a worker, customer, prospect or partner -- plays the role that the business expects. The user's experience is the perceived quality of the interaction, including every technical element or service that contributes to it.
Given this, it's not surprising that optimizing IT operations means optimizing the quality of experience (QoE). This, in turn, means measuring the factors that influence user experience.
Quality of experience vs. quality of service
The original definition of QoE grew out of the networking concept of quality of service (QoS).
QoS measures factors such as delivered information rate and packet loss and delay, including any jitter or variability. Although QoS metrics are still important in measuring QoE, they're less important than application-related metrics, which can be harder to identify and capture.
Visibility and observability are important when gathering metrics. But much of the data available in IT and networking environments has little to no utility in assessing the actual quality of a user's experience with an application.
The central guiding principle of IT ops metrics is that applications should support their business case, which means supporting the business. In turn, the IT teams that use the applications set the quality requirements, and QoE metrics are key in objectively assessing whether those requirements are met.
Start by capturing user interactions
The first step in understanding QoE is to catalog user interactions. To successfully optimize IT operations, it's essential to identify each time a user interacts with an application.
When a user wants to accomplish a task, they embark on a series of interactions that get them what they need. For example, a user might log in, make an inquiry, perform some secondary inquiries and updates, and then log off.
Once the user initiates this series of actions, they're tied up until it's complete. Anything that causes delays, confusion or errors negatively affects their experience -- as well as the business justification for the IT resources they use.
Choose metrics that illustrate the user's experience
Because improving user interactions is the foundation of optimizing IT operations, organizations should start by capturing the metrics that describe those interactions at the user level.
From the time a user attempts to log in, how long is it until they see the screen asking for an ID and password? After providing their login credentials, how long does it take to get access?
Capture the time series for each entry of data. Ideally, do so as close as possible to the user's point of connection. In some cases, however, the first event visible to IT is the time receipt of a message from the user and the time a response is dispatched, along with the number of messages in each category.
When choosing specific metrics to collect, aim to measure application and network performance between the user and application. Think of the interaction as a chart of steps, with each step representing measurable work done to support the user.
Useful IT operations metrics characterize how each step proceeds in terms of volume -- for example, message counts and data volumes -- and process time. The goal is to track how much time each interaction requires and identify any points where information is lost or corrupted.
To collect the required data, the next step is to trace the workflows associated with the interactions. Consider an application as a series of components linked by workflows at the logical level and by network connections at the physical level. Because each interaction creates a series of workflows, measuring QoE means gathering workflow data.
Where possible, collect metrics at both the network and application levels to enable correlation. Network-level metrics should include message counts, utilization and loss-delay information, with as fine granularity as possible. Application-level data should also include message counts to facilitate correlation, but overall should focus on process times.
Gather and analyze IT operations metrics
Collecting the metrics needed for QoE assessment can be a major challenge.
To gather metrics, IT teams can use network management systems, public cloud platforms, platform software tools -- including container management -- and logs of all types. However, variations in APIs and data formats can make correlating data difficult.
After tracing workflows and identifying associated components, identify available APIs, data formats and any tools available to harmonize the information into a common database for analysis. Expect this process to involve a lot of time and testing.
Tools for managing IT ops metrics
The market response to the need for an organized way to manage QoE metrics is the full-stack visibility or observability product set. Many vendors, including AppDynamics and Dynatrace, offer tools that collect and harmonize QoE-related metrics, at least to a degree.
To reduce the challenges of linking a full-stack system to each source of metrics in your IT environment, consider using the OpenTelemetry framework, a Cloud Native Computing Foundation project. OpenTelemetry is an emerging standard that's linked but not specific to cloud computing.
The OpenTelemetry framework focuses on time-based metrics, which are critical in linking specific metrics back to QoE. The framework's broad vendor support doesn't guarantee metrics from every element in your IT environment can be captured effectively, but at least increases the chances.