The three Rs of visibility for any cloud journey
Dealing with a security incident requires prompt notification of the incident and the ability to triage the cause of the incident.
In this scenario, security teams must:
- Carry out forensics
- Identify what other systems, users, devices and applications have been compromised or impacted by the incident
- Identify the magnitude or impact of the incident
- Identify the duration of the activity that led to the incident.
Notification of an incident is simply the first step in a complex journey that could lead to unearthing a major cyber breach, or perhaps writing off a completely benign non-incident.
And while security orchestration automation and response (SOAR) solutions help automate and structure these activities, the activities themselves require telemetry data that provide the breadcrumbs to help remedy the situation. This takes increasing significance in the cloud for a few reasons:
- The public cloud shared security model may lead to gaps in telemetry. For example, a lack of telemetry from the underlying infrastructure could help to correlate breadcrumbs at the infrastructure level to the application level.
- Lack of consistency in telemetry information as applications increasingly segment into microservices, containers and Platform-as-a-Service. Various modules come from different sources, such as internal development, open source, commercial modules, and outsourced development.
- Misconfigurations and misunderstandings as control shifts between DevOps, CloudOps, and SecOps.
When incidents occur, the ability to quickly size the scope and root cause of the incident is directly proportional to the availability of quality data. As companies migrate to the cloud, logs have become the de facto standard of gathering telemetry.
However, there are a few challenges when relying almost exclusively on logs for telemetry.
A common practice with many hackers is to turn off logging on the compromised system to cloak their activity and footprint. This creates gaps in telemetry that can significantly delay incident response and recovery initiatives.
On occasion, DevOps teams may also reduce logging on end systems and applications to reduce CPU usage (and associated costs in the cloud), leading to additional gaps in telemetry data.
Logs also tend to be voluminous and, in many cases, written by developers for developers, leading to too much irrelevant telemetry data. This drives up costs of storing and indexing and also leads to longer query times.
Finally, log levels can be increased or decreased, but ultimately the logs themselves are pre-defined as they are embedded into code. Changing what information logs put out cannot be done in real-time or near real-time in response to an incident but may require code changes, leading to significant delays and impaired incident response capability.
This leads us to the three Rs of telemetry — reliable, relevant, and real-time.
Telemetry data needs to be reliable in that it is available when required, without gaps introduced by malicious actors or even inadvertently by various operators due to misconfiguration.
It needs to be relevant in that it should provide meaningful, actionable insights without significantly driving up costs or query times due to excessive and irrelevant information.
Finally, it needs to be real-time in the sense that the stream of telemetry data can be changed, and new telemetry data can be derived at the click of a button.
A great way to address the three Rs is with telemetry data derived from observing network traffic. After all, command and control activity, lateral movement of malware, and data exfiltration happen over the network.
If end systems or applications are compromised, and logging is turned off at the server or application, network activity can continue capturing breadcrumbs identifying the malicious activity.
In that sense, network-based telemetry can provide a reliable stream of information even when endpoints or end systems are compromised or impacted. Metadata generated from network traffic can be surgically tuned to provide a highly relevant and targeted telemetry feed.
Security operations teams can select from thousands of metadata elements specific to their use case, and discard other network metadata that may not be relevant.
Should the need arise to change what telemetry data is being acquired, it can be easily changed at the network level without requiring any change to the application. A simple API call can change what network metadata is being captured in near real-time.
As organisations look to move to the cloud, complementing their log sources with network-based telemetry will prove invaluable in bolstering their security and compliance posture.
In that sense, network-based telemetry is an essential component in securing the move to the cloud.