Technical Archives | Sysdig https://sysdig.com/blog/topic/technical/ Tue, 06 Aug 2024 16:32:32 +0000 en-US hourly 1 https://wordpress.org/?v=6.6.1 https://sysdig.com/wp-content/uploads/favicon-150x150.png Technical Archives | Sysdig https://sysdig.com/blog/topic/technical/ 32 32 Kubernetes 1.31 – What’s new? https://sysdig.com/blog/whats-new-kubernetes-1-31/ Fri, 26 Jul 2024 14:00:00 +0000 https://sysdig.com/?p=92018 Kubernetes 1.31 is nearly here, and it’s full of exciting major changes to the project! So, what’s new in this...

The post Kubernetes 1.31 – What’s new? appeared first on Sysdig.

]]>
Kubernetes 1.31 is nearly here, and it’s full of exciting major changes to the project! So, what’s new in this upcoming release?

Kubernetes 1.31 brings a plethora of enhancements, including 37 line items tracked as ‘Graduating’ in this release. From these, 11 enhancements are graduating to stable, including the highly anticipated AppArmor support for Kubernetes, which includes the ability to specify an AppArmor profile for a container or pod in the API, and have that profile applied by the container runtime. 

34 new alpha features are also making their debut, with a lot of eyes on the initial design to support pod-level resource limits. Security teams will be particularly interested in tracking the progress on this one.

Watch out for major changes such as the improved connectivity reliability for KubeProxy Ingress, which now offers a better capability of connection draining on terminating Nodes, and for load balancers which support that.

Further enhancing security, we see Pod-level resource limits moving along from Net New to Alpha, offering a capability similar to Resource Constraints in Kubernetes that harmoniously balances operational efficiency with robust security.

There are also numerous quality-of-life improvements that continue the trend of making Kubernetes more user-friendly and efficient, such as a randomized algorithm for Pod selection when downscaling ReplicaSets.

We are buzzing with excitement for this release! There’s plenty to unpack here, so let’s dive deeper into what Kubernetes 1.31 has to offer.

Editor’s pick:

These are some of the changes that look most exciting to us in this release:

#2395 Removing In-Tree Cloud Provider Code

Probably the most exciting advancement in v.1.31 is the removal of all in-tree integrations with cloud providers. Since v.1.26 there has been a large push to help Kubernetes truly become a vendor-neutral platform. This Externalization process will successfully remove all cloud provider specific code from the k8s.io/kubernetes repository with minimal disruption to end users and developers.

Nigel DouglasSr. Open Source Security Researcher

#2644 Always Honor PersistentVolume Reclaim Policy

I like this enhancement a lot as it finally allows users to honor the PersistentVolume Reclaim Policy through a deletion protection finalizer. HonorPVReclaimPolicy is now enabled by default. Finalizers can be added on a PersistentVolume to ensure that PersistentVolumes having Delete reclaim policy are deleted only after the backing storage is deleted.


The newly introduced finalizers kubernetes.io/pv-controller and external-provisioner.volume.kubernetes.io/finalizer are only added to dynamically provisioned volumes within your environment.

Pietro PiuttiSr. Technical Marketing Manager

#4292 Custom profile in kubectl debug


I’m delighted to see that they have finally introduced a new custom profile option for the Kubectl Debug command. This feature addresses the challenge teams would have regularly faced when debugging applications built in shell-less base images. By allowing the mounting of data volumes and other resources within the debug container, this enhancement provides a significant security benefit for most organizations, encouraging the adoption of more secure, shell-less base images without sacrificing debugging capabilities.

Thomas LabarussiasSr. Developer Advocate & CNCF Ambassador


Apps in Kubernetes 1.31

#3017 PodHealthyPolicy for PodDisruptionBudget

Stage: Graduating to Stable
Feature group: sig-apps

Kubernetes 1.31 introduces the PodHealthyPolicy for PodDisruptionBudget (PDB). PDBs currently serve two purposes: ensuring a minimum number of pods remain available during disruptions and preventing data loss by blocking pod evictions until data is replicated.

The current implementation has issues. Pods that are Running but not Healthy (Ready) may not be evicted even if their number exceeds the PDB threshold, hindering tools like cluster-autoscaler. Additionally, using PDBs to prevent data loss is considered unsafe and not their intended use.

Despite these issues, many users rely on PDBs for both purposes. Therefore, changing the PDB behavior without supporting both use-cases is not viable, especially since Kubernetes lacks alternative solutions for preventing data loss.

#3335 Allow StatefulSet to control start replica ordinal numbering

Stage: Graduating to Stable
Feature group: sig-apps

The goal of this feature is to enable the migration of a StatefulSet across namespaces, clusters, or in segments without disrupting the application. Traditional methods like backup and restore cause downtime, while pod-level migration requires manual rescheduling. Migrating a StatefulSet in slices allows for a gradual and less disruptive migration process by moving only a subset of replicas at a time.

#3998 Job Success/completion policy

Stage: Graduating to Beta
Feature group: sig-apps

We are excited about the improvement to the Job API, which now allows setting conditions under which an Indexed Job can be declared successful. This is particularly useful for batch workloads like MPI and PyTorch that need to consider only leader indexes for job success. Previously, an indexed job was marked as completed only if all indexes succeeded. Some third-party frameworks, like Kubeflow Training Operator and Flux Operator, have implemented similar success policies. This improvement will enable users to mark jobs as successful based on a declared policy, terminating lingering pods once the job meets the success criteria.

CLI in Kubernetes 1.31

#4006 Transition from SPDY to WebSockets

Stage: Graduating to Beta
Feature group: sig-cli

This enhancement proposes adding a WebSocketExecutor to the kubectl CLI tool, using a new subprotocol version (v5.channel.k8s.io), and creating a FallbackExecutor to handle client/server version discrepancies. The FallbackExecutor first attempts to connect using the WebSocketExecutor, then falls back to the legacy SPDYExecutor if unsuccessful, potentially requiring two request/response trips. Despite the extra roundtrip, this approach is justified because modifying the low-level SPDY and WebSocket libraries for a single handshake would be overly complex, and the additional IO load is minimal in the context of streaming operations. Additionally, as releases progress, the likelihood of a WebSocket-enabled kubectl interacting with an older, non-WebSocket API Server decreases.

#4706 Deprecate and remove kustomize from kubectl

Stage: Net New to Alpha
Feature group: sig-cli

The update was deferred from the Kubernetes 1.31 release. Kustomize was initially integrated into kubectl to enhance declarative support for Kubernetes objects. However, with the development of various customization and templating tools over the years, kubectl maintainers now believe that promoting one tool over others is not appropriate. Decoupling Kustomize from kubectl will allow each project to evolve at its own pace, avoiding issues with mismatched release cycles that can lead to kubectl users working with outdated versions of Kustomize. Additionally, removing Kustomize will reduce the dependency graph and the size of the kubectl binary, addressing some dependency issues that have affected the core Kubernetes project.

#3104 Separate kubectl user preferences from cluster configs

Stage: Net New to Alpha
Feature group: sig-cli

Kubectl, one of the earliest components of the Kubernetes project, upholds a strong commitment to backward compatibility. We aim to let users opt into new features (like delete confirmation), which might otherwise disrupt existing CI jobs and scripts. Although kubeconfig has an underutilized field for preferences, it isn’t ideal for this purpose. New clusters usually generate a new kubeconfig file with credentials and host details, and while these files can be merged or specified by path, we believe server configuration and user preferences should be distinctly separated.

To address these needs, the Kubernetes maintainers proposed introducing a kuberc file for client preferences. This file will be versioned and structured to easily incorporate new behaviors and settings for users. It will also allow users to define kubectl command aliases and default flags. With this change, we plan to deprecate the kubeconfig Preferences field. This separation ensures users can manage their preferences consistently, regardless of the –kubeconfig flag or $KUBECONFIG environment variable.

Kubernetes 1.31 instrumentation

#2305 Metric cardinality enforcement

Stage: Graduating to Stable
Feature group: sig-instrumentation

Metrics turning into memory leaks pose significant issues, especially when they require re-releasing the entire Kubernetes binary to fix. Historically, we’ve tackled these issues inconsistently. For instance, coding mistakes sometimes cause unintended IDs to be used as metric label values. 

In other cases, we’ve had to delete metrics entirely due to their incorrect use. More recently, we’ve either removed metric labels or retroactively defined acceptable values for them. Fixing these issues is a manual, labor-intensive, and time-consuming process without a standardized solution.

This stable update should address these problems by enabling metric dimensions to be bound to known sets of values independently of Kubernetes code releases.

Network in Kubernetes 1.31

#3836 Ingress Connectivity Reliability Improvement for Kube-Proxy

Stage: Graduating to Stable
Feature group: sig-network

This enhancement finally introduces a more reliable mechanism for handling ingress connectivity for endpoints on terminating nodes and nodes with unhealthy Kube-proxies, focusing on eTP:Cluster services. Currently, Kube-proxy’s response is based on its healthz state for eTP:Cluster services and the presence of a Ready endpoint for eTP:Local services. This KEP addresses the former.

The proposed changes are:

  1. Connection Draining for Terminating Nodes:
    Kube-proxy will use the ToBeDeletedByClusterAutoscaler taint to identify terminating nodes and fail its healthz check to signal load balancers for connection draining. Other signals like .spec.unschedulable were considered but deemed less direct.
  1. Addition of /livez Path:
    Kube-proxy will add a /livez endpoint to its health check server to reflect the old healthz semantics, indicating whether data-plane programming is stale.
  1. Cloud Provider Health Checks:
    While not aligning cloud provider health checks for eTP:Cluster services, the KEP suggests creating a document on Kubernetes’ official site to guide and share knowledge with cloud providers for better health checking practices.

#4444 Traffic Distribution to Services

Stage: Graduating to Beta
Feature group: sig-network

To enhance traffic routing in Kubernetes, this KEP proposes adding a new field, trafficDistribution, to the Service specification. This field allows users to specify routing preferences, offering more control and flexibility than the earlier topologyKeys mechanism. trafficDistribution will provide a hint for the underlying implementation to consider in routing decisions without offering strict guarantees.

The new field will support values like PreferClose, indicating a preference for routing traffic to topologically proximate endpoints. The absence of a value indicates no specific routing preference, leaving the decision to the implementation. This change aims to provide enhanced user control, standard routing preferences, flexibility, and extensibility for innovative routing strategies.

#1880 Multiple Service CIDRs

Stage: Graduating to Beta
Feature group: sig-network

This proposal introduces a new allocator logic using two new API objects: ServiceCIDR and IPAddress, allowing users to dynamically increase available Service IPs by creating new ServiceCIDRs. The allocator will automatically consume IPs from any available ServiceCIDR, similar to adding more disks to a storage system to increase capacity.

To maintain simplicity, backward compatibility, and avoid conflicts with other APIs like Gateway APIs, several constraints are added:

  • ServiceCIDR is immutable after creation.
  • ServiceCIDR can only be deleted if no Service IPs are associated with it.
  • Overlapping ServiceCIDRs are allowed.
  • The API server ensures a default ServiceCIDR exists to cover service CIDR flags and the “kubernetes.default” Service.
  • All IPAddresses must belong to a defined ServiceCIDR.
  • Every Service with a ClusterIP must have an associated IPAddress object.
  • A ServiceCIDR being deleted cannot allocate new IPs.

This creates a one-to-one relationship between Service and IPAddress, and a one-to-many relationship between ServiceCIDR and IPAddress. Overlapping ServiceCIDRs are merged in memory, with IPAddresses coming from any ServiceCIDR that includes that IP. The new allocator logic can also be used by other APIs, such as the Gateway API, enabling future administrative and cluster-wide operations on Service ranges.

Kubernetes 1.31 nodes

#2400 Node Memory Swap Support

Stage: Graduating to Stable
Feature group: sig-node

The enhancement should now integrate swap memory support into Kubernetes, addressing two key user groups: node administrators for performance tuning and app developers requiring swap for their apps. 

The focus was to facilitate controlled swap use on a node level, with the kubelet enabling Kubernetes workloads to utilize swap space under specific configurations. The ultimate goal is to enhance Linux node operation with swap, allowing administrators to determine swap usage for workloads, initially not permitting individual workloads to set their own swap limits.

#4569 Move cgroup v1 support into maintenance mode

Stage: Net New to Stable
Feature group: sig-node

The proposal aims to transition Kubernetes’ cgroup v1 support into maintenance mode while encouraging users to adopt cgroup v2. Although cgroup v1 support won’t be removed immediately, its deprecation and eventual removal will be addressed in a future KEP. The Linux kernel community and major distributions are focusing on cgroup v2 due to its enhanced functionality, consistent interface, and improved scalability. Consequently, Kubernetes must align with this shift to stay compatible and benefit from cgroup v2’s advancements.

To support this transition, the proposal includes several goals. First, cgroup v1 will receive no new features, marking its functionality as complete and stable. End-to-end testing will be maintained to ensure the continued validation of existing features. The Kubernetes community may provide security fixes for critical CVEs related to cgroup v1 as long as the release is supported. Major bugs will be evaluated and fixed if feasible, although some issues may remain unresolved due to dependency constraints.

Migration support will be offered to help users transition from cgroup v1 to v2. Additionally, efforts will be made to enhance cgroup v2 support by addressing all known bugs, ensuring it is reliable and functional enough to encourage users to switch. This proposal reflects the broader ecosystem’s movement towards cgroup v2, highlighting the necessity for Kubernetes to adapt accordingly.

#24 AppArmor Support

Stage: Graduating to Stable
Feature group: sig-node

Adding AppArmor support to Kubernetes marks a significant enhancement in the security posture of containerized workloads. AppArmor is a Linux kernel module that allows system admins to restrict certain capabilities of a program using profiles attached to specific applications or containers. By integrating AppArmor into Kubernetes, developers can now define security policies directly within an app config.

The initial implementation of this feature would allow for specifying an AppArmor profile within the Kubernetes API for individual containers or entire pods. This profile, once defined, would be enforced by the container runtime, ensuring that the container’s actions are restricted according to the rules defined in the profile. This capability is crucial for running secure and confined applications in a multi-tenant environment, where a compromised container could potentially affect other workloads or the underlying host.

Scheduling in Kubernetes

#3633 Introduce MatchLabelKeys and MismatchLabelKeys to PodAffinity and PodAntiAffinity

Stage: Graduating to Beta
Feature group: sig-scheduling

This was Tracked for Code Freeze as of July 23rd. This enhancement finally introduces the MatchLabelKeys for PodAffinityTerm to refine PodAffinity and PodAntiAffinity, enabling more precise control over Pod placements during scenarios like rolling upgrades. 

By allowing users to specify the scope for evaluating Pod co-existence, it addresses scheduling challenges that arise when new and old Pod versions are present simultaneously, particularly in saturated or idle clusters. This enhancement aims to improve scheduling effectiveness and cluster resource utilization.

Kubernetes storage

#3762 PersistentVolume last phase transition time

Stage: Graduating to Stable
Feature group: sig-storage

The Kubernetes maintainers plan to update the API server to support a new timestamp field for PersistentVolumes, which will record when a volume transitions to a different phase. This field will be set to the current time for all newly created volumes and those changing phases. While this timestamp is intended solely as a convenience for cluster administrators, it will enable them to list and sort PersistentVolumes based on the transition times, aiding in manual cleanup and management.

This change addresses issues experienced by users with the Delete retain policy, which led to data loss, prompting many to revert to the safer Retain policy. With the Retain policy, unclaimed volumes are marked as Released, and over time, these volumes accumulate. The timestamp field will help admins identify when volumes last transitioned to the Released phase, facilitating easier cleanup. 

Moreover, the generic recording of timestamps for all phase transitions will provide valuable metrics and insights, such as measuring the time between Pending and Bound phases. The goals are to introduce this timestamp field and update it with every phase transition, without implementing any volume health monitoring or additional actions based on the timestamps.

#3751 Kubernetes VolumeAttributesClass ModifyVolume

Stage: Graduating to Beta
Feature group: sig-storage

The proposal introduces a new Kubernetes API resource, VolumeAttributesClass, along with an admission controller and a volume attributes protection controller. This resource will allow users to manage volume attributes, such as IOPS and throughput, independently from capacity. The current immutability of StorageClass.parameters necessitates this new resource, as it permits updates to volume attributes without directly using cloud provider APIs, simplifying storage resource management.

VolumeAttributesClass will enable specifying and modifying volume attributes both at creation and for existing volumes, ensuring changes are non-disruptive to workloads. Conflicts between StorageClass.parameters and VolumeAttributesClass.parameters will result in errors from the driver. 

The primary goals include providing a cloud-provider-independent specification for volume attributes, enforcing these attributes through the storage, and allowing workload developers to modify them non-disruptively. The proposal does not address OS-level IO attributes, inter-pod volume attributes, or scheduling based on node-specific volume attributes limits, though these may be considered for future extensions.

#3314 CSI Differential Snapshot for Block Volumes

Stage: Net New to Alpha
Feature group: sig-storage

This enhancement was removed from the Kubernetes 1.31 milestone. It aims at enhancing the CSI specification by introducing a new optional CSI SnapshotMetadata gRPC service. This service allows Kubernetes to retrieve metadata on allocated blocks of a single snapshot or the changed blocks between snapshots of the same block volume. Implemented by the community-provided external-snapshot-metadata sidecar, this service must be deployed by a CSI driver. Kubernetes backup applications can access snapshot metadata through a secure TLS gRPC connection, which minimizes load on the Kubernetes API server.

The external-snapshot-metadata sidecar communicates with the CSI driver’s SnapshotMetadata service over a private UNIX domain socket. The sidecar handles tasks such as validating the Kubernetes authentication token, authorizing the backup application, validating RPC parameters, and fetching necessary provisioner secrets. The CSI driver advertises the existence of the SnapshotMetadata service to backup applications via a SnapshotMetadataService CR, containing the service’s TCP endpoint, CA certificate, and audience string for token authentication.

Backup applications must obtain an authentication token using the Kubernetes TokenRequest API with the service’s audience string before accessing the SnapshotMetadata service. They should establish trust with the specified CA and use the token in gRPC calls to the service’s TCP endpoint. This setup ensures secure, efficient metadata retrieval without overloading the Kubernetes API server.

The goals of this enhancement are to provide a secure CSI API for identifying allocated and changed blocks in volume snapshots, and to efficiently relay large amounts of snapshot metadata from the storage provider. This API is an optional component of the CSI framework.

Other enhancements in Kubernetes 1.31

#4193 Bound service account token improvements

Stage: Graduating to Beta
Feature group: sig-auth

The proposal aims to enhance Kubernetes security by embedding the bound Node information in tokens and extending token functionalities. The kube-apiserver will be updated to automatically include the name and UID of the Node associated with a Pod in the generated tokens during a TokenRequest. This requires adding a Getter for Node objects to fetch the Node’s UID, similar to existing processes for Pod and Secret objects.

Additionally, the TokenRequest API will be extended to allow tokens to be bound directly to Node objects, ensuring that when a Node is deleted, the associated token is invalidated. The SA authenticator will be modified to verify tokens bound to Node objects by checking the existence of the Node and validating the UID in the token. This maintains the current behavior for Pod-bound tokens while enforcing new validation checks for Node-bound tokens from the start.

Furthermore, each issued JWT will include a UUID (JTI) to trace the requests made to the apiserver using that token, recorded in audit logs. This involves generating the UUID during token issuance and extending audit log entries to capture this identifier, enhancing traceability and security auditing.

#3962 Mutating Admission Policies

Stage: Net New to Alpha
Feature group: sig-api-machinery

Continuing the work started in KEP-3488, the project maintainers have proposed adding mutating admission policies using CEL expressions as an alternative to mutating admission webhooks. This builds on the API for validating admission policies established in KEP-3488. The approach leverages CEL’s object instantiation and Server Side Apply’s merge algorithms to perform mutations.

The motivation for this enhancement stems from the simplicity needed for common mutating operations, such as setting labels or adding sidecar containers, which can be efficiently expressed in CEL. This reduces the complexity and operational overhead of managing webhooks. Additionally, CEL-based mutations offer advantages such as allowing the kube-apiserver to introspect mutations and optimize the order of policy applications, minimizing reinvocation needs. In-process mutation is also faster compared to webhooks, making it feasible to re-run mutations to ensure consistency after all operations are applied.

The goals include providing a viable alternative to mutating webhooks for most use cases, enabling policy frameworks without webhooks, offering an out-of-tree implementation for compatibility with older Kubernetes versions, and providing core functionality as a library for use in GitOps, CI/CD pipelines, and auditing scenarios.

#3715 Elastic Indexed Jobs

Stage: Graduating to Stable
Feature group: sig-apps

Also graduating to Stable, this feature will allow for mutating spec.completions on Indexed Jobs when it matches and is updated with spec.parallelism. The success and failure semantics remain unchanged for jobs that do not alter spec.completions. For jobs that do, failures always count against the job’s backoffLimit, even if spec.completions is scaled down and the failed pods fall outside the new range. The status.Failed count will not decrease, but status.Succeeded will update to reflect successful indexes within the new range. If a previously successful index is out of range due to scaling down and then brought back into range by scaling up, the index will restart.

If you liked this, you might want to check out our previous ‘What’s new in Kubernetes’ editions:

Get involved with the Kubernetes project:

The post Kubernetes 1.31 – What’s new? appeared first on Sysdig.

]]>
CVE-2024-6387 – Shields Up Against RegreSSHion https://sysdig.com/blog/cve-2024-6387/ Thu, 04 Jul 2024 15:00:00 +0000 https://sysdig.com/?p=90507 On July 1st, the Qualys’s security team announced CVE-2024-6387, a remotely exploitable vulnerability in the OpenSSH server. This critical vulnerability...

The post CVE-2024-6387 – Shields Up Against RegreSSHion appeared first on Sysdig.

]]>
On July 1st, the Qualys’s security team announced CVE-2024-6387, a remotely exploitable vulnerability in the OpenSSH server. This critical vulnerability is nicknamed “regreSSHion” because the root cause is an accidental removal of code that fixed a much earlier vulnerability CVE-2006-5051 back in 2006. The race condition affects the default configuration of sshd (the daemon program for SSH).

OpenSSH versions older than 4.4p1 – unless patched for previous CVE-2006-5051 and CVE-2008-4109) – and versions between 8.5p1 and 9.8p1 are impacted. The general guidance is to update the versions. Ubuntu users can download the updated versions

According to OpenSSH infosec researchers, this vulnerability may be difficult to exploit. 

Their investigation disclosed that under lab conditions, the attack requires, on average, 6-8 hours of continuous connections until the maximum amount accepted by the server is met.

Why is CVE-2024-6387 significant? 

This vulnerability allows an unauthenticated attacker to gain root level privileges and remotely access your glibc-based Linux systems, where syslog() (a system logging protocol) itself calls async-signal-unsafe functions via the SIGALRM handler. Researchers believe that OpenSSH on OpenBSD, a notable exception, is not vulnerable by design as the SIGALRM handler calls syslog_r(), an async-signal-safer version of syslog(). 

What is the impact?

OpenSSH researchers believe the attacks will improve over time –thanks to the advancements in deep learning – and impact other operating systems, including the non-glibc systems. The net effect of exploiting CVE-2024-6387 is full system compromise and takeover, enabling threat actors to execute arbitrary code with the highest privileges, subvert security mechanisms, data theft, and even maintain persistent access. The team at Qualys have already identified no less than 14 million potentially vulnerable OpenSSH server instances exposed to the internet. 

How to find vulnerable OpenSSH packages with sysdig

You can use your inventory workflows to get visibility into resources and security blindspots across your cloud (GCP, Azure and AWS), Kubernetes, and container images. Besides patching, you should also limit SSH access to your critical assets. 

Here’s how you can look for the vulnerable OpenSSH package within your environment using Sysdig Secure:

  • Navigate to the Inventory tab
  • In the Search bar, enter the following query: 
Package contains openssh

The results show all the resources across your cloud estate that have the vulnerable package. Sysdig provides an overview of all the blind spots that may have gone unchecked within your environment. You can interact with the filters and further reduce your investigation timelines from within a single unified platform.

CVE-2024-6387

The need for stateful detections

Exploitation of regreSSHion involves multiple attempts (thousands, in fact) executed in a fixed period of time. This complexity is what downgrades the CVE from “Critical” classified vulnerability to a “High” risk vulnerability, based mostly on the exploit complexity.

Using Sysdig, we can detect drift from baseline sshd behaviors. In this case, stateful detections would track the number of failed attempts to authenticate with the sshd server. Falco rules alone detect the potential Indicators of Compromise (IoCs). By pulling this into a global state table, Sysdig can better detect the spike of actual, failed authentication attempts for anonymous users, rather than focus on point-in-time alerting. 

At the heart of Sysdig Secure lies Falco’s unified detection engine. This cutting‑edge engine leverages real‑time behavioral insights and threat intelligence to continuously monitor the multi‑layered infrastructure, identifying potential security incidents. 

Whether it’s anomalous container activities, unauthorized access attempts, supply chain vulnerabilities, or identity‑based threats, Sysdig ensures that organizations have a unified and proactive defense against evolving threats.

Reference:

https://thehackernews.com/2024/07/new-openssh-vulnerability-could-lead-to.html

https://blog.vyos.io/cve-2024-6387-regresshion

https://www.openssh.com/releasenotes.html

https://github.com/acrono/cve-2024-6387-poc

The post CVE-2024-6387 – Shields Up Against RegreSSHion appeared first on Sysdig.

]]>
Optimizing Wireshark in Kubernetes https://sysdig.com/blog/optimizing-wireshark-in-kubernetes/ Tue, 21 May 2024 17:00:00 +0000 https://sysdig.com/?p=89616 In Kubernetes, managing and analyzing network traffic poses unique challenges due to the ephemeral nature of containers and the layered...

The post Optimizing Wireshark in Kubernetes appeared first on Sysdig.

]]>
In Kubernetes, managing and analyzing network traffic poses unique challenges due to the ephemeral nature of containers and the layered abstraction of Kubernetes structures like pods, deployments, and services. Traditional tools like Wireshark, although powerful, struggle to adapt to these complexities, often capturing excessive, irrelevant data – what we call “noise.”

The Challenge with Traditional Packet Capturing

The ephemerality of containers is one of the most obvious issues. By the time a security incident is detected and analyzed, the container involved may no longer exist. When a pod dies in Kubernetes, it’s designed to instantly recreate itself again. When this happens, it has new context, such as a new IP address and pod name. As a starting point, we need to look past the static context of legacy systems and try to do forensics based on Kubernetes abstractions such as network namespaces and service names.

It’s worth highlighting that there are some clear contextual limitations of Wireshark in cloud native. Tools like Wireshark are not inherently aware of Kubernetes abstractions. This disconnect makes it hard to relate network traffic directly back to specific pods or services without significant manual configuration and contextual stitching. Thankfully, we know Falco has the context of Kubernetes in the Falco rule detection. Wireshark with Falco bridges the gap between raw network data and the intelligence provided by the Kubernetes audit logs. We now have some associated metadata from the Falco alert for the network capture.

Finally, there’s the challenge of data overload associated with PCAP files. Traditional packet capture strategies, such as those employed by AWS VPC Traffic Mirroring or GCP Traffic Mirroring, often result in vast amounts of data, most of which is irrelevant to the actual security concern, making it harder to isolate important information quickly and efficiently. Comparatively, options like AWS VPC Flow Logs or Azure’s attempt at Virtual network tap, although less complex, still incur significant costs in data transfer/storage. 

When’s the appropriate time to start a capture? How do you know when to end it? Should it be pre-filtered to reduce the file size, or should we capture everything and then filter out noise in the Wireshark GUI? We might have a solution to these concerns that bypasses the complexities and costs of cloud services.

The /555 Guide For Security Practitioners

Meet The Only Benchmark For Cloud Security!

Read The Guide

Introducing a New Approach with Falco Talon

Organizations have long dealt with security blindspots related to Kubernetes alerts. Falco and Falco Talon address these shortcomings through a novel approach that integrates Falco, a cloud-native detection engine, with tshark, the terminal version of Wireshark, for more effective and targeted network traffic analysis in Kubernetes environments.

Falco Talon’s event-driven, API approach to threat response is the best way to deal with initiating captures in real time. It’s also the most stable approach we can see with the existing state-of-the-art in cloud-native security – notably, Falco.

Step-by-Step Workflow:

  • Detection: Falco, designed specifically for cloud-native environments like Kubernetes, monitors the environment for suspicious activity and potential threats. It is finely tuned to understand Kubernetes context, making it adept at spotting Indicators of Compromise (IoCs). Let’s say, for example, it triggers a detection for specific anomalous network traffic to a Command and Control (C2) server or botnet endpoints.
  • Automating Tshark: Upon detection of an IoC, Falco sends a webhook to the Falco Talon backend. Talon has many no-code response actions, but one of these actions allows users to trigger arbitrary scripts. This trigger can be context-aware from the metadata associated with the Falco alert, allowing for a tshark command to be automatically initiated with metadata context specific to the incident.
  • Contextual Packet Capturing: Finally, a PCAP file is generated for a few seconds with more tailored context. In the event of a suspicious TCP traffic alert from Falco, we can filter a tshark command for just TCP activity. In the case of a suspicious botnet endpoint, let’s see all traffic to that botnet endpoint. Falco Talon, in each of these scenarios, initiates a tshark capture tailored to the exact network context of the alert. This means capturing traffic only from the relevant pod, service, or deployment implicated in the security alert.
  • Improved Analysis: Finally, the captured data is immediately available for deeper analysis, providing security teams with the precise information needed to respond effectively to the incident. This is valuable for Digital Forensics & Incident Response (DFIR) efforts, but also in maintaining regulatory compliance by logging context specific to security incidents in production.
Wireshark in Kubernetes

This targeted approach not only reduces the volume of captured data, making analysis faster and more efficient, but also ensures that captures are immediately relevant to the security incidents detected, enhancing response times and effectiveness.

Collaboration and Contribution

We believe this integrated approach marks a significant advancement in Kubernetes security management. If you are interested in contributing to this innovative project or have insights to share, feel free to contribute to the Github project today.

This method aligns with the needs of modern Kubernetes environments, leveraging the strengths of both Falco and Wireshark to provide a nuanced, powerful tool for network security. By adapting packet capture strategies to the specific demands of cloud-native architectures, we can significantly improve our ability to secure and manage dynamic containerized applications.

Open source software (OSS) is the only approach with the agility and broad reach to set up the conditions to meet modern security concerns, well-demonstrated by Wireshark over its 25 years of development. Sysdig believes that collaboration brings together expertise and scrutiny, and a broader range of use cases, which ultimately drives more secure software.

This proof-of-concept involves three OSS technologies (Falco, Falco Talon, and Wireshark). While the scenario was specific to Kubernetes, there is no reason why it cannot be adapted to standalone Linux systems, Information of Things (IoT) devices, and Edge computing in the future.

The post Optimizing Wireshark in Kubernetes appeared first on Sysdig.

]]>
What’s New in Kubernetes  1.30? https://sysdig.com/blog/whats-new-in-kubernetes-1-30/ Mon, 15 Apr 2024 15:00:00 +0000 https://sysdig.com/?p=87182 Kubernetes 1.30 is on the horizon, and it’s packed with fresh and exciting features! So, what’s new in this upcoming...

The post What’s New in Kubernetes  1.30? appeared first on Sysdig.

]]>
Kubernetes 1.30 is on the horizon, and it’s packed with fresh and exciting features! So, what’s new in this upcoming release?

Kubernetes 1.30 brings a plethora of enhancements, including a blend of 58 new and improved features. From these, several are graduating to stable, including the highly anticipated Container Resource Based Pod Autoscaling, which refines the capabilities of the Horizontal Pod Autoscaler by focusing on individual container metrics. New alpha features are also making their debut, promising to revolutionize how resources are managed and allocated within clusters.

Watch out for major changes such as the introduction of Structured Parameters for Dynamic Resource Allocation, enhancing the previously introduced dynamic resource allocation with a more structured and understandable approach. This ensures that Kubernetes components can make more informed decisions, reducing dependency on third-party drivers.

Further enhancing security, the support for User Namespaces in Pods moves to beta, offering refined isolation and protection against vulnerabilities by supporting user namespaces, allowing for customized UID/GID ranges that bolster pod security.

There are also numerous quality-of-life improvements that continue the trend of making Kubernetes more user-friendly and efficient, such as updates in pod resource management and network policies.

We are buzzing with excitement for this release! There’s plenty to unpack here, so let’s dive deeper into what Kubernetes 1.30 has to offer.

Kubernetes 1.30 – Editor’s pick

These are the features that look most exciting to us in this release:

#2400 Memory Swap Support

This enhancement sees the most significant overhaul, improving system stability by modifying swap memory behavior on Linux nodes to better manage memory usage and system performance. By optimizing how swap memory is handled, Kubernetes can ensure smoother operation of applications under various load conditions, thereby reducing system crashes and enhancing overall reliability.

Nigel DouglasSr. Open Source Security Advocate (Falco Security)

#3221 Structured Authorization Configuration

This enhancement also hits beta, streamlining the creation of authorization chains with enhanced capabilities like multiple webhooks and fine-grained control over request validation, all configured through a structured file. By allowing for complex configurations and precise authorization mechanisms, this feature significantly enhances security and administrative efficiency, making it easier for administrators to enforce policy compliance across the cluster.

Mike ColemanStaff Developer Advocate – Open Source Ecosystem

#3488 CEL for Admission Control


The integration of Common Expression Language (CEL) for admission control introduces a dynamic method to enforce complex, fine-grained policies directly through the Kubernetes API, enhancing both security and governance capabilities. This improvement enables administrators to craft policies that are not only more nuanced but also responsive to the evolving needs of their deployments, thereby ensuring that security measures keep pace with changes without requiring extensive manual updates.

Thomas LabarussiasSr. Developer Advocate & CNCF Ambassador


In cloud security, time is the most valuable currency. An attack could tarnish reputations in as little as 10 minutes.

That’s why we have curated a comprehensive checklist to guide your security strategy as you escalate your utilisation of containers and Kubernetes.

Read the Checklist


Apps in Kubernetes 1.30

#4443 More granular failure reason for Job PodFailurePolicy

Stage: Net New to Alpha
Feature group: sig-apps

The current approach of assigning a general “PodFailurePolicy” reason to a Job’s failure condition could be enhanced for specificity. One way to achieve this is by adding a customizable Reason field to the PodFailurePolicyRule, allowing for distinct, machine-readable reasons for each rule trigger, subject to character limitations. This method, preferred for its clarity, would enable higher-level APIs utilizing Jobs to respond more precisely to failures, particularly by associating them with specific container exit code.

#3017 PodHealthyPolicy for PodDisruptionBudget

Stage: Graduating to Stable
Feature group: sig-apps

Pod Disruption Budgets (PDBs) are utilized for two main reasons: to maintain availability by limiting voluntary disruptions and to prevent data loss by avoiding eviction until critical data replication is complete. However, the current PDB system has limitations. It sometimes prevents eviction of unhealthy pods, which can impede node draining and auto-scaling. 

Additionally, the use of PDBs for data safety is not entirely reliable and could be considered a misuse of the API. Despite these issues, the dependency on PDBs for data protection is significant enough that any changes to PDBs must continue to support this requirement, as Kubernetes does not offer alternative solutions for this use case. The goals are to refine PDBs to avoid blocking evictions due to unhealthy pods and to preserve their role in ensuring data safety.

#3998 Job Success/completion policy

Stage: Net New to Alpha
Feature group: sig-apps

This Kubernetes 1.30 enhancement offers an extension to the Job API, specifically for Indexed Jobs, allowing them to be declared as successful based on predefined conditions. This change addresses the need in certain batch workloads, like those using MPI or PyTorch, where success is determined by the completion of specific “leader” indexes rather than all indexes. 

Currently, a job is only marked as complete if every index succeeds, which is limiting for some applications. By introducing a success policy, which is already implemented in third-party frameworks like the Kubeflow Training Operator, Flux Operator, and JobSet, Kubernetes aims to provide more flexibility. This enhancement would enable the system to terminate any remaining pods once the job meets the criteria specified by the success policy.

CLI in Kubernetes 1.30

#4292 Custom profile in kubectl debug

Stage: Net New to Alpha
Feature group: sig-cli

This merged enhancement adds the –custom flag in kubectl debug to let the user customize its debug resources. The enhancement of the ‘kubectl debug’ feature is set to significantly improve security posture for operations teams. 

Historically, the absence of a shell in base images posed a challenge for real-time debugging, which discouraged some teams from using these secure, minimalistic containers. Now, with the ability to attach data volumes within a debug container, end-users are enabled to perform in-depth analysis and troubleshooting without compromising on security. 

This capability promises to make the use of shell-less base images more appealing by simplifying the debugging process.

#2590 Add subresource support to kubectl

Stage: Graduating to Stable
Feature group: sig-cli

The proposal introduces a new –subresource=[subresource-name] flag for the kubectl commands get, patch, edit, and replace. 

This enhancement will enable users to access and modify status and scale subresources for all compatible API resources, including both built-in resources and Custom Resource Definitions (CRDs). The output for status subresources will be displayed in a formatted table similar to the main resource. 

This feature follows the same API conventions as full resources, allowing expected reconciliation behaviors by controllers. However, if the flag is used on a resource without the specified subresource, a ‘NotFound’ error message will be returned.

#3895 Interactive flag added to kubectl delete command

Stage: Graduating to Stable
Feature group: sig-cli

This proposal suggests introducing an interactive mode for the kubectl delete command to enhance safety measures for cluster administrators against accidental deletions of critical resources. 

The kubectl delete command is powerful and permanent, presenting risks of unintended consequences from errors such as mistyping or hasty decisions. To address the potential for such mishaps without altering the default behavior due to backward compatibility concerns, the proposal recommends a new interactive (-i) flag. 

This flag would prompt users for confirmation before executing the deletion, providing an additional layer of protection and decision-making opportunity to prevent accidental removal of essential resources.

Instrumentation

#647 API Server tracing

Stage: Graduating to Stable
Feature group: sig-instrumentation

This Kubernetes 1.30 enhancement aims to improve debugging through enhanced tracing in the API Server, utilizing OpenTelemetry libraries for structured, detailed trace data. It seeks to facilitate easier analysis by enabling distributed tracing, which allows for comprehensive insight into requests and context propagation. 

The proposal outlines goals to generate and export trace data for requests, alongside propagating context between incoming and outgoing requests, thus enhancing debugging capabilities and enabling plugins like admission webhooks to contribute to trace data for a fuller understanding of request paths.

#2305 Metric cardinality enforcement

Stage: Graduating to Stable
Feature group: sig-instrumentation

This enhancement addresses the issue of unbounded metric dimensions causing memory problems in instrumented components by introducing a dynamic, runtime-configurable allowlist for metric label values. 

Historically, the Kubernetes community has dealt with problematic metrics through various inconsistent approaches, including deleting offending labels or metrics entirely, or defining a retrospective set of acceptable values. These fixes are manual, labor-intensive, and time-consuming, lacking a standardized solution. 

This enhancement aims to remedy this by allowing metric dimensions to be bound to a predefined set of values independently of code releases, streamlining the process and preventing memory leaks without necessitating immediate binary releases.

#3077 Contextual Logging

Stage: Graduating to Beta
Feature group: sig-instrumentation

This contextual logging proposal introduces a shift from using a global logger to passing a logr.Logger instance through functions, either via a context.Context or directly, leveraging the benefits of structured logging. This method allows callers to enrich log messages with key/value pairs, specify names indicating the logging component or operation, and adjust verbosity to control the volume of logs generated by the callee. 

The key advantage is that this is achieved without needing to feed extra information to the callee, as the necessary details are encapsulated within the logger instance itself. 

Furthermore, it liberates third-party components utilizing Kubernetes packages like client-go from being tethered to the klog logging framework, enabling them to adopt any logr.Logger implementation and configure it to their preferences. For unit testing, this model facilitates isolating log output per test case, enhancing traceability and analysis. 

The primary goal is to eliminate klog’s direct API calls and its mandatory adoption across packages, empowering function callers with logging control, and minimally impacting public APIs while providing guidance and tools for integrating logging into unit tests.

Network in Kubernetes 1.30

#3458 Remove transient node predicates from KCCM’s service controller

Stage: Graduating to Stable
Feature group: sig-network

To mitigate hasty disconnection of services and to minimize the load on cloud providers’ APIs, a new proposal suggests a change in how the Kubernetes cloud controller manager (KCCM) interacts with load balancer node sets. 

This enhancement aims to discontinue the practice of immediate node removal when nodes temporarily lose readiness or are being terminated. Instead, by introducing the StableLoadBalancerNodeSet feature gate, it would promote a smoother transition by enabling connection draining, allowing applications to benefit from graceful shutdowns and reducing unnecessary load balancer re-syncs. This change is aimed at enhancing application reliability without overburdening cloud provider systems.

#3836 Ingress Connectivity Reliability Improvement for Kube-Proxy

Stage: Graduating to Beta
Feature group: sig-network

This Kubernetes 1.30 enhancement introduces modifications to the Kubernetes cloud controller manager’s service controller, specifically targeting the health checks (HC) used by load balancers. These changes aim to improve how these checks interact with kube-proxy, the service proxy managed by Kubernetes. There are three main improvements: 

1) Enabling kube-proxy to support connection draining on terminating nodes by failing its health checks when nodes are marked for deletion, particularly useful during cluster downsizing scenarios; 

2) Introducing a new /livez health check path in kube-proxy that maintains traditional health check semantics, allowing uninterrupted service during node terminations; 

3) Advocating for standardized health check procedures across cloud providers through a comprehensive guide on Kubernetes’ official website. 

These updates seek to ensure graceful shutdowns of services and improve overall cloud provider integration with Kubernetes clusters, particularly for services routed through nodes marked for termination.

#1860 Make Kubernetes aware of the LoadBalancer behavior

Stage: Graduating to Beta
Feature group: sig-network

This enhancement is a modification to the kube-proxy configurations for handling External IPs of LoadBalancer Services. Currently, kube-proxy implementations, including ipvs and iptables, automatically bind External IPs to each node for optimal traffic routing directly to services, bypassing the load balancer. This process, while beneficial in some scenarios, poses problems for certain cloud providers like Scaleway and Tencent Cloud, where such binding disrupts inbound traffic from the load balancer, particularly health checks. 

Additionally, features like TLS termination and the PROXY protocol implemented at the load balancer level are bypassed, leading to protocol errors. The enhancement suggests making this binding behavior configurable at the cloud controller level, allowing cloud providers to disable or adjust this default setting to better suit their infrastructure and service features, addressing these issues and potentially offering a more robust solution than current workarounds.

Kubernetes 1.30 Nodes

#3960 Introducing Sleep Action for PreStop Hook

Stage: Graduating to Beta
Feature group: sig-node

This Kubernetes 1.30 enhancement introduces a ‘sleep’ action for the PreStop lifecycle hook, offering a simpler, native option for managing container shutdowns. 

Instead of relying on scripts or custom solutions for delaying termination, containers could use this built-in sleep to gracefully wrap up operations, easing transitions in load balancing, and allowing external systems to adjust, thereby boosting Kubernetes applications’ reliability and uptime.

#2400 Node Memory Swap Support

Stage: Major Change to Beta
Feature group: sig-node

The enhancement integrates swap memory support into Kubernetes, addressing two key user groups: node administrators for performance tuning and application developers requiring swap for their apps. 

The focus is to facilitate controlled swap use on a node level, with the kubelet enabling Kubernetes workloads to utilize swap space under specific configurations. The ultimate goal is to enhance Linux node operation with swap, allowing administrators to determine swap usage for workloads, initially not permitting individual workloads to set their own swap limits.

#24 AppArmor Support

Stage: Graduating to Stable
Feature group: sig-node

Adding AppArmor support to Kubernetes marks a significant enhancement in the security posture of containerized workloads. AppArmor is a Linux kernel module that allows system admins to restrict certain capabilities of a program using profiles attached to specific applications or containers. By integrating AppArmor into Kubernetes, developers can now define security policies directly within an app config.

The initial implementation of this feature would allow for specifying an AppArmor profile within the Kubernetes API for individual containers or entire pods. This profile, once defined, would be enforced by the container runtime, ensuring that the container’s actions are restricted according to the rules defined in the profile. This capability is crucial for running secure and confined applications in a multi-tenant environment, where a compromised container could potentially affect other workloads or the underlying host.

Scheduling

#3633 Introduce MatchLabelKeys to Pod Affinity and Pod Anti Affinity

Stage: Graduating to Beta
Feature group: sig-scheduling

This Kubernetes 1.30 enhancement introduces MatchLabelKeys for PodAffinityTerm to refine PodAffinity and PodAntiAffinity, enabling more precise control over Pod placements during scenarios like rolling upgrades. 

By allowing users to specify the scope for evaluating Pod co-existence, it addresses scheduling challenges that arise when new and old Pod versions are present simultaneously, particularly in saturated or idle clusters. This enhancement aims to improve scheduling effectiveness and cluster resource utilization.

#3902 Decouple TaintManager from NodeLifecycleController

Stage: Graduating to Stable
Feature group: sig-scheduling

This enhancement separated the NodeLifecycleController duties into two distinct controllers. Currently, the NodeLifecycleController is responsible for both marking unhealthy nodes with NoExecute taints and evicting pods from these tainted nodes. 

The proposal introduces a dedicated TaintEvictionController specifically for managing the eviction of pods based on NoExecute taints, while the NodeLifecycleController will continue to focus on applying taints to unhealthy nodes. This separation aims to streamline the codebase, allowing for more straightforward enhancements and the potential development of custom eviction strategies. 

The motivation behind this change is to untangle the intertwined functionalities, thus improving the system’s maintainability and flexibility in handling node health and pod eviction processes.

#3838 Mutable Pod scheduling directives when gated

Stage: Graduating to Stable
Feature group: sig-scheduling

The enhancement introduced in #3521, PodSchedulingReadiness, aimed at empowering external resource controllers – like extended schedulers or dynamic quota managers – to determine the optimal timing for a pod’s eligibility for scheduling by the kube-scheduler. 

Building on this foundation, the current enhancement seeks to extend the flexibility by allowing mutability in a pod’s scheduling directives, specifically node selector and node affinity, under the condition that such updates further restrict the pod’s scheduling options. This capability enables external resource controllers to not just decide the timing of schedulin,g but also to influence the specific placement of the pod within the cluster. 

This approach fosters a new pattern in Kubernetes scheduling, encouraging the development of lightweight, feature-specific schedulers that complement the core functionality of the kube-scheduler without the need for maintaining custom scheduler binaries. This pattern is particularly advantageous for features that can be implemented without the need for custom scheduler plugins, offering a streamlined way to enhance scheduling capabilities within Kubernetes ecosystems.

Kubernetes 1.30 storage

#3141 Prevent unauthorized volume mode conversion during volume restore

Stage: Graduating to Stable
Feature group: sig-storage

This enhancement addresses a potential security gap in Kubernetes’ VolumeSnapshot feature by introducing safeguards against unauthorized changes in volume mode during the creation of a PersistentVolumeClaim (PVC) from a VolumeSnapshot. 

It outlines a mechanism to ensure that the original volume mode of the PVC is preserved, preventing exploitation through kernel vulnerabilities, while accommodating legitimate backup and restore processes that may require volume mode conversion for efficiency. This approach aims to enhance security without impeding valid backup and restore workflows.

#1710 Speed up recursive SELinux label change

Stage: Net New to Beta
Feature group: sig-storage

This enhancement details improvements to SELinux integration with Kubernetes, focusing on enhancing security measures for containers running on Linux systems with SELinux in enforcing mode. The proposal outlined how SELinux prevents escaped container users from accessing host OS resources or other containers by assigning unique SELinux contexts to each container and labeling volume contents accordingly. 

The proposal also seeks to refine how Kubernetes handles SELinux contexts, offering the option to either set these manually via PodSpec or allow the container runtime to automatically assign them. Key advancements include the ability to mount volumes with specific SELinux contexts using the -o context= option during the first mount to ensure the correct security labeling, as well as recognizing which volume plugins support SELinux. 

The motivation behind these changes includes enhancing performance by avoiding extensive file relabeling, preventing space issues on nearly full volumes, and increasing security, especially for read-only and shared volumes. This approach aims to streamline SELinux policy enforcement across Kubernetes deployments, particularly in securing containerized environments against potential security breaches like CVE-2021-25741.

#3756 Robust VolumeManager reconstruction after kubelet restart

Stage: Graduating to Stable
Feature group: sig-storage

This enhancement addresses the issues with kubelet’s handling of mounted volumes after a restart, where it currently loses track of volumes for running Pods and attempts to reconstruct this state from the API server and the host OS – a process known to be flawed. 

It proposes a reworking of this process, essentially a substantial bugfix that impacts significant portions of kubelet’s functionality. Due to the scope of these changes, they will be implemented behind a feature gate, allowing users to revert to the old system if necessary. This initiative builds on the foundations laid in KEP 1790, which previously went alpha in v1.26. 

The modifications aim to enhance how kubelet, during startup, can better understand how volumes were previously mounted and assess whether any changes are needed. Additionally, it seeks to address issues like those documented in bug #105536, where volumes fail to be properly cleaned up after a kubelet restart, thus improving the overall robustness of volume management and cleanup.

Other enhancements

#1610 Container Resource based Pod Autoscaling

Stage: Graduating to Stable
Feature group: sig-autoscaling

This enhancement outlines enhancements to the Horizontal Pod Autoscaler’s (HPA) functionality, specifically allowing it to scale resources based on the usage metrics of individual containers within a pod. Currently, HPA aggregates resource consumption across all containers, which may not be ideal for complex workloads with containers whose resource usage does not uniformly scale. 

With the proposed changes, HPA would have the capability to scale more precisely by assessing the resource demands of each container separately.

#2799 Reduction of Secret-based Service Account Tokens

Stage: Graduating to Stable
Feature group: sig-auth

This improvement outlines measures to minimize the reliance on less secure, secret-based service account tokens following the general availability of BoundServiceAccountTokenVolume in Kubernetes 1.22. With this feature, service account tokens are acquired via the TokenRequest API and stored in a projected volume, making the automatic generation of secret-based tokens unnecessary. 

This aims to cease the auto-generation of these tokens and remove any that are unused, while still preserving tokens explicitly requested by users. The suggested approach includes modifying the service account control loop to prevent automatic token creation, promoting the use of the TokenRequest API or manually created tokens, and implementing a purge process for unused auto-generated tokens.

#4008 CRD Validation Ratcheting

Stage: Graduating to Beta
Feature group: sig-api-machinery

This proposal focuses on improving the usability of Kubernetes by advocating for the “shift left” of validation logic, moving it from controllers to the frontend when possible. Currently, the process of modifying validation for unchanged fields in a Custom Resource Definition (CRD) is cumbersome, often requiring version increments even for minor validation changes. This complexity hinders the adoption of advanced validation features by both CRD authors and Kubernetes developers, as the risk of disrupting user workflows is high. Such restrictions not only degrade user experience but also impede the progression of Kubernetes itself. 

For instance, KEP-3937 suggests introducing declarative validation with new format types, which could disrupt existing workflows. The goals of this enhancement are to eliminate the barriers that prevent CRD authors and Kubernetes from both broadening and tightening value validations without causing significant disruptions. The proposal aimed to automate these enhancements for all CRDs in clusters where the feature is enabled, maintaining performance with minimal overhead and ensuring correctness by preventing invalid values according to the known schema.

If you liked this, you might want to check out our previous ‘What’s new in Kubernetes’ editions:

Get involved with the Kubernetes project:

And if you enjoy keeping up to date with the Kubernetes ecosystem, subscribe to our container newsletter, a monthly email with the coolest stuff happening in the cloud-native ecosystem.

The post What’s New in Kubernetes  1.30? appeared first on Sysdig.

]]>
Container Drift Detection with Falco https://sysdig.com/blog/container-drift-detection-with-falco/ Tue, 27 Feb 2024 15:30:00 +0000 https://sysdig.com/?p=85013 DIE is the notion that an immutable workload should not change during runtime; therefore, any observed change is potentially evident...

The post Container Drift Detection with Falco appeared first on Sysdig.

]]>
DIE is the notion that an immutable workload should not change during runtime; therefore, any observed change is potentially evident of malicious activity, also commonly referred to as Drift. Container Drift Detection provides an easy way to prevent attacks at runtime by simply following security best practices of immutability and ensuring containers aren’t modified after deployment in production.

Getting ahead of drift in container security

According to the Sysdig 2024 Cloud-Native Security & Usage Report, approximately 25% of Kubernetes users receive alerts on drift behavior. On the other hand, about 4% of teams are fully leveraging drift control policies by automatically blocking unexpected executions. In order to prevent drift, you need to be able to detect drift in real-time. And that’s where Falco’s rich system call collection and analysis is required. We will highlight how Falco rules can detect drift in real time, and provide some practical drift control advice.

Container Drift Detection

Container drift detection when files are open and written

This Falco rule is rather rudimentary, but it still achieves its intended purpose. It looks for the following event types listed – open, openat, openat2, creat. This works, but it relies on fairly ambiguous kernel signals, and therefore only works with Falco Engine version 6 or higher. The rule is enabled by default in the stable rules feed of the Falco Rules Maturity Framework.

- rule: Container Drift Detected (open+create)
  desc: Detects new executables created within a container as a result of open+create.
  condition: >
    evt.type in (open,openat,openat2,creat) 
    and evt.rawres>=0
    and evt.is_open_exec=true 
    and container 
    and not runc_writing_exec_fifo 
    and not runc_writing_var_lib_docker 
    and not user_known_container_drift_activities 
  enabled: false
  output: Drift detected (open+create), new executable created in a container (filename=%evt.arg.filename name=%evt.arg.name mode=%evt.arg.mode evt_type=%evt.type user=%user.name user_uid=%user.uid user_loginuid=%user.loginuid process=%proc.name proc_exepath=%proc.exepath parent=%proc.pname command=%proc.cmdline terminal=%proc.tty %container.info)
  priority: ERROR
  tags: [maturity_sandbox, container, process, filesystem, mitre_execution, T1059]

To see which Falco rules are in what status of the Falco Maturity Framework, check out this link.
maturity_stable indicates that the rule has undergone thorough evaluation by experts with hands-on production experience. These practitioners have determined that the rules embody best practices and exhibit optimal robustness, making it more difficult for attackers to bypass Falco detection.

Container Drift Detection through chmod

In Unix and similar operating systems, the chmod command and system call are utilized to modify the access rights and specific mode flags (such as setuid, setgid, and sticky flags) for file system entities, including both files and directories.

- rule: Container Drift Detected (chmod)
  desc: Detects when new executables are created in a container as a result of chmod.
  condition: >
    chmod 
    and container 
    and evt.rawres>=0 
    and ((evt.arg.mode contains "S_IXUSR") or
         (evt.arg.mode contains "S_IXGRP") or
         (evt.arg.mode contains "S_IXOTH"))
    and not runc_writing_exec_fifo 
    and not runc_writing_var_lib_docker 
    and not user_known_container_drift_activities 
  enabled: false
  output: Drift detected (chmod), new executable created in a container (filename=%evt.arg.filename name=%evt.arg.name mode=%evt.arg.mode evt_type=%evt.type user=%user.name user_uid=%user.uid user_loginuid=%user.loginuid process=%proc.name proc_exepath=%proc.exepath parent=%proc.pname command=%proc.cmdline terminal=%proc.tty %container.info)
  priority: ERROR
  tags: [maturity_sandbox, container, process, filesystem, mitre_execution, T1059]

While this Falco rule can generate significant noise, chmod usage is frequently linked to dropping and executing malicious implants. The rule is therefore disabled by default and placed within the “Sandbox” rules feed of the maturity matrix, however, it can be fine-tuned to better work for your environment.

The newer rule “Drop and execute new binary in container” provides more precise detection of this TTP using unambiguous kernel signals. It is recommended to use the new rule. However, this rule might be more relevant for auditing if applicable in your environment, such as when chmod is used on files within the /tmp folder.

Detect drift when a new binary is dropped and executed

It’s ideal to detect if an executable not belonging to the base image of a container is being executed. The drop and execute pattern can be observed very often after an attacker gained an initial foothold. The process is_exe_upper_layer filter field only applies for container runtimes that use overlayFS as a union mount filesystem.

- rule: Drop and execute new binary in container
  desc: Detects if an executable not belonging to a container base image is executed.
  condition: >
    spawned_process
    and container
    and proc.is_exe_upper_layer=true 
    and not container.image.repository in (known_drop_and_execute_containers)
  output: Executing binary not part of base image (proc_exe=%proc.exe proc_sname=%proc.sname gparent=%proc.aname[2] proc_exe_ino_ctime=%proc.exe_ino.ctime proc_exe_ino_mtime=%proc.exe_ino.mtime proc_exe_ino_ctime_duration_proc_start=%proc.exe_ino.ctime_duration_proc_start proc_cwd=%proc.cwd container_start_ts=%container.start_ts evt_type=%evt.type user=%user.name user_uid=%user.uid user_loginuid=%user.loginuid process=%proc.name proc_exepath=%proc.exepath parent=%proc.pname command=%proc.cmdline terminal=%proc.tty exe_flags=%evt.arg.flags %container.info)
  priority: CRITICAL
  tags: [maturity_stable, container, process, mitre_persistence, TA0003, PCI_DSS_11.5.1]

Adopters can utilize the provided template list known_drop_and_execute_containers containing allowed container images known to execute binaries not included in their base image. Alternatively, you could exclude non-production namespaces in Kubernetes settings by adjusting the rule further. This helps reduce noise by applying application and environment-specific knowledge to this rule. Common anti-patterns include administrators or SREs performing ad-hoc debugging.

Container Drift Detection with Falco

Enforcing Container Drift Prevention at Runtime

Detecting container drift in real time is critical in reducing the risk of data theft or credential access in running workloads. Activating preventive drift control measures in production should reduce the amount of potentially malicious events requiring incident response intervention by approximately 9%, according to Sysdig’s report. That’s where Falco Talon comes to the rescue.

Container Drift Detection

Falco Talon is a response engine for managing threats in your Kubernetes environment. It enhances the solutions proposed by the Falco community with a no-code, tailor-made solution for Falco rules. With easy to configure response actions, you can prevent the indicators of compromise in milliseconds.

- action: Terminate Pod
  actionner: kubernetes:terminate
  parameters:
    ignoreDaemonsets: false
    ignoreStatefulsets: true

- rule: Drift Prevention in a Kubernetes Pod
  match:
    rules:
      - Drop and execute new binary in container
  actions:
    - action: Terminate Pod
      parameters:
        gracePeriods: 2

As you can see in the above Talon Rule Drift Prevention in a Kubernetes Pod, we have configured a response actionner for the Falco rule Drop and execute new binary in container. So, when a user attempts to alter a running container, we can instantly and gracefully terminate the pod, upholding the cloud-native principle of immutability. It’s crucial to remember the DIE concept here. Regular modifications during runtime, if not aligned with DIE, could lead to alert overload or significant system disruptions due to frequent workload interruptions when drift prevention is enabled.

Container Drift Detection with Falco

If you do not intend to shutdown the workload in response to container drift detections, you could alternatively choose to run a shell script within the container to remove the recently dropped binary. Or, you could enforce a Kubernetes network policy to isolate the network requests from a suspected C2 server.

Conclusion

Drift control in running containers is not an optional feature, but rather a necessity when we talk about runtime security. When we look back at the DIE philosophy, we need a real-time approach, as seen in Falco, to protect immutable cloud-native workloads in Kubernetes. By leveraging Falco rules to monitor for unauthorized changes, such as file modifications or unexpected binary executions, organizations can detect and automatically mitigate potential security breaches through Falco Talon. This proactive approach to container security, emphasizing immutability and continuous surveillance, not only fortifies defenses against malicious activities but also aligns with best practices for maintaining the integrity and security of modern cloud-native applications.

Moreover, the adaptability of Falco’s rules to specific operational environments, through customization and the application of context-aware filters, enhances their effectiveness while minimizing false positives. This tailored approach ensures that security measures are both stringent and relevant, avoiding unnecessary alerts that could lead to alert fatigue among security teams. The journey towards a secure containerized environment is ongoing and requires vigilance, collaboration, and a commitment to security best practices.

The post Container Drift Detection with Falco appeared first on Sysdig.

]]>
The power of prioritization: Why practitioners need CNAPP with runtime insights https://sysdig.com/blog/why-practitioners-need-cnapp-with-runtime-insights/ Tue, 20 Feb 2024 15:00:00 +0000 https://sysdig.com/?p=84531 The heightened demand for cloud applications places a premium on the agility of development teams to swiftly create and deploy...

The post The power of prioritization: Why practitioners need CNAPP with runtime insights appeared first on Sysdig.

]]>
The heightened demand for cloud applications places a premium on the agility of development teams to swiftly create and deploy them. Simultaneously, security teams face the crucial task of safeguarding the organization’s cloud infrastructure without impeding the pace of innovation. Navigating this balance between speed and security has become a pivotal challenge, compelling security teams and developers to seek integrated solutions that safeguard the entire cloud-native application lifecycle — from development to production. 

This demand has given rise to the adoption of cloud-native application protection platforms (CNAPP). Security practitioners are embracing CNAPP to streamline their cloud security programs by consolidating point solutions into a single platform. Operating from a unified user interface, security teams gain comprehensive threat visibility across the organization’s cloud environments and workloads, offering a more effective and efficient approach to preventing, detecting, and responding to cloud security risks.

There are two questions CNAPP adopters must ask themselves:

  • How can security teams unlock the full potential of CNAPP to effectively carry out their responsibilities?
  • And how can they use CNAPPs to ensure development teams can swiftly build and deliver applications? 

The key lies in giving security practitioners the ability to identify and address real risks promptly. Enter runtime insights — the linchpin CNAPP capability that enables security teams to effectively prioritize the most important and relevant risks in their environment. 

It probably doesn’t come as a surprise that risk prioritization is the key for CNAPP practitioners to be successful. But to grasp the importance of runtime insights in delivering this capability, it’s important to understand the cloud security complexities driving the need for better prioritization.

Lack of end-to-end visibility and alert overload

While there are multiple factors driving the shift to CNAPP, one of the most important is the need for visibility into risk across the entire application lifecycle. As risk spreads throughout development, staging, and runtime operations, both security and DevOps teams need deep visibility and insights across the organization’s entire multi-cloud footprint. 

In order to ensure comprehensive visibility, a successful CNAPP must process substantial volumes of data from diverse sources. This encompasses data from system calls, Kubernetes audit logs, cloud logs, identity and access tools such as Okta, and more. Extensive coverage is crucial due to the many potential entry points for attacks, as well as the potential for attackers to move laterally across these domains. However, this analysis can generate a flood of alerts and findings that may or may not represent real risk. Security teams can get overwhelmed by the endless stream of alerts, impeding their ability to identify actual suspicious activity such as remote code execution (RCE), privilege escalation, or lateral movement across cloud workloads.

The backlog of notifications can also delay development, as developers waste time with false positives or remediating low-risk vulnerabilities. Without addressing this, security can quickly become an obstacle that slows the pace of innovation. 

Collectively, these challenges make it critical for CNAPPs to provide deeper insights and prioritize the most critical vulnerabilities based on runtime context. That’s where runtime insights excel, distinguishing the most effective CNAPP solutions from the rest.

Enable rapid risk prioritization with runtime insights

The key for security teams to prioritize the most impactful issues across cloud environments is runtime insights. Runtime insights provide actionable information on the most critical problems in an environment based on the knowledge of what is running right now. This provides a lens into what’s actually happening in deployments, allowing security and development teams to focus on current, exploitable risks. 

Runtime insights are an essential capability for an effective CNAPP solution to eliminate alert fatigue, provide deep visibility, and enable teams to identify real and relevant suspicious activity.

For example, a CNAPP with runtime insights:

  • Prioritizes the most critical vulnerabilities to fix by analyzing which packages are in use at runtime. Sysdig research shows that 87% of container images have high or critical vulnerabilities, but only 15% of vulnerabilities are actually tied to loaded packages at runtime.
  • Aids in promptly identifying anomalous behavior, suspicious activity, or posture drift that pose a genuine, immediate risk.
  • Highlights the excessive permissions to fix first by leveraging runtime access patterns. 
  • Guides remediation efforts that ultimately help teams make informed decisions directly where it matters most — at the source of the misconfiguration or vulnerability issue.

Runtime use case: Preventing lateral movement

Let’s explore how a CNAPP with runtime insights can effectively identify and mitigate a lateral movement attack across an organization’s two cloud vendor environments:

Attack path:

  1. Entry: The attacker exploits a publicly exposed critical vulnerability.
  2. Access: Having gained entry, the attacker now has access to a Kubernetes workload.
  3. Privilege escalation: Exploiting failed privilege controls and excessive unused permissions, the attacker escalates privileges, obtaining permissions with admin access.
  4. Lateral movement: Using acquired credentials, the attacker navigates across cloud environments, reaching a sensitive Amazon S3 bucket.

How runtime insights mitigate the attack:

  • Stop initial access by identifying in-use vulnerabilities:

Challenge: Teams face an overwhelming number of system vulnerabilities.

Solution: Using runtime insights, security teams can pinpoint which vulnerabilities are actively in use, enabling practitioners to prioritize immediate patching of exploitable entry points.

  • Track and control excess permissions to block lateral movement:

Challenge: Sorting through permissions can be daunting, leading to excessive and unnecessary access.

Solution: Security teams can leverage runtime insights to differentiate between actively used and excessively assigned permissions so practitioners can effectively ensure they’re applying the principle of least privilege. 

With proper runtime visibility, it is possible for teams to conduct a thorough analysis of permissions usage over an extended period (e.g., 30 to 90 days). If higher-level permissions remain unused during this time, this signals that they are likely unnecessary for regular operations. This proactive visibility equips teams with the knowledge to promptly remove unnecessary permissions, effectively thwarting an attacker’s ability to escalate privileges, and thereby preventing lateral movement.

By leveraging runtime insights, practitioners can significantly enhance their ability to detect, prioritize, and address critical elements of a lateral movement attack, ultimately fortifying the organization’s cloud infrastructure against such security threats.

Wrapping up

Prioritizing CNAPP alerts with runtime insights empowers security practitioners to prevent and respond to cloud security issues with greater efficiency and confidence. As organizations increasingly navigate cloud security complexities, runtime insights provide a decisive advantage by offering comprehensive visibility, enabling rapid risk prioritization, and mitigating alert overload. 

By addressing the challenges of end-to-end visibility and alert fatigue, CNAPPs equipped with runtime insights enable security and development teams to swiftly identify, prioritize, and address critical vulnerabilities, ensuring the organization’s cloud security posture aligns seamlessly with the pace of innovation. 

The post The power of prioritization: Why practitioners need CNAPP with runtime insights appeared first on Sysdig.

]]>
Ephemeral Containers and APTs https://sysdig.com/blog/ephemeral-containers-and-apts/ Mon, 19 Feb 2024 16:00:00 +0000 https://sysdig.com/?p=84522 The Sysdig Threat Research Team (TRT) published their latest Cloud-Native Security & Usage Report for 2024. As always, the research...

The post Ephemeral Containers and APTs appeared first on Sysdig.

]]>
The Sysdig Threat Research Team (TRT) published their latest Cloud-Native Security & Usage Report for 2024. As always, the research team managed to shed additional light on critical vulnerabilities inherent in current container security practices. This blog post delves into the intricate balance between convenience, operational efficiency, and the rising threats of Advanced Persistent Threats (APTs) in the world of ephemeral containers – and what we can do to prevent those threats in milliseconds.

Attackers Have Adapted to Ephemeral Containers

A striking revelation from the Sysdig report is the increasingly transient life of containers. Approximately 70% of containers now have a lifespan of less than five minutes. While this ephemeral nature can be beneficial for resource management, it also presents unique security challenges. Attackers, adapting to these fleeting windows, have honed their methods to conduct swift, automated reconnaissance. The report highlights that a typical cloud attack unfolds within a mere 10 minutes, underscoring the need for real-time response actions.

How to prevent data exfiltration in ephemeral containers

Many organizations have opted to use open-source Falco for real-time threat detection in cloud-native environments. In cases where the adversary opts to use an existing tool such as kubectl cp to copy artifacts from a container’s file system to a remote location via the Kubernetes control plane, Falco can trigger a detection within milliseconds.

- rule: Exfiltrating Artifacts via Kubernetes Control Plane
  desc: Detect artifacts exfiltration from a container's file system using kubectl cp.
  condition: >
    open_read 
    and container 
    and proc.name=tar 
    and container_entrypoint 
    and proc.tty=0 
    and not system_level_side_effect_artifacts_kubectl_cp
  output: Exfiltrating Artifacts via Kubernetes Control Plane (file=%fd.name evt_type=%evt.type user=%user.name user_uid=%user.uid user_loginuid=%user.loginuid process=%proc.name proc_exepath=%proc.exepath parent=%proc.pname command=%proc.cmdline terminal=%proc.tty)
  priority: NOTICE
  tags: [maturity_incubating, container, filesystem, mitre_exfiltration, TA0010]

This Falco rule can identify potential exfiltration of application secrets from ephemeral containers’ file systems, potentially revealing the outcomes of unauthorized access and control plane misuse via stolen identities (such as stolen credentials like Kubernetes serviceaccount tokens). In cases where an attack can start and complete its goal in less than 5 mins, the need for a quick response action is critical. Unfortunately, this Falco rule alone will only notify users of the exfiltration attempt. We need an additional add-on to stop this action entirely.

Preventing Data Exfiltration with Falco Talon

Falco Talon was recently designed as an open-source Response Engine for isolating threats, specifically in the container orchestration platform – Kubernetes. It enhances the cloud-detection detection engine Falco with a no-code solution. In this case, developer operations and security teams can seamlessly author simple Talon rules that respond to existing Falco real-time in real time. Notice how the below Talon rule gracefully terminates a workload if it was flagged as triggering the aforementioned “Exfiltrating Artifacts via Kubernetes Control Plane” Falco rule.

- name: Prevent control plane exfiltration
  match:
    rules:
      - "Exfiltrating Artifacts via Kubernetes Control Plane"
  action:
    name: kubernetes:terminate
    parameters:
      ignoreDaemonsets: true
      ignoreStatefulsets: true
      grace_period_seconds: 0


In the above example, the action chooses to utilize the existing Kubernetes primitives for graceful termination with the name “kubernetes:terminate“. It’s important that your application handles termination gracefully so that there is minimal impact on the end user and the time-to-recovery is as fast as possible – unlike SIGKILL, which is much more forceful.

In practice, this terminate action means your pod will handle the SIGTERM message and begin shutting down when it receives the message. This involves saving state, closing down network connections, finishing any work that is left.

In Falco Talon, the parameters “grace_period_seconds” specifies the duration in seconds before the pod should be deleted. The value zero indicates delete immediately. If configured, the attacker is instantly kicked out of the session and therefore unable to exfiltrate data.

Ephemeral containers and APTs

The Threat of Quick and Agile Attackers

The agility of attackers in the cloud environment cannot be underestimated. Once they gain access, they rapidly acquire an understanding of the environment, poised to advance their malicious objectives. This rapid adaptation means that even short-lived, vulnerable workloads can expose organizations to significant risks. The traditional security models, which rely on longer response times, are proving inadequate against these fast-paced threats.

Conclusion

The insights from the Sysdig report unequivocally call for a strategic reevaluation of security approaches in Kubernetes environments. In response to the challenges posed by limited visibility and the need for effective security controls in ephemeral containers and workloads, projects like the Cloud Native Computing Foundation’s (CNCF) Falco, and its latest open-source companion Falco Talon, have emerged as vital tools. Designed to tackle the intricacies of short-lived (less than 5 minutes) containers, these solutions offer real-time security monitoring and continuous scanning, transitioning from recommended practices to essential components in a Kubernetes security arsenal.

Organizations must find a balance between leveraging the convenience of cloud-native technologies and enforcing stringent security protocols. As attackers increasingly exploit the ephemeral nature of containers, the organizational response must be both dynamic and proactive. Tools like Falco and Falco Talon exemplify the kind of responsive, advanced security measures necessary to navigate this landscape. They provide the much-needed visibility and control to detect and respond to threats in real-time, thereby enhancing the security posture in these fast-paced environments.

Ensuring robust cybersecurity in the face of sophisticated threats is undoubtedly challenging, but with the right tools and strategies, it is within reach. The integration of solutions like Falco and Falco Talon into Kubernetes environments is key to safeguarding against today’s advanced threats, ensuring a secure, efficient, and resilient cloud-native ecosystem for tomorrow.

The post Ephemeral Containers and APTs appeared first on Sysdig.

]]>
Exploring Syscall Evasion – Linux Shell Built-ins https://sysdig.com/blog/exploring-syscall-evasion/ Wed, 14 Feb 2024 15:15:00 +0000 https://sysdig.com/?p=84306 This is the first article in a series focusing on syscall evasion as a means to work around detection by...

The post Exploring Syscall Evasion – Linux Shell Built-ins appeared first on Sysdig.

]]>
This is the first article in a series focusing on syscall evasion as a means to work around detection by security tools and what we can do to combat such efforts. We’ll be starting out the series discussing how this applies to Linux operating systems, but this is a technique that applies to Windows as well, and we’ll touch on some of this later on in the series. 

In this particular installment, we’ll be discussing syscall evasion with bash shell builtins. If you read that and thought “what evasion with bash what now?”, that’s ok. We’ll walk through it from the beginning. 

What is a Syscall?

System calls, commonly referred to as syscalls, are the interface between user-space applications and the kernel, which, in turn, talks to the rest of our resources, including files, networks, and hardware. Basically, we can consider syscalls to be the gatekeepers of the kernel when we’re looking at things from a security perspective.

Many security tools (Falco included) that watch for malicious activity taking place are monitoring syscalls going by. This seems like a reasonable approach, right? If syscalls are the gatekeepers of the kernel and we watch the syscalls with our security tool, we should be able to see all of the activity taking place on the system. We’ll just watch for the bad guys doing bad things with bad syscalls and then we’ll catch them, right? Sadly, no.

There is a dizzying array of syscalls, some of which have overlapping sets of functionality. For instance, if we want to open a file, there is a syscall called open() and we can look at the documentation for it here. So if we have a security tool that can watch syscalls going by, we can just watch for the open() syscall and we should be all good for monitoring applications trying to open files, right? Well, sort of.

If we look at the synopsis in the open() documentation:

Syscall Evasion

As it turns out, there are several syscalls that we could be using to open our file: open(), creat(), openat(), and openat2(), each of which have a somewhat different set of behaviors. For example, the main difference between open() and openat() is that the path for the file being opened by openat() is considered to be relative to the current working directory, unless an absolute path is specified. Depending on the operating system being used, the application in question, and what it is doing relative to the file, we may see different variations of the open syscalls taking place. If we’re only watching open(), we may not see the activity that we’re looking for at all.

Generally, security tools watch for the execve() syscall, which is one syscall indicating process execution taking place (there are others of a similar nature such as execveat(), clone(), and fork()). This is a safer thing to watch from a resource perspective, as it doesn’t take place as often as some of the other syscalls. This is also where most of the interesting activity is taking place. Many of the EDR-like tools watch this syscall specifically. As we’ll see here shortly, this is not always the best approach. 

There aren’t any bad syscalls we can watch, they’re all just tools. Syscalls don’t hack systems, people with syscalls hack systems. There are many syscalls to watch and a lot of different ways they can be used. On Linux, one of the common methods of interfacing with the OS is through system shells, such as bash and zsh. 

NOTE:If you want to see a complete* list of syscalls, take a gander at the documentation on syscall man page here. This list also shows where syscalls are specific to certain architectures or have been deprecated. *for certain values of complete

Examining Syscalls

Now that we have some ideas of what syscalls are, let’s take a quick look at some of them in action. On Linux, one of the primary tools for examining syscalls as they happen is strace. There are a few other tools we can use for this (including the open source version of Sysdig), which we will discuss at greater length in future articles. The strace utility allows us to snoop on syscalls as they’re taking place, which is exactly what we want when we’re trying to get a better view of what exactly is happening when a command executes. Let’s try this out:

1 – We’re going to make a new directory to perform our test in, then use touch to make a file in it. This will help minimize what we get back from strace, but it will still return quite a bit.

5 – Then, we’ll run strace and ask it to execute the ls command. Bear in mind that this is the output of a very small and strictly bounded test where we aren’t doing much. With a more complex set of commands, we would see many, many more syscalls. 

7 – Here, we can see the execve() syscall and the ls command being executed. This particular syscall is often the one monitored for by various detection tools as it indicates program execution. Note that there are a lot of other syscalls happening in our example, but only one execve()

8 – From here on down, we can see a variety of syscalls taking place in order to support the ls command being executed. We won’t dig too deeply into the output here, but we can see various libraries being used, address space being mapped, bytes being read and written, etc.

$ mkdir test
$ cd test/
$ touch testfile

$ strace ls

execve("/usr/bin/ls", ["ls"], 0x7ffcb7920d30 /* 54 vars */) = 0
brk(NULL)                               = 0x5650f69b7000
arch_prctl(0x3001 /* ARCH_??? */, 0x7fff2e5ae540) = -1 EINVAL (Invalid argument)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f07f9f63000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=61191, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 61191, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f07f9f54000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libselinux.so.1", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=166280, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 177672, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f07f9f28000
mprotect(0x7f07f9f2e000, 139264, PROT_NONE) = 0
mmap(0x7f07f9f2e000, 106496, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6000) = 0x7f07f9f2e000
mmap(0x7f07f9f48000, 28672, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x20000) = 0x7f07f9f48000
mmap(0x7f07f9f50000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x27000) = 0x7f07f9f50000
mmap(0x7f07f9f52000, 5640, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f07f9f52000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\3\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0P\237\2\0\0\0\0\0"..., 832) = 832
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
pread64(3, "\4\0\0\0 \0\0\0\5\0\0\0GNU\0\2\0\0\300\4\0\0\0\3\0\0\0\0\0\0\0"..., 48, 848) = 48
pread64(3, "\4\0\0\0\24\0\0\0\3\0\0\0GNU\0 =\340\2563\265?\356\25x\261\27\313A#\350"..., 68, 896) = 68
newfstatat(3, "", {st_mode=S_IFREG|0755, st_size=2216304, ...}, AT_EMPTY_PATH) = 0
pread64(3, "\6\0\0\0\4\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0@\0\0\0\0\0\0\0"..., 784, 64) = 784
mmap(NULL, 2260560, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f07f9c00000
mmap(0x7f07f9c28000, 1658880, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x28000) = 0x7f07f9c28000
mmap(0x7f07f9dbd000, 360448, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1bd000) = 0x7f07f9dbd000
mmap(0x7f07f9e15000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x214000) = 0x7f07f9e15000
mmap(0x7f07f9e1b000, 52816, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f07f9e1b000
close(3)                                = 0
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libpcre2-8.so.0", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\2\1\1\0\0\0\0\0\0\0\0\0\3\0>\0\1\0\0\0\0\0\0\0\0\0\0\0"..., 832) = 832
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=613064, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 615184, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7f07f9e91000
mmap(0x7f07f9e93000, 438272, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x2000) = 0x7f07f9e93000
mmap(0x7f07f9efe000, 163840, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x6d000) = 0x7f07f9efe000
mmap(0x7f07f9f26000, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x94000) = 0x7f07f9f26000
close(3)                                = 0
mmap(NULL, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f07f9e8e000
arch_prctl(ARCH_SET_FS, 0x7f07f9e8e800) = 0
set_tid_address(0x7f07f9e8ead0)         = 877628
set_robust_list(0x7f07f9e8eae0, 24)     = 0
rseq(0x7f07f9e8f1a0, 0x20, 0, 0x53053053) = 0
mprotect(0x7f07f9e15000, 16384, PROT_READ) = 0
mprotect(0x7f07f9f26000, 4096, PROT_READ) = 0
mprotect(0x7f07f9f50000, 4096, PROT_READ) = 0
mprotect(0x5650f62f3000, 4096, PROT_READ) = 0
mprotect(0x7f07f9f9d000, 8192, PROT_READ) = 0
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
munmap(0x7f07f9f54000, 61191)           = 0
statfs("/sys/fs/selinux", 0x7fff2e5ae580) = -1 ENOENT (No such file or directory)
statfs("/selinux", 0x7fff2e5ae580)      = -1 ENOENT (No such file or directory)
getrandom("\x9a\x10\x6f\x3b\x21\xc0\xe9\x56", 8, GRND_NONBLOCK) = 8
brk(NULL)                               = 0x5650f69b7000
brk(0x5650f69d8000)                     = 0x5650f69d8000
openat(AT_FDCWD, "/proc/filesystems", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0444, st_size=0, ...}, AT_EMPTY_PATH) = 0
read(3, "nodev\tsysfs\nnodev\ttmpfs\nnodev\tbd"..., 1024) = 421
read(3, "", 1024)                       = 0
close(3)                                = 0
access("/etc/selinux/config", F_OK)     = -1 ENOENT (No such file or directory)
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
newfstatat(3, "", {st_mode=S_IFREG|0644, st_size=5712208, ...}, AT_EMPTY_PATH) = 0
mmap(NULL, 5712208, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7f07f9600000
close(3)                                = 0
ioctl(1, TCGETS, {B38400 opost isig icanon echo ...}) = 0
ioctl(1, TIOCGWINSZ, {ws_row=48, ws_col=143, ws_xpixel=0, ws_ypixel=0}) = 0
openat(AT_FDCWD, ".", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
newfstatat(3, "", {st_mode=S_IFDIR|0775, st_size=4096, ...}, AT_EMPTY_PATH) = 0
getdents64(3, 0x5650f69bd9f0 /* 3 entries */, 32768) = 80
getdents64(3, 0x5650f69bd9f0 /* 0 entries */, 32768) = 0
close(3)                                = 0
newfstatat(1, "", {st_mode=S_IFCHR|0620, st_rdev=makedev(0x88, 0x2), ...}, AT_EMPTY_PATH) = 0
write(1, "testfile\n", 9testfile
)               = 9
close(1)                                = 0
close(2)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++


Strace has a considerably larger set of capabilities than what we touched on here. A good starting place for digging into it further can be found in the documentation

Now that we’ve covered syscalls, let’s talk a bit about system shells. 

Linux System Shell 101

System shells are interfaces that allow us to interact with an operating system. While shells can be graphical in nature, most of the time when we hear the word shell, it will be in reference to a command-line shell accessed through a terminal application. The shell interprets commands from the user and passes them onto the kernel via, you guessed it, syscalls. We can use the shell to interact with the resources we discussed earlier as being available via syscalls, such as networks, files, and hardware components. 

On any given Linux installation, there will be one or more shells installed. On a typical server or desktop installation, we’ll likely find a small handful of them installed by default. On a purposefully stripped-down distribution, such as those used for containers, there may only be one. 

On most distributions, we can easily ask about the shell environment that we are operating in: 

1 – Reading /etc/shells should get us a list of which shells are installed on the system. Here we can see sh, bash, rbash, dash, and zsh as available shells. 

NOTE: The contents of /etc/shells isn’t, in all cases, the complete list of shells on the system. It’s a list of which ones can be used as login shells. These are generally the same list, but YMMV.

15 – We can easily check which shell we’re currently using by executing echo $0. In this case, we’re running the bash shell.

19 – Switching to another shell is simple enough. We can see that zsh is present in our list of shells and we can change to it by simply issuing zsh from our current shell. 

21 – Once in zsh, we’ll ask which shell we are in again, and we can see it is now zsh.

25 – We’ll then exit zsh, which will land us back in our previous shell. If we check which shell we’re in again, we can see it is bash once again. 

$ cat /etc/shells

# /etc/shells: valid login shells
/bin/sh
/bin/bash
/usr/bin/bash
/bin/rbash
/usr/bin/rbash
/usr/bin/sh
/bin/dash
/usr/bin/dash
/bin/zsh
/usr/bin/zsh

$ echo $0

/bin/bash

$ zsh

% echo $0

zsh

% exit

$ echo $0

/bin/bash

As we walk through the rest of our discussion, we’ll be focusing on the bash shell. The various shells have somewhat differing functionality, but are usually similar, at least in broad strokes. Bash stands for “Bourne Again SHell” as it was designed as a replacement for the original Bourne shell. We’ll often find the Bourne shell on many systems also. It’s in the list we looked at above at /bin/sh

All this is great, you might say, but we were promised syscall evasion. Hold tight, we have one more background bit to cover, then we’ll talk about those parts. 

Shell Builtins vs. External Binaries

When we execute a command in a shell, it can fall into one of several categories:

  • It can be a program binary external to our shell (we’ll call it a binary for short). 
  • It can be an alias, which is a sort of macro pointing to another command or commands. 
  • It can be a function, which is a user defined script or sequence of commands. 
  • It can be a keyword, a common example of which would be something like ‘if’ which we might use when writing a script. 
  • It can be a shell builtin, which is, as we might expect, a command built into the shell itself. We’ll focus primarily on binaries and builtins here. 

Identifying External Binaries

Let’s take another look at the ls command:

1 – We can use the which command to see the location of the command being executed when we run la. We’ll use the -a switch so it will return all of the results. We can see there are a couple results, but this doesn’t tell us what ls is, just where it is.

6 – To get a better idea of what is on the other end of ls when we run it, we can use the type command. Again, we’ll add the -a switch to get all the results. Here, we can see that there is one alias and two files in the filesystem behind the ls command.

7 – First, the alias will be evaluated. This particular alias adds the switch to colorize the output of ls when we execute it. 

8 – After this, there are two ls binaries in the filesystem. Which of these is executed depends on the order of our path. 

11 – If we take a look at the path, we can see that /usr/local/bin appears in the path before /bin, so /usr/local/bin/ls is the command being executed by the ls alias when we type ls into our shell. The final piece of information we need to know here is what type of command this particular ls is.

15 – We can use the file command to dig into ls. File tells us that this particular version of ls is a 64bit ELF binary. Circling all the way back around to our discussion on types of commands, this makes ls an external binary. 

21 – Incidentally, if we look at the other ls located in /bin, we will find that it is an identical file with an identical hash. What is this sorcery? If we use file to interrogate /bin, we’ll see that it’s a symlink to bin. We’re seeing the ls binary twice, but there is really only one copy of the file. 

$ which -a ls
/usr/bin/ls
/bin/ls


$ type -a ls
ls is aliased to `ls --color=auto'
ls is /usr/bin/ls
ls is /bin/ls

$ echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/sna
p/bin:/snap/bin

$ file /usr/bin/ls
/usr/bin/ls: ELF 64-bit LSB pie executable, x86-64, version 1 (SYSV), dynamically
 linked, interpreter /lib64/ld-linux-x86-64.so.2,
 BuildID[sha1]=897f49cafa98c11d63e619e7e40352f855249c13, for GNU/Linux 3.2.0,
 stripped

$ file /bin
/bin: symbolic link to usr/bin

Identifying Shell Builtins

We briefly mentioned that a shell builtin is built into the binary of the shell itself. The builtins available for any given shell can vary quite widely. Let’s take a quick look at what we have available in bash:

1 – The compgen command is one of those esoteric command line kung-fu bits. In this case, we’ll use it with the -b switch, which effectively says “show me all the shell builtins.” We’ll also do a little formatting to show the output in columns and then show a count of the results.

2 – We can see some common commands in the output, like cd, echo, and pwd (also note the compgen command we just ran). When we execute these, we don’t reach out to any other binaries inside the filesystem, we do it all inside of the bash shell already running. 

17 – We should also note that just because one of these commands is in the builtins list for our shell, it can also be elsewhere. If we use the type command again to inquire about echo, which is in our builtins list, type will tell us it is a shell builtin, but we will also see a binary sitting in the filesystem. If we run echo from bash, we will get the builtin, but if we run it from another shell without a builtin echo, we may get the one from the filesystem instead. 

$ compgen -b | pr -5 -t; echo "Count: $(compgen -b | wc -l)"
.	      compopt	    fc		  popd		suspend
:	      continue	    fg		  printf	            test
[	      declare	    getopts	  pushd	times
alias	      dirs	    hash	  pwd		trap
bg	      disown	    help	              read		true
bind	      echo	    history	  readarray	type
break	      enable	    jobs	              readonly	typeset
builtin	      eval	    kill	              return	            ulimit
caller	      exec	    let		  set		umask
cd	      exit 	    local	              shift		unalias
command export	    logout	  shopt		unset
compgen  false	    mapfile	  source	wait
complete
Count: 61

$ type -a echo
echo is a shell builtin
echo is /usr/bin/echo
echo is /bin/echo


It’s also important to note that this set of builtins are specific to the bash shell, and other shells may be very different. Let’s take a quick look at the builtins for zsh.

1 - Zsh doesn’t have compgen, so we’ll need to get the data we want in a different manner. We’ll access the builtins associative array, which contains all the builtin commands of zsh, then do some formatting to make the results a bit more sane and put the output into columns, lastly getting a count of the results.

% print -roC5 -- ${(k)builtins}; echo "Count: ${(k)#builtins}"

-                           compquote     fg                  pushln         umask
.                           compset         float              pwd             unalias
:                           comptags       functions      r                   unfunction
[                           comptry          getln             read             unhash
alias                    compvalues    getopts         readonly      unlimit
autoload              continue          hash             rehash        unset
bg                        declare           history           return          unsetopt
bindkey                dirs                 integer          sched          vared
break                   disable            jobs              set               wait
builtin                   disown            kill                setopt           whence
bye                       echo               let                shift              where
cd                         echotc            limit             source          which
chdir                     echoti             local            suspend       zcompile
command            emulate          log               test               zf_ln
compadd              enable            logout          times           zformat
comparguments  eval                noglob          trap              zle
compcall              exec               popd            true               zmodload
compctl                exit                 print             ttyctl             zparseopts
compdescribe     export             printf            type              zregexparse
compfiles             false               private         typeset         zstat
compgroups         fc                   pushd          ulimit            zstyle
Count: 105
NOTE:
Print what now? “% print -roC5 — ${(k)builtins}; echo “Count: ${(k)#builtins}” can be a bit difficult to parse. Here’s a breakdown of what each part does:
%: This indicates that we’re (probably) in the Zsh shell.
print: This is a command in Zsh used to display text.
-roC5: These are options for the print command.
-r: Don’t treat backslashes as escape characters.
-o: Sort the printed list in alphabetical order.
C5: Format the output into 5 columns.
–: This signifies the end of the options for the command. Anything after this is treated as an argument, not an option.
${(k)builtins}: This is a parameter expansion in Zsh.
${…}: Parameter expansion syntax in Zsh.
(k): A flag to list the keys of an associative array.
builtins: Refers to an associative array in Zsh that contains all built-in commands.
echo “Count: ${(k)#builtins}”: This part of the command prints the count of built-in commands.
echo: A command to display the following text.
“Count: “: The text to be displayed.
${(k)#builtins}: Counts the number of keys in the builtins associative array, which in this context means counting all built-in commands in Zsh.
In simple terms, this command lists all the built-in commands available in the Zsh shell, formats them into five columns, and then displays the total count of these commands.

We can see here that there are over 40 more builtins in zsh than there are in bash. Many of them are the same as what we see in bash, but the availability of builtin commands is something to validate when working with different shells. We’ll continue working with bash as it’s one of the more commonly used shells that we might encounter, but this is certainly worth bearing in mind. 

Now that we know a bit about the shell and shell builtins, let’s look at how we can use these for syscall evasion.

Syscall Evasion Techniques Using Bash Builtins

As we mentioned earlier, many security tools that monitor syscalls monitor for process execution via the execve() syscall. From a certain tool design perspective, this is a great solution as it limits the number of syscalls we need to watch and should catch most of the interesting things going on. For example, let’s use cat to read out the contents of a file and watch what happens with strace:

1 – First, we’ll echo a bit of data into the test file we used earlier so we have something to play with. Then, we’ll cat the file and we can see the output with the file contents.

5 – Now let’s do this again, but this time we’ll watch what happens with strace. We’ll spin up a new bash shell which we will monitor with strace. This time, we’ll also add the -f switch so strace will monitor subprocesses as well. This will result in a bit of extra noise in the output, but we need this in order to get a better view of what is happening as we’re operating in a new shell. Note that strace is now specifying the pid (process id) at the beginning of each syscall as we’re watching multiple processes.

6 – Here we have the execve() syscall taking place for the bash shell we just started. We can see the different subprocesses taking place as bash starts up.

34 – Now we’re dropped back to a prompt, but still operating inside the shell being monitored with strace. Let’s cat the file again and watch the output. 

37 – We can see the syscall for our cat here, along with the results of the command. This is all great, right? We were able to monitor the command with strace and see its execution. We saw the exact command we ran and the output of the command. 

$ echo supersecretdata >> testfile
$ cat testfile 
supersecretdata

$ strace -f -e trace=execve bash
execve("/usr/bin/bash", ["bash"], 0x7ffee6b6c710 /* 54 vars */) = 0
strace: Process 884939 attached
[pid 884939] execve("/usr/bin/lesspipe", ["lesspipe"], 0x55aa1d8a3090 /* 54 vars */) = 0
strace: Process 884940 attached
[pid 884940] execve("/usr/bin/basename", ["basename", "/usr/bin/lesspipe"],
0x55983907af68 /* 54 vars */) = 0
[pid 884940] +++ exited with 0 +++
[pid 884939] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED,
 si_pid=884940, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
strace: Process 884941 attached
strace: Process 884942 attached
[pid 884942] execve("/usr/bin/dirname", ["dirname", "/usr/bin/lesspipe"],
0x559839087108 /* 54 vars */) = 0
[pid 884942] +++ exited with 0 +++
[pid 884941] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=884942, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
[pid 884941] +++ exited with 0 +++
[pid 884939] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=884941, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
[pid 884939] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=884939,
si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
strace: Process 884943 attached
[pid 884943] execve("/usr/bin/dircolors", ["dircolors", "-b"], 0x55aa1d8a2d10 /* 54 vars
*/) = 0
[pid 884943] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=884943,
si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
$ cat testfile

strace: Process 884946 attached
[pid 884946] execve("/usr/bin/cat", ["cat", "testfile"], 0x55aa1d8a9520 /* 54 vars */) = 0
supersecretdata
[pid 884946] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=884946,
si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---

$ exit
exit
+++ exited with 0 +++


Let’s try being sneakier about things by using a little shell scripting of a bash builtin and see what the results are: 

1 – We’ll start a new bash shell and watch it with strace, the same as we did previously.

3 – Here’s the execve() syscall for the bash shell, just as we expected.

31 – And we’re dropped back to the prompt. This time, instead of using cat, we’ll use two of the bash builtins to frankenstein a command together and replicate what cat does:

while IFS= read -r line; do echo "$line"; done < testfile

This uses the bash builtins read and echo to process our file line by line. We use read to fetch each line from testfile into the variable line, with the -r switch to ensure any backslashes are read literally. The IFS= (internal field separator) preserves leading and trailing whitespaces. Then, echo outputs each line exactly as it’s read.

35 – Zounds! We’re dropped back to the prompt with no output from strace at all.

$ strace -f -e trace=execve bash

execve("/usr/bin/bash", ["bash"], 0x7fff866fefc0 /* 54 vars */) = 0
strace: Process 884993 attached
[pid 884993] execve("/usr/bin/lesspipe", ["lesspipe"], 0x5620a56bf090 /* 54 vars */) = 0
strace: Process 884994 attached
[pid 884994] execve("/usr/bin/basename", ["basename", "/usr/bin/lesspipe"],
0x558950f6cf68 /* 54 vars */) = 0
[pid 884994] +++ exited with 0 +++
[pid 884993] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED,
si_pid=884994, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
strace: Process 884995 attached
strace: Process 884996 attached
[pid 884996] execve("/usr/bin/dirname", ["dirname", "/usr/bin/lesspipe"],
0x558950f79108 /* 54 vars */) = 0
[pid 884996] +++ exited with 0 +++
[pid 884995] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED,
si_pid=884996, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
[pid 884995] +++ exited with 0 +++
[pid 884993] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED,
si_pid=884995, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
[pid 884993] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=884993,
si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
strace: Process 884997 attached
[pid 884997] execve("/usr/bin/dircolors", ["dircolors", "-b"], 0x5620a56bed10 /* 54 vars
*/) = 0
[pid 884997] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=884997,
si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
$ while IFS= read -r line; do echo "$line"; done < testfile

Supersecretdata

$ 

If we can’t see the activity while monitoring for process execution, how do we find it?

Looking for Syscalls in All the Right Places

The problem we were encountering with not seeing the sneaky bash builtin activity was largely due to looking in the wrong place. We couldn’t see anything happening with execve() because there was nothing to see. In this particular case, we know a file is being opened, so let’s try one of the open syscalls. In this particular case, we’re going to cheat and jump directly to looking at openat(), but it could very well be any of the open syscalls we discussed earlier. 

1 – We’ll start up the strace-monitored bash shell again. This time, our filter is based on openat() instead of execve().

2 – Note that we see a pretty different view of what is taking place when bash starts up this time since we’re watching for files being opened. 

72 – Back at the prompt, we’ll run our sneaky bit of bash script to read the file. 

73 – Et voilà, we see the openat() syscall for our file being opened and the resulting output. 

$ strace -f -e trace=openat bash
openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libtinfo.so.6", O_RDONLY|O_CLOEXEC) =
3
openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/dev/tty", O_RDWR|O_NONBLOCK) = 3
openat(AT_FDCWD, "/usr/lib/locale/locale-archive", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache",
O_RDONLY) = 3
openat(AT_FDCWD, "/etc/nsswitch.conf", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/etc/passwd", O_RDONLY|O_CLOEXEC) = 3
openat(AT_FDCWD, "/lib/terminfo/x/xterm-256color", O_RDONLY) = 3
openat(AT_FDCWD, "/etc/bash.bashrc", O_RDONLY) = 3
openat(AT_FDCWD, "/home/user/.bashrc", O_RDONLY) = 3
openat(AT_FDCWD, "/home/user/.bash_history", O_RDONLY) = 3
strace: Process 984240 attached
[pid 984240] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
[pid 984240] openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6",
O_RDONLY|O_CLOEXEC) = 3
[pid 984240] openat(AT_FDCWD, "/usr/bin/lesspipe", O_RDONLY) = 3
strace: Process 984241 attached
[pid 984241] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
[pid 984241] openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6",
O_RDONLY|O_CLOEXEC) = 3
[pid 984241] openat(AT_FDCWD, "/usr/lib/locale/locale-archive",
O_RDONLY|O_CLOEXEC) = 3
[pid 984241] +++ exited with 0 +++
[pid 984240] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED,
si_pid=984241, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
strace: Process 984242 attached
strace: Process 984243 attached
[pid 984243] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
[pid 984243] openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6",
O_RDONLY|O_CLOEXEC) = 3
[pid 984243] openat(AT_FDCWD, "/usr/lib/locale/locale-archive",
O_RDONLY|O_CLOEXEC) = 3
[pid 984243] +++ exited with 0 +++
[pid 984242] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED,
si_pid=984243, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
[pid 984242] +++ exited with 0 +++
[pid 984240] --- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED,
si_pid=984242, si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
[pid 984240] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=984240,
si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
strace: Process 984244 attached
[pid 984244] openat(AT_FDCWD, "/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
[pid 984244] openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6",
O_RDONLY|O_CLOEXEC) = 3
[pid 984244] openat(AT_FDCWD, "/usr/lib/locale/locale-archive",
O_RDONLY|O_CLOEXEC) = 3
[pid 984244] openat(AT_FDCWD,
"/usr/lib/x86_64-linux-gnu/gconv/gconv-modules.cache", O_RDONLY) = 3
[pid 984244] +++ exited with 0 +++
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=984244,
si_uid=1000, si_status=0, si_utime=0, si_stime=0} ---
openat(AT_FDCWD, "/usr/share/bash-completion/bash_completion", O_RDONLY) = 3
openat(AT_FDCWD, "/etc/init.d/",
O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
openat(AT_FDCWD, "/etc/bash_completion.d/",
O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
openat(AT_FDCWD, "/etc/bash_completion.d/apport_completion", O_RDONLY) = 3
openat(AT_FDCWD, "/etc/bash_completion.d/git-prompt", O_RDONLY) = 3
openat(AT_FDCWD, "/usr/lib/git-core/git-sh-prompt", O_RDONLY) = 3
openat(AT_FDCWD, "/dev/null", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 3
openat(AT_FDCWD, "/home/user/.bash_history", O_RDONLY) = 3
openat(AT_FDCWD, "/home/user/.bash_history", O_RDONLY) = 3
openat(AT_FDCWD, "/home/user/.inputrc", O_RDONLY) = -1 ENOENT (No such file or
directory)
openat(AT_FDCWD, "/etc/inputrc", O_RDONLY) = 3

$ while IFS= read -r line; do echo "$line"; done < testfile
openat(AT_FDCWD, "testfile", O_RDONLY)  = 3
supersecretdata

We can catch the activity from the shell builtins, in most cases, but it’s a matter of looking in the right places for the activity we want. It might be tempting to think we could just watch all the syscalls all the time, but doing so quickly becomes untenable. Our example above produces somewhere around 50 lines of strace output when we are filtering just for openat(). If we take the filtering off entirely and watch for all syscalls, it balloons out to 1,200 lines of output. 

This is being done inside a single shell with not much else going on. If we tried to do this across a running system, we would see exponentially more in the brief period of time before it melted down into a puddle of flaming goo from the load. In other words, there really isn’t any reasonable way to watch all the syscall activity all the time. The best we can do is to be intentional with what we choose to monitor. 

Conclusion

This exploration into syscall evasion using bash shell builtins illuminates just a fraction of the creative and subtle ways in which system interactions can be manipulated to bypass security measures. Security tools that solely focus on process execution for monitoring are inherently limited in scope and a more nuanced and comprehensive approach to monitoring system activity is needed to provide a better level of security.

The simple example we put together for replicating the functionality of cat dodged this entirely and allowed us to read the data from our file while flying completely under the radar of tools that were only looking for process execution. Unfortunately, this is the tip of the iceberg. 

Using the bash builtins in a similar fashion to what we did above, there are a number of similar ways we can combine them to replicate functionality of other tools and attacks. A very brief amount of Googling will turn up a well-known method for assembling a reverse shell using the bash builtins. Furthermore, we have all the various shells and all their different sets of builtins at our disposal to tinker with (we’ll leave this as an exercise for the reader). 

In the coming articles in this series, we’ll look at some other methods of syscall evasion. If you want to learn more, explore Defense evasion techniques with Falco.  

The post Exploring Syscall Evasion – Linux Shell Built-ins appeared first on Sysdig.

]]>
Resource Constraints in Kubernetes and Security https://sysdig.com/blog/resource-constraints-in-kubernetes-and-security/ Mon, 12 Feb 2024 15:15:00 +0000 https://sysdig.com/?p=84236 The Sysdig 2024 Cloud‑Native Security and Usage Report highlights the evolving threat landscape, but more importantly, as the adoption of...

The post Resource Constraints in Kubernetes and Security appeared first on Sysdig.

]]>
The Sysdig 2024 Cloud‑Native Security and Usage Report highlights the evolving threat landscape, but more importantly, as the adoption of cloud-native technologies such as container and Kubernetes continue to increase, not all organizations are following best practices. This is ultimately handing attackers an advantage when it comes to exploiting containers for resource utilization in operations such as Kubernetes.

Balancing resource management with security is not just a technical challenge, but also a strategic imperative. Surprisingly, Sysdig’s latest research report identified less than half of Kubernetes environments have alerts for CPU and memory usage, and the majority lack maximum limits on these resources. This trend isn’t just about overlooking a security practice; it’s a reflection of prioritizing availability and development agility over potential security risks.

The security risks of unchecked resources

Unlimited resource allocation in Kubernetes pods presents a golden opportunity for attackers. Without constraints, malicious entities can exploit your environment, launching cryptojacking attacks or initiating lateral movements to target other systems within your network. The absence of resource limits not only escalates security risks but can also lead to substantial financial losses due to unchecked resource consumption by these attackers.

Resource Constraints in Kubernetes

A cost-effective security strategy

In the current economic landscape, where every penny counts, understanding and managing resource usage is as much a financial strategy as it is a security one. By identifying and reducing unnecessary resource consumption, organizations can achieve significant cost savings – a crucial aspect in both cloud and container environments.

Enforcing resource constraints in Kubernetes

Implementing resource constraints in Kubernetes is straightforward yet impactful. To apply resource constraints to an example atomicred tool deployment in Kubernetes, users can simply modify their deployment manifest to include resources requests and limits.

Here’s how the Kubernetes project recommends enforcing those changes:

kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: atomicred
  namespace: atomic-red
  labels:
    app: atomicred
spec:
  replicas: 1
  selector:
    matchLabels:
      app: atomicred
  template:
    metadata:
      labels:
        app: atomicred
    spec:
      containers:
      - name: atomicred
        image: issif/atomic-red:latest
        imagePullPolicy: "IfNotPresent"
        command: ["sleep", "3560d"]
        securityContext:
          privileged: true
        resources:
          requests:
            memory: "64Mi"
            cpu: "250m"
          limits:
            memory: "128Mi"
            cpu: "500m"
      nodeSelector:
        kubernetes.io/os: linux
EOF

In this manifest, we set both requests and limits for CPU and memory as follows:

  • requests: Amount of CPU and memory that Kubernetes will guarantee for the container. In this case, 64Mi of memory and 250m CPU (where 1000m equals 1 CPU core).
  • limits: The maximum amount of CPU and memory the container is allowed to use.
    If the container tries to exceed these limits, it will be throttled (CPU) or killed and possibly restarted (memory). Here, it’s set to 128Mi of memory and 500m CPU.

This setup ensures that the atomicred tool is allocated enough resources to function efficiently while preventing it from consuming excessive resources that could impact other processes in your Kubernetes cluster. Those request constraints guarantee that the container gets at least the specified resources, while limits ensure it never goes beyond the defined ceiling. This setup not only optimizes resource utilization but also guards against resource depletion attacks.

Monitoring resource constraints in Kubernetes

To check the resource constraints of a running pod in Kubernetes, use the kubectl describe command. The command provided will automatically describe the first pod in the atomic-red namespace with the label app=atomicred.

kubectl describe pod -n atomic-red $(kubectl get pods -n atomic-red -l app=atomicred -o jsonpath="{.items[0].metadata.name}")

What happens if we abuse these limits?

To test CPU and memory limits, you can run a container that deliberately tries to consume more resources than allowed by its limits. However, this can be a bit complex:

  • CPU: If a container attempts to use more CPU resources than its limit, Kubernetes will throttle the CPU usage of the container. This means the container won’t be terminated but will run slower.
  • Memory: If a container tries to use more memory than its limit, it will be terminated by Kubernetes once it exceeds the limit. This is known as an Out Of Memory (OOM) kill.

Creating a stress test container

You can create a new deployment that intentionally stresses the resources.
For example, you can use a tool like stress to consume CPU and memory deliberately:

kubectl apply -f - <<EOF
apiVersion: apps/v1
kind: Deployment
metadata:
  name: resource-stress-test
  namespace: atomic-red
spec:
  replicas: 1
  selector:
    matchLabels:
      app: resource-stress-test
  template:
    metadata:
      labels:
        app: resource-stress-test
    spec:
      containers:
      - name: stress
        image: polinux/stress
        resources:
          limits:
            memory: "128Mi"
            cpu: "500m"
        command: ["stress"]
        args: ["--vm", "1", "--vm-bytes", "150M", "--vm-hang", "1"]
EOF

The deployment specification defines a single container using the image polinux/stress, which is an image commonly used for generating workload and stress testing. Under the resources section, we define the resource requirements and limits for the container. We are requesting 150Mi of memory but the maximum threshold for memory is fixed at a 128Mi limit. 

A command is run inside the container to tell K8s to create a virtual workload of 150 MB and hang for one second. This is a common way to perform stress testing with this container image. 

As you can see from the below screenshot, the OOMKilled output appears. This means that the container will be killed due to being out of memory. If an attacker was running a cryptomining binary within the pod at the time of OOMKilled action, they would be kicked out, the pod would go back to its original state (effectively removing any instance of the mining binary), and the pod would be recreated.

Alerting on pods deployed without resource constraints

You might be wondering whether you have to describe every pod to ensure it has proper resource constraints in place. While you could do that, it’s not exactly scalable. You could, of course, ingest Kubernetes log data into Prometheus and report on it accordingly. Alternatively, if you have Falco already installed in your Kubernetes cluster, you could apply the below Falco rule to detect instances where a pod is successfully deployed without resource constraints.

- rule: Create Pod Without Resource Limits
  desc: Detect pod created without defined CPU and memory limits
  condition: kevt and pod and kcreate 
             and not ka.target.subresource in (resourcelimits)
  output: Pod started without CPU or memory limits (user=%ka.user.name pod=%ka.resp.name resource=%ka.target.resource ns=%ka.target.namespace images=%ka.req.pod.containers.image)
  priority: WARNING
  source: k8s_audit
  tags: [k8s, resource_limits]

- list: resourcelimits
  items: ["limits"]

Please note: Depending on how your Kubernetes workloads are set up, this rule might generate some False/Positive alert detections for legitimate pods that are intentionally deployed without resource limits. In these cases, you may still need to fine-tune this rule or implement some exceptions in order to minimize those false positives. However, implementing such a rule can significantly enhance your monitoring capabilities, ensuring that best practices for resource allocation in Kubernetes are adhered to.

Sysdig’s commitment to open source

The lack of enforced resource constraints in Kubernetes in numerous organizations underscores a critical gap in current security frameworks, highlighting the urgent need for increased awareness. In response, we contributed our findings to the OWASP Top 10 framework for Kubernetes, addressing what was undeniably an example of insecure workload configuration. Our contribution, recognized for its value, was duly incorporated into the framework. Leveraging the inherently open source nature of the OWASP framework, we submitted a Pull Request (PR) on GitHub, proposing this novel enhancement. This act of contributing to established security awareness frameworks not only bolsters cloud-native security but also enhances its transparency, marking a pivotal step towards a more secure and aware cloud-native ecosystem.

Bridging Security and Scalability

The perceived complexity of maintaining, monitoring, and modifying resource constraints can often deter organizations from implementing these critical security measures. Given the dynamic nature of development environments, where application needs can fluctuate based on demand, feature rollouts, and scalability requirements, it’s understandable why teams might view resource limits as a potential barrier to agility. However, this perspective overlooks the inherent flexibility of Kubernetes’ resource management capabilities, and more importantly, the critical role of cross-functional communication in optimizing these settings for both security and performance.

The art of flexible constraints

Kubernetes offers a sophisticated model for managing resource constraints that does not inherently stifle application growth or operational flexibility. Through the use of requests and limits, Kubernetes allows for the specification of minimum resources guaranteed for a container (requests) and a maximum cap (limits) that a container cannot exceed. This model provides a framework within which applications can operate efficiently, scaling within predefined bounds that ensure security without compromising on performance.

The key to leveraging this model effectively lies in adopting a continuous evaluation and adjustment approach. Regularly reviewing resource utilization metrics can provide valuable insights into how applications are performing against their allocated resources, identifying opportunities to adjust constraints to better align with actual needs. This iterative process ensures that resource limits remain relevant, supportive of application demands, and protective against security vulnerabilities.

Fostering open communication lines

At the core of successfully implementing flexible resource constraints is the collaboration between development, operations, and security teams. Open lines of communication are essential for understanding application requirements, sharing insights on potential security implications of configuration changes, and making informed decisions on resource allocation.

Encouraging a culture of transparency and collaboration can demystify the process of adjusting resource limits, making it a routine part of the development lifecycle rather than a daunting task. Regular cross-functional meetings, shared dashboards of resource utilization and performance metrics, and a unified approach to incident response can foster a more integrated team dynamic. 

Simplifying maintenance, monitoring, and modification

With the right tools and practices in place, resource management can be streamlined and integrated into the existing development workflow. Automation tools can simplify the deployment and update of resource constraints, while monitoring solutions can provide real-time visibility into resource utilization and performance.

Training and empowerment, coupled with clear guidelines and easy-to-use tools, can make adjusting resource constraints a straightforward task that supports both security posture and operational agility.

Conclusion

Setting resource limits in Kubernetes transcends being a mere security measure; it’s a pivotal strategy that harmoniously balances operational efficiency with robust security. This practice gains even more significance in the light of evolving cloud-native threats, particularly cryptomining attacks, which are increasingly becoming a preferred method for attackers due to their low-effort, high-reward nature.

Reflecting on the 2022 Cloud-Native Threat Report, we observe a noteworthy trend. The Sysdig Threat Research team profiled TeamTNT, a notorious cloud-native threat actor known for targeting both cloud and container environments, predominantly for crypto-mining purposes. Their research underlines a startling economic imbalance: cryptojacking costs victims an astonishing $53 for every $1 an attacker earns from stolen resources. This disparity highlights the financial implications of such attacks, beyond the apparent security breaches.

TeamTNT’s approach reiterates why attackers are choosing to exploit environments where container resource limits are undefined or unmonitored. The lack of constraints or oversight of resource usage in containers creates an open field for attackers to deploy cryptojacking malware, leveraging the unmonitored resources for financial gain at the expense of its victim.

In light of these insights, it becomes evident that the implementation of resource constraints in Kubernetes and the monitoring of resource usage in Kubernetes are not just best practices for security and operational efficiency; they are essential defenses against a growing trend of financially draining cryptomining attacks. As Kubernetes continues to evolve, the importance of these practices only escalates. Organizations must proactively adapt by setting appropriate resource limits and establishing vigilant monitoring systems, ensuring a secure, efficient, and financially sound environment in the face of such insidious threats.

The post Resource Constraints in Kubernetes and Security appeared first on Sysdig.

]]>
Kernel Introspection from Linux to Windows https://sysdig.com/blog/kernel-introspection-from-linux-to-windows/ Fri, 02 Feb 2024 16:00:00 +0000 https://sysdig.com/?p=83933 The cybersecurity landscape is undergoing a significant shift, moving from security tools monitoring applications running within userspace to advanced, real-time...

The post Kernel Introspection from Linux to Windows appeared first on Sysdig.

]]>
The cybersecurity landscape is undergoing a significant shift, moving from security tools monitoring applications running within userspace to advanced, real-time approaches that monitor system activity directly and safely within the kernel by using eBPF. This evolution in kernel introspection is particularly evident in the adoption of projects like Falco, Tetragon, and Tracee in Linux environments. These tools are especially prevalent in systems running containerized workloads under Kubernetes, where they play a crucial role in real-time monitoring of dynamic and ephemeral workloads.

The open source project Falco exemplifies this trend. It employs various instrumentation techniques to scrutinize system workload, relaying security events from the kernel to user space. These instrumentations are referred to as ‘drivers’ within Falco, reflecting their operation in kernel space. The driver is pivotal as it furnishes the syscall event source, which is integral for monitoring activities closely tied to the syscall context. When deploying Falco, the kernel module is typically installed via the falco-driver-loader script included in the binary package. This process seamlessly integrates Falco’s monitoring capabilities into the system, enabling real-time detection and response to security threats at the kernel level.

How do system calls work?

System calls (syscalls for short) are a fundamental aspect of how software interacts with the operating system. They are essential mechanisms in any operating system’s kernel, serving as the primary interface between user-space applications and the kernel.

Syscalls are functions used by applications to request services from the operating system’s kernel. These services include operations like reading and writing files, sending network data, and accessing hardware devices.

  1. When a user-space application needs to perform an operation that requires the kernel’s intervention, it makes a syscall.
  2. The application typically uses a high-level API provided by the operating system, which abstracts the details of the syscall.
  3. The syscall switches the processor from user mode to kernel mode, where the kernel has access to protected system resources.
  4. The kernel executes the requested service and then returns the result to the user-space application, switching back to user mode.

Types of system calls

System calls can be categorized into several types, such as:

  • File Management: Operations like open, read, write, and close files
  • Process Control: Creation and termination of processes, and process scheduling
  • Memory Management: Allocating and freeing memory
  • Device Management: Requests to access hardware devices
  • Information Maintenance: System information requests and updates
  • Communication: Creating and managing communication channels

Examples of Linux system calls

  • open(): Used to open a file
  • read(): Used to read data from a file or a network
  • write(): Used to write data to a file or a network
  • fork(): Used to create a new process

Why system calls are necessary for Kernel Introspection

System calls provide a controlled interface for user-space applications to access the hardware and resources managed by the kernel. They ensure security and stability by preventing applications from directly accessing critical system resources that could potentially harm the system if misused.

Kernel introspection performance considerations

System calls involve context switching between user mode and kernel mode, which can be relatively expensive in terms of performance. Therefore, efficient use of system calls is important in application development.

A Shift to eBPF in Linux

In summary, system calls are crucial for the operation of any computer system, acting as gateways through which applications request and receive services from the operating system’s kernel. They play a critical role in resource management, security, and abstraction, allowing applications to perform complex operations without needing to directly interact with the low-level details of the hardware and operating system internals.

In recent years, we have seen a shift towards a technology called extended Berkeley Packet Filter (eBPF for short). eBPF is a revolutionary technology with origins in the Linux kernel that can run sandboxed programs in a privileged context, such as the operating system kernel. It is used to safely and efficiently extend the capabilities of the kernel without requiring to change kernel source code or load kernel modules, which can prove to be a safer alternative to the traditional kernel module.

Historically, the operating system has always been an ideal place to implement observability, security, and networking functionality due to the kernel’s privileged ability to oversee and control the entire system. At the same time, an operating system kernel is hard to evolve due to its central role and high requirement towards stability and security. The rate of innovation at the operating system level has thus traditionally been lower compared to functionality implemented outside of the operating system.

The most noticeable impact on a host comes from the number of times an event has to be sent to user space, and the amount of work that needs to be done in user space to handle this event. In other words, the earlier an event can be confidently dropped and ignored, the better. This is why programmable solutions like eBPF or kernel modules are beneficial. Having the ability to develop fine grained in-kernel filters to control the amount of data sent from kernel space to user space is a huge benefit in Linux.

Falco, for example, has the ability to select specific syscalls to monitor through Adaptive Syscall Selection. This empowers users with granular control, optimizing system performance by reducing CPU load through selective syscall monitoring. After mapping the event strings from the rules to their corresponding syscall IDs, Falco uses a dedicated eBPF map to inject this information into the sys_enter and sys_exit tracepoints within the driver.

Falco’s modern eBPF probe is an alternative driver to the default kernel module. The main advantage it brings to the table is that it is embedded into Falco, which means that you don’t have to download or build anything. If your kernel is recent enough, Falco will automatically inject it, providing increased portability for end-users.

How to Handle Kernel Introspection in Windows & Linux

Syscalls in Windows and Linux fundamentally operate in the same way, providing an interface between user-space applications and the operating system’s kernel. However, there are notable differences in their implementation and usage, which also contribute to the variations in system call monitoring tools and the adoption of technologies like eBPF in these environments. Here are some of the clear differences in syscalls between Windows and Linux:

Implementation and API differences

  • Linux: Uses a consistent set of syscalls across different distributions.
    Linux system calls are well-documented and relatively stable across versions.
  • Windows: Windows syscalls, known as Win32 API calls, can be more complex due to the broader range of functionalities and legacy support. The Windows API includes a set of functions, interfaces, and protocols for building Windows applications.

Syscall invocation

  • In Linux, system calls are typically invoked using a software interrupt, which switches the processor from user mode to kernel mode. For example, when a Linux program needs to read a file, it directly invokes the read syscall, which is a straightforward interface to the kernel’s file reading capabilities.
  • In contrast, Windows uses a similar mechanism but also includes additional layers of APIs that can abstract the underlying system calls more significantly. For instance, in Windows, a program might use the ReadFile function from the Win32 API to read a file.

    This function, in turn, interacts with lower-level system calls to perform the operation. The Win32 API provides a more user-friendly interface and hides the complexity of direct system call usage, which is a common approach in Windows to provide additional functionality and manage compatibility across different versions of the operating system.

Syscall monitoring tools

  • Linux: The open source nature and the standardized system call interface in Linux make it easier to develop and use system call monitoring tools. Tools like auditd, Sysdig Inspect, and eBPF-based technologies are commonly used for monitoring system calls in Linux.
  • Windows: System call monitoring tools are less common in Windows partly due to the complexity and variability of the Windows API and kernel. The closed source nature of Windows also limits the development of external monitoring tools. There are a couple of tools from the Sysinternals suite, such as Procmon and Sysmon, which have existed for a long time. Needless to say, both are closed source, Microsoft proprietary software. However, Windows does provide its own set of tools and APIs to extend Kernel visibility for monitoring, like Event Tracing for Windows (ETW) and Windows Management Instrumentation (WMI).

Implementing user-space hooking techniques in Windows

  • In addition to Procmon and Sysmon, many Windows products utilize kernel drivers, often augmented with user-space hooking techniques, to monitor system calls. User-space hooking refers to the method of intercepting function calls, messages, or events passed between software components in user space, outside the kernel. This technique allows for the monitoring and manipulation of interactions within an application without requiring changes to the underlying operating system kernel.
  • User-space hooking is particularly useful in scenarios where kernel-level access is either not feasible or too risky, such as when dealing with security applications, system utilities, or performance monitoring tools. By leveraging user-space hooking, developers can gather valuable data on application behavior, enhance security measures, or modify functionality without the need for deep integration into the operating system’s core.
  • Despite these approaches, Windows also offers its own set of tools and APIs to facilitate kernel visibility for monitoring purposes. ETW and WMI are the prime examples. ETW provides detailed event logging and tracing capabilities, allowing for the collection of diagnostic and performance information, while WMI offers a framework for accessing management information in an enterprise environment. Both are instrumental in extending visibility for kernel introspection, however, it’s still worth noting that maybe endpoint detection tools are still relying on user-space hooking techniques that provide limited system visibility.

eBPF for Windows

The eBPF for Windows initiative is an ongoing project designed to bring the functionality of eBPF, a feature predominantly used in the Linux environment, to Windows. Essentially, this project integrates existing eBPF tools and APIs into the Windows platform. It does so by incorporating existing eBPF projects as submodules and creating an intermediary layer that enables their operation on Windows.

The primary goal of this project is to ensure compatibility at the source code level for programs that utilize standard hooks and helpers, which are common across different operating systems. In essence, eBPF for Windows aims to allow applications originally written for Linux to be compatible with Windows.

While Linux offers a wide array of hooks and helpers, some are highly specific to its internal structures and may not be transferable to other platforms. However, there are many hooks and helpers with more general applications, and the eBPF for Windows project focuses on supporting these in cross-platform eBPF programs.

Additionally, the project makes the Libbpf APIs available on Windows. This is intended to maintain source code compatibility for applications interacting with eBPF programs, further bridging the gap between Linux and Windows environments in terms of eBPF program development and execution.

As of 2024, the eBPF for Windows project is still a work in progress. There are, of course, challenges to adoption in Windows eBPF. The beta status of eBPF for Windows means that it has yet to see the widespread adoption otherwise observed in Linux systems. The challenges include ensuring compatibility with Windows kernel architecture, integrating with existing Windows security and monitoring tools, and adapting Linux-centric eBPF toolchains to the Windows environment. 

However, if successfully implemented, eBPF for Windows could bring powerful kernel introspection and programmability capabilities, similar to those in Linux, to Windows environments. This would significantly enhance the ability to monitor and secure Windows systems using advanced eBPF-based tools.

While there are inherent differences in how system calls are implemented and monitored in Windows and Linux, efforts like the eBPF for Windows project represent an ongoing endeavor to bridge these gaps. The potential for bringing Linux’s advanced monitoring capabilities to Windows could open up new possibilities in system security and management, although it faces significant developmental challenges. Currently, Windows cannot interpret Linux system calls.

Kernel Introspection for Windows

There are, of course, alternative approaches for Windows kernel introspection. The project Fibratus.io offers itself as a modern tool for Windows kernel exploration and observability with a focus on security. Fibratus uses an approach known as ETW (Event Tracing for Windows) for capturing system events. Many kernel developers will discover that the process of building a kernel driver in Windows is very tedious because of the various stringent Microsoft requirements regarding certification, quality lab testing, and more. Not just that, but the very process of writing kernel code is, in general, a much more time consuming process, and a crash in a single kernel driver may crash the entire system.

Right now, ETW looks like the best approach for deep kernel insights, since the eBPF for Windows implementation is still somewhat limited to a network-stack observability use case, such as Xpress Data Path (XDP) for DDoS mitigation. ETW is implemented in the Windows operating system and provides developers a fast, reliable, and versatile set of event tracing features with very little impact on performance. You can dynamically enable or disable tracing without rebooting your computer, or reloading your application or driver. Unlike debugging statements that you add to your code during development, you can use ETW in your production code. Similar to the syscall approaches we mentioned for Linux systems, ETW provides a mechanism to trace and log events that are raised by user-mode applications and kernel-mode drivers.

Kernel Introspection – A Conclusion

Windows security vendors typically maintain a level of confidentiality about the inner workings of their Endpoint Detection & Response (EDR) products. However, it’s widely recognized that many of these products leverage kernel drivers or the Event Tracing for Windows (ETW) framework, sometimes supplemented with user-space hooking techniques. The specific methodologies and implementations often remain under wraps, aligning with industry norms for proprietary technology.

The introduction of eBPF, a technology with roots in the Linux kernel, into Windows environments marks a significant and promising development. eBPF’s transition to Windows is particularly notable for its potential in production environments. Its capability to dynamically load and unload programs without necessitating a kernel restart is a major advancement. This feature greatly facilitates system administration, allowing for more efficient debugging and problem-solving in live environments. The gradual roll-out of eBPF in Windows signifies a step towards more flexible and powerful system diagnostics and management tools, mirroring some of the advanced capabilities long available in Linux systems. This evolution reflects the ongoing convergence of Linux and Windows operational paradigms and toolsets, enhancing the capabilities and utility of Windows systems in complex, production-grade applications.

The post Kernel Introspection from Linux to Windows appeared first on Sysdig.

]]>