Published on 00/00/0000
Last updated on 00/00/0000
Published on 00/00/0000
Last updated on 00/00/0000
Share
Share
INSIGHTS
5 min read
Share
Secure computing (seccomp) is a Linux kernel feature restricting the system calls every process can make. In short, this reduces the “attack surface” you expose to the world—making your enterprise more secure.
But to understand how seccomp works, you first have to understand containers within this process. Here’s what you’ll need to know.
Over the years, how we build our applications has changed from a monolithic paradigm to a Microservices paradigm. Containerization gained strength and with it the success of Docker and Kubernetes.
What is a container in the secure computing (seccomp) context? Briefly, a container is a set of processes. Think of it as a separate room where computers handle their tasks—with special walls (like cgroups) keeping them isolated from the rest of the rooms.
If you have multiple containers, you will have multiple sets of processes that are isolated from each other using namespaces and cgroups. Namespaces are responsible for isolating processes (which processes can my process see?) and cgroups are responsible for reserving resources for the processes (how much CPU can my process use?).
In the image above, we have an example of the isolation done by namespace. The ubuntu pod, pictured, is executing a sleep command. For the pod, sleep is the process with PID 1. The container doesn't see any processes from the host, or from other containers, so it considers the sleep process as the first one. Otherwise, the host can see the processes inside all containers, and for him the PID of the sleep process is 7275. Who is responsible for creating the namespace for each container?
The answer is simple: the runtime container handles this task (Example: runc on docker). But if the process need access to hardware, how is this done? Here, we need to understand the concepts of user space and kernel space.
Linux divides the memory into user space and kernel space. All user processes live on the user space and the kernel processes on kernel space. When a user process needs to interact with hardware, it makes system calls, or just syscalls, to the kernel on the kernel space.
Although processes inside a container cannot see those on the host due to the isolation created by namespaces, there is no analogous solution for the kernel. That is, our containers share the same kernel space between themselves and the host. Thus, it would be possible for a container to erase file systems using syscall, or write on files that require privileges. This is far less secure than using virtual machines where each service has its own user and kernel space.
The good news is that it is possible to restrict the syscalls a process can do. Seccomp is a Linux kernel feature available since version 2.6.12, which limits the syscalls a process can do. The seccomp makes use of profiles which are json files that tell what is allowed and what is not allowed regarding system calls. The runtimes already leverage seccomp profiles to limit some system calls by default. It’s possible to use its profiles on Kubernetes by enabling a feature gate on kubelet. However, it is also possible to add more security by creating customized and more restrictive profiles for a specific Kubernetes pod.
Using securityContext you can use seccomp profiles on your pods or containers. Here's an example of a Pod that uses a seccomp profile (We'll use this example later). The profile can be specified inside the Pod or the container, the difference is that when you do it inside the Pod all the containers inside it use the same profile. It is also important to set allowPrivilegeEscalation to false to prevent the container from trying to acquire more power than is allowed. Otherwise, the seccomp profile won’t be applied.
apiVersion: v1
kind: Pod
metadata:
creationTimestamp: null
name: ubuntu-deny
spec:
securityContext:
seccompProfile:
type: Localhost
localhostProfile: deny-list.json
containers:
- image: ubuntu
name: ubuntu-deny
command:
- sleep
- "90000"
securityContext:
allowPrivilegeEscalation: false
Seccomp is a powerful tool. But knowing which system call we should permit our process to do is a difficult task. A simple command like sleep can make a few dozen system calls as shown in the image below.
One way to get around this is using deny-type profiles. In a deny list you prohibit some calls and allow all others. This way you can ensure that your container won't try to do stranger things with the host. And you will be able to answer faster to new threats. Here is an example of a deny list.
{
"defaultAction": "SCMP_ACT_ALLOW",
"architectures": [
"SCMP_ARCH_X86_64",
"SCMP_ARCH_X86",
"SCMP_ARCH_X32"
],
"syscalls": [
{
"names": ["clock_nanosleep"],
"action": "SCMP_ACT_ERRNO"
}
]
}
This deny list restricts the use of the clock_nanosleep system call. If you look at the previous example, this call is used by the sleep command. When this profile is applied to our pod, which executes the command sleep, the pod goes directly to Error state. That is, the process does not even run.
If you want to know more about seccomp and how we handle processes like these at Cisco I recommend the article “Hardening Kubernetes Containers Security with Seccomp” here on the Outshift blog.
In this article, you will learn more about the best practices for creating and managing the seccomp profiles and how to use Cisco Secure Firewall Cloud Native to enhance seccomp.
Get emerging insights on innovative technology straight to your inbox.
Discover why security teams rely on Panoptica's graph-based technology to navigate and prioritize risks across multi-cloud landscapes, enhancing accuracy and resilience in safeguarding diverse ecosystems.
The Shift is Outshift’s exclusive newsletter.
The latest news and updates on cloud native modern applications, application security, generative AI, quantum computing, and other groundbreaking innovations shaping the future of technology.