mount_namespaces - Man Page
overview of Linux mount namespaces
For an overview of namespaces, see namespaces(7).
Mount namespaces provide isolation of the list of mounts seen by the processes in each namespace instance. Thus, the processes in each of the mount namespace instances will see distinct single-directory hierarchies.
The views provided by the /proc/pid/mounts, /proc/pid/mountinfo, and /proc/pid/mountstats files (all described in proc(5)) correspond to the mount namespace in which the process with the PID pid resides. (All of the processes that reside in the same mount namespace will see the same view in these files.)
A new mount namespace is created using either clone(2) or unshare(2) with the CLONE_NEWNS flag. When a new mount namespace is created, its mount list is initialized as follows:
- If the namespace is created using clone(2), the mount list of the child's namespace is a copy of the mount list in the parent process's mount namespace.
- If the namespace is created using unshare(2), the mount list of the new namespace is a copy of the mount list in the caller's previous mount namespace.
Subsequent modifications to the mount list (mount(2) and umount(2)) in either mount namespace will not (by default) affect the mount list seen in the other namespace (but see the following discussion of shared subtrees).
Mount namespaces first appeared in Linux 2.4.19.
Namespaces are a Linux-specific feature.
The propagation type assigned to a new mount depends on the propagation type of the parent mount. If the mount has a parent (i.e., it is a non-root mount point) and the propagation type of the parent is MS_SHARED, then the propagation type of the new mount is also MS_SHARED. Otherwise, the propagation type of the new mount is MS_PRIVATE.
Notwithstanding the fact that the default propagation type for new mount is in many cases MS_PRIVATE, MS_SHARED is typically more useful. For this reason, systemd(1) automatically remounts all mounts as MS_SHARED on system startup. Thus, on most modern systems, the default propagation type is in practice MS_SHARED.
Since, when one uses unshare(1) to create a mount namespace, the goal is commonly to provide full isolation of the mounts in the new namespace, unshare(1) (since util-linux
2.27) in turn reverses the step performed by systemd(1), by making all mounts private in the new namespace. That is, unshare(1) performs the equivalent of the following in the new mount namespace:
mount --make-rprivate /
To prevent this, one can use the --propagation unchanged option to unshare(1).
An application that creates a new mount namespace directly using clone(2) or unshare(2) may desire to prevent propagation of mount events to other mount namespaces (as is done by unshare(1)). This can be done by changing the propagation type of mounts in the new namespace to either MS_SLAVE or MS_PRIVATE, using a call such as the following:
mount(NULL, "/", MS_SLAVE | MS_REC, NULL);
For a discussion of propagation types when moving mounts (MS_MOVE) and creating bind mounts (MS_BIND), see Documentation/filesystems/sharedsubtree.rst.
Restrictions on mount namespaces
Note the following points with respect to mount namespaces:
Each mount namespace has an owner user namespace. As explained above, when a new mount namespace is created, its mount list is initialized as a copy of the mount list of another mount namespace. If the new namespace and the namespace from which the mount list was copied are owned by different user namespaces, then the new mount namespace is considered less privileged.
When creating a less privileged mount namespace, shared mounts are reduced to slave mounts. This ensures that mappings performed in less privileged mount namespaces will not propagate to more privileged mount namespaces.
Mounts that come as a single unit from a more privileged mount namespace are locked together and may not be separated in a less privileged mount namespace. (The unshare(2) CLONE_NEWNS operation brings across all of the mounts from the original mount namespace as a single unit, and recursive mounts that propagate between mount namespaces propagate as a single unit.)
In this context, "may not be separated" means that the mounts are locked so that they may not be individually unmounted. Consider the following example:
$ sudo sh # mount --bind /dev/null /etc/shadow # cat /etc/shadow # Produces no output
The above steps, performed in a more privileged mount namespace, have created a bind mount that obscures the contents of the shadow password file, /etc/shadow. For security reasons, it should not be possible to umount(2) that mount in a less privileged mount namespace, since that would reveal the contents of /etc/shadow.
Suppose we now create a new mount namespace owned by a new user namespace. The new mount namespace will inherit copies of all of the mounts from the previous mount namespace. However, those mounts will be locked because the new mount namespace is less privileged. Consequently, an attempt to umount(2) the mount fails as show in the following step:
# unshare --user --map-root-user --mount \ strace -o /tmp/log \ umount /mnt/dir umount: /etc/shadow: not mounted. # grep '^umount' /tmp/log umount2("/etc/shadow", 0) = -1 EINVAL (Invalid argument)
The error message from mount(8) is a little confusing, but the strace(1) output reveals that the underlying umount2(2) system call failed with the error EINVAL, which is the error that the kernel returns to indicate that the mount is locked.
Note, however, that it is possible to stack (and unstack) a mount on top of one of the inherited locked mounts in a less privileged mount namespace:
# echo 'aaaaa' > /tmp/a # File to mount onto /etc/shadow # unshare --user --map-root-user --mount \ sh -c 'mount --bind /tmp/a /etc/shadow; cat /etc/shadow' aaaaa # umount /etc/shadow
The final umount(8) command above, which is performed in the initial mount namespace, makes the original /etc/shadow file once more visible in that namespace.
Following on from point , note that it is possible to umount(2) an entire subtree of mounts that propagated as a unit into a less privileged mount namespace, as illustrated in the following example.
First, we create new user and mount namespaces using unshare(1). In the new mount namespace, the propagation type of all mounts is set to private. We then create a shared bind mount at /mnt, and a small hierarchy of mounts underneath that mount.
$ PS1='ns1# ' sudo unshare --user --map-root-user \ --mount --propagation private bash ns1# echo $$ # We need the PID of this shell later 778501 ns1# mount --make-shared --bind /mnt /mnt ns1# mkdir /mnt/x ns1# mount --make-private -t tmpfs none /mnt/x ns1# mkdir /mnt/x/y ns1# mount --make-private -t tmpfs none /mnt/x/y ns1# grep /mnt /proc/self/mountinfo | sed 's/ - .*//' 986 83 8:5 /mnt /mnt rw,relatime shared:344 989 986 0:56 / /mnt/x rw,relatime 990 989 0:57 / /mnt/x/y rw,relatime
Continuing in the same shell session, we then create a second shell in a new user namespace and a new (less privileged) mount namespace and check the state of the propagated mounts rooted at /mnt.
ns1# PS1='ns2# ' unshare --user --map-root-user \ --mount --propagation unchanged bash ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//' 1239 1204 8:5 /mnt /mnt rw,relatime master:344 1240 1239 0:56 / /mnt/x rw,relatime 1241 1240 0:57 / /mnt/x/y rw,relatime
Of note in the above output is that the propagation type of the mount /mnt has been reduced to slave, as explained in point . This means that submount events will propagate from the master /mnt in "ns1", but propagation will not occur in the opposite direction.
From a separate terminal window, we then use nsenter(1) to enter the mount and user namespaces corresponding to "ns1". In that terminal window, we then recursively bind mount /mnt/x at the location /mnt/ppp.
$ PS1='ns3# ' sudo nsenter -t 778501 --user --mount ns3# mount --rbind --make-private /mnt/x /mnt/ppp ns3# grep /mnt /proc/self/mountinfo | sed 's/ - .*//' 986 83 8:5 /mnt /mnt rw,relatime shared:344 989 986 0:56 / /mnt/x rw,relatime 990 989 0:57 / /mnt/x/y rw,relatime 1242 986 0:56 / /mnt/ppp rw,relatime 1243 1242 0:57 / /mnt/ppp/y rw,relatime shared:518
Because the propagation type of the parent mount, /mnt, was shared, the recursive bind mount propagated a small subtree of mounts under the slave mount /mnt into "ns2", as can be verified by executing the following command in that shell session:
ns2# grep /mnt /proc/self/mountinfo | sed 's/ - .*//' 1239 1204 8:5 /mnt /mnt rw,relatime master:344 1240 1239 0:56 / /mnt/x rw,relatime 1241 1240 0:57 / /mnt/x/y rw,relatime 1244 1239 0:56 / /mnt/ppp rw,relatime 1245 1244 0:57 / /mnt/ppp/y rw,relatime master:518
While it is not possible to umount(2) a part of the propagated subtree (/mnt/ppp/y) in "ns2", it is possible to umount(2) the entire subtree, as shown by the following commands:
ns2# umount /mnt/ppp/y umount: /mnt/ppp/y: not mounted. ns2# umount -l /mnt/ppp | sed 's/ - .*//' # Succeeds... ns2# grep /mnt /proc/self/mountinfo 1239 1204 8:5 /mnt /mnt rw,relatime master:344 1240 1239 0:56 / /mnt/x rw,relatime 1241 1240 0:57 / /mnt/x/y rw,relatime
The mount(2) flags MS_RDONLY, MS_NOSUID, MS_NOEXEC, and the "atime" flags (MS_NOATIME, MS_NODIRATIME, MS_RELATIME) settings become locked when propagated from a more privileged to a less privileged mount namespace, and may not be changed in the less privileged mount namespace.
This point is illustrated in the following example where, in a more privileged mount namespace, we create a bind mount that is marked as read-only. For security reasons, it should not be possible to make the mount writable in a less privileged mount namespace, and indeed the kernel prevents this:
$ sudo mkdir /mnt/dir $ sudo mount --bind -o ro /some/path /mnt/dir $ sudo unshare --user --map-root-user --mount \ mount -o remount,rw /mnt/dir mount: /mnt/dir: permission denied.
A file or directory that is a mount point in one namespace that is not a mount point in another namespace, may be renamed, unlinked, or removed (rmdir(2)) in the mount namespace in which it is not a mount point (subject to the usual permission checks). Consequently, the mount point is removed in the mount namespace where it was a mount point.
Previously (before Linux 3.18), attempting to unlink, rename, or remove a file or directory that was a mount point in another mount namespace would result in the error EBUSY. That behavior had technical problems of enforcement (e.g., for NFS) and permitted denial-of-service attacks against more privileged users (i.e., preventing individual files from being updated by bind mounting on top of them).
unshare(1), clone(2), mount(2), mount_setattr(2), pivot_root(2), setns(2), umount(2), unshare(2), proc(5), namespaces(7), user_namespaces(7), findmnt(8), mount(8), pam_namespace(8), pivot_root(8), umount(8)
Documentation/filesystems/sharedsubtree.rst in the kernel source tree.
clone(2), core(5), fuser(1), landlock(7), lttng-ust(3), mount(2), mount(8), mount_setattr(2), namespaces(7), nsenter(1), pid_namespaces(7), pivot_root(2), proc(5), symlink(7), systemd.exec(5), umount(2), umount(8), unshare(1), unshare(2).