[RFC] Linux virtualization system patch 0.4.2

From: Yuri Pudgorodsky (yur@asplinux.ru)
Date: Mon Aug 21 2000 - 19:48:51 BST


We're announcing the availability of Virtual Environment
system for Linux kernel. It is in early design and implementation
stage, however it's already quite useful and stable. It has not
been reviewed throughly by security experts, so we're pleased
to get a feedback about our code.

The main idea of the system is to provide a strict unbreakable
boundaries between groups of userspace processes, more complete
then FreeBSD 'jail'. These boundaries suppose to prevent a process from
one group to see any object created from / belong to another group,
as well as to prevent them from modifications to global kernel parameters.
We suppose it is secure to run even root processes inside this
environment, if we drop some dangerous capabilities for them.

Such system allows administrators to run several independent application
environments on a single computer with secure boundaries between them.
We believe that the code may significantly help application hosting and,
in any case, improve system security.

We call the isolated group of processes with their filesystem root as
"Virtual Environment", or VE. The kernel patch tries to make the existence
of VE transparent to processes. Each VE has its own `init', startup
scripts and so on. So, VE looks almost like a usual system, including
possibilities to do administration of the environment from inside (for
example, mounting filesystem, network configuration etc). Certainly,
all operations are restricted to objects "inside" the VE.

FreeBSD has "jail" call which partially address the need in restricted and
separated environments. The main feature of our system is the possibility
to have a user with a certain subset of administrative (root) rights inside
the environment. Thus, our access control measures are more sophisticated.
We're also trying to avoid the ugly and unnatural relationships between
filesystem roots and IP addresses that exist in "jail" interface and
implementation (however, to be honest, current network component of the
patch is a quick hack and will be reimplemented).

Implementation details
----------------------

Boundaries between different environments are implemented by a namespace
separation. All valuable objects (task_struct, sockets, skbuf, ...)
are marked by VE id and searching algorithms are changed to expose only
objects with VE id matching the VE id of calling process.

Information about virtual environment is kept in `struct ve_struct'.
Processes keep a reference to the structure corresponding to their VE.

Process table searching macro for_each_task() has been replaced by
for_each_task_all() and for_each_task_ve(). Similarly, find_task_by_pid()
was replaced by find_task_by_pid_all() and find_task_by_pid_ve().

The very first process spawned inside VE (actually, the process
performed env_create syscall) receives special handling: it becomes
an `init' process for this VE, handling its orphans.
To follow traditions, this process PID is 1.

Network changes ensure that processes may operate only with IP addresses
assigned to this VE, and don't see packets destined to others.
However, the network part of the code is under heavy development.

Also, there are changes in other places necessary for environment
isolation, namely:

  - some capabilities are dropped for VE root
    (CAP_SYS_MODULE, CAP_SYS_RAWIO, CAP_SYS_TIME, CAP_NET_ADMIN,
     CAP_NET_NICE)
  - /proc has been made read-only inside VE
  - private pty driver is created for each VE
  - Console is not accessible from VE
  - SysV IPC ids are multiplexed
  - utsname is made private for VE
  - sysctl is read-only inside VE, except for uts_name related ones
  - SIGIO never goes across VE boundary
  - chroot call is secured

Quality of Service
------------------

To limit consumption of valuable resources by a given VE, the code includes
User Beancounter patch
ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounter.html
In current state it accounts and limits a number of memory-related resources
consumed by a group of processes marked with given ID (LUID).

In nearest future we are going to experiment with fair scheduler (fairsched
project).

Also, enabling CBQ and limiting network bandwidth for each VE using
Linux Advanced IP Routing is a good idea.

Filesystem
----------

To minimize problems with separate chroot fs tree, a filesystem with
"copy-on-write" concept is under development. Basically, it will
allow combine/union mount of a read-only template tree and a private
VE tree:

/template -- r/o -->
                             + combine mount --> /ve_root/<VE-ID>
/private/<VE-ID> -- r/w -->

Additionally, when a process issues a modify/delete request to a
template object, this object will been copied to private area
and a request will be redirected to this private copy.

Playing with the virtual environment system online
--------------------------------------------------

It's possible to take a look at how the system works without
patching the kernel and manual configuration of an environment.
We run the system on our own server and provide free access to it for
testing purposes. Check http://beta.asplinux.ru. Right now it has
limited subscriber capacity (254 subscribers, bounded by a configured
IP address space), thought we're going to rewrite parts of the site
scripts to add dynamic IP assignment for active VEs only.

Links
-----

 * http://www.asplinux.ru/en/install/new2.shtml
 * ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounter.html
 * http://beta.asplinux.ru/

 * ftp://ftp.asplinux.com.sg/pub/aspcomplete/ (Singapore)
 * ftp://ftp.asp-linux.com/pub/aspcomplete/ (USA)
 * ftp://ftp.asplinux.ru/pub/aspcomplete/ (Russia)
 * ftp://ftp.asp-linux.co.kr/pub/aspcomplete/ (Korea)

Contacts
--------

We are pleased to hear your comments, suggestions, opinions at
info@asp-linux.com.

(C) SWSoft Pte Ltd., 2000, http://www.sw.com.sg/



This archive was generated by hypermail 2b29 : Fri Jun 01 2001 - 10:15:39 BST