<saw@sw.com.sg>
Contents
This patch provides accounting and allows to configure
limits for user's consumption of exhaustible system resources.
The most important resource controlled by this patch is unswappable memory
(either mlock
'ed or used by internal kernel structures and buffers).
The main goal of this patch is to protect processes
from running short of important resources because of an accidental
misbehavior of processes or malicious activity aiming to ``kill'' the system.
It's worth to mention that resource limits configured by setrlimit
(2)
do not give an acceptable level of protection because they cover only small
fraction of resources and work on a per-process basis. Per-process
accounting doesn't prevent malicious users from spawning a lot of
resource-consuming processes.
Although the main use of this patch is accounting and limiting the amount
of resources consumed by processes of each user, it may be used
for control of resource use by any group of processes with the
common ``luid''.
``luid'' is assigned to unaccounted processes (only) and is inherited over
fork
.
User beancounter patch modifies the core parts of the kernel (like virtual to physical address translation code) and, thus, should be compact and efficient as much as possible. Some functionality and system administrator convenience have been sacrificing to achieve this compactness and efficiency.
All accounting and limiting is provided on a per-luid basis.
Luid is assigned by setluid
system call and is inherited over
fork
's. Once being assigned to a process, it cannot be revoked or
changed in the future. When process creates new objects consuming resources
(like new processes, struct file
, and so on) these objects also grab a
reference to luid of the process and used resources are accounted.
Thus, objects do not change their luid reference and cannot get the
reference at the middle of their existence. Such an architecture simplifies
things a lot.
Resource use limits are just limits, and do not provide ``wait-until-available'' functionality. The limits are organized as two thresholds. The exact meaning of these thresholds is resource-specific. In general, after reaching the first threshold creation of new resource consuming objects is denied, and the system tries to inform applications about resource shortage gracefully. The second threshold is the upper bound for the resource consumption, which is maintained even by means of abrupt killing of the offending process.
To clarify this policy let's consider the limit for unswappable memory.
When the first threshold is reached, the subsequent fork
,
mlock
and other calls start to fail.
The application should handle these failures
and correctly terminate its work. When the second threshold is reached,
all accounted kernel memory allocations will fail for this process. Such
allocation may happen inside, for example, page fault handler which creates
memory images of mapped files under normal circumstances. In the case of
reaching the ``hard'' limit the kernel cannot notify the application and
does not have other choice than to kill it.
The initial version of this patch was developed by Alan Cox and Andrey Savochkin for early 2.2 kernels after some discussion over linux-kernel mailing list.
Currently, the patch is maintained by Andrey Savochkin and for 2.4 kernels only. I try to keep it up-to-date, but I don't make ports to each new testing or pre-releasing kernel.
The latest version accounts for the following resources:
struct task
, page directories, etc.mlock
'ed pages.siginfo
structures.The really important resources are unswappable memory, IPC SHM segment size, and number of processes. Other resources are rather auxiliary.
Unswappable memory is a resource consumed by applications indirectly.
Unswappable memory areas are created on fork
call (different internal
kernel structures like struct task
), on memory management calls (page
directories for virtual to physical address translation), and so on.
Certain call patterns may lead to all available physical memory being occupied
by this kind of data, and the inability to free enough physical memory by
swapping out or any other means.
The patch provides the basic protection, which needs to be extended by
accounting of more sources of unswappable memory allocations.
IPC SHM segment size is another resource where user beancounter patch provides the efficient protection against IPC abuses and denial-of-service attacks. IPC SHM API has several defects, one of which is the rejection of automatic garbage collection. Automatic garbage collector keeps reference counters for objects and release the resource when the object becomes unreferenced. Such a garbage collection exists for files, for example. However, IPC SHM API requires explicit deletion of SHM segments. Such a deletion may be accidently or deliberately omitted, which leads to memory waste. Creating a lot of SHM segments without their deletion may also work as a denial-of-service attack.
Number of processes is limited on IA32 architecture. This limit exists because each process requires a GDT entry, number of which is limited by CPU architecture. GDT entry limit is the main reason for accounting and limiting for the number of processes run by each user.
Other accounted quantities do not correspond to exhaustible resources directly.
For example, the number of mlock
'ed pages is included into accounting
of number of unswappable pages. However, administrators may wish to set the
unswappable page limit to large values to allow users to spawn a lot of
processes. In this case the administrator may limit the users' ability
to mlock
pages to prevent abuses of the high unswappable memory limits.
The initial ideas are described in http://www.uwsg.indiana.edu/hypermail/linux/kernel/0006.2/0748.html, and my current view in MemoryManagement.html.
Current code does:
Memory is charged for the socket at the moment of its creation. It would definitely be better to charge the actual used memory, but this policy has several ideological problems under investigation now.
The current places of the accounting hooks are:
struct socket
gets reference to beancounter (from
current->login_bc
) in sock_alloc
;sk_alloc
call
from protocol family specific creation routines;struct sock
creation it gets beancounter reference, and the
amount of charged memory is stored in the structure;sk_free
uncharges the memory and drops the reference to the
beancounter;setsockopt
calls charge the difference in the socket buffer size.The core of the problems with accounting at the moment of buffer space consumption is following.
If the memory limit is reached at at send
call, the
call should sleep. Returning an error isn't an option here, application
don't expect such errors from send
or treat them as fatal.
It means that
will deadlock being unable to progress and free memory back because of reaching the memory limit;
consume_kernel_memory; result = send(...); free_memory;
poll
and subsequent send
agree
whether send
is unblocking, even if the poll
and send
have a
time distance and other memory and buffer consumption calls may be done
in between; that's not so easy from technical point of view and may require
some forward reservation.There are two possible policies of receive buffer management:
First of all, the summary of control of finite resources. There are
At this moment, the basic protection exist for almost all (except TCP and UDP ports) obvious exhaustible resources.
But we may not be sure that all possibitilities for denial-of-service attacks
are closed. From theoretic point of view, it would be better to ensure that
each non-trivial operation, each kmalloc
is charged. In practice, it's
impossible. There are a lot of places where the subject the resource should
be charged to isn't obvious (not current
!), or where the limit can't be
enforced. Socket buffer accounting (
Sockets
section) is a clear example of such a situation.
So, the only possible way here is to spot suspicious places in the kernel and
add resource control calls suitable for them.
Certainly, comments and patches are welcome!
Some parts of user beancounter patch appear to need redesign because the tests have revealed problems. These parts are
kernel/user.c
written by
Linus and intended to deal with per-user resource control is necessary.
Administrators should also be given a way to implement some policy and to control memory management (i.e. how processes share the pagable memory, page cache, and how swap-out works), then, disk bandwidth, and so on. These matters are to be considered in the future.
This section describes user beancounter API for applications.
There is a well-known conflict between kernel and libc header files. The prototypes of the system calls below are presented as they may be used for making direct calls, without libc modifications.
long sys_getluid(void);
Returns the luid of the process.
Returns error (ENOENT
currently, please suggest the better code)
if luid hasn't been assigned to this process yet.
Beware: this call (and all consequent ones) fail if the beancounter feature isn't compiled into the kernel. Do not make unreasonable assumptions that the call always succeeds or what error codes you may get in return.
long sys_setluid(uid_t uid);
Set luid of the process.
The call succeeds only for privileged processes (CAP_SETUID
currently)
and only if luid hasn't been assigned to this process yet.
Returns 0
on success.
Documented error codes are EPERM
and EINVAL
.
long sys_setublimit(uid_t uid, unsigned long resource, unsigned long *limits);
Set resource limit number resource
for luid uid
.
Returns 0
on success.
Documented error codes are EPERM
and EINVAL
.
The operation is privileged and requires CAP_SYS_RESOURCE
capability.
Currently, if the given luid hasn't been assigned to living process, the call
fails with EINVAL
.
The following constants are defined in linux/beancounter.h
at this
moment.
#define UB_KMEMSIZE 0
#define UB_LOCKEDPAGES 1
#define UB_TOTVMPAGES 2
#define UB_SHMPAGES 3
#define UB_ZSHMPAGES 4
#define UB_NUMPROC 5
#define UB_RESPAGES 6
#define UB_SPCGUARPAGES 7
#define UB_OOMGUARPAGES 8
#define UB_NUMSOCK 9
#define UB_NUMFLOCK 10
#define UB_NUMPTY 11
#define UB_NUMSIGINFO 12
Their meaning is briefly described in section Current Status.
A short example:
#include <linux/unistd.h> #include <linux/resource.h> #include <linux/beancounter.h> static _syscall0(long, getpid); static _syscall1(long, setluid, uid_t, uid); static _syscall3(long, setublimit, uid_t, uid, unsigned long, resource, unsigned long *, limits); void f(void) { unsigned long limits[2]; setluid(500); limits[0] = 4; limits[1] = 4; setublimit(getpid(), UB_NUMPROC, limits); }
Libc doesn't have wrappers to newly created system calls. So, the code should make system calls directly.
The current version of the patch is available at
ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/user_beancounter-IV-current.
It is against 2.4.0-test7
kernel.
The history of changes in the current branch of the patch (IV)
is available at
ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/UserBeancounterChangeLog.
The patch introduces two new kernel configuration options:
CONFIG_USER_RESOURCE
and CONFIG_USER_RESOURCE_PROC
.
The first one enables user beancounter functionality, and the second provides
information about used resources and limits through
/proc/user_beancounters
.
There is a small program to play with the patch:
ftp://ftp.sw.com.sg/pub/Linux/people/saw/kernel/user_beancounter/ulim4.c.
It takes the resource number and it's ``soft'' and ``hard'' limits as
arguments and starts /bin/bash
(check
include/linux/beancounter.h
for resource numbers).
All child processes of the started shell
will have the same luid (i.e. belong to a single accounting group).
Watch resource use through /proc
and try to overpass the limits!
Thanks to Marcelo Tosatti <marcelo@conectiva.com.br>
,
Andrey Moruga,
Vlad Bolkhovitin,
Alexey Raschepkin
for contributions to the patch.
$Id: UserBeancounter.sgml,v 1.10 2000/09/09 08:23:44 saw Rel $