Groups | Search | Server Info | Login | Register

Protection rings within applications

From	"Marven Lee" <marven10@gmail.com>
Newsgroups	alt.os.development, comp.arch
Subject	Protection rings within applications
Date	2012-04-05 11:04 +0100
Message-ID	<9u593lF5htU1@mid.individual.net> (permalink)

Cross-posted to 2 groups.

Show all headers | View raw

*Decided to cross-post to comp.arch as I've seen user-mode protection
rings previously mentioned and all the current talk of user-mode
interrupts and signal handling in the M68k thread is somewhat related,
so it might be of interest.

I've been thinking of splitting user processes into 2 or more protection
rings in my OS. This might be useful for some kind of virtualization
or running a shell in the more privileged part of a process and commands
within the least privileged ring of a process, possibly dividing the least
privileged ring into multiple sandboxes.

I've always assumed that privilege levels above user processes were
just slightly less privileged parts of the kernel. I guess that is how
the driver ring of OS/2 worked.  I'm not sure how the rings of VMS
works, whether the 3 more privileged rings all global or if the supervisor
ring is per process or the VMS equivalent of a process?

So I've thought of some ways to implement application protection rings
and sandboxes using either segmentation or paging.

With segmentation on x86 it is relatively easy to split a process into two
rings using PL2 and PL3. If user mode spans from 0gb to 2gb then it is
possible to set the PL2 and PL3 code and data rings as:

PL0 code and data - base: 0gb   limit: 4gb (kernel)
PL2 code and data -  base: 0gb   limit 2gb (user)
PL3 code and data -  base: 0gb   limit 1gb (user)

Then add a call_monitor() trap gate to call into the PL2 ring and
ensure the iopl field in the eflags register is 0.

A better option would be to use only PL3 segments for the two
user rings and adjust the base and limit of the PL3 segments using
two system calls, call_monitor() and return(). These could be
implemented as signals, with the addition of a new signal, sigmonitor.
All other unmasked signals would trap into the more privileged ring
so preemption would be possible.

int call_monitor (int call_idx, void *args);
void _sigreturn (&ret_context);

call_idx could be passed to the signal handler in the
siginfo->si_code field and a pointer to the args in the
siginfo->si_value.sival_ptr.

call_monitor() and other signals would expand the PL3 segment to the
full 0-2gb, obviously switch stacks and return into the signal handler.
Only the call_monitor() system call would be allowed from the
least privileged ring.

Returning from a signal via _sigreturn() would do the opposite,
setting the base and limit of the segment to 0-1gb and returning to
the stack within this ring.

For more flexibility ret_context can hold a base and limit value
and on each call to _sigreturn() the PL3 segment base and limits
can be adjusted.  So the least privileged ring doesn't have to
be from 0-1gb,  it can be from 64mb to 128mb for example.
The lower 1gb portion of the address space could be split into
several sandboxes and _sigreturn() would be used to switch to
a particular sandbox.  The address space could be layed out
like this:

Monitor (segment base: 0 limit: 2gb) 
...
Sandbox 2 (segment base: 128mb : limit 64mb)
Sandbox 2 (segment base: 64mb : limit 64mb)
Sandbox 1 (segment base:  0mb : limit 64mb)

The monitor would have to implement copyin() and copyout() functions
to access the data in the less privileged ring. These would have to catch
sigsegv signals using setjmp()/longjmp() for example.

As not every CPU has segmentation then to make it more portable
paging alone can be used to implement the protection rings in user
mode. Using 2 page directories per process it is possible to implement
similar protection rings:

Page Dir 1 - maps user 0 - 2gb , kernel 2-4gb
Page Dir 2 - maps user 0 - 1gb , kernel 2-4gb

Again two system calls, call_monitor() and _sigreturn() are used to
transfer between rings by switching page directories. Of course this
is slower than using just segments or altering the segment base and
limits. The page directory entries of the 2nd page directory need to
be altered whenever the base and limit of a sandbox changes. Also the
granularity of the sandbox is limited to 4mb or whatever number
of pages a page table holds.

The ret_context of _sigreturn() could have a relocation flag that indicates
that the pages of a sandbox should always be mapped starting at address
zero. For example if the sandbox exists between 16mb to 20mb, the
relocation flag would map it between 0mb to 4mb in the sandbox page
directory. That way addresses would all be relative to 0 from within the
sandbox.

I've read that user-mode Linux did something similar to some of the above,
protecting the guest OS by restricting the segment limits of guest OS 
processes to 1GB and supporting a page directory per guest OS
process.

It's a pity x86-64 long mode doesn't support segmentation, It could
have allowed many large 4GB+ sandboxes in a single address space.
Perhaps a mode a bit like virtual-8086 mode could have been added
with a single base and limit.


-- 
Marv

Back to comp.arch | Previous | Next — Next in thread | Find similar

Thread

Protection rings within applications "Marven Lee" <marven10@gmail.com> - 2012-04-05 11:04 +0100
  Re: Protection rings within applications Antoine Leca <root@localhost.invalid> - 2012-04-05 13:56 +0200
    Re: Protection rings within applications Morten Reistad <first@last.name> - 2012-04-10 11:02 +0200
      Re: Protection rings within applications "Marven Lee" <marven10@gmail.com> - 2012-04-13 10:49 +0100
  Re: Protection rings within applications "Rod Pemberton" <do_not_have@notemailnot.cmm> - 2012-04-05 09:16 -0400
    Re: Protection rings within applications BGB <cr88192@hotmail.com> - 2012-04-06 12:05 -0700
    Re: Protection rings within applications jgk@panix.com (Joe keane) - 2012-04-09 21:08 +0000
  Re: Protection rings within applications James Harris <james.harris.1@gmail.com> - 2012-04-05 14:17 -0700

csiph-web