/usr/web/sources/contrib/rsc/recover/old

Plan 9 from Bell Labs’s /usr/web/sources/contrib/rsc/recover/old

The Plan 9 file protocol, 9P, defines per-connection state such as
open files and pending requests. When a connection is lost due to
network interruptions or machine failures, that state is lost.
The Plan 9 kernels make no attempt to reestablish per-connection state.
Instead, each application must, if it wishes to continue running
across a network failure, remount its file servers and reopen any files.
.PP

Modifying every program to accomodate connection failures
would be difficult and error-prone. Instead, we
On long distance networks, where connection failures are more common,

In environments such as the internet, where connection failures are more common,

In environments where network connections fail frequently,

9p, the standard protocol for exporting filesystems in Plan 9 is stateful. That means
that for each client connection there is state kept on the server. When the connection breaks
this state is lost. An application using this filesystem will not be able to continue using it.
.CW Recover
was a program originally written by Russ Cox for Plan 9 second edition.
It is meant to interpose itself between a 9p client and a 9p server
on the client side in order to recover broken connections or help clients survive in the case of a
server failure. It works by decoupling the client from the server by keeping
account of the state kept by the server and pushing it again if the connection breaks.
Instead of seeing an "i/o on hungup channel", the connection will recover
and the clients will have only seen a momentary block of the filesystem
operations.
.AE
.SH
Introduction
.LP
Plan 9 [Pike95] is a flexible distributed system. It owes its versatility to three simple principles.
First, resources are named and accessed like files in a hierarchical file system.
Second, there is a standard protocol, called 9P, for accessing these
resources.
Third, the disjoint hierarchies provided by different services are
joined together into a single private hierarchical file name space.
As resources are represented as files and universally accessed through 9P, recover
was written as an interposer between a 9P server and a 9P client.
.PP
A 9P
.I server
.I intro (5)¹
.FS
¹ From now on, a number in parenthesis beside a word, like
.I proc (3)
is a reference to a section
with an entry for that word in the plan 9 manual [9man].
.FE
is an agent that provides one or more hierarchical file systems
\(em file trees \(em
that may be accessed by clients.
A server responds to requests by
clients
to navigate the hierarchy,
and to create, remove, read, and write files.
The prototypical server is a separate machine that stores
large numbers of user files on permanent media;
such a machine is called, somewhat confusingly, a
.I file
.I server .
Another possibility for a server is to synthesize
files on demand, perhaps based on information on data structures
inside the kernel; the
.I proc(3)
kernel
device is a part of the Plan 9 kernel that does this.
User programs can also act as servers.
.PP
A
.I connection
to a server is a bidirectional communication path from the client to the server.
There may be a single client or
multiple clients sharing the same connection.
.PP
9P2000 is the most recent version of 9P, the Plan 9
distributed resource protocol.
It is a typical client/server protocol with request/response
semantics for each operation (or transaction). 9P can be used over any
reliable, in-order transport. While the most common usage is over
pipes
.\" (footnote that pipes is a bit of a simplification of channels)
on the same machine or over TCP/IP to remote machines,
it has been used on a variety of different mediums and encapsulated in several
different protocols.
.PP
9P has 12 basic operations, all of which are initiated by the clients.
Each request (or T-message) is satisfied by a single associated response
(or R-message). In the case of an error, a special response (R-error) is
returned to the client containing a variable length string error message.
The operations summarized in the following table fall into
three categories: session management, file operations, and meta-data operations.
.DS
.TS
box, center;
cb | cb | cb
a | a | a .
class op-code desc
=
session version version & parameter negotiation
management auth authentication
attach establish a connection
flush abort a request
error return an error
_
file walk lookup files and directories
operations open open a file
create create and open a file
read transfer data from a file
write transfer data to a file
clunk release a file
_
metadata stat read file attributes
operations wstat modify file attributes
.TE
.DE
.PP
.PP
The combined acts of transmitting a request of a particular type,
and receiving
its reply is called a transaction
of that type.
.PP
The 9P protocol is stateful. This state is represented by an abstraction called fid,
a 32-bit unsigned integer that the client uses in a T-message to identify
the ``current file'' on the server.
Fids are somewhat like file descriptors in a user process,
but they are not restricted to files open for I/O:
directories being examined, files being accessed by
.I stat (2)
calls, and so on \(em all filesystem elements being manipulated by the operating
system \(em are identified by fids.
Fids are chosen by the client.
All requests on a connection share the same fid space;
when several clients share a connection,
the agent managing the sharing must arrange
that no two clients choose the same fid.
Fids are used as handlers for navigating the hierarchy and in general
to multiplex the channel. Each fid has is
own state on the server, including its current path, if it is open, an the current offset
for an open file or directory. This makes
the implementation much easier and faster as only the current operation
relative to the fid context
needs to be communicated to the server.

While statefulness has many benefits, it also has an important
downside. If a connection breaks,
all the state of the fids is lost. This is essential for garbage collection
on the servers. When a connection is broken or a server reboots,
all the clients are working on a context which is no longer relevant. Any applications
using a filesystem
in this state would see an "i/o on hangup channel" message in Plan 9 and would
need to be restarted. A diskless machine whose root filesystem was in that filesystem
would have the same problem and would need to reboot.
.PP
9P is being tasked in environments requiring a higher
degree of robustness than originally required in the research environments. As such a mechanism
for re-establish broken connections and recovering state (particularly in
the face of server errors) is particularly important. The clients must continue being
able to run in the face of a network failure o a server reboot. The ability
to fail over a connection between redundant file servers is also desirable.
.PP
We needed a 9P interposer capable of writing down the state of the fids of a connnection
and reestablishing that state in the case of any error. The interposer would run on the
client side and should be able to run on an unmodified server and client. Ideally the
client would just block while the recovery is taking place. In the normal case
of no errors, the performance penalty added by the interposer should be negligible.
Recover is a program meant to provide this interposer.

.SH
Architecture
.LP
As mentioned previously
recover interposes itself on a 9P connection, on the side
the client side. It uses a net connection and serves another file on /srv where the original
filesystem can be mounted through recover.
.PP
Recover is composed of two processes:
.I listensrv and
.I listennet.
A shared
lock arbitrates access to resources.
.I Listensrv
listens for T messages from the client via the srv file and forwards
Tmessages to the server through the connection.
.I Listennet
on the other side
listens for Rmessages from the server via the net connection and
sends them to the client through the srv file. Each T-R message
corresponds to a Request structure. When a T message arrives, it is processed
and the corresponding request is allocated with the tag on the message
as identifier. When the response comes, it can be looked up on a hash table
based on the tag.

.KF
.PS
copy "arch.pic"
.PE
.Cs
Figure 1. Recover Architecture
.Ce
.KE

.PP
There are two different kind of requests, internal and external
requests. External requests are generated by the client and forwarded
to the server. Internal requests are generated by recover in the
event of a connection failure. The functions internalresponse and
externalresponse are called by listennet when it reads an Rmessage from
the client process. The kind of response is specified by the flag
isinternal in the Request structure.
.PP
When
.I listensrv
wants to send a request, it calls
.I queuereq.
.I Queuereq
tries
to send the request, unless the fid is not ready. This mean
the connection is down or a recovery is in process. If the connection is down,
.I listennet
will find it out eventually and call
.I redial.
(The function used to reconnect is called
.I redial.)
When
.I redial
is called, all
the requests queued which could not be transmitted will attempt retransmission.
Before the transmission, though, the remote fid state (the
state associated to the fid in the connection between recover and the server)
will have to be looked up to
see it if has been rewalked. If they have not, internal requests will be sent to rewalk the saved
fid path
and the operation will stay queued. Once the Rwalks are received, all the
external requests relative to the fid will be restarted until all operations complete.
.PP
Normally on
.I redial,
after the
.CW version,
.CW attach and
.CW auth
messages exchanged to initiate the session, the only
extra operation needed is to rewalk the fids and open the ones which were open in the lost connection.
After that, the outstanding requests can be sent. There is a special case,
though. Directories cannot be seeked in 9p. As a consequence, there is an
implicit state in the server associated with a fid directory which has to be
pushed back into it and that is the point where the last read got to. There are two possible
solutions for this. As it is not possible to start the read with an arbitrary offset,
one solution would be to read from the start
to reset the offset to that point and throw away the read results once the offset is
set. This can get complicated if the directory has changed. We opted for the
other solution which is to reread the complete directory again. This is
done by rewriting the offset from the client to make it zero on the server. The only
problem this approach could generate, which is repeated entries for the resent
reads is already solved, because the binds in plan 9 already produce duplicates,
so it is not a problem.
.PP
Another interesting exception is ORCLOSE files. If the connection breaks down,
they disappear. Instead of having the ORCLOSE files disappear under us, what we
did was rewritting the open/create messages in orter to make them normal files. When the clunk for
the fid comes, we remove them ourselves. We also forbid opening
exclusive files under recover in order to prevent deadlock, which
can be a tough problem, for example with mail boxes.

.PP
9P admits a specifier for
.CW attach
operations, which includes a user name and a string for
the server. Only one mount specifier works with recover at the moment. In order to support more than
one mount specifier, a new
.CW attach
should be processed, which would need authentication.
To implement this is not easy because one would have to push the dialog
with factotum into a per-fid state machine.
All the authentication is done now at startup, without any client mixing operations with
us, we just send the auth, negotiate the keys with factotum, do the auth rpcs over
the authentification fid and finally make the attach transaction. Once started, on receiving
an
.CW attach
from a client, we just convert it to a
.CW walk
to clone the root fid. Doing the
.CW attach and
.CW auth
just on startup is much easier because it is simple to send a request and read the connection for a reply
without worrying about transactions from another fids getting in between.
For multiple specifiers this has to be multiplexed over the different fids, mixed with other
operations or at least run one pair of listensrv an listennet procs per specifier which leaves then creates the problem
of managing and communicating the processes.
.PP
New attach messages for the same specifier are rewritten as
.CW walks.
This is done in order to create
a new fid for the new client. As long as no new specifier is needed, this works well.

.SH
Debuging
.PP
Probably, the most difficult part or writing any software is debuging. This is specialy true for something
like recover, because the network can fail at any point. If we view the recover server as a state
machine, the failure can come in any of the states of the machine and the number of possible
states is very big. We kept finding bugs on states the filesystem had not been to. This composes
with the fact that recover does its job when things fail and normally things do not fail.

Two simple observations helped us develop a debuging test enviroment which has been able
to make recover very stable with very little effort. The first one is that the state machine of recover
is big, but this state is pushed on to the server by the client. The second one is that the quantum
in which this state is pushed is a 9p message. So the only thing we have to do to get an enviroment
which goes through many of the representative states of the software is do simple operations on
the filesystem and break the connection after every N number of messages. We can do this
more than once, breaking the connection by closing the file every (n1, n2, n3...) number of messages where each of this numbers go from one message to the number of messages in the operation we are trying to
test.

We ended up with a vector of message numbers we can apply consistently to a newly run recover
to a test operation. We applied this enviroment with (n1, n2) for every possible value of the tuple for
all the filesystem operations, like open, read, stat and some of the operations compounded. We found
some bugs which were very easy to correct as they were completely deterministic.

.SH
Performance
.PP
We have two versions of recover, which share almost all the code. One is for Plan 9, the other
is for Plan9ports [P9ports]. Plan9ports is a port of most of the software running in Plan 9 for Unix-like
systems. We measured the performance on both using postmark [Postmark], which we
ported to Plan 9. We did all the measures with 16384 transactions. In Plan 9 the results were what we expected.
Using recover on the loopback we had roughly a factor of two degradation of
latency for each operation. This is because most of the time is spent on context switches between kernel
and user space. As the number of this double (recover runs on user space), the performance is divided by two.
These measures can be seen in figure 2. Over the network, the latency added by the network hides the latency
added by recover.
The effect of using recover is around a ten percent or less performance degradation. This is shown in
figure 3, which depicts measures for a 100 Mbps ethernet and gigabit.

On the other side, on linux, the performance was worse. We got like one third of the performance when
we added recover on the loopback. We looked more deeply into the matter and came out with figure 4.
The first measure on figure 4 stands for postmark run over a directory mounted through the loopback
from a server running in the same host.
The second stands for a measure of the same server but mounted over a named pipe
which serves the loopback connection. This is the
normal way to use the network in Plan 9 ports. It emulates the behaviour of
the srv filesystem in Plan 9). The third is the measure of postmark run
through recover with recover going directly through the network.
The two measures which are equivalent (both go through a named pipe) are the second
and the third one, so we take a twenty percent performance loss because of recover. The huge difference
between the first and the second measure points to a problem in the way the name pipe is managed.
It could be argued that this results (as in the case of Plan 9) would be hidden by a fast network. We did the
same measurements over a gigabit ethernet. The results, shown in figure 5 show that this is a problem
even over a gigabit ethernet. Recover gets even worse results in this measures too. It has to be
taken in account that this
performance loss does not appear in Plan 9 which shares almost all the code. This points to a problem
in some of the libraries or some of the systems infrastructure of Plan 9 ports.
We profiled the system using oprofile [Oprofile] and found that it spend most of the time in locks,
so probably it is a problem with the thread library, though this issue has to be followed more throughly.

.WS 1
.KF
.BP p9_local.eps 2.04i
.Cs
Figure 2. Postmark with and without recover on the loopback on Plan 9
.CE
.KE

.WS 1
.KF
.BP p9.eps 2.04i
.Cs
Figure 3. Postmark with and without recover over the network on Plan 9
.CE
.KE

.WS 1
.KF
.BP linux_local.eps 2.04i
.Cs
Figure 4. Postmark with and without recover over the loopback on Linux
.CE
.KE

.WS 1
.KF
.BP linux_giga.eps 2.04i
.Cs
Figure 5. Postmark with and without recover on a gigabit network on Linux
.CE
.KE

.SH
Related Work
.PP
PhilW's kernel modifications to try to reconnect (what are this, where can I read about them?).
.PP

.I Aan(8)
tunnels traffic between a client and a server through a
persistent network connection. If the connection breaks, the aan client
re-establishes the connection by redialing the server.

Aan uses a unique protocol to make sure no data is ever lost
even when the connection breaks. After a reconnection, aan
retransmits all unacknowledged data between client and
server.

Aan requires a modified server to establish the other end of the tunnel. As a consequence,
it cannot be run on nonmodified file servers. Aan also works at network level, so it does not
understand the meaning of the file operations over it. As a consequence, it does not work
in the event the server hanging or rebooting, because the state of the aan connection is lost.
It cannot do failover either for the same reason.
.PP
Redirfs is a program which serves a 9P connection and a mounted filesystem with the same
purpose as recover. Some of the applications we are using recover with don't have a Plan 9
kernel on the client side, but just a ligthweight library kernel and a 9P connection to the server,
we needed a 9P to 9P interposer, so redirfs did not work as we needed.

.SH
Future Work
.LP
Some synthetic filesystems cannot be used with recover as it is now, specially in the event of a
server reboot. One example of this is /net (see, for example
.I ip(3)
). In /net some operations,
like creating a connection are based on many file operations which separated do not mean anything.
Also, some files, like ip connections, are not replaceable. We are trying to figure out ways
for doing this for the filesystems we use. Some of the ideas behind Plan B's /net [Net] may provide a
solution.

.PP
Recover is now a user space program. It could be integrated in the kernel to make it faster.
Given the results obtained in Plan 9, we do not think integrating recover into the kernel would
be necessary for normal users. Recover is normally used over the network which has a latency
so high that the performance gain would not be worth it.

Users who are using recover through the loopback and need a very
high performance may be interested in doing it, because it would probably multiply the performance by two.
In Linux some other issues have to be dealt with first so that the performance of recover gets to be comparable
to that of Plan 9.

.PP
On some cases, the applications may need to know that a reconnection has happened. How this is
done is not clear. One way would be to return an error and maybe write a library wrapper to hide it
and wait for it in a specific interface, so that legacy applications work.

.SH
Conclusion
.LP
9P is stateful, which makes it more simple and
effective. Recover effectively removes the downside of this approach
by providing high availability and failover for filesystems
in the case of a server shutdown or a broken connection. It provides a safety layer efectively
isolating the client from the server loosing state.

(Return to Plan 9 Home Page)