Introduction
Build and Install /dev/fanout
How /dev/fanout Works
Security and Obsolescence
Purpose: The purpose of fanout is to give Linux a simple, broadcast IPC.
Our own purpose for writing the module was to distribute log messages to one or more processes that want to be notified when an event occurs. We use /dev/fanout, a web server, and XMLHttpRequest on a web client to build a telephone answering machine with multiple web interfaces running simultaneously. One nice feature of our answering machine is that the web interfaces don't use polling but still display caller ID when a new call arrives.
Common Approaches to Broadcast: The two most common broadcast mechanisms in Linux are signals and UDP packets. You can broadcast a signal to a group of related processes using the kill command with a PID of zero. This works well if all of the processes are related and if the program knows what action is required on your signal.
Signals will not work for our application since there is no way to directly route a signal from a web server to a web client, and because web servers do not know that we want to redraw certain web screens on a particular signal.
We can also broadcast events using UDP or TCP. We've built event servers which accept TCP connections and broadcast event information down each accepted connection. We use XMLHttpRequest to request a PHP page that opens the TCP connection and waits for the event. While this approach works well, it requires yet another process and has the slight extra burden of an additional TCP connection for each web client.
A Better Broadcast Approach:
A better approach would be to have something like a FIFO, but instead
of having all of the listeners compete for the single copy of the input
message, have all of the listeners get their own copy. Consider the
following bash dialog:
mkfifo event_fifo
cat event_fifo &
cat event_fifo &
cat event_fifo &
echo "Hello World" > event_fifo
Hello World
The message appears only once, since only one instance of the cat
command is given the fifo output. Now let's consider the same
experiment using fanout: cat /dev/fanout &
cat /dev/fanout &
cat /dev/fanout &
echo "Hello World" > /dev/fanout
Hello World
Hello World
Hello World
The message now appears once for each of the three listening cat
commands. We use bash commands just to illustrate what fanout does.
Its real power lies in letting many different programs get identical
copies of a data stream.
Which fails: send or receive? No matter how hard we try to avoid it, one day we'll find a reading process that can not keep up with the writing process. Allocating more memory postpones the problem but does not eliminate it. When this problem occurs we have two choices: apply back pressure to the writer causing the writer to block, or let the readers miss some output.
The problem with blocking the writing process is that you may affect other parts of the system. Our original purpose was to build a telephone answering machine and we chose to route all caller ID information through syslogd. Since we have /dev/fanout as a target in the syslog.conf file, blocking the writer would block syslogd -- not a good thing.
The author of fanout very deliberately chose to cause the reader to fail when it can not keep up. Data is stored in a circular buffer and if a reader can not keep up with the writer, it will eventually ask for data that is no longer in the circular buffer. The fanout device returns an EPIPE error to the reader when this happens. In our application for /dev/fanout we are happy to protect syslogd at the expense of the web clients when we are forced to choose one over the other.
Build the module with the following commands:
cd /usr/src/linux
tar -xzf fanout.tgz
cd fanout
make
When you install the module you can set the size of the circular
buffer and can set the verbosity of the printk messages. The
default buffer size is 16k and the default debug level is 2. A
debug level of 3 traces all calls in the module and a debug level
of 0 suppresses all printk messages. Here is an example that
overwrites the default values for buffer size and debug level: insmod ./fanout.ko buffersize=8192 debuglevel=3
Fanout uses a kernel assigned major number so you need to look
at /proc/devices to see what was assigned. The following lines
create all ten of the possible instances of a fanout device. MAJOR=`grep fanout /proc/devices | awk '{print $1}'`
mknod /dev/fanout c $MAJOR 0
mknod /dev/fanout1 c $MAJOR 1
mknod /dev/fanout2 c $MAJOR 2
mknod /dev/fanout3 c $MAJOR 3
mknod /dev/fanout4 c $MAJOR 4
If all has gone well, the "Hello World" example given above should
now work for you.
The key to understanding how fanout works is to know how a little about how read() works. If you were to open a disk file and make five read() calls with each call reading a thousand bytes, you would expect the next read to give you the data starting with byte 5000. Internally, the operating system keeps a counter, called f_pos, that remembers where you are in the file. Once you've read the first 5000 bytes, you don't normally want to read them again, and since you aren't likely to ask for them again, fanout can forget them. The mechanism used to remember only the most recent data is a circular queue.
The fanout device uses the count variable to keep track of how many bytes have been written so far. At quiescence, the readers have all read count bytes (count and f_pos are equal), and the readers are now asking for data starting at *offset (which also equals count).
When a writer adds data to the queue, the count variable is incremented by the amount added. Each of the readers must now wake and read the bytes between *offset and count. After adding data to the queue, a writer wakes any sleeping readers with the call to wake_up_interruptible() in fanout_write().)
Buffer overflow: One of the fundamental decisions to make in a design is what to do when a reader can not keep pace with the writers. In many designs you would apply flow control to the writers to slow them down to keep pace with the slowest reader. The fanout device, however, returns an error to the slow reader. Specifically, the reader gets an EPIPE error when it requests data that is no longer in the circular buffer (i.e. *offset < count - buffersize, where buffersize is the number of bytes in the circular buffer).
A reader does not immediately get an EPIPE after opening a fanout device
that's been operating for awhile because in the file open routine,
fanout_open(), we explicitly force the reader to be caught
up with the writers. The line of code that does this is:
filp->f_pos = fodp->count;
Code notes: It is said that programmers can read code and know what it does, but they can not read a variable and know what it means. So instead of reviewing the code, we are going to review the variables.
The fanout module supports up to NUM_FO_DEVS instances
of a fanout device. NUM_FO_DEVS is currently set to two.
Each instance of a fanout device is described by the following
data structure:
struct fo {
char *buf; /* points to circular buffer */
int indx; /* where to put next char recv'd */
loff_t count; /* number chars received */
wait_queue_head_t inq; /* readers wait on this queue */
struct semaphore sem; /* lock to keep buf/indx sane */
};
Let's look at each of these variables in turn:
buf: The buf variable points to the start of the buffersize number of bytes allocated for the circular queue. The memory is not allocated until the first open() on the device, and the memory is allocated using kmalloc(). Allocated memory is not freed until the module is unloaded.
indx: This variable gives the location of where to place the next byte in the circular queue. It is updated by fanout_write() as bytes are added to the queue. When indx gets to buffersize, it wraps back to zero.
count: This variable is the total number of bytes written to the device. It is updated only by fanout_write() and a reader has data to read when count is not equal to *offset.
inq: When a reader has no new data to read it blocks until new data is available. Specifically, the reading process sleeps in a call to wait_event_interruptible(). The writer's call to wake_up_interruptible() causes the readers to wake and continue execution with the lines of code immediately after the wait_event_interruptible().
sem: While writers are writing to the circular queue, there is a short time during which the count and indx variables are not yet consistent with the data in the queue. During this window of inconsistency another writer might run and inadvertently corrupt the queue. The wlock mutex prevents this by locking out other writers while one writer is updating the queue.
One final note on the code is the use of the *private_data
in the file structure. Fanout uses this variable to store a pointer
to the struct fo appropriate to that file. The FanOut
Device Pointer (fodp) is usually retrieved at the start of a
routine with a line of code like this:
struct fo *fodp = (struct fo *)filp->private_data;
Known bugs: While there may be several implementation bugs, the one possible design bug is that fanout assumes that the file offset counter never wraps. This should probably be fixed.
Edwin van den Oetelaar (http://www.oetelaar.com/) used fanout as part of an audio distribution appliance. He found that a lock is is needed not just on the input queue but by the readers as well. He modified fanout.c to include some extra semaphores to prevent a race condition in the fanout read routines. He also added more comments and made the code more like the code in the book, "Linux Device Drivers, third edition". (Thanks, Edwin!)
We use /dev/fanout in an appliance where, after boot, we can at least drop the system capability CAP_MKNOD when we drop the other system capabilities.
The biggest problem with /dev/fanout is that there are a limited number of fanout devices and they need to be allocated. It would be nice to have the fanout functionality in the filesystem (like FIFOs) so they can be created as needed, and so developers do not need to worry about having two programs allocate the same fanout device.