Zumastor design and implementation notes

Snapshots versus the origin.

When a snapshot is taken, it is empty; all reads and writes get passed through to the origin. When an origin write happens, the affected chunk or chunks must be copied to any extant snapshots. (This chunk is referred to as an "exception" to the rule that all chunks belong to the origin.) The make_unique() function checks whether the chunk needs to be copied out, does so if necessary, and returns an indication of whether that happened.

Path of a bio request through dm-ddsnap and ddsnapd.

The path a bio request takes as a result of a read from a snapshot device that goes to the origin (aka "snapshot origin read").

The path a bio request takes as a result of a write to the origin device (aka "origin write").

The path a bio request takes as a result of a write to a snapshot device (aka "snapshot write").

Startup.

The Zumastor system of processes is started as a side effect of creating a Zumastor volume by use of the "zumastor define volume" command. Zumastor starts the user-space daemons (ddsnap agent and ddsnap server), then commands the devmapper to create the actual devices; this eventually results in a call to the kernel function ddsnap-create() through the constructor call from the devmapper. This function creates the control, client and worker threads, which proceed as outlined below.

Generally, when it is started the ddsnap agent just waits for connections; it sends no messages and all further operations are performed on behalf of clients. During processing it gets a new connection, accepts it, allocates a client structure and adds it to its client list. It adds the fd for that client to the poll vector and uses the offset therein to locate the client information later. After startup the ddsnap server (ddsnapd) operates the same way.

Locking

When a chunk is shared between the origin and one or more snapshots, any reads of that chunk must go to the origin device. A write to that chunk on the origin, however, may change the shared data and must therefore be properly serialized with any outstanding reads.

Locks are created when a read to a shared chunk takes place; a snaplock is allocated if necessary and a hold record added to its "hold" list. When a write to such a chunk takes place, we first copy the chunk out to the snapshots with which it is shared so that future snapshot reads no longer must lock the chunk. We then check whether the chunk has already been locked. If it has, we create a "pending" structure and append it to the eponymous list on the snaplock, thereby queueing the write for later processing. When all outstanding reads of that chunk have completed, the chunk will be unlocked and the queued writes allowed to complete.

Replication and Hostnames

Configuration

Zumastor currently stores replication target information on the source host by volume name and the name of the target host. For each volume, the replication targets are keyed by hostname, which is generated from the output of the "uname -n" command. That is, for a volume "examplevol," the configuration for replication to a volume on the host "targethost.dom.ain.com" is stored in the directory /var/lib/zumastor/volumes/examplevol/targethost.dom.ain.com/. Files in this directory include files containing the progress of an ongoing replication as well as "port," which contains the port on which the target will listen for a snapshot to be transmitted, and "hold," which contains the name of the snapshot that was last successfully transmitted to the target and which is therefore being "held" as the basis for future replication.

On the target host, replication source information is stored by volume name only, in the directory "source," e.g. /var/lib/zumastor/volumes/examplevol/source/. Files here include "hostname," which contains the name of the source host (which currently must be the output of "uname -n" on the source host), "hold," which contains the name of the snapshot that was last successfully received from the source and directly corresponds to the source file of the same name, and "period," which contains the number of seconds between each automatic replication cycle.

Sanity checking

The zumastor command checks the host given in a "replicate" command against the actual output of "uname -n" on the source host. The output of "uname -n" is also used when triggering replication via the "nag" daemon, by running a command on the source host to write the string "target/<hostname>" to the "trigger" named pipe.

Replication

When the "zumastor replicate" command is given (on the source host), it does an "ssh" to the target host (given on the command line) to get the contents of the "source/hostname" configuration file. If the contents of that file don't match the output of "uname -n" on the source host, it logs an error and aborts. Otherwise, after some setup it does another "ssh" to the target to run the command "zumastor receive start," giving the volume and the TCP port to which the data transmission process will connect. The target host prepares to receive the replication data and starts a "ddsnap delta listen" process in the background, to wait for the data connection. The source host, still running the "replicate" command, issues a "ddsnap transmit" command, giving in addition to other parameters the target hostname and port to which the data connection will be made. When the "ddsnap transmit" command completes, the source host does yet another "ssh" to the target to run the command "zumastor receive done," also giving the volume and TCP port.

Other uses of "uname -n"

During configuration on a target host, when the administrator runs the "zumastor source" command, the script does an "ssh" to the given source host to retrieve the size of the volume to be replicated. Also on the target host, the "nag" daemon (used to periodically force a replication) does an "ssh" to the source host to write the string "target/<uname -n output>" to the "trigger" named pipe to trigger replication on the source host to the target.

Glossary