302 lines
12 KiB
Groff
302 lines
12 KiB
Groff
.TH MD 4
|
|
.SH NAME
|
|
md \- Multiple Device driver aka Linux Software Raid
|
|
.SH SYNOPSIS
|
|
.BI /dev/md n
|
|
.br
|
|
.BI /dev/md/ n
|
|
.SH DESCRIPTION
|
|
The
|
|
.B md
|
|
driver provides virtual devices that are created from one or more
|
|
independent underlying devices. This array of devices often contains
|
|
redundancy, and hence the acronym RAID which stands for a Redundant
|
|
Array of Independent Devices.
|
|
.PP
|
|
.B md
|
|
supports RAID levels 1 (mirroring) 4 (striped array with parity
|
|
device), 5 (striped array with distributed parity information) and 6
|
|
(striped array with distributed dual redundancy information.) If a
|
|
some number of underlying devices fails while using one of these
|
|
levels, the array will continue to function; this number is one for
|
|
RAID levels 4 and 5, two for RAID level 6, and all but one (N-1) for
|
|
RAID level 1.
|
|
.PP
|
|
.B md
|
|
also supports a number of pseudo RAID (non-redundant) configurations
|
|
including RAID0 (striped array), LINEAR (catenated array) and
|
|
MULTIPATH (a set of different interfaces to the same device).
|
|
|
|
.SS MD SUPER BLOCK
|
|
With the exception of Legacy Arrays described below, each device that
|
|
is incorporated into an MD array has a
|
|
.I super block
|
|
written towards the end of the device. This superblock records
|
|
information about the structure and state of the array so that the
|
|
array can be reliably re-assembled after a shutdown.
|
|
|
|
The superblock is 4K long and is written into a 64K aligned block that
|
|
starts at least 64K and less than 128K from the end of the device
|
|
(i.e. to get the address of the superblock round the size of the
|
|
device down to a multiple of 64K and then subtract 64K).
|
|
The available size of each device is the amount of space before the
|
|
super block, so between 64K and 128K is lost when a device in
|
|
incorporated into an MD array.
|
|
|
|
The superblock contains, among other things:
|
|
.TP
|
|
LEVEL
|
|
The manner in which the devices are arranged into the array
|
|
(linear, raid0, raid1, raid4, raid5, multipath).
|
|
.TP
|
|
UUID
|
|
a 128 bit Universally Unique Identifier that identifies the array that
|
|
this device is part of.
|
|
|
|
.SS LEGACY ARRAYS
|
|
Early versions of the
|
|
.B md
|
|
driver only supported Linear and Raid0 configurations and so
|
|
did not use an MD superblock (as there is no state that needs to be
|
|
recorded). While it is strongly recommended that all newly created
|
|
arrays utilise a superblock to help ensure that they are assembled
|
|
properly, the
|
|
.B md
|
|
driver still supports legacy linear and raid0 md arrays that
|
|
do not have a superblock.
|
|
|
|
.SS LINEAR
|
|
|
|
A linear array simply catenates the available space on each
|
|
drive together to form one large virtual drive.
|
|
|
|
One advantage of this arrangement over the more common RAID0
|
|
arrangement is that the array may be reconfigured at a later time with
|
|
an extra drive and so the array is made bigger without disturbing the
|
|
data that is on the array. However this cannot be done on a live
|
|
array.
|
|
|
|
|
|
.SS RAID0
|
|
|
|
A RAID0 array (which has zero redundancy) is also known as a
|
|
striped array.
|
|
A RAID0 array is configured at creation with a
|
|
.B "Chunk Size"
|
|
which must be a power of two, and at least 4 kibibytes.
|
|
|
|
The RAID0 driver assigns the first chunk of the array to the first
|
|
device, the second chunk to the second device, and so on until all
|
|
drives have been assigned one chunk. This collection of chunks forms
|
|
a
|
|
.BR stripe .
|
|
Further chunks are gathered into stripes in the same way which are
|
|
assigned to the remaining space in the drives.
|
|
|
|
If devices in the array are not all the same size, then once the
|
|
smallest device has been exhausted, the RAID0 driver starts
|
|
collecting chunks into smaller stripes that only span the drives which
|
|
still have remaining space.
|
|
|
|
|
|
.SS RAID1
|
|
|
|
A RAID1 array is also known as a mirrored set (though mirrors tend to
|
|
provide reflected images, which RAID1 does not) or a plex.
|
|
|
|
Once initialised, each device in a RAID1 array contains exactly the
|
|
same data. Changes are written to all devices in parallel. Data is
|
|
read from any one device. The driver attempts to distribute read
|
|
requests across all devices to maximise performance.
|
|
|
|
All devices in a RAID1 array should be the same size. If they are
|
|
not, then only the amount of space available on the smallest device is
|
|
used. Any extra space on other devices is wasted.
|
|
|
|
.SS RAID4
|
|
|
|
A RAID4 array is like a RAID0 array with an extra device for storing
|
|
parity. This device is the last of the active devices in the
|
|
array. Unlike RAID0, RAID4 also requires that all stripes span all
|
|
drives, so extra space on devices that are larger than the smallest is
|
|
wasted.
|
|
|
|
When any block in a RAID4 array is modified the parity block for that
|
|
stripe (i.e. the block in the parity device at the same device offset
|
|
as the stripe) is also modified so that the parity block always
|
|
contains the "parity" for the whole stripe. i.e. its contents is
|
|
equivalent to the result of performing an exclusive-or operation
|
|
between all the data blocks in the stripe.
|
|
|
|
This allows the array to continue to function if one device fails.
|
|
The data that was on that device can be calculated as needed from the
|
|
parity block and the other data blocks.
|
|
|
|
.SS RAID5
|
|
|
|
RAID5 is very similar to RAID4. The difference is that the parity
|
|
blocks for each stripe, instead of being on a single device, are
|
|
distributed across all devices. This allows more parallelism when
|
|
writing as two different block updates will quite possibly affect
|
|
parity blocks on different devices so there is less contention.
|
|
|
|
This also allows more parallelism when reading as read requests are
|
|
distributed over all the devices in the array instead of all but one.
|
|
|
|
.SS RAID6
|
|
|
|
RAID6 is similar to RAID5, but can handle the loss of any \fItwo\fP
|
|
devices without data loss. Accordingly, it requires N+2 drives to
|
|
store N drives worth of data.
|
|
|
|
The performance for RAID6 is slightly lower but comparable to RAID5 in
|
|
normal mode and single disk failure mode. It is very slow in dual
|
|
disk failure mode, however.
|
|
|
|
.SS MUTIPATH
|
|
|
|
MULTIPATH is not really a RAID at all as there is only one real device
|
|
in a MULTIPATH md array. However there are multiple access points
|
|
(paths) to this device, and one of these paths might fail, so there
|
|
are some similarities.
|
|
|
|
A MULTIPATH array is composed of a number of logical different
|
|
devices, often fibre channel interfaces, that all refer the the same
|
|
real device. If one of these interfaces fails (e.g. due to cable
|
|
problems), the multipath driver to attempt to redirect requests to
|
|
another interface.
|
|
|
|
.SS FAULTY
|
|
The FAULTY md module is provided for testing purposes. A faulty array
|
|
has exactly one component device and is normally assembled without a
|
|
superblock, so the md array created provides direct access to all of
|
|
the data in the component device.
|
|
|
|
The FAULTY module may be requested to simulate faults to allow testing
|
|
of other md levels or of filesystem. Faults can be chosen to trigger
|
|
on read requests or write requests, and can be transient (a subsequent
|
|
read/write at the address will probably succeed) or persistant
|
|
(subsequent read/write of the same address will fail). Further, read
|
|
faults can be "fixable" meaning that they persist until a write
|
|
request at the same address.
|
|
|
|
Fault types can be requested with a period. In this case the fault
|
|
will recur repeatedly after the given number of request of the
|
|
relevant time. For example if persistent read faults have a period of
|
|
100, then ever 100th read request would generate a fault, and the
|
|
faulty sector would be recorded so that subsequent reads on that
|
|
sector would also fail.
|
|
|
|
There is a limit to the number of faulty sectors that are remembered.
|
|
Faults generated after this limit is exhausted are treated as
|
|
transient.
|
|
|
|
It list of faulty sectors can be flushed, and the active list of
|
|
failure modes can be cleared.
|
|
|
|
.SS UNCLEAN SHUTDOWN
|
|
|
|
When changes are made to a RAID1, RAID4, RAID5 or RAID6 array there is a
|
|
possibility of inconsistency for short periods of time as each update
|
|
requires are least two block to be written to different devices, and
|
|
these writes probably wont happen at exactly the same time.
|
|
Thus if a system with one of these arrays is shutdown in the middle of
|
|
a write operation (e.g. due to power failure), the array may not be
|
|
consistent.
|
|
|
|
To handle this situation, the md driver marks an array as "dirty"
|
|
before writing any data to it, and marks it as "clean" when the array
|
|
is being disabled, e.g. at shutdown. If the md driver finds an array
|
|
to be dirty at startup, it proceeds to correct any possibly
|
|
inconsistency. For RAID1, this involves copying the contents of the
|
|
first drive onto all other drives. For RAID4, RAID5 and RAID6 this
|
|
involves recalculating the parity for each stripe and making sure that
|
|
the parity block has the correct data. This process, known as
|
|
"resynchronising" or "resync" is performed in the background. The
|
|
array can still be used, though possibly with reduced performance.
|
|
|
|
If a RAID4, RAID5 or RAID6 array is degraded (missing at least one
|
|
drive) when it is restarted after an unclean shutdown, it cannot
|
|
recalculate parity, and so it is possible that data might be
|
|
undetectably corrupted. The 2.4 md driver
|
|
.B does not
|
|
alert the operator to this condition. The 2.5 md driver will fail to
|
|
start an array in this condition without manual intervention.
|
|
|
|
.SS RECOVERY
|
|
|
|
If the md driver detects any error on a device in a RAID1, RAID4,
|
|
RAID5 or RAID6 array, it immediately disables that device (marking it
|
|
as faulty) and continues operation on the remaining devices. If there
|
|
is a spare drive, the driver will start recreating on one of the spare
|
|
drives the data what was on that failed drive, either by copying a
|
|
working drive in a RAID1 configuration, or by doing calculations with
|
|
the parity block on RAID4, RAID5 or RAID6.
|
|
|
|
While this recovery process is happening, the md driver will monitor
|
|
accesses to the array and will slow down the rate of recovery if other
|
|
activity is happening, so that normal access to the array will not be
|
|
unduly affected. When no other activity is happening, the recovery
|
|
process proceeds at full speed. The actual speed targets for the two
|
|
different situations can be controlled by the
|
|
.B speed_limit_min
|
|
and
|
|
.B speed_limit_max
|
|
control files mentioned below.
|
|
|
|
.SS KERNEL PARAMETERS
|
|
|
|
The md driver recognised three different kernel parameters.
|
|
.TP
|
|
.B raid=noautodetect
|
|
This will disable the normal detection of md arrays that happens at
|
|
boot time. If a drive is partitioned with MS-DOS style partitions,
|
|
then if any of the 4 main partitions has a partition type of 0xFD,
|
|
then that partition will normally be inspected to see if it is part of
|
|
an MD array, and if any full arrays are found, they are started. This
|
|
kernel paramenter disables this behaviour.
|
|
|
|
.TP
|
|
.BI md= n , dev , dev ,...
|
|
This tells the md driver to assemble
|
|
.B /dev/md n
|
|
from the listed devices. It is only necessary to start the device
|
|
holding the root filesystem this way. Other arrays are best started
|
|
once the system is booted.
|
|
|
|
.TP
|
|
.BI md= n , l , c , i , dev...
|
|
This tells the md driver to assemble a legacy RAID0 or LINEAR array
|
|
without a superblock.
|
|
.I n
|
|
gives the md device number,
|
|
.I l
|
|
gives the level, 0 for RAID0 or -1 for LINEAR,
|
|
.I c
|
|
gives the chunk size as a base-2 logarithm offset by twelve, so 0
|
|
means 4K, 1 means 8K.
|
|
.I i
|
|
is ignored (legacy support).
|
|
|
|
.SH FILES
|
|
.TP
|
|
.B /proc/mdstat
|
|
Contains information about the status of currently running array.
|
|
.TP
|
|
.B /proc/sys/dev/raid/speed_limit_min
|
|
A readable and writable file that reflects the current goal rebuild
|
|
speed for times when non-rebuild activity is current on an array.
|
|
The speed is in Kibibytes per second, and is a per-device rate, not a
|
|
per-array rate (which means that an array with more disc will shuffle
|
|
more data for a given speed). The default is 100.
|
|
|
|
.TP
|
|
.B /proc/sys/dev/raid/speed_limit_max
|
|
A readable and writable file that reflects the current goal rebuild
|
|
speed for times when no non-rebuild activity is current on an array.
|
|
The default is 100,000.
|
|
|
|
.SH SEE ALSO
|
|
.BR mdadm (8),
|
|
.BR mkraid (8).
|