This forcibly removed the bad-block log. There can be situations where it is hard to
remove bad blocks by writing to them - partiularly on RAID5.
Signed-off-by: NeilBrown <neilb@suse.com>
This commit does the following jobs:
1. rename is_clustered to dlm_funs_ready since it match the
function better.
2. st->cluster_name can't be use to identify the raid is a
clustered or not, we should check the bitmap's version to
perform the identification.
3. for cluster_get_dlmlock/cluster_release_dlmlock funcs, both
of them just need the lockid as parameter since the cluster
name can get by get_cluster_name().
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
"mdadm -X DISK" is used to report information about a bitmap
file, it is better to not display all the related infos if
bitmap is cleared with "--bitmap=none" under grow mode.
To do that, the locate_bitmap is changed a little to have a
return value based on MD_FEATURE_BITMAP_OFFSET.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
This patch tries recreates missing/faulty journal in mdadm.
Example:
./mdadm --fail /dev/md1 /dev/sdb2
mdadm: set /dev/sdb2 faulty in /dev/md1
./mdadm --stop /dev/md1
mdadm: stopped /dev/md1
./mdadm -A --scan --force
mdadm: Journal is missing or stale, starting array read only.
mdadm: /dev/md/1 has been started with 15 drives.
./mdadm --add-journal /dev/md1 /dev/sdb2
mdadm: added /dev/sdb2
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Add sysfs_array_state to struct mdinfo, and add GET_ARRAY_STATE to
options of sysfs_read.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
32 bit signed timestamps will overflow in the year 2038.
Change the user interface mdu_array_info_s structure timestamps:
ctime and utime values used in ioctls GET_ARRAY_INFO and
SET_ARRAY_INFO to unsigned int. This will extend the field to last
until the year 2106.
Add time_after/time_before and supporting typecheck from
the kernel to take care of unsigned time wraparound.
The long term plan is to get rid of ctime and utime values in
this structure as this information can be read from the on-disk
meta data directly.
v0.90 on disk meta data uses u32 for maintaining time stamps.
So this will also last until year 2106.
Assumption is that the usage of v0.90 will be deprecated by
year 2106.
Timestamp fields in the on disk meta data for v1.0 version already
use 64 bit data types.
Signed-off-by: NeilBrown <neilb@suse.com>
As discussed, standalone require_journal() in struct superswitch
is not a very good idea. Instead, journal related information
fits well in struct mdinfo.
This patch simplifies journal support code in Assemble and
Incremental as:
- Add journal_device_required and journal_clean to struct mdinfo;
- Remove function require_journal from struct superswitch;
- Update Assemble and Incremental to use journal_device_required
and journal_clean from struct mdinfo (instead of separate var).
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Let libcmap lib and related funs also only need one-time
setup during mdadm running period.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Modifying an exiting device's superblock or creating a new superblock
on an existing device needs to be checked because the device could be
in use by another node in another array. So, we check this by taking
all superblock locks in userspace so that we don't step onto an active
device used by another node and safeguard against accidental edits.
After the edit is complete, we release all locks and the lockspace so
that it can be used by the kernel space.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Example output:
./mdadm --assemble /dev/md0 /dev/sd[c-f] /dev/sdb1
mdadm: /dev/md0 has been started with 4 drives and 1 journal.
mdadm checks superblock for journal devices. If the journal device
is missing or faulty, mdadm will show warning
./mdadm --assemble /dev/md0 /dev/sd[c-q] /dev/sdb1
mdadm: Not safe to assemble with missing or stale journal device, consider --force.
User can insist to start the array (read only) with --force
./mdadm --assemble /dev/md0 /dev/sd[c-q] /dev/sdb1 --force
mdadm: Journal is missing or stale, starting array read only.
mdadm: /dev/md0 has been started with 15 drives.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
Specify the write journal device with --write-journal DEVICE
./mdadm --create -f /dev/md0 --assume-clean -c 32 --raid-devices=4 --level=5 /dev/sd[c-f] --write-journal /dev/sdb1
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
Only one journal device is allowed. If multiple --write-journal
are given, mdadm will use the first and ignore others
./mdadm --create -f /dev/md0 --assume-clean -c 32 --raid-devices=4 --level=5 /dev/sd[c-f] --write-journal /dev/sdb1 --write-journal /dev/sdx
mdadm: Please specify only one journal device for the array.
mdadm: Ignoring --write-journal /dev/sdx...
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
Signed-off-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Shaohua Li <shli@fb.com>
Signed-off-by: NeilBrown <neilb@suse.com>
If sysfs node existed, we should try to write "re-add" to it.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.com>
This does make mdassemble a bit bigger, but it also means
it actually works properly with named arrays.
Ref: https://bbs.archlinux.org/viewtopic.php?id=198196
Signed-off-by: NeilBrown <neilb@suse.com>
These both have the same value, and have done since the
'devnm' concept was introduced.
So discard the pointless duplicate.
Signed-off-by: NeilBrown <neilb@suse.de>
This extends nodes option for assemble mode, make the num of
cluster node could be change by user.
Before that, it is necessary to ensure there are enough space
for those nodes, calc_bitmap_size is introduced to calculate
the bitmap size of each node.
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
To support change the cluster name, the commit do the followings:
1. extend original write_bitmap function for new scenario.
2. add the scenarion to handle the modification of cluster's name
in write_bitmap1.
3. let the cluster name also show in examine_super1 and detail_super1
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
A clustered disk is added by the traditional --add sequence.
However, other nodes need to acknowledge that they can "see"
the device. This is done by --cluster-confirm:
--cluster-confirm SLOTNUM:/dev/whatever (if disk is found)
or
--cluster-confirm SLOTNUM:missing (if disk is not found)
The node initiating the --add, has the disk state tagged with
MD_DISK_CLUSTER_ADD and the one confirming tag the disk with
MD_DISK_CANDIDATE.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The home-cluster is stored in the bitmap super block of the
array. The device can be assembled on a cluster with the
cluster name same as the one recorded in the bitmap.
If home-cluster is not specified, this is auto-detected using
dlopen corosync cmap library.
neilb: allow code to compile when corosync-devel is not installed.
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Specifies the maximum number of nodes in the cluster that may use
this device simultaneously. This is equivalent to the number of
bitmaps created in the internal superblock (patches to follow).
Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com>
Signed-off-by: Guoqing Jiang <gqjiang@suse.com>
Signed-off-by: NeilBrown <neilb@suse.de>
make dprintf() print program name and __func__, so that
this messaging is consistent.
Also remove all __func__ messages from pr_err(). We shouldn't
leak that internal data in error message.
If we really want function name there, we new pr_XXX might
be wanted.
Signed-off-by: NeilBrown <neilb@suse.de>
Sometimes mdadm prints messages with wrong name "mdmon",
and vice versa.
This patch solves this problem by changing method of determining
process name.
Now "Name" will be set in const at start of a program,
previously was hardcoded as #define.
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
CREATE bbl=no
in mdadm.conf will cause any devices added to an array
to not have a bad block list. By default they do for 1.x
metadata.
This is useful if you are suspicious of the bad-block-list
implementation.
Reported-by: Ethan Wilson <ethan.wilson@shiftmail.org>
Signed-off-by: NeilBrown <neilb@suse.de>
If 'prepare_update' fails for some reason there is little
point continuing on to 'process_update'.
For now only malloc failures are caught, but other failures
will be considered in future.
Signed-off-by: NeilBrown <neilb@suse.de>
Due to several changes in code assemble with disks
spanned between different controllers can be obtained
in some cases. After IMSM container will be assembled, check HBA of
disks, and print proper warning if mismatch is detected.
Signed-off-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: Artur Paszkiewicz <artur.paszkiewicz@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit 5e76dce1ac changed
Grow_continue to assume a fork had already happened, so that
mdadm --grow --continue
didn't fork. This is good, but it means that if Grow_continue
is run from Assemble, then
mdadm --assemble ....
can misbehave if the array was in the middle of a reshape.
So introduce finer control. Grow_continue only assumes it has
already forked if run from "mdadm --grow --continue".
Signed-off-by: NeilBrown <neilb@suse.de>
Subsequent patch will allow the background part of "mdadm --grow" to
be run from systemd. This can require the passing of a backup file
name.
To do this, store that name as a symlink in /run/mdadm (or MAP_DIR)
and look for it when appropriate.
It might be useful to also store the name across reboot, but that
would be a different patch. We would need to use the uuid to identify
it, and store it in stable storage.
Signed-off-by: NeilBrown <neilb@suse.de>
As soon as the array is assembled, udev or systemd might run
fsck and mount it. So we need to drop O_EXCL promptly.
Signed-off-by: NeilBrown <neilb@suse.de>
--incremental currently fails if the device name passed does not
textually match the names permitted by the DEVICE line in mdadm.conf.
This is problematic when "mdadm -I" is run by udev as the name given
can be a temp name.
This patch makes two improvements:
1/ We generate a list of all existing devices that match the names
in mdadm.conf, and allow rdev based matching
2/ We allows extra aliases to be provided on the command line, and
perform textual matching on those. This is particularly suitable
for udev usages as ${DEVLINKS} can be provided even though the links
make not yet be created.
Signed-off-by: NeilBrown <neilb@suse.de>
If --export is given with --incremental, then
MD_DEVNAME
is output which gives the name of the device (in /dev/md) that
is the array (or container) that the device would be added to.
Also
MD_STARTED
is set to one of
no
unsafe
yes
nothing
to indicate if the array was started. IF MD_STARTED=unsafe
then it may be appropriate to run
mdadm -R /dev/md/$MD_DEVNAME
after a timeout to ensure newly degraded array are started.
If
MD_FOREIGN=yes
it might be appropriate to suppress this as the array is
probably not critical.
Signed-off-by: NeilBrown <neilb@suse.de>
--add-spare is like --add, but a --re-add is never attempted.
So it is equivalent to two separate commands:
--zero-metadata
--add
Signed-off-by: NeilBrown <neilb@suse.de>
The bswap_*() macros return int values. Make sure we return the
equivalent types in same byteorder pass-through functions to avoid
problems with the original type leaking through to printf() etc.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We really need to make sure assemble_container_content()
gets called to finished the assembly of these.
Reported-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Current "mdadm --run /dev/mdX" will not handle external metadata
properly. mdmon won't be started etc.
So use the code from "mdadm -IRs" instead - that already does all
the right things.
Reported-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Now that mdmon responds fairly well to SIGTERM, stop lying to
systemd about being started on the initrd.
Note that if mdmon is rerun (--takeover) for some reason, and systemd
chooses to kill processes before remounting / readonly, then the
unmount will hang.
If systemd ever lets us tell it that we don't want to be killed until
root is readonly, then we should do that.
Signed-off-by: NeilBrown <neilb@suse.de>
The purpose od devid2devnm is to return a kernel name of an
md device, whether that device is a whole device or a partition,
we want the whole device. md4, never md4p2.
In one place I was using devid2devnm where I really wanted the
partition if there was one ... and wasn't really interested in it
being an md device.
So introduce a new 'devid2kname' for that case.
Signed-off-by: NeilBrown <neilb@suse.de>
Currently the extra space to leave before the data in the array
is calculated in two separate places, and they can be inconsistent.
Instead, do it all in validate_geometry. This records the
'data_offset' chosen which all other devices then use.
'write_init_super' now just uses the value rather than doing all the
calculations again.
This results in more consistent numbers.
Also, load_super sets st->data_offset so that it is used by "--add",
so the new device has a data offset matching a pre-existing device.
Signed-off-by: NeilBrown <neilb@suse.de>
Having a fix time for a wait is clumsy and can make us
wait much too long.
So use mdstat_wait and keep the mdstat_fd open.
This requires an 'mdstat_close' so it doesn't stay open
forever.
Signed-off-by: NeilBrown <neilb@suse.de>
When "containers" appears on the "DEVICES" line (which is does by
default), use names from the mdadm map file instead of kernel names,
when possible.
This mean that the name will be more likely to appear in mdadm.conf
and so more likely to match "container=" tags.
Signed-off-by: NeilBrown <neilb@suse.de>
To be able to revert-reshape of raid4/5/6 which is changing
the number of devices, the reshape must has been stopped on a multiple
of the old and new stripe sizes.
The kernel only enforces the new stripe size multiple.
So we enforce the old-stripe-size multiple by careful use of
"sync_max" and monitoring "reshape_position".
Signed-off-by: NeilBrown <neilb@suse.de>
We have several places that wait for activity on a sysfs
file. Combine most of these into a single 'sysfs_wait' function.
Signed-off-by: NeilBrown <neilb@suse.de>
Some people want to create truely enormous arrays.
As we sometimes need to hold one file descriptor for each
device, this can hit the NOFILE limit.
So raise the limit if it ever looks like it might be a problem.
Signed-off-by: NeilBrown <neilb@suse.de>
This allows the smooth conversion of legacy 0.90 arrays
to 1.0 metadata.
Old metadata is likely to remain but will be ignored.
It can be removed with
mdadm --zero-superblock --metadata=0.90 /dev/whatever
Signed-off-by: NeilBrown <neilb@suse.de>
raid10 currently uses the 'backup_blocks' field to store something
else: a minimum offset change.
This is bad practice, we will shortly need to have both for RAID5/6,
so make a separate field.
Signed-off-by: NeilBrown <neilb@suse.de>
This allows the metadata on a device to be saved and later restored.
This can be useful before experimenting on an array that is misbehaving.
Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
With the 'devnm' infrastructure fixed, it is quite easy to support
names like "md_home" for md arrays.
The currently defaults to "off" and can be enabled in mdadm.conf with
CREATE names=yes
This is incase other tools get confused by the new names.
Signed-off-by: NeilBrown <neilb@suse.de>
We widely use a "devnum" which is 0 or +ve for md%d devices
and -ve for md_d%d devices.
But I want to be able to use md_%s device names.
So get rid of devnum (a number) and use devnm (a 32char string).
eg.
md0
md_d2
md_home
Signed-off-by: NeilBrown <neilb@suse.de>
the code that was exposed on anything else than dietlibc and klibc
is entirely glibc specific and broke the build on musl libc.
Signed-off-by: John Spencer <maillist-mdadm@barfooze.de>
Signed-off-by: NeilBrown <neilb@suse.de>
We still allow --offroot to be given - for compatibility with scripts
- but ignore it.
The whole point of --offroot is to get systemd to not auto-kill mdmon,
and we always want that.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
map_dev can be slow so it is best to not call it when
not necessary.
The final test in "find_free_devnum" is not relevant when
udev is being used, so remove the test in that case.
Signed-off-by: NeilBrown <neilb@suse.de>
--replace can be used to replace a device without completely failing
it. Once the replacement completes the device will be failed.
--with can indicate which of several spares to use.
Signed-off-by: NeilBrown <neilb@suse.de>
mdadm --create /dev/md0 .... /dev/sda1:1024 /dev/sdb1:2048 ...
The size is in K unless a suffix: K M G is given.
The suffix 's' means sectors.
Signed-off-by: NeilBrown <neilb@suse.de>
Some arrays (raid10) never need a backup file, so during assembly
we can avoid the whole Grow_continue check in that case.
Achieve this using a flag set by the metadata handler.
Also get "mdadm -I" to fail if a backup process would be
needed. It currently does fail as the kernel rejects things,
but it is nicer to have this explicit.
Signed-off-by: NeilBrown <neilb@suse.de>
This can be used to over-ride the automatic assignment of
data offset.
For --create, it is useful to re-create old arrays where different
defaults applied.
For --grow it may be able to force a reshape in the reverse direction.
Signed-off-by: NeilBrown <neilb@suse.de>
This is currently only useful for 1.x metadata and will allow an
explicit --data-offset request on command line.
Signed-off-by: NeilBrown <neilb@suse.de>
We will shortly introduce --data-offset= which is allowed to
be zero. We will want to use parse_size() so it needs to be
able to return '0' without it being an error.
So define INVALID_SECTORS to be an impossible value (currently '1')
and return and test for it consistently.
Signed-off-by: NeilBrown <neilb@suse.de>
1/ When printing the "name=" entry for --brief output,
enclose name in quotes if it contains spaces etc.
Quotes are already supported for reading mdadm.conf
2/ When a name is used as a device name, translate spaces
and tabs to '_', as well as the current translation of
'/' to '-'.
Signed-off-by: NeilBrown <neilb@suse.de>
When using human_size_brief, only IEC prefixes were supported. Now
it's possible to specify which format we want to see - either IEC
(kibi, mibi, gibi) or JEDEC (kilo, mega, giga).
Signed-off-by: Maciej Naruszewicz <maciej.naruszewicz@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
An error in parse_size() should be reported by 0, not -1,
because -1 is changed to the max value of unsigned long long
during calculations of size (e.g. at mdadm.c:412).
A negative value of size should be reported as error
(e.g. size equal -1 has been changed to the max value of
unsigned long long so far).
Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Both are impossible, and '1' allows size to be unsigned,
which is neater.
Also #define MAX_SIZE to be '1' to make it all more explicit.
Signed-off-by: NeilBrown <neilb@suse.de>
The value of 'verbose' is sometimes mixed into 'brief', particularly
for Examine.
This is messy and confusing. So keep them separate.
'brief' still gets assumed when 'scan' is set, unless we are very
verbose.
Signed-off-by: NeilBrown <neilb@suse.de>
If we change some functions to accept 'verbose', where <0 means to be
quiet, in place of 'quiet', then we will be able to merge
'quiet' and 'verbose' together for simplicity.
Signed-off-by: NeilBrown <neilb@suse.de>
Rather than passing a long list of little flags etc to various
functions we will soon pass a small collection of structures.
This first step combines a collection of variables local to
'main' into a single structure.
Signed-off-by: NeilBrown <neilb@suse.de>
malloc should never fail, and if it does it is unlikely
that anything else useful can be done. Best approach is to
abort and let some super-daemon restart.
So define xmalloc, xcalloc, xrealloc, xstrdup which don't
fail but just print a message and exit. Then use those
removing all the tests for failure.
Also replace all "malloc;memset" sequences with 'xcalloc'.
Signed-off-by: NeilBrown <neilb@suse.de>