Current "mdadm --run /dev/mdX" will not handle external metadata
properly. mdmon won't be started etc.
So use the code from "mdadm -IRs" instead - that already does all
the right things.
Reported-by: Francis Moreau <francis.moro@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
commit 23bf42cc79
super1: simplify setting of array size.
removed the setting for sb->data_offset for 1.0 metadata for some reason,
and messed up the size calculation for 1.0 metadata too.
Signed-off-by: NeilBrown <neilb@suse.de>
It is possible that mdadm creates a new subarray containing failed
devices. This may happen if a device has failed, but the meta data
containing that information hasn't been written out yet.
This code tests for this situation, and handles it in the monitor.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
If a disk fails and simulaneously a new array is created, a race
condition may arise because the meta data on disk doesn't reflect
the disk failure yet. This is a test for that case.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
This is one more unit test for failure/recovery, this time with
double redundancy, which isn't covered by the other tests.
Signed-off-by: NeilBrown <neilb@suse.de>
An ext[234] filesystem larger than 2TB was beign reported with
a negative size - which looks odd.
So fix it to use suitably large and unsigned values.
Reported-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: NeilBrown <neilb@suse.de>
The recent change to skip over invalid conf entries was bad because
it could leave garbage on the disk.
But we don't to write each entry separately as the writes a O_DIRECT
and so synchronous so it takes way too long.
So allocate a large buffer (probably the one used to read the config records)
and fill that then write it all at once.
Reported-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Allow other device types for testing; this allows to test on
a larger variety of devices.
Option --dev=[loop|lvm|ram] selects loop device (default), lvm,
and ram disk, respecively. To use RAM disks with DDF,
the kernel parameter ramdisk_size=65536 must be used.
For LVM, use --volgroup=<vg> to specify the name of the volume
group in which the test LVs will be created.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
We should skip known failed disks when allocating space for
new arrays. This fixes the problem with 10ddf-fail-spare.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
This test has some randomness because it is not always deterministic
which of the two arrays gets the spare and which remains degraded.
Handle it.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
helper functions to determine the list of devices in an array,
etc.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
mdadm default to using /run/mdadm. However not all distros
provide /run yet. This can confuse people who build their own
mdadm.
So have "make" complain if the given directory doesn't exist.
This will make it harder to build an mdadm which doesn't work.
Reported-by: Albert Pauw <albert.pauw@gmail.com>
Signed-off-by: NeilBrown <neilb@suse.de>
It is possible for mdmon to see (in /proc/mdstat) and array
in 'inactive' state, "mdadm -S" has written "inactive" to
"array_state".
In this state values such as "raid_disk" are not meaningful
and so should be ignored by manage_member().
Reported-by: "Dorau, Lukasz" <lukasz.dorau@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Commit c7079c84 arrange for DDF to forget about any device
that is failed and not still marked as part of any array.
However such devices could still be part of the container and this
removal and updating of 'pdnum' can result in multiple devices having
the same pdnum. This in turn easily leads to confusion and
corruption.
So only discard pd entries for devices which are failed, not listed in
any virtual device, and for which we don't have a handle on the
device.
pd entries will not get removed until a new device is added after
the device has been removed from the container, either by
"mdadm --remove" or by assembling without the failed devices.
Reported-by: Albert Pauw <albert.pauw@gmail.com>
Analysed-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
When testing we want to run mdmon directly, not use
systemctl to get systemd to run it.
So allow an environment variable to make that choice.
Signed-off-by: NeilBrown <neilb@suse.de>
This clearly should be 'st2'.
As it is the 'raid_disk' value being tested is completely
meaningless in the context of the new device.
Signed-off-by: NeilBrown <neilb@suse.de>
Recent commit 273989b93a
skipped writing some large blocks of 0xFF, but didn't seek
over the space, so subsequent data was written wrongly.
When we don't write, we need to seek.
Signed-off-by: NeilBrown <neilb@suse.de>
With the previous patch, mdmon will provide the layout property for us.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
commit 71d68ff62 uses the array layout. It needs to be initialized.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Now that mdmon responds fairly well to SIGTERM, stop lying to
systemd about being started on the initrd.
Note that if mdmon is rerun (--takeover) for some reason, and systemd
chooses to kill processes before remounting / readonly, then the
unmount will hang.
If systemd ever lets us tell it that we don't want to be killed until
root is readonly, then we should do that.
Signed-off-by: NeilBrown <neilb@suse.de>
The purpose od devid2devnm is to return a kernel name of an
md device, whether that device is a whole device or a partition,
we want the whole device. md4, never md4p2.
In one place I was using devid2devnm where I really wanted the
partition if there was one ... and wasn't really interested in it
being an md device.
So introduce a new 'devid2kname' for that case.
Signed-off-by: NeilBrown <neilb@suse.de>
Telling systemd that mdadm was started from the initrd
is often a lie and never necessary. Now that the reshape monitoring
thread handles SIGTERM gracefully it is OK for system to kill
and mdadm that it finds running.
mdmon still have a bit of a question mark over it so I won't remove
the '@' from there just yet.
Signed-off-by: NeilBrown <neilb@suse.de>
If the mdadm thread that monitors a reshape gets SIGTERM it should
exit cleanly and clear the 'suspended' region of the array.
However it mustn't clear 'sync_max' as that would allow the
reshape to continue unmonitored.
If the thread ever does get killed, the array should really be
shutdown soon after if possible.
Signed-off-by: NeilBrown <neilb@suse.de>
I forgot to check in this helper script, similar to the one for IMSM.
It is needed by tests/10ddf-create-fail-rebuild.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
This test adds a new unit test similar to 009imsm-create-fail-rebuild.
With the previous patches, it actually succeeds on my system.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
In order to track kernel state changes, the monitor needs to
notice changes in sysfs. If the changes are transient, and the
monitor is busy writing meta data, it can happen that the changes
are missed. This will cause the meta data to be inconsistent with
the real state of the array.
I can reproduce this in a test scenario with a DDF container and
two subarrays, where I set a disk to "failed" and then add a global
hot-spare. On a typical MD test setup with loop devices, I can
reliably reproduce a failure where the metadata show degraded members
although the kernel finished the recovery successfully.
This patch fixes this problem by applying two changes. First, when
a metadata update is queued, wait until it is certain that the monitor
actually applied these meta data (the for loop is actually needed to
avoid failures completely in my test case). Second, after triggering the
recovery, set prev_state of the changed array to "recover", in case
the monitor misses the transient "recover" state.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Correctly print out wake reason if it was a signal. Previous code
would print misleading select events (pselect(2) man page says the
fdsets become undefined in case of error).
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
read_and_act() currently prints a debug message only very late.
Print the status seen by mdmon right away, to track mdmon's
actions more closely. Add a time stamp to observe long delays
between read_and_act calls, e.g. caused by meta data writes.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Adds more verbose debugging in ddf_set_disk, to understand failures
better.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Try to determine problem if load_ddf_header fails. May be useful
for determining compatibility problems with Fake RAID BIOSes.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
I needed this for tracking a bug with wrong offsets after array
creation.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
In particular, include refnum for better tracking. This makes
it a little easier for humans to track what happened to which disk.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
Move the check for good drives in the dl loop - otherwise dl
may be NULL and mdmon may crash.
Signed-off-by: Martin Wilck <mwilck@arcor.de>
Signed-off-by: NeilBrown <neilb@suse.de>
For RAID10, 'sync' numbers go up to the array size rather than the
component size. is_resync_complete() needs to allow for this.
Reported-by: Pawel Baldysiak <pawel.baldysiak@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Coverity discovered a possible double close(fd2) in Grow.c. Avoided by
invalidating fd2 after the first close.
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Currently the extra space to leave before the data in the array
is calculated in two separate places, and they can be inconsistent.
Instead, do it all in validate_geometry. This records the
'data_offset' chosen which all other devices then use.
'write_init_super' now just uses the value rather than doing all the
calculations again.
This results in more consistent numbers.
Also, load_super sets st->data_offset so that it is used by "--add",
so the new device has a data offset matching a pre-existing device.
Signed-off-by: NeilBrown <neilb@suse.de>
_avail_space1() is calls from both avail_space1() and validate_geometry1()
and does slightly different things.
The partial code sharing doesn't really help. In particularly the
responsibility for setting the size of the array is currently
confused.
So duplicate the code into the two locations - one where 'super' is
always NULL (validate_geometry1) and one where it is never NULL
(avail_space1), and simplify.
No behaviour change - just code re-organisation.
Signed-off-by: NeilBrown <neilb@suse.de>
This call to validate_geometry is really rather gratuitous.
It is purely about the fact that super0 cannot use more than 4TB.
So just make it an explicit test - less confusing that way.
With this, validate_geometry is only called from Create, which
makes it easier to reason about.
Also validate_geometry is now never passed NULL for the 'chunk'
parameter, so we can remove those annoying tests for NULL.
Signed-off-by: NeilBrown <neilb@suse.de>
Metadata updates for secondary RAID (RAID10) need to cover
all BVDs. Compare with code in write_init_super_ddf().
Signed-off-by: NeilBrown <neilb@suse.de>