Problem was found during reshaping 2 volumes /raid0 and raid5/ in container.
Sometimes mdmon throws core dump due to NULL pointer exception.
Problem occurs in scenario:
- managemon: is about spare activation (degraded raid4 volume == raid0 under takeover)
- managemon: detect level change and signals monitor (manage_member() calls replace_array())
- monitor: detects transition raid4/5->raid0 and sets a->container to NULL
to indicate array deactivation
- managemon : continues his work and tries to activate spare (a->check_degraded is set).
NULL pointer is passed to metadata handler activate_spare()
Core dump is generated.
To resolve this situation managemon (after monitor kick) checks again
a->container pointer to learn if current array is not to be deactivated.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Description of the bug:
Sometimes mdmon crashes after changing RAID level from 1 to 0 (takeover).
Cause of the bug:
The managemon marks an active_array for removal from monitoring
by assigning a->container to NULL value (in the "manage_member" function).
Sometimes (during stress test) it happens right when the monitor
is in the "read_and_act" function and a->container pointer is in use.
This causes the monitor crashes.
Solution:
The active array has to be marked for removal in another way
than setting NULL pointer when it can be in use.
A new field "to_remove" was added to the "active_array" structure.
It is used in the managemon to mark a container to remove
(instead of the old assigment: a->container = NULL)
and monitor checks it to determine if the array should be removed.
The field "to_remove" should be checked in some other places
to avoid managing of the array which is going to be removed.
Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
The following test fails when the md_check_recovery() event triggered by
the ro->rw transition causes remove_and_add_spares() to run while mdmon
is attempting spare activation.
Result is that the kernel races to set the slot immediately after
sysfs_add_disk() writes new_dev. mdmon thinks the spare activation
failed and declines to send the monitor a new acitve_array. We show
degraded after the wait because the monitor cannot notify the metadata
that all disks are in_sync.
#!/bin/bash
i=0
false
while [ $? == 1 ]
do
i=$((i+1))
mdadm -Ss
mdadm -CR /dev/md0 /dev/loop[0-2] -n 3 -e imsm
mdadm -CR /dev/md1 /dev/loop[01] missing -n 3 -l 5
mdadm --wait /dev/md1
mdadm -E /dev/loop2 | grep -i degraded
done
echo "failed: $i"
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When raid0 expansion occurs, takeover operation is used.
After backward takeover monitor remains in memory.
This happens due to remaining just removed active array in mdmon structures.
If there is no other monitored arrays, mdmon has to finish his work.
Problem was introduced in patch (2011.03.22):
mdmon: Stop keeping track of RAID0 (and LINEAR) arrays.
Prior to this patch mdmon kicking occurs via replace_array() where
wakeup_monitor() was called.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Tracking RAID0 arrays doesn't really work. There is no need,
and there are some sysfs files which won't exist when the array
appears and then won't be opened when the level is changed.
So simply ignore RAID0 and LINEAR arrays - don't add them when they
appear and if an array we are monitoring turns into one of these,
discard it promptly.
Signed-off-by: NeilBrown <neilb@suse.de>
As monitor() can set ->container to NULL, we need to be careful
about dereferencing it.
So take a copy in manage_member, return if it is NULL, and only
use the copy.
Signed-off-by: NeilBrown <neilb@suse.de>
This is needed to remove devices from mdmon's knowledge when the
device is removed from the md container.
Now that ddf have a remove_from_super we don't need the code
that allows some personalities not to implement this.
Signed-off-by: NeilBrown <neilb@suse.de>
Manager thread shall pass the information to monitor thread (mdmon)
that some devices are removed from container. Otherwise, monitor
(mdmon) might use such devices (spares) to rebuild the array that has
gone degraded.
This problem happens for imsm containers, since a list of the
container disks is maintained in intel_super structure. When array
goes degraded, the list is searched to find a spare disks to start
rebuild. Without this fix the rebuild could be stared on the spare
device that was a member of the container, but has been removed from
it.
New super type function handler has been introduced to prepare
metadata format specific information about removed devices.
int (*remove_from_super)(struct supertype *st, mdu_disk_info_t *dinfo)
The message prepared in remove_from_super is later processed by
process_update handler in monitor thread.
Signed-off-by: Marcin Labun <marcin.labun@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Spare assignment requires full knowledge of array state. A pending
update might modify that state (such as a pending spare assignment)
so don't try while there are updates pending.
Signed-off-by: NeilBrown <neilb@suse.de>
This is needed to remove devices from mdmon's knowledge when the
device is removed from the md container.
Now that ddf have a remove_from_super we don't need the code
that allows some personalities not to implement this.
Signed-off-by: NeilBrown <neilb@suse.de>
last_checkpoint is variable that tracks sync_complete sysfs entry.
sync_complete is per disk counter, so initializing during starting from checkpoint
has to have this in mind and convert reshape position properly.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When reshape is restarted and active array in mdmon is being initialized,
mdmon has to know last checkpoint, otherwise reshape will be restarted
form '0' position.
mdadm when reshaped array is assembled stores reshape_position in sysfs
and runs mdmon. Initialize last_checkpoint in active array structure
to value present in sysfs for reshaped array start.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
As chunk_size in mdstat_ent is never set, we shouldn't copy
it into a->info.array.
In fact, it is safest to get rid of the field altogether.
Reported-by: "Kwolek, Adam" <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Calculation of size is almost ok, except concept of blocks.
Size for setting in md has to be divided by 2 to be correct.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
For level migration support it is necessary to allow mdmon to react for level changes.
It has to have ability to change configuration of active array,
and for array level change to raid0 finish array monitoring.
Signed-off-by: Maciej Trela <maciej.trela@intel.com>
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
We need to allow metadata to handle progress of reshape,
completion, and abort-before-start.
Include all those in ->set_array_state()
Signed-off-by: NeilBrown <neilb@suse.de>
Sometimes one metadata update will require allocating several
larger data structures. As 'monitor' cannot allocate, 'manager'
must, so it must be able to attach a list of allocates to the
update, and importantly it must be able to easily free them.
So add a 'space_list' element to metadata updates where each
element on the list starts with a pointer to the next.
Signed-off-by: NeilBrown <neilb@suse.de>
When mdadm starts a reshape, it might add some devices to the array
first. mdmon needs to notice the reshape starting and check for any
new devices. If there are any they need to be provided to be
monitored.
Signed-off-by: NeilBrown <neilb@suse.de>
Manager thread shall pass the information to monitor thread (mdmon)
that some devices are removed from container. Otherwise, monitor
(mdmon) might use such devices (spares) to rebuild the array that has
gone degraded.
This problem happens for imsm containers, since a list of the
container disks is maintained in intel_super structure. When array
goes degraded, the list is searched to find a spare disks to start
rebuild. Without this fix the rebuild could be stared on the spare
device that was a member of the container, but has been removed from
it.
New super type function handler has been introduced to prepare
metadata format specific information about removed devices.
int (*remove_from_super)(struct supertype *st, mdu_disk_info_t *dinfo)
The message prepared in remove_from_super is later processed by
process_update handler in monitor thread.
Signed-off-by: Marcin Labun <marcin.labun@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
sync_completed_fd handler has to be closed when array is closing.
This is in pair to open handler code.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
As chunk_size in mdstat_ent is never set, we shouldn't copy
it into a->info.array.
In fact, it is safest to get rid of the field altogether.
Reported-by: "Kwolek, Adam" <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In order to support reshape and atomic removal of spares from containers
we need to prevent mdmon from activating spares. In the reshape case we
additionally need to freeze sync_action while the reshape transaction is
initiated with the kernel and recorded in the metadata.
When reshaping a raid0 array we need to freeze the array *before* it is
transitioned to a redundant raid level. Since sync_action does not exist
at this point we extend the '-' prefix of a subarray string to flag
mdmon not to activate spares.
Mdadm needs to be reasonably certain that the version of mdmon in the
system honors this 'freeze' indication. If mdmon is not already active
then we assume the version that gets started is the same as the mdadm
version. Otherwise, we check the version of mdmon as returned by the
extended ping_monitor() operation. This is to catch cases where mdadm
is upgraded in the filesystem, but mdmon started in the initramfs is
from a previous release.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
...before introducing another open coded instace of this conversion.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
To accurately detect when an array has been split and is now being
recombined, we need to track which other devices each thinks is
working.
We should never include a device in an array if it thinks that the
primary device has failed.
This patch just allows get_info_super to return a list of devices
and whether they are thought to be working or not.
Signed-off-by: NeilBrown <neilb@suse.de>
...i.e. GET_DEVS == (GET_DEVS|SKIP_GONE_DEVS)
A null pointer dereference in Incremental.c can be triggered by
replugging a disk while the old name is in use. When mdadm -I is called
on the new disk we fail the call to sysfs_read(). I audited all the
locations that use GET_DEVS and it appears they can tolerate missing a
drive. So just make SKIP_GONE_DEVS the default behaviour.
Also fix up remaining unchecked usages of the sysfs_read() return value.
Reported-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The kernel updates and notifies md/sync_completed when it is time to
take a checkpoint. When this occurs (at 1/16 array size intervals)
write 'idle' to md/sync_action to have the current recovery position
updated in recovery_start and resync_start.
Requires the metadata handler to reset ->last_checkpoint when it has
determined that recovery has ended.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
When activating a spare we neglect to open recovery_start and as such do
not see checkpoint events. Move disk initialization to common routine
to mitigate recurrence.
Reported-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Now that we don't "mdadm --takeover" until /var/run is writable
there is no need to continually try to create files in there.
So only create these files at startup and fail if they cannot be
made. This means that to start an array with externally managed
metadata, either /var/run or ALT_RUN (e.g. /lib/init/rw) must be
writable. To 'takeover' from a previous mdmon instance, /var/run
must be writable.
This means we don't need to worry about SIGHUP (which was once used to
tell us it was time to create .pid) and SIGALRM.
Signed-off-by: NeilBrown <neilb@suse.de>
Monitoring /proc/mounts and creating a .pid file as soon as /var/run
is writable is racy. Most distros clean all non-directories from
/var/run early in boot and if mdmon races with this it could
lose the files as soon as they are created.
Instead require that "mdmon --takeover" be run after /var is writable.
Signed-off-by: NeilBrown <neilb@suse.de>
/var/run probably doesn't persist from early boot.
So if necessary, store in in /lib/init/rw or somewhere else
that does persist.
Signed-off-by: NeilBrown <neilb@suse.de>
Creating /var/run in mdmon is really not justifiable.
If /var/run doesn't exist, then it is either deliberate and it should
be left that way to make sure the mapfile gets created in /dev, or
it is a configuration error and not our problem to fix.
Signed-off-by: NeilBrown <neilb@suse.de>
Minimal changes needed to permit reassembling partially recovered
external metadata arrays. The biggest logical change is that
->container_content() can now surface partially rebuilt members rather
than omitting them from the disk list.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We don't need to sprinkle reads of this attribute all over the place,
just once at the entry of read_and_act(). Also, the mdinfo structure
for the array already has a 'resync_start' member, so just reuse that.
Finally, rename get_resync_start() to read_resync_start to make it
consistent with the other sysfs accessors in monitor.c.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
When killing a previous monitor be careful not to cause writes to the
filesystem until the reads necessary to get the monitor operational have
completed.
The code is already prepared for errors creating the pid and socket
files, so simply defer creation of these files until after the first
call to manage().
Cc: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
When building container members with -IR, we need to ensure that
devices added to an active array preserve the 'in_sync' status so they
don't needlessly get rebuilt.
So allow sysfs_add_disk to do this (only works in kernels since
2.6.30) and pass the relevant flag down.
Signed-off-by: NeilBrown <neilb@suse.de>
If mdmon sees a device added to a container, it should assume it is
a new spare. It could be a part of the array that just hadn't been
assembled yet. So check first.
Signed-off-by: NeilBrown <neilb@suse.de>
Now that mdmon handles sigterm if another monitor wants to take over it
should wait until all managed arrays are clean. So make WaitClean()
available to mdmon and teach try_kill_monitor() to wait on each subarray
in the container.
...since we may be communicating with a dieing process, we need to
block SIGPIPE earlier.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
We generally don't want mdmon to be terminated, but if a SIGTERM gets
through try to leave the monitored arrays in a clean state, block
attempts to mark the array dirty, and stop servicing the socket.
When we are killed by sigterm don't remove the pidfile let that be
cleaned up by the next monitor.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>