map_dev can be slow so it is best to not call it when
not necessary.
The final test in "find_free_devnum" is not relevant when
udev is being used, so remove the test in that case.
Signed-off-by: NeilBrown <neilb@suse.de>
It is important to check for compatibility with 'platform' or
Option ROM when creating or changing and array. However there is no
real need when simply assembling the array.
On some systems there are situations where the platform information is
not available. e.g. on some UEFI systems, UEFI is not available
during 'kdump' handling. This makes it impossible to assemble
an IMSM array to receive the dump.
So remove the requirements that the platform be visible to assemble
an IMSM array.
Signed-off-by: NeilBrown <neilb@suse.de>
open_container should open a container which contains the device,
but sometimes it would open another volume which contains the
device. Be more careful in 'holder' selection.
Signed-off-by: NeilBrown <neilb@suse.de>
mdadm --create /dev/md0 .... /dev/sda1:1024 /dev/sdb1:2048 ...
The size is in K unless a suffix: K M G is given.
The suffix 's' means sectors.
Signed-off-by: NeilBrown <neilb@suse.de>
We will shortly introduce --data-offset= which is allowed to
be zero. We will want to use parse_size() so it needs to be
able to return '0' without it being an error.
So define INVALID_SECTORS to be an impossible value (currently '1')
and return and test for it consistently.
Signed-off-by: NeilBrown <neilb@suse.de>
The 'enough' function is written to work with 'near' arrays only
in that is implicitly assumes that the offset from one 'group' of
devices to the next is the same as the number of copies.
In reality it is the number of 'near' copies.
So change it to make this number explicit.
Reported-by: Jakub Husák <jakub@gooseman.cz>
Signed-off-by: NeilBrown <neilb@suse.de>
When using human_size_brief, only IEC prefixes were supported. Now
it's possible to specify which format we want to see - either IEC
(kibi, mibi, gibi) or JEDEC (kilo, mega, giga).
Signed-off-by: Maciej Naruszewicz <maciej.naruszewicz@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
It would be better if two size-calculating methods had the same
calculating algorithm. The human_size way of calculation seems
more readable, so let's use it for both methods.
Signed-off-by: Maciej Naruszewicz <maciej.naruszewicz@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
An error in parse_size() should be reported by 0, not -1,
because -1 is changed to the max value of unsigned long long
during calculations of size (e.g. at mdadm.c:412).
A negative value of size should be reported as error
(e.g. size equal -1 has been changed to the max value of
unsigned long long so far).
Signed-off-by: Lukasz Dorau <lukasz.dorau@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
This avoid code duplication for utilities that do not link to
util.c and everything that comes with it, such as test_restripe and
raid6check
Signed-off-by: NeilBrown <neilb@suse.de>
high-number names like "/dev/md126" shouldn't be in /etc/mdadm.conf,
but if they are they should be ignored when choosing an
unused number.
Signed-off-by: NeilBrown <neilb@suse.de>
malloc should never fail, and if it does it is unlikely
that anything else useful can be done. Best approach is to
abort and let some super-daemon restart.
So define xmalloc, xcalloc, xrealloc, xstrdup which don't
fail but just print a message and exit. Then use those
removing all the tests for failure.
Also replace all "malloc;memset" sequences with 'xcalloc'.
Signed-off-by: NeilBrown <neilb@suse.de>
When we can for devices using GET_DISK_INFO we currently
limit to 1024. But some arrays can have more than this.
So raise it to 4096 and make the constant a #define.
Signed-off-by: NeilBrown <neilb@suse.de>
It isn't sufficient to use '0' for 'error' as well will
later have fields that can validly be '0'.
So return "-1" on error.
Also fix parsing of --bitmap_check so that '0' is treated
as an error: we don't support 512B anyway.
Signed-off-by: NeilBrown <neilb@suse.de>
no point calling info_to_blocks_per_member when it just returns size*2 for level==1
calc_array_size can be used for all levels
Signed-off-by: Anna Czarnowska <anna.czarnowska@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When result from strchr() is NULL and it is assigned to subarray,
NULL pointer can be passed to strdup() function and coredump file
is generated.
Subarray is checked for NULL pointer, so it is assumed that it can
be NULL at this moment.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
It can easily be calculated from 'avail' and 'raid_disks', and we
will soon have a case where we don't have it easily available to pass
in.
Signed-off-by: NeilBrown <neilb@suse.de>
Per [1] GPT partition table entries are not guaranteed to be 128
bytes, in which case read() straight into a struct GPT_part_entry
would result in a buffer overflow corrupting the stack.
[1] http://en.wikipedia.org/wiki/GUID_Partition_Table
Signed-off-by: Jes Sorensen <Jes.Sorensen@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
readlink() returns the number of bytes in the buffer.
If we do something like
len = readlink(path, buf, sizeof(buf));
buf[len] = '\0';
we might write one byte past the end of the buffer.
Signed-off-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Signed-off-by: NeilBrown <neilb@suse.de>
A 0.90 array can use at most 4TB of each device - 2TB between
2.6.39 and 3.1 due to a kernel bug.
The test for this in validate_super0 is very wrong. 'size' is sectors
and the number it is compared against is just confusing.
So fix it all up and correct the spelling of terabytes and remove
a second redundant test on 'size'.
Signed-off-by: NeilBrown <neilb@suse.de>
Negative value must be returned to indicate error in open_subarray
Reviewed-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Anna Czarnowska <anna.czarnowska@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When forking mdmon we need to close all other fds because we don't
use O_CLOEXEC yet.
Any approach will be fairly arbitrary, but as we can expect fds to be
fairly dense, closing until we find a set number that don't need
closing is possible safer than only closing the first 100.
So keep closing until we find 20 that are already closed.
Signed-off-by: NeilBrown <neilb@suse.de>
There are some more times when we don't care that the hardware doesn't
support the metadata:
- when removing old metadata
- when reporting the metadata present before over-writing it.
So set ignore_hw_compat in these cases.
Signed-off-by: NeilBrown <neilb@suse.de>
The next version of Linux might be 3.0. If it is, get_linux_version
will fail.
So make it more robust.
Reported-by: Namhyung Kim <namhyung@gmail.com>
Reported-by: Milan Broz <mbroz@redhat.com>
Signed-off-by: NeilBrown <neilb@suse.de>
looking at the gpt code in util.c i found i did not like it at all, a
gpt partition entry is currently 128 bytes, but the spec does not say it
is a fixed value, so the code that reads into a buffer with 512bytes
chunk expecting this to be a multiplier of part_size is imho incorrect.
my fix was to read each partition entry directly into a struct
GPT_part_entry, the advantage is that the code is very simple to read,
the disadvantage it is 128 reads of 128 bytes each, which is
sub-optimal, but i believe readahead will mitigate this a lot.
Signed-off-by: NeilBrown <neilb@suse.de>
The loop over all member devices in enough_fd could easily stop
before it had found all devices. This would cause --re-add to
fail incorrectly.
So change the loop to be based on the reported number of devices
in the device - with a safe-guard limit of 1024.
Change some other loops to be more careful too.
Reported-by: "Schmidt, Annemarie" <Annemarie.Schmidt@stratus.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Some of util.c is dependent on lots of other code, some of it
is stand-alone.
Move some of the stand-alone stuff into a new lib.c so it can be used
by smaller utilities.
Signed-off-by: NeilBrown <neilb@suse.de>
For many operations we don't need a writable device. So if
opening O_RDWR fails in open_dev_excl, then try again O_RDONLY.
If we really needed write, a subsequent operation will failed. But
if we didn't, we succeed when otherwise we wouldn't have.
Signed-off-by: NeilBrown <neilb@suse.de>
Allow for loading metadata from disk attached to non-metadata compliant
system. Affects mdadm --examine and guess_super.
Added ignore_hw_compat in supertype to pass information to load_super
handler. If ignore_hw_compat is set the handler should load metadata
also from disks that do not comply with metadata requirements (i.e. disk is not
attached to native controller, etc).
Signed-off-by: Marcin Labun <marcin.labun@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If mdmon is shutting down because there are no devices
left to look at, then don't wait 5 seconds for an O_EXCL open,
and that can block progress of --grow.
Only wait for O_EXCL if we received a signal.
Signed-off-by: NeilBrown <neilb@suse.de>
If single-disk RAID0 or RAID1 array is created, user may preserve data on
disk. If array given size covers all partitions on disk, all data will be
available on created array. If array size is too small (not covers
all partitions), data will be not accessible.
This patch introduces warning message during array creation if given size
is too small. User may interrupt creation process to avoid data loss.
Signed-off-by: Krzysztof Wojcik <krzysztof.wojcik@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
When opening an array to manipulate it we never need to write to the
array and sometimes it might be read-only so the open for write will
fail.
So always open read-only.
Reported-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Sometime we will need to know the difference between no domains found
and domains didn't match.
So allow domain_test to return different values and fix up all callers
to maintain current behaviour.
Signed-off-by: NeilBrown <neilb@suse.de>
For containers, it is always appropriate to include a device in the
container.
Whether it should then be included in an array is a separate question.
Signed-off-by: NeilBrown <neilb@suse.de>
Arrays on partitions are not supported for external metadata
so do not take such spare from native array.
Signed-off-by: Anna Czarnowska <anna.czarnowska@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
# mdadm --detail --export /dev/md127p1
Before:
MD_LEVEL=raid5
MD_DEVICES=4
MD_METADATA=0.90
After:
MD_LEVEL=raid5
MD_DEVICES=4
MD_CONTAINER=/dev/md0
MD_MEMBER=0
MD_UUID=55746a20:925d24a7:4f9bd7e2:9c9a411f
We parse the symlink target with a format:
../../block/mdXXX/mdXXXpYY
...and need the second '/' from the end of the string to read detect a
'md' device.
Reported-by: Krzysztof Wasilewski <krzysztof.wasilewski@intel.com>
Cc: Przemyslaw Czarnowski <przemyslaw.hawrylewicz.czarnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
container_chose_spares in Monitor.c and
get_spares_for_grow in super-intel.c
do the same thing: search for spares in a container.
Another version will also be needed for Incremental
so a more general solution is presented here and
applied in two previous contexts.
Normally domlist==NULL would lead an empty list but
this is typically checked earlier so here it is interpreted
as "do not test domains".
Signed-off-by: Anna Czarnowska <anna.czarnowska@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
uuid_match_any is replaced by uuid_zero for imsm spares.
Function fixup_container_spare_uuid not needed as it gives
unwanted uuid to spares.
Signed-off-by: Anna Czarnowska <anna.czarnowska@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
Sometimes one metadata update will require allocating several
larger data structures. As 'monitor' cannot allocate, 'manager'
must, so it must be able to attach a list of allocates to the
update, and importantly it must be able to easily free them.
So add a 'space_list' element to metadata updates where each
element on the list starts with a pointer to the next.
Signed-off-by: NeilBrown <neilb@suse.de>
Sometimes we want to convert a devnum to a devname without allocating
memory. So provide function to do the formatting without allocation.
Signed-off-by: NeilBrown <neilb@suse.de>
Due to fact that IMSM Windows compatibility was not tested yet,
feature has to be treated as experimental until compatibility
verification will be performed.
Signed-off-by: Adam Kwolek <adam.kwolek@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
For consistency with makedev().
int is not sufficient.
Signed-off-by: Anna Czarnowska <anna.czarnowska@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
If a devices - typically in a mirrored set - is assembled
independently of the other devices, and then attempted to be brought
back into the set, it could contain inconsistent data. It should not
be included.
So detect this situation by ensuring that the 'most recent' device is
believed to be active by every other device. If a device is wayward,
it will only consider fellow wayward devices to be active and will
think all others are failed or missing.
This patches fixes --incremental, --assemble was done in an earlier
patch.
Signed-off-by: NeilBrown <neilb@suse.de>
Precludes needing to deduce this information later, like in Detail.c and
soon in Grow.c.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
In order to support reshape and atomic removal of spares from containers
we need to prevent mdmon from activating spares. In the reshape case we
additionally need to freeze sync_action while the reshape transaction is
initiated with the kernel and recorded in the metadata.
When reshaping a raid0 array we need to freeze the array *before* it is
transitioned to a redundant raid level. Since sync_action does not exist
at this point we extend the '-' prefix of a subarray string to flag
mdmon not to activate spares.
Mdadm needs to be reasonably certain that the version of mdmon in the
system honors this 'freeze' indication. If mdmon is not already active
then we assume the version that gets started is the same as the mdadm
version. Otherwise, we check the version of mdmon as returned by the
extended ping_monitor() operation. This is to catch cases where mdadm
is upgraded in the filesystem, but mdmon started in the initramfs is
from a previous release.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
...before introducing another open coded instace of this conversion.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: NeilBrown <neilb@suse.de>
It makes more sense to test for container_dev than for subarray
for several places in Create where it then uses container_dev.
This allows us to subsequently remove subarray.
Signed-off-by: NeilBrown <neilb@suse.de>
Rather than hiding this in the 'st', return it explicitly.
In the one case we still need it, copy it into st where needed.
This will disappear in a future patch.
Signed-off-by: NeilBrown <neilb@suse.de>
Rather than hiding this arg in the 'st' structure, pass it explicitly.
This is a first step to getting rid of 'subarray' from 'supertype'.
The strcpy in open_subarray should have better error checking, but it
will disappear soon so there is little point.
Signed-off-by: NeilBrown <neilb@suse.de.
To accurately detect when an array has been split and is now being
recombined, we need to track which other devices each thinks is
working.
We should never include a device in an array if it thinks that the
primary device has failed.
This patch just allows get_info_super to return a list of devices
and whether they are thought to be working or not.
Signed-off-by: NeilBrown <neilb@suse.de>
If an --add is requested and a re-add looks promising but fails or
cannot possibly succeed, then don't try the add. This avoids
inadvertently turning devices into spares when an array is failed but
the devices seem to actually work.
Signed-off-by: NeilBrown <neilb@suse.de>
To support incorpating a new bare device into a collection of arrays -
one partition each - mdadm needs a modest understanding of partition
tables.
The main needs to be able to recognise a partition table on one device
and copy it onto another.
This will be done using pseudo metadata types 'mbr' and 'gpt'.
Signed-off-by: NeilBrown <neilb@suse.de>
/dev could be read-only in which case we cannot make devices
there.
So dev_open should first try to use an existing device name,
and if that doesn't work try creating a node in /dev or /tmp.
Reported-by: Paweł Sikora <pluto@agmk.net>
Signed-off-by: NeilBrown <neilb@suse.de>
We now have 3 directory definitions: mdmon directory for its pid and
sock files (compile time define, not changable at run time), mdmonitor
directory which is for the mdadm monitor mode pid file (can only be
passed in via command line at the time mdadm is invoked in monitor mode),
and the directory for the mdadm incremental assembly map file (compile
time define, not changable at run time). Only the mdadm map file still
hunts multiple locations, and the number of locations has been reduced
to /var/run and the compile time specified location. Re-use of similar
sounding defines that actually didn't denote their actual usage at
compile time made it more difficult for a person to know what affect
changing the compile time defines would have on the resulting programs.
This patch renames the various defines to clearly identify which item
the define affects. It also reduces the number of various directories
which will be searched for these files as this has lead to confusion
in mdadm and mdmon in terms of which files should take precedence when
files exist in multiple locations, etc. It's best if the person
compiling the program intentionally and with planning selects the
right directories to be used for the various purposes. Which directory
is right depends on which items you are talking about and what boot
loader your system uses and what initramfs generation program your
system uses. Because of the inter-dependency of all these items it
would typically be up to the distribution that mdadm is being integrated
into to select the correct values for these defines.
Signed-off-by: Doug Ledford <dledford@redhat.com>
GET_ARRAY_INFO always succeeds on an inactive container, so we need to
be a bit more diligent about adding a disk to an active container.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Support for deleting a subarray out of a container. When all subarrays
are deleted the component devices are converted back into spares, a
--zero-superblock is still needed to kill the remaining metadata at this
point. This operation is blocked when the subarray is active and may
also be blocked by the metadata handler when deleting the subarray might
change the uuid of other active subarrays. For example, with imsm,
deleting subarray 'n' may change the uuid of subarrays with indexes > n.
Deleting a subarray needs to be a container wide event to ensure
disks that record the modified subarray list perceive other disks that
did not receive this change as out of date.
Notes:
The st->subarray parsing in super-intel.c and super-ddf.c is updated to
be more strict now that we are reading user supplied subarray values.
Offline container modification shares actions that mdmon typically
handles so promote is_container_member() and version_to_superswitch()
(formerly find_metadata_methods()) to generic utility functions for the
cases where mdadm performs the operation.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
devnum2devname() returns pointer to memory allocated with strdup.
It must be released to prevent memory leak.
Signed-off-by: Przemyslaw Czarnowski <przemyslaw.hawrylewicz.czarnowski@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
These metadata are not expected on partitions, and they have
no way of differentiation whether which is correct if they
are found both on the device and on the last partition.
So if the device is a partition, refuse to read the metadata.
Signed-off-by: NeilBrown <neilb@suse.de>
Code to check partition tables used some needless casts
and was broken, using a u8 when a u32 was wanted.
So create structure describing the tables rather than using offset,
and read into those tables instead.
Signed-off-by: NeilBrown <neilb@suse.de>
- when we waited for the old mdmon to exit, we didn't look
for the socket in the right place
- when we failed to find a pid file, we returned the wrong
value (code expected <0, but got ==0).
Signed-off-by: Luca Berra <bluca@comedia.it>
Signed-off-by: NeilBrown <neilb@suse.de>
/var/run probably doesn't persist from early boot.
So if necessary, store in in /lib/init/rw or somewhere else
that does persist.
Signed-off-by: NeilBrown <neilb@suse.de>
Unlike native md checkpointing some data about the geometry and type of
the migration process is coded into curr_migr_unit. Provide logic to
convert between md/{resync_start|recovery_start} and imsm/curr_migr_unit.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Minimal changes needed to permit reassembling partially recovered
external metadata arrays. The biggest logical change is that
->container_content() can now surface partially rebuilt members rather
than omitting them from the disk list.
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
When creating an array, check if the devices have partition
tables and print a warning if the table or the partitions might be
destroyed by array creation.
Signed-off-by: NeilBrown <neilb@suse.de>
The load_super() from an mdadm --detail call may race against an mdmon
update. When this happens the load_super sees an inconsistent metadata
block and returns an error. The fallback path to use the map file
contents lacks uuid reporting, so provide __fname_from_uuid for
generically printing a uuid.
Reported-by: Hans de Goede <hdegoede@redhat.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
Spares for imsm arrays do not have any info about the container in their
metadata records. If Detail() inadvertantly picks such a device for
->get_array_info() it will end up with less than useful info for the
container. So, continue to read from the disks until a non-spare device
is found.
This bug was found by timeouts waiting for udev to create the
user-friendly container name. To detect future UUID reporting problems
and a debug print to the timeout case in wait_for().
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
The family_number field can change. The option-rom will change the
family number when it starts a rebuild process (flags a container for
rebuild). This was not seen previously as mdadm would usually start the
rebuild process, preserving the family number.
This is the mechanism that helps to prevent a prodigal array member from
being returned to its original system and cause a rebuild to go in the
wrong direction. With the change we will end up with a container that
will fail to assemble unless the device with the incompatible family
number is left out of the assembly.
So, take several actions:
1/ Convert uuid generation to use orig_family_num, being careful to
preserve the existing uuid in the case where orig_family_num is not
set (i.e. previous mdadm created imsm arrays)
2/ Set orig_family_num at Create. For arrays created by mdadm prior to
this release orig_family_num will be zero, so set it to family_num at
the first metadata write.
3/ Add checks for orig_family_num to compare_super_imsm
4/ Update the family number when initiating rebuild
5/ The option-rom mixes some random data into the family number, add
this functionality to the mdadm implementation.
Reported-by: Marcin Labun <marcin.labun@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>