Wednesday, February 15, 2012

libblkid maintainer's brain dump

This article is about the low-level probing libblkid code, and it's really dump, nothing more ;-)

High and Low level

The library contains two APIs.
  • high-level - this is the original library code from e2fsprogs. All results are cached in the file /etc/blkid.tab (or /run/blkid/blkid.tab). The advantage is that information about LABELs and UUIDs are accessible for non-root users and the cache has positive impact on performance.

    This advantage is no more valid on many systems where all necessary information are stored in udev db, and things like LABEL and UUID are accessible by /dev/disk/by-* udev symlinks.

    This is reason why for newly written programs are recommended blkid_evaluate_* functions which are able to use udev symlinks as well as the original libblkid cache. This functionality is also accessible from command line by the blkid -L|-U command.

  • low-level - this part of the API completely bypass the cache and allows to work directly with library probing functions. The rest of this article is about the low-level part of the library.
The library contains three chains of the probing functions:
  1. superblocks
  2. partitions
  3. topology
The superblocks probing is enabled by default. The command "blkid -p -o udev" (or built-in code in udevd) enables partitions probing chain too.

There are two basic probing methods:
  • safeprobe - this is recommended method. This method cares about collisions between filesystems, raids or partition tables.
  • fullprobe - don't check for conflicts, used for example in wipefs(8)
For the superblock is available NAME=value based API only. For topology and partitions is available binary interface too. See the docs link below.

Superblocks
  • three basic "usage" groups: filesystems, raids, crypto and others
  • RAIDs (MD, LVM, ...) are probed before filesystems
  • don't check for filesystems when a RAID signature is detected
  • don't check for RAIDs or others (swap-area) on CD-ROMs
  • don't check for RAIDs on tiny devices (< 1 MiB)
  • don't read whole FAT root directory (to lookup LABEL) on tiny devices (< 1 MiB)
exceptions / extra cases:
  • MD RAID is ignored if detected within a valid partition during whole-disk probing

    [use case: partitioned disk, last partition used as a RAID member and the RAID has metadata at the end of the last partition (so end of the disk)]

  • LVM signature is ignored if another signature is detected within first 8KiB of the device (LVM wipes this area, so there should not be any filesystem superblock)

    [use case: disk with LVM, user stops to use LVM and creates a new partition table by fdisk, result is MBR and obsolete LVM signature on the same device]
Partitions
  • disabled by default, enabled for udev (see ID_PART_ENTRY_* in udev db)
  • parse partition tables (aix, minix, bsd, mbr, gpt, mac, sgi, solaris, sun, ultrix and unixware)
  • detect nested partition tables (e.g. BSD) within partitions
  • if given device is a partition (e.g. sda1) then open whole disk (e.g. sda) to read details about the partition from partition table. This feature has to be enabled by BLKID_PARTS_ENTRY_DETAILS flag.
  • partition table is ignored if a valid RAID superblock is detected at the end of the device

    [use case: partitioned RAID1 (mirror) -- the partition table is visible from underlaying devices]
Topology
  • rarely used
  • designed for mkfs-like or fdisk-like programs to get info about I/O topology
  • for kernel >= 2.6.3x uses ioctl or sysfs
  • as fallback for old kernels uses code originally from xfsprogs

Tips for users

  • please use wipefs(8) before fdisk, mkfs or mkswap. The latest version is able to remove really all possible backup signatures, partition tables and at first glance invisible things. Don't rely on mkfs developers :-)
  • think twice before you start to use some complex setups (for example partitioned RAIDs) to avoid misinterpretation by kernel or system tools.
  • don't forget that blkid without -p might returns cached results
Tips for developers

.... I'll try to keep these notes updated.

3 comments:

  1. I notice that libblkid does not return a blkid_partlist when blkid_probe_get_partitions() is called on a mdadm device. gfdisk and fdisk both do show a partition table, though the partitions they list don't have corresponding /dev nodes. showpart(1) doesn't list either GPT or MBR partitions. Any idea what's up?

    ReplyDelete
    Replies
    1. blog is probably bad place for such discussion... anyway, try

      LIBBLKID_DEBUG=0xffff partx /dev/sda

      is it top-level device or any raid member?

      Delete
    2. Top-level mdadm device. Feel free to hit me up at nick.black@sprezzatech.com.

      [skynet](0) $ sudo gfdisk /dev/md127
      GNU Fdisk 1.2.4
      Copyright (C) 1998 - 2006 Free Software Foundation, Inc.
      This program is free software, covered by the GNU General Public License.

      This program is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
      GNU General Public License for more details.

      Using /dev/md127
      Command (m for help): p

      Disk /dev/md127: 16003 GB, 16003162344960 bytes
      255 heads, 63 sectors/track, 1945607 cylinders
      Units = cylinders of 16065 * 512 = 8225280 bytes

      Device Boot Start End Blocks Id System
      /dev/md127p1 1 1945608 15628096228 83 Linux
      Warning: Partition 1 does not end on cylinder boundary.
      Command (m for help): q
      [skynet](0) $ sudo fdisk /dev/md127
      GNU Fdisk 1.2.4
      Copyright (C) 1998 - 2006 Free Software Foundation, Inc.
      This program is free software, covered by the GNU General Public License.

      This program is distributed in the hope that it will be useful,
      but WITHOUT ANY WARRANTY; without even the implied warranty of
      MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
      GNU General Public License for more details.

      Using /dev/md127
      Command (m for help): p

      Disk /dev/md127: 16003 GB, 16003162344960 bytes
      255 heads, 63 sectors/track, 1945607 cylinders
      Units = cylinders of 16065 * 512 = 8225280 bytes

      Device Boot Start End Blocks Id System
      /dev/md127p1 1 1945608 15628096228 83 Linux
      Warning: Partition 1 does not end on cylinder boundary.
      Command (m for help): q
      [skynet](0) $ sudo LIBBLKID_DEBUG=0xffff partx /dev/md127
      libblkid: debug mask set to 0xffff.
      allocate a new probe 0x133c050
      zeroize wiper
      ready for low-probing, offset=0, size=16003169779712
      whole-disk: YES, regfile: NO
      partlist reset
      parts: initialized partitions list (0x133cc70, size=0)
      --> starting probing loop [PARTS idx=-1]
      buffer read: off=0 len=1024 pr=0x133c050
      reuse buffer: off=0 len=1024 pr=0x133c050
      reuse buffer: off=0 len=1024 pr=0x133c050
      reuse buffer: off=0 len=1024 pr=0x133c050
      gpt: ---> call probefunc()
      reuse buffer: off=0 len=1024 pr=0x133c050
      gpt: <--- (rc = 1)
      reuse buffer: off=0 len=1024 pr=0x133c050
      ultrix: ---> call probefunc()
      buffer read: off=15872 len=512 pr=0x133c050
      ultrix: <--- (rc = 1)
      reuse buffer: off=0 len=1024 pr=0x133c050
      reuse buffer: off=0 len=1024 pr=0x133c050
      reuse buffer: off=0 len=1024 pr=0x133c050
      buffer read: off=28672 len=1024 pr=0x133c050
      reuse buffer: off=0 len=1024 pr=0x133c050
      reuse buffer: off=0 len=1024 pr=0x133c050
      <-- leaving probing loop (failed) [PARTS idx=10]
      partx: /dev/md127: failed to read partition table
      reseting probing buffers pr=0x133c050
      buffers summary: 2560 bytes by 3 read() call(s)
      free probe 0x133c050
      [skynet](0) $

      Delete