Wednesday, July 20, 2011

dmesg(1) changes for util-linux 2.20

I have re-written the dmesg(1). That's the first large change in the code in last 18 years.

New features:
  • --decode facility and level number to human readable prefixes
$ dmesg --decode
kern :info : [26443.677632] ata1.00: configured for UDMA/100
kern :info : [26443.830225] PM: resume of devices complete after 2452.856 msecs
kern :debug : [26443.830606] PM: Finishing wakeup.
kern :warn : [26443.830608] Restarting tasks ... done.
  • filter out messages according to the --facility and --level options, for example
$ dmesg --level=err,warn

$ dmesg --facility=daemon,user

$ dmesg --facility=daemon --level=debug

  • -u, --userspace to print only userspace messages

  • -k, --kernel to print only kernel messages

  • -t, --notime to skip [...] timestamps

  • -T, --ctime to print human readable timestamp in ctime()-like format. Unfortunately, this is useless on laptops if you have used suspend/resume. (The kernel does not use the standard system time as a source for printk() and it's not updated after resume.)

  • --show-delta to print time delta between printed messages
$ dmesg --show-delta
[35523.876281 < 4.016887>] usb 1-4.1: new low speed USB device using hci_hcd and address 12
[35523.968398 < 0.092117>] usb 1-4.1: New USB device found, idVendor=413c, idProduct=2003
[35523.968408 < 0.000010>] usb 1-4.1: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[35523.968416 < 0.000008>] usb 1-4.1: Product: Dell USB Keyboard

Wednesday, April 20, 2011

bind mounts, mtab and read-only

The bind mount feature is supported since Linux 2.4. It's pretty long time, but many users still think that bind mounts are something completely different to the normal mounts.

Example 1:
 # mount /dev/sdb1 /mnt/A
# mount /dev/sdb1 /mnt/B
This is not a bug. It's possible to mount the same filesystem on two places.

Example 2:
 # mount /dev/sdb1 /mnt/A
# mount --bind /mnt/A /mnt/B
The result from both examples is the same, see /proc/self/mountinfo:
 # grep mnt /proc/self/mountinfo
48 20 8:17 / /mnt/A rw,relatime - ext4 /dev/sdb1 rw,barrier=1,stripe=64,data=ordered
49 20 8:17 / /mnt/B rw,relatime - ext4 /dev/sdb1 rw,barrier=1,stripe=64,data=ordered
This is very important, from kernel point of view is it the same thing. The same filesystem is mounted on two places.

The kernel does not maintain anywhere information that /mnt/B was created by bind mount (MS_BIND mount(2) syscall flags). There is not dependence between /mnt/A and /mnt/B (for example you can umount /mnt/A).

Unfortunately, the situation in the /etc/mtab file is completely different:
 # grep mnt /etc/mtab
/dev/sdb1 /mnt/A ext4 rw 0 0
/mnt/A /mnt/B none rw,bind 0 0
This is confusing for many users. Try:
 # umount /mnt/A
# rm -rf /mnt/A

# grep mnt /etc/mtab
/mnt/A /mnt/B none rw,bind 0 0
Does the information in mtab make any sense? I don't think so... Keep this kind of information in userspace is mistake. Yeah, mtab is evil.


Everyone who uses bind mounts on system without mtab (where mtab is symlink to /proc/mounts) has to undestand that "bind" flag is no more stored anywhere. For example you have to explicitly add the flag to the mount options if you want to use read-only bind mount.
 # rm -f /etc/mtab
# ln -s /proc/mounts /etc/mtab
(or install Fedora 15:-)

Let's use findmnt(8) rather than grep /proc/self/mountinfo:
 # findmnt -o TARGET,VFS-OPTIONS,FS-OPTIONS /dev/sda1
TARGET VFS-OPTIONS FS-OPTIONS
/mnt/A rw,relatime rw,errors=continue,user_xattr,acl,barrier=0,data=ordered
/mnt/B rw,relatime rw,errors=continue,user_xattr,acl,barrier=0,data=ordered
What will happen if we try to remount with bind flag? See:
  # mount -o remount,ro,bind /mnt/B

# findmnt -o TARGET,VFS-OPTIONS,FS-OPTIONS /dev/sda1
TARGET VFS-OPTIONS FS-OPTIONS
/mnt/A rw,relatime rw,errors=continue,user_xattr,acl,barrier=0,data=ordered
/mnt/B ro,relatime rw,errors=continue,user_xattr,acl,barrier=0,data=ordered
The filesystem (superblock) is still read-write, but the /mnt/B mountpoint is in VFS marked as read-only.

And now the same thing without the bind flag:
 # mount -o remount,ro /mnt/B

# findmnt -o TARGET,VFS-OPTIONS,FS-OPTIONS /dev/sda1
TARGET VFS-OPTIONS FS-OPTIONS
/mnt/A rw,relatime ro,errors=continue,user_xattr,acl,barrier=0,data=ordered
/mnt/B ro,relatime ro,errors=continue,user_xattr,acl,barrier=0,data=ordered
the superblock has been remounted read-only, so the filesystem is read-only everywhere in the system.

Again, all this is possible independently on the way how /mnt/B has been mounted to the system (examples 1 and 2).

BTW, you can also set the block device as read-only by blockdev --setro. So we have three layers (device -> FS -> VFS) where is possible to set read-only attribute :-)

Tuesday, January 4, 2011

findmnt(8) and submounts

I just applied (to the util-linux upstream) a patch that allows to list all submounts for defined filesystem(s). For example:
$ findmnt --submounts /sys
TARGET SOURCE FSTYPE OPTIONS
/sys /sys sysfs rw,relatime
├─/sys/fs/cgroup tmpfs tmpfs rw,nosuid,nodev,noexec,relat
│ ├─/sys/fs/cgroup/systemd cgroup cgroup rw,nosuid,nodev,noexec,relat
│ ├─/sys/fs/cgroup/cpuset cgroup cgroup rw,nosuid,nodev,noexec,relat
│ ├─/sys/fs/cgroup/ns cgroup cgroup rw,nosuid,nodev,noexec,relat
│ ├─/sys/fs/cgroup/cpu cgroup cgroup rw,nosuid,nodev,noexec,relat
│ ├─/sys/fs/cgroup/cpuacct cgroup cgroup rw,nosuid,nodev,noexec,relat
│ ├─/sys/fs/cgroup/memory cgroup cgroup rw,nosuid,nodev,noexec,relat
│ ├─/sys/fs/cgroup/devices cgroup cgroup rw,nosuid,nodev,noexec,relat
│ ├─/sys/fs/cgroup/freezer cgroup cgroup rw,nosuid,nodev,noexec,relat
│ ├─/sys/fs/cgroup/net_cls cgroup cgroup rw,nosuid,nodev,noexec,relat
│ └─/sys/fs/cgroup/blkio cgroup cgroup rw,nosuid,nodev,noexec,relat
├─/sys/kernel/security systemd-1 autofs rw,relatime,fd=22,pgrp=1,tim
├─/sys/kernel/debug systemd-1 autofs rw,relatime,fd=24,pgrp=1,tim
└─/sys/fs/fuse/connections fusectl fusectl rw,relatime
returns info about /sys and all /sys submounts.

Now you can implement recursive umount in shell, something like:
for d in $(findmnt --list --submounts $MOUNTPOINT -o TARGET -n | tac); do
umount $d
done
I hope that umount(8) will support something like this ASAP.

Thursday, December 16, 2010

lsblk(8)

lsblk(8) is a new util that will be available in util-linux-2.19 (coming soon;-). The util lists all block devices as a tree. This output is very useful on machines with complicated storage setup (e.g. RAID, dm-crypt, ...). It's so useful that we'll probably backport lsblk(8) to RHEL-6 to make life easier for people who need to debug their systems.

The original idea comes from "dmsetup ls --tree", but lsblk(8) is better :-) It uses "holders" and "slaves" from /sys filesystem. This means that the util is usable without root permissions and it works for all types of block devices (dmsetup uses DM ioctls).

For example my laptop with dm-crypted $HOME and partitioned RAID0 (md8) on scsi_debug device (sdc):
$ lsblk
NAME MAJ:MIN RA SIZE RO MOUNTPOINT
sda 8:0 0 93.2G 0
├─sda1 8:1 0 102M 0 /mnt/test
├─sda2 8:2 0 1K 0
├─sda3 8:3 0 2.3G 0 [SWAP]
├─sda4 8:4 0 76.2G 0 /
├─sda5 8:5 0 10G 0
│ └─kzak-home (dm-0) 253:0 0 10G 0 /home/kzak
└─sda6 8:6 0 4.7G 0 /boot
sdc 8:32 0 500M 0
├─sdc1 8:33 0 250M 0
│ └─md8 9:8 0 498.9M 0
│ ├─md8p1 259:0 0 100M 0
│ ├─md8p2 259:1 0 100M 0
│ └─md8p3 259:2 0 297.9M 0
└─sdc2 8:34 0 249M 0
└─md8 9:8 0 498.9M 0
├─md8p1 259:0 0 100M 0
├─md8p2 259:1 0 100M 0
└─md8p3 259:2 0 297.9M 0
You can also list more details about devices. The next example is from scsi_debug device with 4KiB sectors and enabled alignment offset:
$ lsblk --topology /dev/sdb
NAME ALIGNMENT MIN-IO OPT-IO PHY-SEC LOG-SEC ROTA SCHED
sdb 3584 4096 32768 4096 512 0 cfq
├─sdb1 0 4096 32768 4096 512 0 cfq
└─sdb2 0 4096 32768 4096 512 0 cfq
lsblk(8) is also usable in script, for example:
$ lsblk --nodeps --noheading --raw -o ALIGNMENT /dev/sdb
3584
returns only alignment-offset for device sdb.

Thanks to Milan Broz (from Red Hat) who is author of the original lsblk(8) prototype.

Thursday, December 2, 2010

util-linux is without -ng

The util-linux-ng project was officially merged into util-linux in last days. It means that there is no more -ng fork. Fortunately, the change was pretty simple, because the original util-linux project was inactive in last four years. It was enough to rename mailing list, git repository and directories at kernel.org. Note that the old addresses still work and all is redirected to the new stuff.

The last remaining problem is to rename the mailing list at gmane.org (http://news.gmane.org/gmane.linux.utilities.util-linux-ng/). I don't want to lost the list history, so the renaming seems as a better way than remove old list and add a new list (without -ng). Let's hope that gmane admins will be able to do the change.

See
http://article.gmane.org/gmane.linux.file-systems/49137
for more details about new URLs and addresses.

Thanks to Kay, Adrian, John and David.

Thursday, July 15, 2010

findmnt(8)

I released util-linux-ng 2.18 two weeks ago. There is many changes, for example completely new libmount (not stable API yet), new fsfreeze(8) and findmnt(8) utils and some important changes in fdisk(8).

From my point of view the most attractive for end users is findmnt(8). This new util is a command line interface to the libmount library, the util is able to search in /etc/fstab, /etc/mtab or /proc/self/mountinfo.

Default output (mounted filesystems):
$ findmnt
TARGET SOURCE FSTYPE OPTIONS
/ /dev/sda4 ext3 rw,noatime,errors=co
├─/proc /proc proc rw,relatime
│ ├─/proc/bus/usb /proc/bus/usb usbfs rw,relatime
│ ├─/proc/sys/fs/binfmt_misc none binfmt_m rw,relatime
│ └─/proc/fs/nfsd nfsd nfsd rw,relatime
├─/sys /sys sysfs rw,relatime
├─/dev udev devtmpfs rw,relatime,size=197
│ ├─/dev/pts devpts devpts rw,relatime,gid=5,mo
│ └─/dev/shm tmpfs tmpfs rw,relatime
├─/boot /dev/sda1 ext3 rw,noatime,errors=co
├─/home/kzak /dev/mapper/kzak-home ext4 rw,noatime,barrier=1
│ └─/home/kzak/.gvfs gvfs-fuse-daemon fuse.gvf rw,nosuid,nodev,rela
├─/var/lib/nfs/rpc_pipefs sunrpc rpc_pipe rw,relatime
├─/mnt/foo //sr.net.home/foo cifs rw,relatime,mand,unc
└─/mnt/test /dev/sda6 btrfs rw,relatime
Get info about a mountpoint:
$ findmnt /home/kzak
TARGET SOURCE FSTYPE OPTIONS
/home/kzak /dev/mapper/kzak-home ext4 rw,noatime,barrier=1,data=ordered
Get all mounted extN filesystems:
$ findmnt -t ext4,ext3
TARGET SOURCE FSTYPE OPTIONS
/ /dev/sda4 ext3 rw,noatime,errors=continue,user_xattr
/boot /dev/sda1 ext3 rw,noatime,errors=continue,user_xattr
/home/kzak /dev/mapper/kzak-home ext4 rw,noatime,barrier=1,data=ordered
The same thing, but from fstab:
$ findmnt --fstab -t ext4,ext3
TARGET SOURCE FSTYPE OPTIONS
/ UUID=d3a8f783-df75-4dc8-9163-975a891052c0 ext3 noatime,defaults
/boot UUID=f1cd38fa-c887-4ab8-834b-c8ee659b97fe ext3 noatime,defaults
/home/kzak /dev/mapper/kzak-home ext4 noatime,defaults
Don't like LABLEs/UUIDs?
$ findmnt --fstab --evaluate -t ext4,ext3
TARGET SOURCE FSTYPE OPTIONS
/ /dev/sda4 ext3 noatime,defaults
/boot /dev/sda1 ext3 noatime,defaults
/home/kzak /dev/mapper/kzak-home ext4 noatime,defaults
or convert UUID to mountpoint:
$ findmnt -o TARGET --noheadings UUID=f1cd38fa-c887-4ab8-834b-c8ee659b97fe
/boot

Wednesday, May 12, 2010

4096-byte sector hard drives

Maybe you're already read some blogs/articles about new 4KiB disks and Linux. These articles usually share one important thing -- WDxxEARS hard drives. Unfortunately, it seems that WDC made a brown-paper-bag bug here. The disks report 512-bytes physical sector size instead of 4096...
hdparm -I /dev/sdb

ATA device, with non-removable media
Model Number: WDC WD15EARS-00Z5B1
Serial Number: WD-WMAVUxxxxxxx
Firmware Revision: 80.00A80
[...]
Logical/Physical Sector size: 512 bytes
Fortunately, it seems that newer models correctly report 4K sectors:
ATA device, with non-removable media
Model Number: WDC WD15EARS-00Z5B1
Serial Number: WD-WMAVUxxxxxxx
Firmware Revision: 80.00A80
[...]
Logical Sector size: 512 bytes
Physical Sector size: 4096 bytes
Logical Sector-0 offset: 0 bytes
Now the good news. Fedora-13 and all related upstream projects are ready for 4096-byte sector disks. The libparted, fdisk, mkfs.{ext[234],xfs,gfs2} and cryptsetup (upstream, Fedora-14 and RHEL6) have been enhanced to use the new I/O topology to properly align things on the devices.

The I/O topology (aka "I/O limits") is supported since kernel 2.6.31. The topology is exported to userspace by sysfs, for example:
  $ cat /sys/block/sdb/queue/physical_block_size
4096
$ cat /sys/block/sdb/queue/logical_block_size
512
$ cat /sys/block/sdb/queue/optimal_io_size
32768
The kernel also supports topology ioctls since 2.6.32. The parted, fdisk and mkfs.{ext,xfs} use libblkid to get the topology, but some other tools directly use ioctls. So it's better to have kernel 2.6.32 or .33.

The fdisk(8) command uses 1MiB offset and grain to align partitions by default. So the final partition table is usable on hard drives with 4096-byte sectors by default. It means independently on the disk topology. This is the good news for WDxxEARS users.

If you want to use fdisk(8) then think twice and don't forget that fdisk is a low-level tool. Some fdiks(8) hints:
  • use fdisk from util-linux-ng >= 2.17.2
  • read warnings
  • don't use DOS-compatible mode (for backward compatibility this mode is enable by default, you have to use command 'c' or '-c' command line option to disable DOS mode. Note that for the next major release the DOS mode will be disable by default.)
  • use sectors as display units (command 'u' or '-u' command line option)
  • all default sizes/offsets are aligned to the physical block boundary (e.g. "First sector" dialog always provides aligned default)
  • use +size{M,G} convention to specify "Last sector" (e.g. +5G to create 5GiB partition) then fdiskl aligns the size to physical block boundary
  • don't forget that fdisk(8) always follows your wishes -- it means that if you explicitly define first/last sector number then the partition could be misaligned
  • the 'p' (print) command checks for partitions alignment
For more information about 4KiB sectors read:
https://ata.wiki.kernel.org/index.php/ATA_4_KiB_sector_issues
http://people.redhat.com/msnitzer/docs/io-limits.txt