Do the (disk) shuffle
Jul. 9th, 2014 11:30 amMy wanna-be server has an external disk tower which can hold 8 disks. In this I had 4*2Tb disks in a RAID5 (media rips, mostly) and 4*1Tb disk in a RAID10 (main filesystems, VM images, etc). It also has, internally, 2*500Gb in a RAID1 for root+boot. Since that meant some wasted space I also put on a "backup" filesystem where I store dumps from various important VMs and colo'd servers (linode, Panix, SYS).
The media disk ran out of space. Now previously when this has happened I just replaced the raid5 disks with bigger ones and repurposed those disks into the raid 10 (that's why I have 500Gb, 1Tb, 2Tb disks; they went through this process). 4*4Tb in a raid5...
But I don't trust these new disks as much. There's too big a risk of double-failure for my liking, so this'd mean raid6. And 4 disks in a raid6 is not much of a win!
So I found a hotswap bay that I could plug into the main case; that gives 4 more disks. I can now do 8*4Tb RAID6 and 2*2Tb RAID10.
The fun part was in making this all work with minimal downtime. I actually got two hotswap bays (they're cheap) and had one hanging out loose from the side of the machine. So at one point I had 4*2Tb, 4*1Tb, 8*4Tb and 1*500Gb. Only one 500Gb disk... because I ran out of SATA ports! I temporarily degraded my root and backup areas to perform this work.
Now:
mdadm --create /dev/md6 -l 6 -n 8 -b internal /dev/sdc1 /dev/sdm2 /dev/sdn3 /dev/sdo4 /dev/sdb1 /dev/sdl2 /dev/sdp3 /dev/sdq4
(time passes)
pvcreate /dev/md6
vgextend /dev/Large /dev/md6
pvmove /dev/md5 /dev/md6
(even more time passes)
vgreduce /dev/Large /dev/md5
Hey, all my Media is now on the raid6!
% vgs /dev/Large
VG #PV #LV #SN Attr VSize VFree
Large 1 1 0 wz--n- 21.83t 16.37t
Now I can repurpose the 2Tb disks:
mdadm --stop /dev/md5
mdadm --create /dev/md20 -l 10 -n 4 /dev/sdd1 /dev/sde2 /dev/sdf3 /dev/sdg4
(time passes)
pvcreate /dev/md20
vgextend /dev/Raid10 /dev/md20
pvmove /dev/md10 /dev/md20
(more time...)
vgreduce /dev/Raid10 /dev/md10
% vgs /dev/Raid10
VG #PV #LV #SN Attr VSize VFree
Raid10 1 11 0 wz--n- 3.64t 2.69t
And, just to be safe, I updated /etc/mdadm.conf
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=fe93a001:36e719f2:994f2675:3f26f378
ARRAY /dev/md2 level=raid1 num-devices=2 UUID=e0c936d6:5959922b:31421ce0:99b75db1
ARRAY /dev/md6 level=raid6 num-devices=8 UUID=b7db1262:f20170ed:57fd1f88:1cea0273
ARRAY /dev/md20 level=raid10 num-devices=4 UUID=a3d58575:7655548e:349f6442:71ce924e
All of this activity happened while the machine was up. The only downtime, so far, was to insert the PCIe SATA adapter into the motherboard and put the hotswap bay into the case (tight fit!).
At this point I then needed to physically swap around disks; remove the 1Tb from the external chasis, move the 2Tb to the internal bay, move the 4Tb to the external chasis. In theory this could be done "live" but would have caused raid rebuilds and taken too long. So I shutdown for 30 minutes and did all the swaps.
And it worked! A quick resync of the internal disk and we have success :-)
Boot+root (md0 RAID 1)
/dev/sda1 1049kB 8590MB 8589MB primary ext4 boot, raid
/dev/sdb1 1049kB 8590MB 8589MB primary ext4 boot, raid
/dev/Internal (md2 RAID 1) -> Backups
/dev/sda3 8590MB 500GB 492GB primary raid
/dev/sdb3 8590MB 500GB 492GB primary raid
/dev/Raid10 (md20 RAID 10) -> Primary datadisk, VMs, data etc
/dev/sdc1 2097kB 2000GB 2000GB Linux RAID raid
/dev/sdd2 2097kB 2000GB 2000GB Linux RAID raid
/dev/sde3 2097kB 2000GB 2000GB Linux RAID raid
/dev/sdf4 2097kB 2000GB 2000GB Linux RAID raid
/dev/Large (md6 RAID 6) -> Media
/dev/sdg1 1049kB 4001GB 4001GB Linux RAID raid
/dev/sdh2 1049kB 4001GB 4001GB Linux RAID raid
/dev/sdi3 1049kB 4001GB 4001GB Linux RAID raid
/dev/sdj4 1049kB 4001GB 4001GB Linux RAID raid
/dev/sdk1 1049kB 4001GB 4001GB Linux RAID raid
/dev/sdl2 1049kB 4001GB 4001GB Linux RAID raid
/dev/sdm3 1049kB 4001GB 4001GB Linux RAID raid
/dev/sdn4 1049kB 4001GB 4001GB Linux RAID raid
External USB hotswap disk
/dev/sdo1 32.3kB 123GB 123GB primary ext2
People have previously asked me why I have different partition numbers on each disk; well, that's so I can help identify any failed disk and verify I've removed the right one :-) Creating these partitions with parted was a bit of a pain though (large disks need GPT partitions so fdisk doesn't work). In the end I created dummy small paritions then deleted them
# parted /dev/sdo
GNU Parted 2.1
Using /dev/sdo
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel gpt
Warning: The existing disk label on /dev/sdo will be destroyed and all data on
this disk will be lost. Do you want to continue?
Yes/No? y
(parted) mkpart primary 34s 34s
Warning: The resulting partition is not properly aligned for best performance.
Ignore/Cancel? i
(parted) mkpart primary 35s 35s
Warning: The resulting partition is not properly aligned for best performance.
Ignore/Cancel? i
(parted) mkpart primary 36s 36s
Warning: The resulting partition is not properly aligned for best performance.
Ignore/Cancel? i
(parted) mkpart primary 2048s 100%
(parted) rm 1
(parted) rm 2
(parted) rm 3
(parted) name 4 "Linux RAID"
(parted) set 4 raid on
(parted) print
Model: ATA ST4000DM000-1F21 (scsi)
Disk /dev/sdo: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Number Start End Size File system Name Flags
4 1049kB 4001GB 4001GB Linux RAID raid
(parted) unit s
(parted) p
Model: ATA ST4000DM000-1F21 (scsi)
Disk /dev/sdo: 7814037168s
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Number Start End Size File system Name Flags
4 2048s 7814035455s 7814033408s Linux RAID raid
(parted) q
Information: You may need to update /etc/fstab.
The media disk ran out of space. Now previously when this has happened I just replaced the raid5 disks with bigger ones and repurposed those disks into the raid 10 (that's why I have 500Gb, 1Tb, 2Tb disks; they went through this process). 4*4Tb in a raid5...
But I don't trust these new disks as much. There's too big a risk of double-failure for my liking, so this'd mean raid6. And 4 disks in a raid6 is not much of a win!
So I found a hotswap bay that I could plug into the main case; that gives 4 more disks. I can now do 8*4Tb RAID6 and 2*2Tb RAID10.
The fun part was in making this all work with minimal downtime. I actually got two hotswap bays (they're cheap) and had one hanging out loose from the side of the machine. So at one point I had 4*2Tb, 4*1Tb, 8*4Tb and 1*500Gb. Only one 500Gb disk... because I ran out of SATA ports! I temporarily degraded my root and backup areas to perform this work.
Now:
mdadm --create /dev/md6 -l 6 -n 8 -b internal /dev/sdc1 /dev/sdm2 /dev/sdn3 /dev/sdo4 /dev/sdb1 /dev/sdl2 /dev/sdp3 /dev/sdq4
(time passes)
pvcreate /dev/md6
vgextend /dev/Large /dev/md6
pvmove /dev/md5 /dev/md6
(even more time passes)
vgreduce /dev/Large /dev/md5
Hey, all my Media is now on the raid6!
% vgs /dev/Large
VG #PV #LV #SN Attr VSize VFree
Large 1 1 0 wz--n- 21.83t 16.37t
Now I can repurpose the 2Tb disks:
mdadm --stop /dev/md5
mdadm --create /dev/md20 -l 10 -n 4 /dev/sdd1 /dev/sde2 /dev/sdf3 /dev/sdg4
(time passes)
pvcreate /dev/md20
vgextend /dev/Raid10 /dev/md20
pvmove /dev/md10 /dev/md20
(more time...)
vgreduce /dev/Raid10 /dev/md10
% vgs /dev/Raid10
VG #PV #LV #SN Attr VSize VFree
Raid10 1 11 0 wz--n- 3.64t 2.69t
And, just to be safe, I updated /etc/mdadm.conf
MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md0 level=raid1 num-devices=2 UUID=fe93a001:36e719f2:994f2675:3f26f378
ARRAY /dev/md2 level=raid1 num-devices=2 UUID=e0c936d6:5959922b:31421ce0:99b75db1
ARRAY /dev/md6 level=raid6 num-devices=8 UUID=b7db1262:f20170ed:57fd1f88:1cea0273
ARRAY /dev/md20 level=raid10 num-devices=4 UUID=a3d58575:7655548e:349f6442:71ce924e
All of this activity happened while the machine was up. The only downtime, so far, was to insert the PCIe SATA adapter into the motherboard and put the hotswap bay into the case (tight fit!).
At this point I then needed to physically swap around disks; remove the 1Tb from the external chasis, move the 2Tb to the internal bay, move the 4Tb to the external chasis. In theory this could be done "live" but would have caused raid rebuilds and taken too long. So I shutdown for 30 minutes and did all the swaps.
And it worked! A quick resync of the internal disk and we have success :-)
Boot+root (md0 RAID 1)
/dev/sda1 1049kB 8590MB 8589MB primary ext4 boot, raid
/dev/sdb1 1049kB 8590MB 8589MB primary ext4 boot, raid
/dev/Internal (md2 RAID 1) -> Backups
/dev/sda3 8590MB 500GB 492GB primary raid
/dev/sdb3 8590MB 500GB 492GB primary raid
/dev/Raid10 (md20 RAID 10) -> Primary datadisk, VMs, data etc
/dev/sdc1 2097kB 2000GB 2000GB Linux RAID raid
/dev/sdd2 2097kB 2000GB 2000GB Linux RAID raid
/dev/sde3 2097kB 2000GB 2000GB Linux RAID raid
/dev/sdf4 2097kB 2000GB 2000GB Linux RAID raid
/dev/Large (md6 RAID 6) -> Media
/dev/sdg1 1049kB 4001GB 4001GB Linux RAID raid
/dev/sdh2 1049kB 4001GB 4001GB Linux RAID raid
/dev/sdi3 1049kB 4001GB 4001GB Linux RAID raid
/dev/sdj4 1049kB 4001GB 4001GB Linux RAID raid
/dev/sdk1 1049kB 4001GB 4001GB Linux RAID raid
/dev/sdl2 1049kB 4001GB 4001GB Linux RAID raid
/dev/sdm3 1049kB 4001GB 4001GB Linux RAID raid
/dev/sdn4 1049kB 4001GB 4001GB Linux RAID raid
External USB hotswap disk
/dev/sdo1 32.3kB 123GB 123GB primary ext2
People have previously asked me why I have different partition numbers on each disk; well, that's so I can help identify any failed disk and verify I've removed the right one :-) Creating these partitions with parted was a bit of a pain though (large disks need GPT partitions so fdisk doesn't work). In the end I created dummy small paritions then deleted them
# parted /dev/sdo
GNU Parted 2.1
Using /dev/sdo
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) mklabel gpt
Warning: The existing disk label on /dev/sdo will be destroyed and all data on
this disk will be lost. Do you want to continue?
Yes/No? y
(parted) mkpart primary 34s 34s
Warning: The resulting partition is not properly aligned for best performance.
Ignore/Cancel? i
(parted) mkpart primary 35s 35s
Warning: The resulting partition is not properly aligned for best performance.
Ignore/Cancel? i
(parted) mkpart primary 36s 36s
Warning: The resulting partition is not properly aligned for best performance.
Ignore/Cancel? i
(parted) mkpart primary 2048s 100%
(parted) rm 1
(parted) rm 2
(parted) rm 3
(parted) name 4 "Linux RAID"
(parted) set 4 raid on
(parted) print
Model: ATA ST4000DM000-1F21 (scsi)
Disk /dev/sdo: 4001GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Number Start End Size File system Name Flags
4 1049kB 4001GB 4001GB Linux RAID raid
(parted) unit s
(parted) p
Model: ATA ST4000DM000-1F21 (scsi)
Disk /dev/sdo: 7814037168s
Sector size (logical/physical): 512B/4096B
Partition Table: gpt
Number Start End Size File system Name Flags
4 2048s 7814035455s 7814033408s Linux RAID raid
(parted) q
Information: You may need to update /etc/fstab.