Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.os.linux.misc > #605 > unrolled thread
| Started by | Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> |
|---|---|
| First post | 2011-04-05 19:39 -0700 |
| Last post | 2011-04-12 03:37 +0000 |
| Articles | 20 on this page of 49 — 12 participants |
Back to article view | Back to comp.os.linux.misc
linux raid vs hw raid Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> - 2011-04-05 19:39 -0700
Re: linux raid vs hw raid Tim Watts <tw@dionic.net> - 2011-04-06 08:01 +0100
Re: linux raid vs hw raid David Brown <david@westcontrol.removethisbit.com> - 2011-04-06 10:03 +0200
Re: linux raid vs hw raid Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> - 2011-04-06 14:00 -0700
Re: linux raid vs hw raid David Brown <david.brown@removethis.hesbynett.no> - 2011-04-06 23:42 +0200
Re: linux raid vs hw raid Grant <omg@grrr.id.au> - 2011-04-08 10:45 +1000
Re: linux raid vs hw raid David Brown <david@westcontrol.removethisbit.com> - 2011-04-08 11:12 +0200
Re: linux raid vs hw raid Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> - 2011-04-08 08:22 -0700
Re: linux raid vs hw raid Grant <omg@grrr.id.au> - 2011-04-09 09:51 +1000
Re: linux raid vs hw raid Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> - 2011-04-08 17:10 -0700
Re: linux raid vs hw raid David Brown <david.brown@removethis.hesbynett.no> - 2011-04-09 13:14 +0200
Re: linux raid vs hw raid Grant <omg@grrr.id.au> - 2011-04-09 09:47 +1000
Re: linux raid vs hw raid David Brown <david.brown@removethis.hesbynett.no> - 2011-04-09 13:55 +0200
Re: linux raid vs hw raid Tris Orendorff <triso@remove-me.cogeco.ca> - 2011-04-12 18:04 +0000
Re: linux raid vs hw raid Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> - 2011-04-12 11:34 -0700
Re: linux raid vs hw raid The Natural Philosopher <tnp@invalid.invalid> - 2011-04-12 21:13 +0100
Re: linux raid vs hw raid David Brown <david@westcontrol.removethisbit.com> - 2011-04-13 09:45 +0200
Re: linux raid vs hw raid Grant <omg@grrr.id.au> - 2011-04-14 13:42 +1000
Re: linux raid vs hw raid David Brown <david@westcontrol.removethisbit.com> - 2011-04-14 09:15 +0200
Re: linux raid vs hw raid Grant <omg@grrr.id.au> - 2011-04-15 08:03 +1000
Re: linux raid vs hw raid Tim Watts <tw@dionic.net> - 2011-04-15 07:22 +0100
Re: linux raid vs hw raid David Brown <david@westcontrol.removethisbit.com> - 2011-04-15 09:28 +0200
Re: linux raid vs hw raid Grant <omg@grrr.id.au> - 2011-04-19 11:20 +1000
Re: linux raid vs hw raid Grant <omg@grrr.id.au> - 2011-04-14 13:38 +1000
Re: linux raid vs hw raid Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> - 2011-04-13 21:49 -0700
Re: linux raid vs hw raid Grant <omg@grrr.id.au> - 2011-04-14 13:34 +1000
Re: linux raid vs hw raid Tris Orendorff <triso@remove-me.cogeco.ca> - 2011-04-15 21:59 +0000
Re: linux raid vs hw raid "Peter J. Holzer" <hjp-usenet2@hjp.at> - 2011-04-16 00:56 +0200
Re: linux raid vs hw raid The Natural Philosopher <tnp@invalid.invalid> - 2011-04-16 01:32 +0100
Re: linux raid vs hw raid Tauno Voipio <tauno.voipio@notused.fi.invalid> - 2011-04-08 21:38 +0300
Re: linux raid vs hw raid Grant <omg@grrr.id.au> - 2011-04-09 09:53 +1000
Re: linux raid vs hw raid KR <kristian.rasmussen@broadpark.no.spam.com> - 2011-04-09 11:56 +0200
Re: linux raid vs hw raid Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> - 2011-04-09 10:32 -0700
Re: linux raid vs hw raid Grant <omg@grrr.id.au> - 2011-04-10 11:12 +1000
Re: linux raid vs hw raid Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> - 2011-04-09 18:59 -0700
Re: linux raid vs hw raid KR <kristian.rasmussen@broadpark.no.spam.com> - 2011-04-10 04:32 +0200
Re: linux raid vs hw raid Grant <omg@grrr.id.au> - 2011-04-10 12:46 +1000
Re: linux raid vs hw raid Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> - 2011-04-09 20:39 -0700
Re: linux raid vs hw raid Robert Riches <spamtrap42@jacob21819.net> - 2011-04-10 03:47 +0000
Re: linux raid vs hw raid Balwinder S Dheeman <bsd.SANSPAM@anu.homelinux.net> - 2011-04-10 11:11 +0530
Re: linux raid vs hw raid Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> - 2011-04-09 23:29 -0700
Re: linux raid vs hw raid Balwinder S Dheeman <bsd.SANSPAM@anu.homelinux.net> - 2011-04-10 14:05 +0530
Re: linux raid vs hw raid Grant <omg@grrr.id.au> - 2011-04-10 20:16 +1000
Re: linux raid vs hw raid Tim Watts <tw@dionic.net> - 2011-04-10 11:28 +0100
Re: linux raid vs hw raid Balwinder S Dheeman <bsd.SANSPAM@anu.homelinux.net> - 2011-04-10 19:43 +0530
Re: linux raid vs hw raid Robert Riches <spamtrap42@jacob21819.net> - 2011-04-12 03:44 +0000
Re: linux raid vs hw raid Balwinder S Dheeman <bsd.SANSPAM@anu.homelinux.net> - 2011-04-12 13:56 +0530
Re: linux raid vs hw raid Grant <omg@grrr.id.au> - 2011-04-10 20:09 +1000
Re: linux raid vs hw raid Robert Riches <spamtrap42@jacob21819.net> - 2011-04-12 03:37 +0000
Page 1 of 3 [1] 2 3 Next page →
| From | Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> |
|---|---|
| Date | 2011-04-05 19:39 -0700 |
| Subject | linux raid vs hw raid |
| Message-ID | <fc0t68x5ci.ln2@goaway.wombat.san-francisco.ca.us> |
Hi all, I am attempting to build a snapshot server for a ~15TB fileserver with old fileserver hardware I have on hand. My initial plan was to use the hardware card in the old fileserver in a RAID50 (the card is old enough that it doesn't support RAID6 natively) using new 2TB enterprise hard drives. But, as you probably know, these drives are reasonably expensive. So, since this machine will not be used by end-users very much, I was contemplating using linux software raid instead, exporting desktop-class drives as JBODs and using mdadm to RAID them. The obvious advantage to this is cost: I can save almost 40% of my original estimate by using desktop drives instead, thus fulfilling the original meaning of the I of the RAID acronym. There are other advantages, as well, including being able to build a RAID6, which I slightly prefer over a RAID50, and having more flexibility later on if I want to move to bigger disks. (Yes, I have seen the documentation warning against too-large RAID arrays resulting in a failure during a rebuild.) A tertiary advantage would be that I would learn how to work with linux software RAID, a skill I haven't yet acquired. The disadvantages I can think of are: higher probability of disk failures, resulting in more work for me in swapping out and RMAing failed drives; potential degradation in performance, due both to RAID in software and slower disks; a learning curve for linux RAID; and a configuration less likely to be supported by the hardware RAID vendor. My counters to most of the disadvantages would be that performance only has to be decent, not great, on this box; the learning curve shouldn't be too bad; and this configuration shouldn't require support from the hardware RAID vendor anyway. The disk failures would be the only issue I couldn't counter, except by trying to determine if my labor costs would end up being more than the savings in moving to cheaper disks. My questions: 1) Has anyone done this before, and if so, what were the results? Was performance acceptable in this configuration? Are there any gotchas to an otherwise workable configuration? 2) From what I've read so far, using desktop-class disks with linux software RAID should not be a major problem, unlike using them on a true hardware RAID card. Is this reasonably accurate? If not, are there links that describe the difficulties? 3) Suppose that my RAID6 starts out using 12 2TB disks, with three free drive bays (one would be a hot spare). Later on, I want to seamlessly replace the 2TB disks with 3TB or larger disks. Can mdadm grow an array like this if, say, I replace one drive, rebuild, and repeat until I've replaced all 12 disks with larger ones? Or will the new 3TB disks only be used up to 2TB, the size of the original disks? Thanks for any advice or pointers you can provide! --keith -- kkeller-usenet@wombat.san-francisco.ca.us (try just my userid to email me) AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt see X- headers for PGP signature information
[toc] | [next] | [standalone]
| From | Tim Watts <tw@dionic.net> |
|---|---|
| Date | 2011-04-06 08:01 +0100 |
| Message-ID | <dmft68-2q8.ln1@squidward.dionic.net> |
| In reply to | #605 |
Keith Keller wrote: > Hi all, > > I am attempting to build a snapshot server for a ~15TB fileserver with > old fileserver hardware I have on hand. My initial plan was to use the > hardware card in the old fileserver in a RAID50 (the card is old enough > that it doesn't support RAID6 natively) using new 2TB enterprise hard > drives. But, as you probably know, these drives are reasonably > expensive. So, since this machine will not be used by end-users very > much, I was contemplating using linux software raid instead, exporting > desktop-class drives as JBODs and using mdadm to RAID them. > > The obvious advantage to this is cost: I can save almost 40% of my > original estimate by using desktop drives instead, thus fulfilling the > original meaning of the I of the RAID acronym. There are other > advantages, as well, including being able to build a RAID6, which I > slightly prefer over a RAID50, and having more flexibility later on if I > want to move to bigger disks. (Yes, I have seen the documentation > warning against too-large RAID arrays resulting in a failure during a > rebuild.) A tertiary advantage would be that I would learn how to work > with linux software RAID, a skill I haven't yet acquired. > > The disadvantages I can think of are: higher probability of disk > failures, resulting in more work for me in swapping out and RMAing > failed drives; potential degradation in performance, due both to RAID in > software and slower disks; a learning curve for linux RAID; and a > configuration less likely to be supported by the hardware RAID vendor. Hi, Highly dependant on your server and RAID card of course, but you may find MD software raid is quicker. Even and older server has far more CPU horsepower available compared to a mediocre RAID card (and by mediocre, I mean anything costing less than 100's pounds. > My counters to most of the disadvantages would be that performance only > has to be decent, not great, on this box; the learning curve shouldn't > be too bad; and this configuration shouldn't require support from the > hardware RAID vendor anyway. The disk failures would be the only issue > I couldn't counter, except by trying to determine if my labor costs > would end up being more than the savings in moving to cheaper disks. The learning curve is fairly easy with mdadm - furthermore, linux MD is now more functionally complete than all but the better end *modern* hardware RAID systems. Specifically, some things linux will do that a lot of older/cheaper HW RAID won't: 1) Attempt to rewrite a disck block that has failed to read <- triggers a bad block remap on most drives. 2) If you run the monitor daemon, linux will alert you if stuff goes bad, eg failed disk (OK, a crappy HW raid knos this, but can it alert you by email or just sit there with a falshing red LED?) 3) Perform a full sweep and parity verify on demand? There are more, but those are what I consider most useful. > My questions: > > 1) Has anyone done this before, and if so, what were the results? Was > performance acceptable in this configuration? Are there any gotchas to > an otherwise workable configuration? Yep - been running SW raid 5 at home on 1.5TB total for 3 years. I have used a lot of mid range RAID controllers too (Chaparrel, Infotrend, ARECA, Eurologic) > 2) From what I've read so far, using desktop-class disks with linux > software RAID should not be a major problem, unlike using them on a true > hardware RAID card. Is this reasonably accurate? If not, are there > links that describe the difficulties? Yep - desktop are fine. Enterprise class or "RAID Edition" may be better quality and/or quicker. Quicker is usually related to RPM and at least is checkable in the specifications. "Well built" is more abstract. I prefer to use a mixture of makes in the same server, eg Hitachi, Seagate, Fujitsu, WD) - that way, you lessen the risk of the "Maxtor Deathstar" whole buch failing at once syndrome. > 3) Suppose that my RAID6 starts out using 12 2TB disks, with three free > drive bays (one would be a hot spare). Later on, I want to seamlessly > replace the 2TB disks with 3TB or larger disks. Can mdadm grow an array > like this if, say, I replace one drive, rebuild, and repeat until I've > replaced all 12 disks with larger ones? Or will the new 3TB disks only > be used up to 2TB, the size of the original disks? RAID5/6 need to be spread over identically sized partitions. So you can't add a 3TB drive to a 2TB disk based array. You can partition and make a new RAID across the 1TB partition. This is where ZFS gets clever, but that's not really an option for linux (BTRFS will probably get there one day). > Thanks for any advice or pointers you can provide! One thing, whichever system you go for: set it up and do some speed and breakage tests to make sure it all works correctly - pull a disk out live, be sure you know how to put the disk back and bring the array back to fault tolerant and stuff like that. It's good fun, enjoy :) Cheers Tim > --keith > -- Tim Watts
[toc] | [prev] | [next] | [standalone]
| From | David Brown <david@westcontrol.removethisbit.com> |
|---|---|
| Date | 2011-04-06 10:03 +0200 |
| Message-ID | <Xe-dnXd4LLb1gwHQnZ2dnUVZ7sWdnZ2d@lyse.net> |
| In reply to | #608 |
On 06/04/2011 09:01, Tim Watts wrote: > Keith Keller wrote: > >> Hi all, >> >> I am attempting to build a snapshot server for a ~15TB fileserver with >> old fileserver hardware I have on hand. My initial plan was to use the >> hardware card in the old fileserver in a RAID50 (the card is old enough >> that it doesn't support RAID6 natively) using new 2TB enterprise hard >> drives. But, as you probably know, these drives are reasonably >> expensive. So, since this machine will not be used by end-users very >> much, I was contemplating using linux software raid instead, exporting >> desktop-class drives as JBODs and using mdadm to RAID them. >> >> The obvious advantage to this is cost: I can save almost 40% of my >> original estimate by using desktop drives instead, thus fulfilling the >> original meaning of the I of the RAID acronym. There are other >> advantages, as well, including being able to build a RAID6, which I >> slightly prefer over a RAID50, and having more flexibility later on if I >> want to move to bigger disks. (Yes, I have seen the documentation >> warning against too-large RAID arrays resulting in a failure during a >> rebuild.) A tertiary advantage would be that I would learn how to work >> with linux software RAID, a skill I haven't yet acquired. >> >> The disadvantages I can think of are: higher probability of disk >> failures, resulting in more work for me in swapping out and RMAing >> failed drives; potential degradation in performance, due both to RAID in >> software and slower disks; a learning curve for linux RAID; and a >> configuration less likely to be supported by the hardware RAID vendor. > > Hi, > > Highly dependant on your server and RAID card of course, but you may find MD > software raid is quicker. > > Even and older server has far more CPU horsepower available compared to a > mediocre RAID card (and by mediocre, I mean anything costing less than 100's > pounds. > I'd go further than that and say that software raid will be faster unless your hardware raid card costs many 1000's of pounds. Unless you are using the sort of raid card that comes with its own backup battery for caching, then mdadm raid is going to be faster with a modern processor. Even with such a card, mdadm raid is probably going to be faster for raid 5 or raid 6, simply because the host has access to more memory for caching stripes. A key bottleneck to consider is IO throughput, rather than CPU power. This is especially true for RAID1 setups - doing the RAID1 on a hardware card halves the IO on the host. However, if the server is old enough, there was a time when commonly used hardware raid cards were faster than doing it in software on the host. In particular, if the host is single core, or Intel's old and crappy shared bus SMP, then a hardware raid card will be faster. Not that this matters too much to the OP, of course! >> My counters to most of the disadvantages would be that performance only >> has to be decent, not great, on this box; the learning curve shouldn't >> be too bad; and this configuration shouldn't require support from the >> hardware RAID vendor anyway. The disk failures would be the only issue >> I couldn't counter, except by trying to determine if my labor costs >> would end up being more than the savings in moving to cheaper disks. > > The learning curve is fairly easy with mdadm - furthermore, linux MD is now > more functionally complete than all but the better end *modern* hardware > RAID systems. Specifically, some things linux will do that a lot of > older/cheaper HW RAID won't: > > 1) Attempt to rewrite a disck block that has failed to read<- triggers a > bad block remap on most drives. > > 2) If you run the monitor daemon, linux will alert you if stuff goes bad, eg > failed disk (OK, a crappy HW raid knos this, but can it alert you by email > or just sit there with a falshing red LED?) > > 3) Perform a full sweep and parity verify on demand? > > There are more, but those are what I consider most useful. > One hint about learning mdadm - with mdadm, you can build your arrays from partitions, not just whole disks. So you can give your disks a 4 GB partition at the start and use that when testing and learning - it's a lot easier to learn when your rebuild times are a couple of minutes, rather than most of the day! One thing to practice is identifying drives - when a drive fails, you want to be very sure of which one you should be replacing :-) >> My questions: >> >> 1) Has anyone done this before, and if so, what were the results? Was >> performance acceptable in this configuration? Are there any gotchas to >> an otherwise workable configuration? > > Yep - been running SW raid 5 at home on 1.5TB total for 3 years. I have used > a lot of mid range RAID controllers too (Chaparrel, Infotrend, ARECA, > Eurologic) > I haven't tried any hardware raid cards seriously, but I've used mdadm raid often on servers and desktops. Personally, I like RAID10 with "far" layout - it gives you greater safety than RAID5 or RAID6, and most of the speed of RAID0. It works well with 2 or 3 disks (something that no hardware raid card can do). >> 2) From what I've read so far, using desktop-class disks with linux >> software RAID should not be a major problem, unlike using them on a true >> hardware RAID card. Is this reasonably accurate? If not, are there >> links that describe the difficulties? > > Yep - desktop are fine. Enterprise class or "RAID Edition" may be better > quality and/or quicker. Quicker is usually related to RPM and at least is > checkable in the specifications. "Well built" is more abstract. I prefer to > use a mixture of makes in the same server, eg Hitachi, Seagate, Fujitsu, WD) > - that way, you lessen the risk of the "Maxtor Deathstar" whole buch failing > at once syndrome. > I am not convinced that enterprise class disks really offer much more than desktop disks if you have a reasonable environment (not too hot or cold, reliable power, etc.). There will be a difference in the expected lifetimes of the drives - but since disk failures are actually fairly rare, it won't show in the statistics unless you have hundreds of drives or drive them under very heavy load. >> 3) Suppose that my RAID6 starts out using 12 2TB disks, with three free >> drive bays (one would be a hot spare). Later on, I want to seamlessly >> replace the 2TB disks with 3TB or larger disks. Can mdadm grow an array >> like this if, say, I replace one drive, rebuild, and repeat until I've >> replaced all 12 disks with larger ones? Or will the new 3TB disks only >> be used up to 2TB, the size of the original disks? > > RAID5/6 need to be spread over identically sized partitions. So you can't > add a 3TB drive to a 2TB disk based array. You can partition and make a new > RAID across the 1TB partition. This is where ZFS gets clever, but that's not > really an option for linux (BTRFS will probably get there one day). > You can increase the size of the RAID5/6 devices (whole disks, or partitions) if you re-size them all. So if you replace one 2 TB drive with a 3 TB drive and let it rebuild, you can't use more than the first 2 TB. But if you continue the process and replace all of the drives, you can then "grow" the array to use the new space. Another option for growth is to use mdadm over partitions, rather than whole disks. Then when you add bigger disks, you have spare space that you can make into new partitions, and make another mdadm raid using them. If you are using LVM to organise your real partitions (which I highly recommend), then you can add your new raid as a new physical partition and extend your working space. One other thing to think about if you are planning to replace disks, is that you are reducing your redundancy while it is happening. For example, if you have a RAID5 array and you pull one drive to replace it with a bigger drive, then you have no redundancy during that operation. With RAID6 you have one drive redundancy rather than two. And like all rebuilds, the rebuild for the drive replacement is particularly stressful for the rest of the disks in the array - and you are going to do the whole operation 12 times in a row. But the beauty of md raid is its flexibility. Rather than use twelve disks in a RAID6, build twelve RAID1 pairs from a real drive and a missing drive. Then build your RAID6 on top of these "pairs". The result is the same in terms of speed, capacity and redundancy. But when you want to replace a drive with a bigger disk, you do it by adding the new drive to one of the pairs and letting the pair "rebuild". Then you remove the old disk from the pair. You keep the same redundancy over the whole array throughout the operation, and the rebuild is done as a mirror copy from one disk - the other drives are unaffected. You can happily do the replacement with multiple disks in parallel - as many as you have spare drive bays. (Future plans for md include "hot replace" functionality that will effectively automate this, but that's for the future.) >> Thanks for any advice or pointers you can provide! > > One thing, whichever system you go for: set it up and do some speed and > breakage tests to make sure it all works correctly - pull a disk out live, > be sure you know how to put the disk back and bring the array back to fault > tolerant and stuff like that. > > It's good fun, enjoy :) > > Cheers > > Tim > >> --keith >> >
[toc] | [prev] | [next] | [standalone]
| From | Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> |
|---|---|
| Date | 2011-04-06 14:00 -0700 |
| Message-ID | <kr0v68xq47.ln2@goaway.wombat.san-francisco.ca.us> |
| In reply to | #610 |
Hello Tim, David, thanks so much for your comments. I do want to make specific comments, but in general, it seems like the take-home message is that I'm not completely stupid or insane for thinking about attempting this. That's what I suspected, but I do feel a little better having it confirmed. On 2011-04-06, David Brown <david@westcontrol.removethisbit.com> wrote: > On 06/04/2011 09:01, Tim Watts wrote: >> >> Highly dependant on your server and RAID card of course, but you may find MD >> software raid is quicker. Yes, and I probably should have mentioned the card: it's a 3ware 9550SX, with no BBU, on a 64bit dual-core machine. So, based on yours and David's comments, I probably shouldn't expect significantly worse performance, and may even be better. That's really all I desire given the intended purpose. >> The learning curve is fairly easy with mdadm - furthermore, linux MD is now >> more functionally complete than all but the better end *modern* hardware >> RAID systems. Specifically, some things linux will do that a lot of >> older/cheaper HW RAID won't: >> >> 1) Attempt to rewrite a disck block that has failed to read<- triggers a >> bad block remap on most drives. >> >> 2) If you run the monitor daemon, linux will alert you if stuff goes bad, eg >> failed disk (OK, a crappy HW raid knos this, but can it alert you by email >> or just sit there with a falshing red LED?) >> >> 3) Perform a full sweep and parity verify on demand? I believe the 9550 will do #2, and it definitely does #3, with email alerts (which I direct to my cell phone via my SMS gateway). I did have to work with RAID controllers which would simply blink, which was incredibly frustrating. > One hint about learning mdadm - with mdadm, you can build your arrays > from partitions, not just whole disks. So you can give your disks a 4 > GB partition at the start and use that when testing and learning - it's > a lot easier to learn when your rebuild times are a couple of minutes, > rather than most of the day! Great suggestion! > One thing to practice is identifying drives - when a drive fails, you > want to be very sure of which one you should be replacing :-) Oh boy, I learned that The Hard Way (TM) many years ago, when I accidentally pulled the wrong drive bay on a server with a failed disk. Now I number the drive bays, verify twice that I have the right bay, generate disk activity (or use the "identify drive" feature to blink the light) to be sure I'm pulling an inactive drive, do that again, go back and verify the right bay again, then pull the drive with fingers and toes crossed. (Fortunately, my mistake with the wrong drive wasn't catastrophic, but it definitely made extra work for me.) > Personally, I like RAID10 with "far" layout - it gives you greater > safety than RAID5 or RAID6, and most of the speed of RAID0. Is that the "far replicas" described in the man page for md(4)? My concern about RAID10 is that I'll lose too much capacity to redundancy. Because this is a snapshot server, I really need to maximize available storage space; if I have 12 drive bays, with 2TB drives I'd get only 12TB of usable space from a RAID10; even with 3TB drives that's only 18TB (if my math is right). Whereas, a RAID6 with 12 2TB drives gets me 20TB usable. (If this were my primary fileserver I'd be more likely to consider a RAID10.) > Another option for growth is to use mdadm over partitions, rather than > whole disks. Then when you add bigger disks, you have spare space that > you can make into new partitions, and make another mdadm raid using > them. If you are using LVM to organise your real partitions (which I > highly recommend), then you can add your new raid as a new physical > partition and extend your working space. Yes, I use LVM. Using partitions sounds like a great idea, and is definitely something that I can't get out of a hardware RAID controller (another reason I'm leaning this way). > But the beauty of md raid is its flexibility. Rather than use twelve > disks in a RAID6, build twelve RAID1 pairs from a real drive and a > missing drive. Then build your RAID6 on top of these "pairs". The > result is the same in terms of speed, capacity and redundancy. But when > you want to replace a drive with a bigger disk, you do it by adding the > new drive to one of the pairs and letting the pair "rebuild". Then you > remove the old disk from the pair. You keep the same redundancy over > the whole array throughout the operation, and the rebuild is done as a > mirror copy from one disk - the other drives are unaffected. You can > happily do the replacement with multiple disks in parallel - as many as > you have spare drive bays. Another fantastic idea! (Though I'm guessing the RAID1s will somehow show up as ''failed''; I would need to work around that for paging purposes.) Again, thanks for the thoughtful responses! --keith -- kkeller-usenet@wombat.san-francisco.ca.us (try just my userid to email me) AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt see X- headers for PGP signature information
[toc] | [prev] | [next] | [standalone]
| From | David Brown <david.brown@removethis.hesbynett.no> |
|---|---|
| Date | 2011-04-06 23:42 +0200 |
| Message-ID | <ididnbdDi9KnQwHQnZ2dnUVZ8imdnZ2d@lyse.net> |
| In reply to | #612 |
On 06/04/11 23:00, Keith Keller wrote: > Hello Tim, David, thanks so much for your comments. > > I do want to make specific comments, but in general, it seems like the > take-home message is that I'm not completely stupid or insane for > thinking about attempting this. That's what I suspected, but I do feel > a little better having it confirmed. > > On 2011-04-06, David Brown<david@westcontrol.removethisbit.com> wrote: >> On 06/04/2011 09:01, Tim Watts wrote: >>> >>> Highly dependant on your server and RAID card of course, but you may find MD >>> software raid is quicker. > > Yes, and I probably should have mentioned the card: it's a 3ware 9550SX, > with no BBU, on a 64bit dual-core machine. So, based on yours and > David's comments, I probably shouldn't expect significantly worse > performance, and may even be better. That's really all I desire given > the intended purpose. > >>> The learning curve is fairly easy with mdadm - furthermore, linux MD is now >>> more functionally complete than all but the better end *modern* hardware >>> RAID systems. Specifically, some things linux will do that a lot of >>> older/cheaper HW RAID won't: >>> >>> 1) Attempt to rewrite a disck block that has failed to read<- triggers a >>> bad block remap on most drives. >>> >>> 2) If you run the monitor daemon, linux will alert you if stuff goes bad, eg >>> failed disk (OK, a crappy HW raid knos this, but can it alert you by email >>> or just sit there with a falshing red LED?) >>> >>> 3) Perform a full sweep and parity verify on demand? > > I believe the 9550 will do #2, and it definitely does #3, with email > alerts (which I direct to my cell phone via my SMS gateway). I did have > to work with RAID controllers which would simply blink, which was > incredibly frustrating. > >> One hint about learning mdadm - with mdadm, you can build your arrays >> from partitions, not just whole disks. So you can give your disks a 4 >> GB partition at the start and use that when testing and learning - it's >> a lot easier to learn when your rebuild times are a couple of minutes, >> rather than most of the day! > > Great suggestion! > >> One thing to practice is identifying drives - when a drive fails, you >> want to be very sure of which one you should be replacing :-) > > Oh boy, I learned that The Hard Way (TM) many years ago, when I > accidentally pulled the wrong drive bay on a server with a failed disk. I see this as the number one reason for preferring RAID6 to RAID5. One should never underestimate the risks of human error :-) > Now I number the drive bays, verify twice that I have the right bay, > generate disk activity (or use the "identify drive" feature to blink the > light) to be sure I'm pulling an inactive drive, do that again, go back > and verify the right bay again, then pull the drive with fingers and > toes crossed. (Fortunately, my mistake with the wrong drive wasn't > catastrophic, but it definitely made extra work for me.) > >> Personally, I like RAID10 with "far" layout - it gives you greater >> safety than RAID5 or RAID6, and most of the speed of RAID0. > > Is that the "far replicas" described in the man page for md(4)? > No, it is a special layout choice for RAID10. Wikipedia has a reasonable explanation: <http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10> If you are using RAID10, then it is a good choice for many workloads (it is significantly faster for reads than traditional RAID10, but marginally slower for writes). > My concern about RAID10 is that I'll lose too much capacity to > redundancy. Because this is a snapshot server, I really need to > maximize available storage space; if I have 12 drive bays, with 2TB > drives I'd get only 12TB of usable space from a RAID10; even with 3TB > drives that's only 18TB (if my math is right). Whereas, a RAID6 with > 12 2TB drives gets me 20TB usable. (If this were my primary fileserver > I'd be more likely to consider a RAID10.) Fair enough. You choose your balance between size, cost, speed, redundancy, rebuild times, etc. > >> Another option for growth is to use mdadm over partitions, rather than >> whole disks. Then when you add bigger disks, you have spare space that >> you can make into new partitions, and make another mdadm raid using >> them. If you are using LVM to organise your real partitions (which I >> highly recommend), then you can add your new raid as a new physical >> partition and extend your working space. > > Yes, I use LVM. Using partitions sounds like a great idea, and is > definitely something that I can't get out of a hardware RAID controller > (another reason I'm leaning this way). > I have only set up real systems with smaller numbers of drives - the last one I did had three drives in a RAID10 layout. But grub won't boot from an mdadm RAID10 set - it is pretty non-standard. So I put a small partition at the start of each disk and made a three-way RAID1 using those partitions (being small, the poor space efficiency doesn't matter). I put /boot on that RAID1 and grub on the MBR of each disk. Then the rest of each disk was a single large partition, with those all tied together as RAID10. >> But the beauty of md raid is its flexibility. Rather than use twelve >> disks in a RAID6, build twelve RAID1 pairs from a real drive and a >> missing drive. Then build your RAID6 on top of these "pairs". The >> result is the same in terms of speed, capacity and redundancy. But when >> you want to replace a drive with a bigger disk, you do it by adding the >> new drive to one of the pairs and letting the pair "rebuild". Then you >> remove the old disk from the pair. You keep the same redundancy over >> the whole array throughout the operation, and the rebuild is done as a >> mirror copy from one disk - the other drives are unaffected. You can >> happily do the replacement with multiple disks in parallel - as many as >> you have spare drive bays. > > Another fantastic idea! (Though I'm guessing the RAID1s will somehow > show up as ''failed''; I would need to work around that for paging > purposes.) > Yes, you will need to take these "failures" into account in your warning system. It will also be an issue for hot spares - you will not want to make a spare drive into a general hot spare for the RAID1's, or it will quickly be grabbed by one of them. I think you would have to go back to the old-fashioned way of using mdadm monitor to trigger a script when one of the mirrors fails completely, and then "manually" add in the disk to the correct mirror. After a quick check of the mdadm man page, it seems you can make your RAID1 sets consist of only one drive. Then your one-way "mirrors" are not failed. When you want to migrate to a larger disk, you can simply "grow" the "mirror" to being two disks, including the new one. Once you are ready to remove the old one, you fail it, remove it, then "grow" the "mirror" back to one disk. I suspect you would still need some fiddling with mdadm-triggered scripts to get your hot spares working, as an automatic hot spare will not work when a "mirror" set dies completely. > Again, thanks for the thoughtful responses! > > --keith >
[toc] | [prev] | [next] | [standalone]
| From | Grant <omg@grrr.id.au> |
|---|---|
| Date | 2011-04-08 10:45 +1000 |
| Message-ID | <tllsp6ltftsq6ufp048hcc4ivufupgbmki@4ax.com> |
| In reply to | #612 |
On Wed, 6 Apr 2011 14:00:04 -0700, Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> wrote: >Hello Tim, David, thanks so much for your comments. > >I do want to make specific comments, but in general, it seems like the >take-home message is that I'm not completely stupid or insane for >thinking about attempting this. That's what I suspected, but I do feel >a little better having it confirmed. > >On 2011-04-06, David Brown <david@westcontrol.removethisbit.com> wrote: >> On 06/04/2011 09:01, Tim Watts wrote: >>> >>> Highly dependant on your server and RAID card of course, but you may find MD >>> software raid is quicker. > >Yes, and I probably should have mentioned the card: it's a 3ware 9550SX, >with no BBU, on a 64bit dual-core machine. So, based on yours and >David's comments, I probably shouldn't expect significantly worse >performance, and may even be better. That's really all I desire given >the intended purpose. > >>> The learning curve is fairly easy with mdadm - furthermore, linux MD is now >>> more functionally complete than all but the better end *modern* hardware >>> RAID systems. Specifically, some things linux will do that a lot of >>> older/cheaper HW RAID won't: >>> >>> 1) Attempt to rewrite a disck block that has failed to read<- triggers a >>> bad block remap on most drives. >>> >>> 2) If you run the monitor daemon, linux will alert you if stuff goes bad, eg >>> failed disk (OK, a crappy HW raid knos this, but can it alert you by email >>> or just sit there with a falshing red LED?) >>> >>> 3) Perform a full sweep and parity verify on demand? > >I believe the 9550 will do #2, and it definitely does #3, with email >alerts (which I direct to my cell phone via my SMS gateway). I did have >to work with RAID controllers which would simply blink, which was >incredibly frustrating. > >> One hint about learning mdadm - with mdadm, you can build your arrays >> from partitions, not just whole disks. So you can give your disks a 4 >> GB partition at the start and use that when testing and learning - it's >> a lot easier to learn when your rebuild times are a couple of minutes, >> rather than most of the day! > >Great suggestion! RAID on partitions is a great idea, I'm using it here with 6 x 1TB drives for the RAID, and a 2TB drive for backup, bounce buffer. At the moment growing from 5 to 6 x 1TB drives with the aid of a borrowed 1.5TB drive to keep it separate from my other stuff. So I use the fast end for OS, a 4GB partition for RAID10 swap, then 2 partitions in the bulk of the space for data in two separate RAID6 arrays. I'm still to find the best settings, running with a quad core CPU on an Intel chipset (ICH9R) mobo for the 6 raid drives, a dual SATA controller card for backup and external (casual) SATA drives. One thing I'm not seeing discussed enough is the need for adjusting NCQ on the SATA drives. I'm using Seagate drives that have up to 31 queue slots, and switched them to use 1. But I've not yet scripted a benchmark to find out if there's a better queue depth to use. The theory is that the mdadm RAID software is fighting command queuing, I have no idea what the impact is, but short tests indicate no queue is better. I'd like more info, confirmation. > >> One thing to practice is identifying drives - when a drive fails, you >> want to be very sure of which one you should be replacing :-) Mark the cables and put the drives in order! Also spin down the drive you want to pull if it's out where you can feel if it's spinning? > >Oh boy, I learned that The Hard Way (TM) many years ago, when I >accidentally pulled the wrong drive bay on a server with a failed disk. >Now I number the drive bays, verify twice that I have the right bay, >generate disk activity (or use the "identify drive" feature to blink the >light) to be sure I'm pulling an inactive drive, do that again, go back >and verify the right bay again, then pull the drive with fingers and >toes crossed. (Fortunately, my mistake with the wrong drive wasn't >catastrophic, but it definitely made extra work for me.) > >> Personally, I like RAID10 with "far" layout - it gives you greater >> safety than RAID5 or RAID6, and most of the speed of RAID0. I did that for the swap RAID10, unsure how to change from RAID10 with spare to what? now that I have 6 drives in there. Not that I plan to use a lot of swap, but it is the overload area for /tmp as well (/tmp mounted in memory, expands to swap after it uses half of memory, something like that, I soon forget the details when there's no problems). > >Is that the "far replicas" described in the man page for md(4)? > >My concern about RAID10 is that I'll lose too much capacity to >redundancy. Because this is a snapshot server, I really need to >maximize available storage space; if I have 12 drive bays, with 2TB >drives I'd get only 12TB of usable space from a RAID10; even with 3TB >drives that's only 18TB (if my math is right). Whereas, a RAID6 with >12 2TB drives gets me 20TB usable. (If this were my primary fileserver >I'd be more likely to consider a RAID10.) RAID6 for data, if you're on a budget :) RAID6 is slower than RAID5, but that extra data protection is worth it, I think. You need to cost loss of data vs speed and other factors relevant for your own scenario. To rebuild a RAID5 with a RAID5 after total data loss is madness, yet I know a guy doing business systems did that, 'cos the RAID controller didn't do RAID6 (was on a windoze box). Madness? > >> Another option for growth is to use mdadm over partitions, rather than >> whole disks. Then when you add bigger disks, you have spare space that >> you can make into new partitions, and make another mdadm raid using >> them. If you are using LVM to organise your real partitions (which I >> highly recommend), then you can add your new raid as a new physical >> partition and extend your working space. > >Yes, I use LVM. Using partitions sounds like a great idea, and is >definitely something that I can't get out of a hardware RAID controller >(another reason I'm leaning this way). I tried telling mdadm to grow on partition size increase and it refused :( Probably me not up there on the learning curve, but I was disappointed. Since mdadm is under active development, I expect it to improve over time. > >> But the beauty of md raid is its flexibility. Rather than use twelve >> disks in a RAID6, build twelve RAID1 pairs from a real drive and a >> missing drive. Then build your RAID6 on top of these "pairs". The >> result is the same in terms of speed, capacity and redundancy. But when >> you want to replace a drive with a bigger disk, you do it by adding the >> new drive to one of the pairs and letting the pair "rebuild". Then you >> remove the old disk from the pair. You keep the same redundancy over >> the whole array throughout the operation, and the rebuild is done as a >> mirror copy from one disk - the other drives are unaffected. You can >> happily do the replacement with multiple disks in parallel - as many as >> you have spare drive bays. > >Another fantastic idea! (Though I'm guessing the RAID1s will somehow >show up as ''failed''; I would need to work around that for paging >purposes.) Swap space? RAID10 is best for that, from my reading. Got to be careful with swap reliability because bad swap will crash the machine and possibly eat your data. Same as bad memory. Grant.
[toc] | [prev] | [next] | [standalone]
| From | David Brown <david@westcontrol.removethisbit.com> |
|---|---|
| Date | 2011-04-08 11:12 +0200 |
| Message-ID | <FqSdnWp-6szqTAPQnZ2dnUVZ8hednZ2d@lyse.net> |
| In reply to | #625 |
On 08/04/2011 02:45, Grant wrote: > On Wed, 6 Apr 2011 14:00:04 -0700, Keith > Keller<kkeller-usenet@wombat.san-francisco.ca.us> wrote: > >> Hello Tim, David, thanks so much for your comments. <snip> >>> One hint about learning mdadm - with mdadm, you can build your >>> arrays from partitions, not just whole disks. So you can give >>> your disks a 4 GB partition at the start and use that when >>> testing and learning - it's a lot easier to learn when your >>> rebuild times are a couple of minutes, rather than most of the >>> day! >> >> Great suggestion! > > RAID on partitions is a great idea, I'm using it here with 6 x 1TB > drives for the RAID, and a 2TB drive for backup, bounce buffer. At > the moment growing from 5 to 6 x 1TB drives with the aid of a > borrowed 1.5TB drive to keep it separate from my other stuff. > > So I use the fast end for OS, a 4GB partition for RAID10 swap, then > 2 partitions in the bulk of the space for data in two separate RAID6 > arrays. > The flexibility is a big advantage of mdraid. Sometimes you want to emphasise redundancy, sometimes speed, sometimes space efficiency - you can do it all on the same disks using md raid over partitions. Another thing you can do with software raid is use external USB (or eSATA, if possible) drives in your raids. While you won't want to do that for normal use, it can be a great way to add in a bit of extra redundancy before doing operations such as moving over to larger drives. Try doing that with hardware raid cards! > I'm still to find the best settings, running with a quad core CPU on > an Intel chipset (ICH9R) mobo for the 6 raid drives, a dual SATA > controller card for backup and external (casual) SATA drives. > > One thing I'm not seeing discussed enough is the need for adjusting > NCQ on the SATA drives. I'm using Seagate drives that have up to 31 > queue slots, and switched them to use 1. But I've not yet scripted a > benchmark to find out if there's a better queue depth to use. The > theory is that the mdadm RAID software is fighting command queuing, I > have no idea what the impact is, but short tests indicate no queue is > better. I'd like more info, confirmation. > I hadn't thought about that at all. I'm planning on setting up a couple of new servers in the near future - maybe I'll get a chance to try that out. >> >>> One thing to practice is identifying drives - when a drive fails, >>> you want to be very sure of which one you should be replacing >>> :-) > > Mark the cables and put the drives in order! Also spin down the > drive you want to pull if it's out where you can feel if it's > spinning? Marking the cables, as well as the drives, is a great idea. It is obvious when you say it, of course, but worth saying out loud. Spinning a drive down is a nice idea to identify them (especially if you forgot to label the drives and cables...) - I will try that to see how easy it is to feel the difference. >> >> Oh boy, I learned that The Hard Way (TM) many years ago, when I >> accidentally pulled the wrong drive bay on a server with a failed >> disk. Now I number the drive bays, verify twice that I have the >> right bay, generate disk activity (or use the "identify drive" >> feature to blink the light) to be sure I'm pulling an inactive >> drive, do that again, go back and verify the right bay again, then >> pull the drive with fingers and toes crossed. (Fortunately, my >> mistake with the wrong drive wasn't catastrophic, but it definitely >> made extra work for me.) >> >>> Personally, I like RAID10 with "far" layout - it gives you >>> greater safety than RAID5 or RAID6, and most of the speed of >>> RAID0. > > I did that for the swap RAID10, unsure how to change from RAID10 > with spare to what? now that I have 6 drives in there. Not that I > plan to use a lot of swap, but it is the overload area for /tmp as > well (/tmp mounted in memory, expands to swap after it uses half of > memory, something like that, I soon forget the details when there's > no problems). I too like my /tmp (and /var/tmp, and sometimes other ad-hoc temporary directories) on tmpfs, and so often have a large swap even when I have a lot of ram. I haven't bothered using raid on the swap drives - mirroring swap is a bit overkill on a desktop, though it's a good idea on a server. You don't need to explicitly use raid0 for swap - the kernel does that automatically if you have multiple swap drives/partitions. I am not sure whether RAID10,far is the best choice for swap, as compared to RAID10,near. RAID10,far is excellent for a read-mostly array, but writes involve more head movement than in RAID10,near - and swap involves writes as much as reads. Perhaps RAID10,offset is in fact the best choice. One disadvantage of RAID10 is that you can't change it after it is made - you can't reshape it, grow it, or change the layout. But for swap that shouldn't be a problem - just turn your swap off, break down the existing array, and create a new one including the extra drives. Since you have no data on the raid (assuming you are not using swap at the time), you've nothing to lose. >> Is that the "far replicas" described in the man page for md(4)? >> >> My concern about RAID10 is that I'll lose too much capacity to >> redundancy. Because this is a snapshot server, I really need to >> maximize available storage space; if I have 12 drive bays, with >> 2TB drives I'd get only 12TB of usable space from a RAID10; even >> with 3TB drives that's only 18TB (if my math is right). Whereas, a >> RAID6 with 12 2TB drives gets me 20TB usable. (If this were my >> primary fileserver I'd be more likely to consider a RAID10.) > > RAID6 for data, if you're on a budget :) RAID6 is slower than > RAID5, but that extra data protection is worth it, I think. You need > to cost loss of data vs speed and other factors relevant for your own > scenario. > I doubt if RAID6 is noticeably slower than RAID5 for most operations. Modern cpu's handle the calculations easily. The only slow point is that partial stripe writes will be a little slower (if they miss the stripe cache), since you need to read in and write out at least three blocks. But these blocks are all on different disks, so they operate in parallel. I think the days of RAID5 are numbered, expect in cases where you have additional protection (such as RAID1+5). Certainly RAID5 + hot spare is a meaningless choice - RAID6 would definitely be better. > To rebuild a RAID5 with a RAID5 after total data loss is madness, > yet I know a guy doing business systems did that, 'cos the RAID > controller didn't do RAID6 (was on a windoze box). Madness? Many low-end hardware cards don't support RAID6. >> >>> Another option for growth is to use mdadm over partitions, rather >>> than whole disks. Then when you add bigger disks, you have spare >>> space that you can make into new partitions, and make another >>> mdadm raid using them. If you are using LVM to organise your >>> real partitions (which I highly recommend), then you can add your >>> new raid as a new physical partition and extend your working >>> space. >> >> Yes, I use LVM. Using partitions sounds like a great idea, and is >> definitely something that I can't get out of a hardware RAID >> controller (another reason I'm leaning this way). > > I tried telling mdadm to grow on partition size increase and it > refused :( > > Probably me not up there on the learning curve, but I was > disappointed. > It depends on the type of array you have - some can be grown, others cannot. RAID 1, 5 and 6 can be grown when you have increased the partition size of all components. But RAID 0 and 10 cannot (currently) be grown. Resizing RAID 10 would be complicated because of its layout, though I'm sure one day it will be supported. Resizing RAID 0 sounds easy, but I gather that md RAID 0 is actually very general (it will work with different sized disks, for example), which complicates resizing. > Since mdadm is under active development, I expect it to improve over > time. Some of the plans discussed on the linux-raid@vger.kernel.org mailing list are /very/ exciting. >> >>> But the beauty of md raid is its flexibility. Rather than use >>> twelve disks in a RAID6, build twelve RAID1 pairs from a real >>> drive and a missing drive. Then build your RAID6 on top of these >>> "pairs". The result is the same in terms of speed, capacity and >>> redundancy. But when you want to replace a drive with a bigger >>> disk, you do it by adding the new drive to one of the pairs and >>> letting the pair "rebuild". Then you remove the old disk from >>> the pair. You keep the same redundancy over the whole array >>> throughout the operation, and the rebuild is done as a mirror >>> copy from one disk - the other drives are unaffected. You can >>> happily do the replacement with multiple disks in parallel - as >>> many as you have spare drive bays. >> >> Another fantastic idea! (Though I'm guessing the RAID1s will >> somehow show up as ''failed''; I would need to work around that for >> paging purposes.) > > Swap space? RAID10 is best for that, from my reading. Got to be > careful with swap reliability because bad swap will crash the machine > and possibly eat your data. Same as bad memory. > > Grant.
[toc] | [prev] | [next] | [standalone]
| From | Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> |
|---|---|
| Date | 2011-04-08 08:22 -0700 |
| Message-ID | <4ql378xm3b.ln2@goaway.wombat.san-francisco.ca.us> |
| In reply to | #626 |
On 2011-04-08, David Brown <david@westcontrol.removethisbit.com> wrote: > On 08/04/2011 02:45, Grant wrote: >> >> Mark the cables and put the drives in order! Also spin down the >> drive you want to pull if it's out where you can feel if it's >> spinning? > > Marking the cables, as well as the drives, is a great idea. It is > obvious when you say it, of course, but worth saying out loud. > > Spinning a drive down is a nice idea to identify them (especially if you > forgot to label the drives and cables...) - I will try that to see how > easy it is to feel the difference. It sounds like these suggestions all assume a desktop-like case. Any decent rackmount case with hot-swap drive bays should have some way to label the drive bays, if the trays aren't already labeled. > One disadvantage of RAID10 is that you can't change it after it is made > - you can't reshape it, grow it, or change the layout. But for swap > that shouldn't be a problem - just turn your swap off, break down the > existing array, and create a new one including the extra drives. Since > you have no data on the raid (assuming you are not using swap at the > time), you've nothing to lose. You could always create a swap file on some other disks (even your data disks, if you really need to do this), swapon the new file, then swapoff the RAID10 swap space. This might not be a lot of fun if you've got a lot of swap in use, but that's an indicator of other problems. :) > I think the days of RAID5 are numbered, expect in cases where you have > additional protection (such as RAID1+5). Certainly RAID5 + hot spare is > a meaningless choice - RAID6 would definitely be better. I think RAID5 isn't dead yet, but it's a smaller niche. Perhaps you have redundant public-facing nodes with four drive bays. Maybe you want the extra storage space, so you don't want RAID6, but you want some protection against failure, so you don't want RAID0. But yes, in general I wouldn't want to go RAID5 with more than four or so disks, and RAID5 + hot spare is almost pointless. > Many low-end hardware cards don't support RAID6. Yep! The card in my original post doesn't support RAID6. It does support RAID50, but I think RAID6 is a better option both space-wise and safety-wise--RAID6 can always tolerate two disk failures, whereas some RAID50 two-disk failures will destroy the array. (Yes, you get better rebuild times on RAID50.) --keith -- kkeller-usenet@wombat.san-francisco.ca.us (try just my userid to email me) AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt see X- headers for PGP signature information
[toc] | [prev] | [next] | [standalone]
| From | Grant <omg@grrr.id.au> |
|---|---|
| Date | 2011-04-09 09:51 +1000 |
| Message-ID | <hk7vp6lanc5k2b7fahvhcnscaolfaf9nqm@4ax.com> |
| In reply to | #635 |
On Fri, 8 Apr 2011 08:22:12 -0700, Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> wrote: >On 2011-04-08, David Brown <david@westcontrol.removethisbit.com> wrote: >> On 08/04/2011 02:45, Grant wrote: >>> >>> Mark the cables and put the drives in order! Also spin down the >>> drive you want to pull if it's out where you can feel if it's >>> spinning? >> >> Marking the cables, as well as the drives, is a great idea. It is >> obvious when you say it, of course, but worth saying out loud. >> >> Spinning a drive down is a nice idea to identify them (especially if you >> forgot to label the drives and cables...) - I will try that to see how >> easy it is to feel the difference. > >It sounds like these suggestions all assume a desktop-like case. Any >decent rackmount case with hot-swap drive bays should have some way to >label the drive bays, if the trays aren't already labeled. My server is crammed into a desktop case, wish I had activity lights, can't see where to connect them? (Seagate cheapie SATA drives), no mobo connections. Does one add LEDs some other way and write a little driver? Always did want to add lots of flashing LEDs to a PC ;^) > >> One disadvantage of RAID10 is that you can't change it after it is made >> - you can't reshape it, grow it, or change the layout. But for swap >> that shouldn't be a problem - just turn your swap off, break down the >> existing array, and create a new one including the extra drives. Since >> you have no data on the raid (assuming you are not using swap at the >> time), you've nothing to lose. > >You could always create a swap file on some other disks (even your data >disks, if you really need to do this), swapon the new file, then swapoff >the RAID10 swap space. This might not be a lot of fun if you've got a >lot of swap in use, but that's an indicator of other problems. :) Yes, swap is overflow space, should be able to quieten it on demand? > >> I think the days of RAID5 are numbered, expect in cases where you have >> additional protection (such as RAID1+5). Certainly RAID5 + hot spare is >> a meaningless choice - RAID6 would definitely be better. > >I think RAID5 isn't dead yet, but it's a smaller niche. Perhaps you >have redundant public-facing nodes with four drive bays. Maybe you want >the extra storage space, so you don't want RAID6, but you want some >protection against failure, so you don't want RAID0. > >But yes, in general I wouldn't want to go RAID5 with more than four or >so disks, and RAID5 + hot spare is almost pointless. > >> Many low-end hardware cards don't support RAID6. > >Yep! The card in my original post doesn't support RAID6. It does >support RAID50, but I think RAID6 is a better option both space-wise and >safety-wise--RAID6 can always tolerate two disk failures, whereas some >RAID50 two-disk failures will destroy the array. (Yes, you get better >rebuild times on RAID50.) What's RAID50, I guess two mirrored RAID5s? RAID6 seems more efficient? Grant. > >--keith
[toc] | [prev] | [next] | [standalone]
| From | Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> |
|---|---|
| Date | 2011-04-08 17:10 -0700 |
| Message-ID | <kok478x7tj.ln2@goaway.wombat.san-francisco.ca.us> |
| In reply to | #645 |
On 2011-04-08, Grant <omg@grrr.id.au> wrote: > On Fri, 8 Apr 2011 08:22:12 -0700, Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> wrote: > > What's RAID50, I guess two mirrored RAID5s? RAID6 seems more efficient? RAID50 is two striped RAID5s. RAID51 would be a mirror of RAID5s. RAID6 is an improvement over RAID50, but older hardware RAID controllers (like the one I have) don't support RAID6. --keith -- kkeller-usenet@wombat.san-francisco.ca.us (try just my userid to email me) AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt see X- headers for PGP signature information
[toc] | [prev] | [next] | [standalone]
| From | David Brown <david.brown@removethis.hesbynett.no> |
|---|---|
| Date | 2011-04-09 13:14 +0200 |
| Message-ID | <sJGdnWu4CssWoj3QnZ2dnUVZ8sydnZ2d@lyse.net> |
| In reply to | #648 |
On 09/04/11 02:10, Keith Keller wrote: > On 2011-04-08, Grant<omg@grrr.id.au> wrote: >> On Fri, 8 Apr 2011 08:22:12 -0700, Keith Keller<kkeller-usenet@wombat.san-francisco.ca.us> wrote: >> >> What's RAID50, I guess two mirrored RAID5s? RAID6 seems more efficient? > > RAID50 is two striped RAID5s. RAID51 would be a mirror of RAID5s. > RAID6 is an improvement over RAID50, but older hardware RAID controllers > (like the one I have) don't support RAID6. > RAID50 has some advantages in terms of scalability and recovery, as compared to a single wide RAID6 - quite aside from any limitations that hardware controllers might have. One is that it can be easier to manage a hierarchical setup if you have a lot of drives. You might have a number of independent RAID5 boxes, and stripe them together as RAID0. Or you could have more than one RAID controller card in the same box, each managing a RAID5 array, with RAID0 handled in software. There is also the issue of rebuilding. With a RAID5, a rebuild requires continuous reading of all data in all the other drives in the array (RAID6 is only slightly less bad). If you have your array split into separate RAID5 arrays, then there will be less disk work during the rebuild. A RAID50 will also be more efficient for partial stripe writes than a single wide RAID5 or RAID6, since you don't have to read in so many blocks to calculate the parity. With very wide arrays, a higher proportion of your writes will be partial stripes, so this can be a bottleneck to scalability. Of course, RAID50 doesn't give you any better worst-case redundancy than RAID5 otherwise would - a second disk failure during a rebuild means you lose everything. RAID6 gives you that extra redundancy. However, RAID50 gives you average-case better reliability than a single wide RAID5 would, if RAID6 is not an option. It is also perfectly possible to do RAID60, and get the benefits of both (at the cost of another disk in each set, obviously).
[toc] | [prev] | [next] | [standalone]
| From | Grant <omg@grrr.id.au> |
|---|---|
| Date | 2011-04-09 09:47 +1000 |
| Message-ID | <tn3vp65b9h39pc81l6t65bhdleh3hi08c7@4ax.com> |
| In reply to | #626 |
On Fri, 08 Apr 2011 11:12:14 +0200, David Brown <david@westcontrol.removethisbit.com> wrote:
>On 08/04/2011 02:45, Grant wrote:
>> On Wed, 6 Apr 2011 14:00:04 -0700, Keith
>> Keller<kkeller-usenet@wombat.san-francisco.ca.us> wrote:
>>
>>> Hello Tim, David, thanks so much for your comments.
><snip>
>>>> One hint about learning mdadm - with mdadm, you can build your
>>>> arrays from partitions, not just whole disks. So you can give
>>>> your disks a 4 GB partition at the start and use that when
>>>> testing and learning - it's a lot easier to learn when your
>>>> rebuild times are a couple of minutes, rather than most of the
>>>> day!
>>>
>>> Great suggestion!
>>
>> RAID on partitions is a great idea, I'm using it here with 6 x 1TB
>> drives for the RAID, and a 2TB drive for backup, bounce buffer. At
>> the moment growing from 5 to 6 x 1TB drives with the aid of a
>> borrowed 1.5TB drive to keep it separate from my other stuff.
>>
>> So I use the fast end for OS, a 4GB partition for RAID10 swap, then
>> 2 partitions in the bulk of the space for data in two separate RAID6
>> arrays.
>>
>
>The flexibility is a big advantage of mdraid. Sometimes you want to
>emphasise redundancy, sometimes speed, sometimes space efficiency - you
>can do it all on the same disks using md raid over partitions.
>
>Another thing you can do with software raid is use external USB (or
>eSATA, if possible) drives in your raids. While you won't want to do
>that for normal use, it can be a great way to add in a bit of extra
>redundancy before doing operations such as moving over to larger drives.
> Try doing that with hardware raid cards!
I have a borrowed drive out on an eSATA right now for an extra bounce buffer :)
But I'm leery of making an external drive a RAID member, prefer RAID members
to be bolted down.
>
>> I'm still to find the best settings, running with a quad core CPU on
>> an Intel chipset (ICH9R) mobo for the 6 raid drives, a dual SATA
>> controller card for backup and external (casual) SATA drives.
>>
>> One thing I'm not seeing discussed enough is the need for adjusting
>> NCQ on the SATA drives. I'm using Seagate drives that have up to 31
>> queue slots, and switched them to use 1. But I've not yet scripted a
>> benchmark to find out if there's a better queue depth to use. The
>> theory is that the mdadm RAID software is fighting command queuing, I
>> have no idea what the impact is, but short tests indicate no queue is
>> better. I'd like more info, confirmation.
>>
>
>I hadn't thought about that at all. I'm planning on setting up a couple
>of new servers in the near future - maybe I'll get a chance to try that out.
Takes a long time I think. A case of writing a script to make the queue depth
change then call some benchmark exercises... One thing I'm no longer sure of
is that after exploring the /sys/ area controls, my method of writing NCQ depth
to the drives direct from rc.local is probably okay, but changing the queue depth
on the fly? Made me wonder if I created data loss yesterday loading up the new
RAID6, so I cleared the drives overnight, will start again, and 'talk' to the
drives through the kernel which will presumably do an adjustment without losing
in flight data.
Better safe than sorry.
>
>>>
>>>> One thing to practice is identifying drives - when a drive fails,
>>>> you want to be very sure of which one you should be replacing
>>>> :-)
>>
>> Mark the cables and put the drives in order! Also spin down the
>> drive you want to pull if it's out where you can feel if it's
>> spinning?
>
>Marking the cables, as well as the drives, is a great idea. It is
>obvious when you say it, of course, but worth saying out loud.
>
>Spinning a drive down is a nice idea to identify them (especially if you
>forgot to label the drives and cables...) - I will try that to see how
>easy it is to feel the difference.
It was good when I had several drives sitting outside the box a few months
ago, I don't have a removable drive cage, so no idea how well that works in
the box. I got seven drives in a four drive tower, four in proper spots, one
where floppy goes, and two up in the 5 1'4" bays with adapters, 600W power
supply, but the UPS says the box taking less than 150W.
Got UPS so I can run XFS safely, though I've yet to rewrite the crappy script
that came with the UPS for delayed shutdown. I think UPS is important part of
RAID discussions.
>
>
>>>
>>> Oh boy, I learned that The Hard Way (TM) many years ago, when I
...
>>> made extra work for me.)
>>>
>>>> Personally, I like RAID10 with "far" layout - it gives you
>>>> greater safety than RAID5 or RAID6, and most of the speed of
>>>> RAID0.
>>
>> I did that for the swap RAID10, unsure how to change from RAID10
>> with spare to what? now that I have 6 drives in there. Not that I
>> plan to use a lot of swap, but it is the overload area for /tmp as
>> well (/tmp mounted in memory, expands to swap after it uses half of
>> memory, something like that, I soon forget the details when there's
>> no problems).
>
>I too like my /tmp (and /var/tmp, and sometimes other ad-hoc temporary
>directories) on tmpfs, and so often have a large swap even when I have a
>lot of ram. I haven't bothered using raid on the swap drives -
>mirroring swap is a bit overkill on a desktop, though it's a good idea
>on a server. You don't need to explicitly use raid0 for swap - the
>kernel does that automatically if you have multiple swap drives/partitions.
Yes, I'm building a server, if you check my headers you'll see I write
from a windows box!
>
>I am not sure whether RAID10,far is the best choice for swap, as
>compared to RAID10,near. RAID10,far is excellent for a read-mostly
>array, but writes involve more head movement than in RAID10,near - and
>swap involves writes as much as reads. Perhaps RAID10,offset is in fact
>the best choice.
I don't recognise the RAID10,offset option, you see below I chose the f2
option from my reading, but this is the first RAIDed swap I've put in place,
and yes, I usually put a swap partition on each spindle and add the ',pri=1'
to /etc/fstab to have them treated as RAIDO.
A while back somebody pointed out to me that a disk failure in swap area
is same as memory failure, therefore for a server should have redundancy
for the swap too. I agree, hence the RAID10, I'm happy to adjust it to
better performing one :)
Do you have a reference for the ',offset' argument? Or is it buried in
'man mdadm' somewhere.
Hmm, I had to check my notes for the RAID10 setup, I have:
/etc/mdadm:
# swap: RAID10 - 4 x 2GiB + spare
ARRAY /dev/md/pooh:swap
UUID=0e3121d0:613689a2:228d5e7b:570357bf
devices=/dev/sd[abcd]3
spares=/dev/sde3
And from my setup notes, I used:
swap array:
mdadm --create /dev/md1 --metadata=1.2 --verbose --level=10 \
--layout=f2 --chunk=64 --raid-devices=4 /dev/sd[abcd]5 \
--spare-devices=1 /dev/sde5
Now I have six by 4GB partitions to play with for the swap array. Which
probably is good to keep redundancy there so I don't have to swap a disk
that fails in just that area. I expect total disk failure though.
First data RAID for me, I avoided them until I met RAID6. Too many horror
stories people writing about losing two of three RAID5 disks, possibly due
to using two per IDE cable or something stupid... Drive goes down, takes
mate on same cable with it?
Only that's not quite it because I changed the name for the final one to
/dev/md/swap, which then had to be /dev/md/pooh:swap to keep /etc/fstab
happy.
/etc/fstab, with my notes from the time:
# 8GB RAID10 swap space
/dev/md/pooh:swap swap swap defaults 0 0
# RAID6 data areas
/dev/md/data1p1 /home/raid/a ext4 defaults 0 0
#
/dev/md/data2 /home/raid/b xfs defaults 0 0
#
# backup of the RAID data area, actually I think this is second backup,
# as I change my mind about duplicating lower half of this 2TB disk for
# connection with the 5 x 1TB RAID arrays. The size of this partition
# memorialises that early decision, as it has room for the 1TB partition
# layout found on the remaining drives
#
/dev/sdg1 /home/backup1 xfs defaults,ro 0 0
#
# okay, mount top half of the shiny new 2TB drive as temp holding place,
# let's me think about using the bottom half in the RAID, but I'm sure
# that's a bad idea. Alternately, since I don't yet need that space,
# it's ready to be pushed into service as a cold spare for the RAID6
# data partitions
#
/dev/sdg2 /home/backup2 xfs defaults,ro 0 0
#
# borrowed John's 1.5TB drive for temp data
/dev/sdh1 /home/backup3 ext4 defaults,ro 0 0
#
So I added a sixth 1TB drive a couple days ago, and the 2TB backup or bounce
drive is there holding stuff that has to go onto the data RAID, then it'll
be a de duplicated backup, my backups for stuff dating back to the 1990s is
a mess, some things I have a dozen copies, one area I found only one copy
from the 90s floppy disk era.
>
>One disadvantage of RAID10 is that you can't change it after it is made
>- you can't reshape it, grow it, or change the layout. But for swap
>that shouldn't be a problem - just turn your swap off, break down the
>existing array, and create a new one including the extra drives. Since
>you have no data on the raid (assuming you are not using swap at the
>time), you've nothing to lose.
Exactly right, quiesce the machine as far as big jobs go and one can turn
swap off.
>
>
>>> Is that the "far replicas" described in the man page for md(4)?
>>>
>>> My concern about RAID10 is that I'll lose too much capacity to
>>> redundancy. Because this is a snapshot server, I really need to
>>> maximize available storage space; if I have 12 drive bays, with
>>> 2TB drives I'd get only 12TB of usable space from a RAID10; even
>>> with 3TB drives that's only 18TB (if my math is right). Whereas, a
>>> RAID6 with 12 2TB drives gets me 20TB usable. (If this were my
>>> primary fileserver I'd be more likely to consider a RAID10.)
>>
>> RAID6 for data, if you're on a budget :) RAID6 is slower than
>> RAID5, but that extra data protection is worth it, I think. You need
>> to cost loss of data vs speed and other factors relevant for your own
>> scenario.
>>
>
>I doubt if RAID6 is noticeably slower than RAID5 for most operations.
30% slower for initial sync, I can do some comparative benchmarking on
the 'fast' RAID partitions (sd[abcdef]5), since that area is yet to be
rebuilt. I'm copying data from there to the sd[abcdef]6 RAID6 today,
via the external temp 1.5TB drive.
>Modern cpu's handle the calculations easily. The only slow point is
>that partial stripe writes will be a little slower (if they miss the
>stripe cache), since you need to read in and write out at least three
>blocks. But these blocks are all on different disks, so they operate in
>parallel.
Well, I put in a quad core, Q6600 CPU, with 4GB memory, and the top usage
is sitting between 2 and 3 for writing from external to the RAID6.
>
>I think the days of RAID5 are numbered, expect in cases where you have
>additional protection (such as RAID1+5). Certainly RAID5 + hot spare is
>a meaningless choice - RAID6 would definitely be better.
Yup!
>
>> To rebuild a RAID5 with a RAID5 after total data loss is madness,
>> yet I know a guy doing business systems did that, 'cos the RAID
>> controller didn't do RAID6 (was on a windoze box). Madness?
>
>Many low-end hardware cards don't support RAID6.
Yes, that too, I didn't know about RAID6 until a friend asked me to look at
a NAS box he was buying. At the moment seems only Linux mdadm and high end
cards do RAID6? Intel motherboard chipsets I've seen don't know about it,
so I'm running six AHCI drives on the ICH9R 6 x SATA chipset.
>
>>>
>>>> Another option for growth is to use mdadm over partitions, rather
>>>> than whole disks. Then when you add bigger disks, you have spare
>>>> space that you can make into new partitions, and make another
>>>> mdadm raid using them. If you are using LVM to organise your
>>>> real partitions (which I highly recommend), then you can add your
>>>> new raid as a new physical partition and extend your working
>>>> space.
>>>
>>> Yes, I use LVM. Using partitions sounds like a great idea, and is
>>> definitely something that I can't get out of a hardware RAID
>>> controller (another reason I'm leaning this way).
>>
>> I tried telling mdadm to grow on partition size increase and it
>> refused :(
>>
>> Probably me not up there on the learning curve, but I was
>> disappointed.
>>
>
>It depends on the type of array you have - some can be grown, others
>cannot. RAID 1, 5 and 6 can be grown when you have increased the
>partition size of all components.
It was a RAID6 I tried to grow, but I deleted it and started over, thanks to
the plan of running two data RAID stripes, though seek time between them
would be lousy, so that's not the planned operation, sort of active plus
archive RAID, I could always merge them with LVM, but I read that slows down
access times markedly.
> But RAID 0 and 10 cannot (currently)
>be grown. Resizing RAID 10 would be complicated because of its layout,
>though I'm sure one day it will be supported. Resizing RAID 0 sounds
>easy, but I gather that md RAID 0 is actually very general (it will work
>with different sized disks, for example), which complicates resizing.
>
>> Since mdadm is under active development, I expect it to improve over
>> time.
>
>Some of the plans discussed on the linux-raid@vger.kernel.org mailing
>list are /very/ exciting.
Hmm, I skim through lkml, dunno if I want to see a more detailed story ;)
Grant.
>
>>>
>>>> But the beauty of md raid is its flexibility. Rather than use
>>>> twelve disks in a RAID6, build twelve RAID1 pairs from a real
>>>> drive and a missing drive. Then build your RAID6 on top of these
>>>> "pairs". The result is the same in terms of speed, capacity and
>>>> redundancy. But when you want to replace a drive with a bigger
>>>> disk, you do it by adding the new drive to one of the pairs and
>>>> letting the pair "rebuild". Then you remove the old disk from
>>>> the pair. You keep the same redundancy over the whole array
>>>> throughout the operation, and the rebuild is done as a mirror
>>>> copy from one disk - the other drives are unaffected. You can
>>>> happily do the replacement with multiple disks in parallel - as
>>>> many as you have spare drive bays.
>>>
>>> Another fantastic idea! (Though I'm guessing the RAID1s will
>>> somehow show up as ''failed''; I would need to work around that for
>>> paging purposes.)
>>
>> Swap space? RAID10 is best for that, from my reading. Got to be
>> careful with swap reliability because bad swap will crash the machine
>> and possibly eat your data. Same as bad memory.
>>
>> Grant.
[toc] | [prev] | [next] | [standalone]
| From | David Brown <david.brown@removethis.hesbynett.no> |
|---|---|
| Date | 2011-04-09 13:55 +0200 |
| Message-ID | <V4-dnbjdPpe_1D3QnZ2dnUVZ7vydnZ2d@lyse.net> |
| In reply to | #644 |
On 09/04/11 01:47, Grant wrote:
> On Fri, 08 Apr 2011 11:12:14 +0200, David Brown<david@westcontrol.removethisbit.com> wrote:
>
>> On 08/04/2011 02:45, Grant wrote:
>>> On Wed, 6 Apr 2011 14:00:04 -0700, Keith
>>> Keller<kkeller-usenet@wombat.san-francisco.ca.us> wrote:
I've done some more snipping here - these posts are getting a bit too
long for convenience. There is lots of interest to discuss here.
>>
>> Another thing you can do with software raid is use external USB (or
>> eSATA, if possible) drives in your raids. While you won't want to do
>> that for normal use, it can be a great way to add in a bit of extra
>> redundancy before doing operations such as moving over to larger drives.
>> Try doing that with hardware raid cards!
>
> I have a borrowed drive out on an eSATA right now for an extra bounce buffer :)
>
> But I'm leery of making an external drive a RAID member, prefer RAID members
> to be bolted down.
Yes, but the extra external disk during such maintenance is a great
safety net.
>
> Got UPS so I can run XFS safely, though I've yet to rewrite the crappy script
> that came with the UPS for delayed shutdown. I think UPS is important part of
> RAID discussions.
I take an UPS for granted in a server situation. Using RAID without an
UPS is much like having a car airbag and then not wearing a seatbelt.
If your power dies while you are writing to the disk, then RAID will not
save you - and it will mean /very/ long check times on restart.
>>
>>
>>>>
>>>> Oh boy, I learned that The Hard Way (TM) many years ago, when I
> ...
>>>> made extra work for me.)
>>>>
>>>>> Personally, I like RAID10 with "far" layout - it gives you
>>>>> greater safety than RAID5 or RAID6, and most of the speed of
>>>>> RAID0.
>>>
>>> I did that for the swap RAID10, unsure how to change from RAID10
>>> with spare to what? now that I have 6 drives in there. Not that I
>>> plan to use a lot of swap, but it is the overload area for /tmp as
>>> well (/tmp mounted in memory, expands to swap after it uses half of
>>> memory, something like that, I soon forget the details when there's
>>> no problems).
>>
>> I too like my /tmp (and /var/tmp, and sometimes other ad-hoc temporary
>> directories) on tmpfs, and so often have a large swap even when I have a
>> lot of ram. I haven't bothered using raid on the swap drives -
>> mirroring swap is a bit overkill on a desktop, though it's a good idea
>> on a server. You don't need to explicitly use raid0 for swap - the
>> kernel does that automatically if you have multiple swap drives/partitions.
>
> Yes, I'm building a server, if you check my headers you'll see I write
> from a windows box!
If you check /my/ headers, you'll see that some of my posts are from a
windows machine at work, others from a linux machine at home.
But even for servers, redundancy on your swap partitions is perhaps only
an issue if you really need continuous service. For many uses, RAID is
about /reducing/ downtime - it is not necessary to try to /eliminate/
downtime. Still, making your swap space redundant is not exactly a big
cost - it's just a small sliver off each disk in your arrays.
>>
>> I am not sure whether RAID10,far is the best choice for swap, as
>> compared to RAID10,near. RAID10,far is excellent for a read-mostly
>> array, but writes involve more head movement than in RAID10,near - and
>> swap involves writes as much as reads. Perhaps RAID10,offset is in fact
>> the best choice.
>
> I don't recognise the RAID10,offset option, you see below I chose the f2
> option from my reading, but this is the first RAIDed swap I've put in place,
> and yes, I usually put a swap partition on each spindle and add the ',pri=1'
> to /etc/fstab to have them treated as RAIDO.
> A while back somebody pointed out to me that a disk failure in swap area
> is same as memory failure, therefore for a server should have redundancy
> for the swap too. I agree, hence the RAID10, I'm happy to adjust it to
> better performing one :)
>
> Do you have a reference for the ',offset' argument? Or is it buried in
> 'man mdadm' somewhere.
>
Yes, the "offset" option is somewhere in the mdadm man page. It doesn't
get the same level of publicity as the "far" option, which is an
exclusive feature in Linux md raid, and is generally the fastest choice
("near" is pretty much standard RAID1+0, if you have 4 disks). I think
"offset" was added to md for compatibility with some other raid system,
but I suspect it might actually be the best choice for when you have
lots of writes, especially small writes, such as for swap space.
There are some layout diagrams here:
<http://en.wikipedia.org/wiki/Non-standard_RAID_levels#Linux_MD_RAID_10>
> Hmm, I had to check my notes for the RAID10 setup, I have:
>
> /etc/mdadm:
>
> # swap: RAID10 - 4 x 2GiB + spare
> ARRAY /dev/md/pooh:swap
> UUID=0e3121d0:613689a2:228d5e7b:570357bf
> devices=/dev/sd[abcd]3
> spares=/dev/sde3
>
> And from my setup notes, I used:
> swap array:
> mdadm --create /dev/md1 --metadata=1.2 --verbose --level=10 \
> --layout=f2 --chunk=64 --raid-devices=4 /dev/sd[abcd]5 \
> --spare-devices=1 /dev/sde5
>
> Now I have six by 4GB partitions to play with for the swap array. Which
> probably is good to keep redundancy there so I don't have to swap a disk
> that fails in just that area. I expect total disk failure though.
>
> First data RAID for me, I avoided them until I met RAID6. Too many horror
> stories people writing about losing two of three RAID5 disks, possibly due
> to using two per IDE cable or something stupid... Drive goes down, takes
> mate on same cable with it?
>
With IDE cables, it was certainly possible for one failure to bring down
the other disk on the same cable. You also suffered from low bandwidth
when you had two disks on the same cable. Parallel SCSI had similar
issues, but then you could have more than just two disks in the same
chain. On the other hand, the SCSI disks were more robust and less
likely to be affected by the failures of other disks in the chain.
But with serial cables (SATA or SAS), you don't get these problems any more.
However, there is still a risk of losing a second disk during a RAID5
rebuild. Rebuilds are the most stressful action you can have on a RAID5
(or RAID6) array, so if a second disk is feeling poorly, then a rebuild
might be the trigger that pushes it over the edge.
As always, make sure you have good independent backups of anything
important - even if you use RAID6!
> Only that's not quite it because I changed the name for the final one to
> /dev/md/swap, which then had to be /dev/md/pooh:swap to keep /etc/fstab
> happy.
>
> /etc/fstab, with my notes from the time:
>
> # 8GB RAID10 swap space
> /dev/md/pooh:swap swap swap defaults 0 0
> # RAID6 data areas
> /dev/md/data1p1 /home/raid/a ext4 defaults 0 0
> #
> /dev/md/data2 /home/raid/b xfs defaults 0 0
> #
> # backup of the RAID data area, actually I think this is second backup,
> # as I change my mind about duplicating lower half of this 2TB disk for
> # connection with the 5 x 1TB RAID arrays. The size of this partition
> # memorialises that early decision, as it has room for the 1TB partition
> # layout found on the remaining drives
> #
> /dev/sdg1 /home/backup1 xfs defaults,ro 0 0
> #
> # okay, mount top half of the shiny new 2TB drive as temp holding place,
> # let's me think about using the bottom half in the RAID, but I'm sure
> # that's a bad idea. Alternately, since I don't yet need that space,
> # it's ready to be pushed into service as a cold spare for the RAID6
> # data partitions
> #
> /dev/sdg2 /home/backup2 xfs defaults,ro 0 0
> #
> # borrowed John's 1.5TB drive for temp data
> /dev/sdh1 /home/backup3 ext4 defaults,ro 0 0
> #
>
> So I added a sixth 1TB drive a couple days ago, and the 2TB backup or bounce
> drive is there holding stuff that has to go onto the data RAID, then it'll
> be a de duplicated backup, my backups for stuff dating back to the 1990s is
> a mess, some things I have a dozen copies, one area I found only one copy
> from the 90s floppy disk era.
>>
>> One disadvantage of RAID10 is that you can't change it after it is made
>> - you can't reshape it, grow it, or change the layout. But for swap
>> that shouldn't be a problem - just turn your swap off, break down the
>> existing array, and create a new one including the extra drives. Since
>> you have no data on the raid (assuming you are not using swap at the
>> time), you've nothing to lose.
>
> Exactly right, quiesce the machine as far as big jobs go and one can turn
> swap off.
>>
>>
>>>> Is that the "far replicas" described in the man page for md(4)?
>>>>
>>>> My concern about RAID10 is that I'll lose too much capacity to
>>>> redundancy. Because this is a snapshot server, I really need to
>>>> maximize available storage space; if I have 12 drive bays, with
>>>> 2TB drives I'd get only 12TB of usable space from a RAID10; even
>>>> with 3TB drives that's only 18TB (if my math is right). Whereas, a
>>>> RAID6 with 12 2TB drives gets me 20TB usable. (If this were my
>>>> primary fileserver I'd be more likely to consider a RAID10.)
>>>
>>> RAID6 for data, if you're on a budget :) RAID6 is slower than
>>> RAID5, but that extra data protection is worth it, I think. You need
>>> to cost loss of data vs speed and other factors relevant for your own
>>> scenario.
>>>
>>
>> I doubt if RAID6 is noticeably slower than RAID5 for most operations.
>
> 30% slower for initial sync, I can do some comparative benchmarking on
> the 'fast' RAID partitions (sd[abcdef]5), since that area is yet to be
> rebuilt. I'm copying data from there to the sd[abcdef]6 RAID6 today,
> via the external temp 1.5TB drive.
>
Don't place too much emphasis on the initial sync time - that's only
done once, and doesn't matter in the long run. Rebuild times are a bit
more important, but you (hopefully!) don't have to rebuild often. It's
the speed of the array in real-time use that's important.
>> Modern cpu's handle the calculations easily. The only slow point is
>> that partial stripe writes will be a little slower (if they miss the
>> stripe cache), since you need to read in and write out at least three
>> blocks. But these blocks are all on different disks, so they operate in
>> parallel.
>
> Well, I put in a quad core, Q6600 CPU, with 4GB memory, and the top usage
> is sitting between 2 and 3 for writing from external to the RAID6.
>>
>> I think the days of RAID5 are numbered, expect in cases where you have
>> additional protection (such as RAID1+5). Certainly RAID5 + hot spare is
>> a meaningless choice - RAID6 would definitely be better.
>
> Yup!
>>
>>> To rebuild a RAID5 with a RAID5 after total data loss is madness,
>>> yet I know a guy doing business systems did that, 'cos the RAID
>>> controller didn't do RAID6 (was on a windoze box). Madness?
>>
>> Many low-end hardware cards don't support RAID6.
>
> Yes, that too, I didn't know about RAID6 until a friend asked me to look at
> a NAS box he was buying. At the moment seems only Linux mdadm and high end
> cards do RAID6? Intel motherboard chipsets I've seen don't know about it,
> so I'm running six AHCI drives on the ICH9R 6 x SATA chipset.
The "raid" supported by motherboard chipsets is often known as
"fakeraid". It's a limited form of software raid, with all the
disadvantages of software raid and all the disadvantages of hardware
raid. It's okay as a quick and easy solution for a desktop with either
RAID0 or RAID1 for a pair of disks, and especially useful for OS's that
don't have particularly good software raid (guess which one...). But
it's a poor choice for a more serious setup.
>>
>>>>
>>>>> Another option for growth is to use mdadm over partitions, rather
>>>>> than whole disks. Then when you add bigger disks, you have spare
>>>>> space that you can make into new partitions, and make another
>>>>> mdadm raid using them. If you are using LVM to organise your
>>>>> real partitions (which I highly recommend), then you can add your
>>>>> new raid as a new physical partition and extend your working
>>>>> space.
>>>>
>>>> Yes, I use LVM. Using partitions sounds like a great idea, and is
>>>> definitely something that I can't get out of a hardware RAID
>>>> controller (another reason I'm leaning this way).
>>>
>>> I tried telling mdadm to grow on partition size increase and it
>>> refused :(
>>>
>>> Probably me not up there on the learning curve, but I was
>>> disappointed.
>>>
>>
>> It depends on the type of array you have - some can be grown, others
>> cannot. RAID 1, 5 and 6 can be grown when you have increased the
>> partition size of all components.
>
> It was a RAID6 I tried to grow, but I deleted it and started over, thanks to
> the plan of running two data RAID stripes, though seek time between them
> would be lousy, so that's not the planned operation, sort of active plus
> archive RAID, I could always merge them with LVM, but I read that slows down
> access times markedly.
>
LVM can slow down operations in a number of ways. The layers of
indirection will increase access times, and it is easy to get
non-contiguous logical partitions, especially if you have several
physical volumes, which can mess with the filesystem's optimisations.
But you get an enormous flexibility by using it. The usual attitude is
therefore to make your low-level RAID using md raid, getting the fastest
setup you can with the redundancy and space requirements you need. Then
you put LVM on top and accept the speed costs for the flexibility gains.
>> But RAID 0 and 10 cannot (currently)
>> be grown. Resizing RAID 10 would be complicated because of its layout,
>> though I'm sure one day it will be supported. Resizing RAID 0 sounds
>> easy, but I gather that md RAID 0 is actually very general (it will work
>> with different sized disks, for example), which complicates resizing.
>>
>>> Since mdadm is under active development, I expect it to improve over
>>> time.
>>
>> Some of the plans discussed on the linux-raid@vger.kernel.org mailing
>> list are /very/ exciting.
>
> Hmm, I skim through lkml, dunno if I want to see a more detailed story ;)
Have a look at <http://neil.brown.name/blog/20110216044002>. Neil
writes a very clear and well-thought-out article (as well as writing
excellent software!).
[toc] | [prev] | [next] | [standalone]
| From | Tris Orendorff <triso@remove-me.cogeco.ca> |
|---|---|
| Date | 2011-04-12 18:04 +0000 |
| Message-ID | <Xns9EC58F37B7D8RepublicPicturesLtd@69.16.185.250> |
| In reply to | #625 |
Grant <omg@grrr.id.au> burped up warm pablum in news:tllsp6ltftsq6ufp048hcc4ivufupgbmki@4ax.com: > > Swap space? RAID10 is best for that, from my reading. Got to be > careful with swap reliability because bad swap will crash the machine > and possibly eat your data. Same as bad memory. Swap space? Isn't that useless for a server? We've found it next-to-useless on our desktops even with the fastest SSDs. -- Tris Orendorff [ Anyone naming their child should spend a few minutes checking rhyming slang and dodgy sounding names. Brad and Angelina failed to do this when naming their kid Shiloh Pitt. At some point, someone at school is going to spoonerise her name. Craig Stark ]
[toc] | [prev] | [next] | [standalone]
| From | Keith Keller <kkeller-usenet@wombat.san-francisco.ca.us> |
|---|---|
| Date | 2011-04-12 11:34 -0700 |
| Message-ID | <viie78x9sc.ln2@goaway.wombat.san-francisco.ca.us> |
| In reply to | #694 |
On 2011-04-12, Tris Orendorff <triso@remove-me.cogeco.ca> wrote: > > Swap space? Isn't that useless for a server? We've found it next-to-useless on our desktops even with the fastest SSDs. It's not *useless* per se. The example I sometimes see cited (even in this thread?) is with xfs repairs on large filesystems. These can take a ton of memory, but much of it isn't active, and let's face it, you probably really really want the xfs check or repair to work no matter what, so you'd be willing to sacrifice performance for results. (Of course you need plenty of free disk space on an unaffected fs!) But even if you don't do that, it's still not completely useless. The kernel will move things to swap if it hasn't been used in a while and free memory is wanted; when the memory frees up, it can leave that data in swap so that it can use more physical memory for other tasks (e.g., more disk buffers). In this use case, you're not actually using the swap space very often. It's true that if you're counting on swap to be useful as active memory you're likely to be disappointed, but used as one-off space it can be handy. --keith -- kkeller-usenet@wombat.san-francisco.ca.us (try just my userid to email me) AOLSFAQ=http://www.therockgarden.ca/aolsfaq.txt see X- headers for PGP signature information
[toc] | [prev] | [next] | [standalone]
| From | The Natural Philosopher <tnp@invalid.invalid> |
|---|---|
| Date | 2011-04-12 21:13 +0100 |
| Message-ID | <io2bp3$9sk$1@news.albasani.net> |
| In reply to | #695 |
Keith Keller wrote: > The > kernel will move things to swap if it hasn't been used in a while and > free memory is wanted; when the memory frees up, it can leave that data > in swap so that it can use more physical memory for other tasks (e.g., > more disk buffers). That is key, It means the rarely used admin processes that ar essentially asleep, do not fill the RAM.
[toc] | [prev] | [next] | [standalone]
| From | David Brown <david@westcontrol.removethisbit.com> |
|---|---|
| Date | 2011-04-13 09:45 +0200 |
| Message-ID | <XcSdnU4wVYejyTjQnZ2dnUVZ8o-dnZ2d@lyse.net> |
| In reply to | #696 |
On 12/04/2011 22:13, The Natural Philosopher wrote: > Keith Keller wrote: >> The >> kernel will move things to swap if it hasn't been used in a while and >> free memory is wanted; when the memory frees up, it can leave that data >> in swap so that it can use more physical memory for other tasks (e.g., >> more disk buffers). > > That is key, It means the rarely used admin processes that ar > essentially asleep, do not fill the RAM. Such processes are usually small, but it's a good principle none the less. The reason I like swap space is as a backing store for tmpfs filesystems. I usually put /tmp and /var/tmp on tmpfs, and sometimes have additional tmpfs mounts for odd purposes (such as the build directories for large compilations - though obviously that's more desktop than server usage). Tmpfs is much faster and more efficient than any other filesystem, even if the data is stored on a disk rather than in memory, because it does not give the slightest care to data reliability. The Linux kernel is good at memory management, and at balancing what goes in ram and what goes in swap. Clearly you are always faster with x+y ram instead of x ram and y swap, but x+y ram and y swap is even better.
[toc] | [prev] | [next] | [standalone]
| From | Grant <omg@grrr.id.au> |
|---|---|
| Date | 2011-04-14 13:42 +1000 |
| Message-ID | <h2rcq6hvcf8rbb0rigptkf6alrtfhsvqp7@4ax.com> |
| In reply to | #698 |
On Wed, 13 Apr 2011 09:45:58 +0200, David Brown <david@westcontrol.removethisbit.com> wrote: >On 12/04/2011 22:13, The Natural Philosopher wrote: >> Keith Keller wrote: >>> The >>> kernel will move things to swap if it hasn't been used in a while and >>> free memory is wanted; when the memory frees up, it can leave that data >>> in swap so that it can use more physical memory for other tasks (e.g., >>> more disk buffers). >> >> That is key, It means the rarely used admin processes that ar >> essentially asleep, do not fill the RAM. > >Such processes are usually small, but it's a good principle none the less. > > >The reason I like swap space is as a backing store for tmpfs >filesystems. I usually put /tmp and /var/tmp on tmpfs, and sometimes >have additional tmpfs mounts for odd purposes (such as the build >directories for large compilations - though obviously that's more >desktop than server usage). Tmpfs is much faster and more efficient >than any other filesystem, even if the data is stored on a disk rather >than in memory, because it does not give the slightest care to data >reliability. Wonder if you could show your relevant /etc/fstab lines? I'm curious how other do this? > >The Linux kernel is good at memory management, and at balancing what >goes in ram and what goes in swap. Clearly you are always faster with >x+y ram instead of x ram and y swap, but x+y ram and y swap is even better. Grant.
[toc] | [prev] | [next] | [standalone]
| From | David Brown <david@westcontrol.removethisbit.com> |
|---|---|
| Date | 2011-04-14 09:15 +0200 |
| Message-ID | <-NmdnSetarUCAzvQnZ2dnUVZ8hKdnZ2d@lyse.net> |
| In reply to | #711 |
On 14/04/2011 05:42, Grant wrote: > On Wed, 13 Apr 2011 09:45:58 +0200, David Brown<david@westcontrol.removethisbit.com> wrote: > >> On 12/04/2011 22:13, The Natural Philosopher wrote: >>> Keith Keller wrote: >>>> The >>>> kernel will move things to swap if it hasn't been used in a while and >>>> free memory is wanted; when the memory frees up, it can leave that data >>>> in swap so that it can use more physical memory for other tasks (e.g., >>>> more disk buffers). >>> >>> That is key, It means the rarely used admin processes that ar >>> essentially asleep, do not fill the RAM. >> >> Such processes are usually small, but it's a good principle none the less. >> >> >> The reason I like swap space is as a backing store for tmpfs >> filesystems. I usually put /tmp and /var/tmp on tmpfs, and sometimes >> have additional tmpfs mounts for odd purposes (such as the build >> directories for large compilations - though obviously that's more >> desktop than server usage). Tmpfs is much faster and more efficient >> than any other filesystem, even if the data is stored on a disk rather >> than in memory, because it does not give the slightest care to data >> reliability. > > Wonder if you could show your relevant /etc/fstab lines? I'm curious > how other do this? Putting /tmp on tmpfs is not rocket science - if you thought I had some cunning secret here, I have to disappoint you : tmpfs /tmp tmpfs defaults 0 0 tmpfs /var/tmp tmpfs defaults 0 0 (Note that /var/tmp should really survive a reboot. However, I have never heard of any programs that actually rely on that - but no guarantees. /tmp should always be safe on tmpfs.) You can make a new tmpfs on another directory: mkdir t mount -t tmpfs tmpfs t By default, tmpfs mounts are limited in size to half your physical ram - but you can change that with the "size" mount option. tmpfs takes negligible space overhead - you only use ram/swap for the files stored there. >> >> The Linux kernel is good at memory management, and at balancing what >> goes in ram and what goes in swap. Clearly you are always faster with >> x+y ram instead of x ram and y swap, but x+y ram and y swap is even better. > > Grant.
[toc] | [prev] | [next] | [standalone]
| From | Grant <omg@grrr.id.au> |
|---|---|
| Date | 2011-04-15 08:03 +1000 |
| Message-ID | <35qeq61o56nifjseocj5jf7c3h1k6snjv8@4ax.com> |
| In reply to | #715 |
On Thu, 14 Apr 2011 09:15:32 +0200, David Brown <david@westcontrol.removethisbit.com> wrote:
>On 14/04/2011 05:42, Grant wrote:
>> On Wed, 13 Apr 2011 09:45:58 +0200, David Brown<david@westcontrol.removethisbit.com> wrote:
>>
>>> On 12/04/2011 22:13, The Natural Philosopher wrote:
>>>> Keith Keller wrote:
>>>>> The
>>>>> kernel will move things to swap if it hasn't been used in a while and
>>>>> free memory is wanted; when the memory frees up, it can leave that data
>>>>> in swap so that it can use more physical memory for other tasks (e.g.,
>>>>> more disk buffers).
>>>>
>>>> That is key, It means the rarely used admin processes that ar
>>>> essentially asleep, do not fill the RAM.
>>>
>>> Such processes are usually small, but it's a good principle none the less.
>>>
>>>
>>> The reason I like swap space is as a backing store for tmpfs
>>> filesystems. I usually put /tmp and /var/tmp on tmpfs, and sometimes
>>> have additional tmpfs mounts for odd purposes (such as the build
>>> directories for large compilations - though obviously that's more
>>> desktop than server usage). Tmpfs is much faster and more efficient
>>> than any other filesystem, even if the data is stored on a disk rather
>>> than in memory, because it does not give the slightest care to data
>>> reliability.
>>
>> Wonder if you could show your relevant /etc/fstab lines? I'm curious
>> how other do this?
>
>Putting /tmp on tmpfs is not rocket science - if you thought I had some
>cunning secret here, I have to disappoint you :
>
>tmpfs /tmp tmpfs defaults 0 0
>tmpfs /var/tmp tmpfs defaults 0 0
So from where did I get this?
...
tmpfs /dev/shm tmpfs defaults 0 0
#
# run /tmp in memory, use up to twice physical memory size, 8GB!
none /tmp tmpfs size=8096M,mode=1777,nodev,nosuid 0 0
#
It works too, in that dd'ing to a new file in /tmp will use half memory
then expand into swap space:
root@pooh:~# time (dd if=/dev/zero bs=1G count=6 of=/tmp/zeroes; sync)
6+0 records in
6+0 records out
6442450944 bytes (6.4 GB) copied, 25.6495 s, 251 MB/s
real 0m27.977s
user 0m0.003s
sys 0m9.449s
Why this confusion with GiB and GB? dd counts by GiB, reports in decimal
GB, a bet each way? And yes, running into swap space takes a lot of time ;)
Swap is on RAID10, now set to o2 :)
root@pooh:~# ls -l /tmp/
total 6303772
-rw-r--r-- 1 root root 6442450944 2011-04-15 07:40 zeroes
root@pooh:~# cat /proc/swaps
Filename Type Size Used Priority
/dev/md127 partition 8386300 4815424 -1
root@pooh:~# free
total used free shared buffers cached
Mem: 4053296 2269572 1783724 0 47532 1754452
-/+ buffers/cache: 467588 3585708
Swap: 8386300 4815332 3570968
root@pooh:~# time (dd if=/dev/zero bs=1G count=6 of=/tmp/zeroes2; sync)
dd: writing `/tmp/zeroes2': No space left on device
2+0 records in
1+0 records out
2030231552 bytes (2.0 GB) copied, 6.60451 s, 307 MB/s
real 0m8.519s
user 0m0.000s
sys 0m4.110s
root@pooh:~# ls -l /tmp/
total 8290308
-rw-r--r-- 1 root root 6442450944 2011-04-15 07:40 zeroes
-rw-r--r-- 1 root root 2030231552 2011-04-15 07:50 zeroes2
Shouldn't I get 10GB into /tmp if it has 2GB of real memory plus the 8GB
swap sapce? No, because I set /tmp size, had to, to make it go larger
than tmpfs default.
root@pooh:~# rm /tmp/z*
Don't leave a saturated /tmp space!
>
>(Note that /var/tmp should really survive a reboot. However, I have
>never heard of any programs that actually rely on that - but no
>guarantees. /tmp should always be safe on tmpfs.)
Hmm, I don't do anything special for /var/tmp, but on a slack-11.0 box
been up 16 days, it's empty. ON the 'pooh' box above, it's got old crap
surviving boot for KDE, 2.2MB for a single user, wonder why? I tend
towards wanting to flush that one on boot too, or make it in tmpfs.
root@pooh:~# ls -las /var/tmp
total 1
0 drwxrwxrwt 3 root root 80 2011-01-07 11:10 ./
1 drwxr-xr-x 19 root root 536 2011-02-10 08:03 ../
0 drwx------ 3 grant wheel 128 2011-01-07 11:12 kdecache-grant/
root@pooh:~# ls -las /var/tmp/kdecache-grant/
total 2166
0 drwx------ 3 grant wheel 128 2011-01-07 11:12 ./
0 drwxrwxrwt 3 root root 80 2011-01-07 11:10 ../
0 drwx------ 2 grant wheel 168 2011-01-07 11:13 kpc/
2162 -rw-r--r-- 1 grant wheel 2211743 2011-01-07 11:12 ksycoca4
4 -rw-r--r-- 1 grant wheel 358 2011-01-07 11:12 ksycoca4stamp
root@pooh:~# du -sh /var/tmp
2.2M /var/tmp
>
>You can make a new tmpfs on another directory:
>
>mkdir t
>mount -t tmpfs tmpfs t
>
>By default, tmpfs mounts are limited in size to half your physical ram -
>but you can change that with the "size" mount option. tmpfs takes
>negligible space overhead - you only use ram/swap for the files stored
>there.
Thanks.
Grant.
[toc] | [prev] | [next] | [standalone]
Page 1 of 3 [1] 2 3 Next page →
Back to top | Article view | comp.os.linux.misc
csiph-web