A VMware ESX question: Boot from SAN? By VSM News Staff published: Wednesday, November 02 2005
Expert Server Group is a VMware Authorized Consultant (VAC) with some of the most skilled and experienced System Engineers in North America in implementing Server Virtualization.
VSM had a chance to talk with them about how to decide whether to boot from SAN.
VSM: Are you hearing questions about this from customers?
ESG: We’ve been involved in a lot of conversations over the last year on whether to boot VMware ESX servers from SAN, as well as the general uses of SAN.
VSM: Do you hear this from advanced users or from people who are just beginning to get into virtualization?
ESG: From both. The answer is, it depends. As a general rule, we like to boot from the servers, with internal disks mirrored via hardware mirroring. If you have an HP server, you would be able to use the smart-array utility and take two drives in the server and mirror them for an operating system.
Size matters as far as the drive choice, but it’s relatively painless. Even with high-end servers, a 72GB pair of mirrored drives internally would be a wonderful choice.
Boot from SAN is a viable option, because it helps remove you even further from being dependent on any piece of hardware. You could have additional hardware sitting around, and if you lost a server, you could recover very quickly. In a disaster-recovery situation, you could recover rapidly if everything was SAN-enabled and SAN-managed.
The downside is the costs that are usually associated with additional traffic and additional boot from SAN. It’s about weighing out the costs vs. the benefits that you get.
VSM: You’d prefer to boot from mirrored disks?
ESG: Internal to the system, yes.
VSM: But then you said there are big advantages in booting from a SAN?
ESG: Yes, but it does increase costs.
VSM: If cost was taken out of the equation, would you have a clear preference?
ESG: If cost was not part of the equation, and manageability was not part of the equation, putting everything on the SAN would be great. Then each physical server would only be responsible for running CPU cycles and providing memory.
VSM: What about manageability in that context?
ESG: When you install physical devices into a VMware ESX server, you have to allocate that hardware to be running from the service console - the VMware operating system that runs on the bare metal - or you have to allocate the hardware to virtual machines. Very few devices should be shared between the two, even though that option is available to you.
Fibre cards, for instance, should only be dedicated to one or the other. In a proper, redundant environment, two Fibre Channel cards with a minimum of two ports, one per card, would be the optimal solution, and that way you would have redundant paths through the SAN to your storage. That’s great, it’s an expected cost, and it’s no different from what people do today.
In the VMware model, let’s use the example of 20-1 consolidation on a 4-way server, where we took 20 physical servers and consolidated them down to one. To have 20 servers redundantly connected to fabric is a substantial cost. The Fibre Channel card cost, at $1,500 per card, would be $3,000 per server. Multiply that by 20 you’re at $60,000 for hardware, just to make the connection to the SAN fabrics.
Not only that, you need to have a total of 40 Fibre Channel switch ports, which are still very costly. You have a huge cost if you are using physical servers, but in the VMware world you have two Fibre Channel cards costing a total of $3,000. You would eat up two Fibre Channel ports, but that cost can now be justified. If it was $10,000 total to connect one ESX server, that’s much more realistic than the cost to connect multiple servers. That justifies the SAN.
Those two cards would have to be dedicated for virtual machines to make that connection into the SAN. If we were going to be setting up the ESX operating system environment and make it properly redundant, we would need another two Fibre Channel ports.
That adds additional cost and complexity to the solution, because now the person who manages the SAN has to be aware that all the SAN disk that goes to the ESX servers for virtual machines is managed one way, and all of the ESX operating system drives on SAN are managed in a different way. Some of those disks would be shared, while the operating system for each of the ESX environments would not be shared.
VSM: In a data center environment, when I hear the word “complexity,” I think of the word “risk.” By booting from the SAN, are you increasing the risk in your data center?
ESG: It’s not risk in that aspect. It truly is just the complexity of getting the initial setup up and running. It’s a new environment for most people, and it’s a new way to manage the environment. Change is really the obstacle.
In the VMware environment, you have to learn how to provide the platform, and what the platform can do, so you can run many virtual machines. Learning virtual machine operation is the same as learning physical machines for the most part. The key is learning how to get that infrastructure up and running and understanding what’s going on behind the scene so you can address issues and not add risk into the solution as you’re implementing it.
VSM: If you’re using virtualization for a redundant data center, does that change your consideration?
ESG: In most situations, no. Not when you have multiple data centers and you’re considering site-based disaster-recovery solutions where you would lose the site and move to another physical location in another geography.
In a disaster scenario, you’re trying to weigh your costs by the need to come back online in a certain amount of time. Would there be a need sometimes to have a very highly available or very quick-to- recover system? Yes. But can that be addressed in many ways, with a substantial numbers of options.
VSM: If someone is considering an implementation, which way should they go? How should they evaluate cost against the need for a speedy boot?
ESG: That needs a lengthy answer. Let’s take a company that’s going to be bringing just three or four ESX servers into their environment initially, and they are going to connect to SAN. Cost is somewhat of an issue, but they’re willing to make the investment in a redundant SAN and redundant fabrics to make the SAN connect redundantly to different hosts. Each of the ESX servers would have two Fibre cards so that they would be completely redundant.
In a situation where it’s that controlled and that easy to maintain, it’s very easy just to keep internal disks in each of the four servers. It’s very smooth to install, and less of a learning curve initially, because people are comfortable installing operating systems onto physical hardware. All local, into the box.
Let’s say the cost associated with two internal disks is $1,000. To use a SAN-attached disk, you would need roughly $3,000 for Fibre cards, and additional fiber channel ports, adding to the cost of making that connection. Then you would also need SAN disk, which can become very expensive.
If you had 72GB internally at $1,000, that might turn into $4,000 if it’s out of the SAN. The cost difference could be $1,000 for internal to upwards of $10,000 when you figure all the costs associated with putting a bootable environment on the SAN.
And we still haven’t talked about additional management people who have to spend time to manage that SAN configuration. As you add multiple hosts and multiple connections the storage group has to be aware of how they’re bringing that storage around and presenting it to the physical host. You add another layer of complexity beyond the standard VMware setup.
VSM: What kind of complexity are we talking about?
ESG: The VMware complexity is really just a new way of thinking of storage presentation. The storage administrator is often focused on presenting one disk for the SAN, and having one host as the recipient of that disk.That’s the way storage is typically presented.
With VMware we break that model, and say, “We have one disk that’s going to be running virtual machines. We need to present that to four or five different servers at the same time.” Typically that’s only done in clustering, not in standard server models. That’s why we say the VMware complexity will always be there.
VSM: Based on the questions you’ve fielded from clients, what other considerations should people be thinking about?
ESG: We’ve had people who argued that you can’t control cost and have your machine boot from SAN; it’s a contradiction. You’ve got to give up one or the other. You have to consolidate your environment and put all your disks in the SAN, or you have to choose not to spend the money.
Let’s say you had ESX servers and wanted a very quick DR site, and the ESX environment is booting from the SAN because cost wasn’t a concern it was set up. That’s great, because if the SAN is replicating to a disaster recovery site in some other geography, you can have some gear there waiting and the storage group could have provided all the scripts.
In the event of a disaster you can fire off a couple of scripts to get through the reallocation of storage, so that at the DR site you can actually use the storage you’ve been replicating, and bring your solution back online. It adds a level of complexity, but any DR plan is going to have some level of complexity.
If we were not booting from SAN, the additional work that would need to be done at the DR site would include bringing virtual machines under scope of management. In the short term you can bring each virtual machine that you’d like to manage back in line in about a minute, so a well drawn-out plan can negate the benefit of having SAN replication.
It’s almost a moot point to boot from SAN in situations where you’re focused around DR. You can get those same objectives accomplished in most situations where the servers do not need to be recovered within the first few hours, first day, maybe even the first three days. Those types of servers don’t necessarily need to have that level of availability.
There are very high-end systems that may not even be part of the VMware solution if the requirement is to be that available. They may be higher-end boxes with global clustering enabled so there’s a cluster running at the primary site and a cluster running at the disaster site, and there’s replication occurring. You might see that in the financial sectors of the world, where if you lost a transaction there would be very serious consequences. But that’s a different conversation than just whether to boot from SAN in VMware.
Another consideration would be Blades. With Blades it’s very cost effective to aggregate all of your storage down a couple of Fibre ports. That’s one of the reasons people buy the Blade architecture, so they can aggregate the connections and simplify the cable management on the back of the boxes.
We had one customer who showed a very high interest in booting from SAN on Blade systems, and the logic was that he was going to have many remote data centers scattered around the globe, and in some of these centers there would be no technically adept person. He was putting in a three-quarters populated Blade chassis, and adding one additional Blade to sit as an idle failover Blade. Not automatic failover, just to have available when need be.
If they lost a physical piece of gear, they could fail over the operations onto that spare Blade, and just fire things back up. They would be able to do that by managing the SAN environment behind the scenes as a reactionary step, but there would be no internal disk. There would be nothing to install on that Blade. The failed Blade would be down, they would recognize that and power it off. They would bring on the failover Blade and point it to the disk to take over operation mode of that failed Blade.
Because he had no personnel, having this extra redundancy built in justified the cost in booting from SAN. That was a great argument, and it was a successful implementation as far as we're aware.
But as a general rule if you can boot internally, and save yourself a lot of headache and management woes in the SAN, let’s help out by not adding management jobs or additional management tasks. Virtualization is supposed to make your life easier, and just because you can boot from SAN does not mean it will benefit you.
VSM: Should companies be making this decision early in the design of their architecture? Should it be architected into the implementation of virtualization?
ESG: Absolutely. Because there will be costs that are significant, depending on your choices. Also, the methodologies will change depending on how you deploy and manage things. It might not be a make-it-or-break-it decision, if you don’t make it day one, but it could cause a lot of workarounds and a lot of extra cycles to be spent if you decided mid-course that you wanted a different solution.
Maybe you implemented boot-from-SAN initially, and you found that the cost was too high, but you found that out too late. If you found that on one server during your pilot, and you had already purchased all the other servers and then realized that booting from SAN was too expensive, in the large scope of things you didn’t spend much on the proof of concept, but you’ve affected the way the rest of the project is going to flow.
There’s no right or wrong answer, no one deciding factor. Some answers are more cost-effective than others, but cost does not always steer the solution, nor does functionality in a situation where DR kicks in. Of the implementations that we’ve seen that were concerned with simplifying the lives of the IT staff, easily 50% went with internal boot disks without a second consideration. The others explored it, but there wasn’t a high percentage that we’ve seen booting from SAN.
VSM: If 50% boot from internal disk, and even 25% boot from SAN, what are the other people doing?
ESG: 50% go with the choice to boot from the internal disk without much consideration, just because of the cost. The rest consider both ways, so that other 50% is up for consideration. But very few of our implementations have been focused on a boot-from-SAN environment.
*****
For more information, visit www.expertserver.com.
|