I’ve been thinking quite a bit lately about the role of cloud computing as it applies to scientific research (as hinted at by the title of this site). One possible flaw in my approach is that I’ve been delving into MPI-based compute as much as I can to wrap my mind around how it works with the notion of then applying that paradigm to cloud compute. I list it as a flaw only because I wonder if it is possibly time to think a bit further outside the proverbial box if you will. I’ve been mulling over the following:
How do people actually use multiple machines to solve a problem? – This is really the root question behind all of this work. The first scenario is high-end shared-memory machines (ala Cray supercomputers) and I’m going to eliminate that type of compute from the conversation due to the fact that it simply can’t be well-replicated in the cloud as we currently know it. The far opposite end of the spectrum is “manual” clustering or map reduce – someone figures out a problem they want to solve, divvy’s it up amongst N nodes, and then individually runs a program on each node with the appropriate settings and then manually aggregates the results. This extreme is most likely done by ad-hoc projects or those not familiar with traditional HPC technologies and approaches. Between the two extremes listed, there are Map/Reduce implementations and traditional MPI programs targeted at distributed memory systems.
Amazon’s EC2 – very easy to utilize for lower-throughput MPI-based HPC - given you can get n Linux boxes for 0.10/cpu/hour and, because of the vast community that has grown up around it, there are pre-packaged clusters (via AIM) and even commercial vendors building businesses on top of providing HPC-style compute in EC2 in an “on-demand” fashion. Further, traditional grid computing platforms such as Nimbus have been radically adapted to provide a rather compelling local – to – cloud story for scientific HPC. It would seem that if you are working in HPC today, and simply want to utilize an HPC cluster “in the cloud” (maybe because of lack of access to sufficient hardware) that Amazon’s EC2 and the toolsets such as Nimbus (and others) that sit on top of it is a natural solution.
Microsoft’s Azure – While it is a quickly adapting platform (seeing as it hasn’t yet released) and they have hinted at plans to adapt the platform based on customer demand, if you look at it currently, there’s not an obvious fit for the traditional HPC model. The customer of Azure is given the choice of deploying web or worker roles, and one can imagine using worker roles in a fashion analogous to cluster nodes… but there currently isn’t any built-in infrastructure to bind those nodes into a single group/cluster. As it stands now, Azure seems to lean towards the manual-approach to large-scale compute. What could change this story completely is if Microsoft decided to offer HPC Pack-enabled nodes as a type of resource you could request, although there’s been nothing to hint that they are planning anything like this.
Where do we go next? – I’ve been chewing on whether or not it makes any sense to try to push HPC-style work into Azure, or if it should simply be relegated to the EC2’s of the world… One could conceivably build an implementation of MPI that, rather than relying on the underlying cluster would provide cloud-style/enabled communications between nodes… this could allow those most comfortable with (or with large existing code bases of) MPI-style apps to continue to utilize those libraries/applications, but one has to wonder if, unless the Microsoft pricing (to be announced later this summer) is incredibly cheaper than that of EC2, why would one bother (other than academic interest, of course) to build such? Again, this could be mitigated by Microsoft providing such itself, but the platform would have to provide additional compelling aspects to pull someone away from what would otherwise be a very comfortable transition (local cluster to an EC2-hosted cluster running the same software stacks).
What I’ve really been wondering – is if it is not time to throw MPI out altogether (or, more accurately, the programming paradigm that it represents). Is it time to look for ways to raise the level of abstraction for the computational researcher… and, if so, does something like Azure have a more interesting role? I’m wondering if some of the abstraction tools (workflow engines, queue services, etc.) will begin to have a role or if we need to continue to stretch for every raw bit of horsepower from the system (acquiescing to the fact that abstraction layers cost is in reduced raw system power). For many of the large simulation models it seems that the raw horsepower is indeed necessary. You also cannot simply ignore the vast collection of existing tools and libraries that already exist and target this paradigm. The flip question is that is there a collection of computational research for which, if the cost per cpu hour was low enough, and the increase in development productivity was great enough (assuming that the proposed layers of abstraction resulted in such), would it really matter if the job took an additional 30-50% time to run? This is, of course, only salient if we live in a world wherein I can get however many compute nodes I want whenever I want them (no waiting in queue).
My gut tells me we aren’t quite there yet, but I wonder how far out it really is?
Currently rated 5.0 by 2 people
- Currently 5/5 Stars.
- 1
- 2
- 3
- 4
- 5