Signed in as:
filler@godaddy.com
Signed in as:
filler@godaddy.com
The Accelerator Era for AI/ML Data Centers
Major shifts in the data center market led by Generative AI/ML are the driving force behind several challenges for traditional data center designs. Public cloud costs and security concerns have driven many companies to begin cloud repatriation to a colo or on-prem data center, while many new GPU cloud providers are building specialized cloud infrastructure for accelerator-based applications. Despite the emergence of the #AcceleratorEra for AI/ML data centers, data center architects are left with long-established traditional interconnect architectures that are not well suited for the modern AI/ML data center.
The #AcceleratorEra presents modern data center architects with unprecedented scale challenges. Specialized workflow needs, connectivity, power, cooling, and team skill set requirements hinder AI/ML infrastructure deployment. Traditional vendor solutions force forklift upgrades and fail to address the root cause of the accelerator scale-out challenge. Legacy vendor solutions dictate infrastructure design, forcing users to squeeze applications into it rather than vice versa. This results in networks built for general purpose compute or the #HypervisorEra, increasing cost, power, and complexity, promoting resource inefficiency.
A Better Interconnect Design
Drut’s DynamicXcelerator provides customers the freedom to offer data center solutions that match their workload requirements as well as budgetary constraints, all within a multi-vendor dynamically reconfigurable computing infrastructure that can be tailored to their specific workload’s requirements. Our key technology benefit is referred to as “vPODs or virtual PODs” which provides the ability to perform dynamic slicing of datacenter resources based on software workloads.
Bringing Brilliance to AI with the Power of Photonics
All this is made possible with the innovative use of Drut’s photonic fabric. The DynamicXcelerator is a protocol-agnostic connectivity solution, allowing dynamically reconfigurable low-latency direct paths between various resource units inside a data center. Our industry has used optics in point-to-point links and talked about photonics for many years, but breakthrough advances in photonic technology now allow for it to be built at enterprise price and data center scale. There are many industry reasons why connectivity solutions continue to follow the model of “all-to-all” and “spray-and-pray” techniques, it is a combination of how it has always been done and incumbent suppliers protecting the legacy business model, but the AI revolution is changing the dynamics in such a way that this legacy model has now become a hinderance to the industry.
The characteristic that makes AI workloads so well suited to photonics is that the nature of the traffic patterns between GPUs is predictable within the model training and inference cycles. The result is that for shared environments you can build several smaller topologies, each of which matches these traffic patterns, and each of which is far simpler than a single shared topology.
Photonic fabrics provide several compelling advantages to AI deployments. Beyond scalability and availability, the modular nature of the fabric, the lower power consumption of the photonic switches, faster workload scheduling, and better workload isolation are the key compelling advantages. The dynamic nature, the continued reuse, and tuning of the photonic fabric deliver a better operating advantage.
Photonic fabrics provide several compelling advantages to AI deployments. Beyond scalability and availability, the modular nature of the fabric, the lower power consumption of the photonic switches, faster workload scheduling, and better workload isolation are the key compelling advantages. The dynamic nature, the continued reuse, and tuning of the photonic fabric deliver a better operating advantage.
The diagram on the right shows the physical construct of an optimized 3D Torus (16x16x16), as well as a representative example of topology slices. A total of 64 cubes, with 64 nodes each, forms a 4k (4096) node cluster as a single availability zone (AZ) with three additional AZs possible, bringing the overall data center capacity to 16k nodes.
AI / Machine Learning Software Life Cycle
The Drut approach is to consider that as application workloads require ever-changing resources, it is time to deploy a dynamic open-looped resource scheduling architecture for your applications. How do you carve out the available resources? If we follow the typical model used by many research institutions, the available resources are clustered by how and when they were deployed, so a user can schedule cluster 1 with 4x GPUs, or cluster 2 with 8x/16x GPUs, etc. The reservation of resources has little to do with the wants or needs of the user, and if the available clusters do not match the user's needs, then a significant effort is required to modify the available clusters to match their needs. The result is a poor combination of underutilized and limited capacity resources, leaving everyone dissatisfied with the result.
Drut solution introduces dynamic cluster creation into an open-looped workflow, allowing the research team involved with the building of their data model to provide input to their needs, which enables the Drut software and system to adjust by expanding or shrinking the cluster resources just in time for their operations to run. As this is often a multi-phase training operation the results of the operations can be fed back into the research team to adjust their input and allow Drut to modify the datacenter resources that they can use.
Research at various academic organizations has shown that the parallelism (data, tensor or pipeline) strategies can influence the traffic matrix, which can be used as one of the critical criteria in building the Dynamic Resources. This means that we are already at the apex where software can aid the researcher when they try to determine the resources needed.
Drut’s solution consists of a more rounded system approach to organizing constituent components into easily consumable blocks, that can be built to mirror the physical requirements of your datacenter and are flexible to map into your applications as needs change.
These blocks can be seen as moveable parts, build the systems as you desire and expand and upgrade as needed. The experience of living with the DynamicXcelerator is much different than with static legacy box architectures. Systems can be composed, upgraded, and altered on demand. Resources can be taken out of service, new resources such as new GPUs added and then composed into nodes. Need more GPUs or more FPGAs or maybe you want to change GPU vendor because there is a new GPU vendor to try? Simply add the new GPUs to the Photonic Resource Unit (PRU) and put them into production. Need more bandwidth in the fabric five years from today? Well, most likely we have you covered because you will be upgrading the FIC 2500 to the FIC 4500 or something like that. The fabric is rate agnostic and new FICs will operate just fine.
You can now think of your private cloud datacenter in terms of the aggregate resources available, knowing that you can carve out sections of resources that your workloads require. There is flexibility in terms of the placement of the blocks, but you can meld this with the physical requirements of your data center. For example, you can place the power hungry and cooling needed components (such as GPUs) within racks that have the appropriate power and cooling available, saving you time and money as most of your datacenter can remain untouched. If you have upgraded power on one side of your data center but not on the other, the DynamicXcelerator can stretch the distance allowing for better fidelity of power distribution amongst resources.
Drut Technologies Inc.
200 Innovative Way, Suite 1360, Nashua, New Hampshire 03062, United States
©2024 Drut Technologies Inc. All Rights Reserved.
We use cookies to analyze website traffic and optimize your website experience. By accepting our use of cookies, your data will be aggregated with all other user data.