• Saturday, September 07, 2024
businessday logo

BusinessDay

How to fix AI, machine learning bottlenecks

machine learning

Tech Explainer: Deep learning and machine learning

Ghassan Azzi, Sales Director, Africa at Western Digital, has discussed the transformative impact of artificial intelligence (AI) and machine learning (ML) on data analysis, offering insights and automation capabilities.

In his recent Thought Leadership article, “Three Architecture Tips for Storage Environments Primed for AI/ML,” Azzi noted that AI’s proliferation has disrupted traditional business models.

The data fueling these technologies is spread across data warehouses, data lakes, the cloud, and on-premises data centres, ensuring critical information is accessible and analyzable for AI initiatives.

Read also: Tech Explainer: Deep Learning and Machine Learning.

“Organizations increasingly rely on AI to enhance customer experiences, streamline operations, and drive innovation. To maximize AI’s benefits, it is crucial to adopt advanced storage architectures.

“NVMe over Fabrics (NVMe-oF™) provides the low-latency, high-throughput access needed for AI workloads, accelerating performance and reducing potential bottlenecks.

“Implementing disaggregated storage offers greater flexibility, enabling the independent scaling of storage and compute resources to maximize utilization. Failure to implement suitable architecture and integrate AI can leave businesses behind in a data-driven world.”

/Read also: The Role of Artificial Intelligence in the Growth of Insurance Industry in Nigeria

Considerations in Deploying Machine Learning Models

Azzi revealed that organisations are constantly pressured to extract maximum value from their data quickly and cost-efficiently, without disrupting regular business operations.

According to him, relying on commodity storage, whether on-premises or in the cloud, is no longer ideal. High-performance, flexible, and scalable computing environments are needed to support the real-time processing demands of modern AI workflows. Efficient, purpose-built data storage is critical, requiring considerations for data volume, velocity, variety, and veracity.

Organisations can now build public cloud-like infrastructures in on-premises data centres, providing the flexibility and scalability of the cloud along with the control and cost efficiency of private infrastructure.

“Properly architected, these environments offer a more efficient way to support the high-performance, scalable requirements of storage environments primed for AI applications. Repatriating AI/ML datasets to on-premises data centers from the cloud can be an ideal option for organizations operating within certain performance or cost limits.”

Read also: Enhancing the Reliability of Population Data in Census for Nigeria Through Artificial Intelligence Technology.

Building an On-Premises Storage Environment for AI Applications

Azzi outlined three key considerations when building on-premises storage environments suited to the needs of today’s AI/ML-powered world.

He said that AI applications require significant computing resources to process and analyze ML datasets efficiently, making the selection of a suitable server architecture crucial, focusing on the ability to scale GPU resources without creating system bottlenecks.

“It is also important to include high-performance storage networking that can meet and exceed GPUs’ ever-increasing performance demands while providing scalable capacity and throughput to meet learning model data set sizes and performance requirements.

“Storage solutions that take advantage of direct path technology enable direct GPU-to-storage communication, bypassing the CPU to enhance data transfer speeds, reduce latency, and improve utilization.

“Finally, solutions should be hardware and protocol agnostic, providing multiple ways to connect servers and storage to the network, as the interoperability of the infrastructure is essential for building a flexible environment primed for AI applications.”

Building a New Architecture

According to the sales director, building public cloud-like infrastructures on-premises can offer organisations the flexibility and scalability of the cloud while maintaining the control and cost efficiency of private infrastructure. However, the right storage architecture decisions must be made with AI considerations in mind, providing the necessary combination of compute power and storage capacity for AI applications to operate at the speed of business.

Read also: Artificial Intelligence and Climate Change: Implications for the Global Future

“One way to ensure proper resource allocation and reduce bottlenecks is through storage disaggregation, which allows for independent scaling of storage, ensuring GPU saturation and efficient performance in AI/ML workloads.

“Western Digital’s RapidFlex™ technology, Ingrasys’ ES2100 with integrated NVIDIA Spectrum™ Ethernet switches, and NVIDIA’s GPUs, Magnum IO™ GPUDirect Storage, and ConnectX® SmartNICs combine to offer the performance, scalability, and agnostic architecture required for building on-premises supercomputing environments for AI/ML applications.

“These technologies create a direct data path between NVMe-oF storage and GPU memory, driving high performance and efficient utilization of storage and GPU resources,” he said. “Western Digital’s proof of concept demonstrates simple independent scaling of storage bandwidth to maximize GPU workloads, achieving over 100 GB/s for multiple NVIDIA A100 GPUs.”