Cerebras presents its AI supercomputer at 2 Exaflops

Generative AI is eating the world.

That’s how Andrew Feldman, CEO of Silicon Valley AI computer maker Cerebras, begins his introduction to his company’s latest achievement: an AI supercomputer capable of executing 2 trillion operations per second (2 exaflops). The system, dubbed Condor Galaxy 1, is on track to double in size within 12 weeks. In early 2024, it will be joined by two more systems that are twice the size. The Silicon Valley firm plans to continue adding Condor Galaxy installations next year until it can operate a network of nine supercomputers capable of handling 36 exaflops total.

If broad-language models and other generative AIs are eating the world, Cerebra’s plan is to help them digest it. And the Silicon Valley company is not alone. Other AI-focused computer makers are building massive systems around their own specialized processors or Nvidia’s latest GPU, the H100. While it’s hard to judge the size and capabilities of most of these systems, Feldman says the Condor Galaxy 1 is already among the largest.

Condor Galaxy 1, assembled and running in just ten days, consists of 32 Cerebras CS-2 computers and is set to expand to 64. The next two systems, to be built in Austin, Texas, and Ashville, NC, will also house 64 CS-2s each.

At the heart of every CS-2 is the Waferscale Engine-2, an AI-specific processor with 2.6 trillion transistors and 850,000 AI cores built from an entire silicon wafer. The chip is so large that the scale of memory, bandwidth, computing resources and other things in new supercomputers quickly becomes a little ridiculous, as the following graph shows.

In case you didn’t find these numbers overwhelming enough here’s another one: There are at least 166 trillion transistors in the Condor Galaxy 1.Brain

One of Cerebra’s biggest advantages in building large AI supercomputers is its ability to scale up resources easily, Feldman says. For example, a 40 billion parameter network can be trained in roughly the same time as a 1 billion parameter network if 40 times more hardware resources are devoted to it. Importantly, such a scaling requires no additional lines of code. Proving linear scaling has historically been very problematic, due to the difficulty of dividing large neural networks so that they work efficiently. We scale linearly from 1 to 32 [CS-2s] with a key, he says.

The Condor Galaxy series is owned by Abu Dhabi-based G42, a holding company with nine AI-based companies, including G42 Cloud, one of the largest cloud computing providers in the Middle East. However, Cerebras will manage the supercomputers and may lease resources that G42 does not use for internal work.

Demand for training large neural networks has soared, according to Feldman. The number of companies training neural network models with 50 billion or more parameters has risen from 2 in 2021 to more than 100 this year, he says.

Of course, Cerebras isn’t the only one going after companies that need to train really big neural networks. Big players like Amazon, Google, Meta and Microsoft have their offers. Computer clusters built around Nvidia GPUs dominate much of this business, but some of these companies have developed their own silicon for AI, such as Google’s TPU series and Amazon’s Trainium. There are also competing startups of Cerebras, making their own accelerators and AI computers including Habana (now part of Intel), Graphcore and Samba Nova.

Meta, for example, built its AI Research SuperCluster using more than 6,000 Nvidia A100 GPUs. A planned second phase would push the cluster to 5 exaflops. Google built a system containing 4,096 of its TPU v4 accelerators totaling 1.1 exaflops. That system traversed the BERT natural language processor’s neural network, which is much smaller than today’s LLMs, in just over 10 seconds. Google is also running Compute Engine A3, powered by Nvidia H100 GPUs and a custom infrastructure processing unit built with Intel. Cloud service provider CoreWeave, working with Nvidia, tested a 3,584 H100 GPU system that trained a benchmark representing the GPT-3 large language model in just over 10 minutes. In 2024, Graphcore plans to build a 10 exaflop system called Good Computer made up of more than 8,000 of its Bow processors.

You can access Condor Galaxy here.

From articles on your site

Related articles Around the web

#Cerebras #presents #supercomputer #Exaflops
Image Source : spectrum.ieee.org

Leave a Comment