> *Rather than each of K startups individually buying clusters of N gpus, togeth...

tikkun · on July 31, 2023

Couple things, mostly pricing and availability:

1) Margins. Public cloud investors expect a certain margin profile. They can’t compete with Lambda/Fluidstack’s margins.

2) To an extent also big clouds have worse networking for LLM training. I believe only Azure has infiniband. Oracle is 3200 Gbps but not infiniband, same for AWS I believe. GCP not sure but their A100 networking speeds were only 100 Gbps I believe rather than 1600. Whereas lambda, fluidstack and coreweave all have ib.

3) Availability. Nvidia isn’t giving big clouds the allocation they want.

bravura · on July 31, 2023

What is your differentiator from Lambda? That you are smaller and in a single DC?

Sincere question.

tikkun · on July 31, 2023

I'm not OP/submitter, but the main differentiator is that Lambda doesn't have on-demand availability for lots of interlinked H100s - you have to reserve them.

Lambda has "Lambda Sprint" which is kinda similar,[1] but Sprint is $4.85/GPU/hr instead of <$2.

So if you want 128 GPUs for a week, you can't use Lambda reserved (3 year term), you can't use Lambda on-demand (can't get 128 A/H100s on-demand), your options are Lambda Sprint or SF Compute, and SF Compute is offering significantly lower prices.

[1]: https://lambdalabs.com/service/gpu-cloud/reserved

TylerE · on July 31, 2023

Low margins and “will this thing still be around in 2 years” are negatively correlated.

Where’s the capital for upgrades, repairs, and replacements coming from?

littlestymaar · on July 31, 2023

Using investor's money to build something with low to zero margin until you capture enough value to make it profitable a few years down the line has been the core SV strategy for more than a decade now, so it's not an extraordinary plan.

Of course it doesn't always work, and it may be even harder to make it work in the current macroeconomic environment, but it's still pretty standard play.

aabhay · on July 30, 2023

They are working on this. All the major clouds have initiatives to do short term requests/reservations. It’s just not a feature that has ever been of much use pre-GenAI. How often do you need to request 1000 CPU nodes for 48 hours in a single zone?

Secondly, there is a fundamental question of resource sharing here. Even with this project by Evan and AI Grant (the second such cluster created by AI Grant btw), the question will arise — if one team has enough money to provision the entire cluster forever, why not do it? What are the exact parameters of fair use? In networking, we have algorithms around bandwidth sharing (TCP Fairness, etc.) that encode sharing mechanisms but they don’t work for these kinds of chunky workloads either.

But over the next few months, AWS and others are working to release queueing services that let you temporarily provision a chunk of compute, probably with upfront payment, and at a high expense (perhaps above the on demand rate).

whimsicalism · on July 30, 2023

> It’s just not a feature that has ever been of much use pre-GenAI. How often do you need to request 1000 CPU nodes for 48 hours in a single zone?

I would srgue this has always been a common case for cloud GPU compute

abraae · on July 30, 2023

AWS and Azure would slit their own throats before they created a way for their customers to pool instances to save money.

They want to do that themselves, and keep the customer relationship and the profits, instead of giving them to a middleman or the customer.

jiggawatts · on July 30, 2023

It’s just corporate profits combined with market forces, not a some sort of malicious conspiracy.

You can rent a 2-socket AMD server with 120 available cores and RDMA for something like 50c to $2 per hour. That’s just barely above the cost of the electricity and cooling!

What do you want, free compute just handed to you out of the goodness of their hearts?

There is incredible demand for high-end GPUs right now, and market prices reflect that.

abraae · on July 30, 2023

You mentioned malicious conspiracy, not me.

It's just business and I'd do the same if I was in charge of AWS.

alex_lav · on July 31, 2023

> You can rent a 2-socket AMD server with 120 available cores and RDMA for something like 50c to $2 per hour.

Source required

jiggawatts · on July 31, 2023

https://news.ycombinator.com/item?id=36950422

mikeravkine · on July 30, 2023

Sorry where are these .50c many core servers you speak of exactly?

jiggawatts · on July 31, 2023

Azure's HB120rs_v3 size is about 36c per hour right now with Spot pricing in East US. These use 3rd generation AMD EPYC "Milan" processors.

The instances with the 4th generation "Genoa-X" processors (HB176rs_v4) cost about $2.88 per hour. The HX176rs_v4 model with 1.7 TB of memory is $3.46 per hour.

https://learn.microsoft.com/en-us/azure/virtual-machines/hbv...

https://learn.microsoft.com/en-us/azure/virtual-machines/hx-...

alex_lav · on July 31, 2023

Are these actually attainable, as in I can log in and launch an instances with these specifications right now, or are they just listings? I ask because literally last week I was unable to launch similar instances on AWS despite those specs being listed as available and online.

jiggawatts · on Aug 3, 2023

I could. Availability tends to be region-dependent with all clouds.

megakwood · on July 30, 2023

Where can you get 120 cores for $2/hr?

jiggawatts · on July 31, 2023

https://news.ycombinator.com/item?id=36950422

asdfaoeu · on July 31, 2023

AWS and Azure both charge by the hour anyway so it wouldn't but if you wanted you could use Reserved instances and just have their accounts in the same organisation.

A large part of the profit comes from the upfront risk of buying machines. With this you are just absorbing that risk which may be better if the startup expects to last.