Hacker Newsnew | past | comments | ask | show | jobs light | darkhn

> Rather than each of K startups individually buying clusters of N gpus, together we buy a cluster with NK gpus... Then we set up a job scheduler to allocate compute

In theory, this sounds almost identical to the business model behind AWS, Azure, and other cloud providers. "Instead of everyone buying a fixed amount of hardware for individual use, we'll buy a massive pool of hardware that people can time-share." Outside of cloud providers having to mark up prices to give themselves a net-margin, is there something else they are failing to do, hence creating the need for these projects?


Couple things, mostly pricing and availability:

1) Margins. Public cloud investors expect a certain margin profile. They can’t compete with Lambda/Fluidstack’s margins.

2) To an extent also big clouds have worse networking for LLM training. I believe only Azure has infiniband. Oracle is 3200 Gbps but not infiniband, same for AWS I believe. GCP not sure but their A100 networking speeds were only 100 Gbps I believe rather than 1600. Whereas lambda, fluidstack and coreweave all have ib.

3) Availability. Nvidia isn’t giving big clouds the allocation they want.


What is your differentiator from Lambda? That you are smaller and in a single DC?

Sincere question.


I'm not OP/submitter, but the main differentiator is that Lambda doesn't have on-demand availability for lots of interlinked H100s - you have to reserve them.

Lambda has "Lambda Sprint" which is kinda similar,[1] but Sprint is $4.85/GPU/hr instead of <$2.

So if you want 128 GPUs for a week, you can't use Lambda reserved (3 year term), you can't use Lambda on-demand (can't get 128 A/H100s on-demand), your options are Lambda Sprint or SF Compute, and SF Compute is offering significantly lower prices.

[1]: https://lambdalabs.com/service/gpu-cloud/reserved


Low margins and “will this thing still be around in 2 years” are negatively correlated.

Where’s the capital for upgrades, repairs, and replacements coming from?


Using investor's money to build something with low to zero margin until you capture enough value to make it profitable a few years down the line has been the core SV strategy for more than a decade now, so it's not an extraordinary plan.

Of course it doesn't always work, and it may be even harder to make it work in the current macroeconomic environment, but it's still pretty standard play.


They are working on this. All the major clouds have initiatives to do short term requests/reservations. It’s just not a feature that has ever been of much use pre-GenAI. How often do you need to request 1000 CPU nodes for 48 hours in a single zone?

Secondly, there is a fundamental question of resource sharing here. Even with this project by Evan and AI Grant (the second such cluster created by AI Grant btw), the question will arise — if one team has enough money to provision the entire cluster forever, why not do it? What are the exact parameters of fair use? In networking, we have algorithms around bandwidth sharing (TCP Fairness, etc.) that encode sharing mechanisms but they don’t work for these kinds of chunky workloads either.

But over the next few months, AWS and others are working to release queueing services that let you temporarily provision a chunk of compute, probably with upfront payment, and at a high expense (perhaps above the on demand rate).


> It’s just not a feature that has ever been of much use pre-GenAI. How often do you need to request 1000 CPU nodes for 48 hours in a single zone?

I would srgue this has always been a common case for cloud GPU compute


AWS and Azure would slit their own throats before they created a way for their customers to pool instances to save money.

They want to do that themselves, and keep the customer relationship and the profits, instead of giving them to a middleman or the customer.


It’s just corporate profits combined with market forces, not a some sort of malicious conspiracy.

You can rent a 2-socket AMD server with 120 available cores and RDMA for something like 50c to $2 per hour. That’s just barely above the cost of the electricity and cooling!

What do you want, free compute just handed to you out of the goodness of their hearts?

There is incredible demand for high-end GPUs right now, and market prices reflect that.


You mentioned malicious conspiracy, not me.

It's just business and I'd do the same if I was in charge of AWS.


> You can rent a 2-socket AMD server with 120 available cores and RDMA for something like 50c to $2 per hour.

Source required



Sorry where are these .50c many core servers you speak of exactly?


Azure's HB120rs_v3 size is about 36c per hour right now with Spot pricing in East US. These use 3rd generation AMD EPYC "Milan" processors.

The instances with the 4th generation "Genoa-X" processors (HB176rs_v4) cost about $2.88 per hour. The HX176rs_v4 model with 1.7 TB of memory is $3.46 per hour.

https://learn.microsoft.com/en-us/azure/virtual-machines/hbv...

https://learn.microsoft.com/en-us/azure/virtual-machines/hbv...

https://learn.microsoft.com/en-us/azure/virtual-machines/hx-...


Are these actually attainable, as in I can log in and launch an instances with these specifications right now, or are they just listings? I ask because literally last week I was unable to launch similar instances on AWS despite those specs being listed as available and online.


I could. Availability tends to be region-dependent with all clouds.


Where can you get 120 cores for $2/hr?



AWS and Azure both charge by the hour anyway so it wouldn't but if you wanted you could use Reserved instances and just have their accounts in the same organisation.

A large part of the profit comes from the upfront risk of buying machines. With this you are just absorbing that risk which may be better if the startup expects to last.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact |

Search: