I hope you succeed. TPU research cloud (TRC) tried this in 2019. It was how I go...

flaque · on July 30, 2023

> Your project has a youthful optimism that I hope you won’t lose as you go. And in fact it might be the way to win in the long run.

This is the nicest thing anyone has said to us about this. We're gonna frame this and hang it out on our wall.

> So whenever someone comes knocking, begging for a tiny slice of your H100s for their harebrained idea, I hope you’ll humor them.

Absolutely! :D

camhart · on July 30, 2023

Optimism is (almost) always required in order to accomplish anything of significance. Those who lose it, aren't living up to their potential.

I'm not encouraging the false belief that everything you do will work out. Instead I'm encouraging the realization that the greatest accomplishments almost always feel like long shots, and require significant amounts of optimism. Fear and pessimism, while helpful in appropriate doses, will limit you greatly in life if you let them rule you too significantly.

When I look back on my life, the greatest accomplishments I've achieved are ones where I was naive yet optimistic going into it. This was a good thing, because I would have been too scared to try had I really known the challenges that lay ahead.

Frost1x · on July 30, 2023

>Optimism is (almost) always required in order to accomplish anything of significance. Those who lose it, aren't living up to their potential.

I argue that realism trumps optimism. It's perfectly normal in a realist farming to see something difficult, acknowledge the high risk and failure potential, and still pursue something with intent to succeed.

I've personally grown tired of over optimism everywhere because it creates unrealistic situations and passes consequences of failure in an inequitable way. The "visionary" is rewarded when the rare successes occur, while everyone else suffers the consequences for most failures. No contingency plans for failure, no discussion of failure, and so on. Optimism just takes any idea, pursues it and consequences be someone else's problem and be damned.

Pessimism isn't much better, you essentially think everything is too risky or unlikely to succeed so you never do anything. You live in a state of inaction because any level of risk or uncertainty is too much.

To me, realism is much better. You acknowledge the challenge. You acknowledge the risk. You make sure everyone involved understands it, but you still charge forward knowing you might succeed. Some think if you're not naively optimistic (what most people in my experience refer to as "optimism") you don't create enough pressure. I think that's non-sense.

OrsonSmelles · on July 31, 2023

While I've said something like this comment scores of times in my life, and it's definitely a necessary corrective for a lot of optimists who don't think too hard about how they think, I don't think it's a useful place to stop. It's not hard to get unanimous agreement with "be a realist!" because it's framed so the alternative is irrationality/delusion. But even among people who agree that the goal should be to reason under uncertainty and assess risks clearly, there will be a spectrum of risk tolerance, and I don't think it's the worst thing ever to describe that as "optimism" vs. "pessimism"! (I fully acknowledge this isn't the dominant usage, but I think some spaces lean this way)

In this context, I tend to read the parent claim as something like, "great success requires willingness to sometimes take worse-than-even odds or pursue modestly-negative-EV opportunities". I'm not sure I agree with the strongest version of that, but I think it's likely that the space of risky paths to great achievement is richer than that of cautious ones.

hackernewds · on July 31, 2023

If everyone were a realist, we wouldn't have half the advances we do. Because what can be "real" is proven wrong through innovation, after all isn't that disruption? :)

Sam Altman talks about this quite frequently, that it's not intelligence or luck necessary for an enduring innovation. It is persistence in the face of inevitability, and a high tolerance for being proven wrong and still persisting

OO000oo · on July 31, 2023

Everybody knows socialism is impossible though. Can't work, not worth trying, don't even think about it.

hgsgm · on July 30, 2023

Realism doesn't work in business. Business success requires 10 people to try for 1 person to succeed. If those 10 people were realists, they wouldn't try.

worldsayshi · on July 31, 2023

Depends how much that one person wins and how much the others lose.

rileyphone · on July 30, 2023

You can be a realist visionary.

Frost1x · on July 31, 2023

Absolutely, which is what I advocate for.

dhash · on July 30, 2023

YC startup founder here,

Mostly agree, except the market is not an optimistic place — it’s the market.

There are a multitude of reasons you lose your optimism, mostly because people take it away — your optimism is their money

johnthewise · on July 31, 2023

I like this quote from Napoleon on taking risks: “If the art of war were nothing but the art of avoiding risks, glory would become the prey of mediocre minds.... I have made all the calculations; fate will do the the rest."

jacquesm · on July 31, 2023

To me the payoff of failed projects is in what I learned. As long as that's the case I can carry my optimism over into new projects.

hackernewds · on July 31, 2023

What a beautiful and articulate thought. thank you

zak · on July 30, 2023

Actually, the TPU Research Cloud program is still going strong! We've expanded the compute pool significantly to include Cloud TPU v4 Pod slices, and larger projects still use hundreds of chips at a time. (TRC capacity has not been reclaimed for internal use.)

Check out this list of recent TRC-supported publications: https://sites.research.google/trc/publications/

Demand for Cloud TPUs is definitely intense, so if you're using preemptible capacity, you're probably seeing more frequent interruptions, but reserved capacity is also available. Hope you email the TRC support team to say hello!

sillysaurusx · on July 30, 2023

Zak, I love you buddy, but you should have some of your researchers try to use the TRC program. They should pretend to be a nobody (like I was in 2019) and try to do any research with the resources they’re granted. I guarantee you those researchers will all tell you “we can’t start any training runs anymore because the TPUs die after 45 minutes.”

This may feel like an anime betrayal, since you basically launched my career as a scientist. But it’s important for hobbyists and tinkerers to be able to participate in the AI ecosystem, especially today. And TRC just does not support them anymore. I tried, many times, over the last year and a half.

You don’t need to take my word for it. Here’s some unfiltered DMs on the subject: https://imgur.com/a/6vqvzXs

Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying.

I held out hope for so long. I thought it was temporary. It ain’t temporary, Zak. And I vividly remember when it happened. Some smart person in google proposed a new allocation algorithm back near the end of 2021, and poof, overnight our ability to make TPUs went from dozens to a handful. It was quite literally overnight; we had monitoring graphs that flatlined. I can probably still dig them up.

I’ve wanted to email you privately about this, but given that I am a small fish in a pond that’s grown exponentially bigger, I don’t think it would’ve made a difference. The difference is in your last paragraph: you allocate reserved instances to those who deserve it, and leave everybody else to fight over 45 minutes of TPU time when it takes 25 minutes just to create and fill your TPU with your research data.

Your non-preemptible TPUs are frankly a lie. I didn’t want to drop the L word, but a TPUv3 in euw4a will literally delete itself — aka preempt — after no more than a couple hours. I tested this over many months. That was some time ago, so maybe things have changed, but I wouldn’t bet on it.

There’s some serious “left hand doesn’t know that right hand detached from its body and migrated south for the winter” energy in the TRC program. I don’t know where it embedded itself, but if you want to elevate any other engineers from software devs to researchers, I urge you to make some big changes.

One last thing. The support staff of TRC is phenomenal. Jonathan Colton has worked more miracles than I can count, along with the rest of his crew. Ultimately he had to send me an email like “by the way, TRC doesn’t delete TPUs. This distinction probably won’t be too relevant, but I wanted to let you know” (paraphrasing). Translation: you took the power away from the people who knew where to put it (Jonathan) and gave it to some really important researchers, probably in Brain or some other division of Google. And the rest is history. So I don’t want to hear that one of the changes is “ok, we’ve punished the support staff” - as far as I can tell, they’ve moved mountains with whatever tools they had available, and I definitely wouldn’t have been able to do any better in their shoes.

Also, hello. Thanks for launching my career. Sorry that I had to leave this here, but my duty is to the open source community. The good news is that you can still recover, if only you’d revert this silly “we’ll slip you some reserved TPUs that don’t kamikaze themselves after 45 minutes if you ask in just the right way” stuff. That wasn’t how the program was in 2019, and I guarantee that I couldn’t have done the work I did then under the current conditions.

zak · on July 31, 2023

A few quick comments:

> But it’s important for hobbyists and tinkerers to be able to participate in the AI ecosystem

Totally agree! This was a big part of my original motivation for creating the TPU Research Cloud program. People sometimes assume that e.g. an academic affiliation is required to participate, but that isn't true; we want the program to be as open as possible. We should find a better way to highlight the work of TRC tinkerers - for now, the GitHub and Hugging Face search buttons near the top of https://sites.research.google/trc/publications/ provide some raw pointers.

I'm sorry to hear that you've personally had a hard time getting TPU v3 capacity in europe-west4-a. In general, TRC TPU availability varies by region and by hardware generation, and we've experimented with different ways of prioritizing projects. It's possible that something was misconfigured on our end if your TPU lifetimes were so short. Could you email Jonathan the name of the project(s) you were using and any other data you still have handy so we can figure out what was going wrong?

Also, thanks for the kind words for Jonathan and the rest of the TRC team. They haven't lost any power or control, and they are allocating a lot more Cloud TPU capacity than ever. However, now that everyone wants to train LLMs, diffusion models, and other exciting new things, demand for TPU compute is way up, so juggling all of the inbound TRC requests is definitely more challenging than it used to be.

sillysaurusx · on July 31, 2023

It’s not euw4a. It’s everywhere. The allocation algorithm across the board kills off TPUs after no more than a couple hours. usc1f, usc1a, usc1c, euw4a; it makes no difference.

It would be funny if someone set gpt-2-15b-poetry (our project) in some special way to prevent us from making TPUs that ever last more than a few hours, but from what I’ve heard from other people, this isn’t the case. That’s what I mean about the left hand doesn’t know what’s going on with the right hand. It’s not a misconfiguration. Again, pretend to be some random person who just wants to apply for TPU access, fill out your form, then try to do research with the TPUs that are available to you. You’ll have a rough time, but it’ll also cure this misconception that it’s a special case or was just me.

Again, no need to take my word for it; here’s an organic comment from someone who was rolling their eyes whenever I was cheerleading TRC, because their experience was so bad: https://news.ycombinator.com/item?id=36936782

I think that the experience is probably great for researchers who get special approval. And that’s fine, if that’s how the program is designed to be. But at least tell people that they shouldn’t expect more than an hour or two of TPU time.

zak · on July 31, 2023

It sounds like you're primarily using preemptible TPU quota, which doesn't come with any availability or uptime expectations at all.

By default, the TRC program grants both on-demand quota and preemptible quota. If you are able to create a TPU VM with your on-demand quota, it should last quite a bit longer than a few hours. (There are situations in which on-demand TRC TPU VMs can be interrupted, but these ought to be rare.) If your on-demand TPU VMs are being interrupted frequently, please email TRC support and provide the names of the TPU hosts that were interrupted so folks can try to help.

When there is very high demand for Cloud TPUs, it's certainly possible for preemptible TPU VMs to be interrupted frequently. It would be an interesting engineering project to make a very robust training system that could make progress even with low TPU VM uptime, and I hope someone does it! Until then, though, you should have a better experience with on-demand resources when you're able to create them. Reserved capacity is even better since it provides an expectation of both availability and uptime.

sillysaurusx · on July 31, 2023

I was using on-demand TPUs primarily, and preemptible TPUs secondarily. Neither would last more than an hour or two. And two was something of a minor miracle by the end.

zak · on Aug 2, 2023

For future reference, the team looked into this, and it appears that the interruptions you experienced were specific to your project and a small number of other projects. The vast majority of TRC projects should see much longer Cloud TPU uptimes when they are able to create on-demand TPUs.

I'm sorry that you had such a frustrating time and that we weren't able to sort it out via email while it was happening. If you decide to try TRC again and run into issues like this, please be sure to engage with TRC support!

nl · on July 31, 2023

> You don’t need to take my word for it. Here’s some unfiltered DMs on the subject: https://imgur.com/a/6vqvzXs

> Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying.

Unless I'm misreading this they sound pretty happy and you sound pessimistic? Their last substantial comment was "I'm sure Zak could hook you up with something better"?

sillysaurusx · on July 31, 2023

TRC is supposed to be the “something better”. This insider TPU stuff is for the birds. If TRC can only offer 4 hours with no preemptions, that’s fine, but they need to be up front about that. Saying that TPUs preempt every 24 hours and then killing them off after 45 minutes is… not very productive.

As for their comments, the third screenshot is the key; they’re agreeing that the situation is bad. They’re a friend, and they’re a little indirect with the way they phrase things. (If you’ve ever had a friend who really doesn’t want to be wrong, you know what I mean; they kind of say things in a circular way in order to agree without agreeing. After awhile it’s pretty cute and endearing though.)

I was particularly pessimistic in those DMs because it came a couple months after I thought I’d give TRC one last try, back in January, which was roughly a year after I’d started my “ok, I’m losing hope, but I’ll wait and see” journey. In the meantime I kept cheerleading TRC and driving people to their signup page. But after the TPUs all died in less than two hours yet again, that was that.

I have a really high tolerance for faulty equipment. This is free compute; me complaining is just ungrateful. But I saw what things were like in 2019. “Different” would be the understatement of the century. If my baby wasn’t being incubated in the NICU today, I’d show the charts where our usage went from thousands of cores down to almost zero, and not for lack of trying.

It also would’ve been fine to say “sorry, this is unsustainable, the new limits are one tpu per person per project” and then give me a rock solid tpu. We had those in 2021. One of our TPUv3s stayed online for so long that I started to host my blog on it just to show people that TPUs were good for more than AI; the uptime was measured in months. Then poof, now you can barely fire one up.

nl · on July 31, 2023

I don't have a qualified opinion on the subject of TPU availability.

I'm just pointing out that your summary of the DMs ("Notice how their optimism dries up, and not because I was telling them how bad TRC has become. It’s because their TPUs kept dying") is the opposite of what the DMs show.

zak · on July 31, 2023

As mentioned in another comment, it sounds like you're using preemptible TRC TPU quota. If you use on-demand TRC TPU quota instead, that should improve your uptime substantially.

KirillPanov · on July 31, 2023

This is totally fascinating.

Frankly, it sounds to me like they're having severe yield+reliability problems with the TPUv4s that aren't getting caught by wafer-level testing, and have binned the flakiest ones for use by outsiders.

A lot of yield issues show up as spontaneous resets/crashes.

nl · on July 31, 2023

It's more likely Google preempting researcher who are on a preemptable research grant, and it is happening a lot more often because there are more paying customers.

KirillPanov · on July 31, 2023

"Preemptable money" sounds like the kind of bullshit I would use to cover up failed chips. And yes, I am a VLSI engineer.

choppaface · on July 31, 2023

Main problem with the TPU Research Cloud is you get dragged down a LOT by the buggy TPU API-- not just the Google Cloud API being awful but the Tensorflow/Jax/Pytorch support also being awful too. You also basically must use Google Cloud Storage, which is also slow and can be really expensive getting anything into / out-of.

The Googlers maintaining the TPU Github repo also just basically don't care about your PR unless it's somehow gonna help them in their own perf review.

In contrast with a GPU-based grid, you can not only run the latest & greatest out-of-the-box but also do a lot of local testing that saves tons of time.

Finally, the OP here appears to be offering real customer engagement, which is totally absent from my own GCloud experiences across several companies.

zak · on July 31, 2023

Could you share a few technical details about the issues you've encountered with TF / JAX / PyTorch on Cloud TPUs? The overall Cloud TPU user experience improved a whole lot when we enabled direct access to TPU VMs, and I believe the newer JAX and PyTorch integrations are improving very rapidly. I'd love to know which issues are currently causing the most friction.

ShamelessC · on July 30, 2023

Wow! I never thought you’d see the light. All I ever see from your posts is praise for TRC. As someone who got started way later on, I had infinitely more success with a gaming GPU I owned myself. Obviously not really comparable, but TRC was very very difficult to work with. I think I only ever had access to a TPUv3 once and that wasn’t nearly enough time to learn the ropes.

My understanding was that this situation changed drastically depending on what sort of email you had or how popular your Twitter handle was.

haldujai · on July 30, 2023

My experience has been different. Considering how easy the application is I think they're still being fairly generous as I've been offered multiple v3-8s and v3-32s x 30days as well as pre-emptible v3-64s x 28 days for a few different projects within the last 6 months.

Are you affiliated with an academic institution? Otherwise I'm not sure why they're been more generous with me, my projects have been mildly interesting at best.

They're certainly a lot stingier with larger pods than they used to be though.

latchkey · on July 30, 2023

What Shawn says is absolutely right. The race right now is way too hot for this stuff. A single customer will eat up 512 gpus for 3 years.

nwoli · on July 31, 2023

> In 2023 you can barely get a single TPU for more than an hour

Oh come on, colab gives TPU access in the free tier for a whole half day. No need to exaggerate the shortage

LoganDark · on July 30, 2023

> In 2023 you can barely get a single TPU for more than an hour.

Um. Can't you order them from coral.ai and put them in an NVMe slot? Or are the cloud TPUs more powerful?

whimsicalism · on July 30, 2023

TPU pod is not sold by google, edge tpu is different

LoganDark · on July 30, 2023

So the cloud TPUs are more powerful...? Or what are you saying?

sillysaurusx · on July 30, 2023

Yeah, it’s a silly branding thing.

One TPU (not even a pod, just a regular old TPUv2) has 96 CPU cores with 1.4TB of RAM, and that’s not even counting their hardware acceleration. I’d love to buy one.

haldujai · on July 30, 2023

Huh, this doesn't seem right. Based on #s you seem to be referring to pods but even then I'm not familiar with such a configuration existing.

A single TPUv2 chip has 1 core and 8gb of memory. A single device comes in the v2-8 configuration with 8 cores and 64gb of memory.

Pod variants come in v2-32 to v2-512 configurations.

simonster · on July 31, 2023

A single TPUv2 host has 8 TPU cores with 64GB of total HBM (8GB per core), but like GPUs, TPUs can't directly access a network, so the host also needs CPUs and standard RAM to send data to them. They are fast, and the host has to be fast enough to keep them fed with data, so the host is pretty beefy. But FWIW, a TPUv2 host has somewhere around 330GB of RAM, not 1.4TB.

haldujai · on July 31, 2023

Thanks for clarifying, I misinterpreted the commenter as referring to the accelerator as the conversation was about TPU availability for purchase.

I know just enough about the architecture to facilitate using TPUs for research training runs but I'm not sure what's so special about the host?

Sure it's beefy but there are much beefier servers readily available.

simonster · on July 31, 2023

There's nothing super-special about the host. The accelerators are the special part (and, as described elsewhere, they are orders of magnitude more powerful than the Edge TPU). However, if you're an academic/independent researcher, being able to access a system with that much system memory/CPU cores for free through TPU Research Cloud is potentially appealing even without the accelerators.

dgacmu · on July 31, 2023

Edge TPUs are low cost, low power inference devices the size of a dime. I have a hundred of them sitting in a closet. (Alas. Anyone want to buy 100 coral minis? :-)

The TPUs you rent that are being discussed here are capable of training, consume hundreds of watts and have a heatsink bigger than your fist and really spectacular network links. They're analogous to Nvidia's highest end GPUs from a "what can you do with them" perspective.

Both are custom chips for deep learning but they're completely different beasts.

fragmede · on July 31, 2023

Can I hook a microphone up to a Coral Mini and run Whisper? I'd love to have a home assistant that wasn't on the cloud.

As for the rest of them, list them on Amazon and let them do the fulfillment. That $10k of hardware isn't going to sell itself from your closet. (Yet. LLMs are making great strides.)

dgacmu · on July 31, 2023

It has a microphone built in.

And that's a good idea, thanks. I've been dreading the idea of using ebay.

buildbot · on July 31, 2023

They are entirely different chips - like an order of magnitude in terms if transistor count and die size.

whimsicalism · on July 30, 2023