No, they can't run it. llama 70 with 4 bit quantization takes ~50 GB VRAM for decent enough context size. You need A100, or 2-3 V100 or 4 3090 which all costs roughly roughly $3-5/h
Which model? I am talking about 70b as mentioned clearly. 70b 8b is 70GB just for the model itself. How much token/second are you getting with single 4090?
That is more than an order of magnitude better than my experience; I get around 2 t/s with similar hardware. I had also seen others reporting similar figures to mine so I assumed it was normal. Is there a secret to what you're doing?