Anytime I see a claim that our 7b models are better than gpt-4 I basically stop ...

thorum · on Dec 20, 2023

Anecdotally, I finetuned Mistral 7B for a specific (and slightly unusual) natural language processing task just a few days ago. GPT-4 can do the task, but it needs a long complex prompt and only gets it right about 80-90% of the time - the finetuned model performs significantly better with fewer tokens. (In fact it does so well that I suspect I could get good results with an even smaller model.)

oceanplexian · on Dec 20, 2023

I have a fine tuned version of Mistral doing a really simple task and spitting out some JSON. I'm getting equivalent performance to GPT-4 on that specialized task. It's lower latency, it's outputting more tokens/sec., more reliable, private, and completely free.

I don't think we will have an Open Source GPT4 for a long time so this is sorta clickbait, but for the small, specialized tasks, tuned on high quality data, we are already in the "Linux" era of OSS models. They can do real, practical work.

tomrod · on Dec 20, 2023

Been my thought for awhile now.

Can you recommend where I can learn more about hardware requirements for running Mistral/Mixtral?

YetAnotherNick · on Dec 20, 2023

> completely free

Not according to my calculation. For low request rate it is likely more expensive than GPT4.

cced · on Dec 21, 2023

How are you guys fine tuning?

skeletonjelly · on Dec 21, 2023

Can you please point me in the direction of the guide you used for fine tuning? Did you use QLoRA?

jug · on Dec 20, 2023

What I think they’re claiming is that it’s a base model aimed for further fine tuning, that when further tuned might perform better than GPT-4 on certain tasks.

It’s an argument they make at least as much to market fine tuning as their own model.

This is not a generic model that outperforms another generic model (GPT-4).

That can of course have useful applications because the resource/cost is then comparatively minuscule for certain business use cases.

brucethemoose2 · on Dec 20, 2023

IDK about GPT4 specifically, but I have recently witnessed a case where small finetuned 7Bs greatly outperformed larger models (Mixtral Instruct, Llama 70B finetunes) in a few very specific tasks.

There is nothing unreasonable about this. However I do dislike it when that information is presented in a fishy way, implying that it "outperforms GPT4" without any qualification.

kcorbitt · on Dec 20, 2023

(Post author here). Totally fair concern. I'll find some representative examples on a sample task we've done some fine-tuning on and add them to the post.

EDIT: Ok so the prompt and outputs are long enough that adding them to the post directly would be kind of onerous. But I didn't want to leave you waiting, so I copied an example into a Notion doc you can see here: https://opipe.notion.site/PII-Redaction-Example-ebfd29939d25...

achille · on Dec 20, 2023

They can absolutely outperform gpt4 for specific use cases.

TOMDM · on Dec 20, 2023

Yeah, a 7B foundation model is of course going to be worse when expected to perform on every task.

But finetuning on just a few tasks?

Depending on the task, it's totally reasonable to expect that a 7B model might eke out a win against stock GPT4. Especially if there's domain knowledge in the finetune, and the given task is light on demand for logical skills.

nickthegreek · on Dec 20, 2023

I am very open to believing that. I'd love to see some examples.

turnsout · on Dec 20, 2023

I agree, I think they need an example or two on that blog post to back up the claim. I'm ready to believe it, but I need something more than "diverse customer tasks" to understand what we're talking about.

bugglebeetle · on Dec 20, 2023

You can fine-tune a small model yourself and see. GPT-4 is an amazing general model, but won’t perform the best at every task you throw at it, out of the box. I have a fine-tuned Mistral 7B model that outperforms GPT 4 on a specific type of structured data extraction. Maybe if I fine-tuned GPT-4 it could beat it, but that costs a lot of money for what I can now do locally for the cost of electricity.

GaggiX · on Dec 20, 2023

Well it's pretty easy to find examples online, this one using Llama 2, not even Mistral or fancy techniques: https://www.anyscale.com/blog/fine-tuning-llama-2-a-comprehe...

shiftpgdn · on Dec 20, 2023

They're quite close in arena format: https://chat.lmsys.org/?arena

TOMDM · on Dec 20, 2023

To be clear, Mixtral is very competitive, Mistral while certainly way better than most 7B models performs far worse than ChatGPT3.5 Turbo.

shiftpgdn · on Dec 20, 2023

Apologies, that's what I get for skimming through the thread.

holoduke · on Dec 20, 2023

Not for translations. Did a lot of experimenting different local models. None come even a bit close to the capabilities of chatgpt. Most local models just outputting plain wrong intormation. I am still hoping one day it will be possible. For our business a huge opportunity.

ijk · on Dec 21, 2023

For translation, you're probably better off with a model that's specifically designed for translation, like MADLAD-400 or DeepL's services.

tomrod · on Dec 20, 2023

Looks like they utilized the Bradley-Terry model, but that's not one I'm super familiar with.

https://en.wikipedia.org/wiki/Bradley%E2%80%93Terry_model

huac · on Dec 20, 2023

the BTL model is just a way to infer 'true' skill levels given some list of head to head comparisons. the head to head comparisons / rankings are the most important!!!! and in this case, the rankings come from GPT-4 itself. so take any subsequent score with all the grains of salt you can muster.

their methodology also appears to be 'try 12 different models and hope 1 of them wins out.' multiple hypothesis adjustments come to mind here :)

mistercheph · on Dec 20, 2023

https://chat.lmsys.org/?arena

Try a few blinds, mixtral 8x7b-instruct and gpt-4 are 50-50 for me, and it outperforms 3.5 almost every time, and you can run inference on it with a modern cpu and 64 GB of RAM on a personal device lmfao. and the instruct finetuning has had nowhere near the $$$ and rlhf that openai has. It's not a done deal, but people will be able to run models better than today's SOTA on <$1000 hardware in <3 months, I hope for their own sake that OpenAI is moving fast.

hospitalJail · on Dec 20, 2023

Some things to note about gpt4:

>Sometimes it will spit out terrible horrid answers. I believe this might be due to time of the day/too many users. They limit tokens.

>Sometimes it will lie because it has alignment

>Sometimes I feel like it tests things on me

So, yes you are right, gpt4 is overall better, but I find myself using local models because I stopped trusting gpt4.

crooked-v · on Dec 20, 2023

Don't forget that ChatGPT 4 also has seasonal depression [1].

[1]: https://twitter.com/RobLynch99/status/1734278713762549970

(Though with that said, the seasonal issue might be common to any LLM with training data annotated by time of year.)

moffkalast · on Dec 20, 2023

How are local models better in terms of trust? GPT 4 is the only model I've seen actually tuned to say no when it doesn't have the information being asked for. Though I do agree it used to run better earlier this year.

The best open source has to offer is Mixtral that will confidently make up a biography of a person it's never heard of before or write a script with nonexistant libraries.

mattkevan · on Dec 20, 2023

I once asked Llama whether it’d heard of me. It came back with such a startlingly detailed and convincing biography of someone almost but not quite entirely unlike me that I began to wonder if there was some kind of Sliding Doors alternate reality thing going on.

Some of the things it said I’d done were genuinely good ideas, and I might actually go and do them at some point.

ChatGPT just said no.

hospitalJail · on Dec 21, 2023

To be clear, the comparison was originally with GPT3 and ChatGPT3. ChatGPT3 would lie about anti-vaxx books never existing. GPT3 would answer facts.

gmuslera · on Dec 20, 2023

“… with my definition of better” should be the default interpretation whenever you see the word better anywhere.

filterfiber · on Dec 20, 2023

In their second sentence they have the most honest response I've seen so far at least: " averaged across 4 diverse customer tasks, fine-tunes based on our new model are _slightly_ stronger than GPT-4, as measured by GPT-4 itself."