Anytime I see a claim that our 7b models are better than gpt-4 I basically stop reading. If you are going to make that claim, give me several easily digestible examples of this taking place.
Anecdotally, I finetuned Mistral 7B for a specific (and slightly unusual) natural language processing task just a few days ago. GPT-4 can do the task, but it needs a long complex prompt and only gets it right about 80-90% of the time - the finetuned model performs significantly better with fewer tokens. (In fact it does so well that I suspect I could get good results with an even smaller model.)
I have a fine tuned version of Mistral doing a really simple task and spitting out some JSON. I'm getting equivalent performance to GPT-4 on that specialized task. It's lower latency, it's outputting more tokens/sec., more reliable, private, and completely free.
I don't think we will have an Open Source GPT4 for a long time so this is sorta clickbait, but for the small, specialized tasks, tuned on high quality data, we are already in the "Linux" era of OSS models. They can do real, practical work.
What I think they’re claiming is that it’s a base model aimed for further fine tuning, that when further tuned might perform better than GPT-4 on certain tasks.
It’s an argument they make at least as much to market fine tuning as their own model.
This is not a generic model that outperforms another generic model (GPT-4).
That can of course have useful applications because the resource/cost is then comparatively minuscule for certain business use cases.
IDK about GPT4 specifically, but I have recently witnessed a case where small finetuned 7Bs greatly outperformed larger models (Mixtral Instruct, Llama 70B finetunes) in a few very specific tasks.
There is nothing unreasonable about this. However I do dislike it when that information is presented in a fishy way, implying that it "outperforms GPT4" without any qualification.
(Post author here). Totally fair concern. I'll find some representative examples on a sample task we've done some fine-tuning on and add them to the post.
EDIT: Ok so the prompt and outputs are long enough that adding them to the post directly would be kind of onerous. But I didn't want to leave you waiting, so I copied an example into a Notion doc you can see here: https://opipe.notion.site/PII-Redaction-Example-ebfd29939d25...
Yeah, a 7B foundation model is of course going to be worse when expected to perform on every task.
But finetuning on just a few tasks?
Depending on the task, it's totally reasonable to expect that a 7B model might eke out a win against stock GPT4. Especially if there's domain knowledge in the finetune, and the given task is light on demand for logical skills.
I agree, I think they need an example or two on that blog post to back up the claim. I'm ready to believe it, but I need something more than "diverse customer tasks" to understand what we're talking about.
You can fine-tune a small model yourself and see. GPT-4 is an amazing general model, but won’t perform the best at every task you throw at it, out of the box. I have a fine-tuned Mistral 7B model that outperforms GPT 4 on a specific type of structured data extraction. Maybe if I fine-tuned GPT-4 it could beat it, but that costs a lot of money for what I can now do locally for the cost of electricity.
Not for translations. Did a lot of experimenting different local models. None come even a bit close to the capabilities of chatgpt. Most local models just outputting plain wrong intormation. I am still hoping one day it will be possible. For our business a huge opportunity.
the BTL model is just a way to infer 'true' skill levels given some list of head to head comparisons. the head to head comparisons / rankings are the most important!!!! and in this case, the rankings come from GPT-4 itself. so take any subsequent score with all the grains of salt you can muster.
their methodology also appears to be 'try 12 different models and hope 1 of them wins out.' multiple hypothesis adjustments come to mind here :)
Try a few blinds, mixtral 8x7b-instruct and gpt-4 are 50-50 for me, and it outperforms 3.5 almost every time, and you can run inference on it with a modern cpu and 64 GB of RAM on a personal device lmfao. and the instruct finetuning has had nowhere near the $$$ and rlhf that openai has. It's not a done deal, but people will be able to run models better than today's SOTA on <$1000 hardware in <3 months, I hope for their own sake that OpenAI is moving fast.
How are local models better in terms of trust? GPT 4 is the only model I've seen actually tuned to say no when it doesn't have the information being asked for. Though I do agree it used to run better earlier this year.
The best open source has to offer is Mixtral that will confidently make up a biography of a person it's never heard of before or write a script with nonexistant libraries.
I once asked Llama whether it’d heard of me. It came back with such a startlingly detailed and convincing biography of someone almost but not quite entirely unlike me that I began to wonder if there was some kind of Sliding Doors alternate reality thing going on.
Some of the things it said I’d done were genuinely good ideas, and I might actually go and do them at some point.
In their second sentence they have the most honest response I've seen so far at least: " averaged across 4 diverse customer tasks, fine-tunes based on our new model are _slightly_ stronger than GPT-4, as measured by GPT-4 itself."