The llama 4 artificial analysis scores are out. And its very bad

kollikiran456 · April 6, 2025, 6:50pm

So the small model llama 4 scout is worse than gemma 3 27b.

The 400b maverick is atrocious for its size.
Its worse than qwq32b.

The only good thing is that it has vision.

But if we just route the vision queries to a different model and use the deepseek v3 or qwq 32b for the text only queries and call the model samba - omni.

It would rockkkkk

tonyd · April 6, 2025, 7:25pm

I think the 10M context window is laudable and noteworthy, even if the model performance (and model performance relative to size) both struggle to keep up with other SOTA models. I hope to see larger context windows like this become more common across the whole LLM ecosystem, but it fundamentally has to start with the models, as Meta is doing here. I am optimistic that this will encourage Meta’s model-developing competitors, both open and closed, to feel pressured to create models that support longer context lengths as well.

coby.adams · April 6, 2025, 9:00pm

@kollikiran456

This is where agentic workflows and coding of MCPs comes in. You build your workflow. or call to MCPs to route to the best model choice for you application .

-Coby

neks · April 6, 2025, 11:32pm

I’m not seeing the bad results on Artificial Analysis but I’m curious how they got those results since most providers are not returning results that would get that score.

Artificial Analysis’s results says L4 Maverick beats out Claude 3.7 Sonnet and there’s still a reasoning model to be released later this month. Since MoE’s are easier to run, I’d be super happy with a local quantized version of Claude 3.7 that can be ran with fast ram or a Claude 3.7 level API that’s priced well with a provider that has a good privacy policy.

justin.woo · April 7, 2025, 5:00pm

I love this discussion. Thank you everyone for sharing your opinion especially on the latest results.

Just wondering what sites do people visit to evaluate what model to use.

What benchmarks are important to you? I would love to know everyone’s thoughts!

tonyd · April 10, 2025, 12:50pm

I pay the most attention to the newer coding benchmarks (SWE-bench, LiveCodeBench), though I pay attention to the older ones as well (HumanEval and MBPP), with tool-use benchmarks (TAU-bench) not escaping my attention either.

Additionally, they’re not quite benchmarks, but I also really appreciate Aider’s Polyglot leaderboard as well as LMArena’s WebDev Arena Leaderboard, which, while technically focused on web development, I find to be a better proxy for coding capability generally than LMArena’s main leaderboard.

kollikiran456 · April 10, 2025, 1:40pm

Its not good for you then .
Because meta is training the models on benchmarks data

tonyd · April 10, 2025, 3:15pm

I appreciate your perspective, but I’d respectfully like to share a counterpoint.

While there are valid concerns about models being trained specifically to perform well on benchmarks (and your point that Llama 4 models aren’t for me is 100% correct), I believe the AI community’s collective journey is more exciting when you focus on the innovations rather than the downsides.

Looking at the history of innovation, many labs and even users have contributed important pieces to the puzzle. Chain of Thought reasoning, by my understanding, was first discovered on GPT-3 by some creative users doing early work in prompt engineering. By today’s standards, GPT-3 is objectively a very weak model. But CoT stuck with the whole AI community and has become a boon for performance enhancements across a number of models. We kept the good and discarded the bad.

Mistral AI’s implementation with Mixtral 8x7B was groundbreaking in making MoE work effectively for open-source LLMs, and that architecture, though not developed by Mistral, is now widespread thanks in part to Mistral’s work showing how useful that architecture can be. Again, by today’s standards, Mistral 8x7B is objectively a very weak model. But MoE stuck with the whole AI community and offers us dividends in terms of compute efficiency that shows up as lower latency, higher throughput, etc. We kept the good and discarded the bad.

Meta’s contribution with the 10M context window in Llama 4 Scout is genuinely impressive, and these larger context windows are going to be crucial for enabling better long-context applications like improved RAG and multi-document summarization. Think about what might be possible with LLM’s that can ingest several entire medical textbooks and incorporate 10,000+ pages of dense medical information into analysis. I know we like to think of our doctors as highly skilled and knowledgeable, but they are still only human. There is no way for them to exhaustively remember every word of every paragraph of every page of every textbook they had to study, and medical mistakes cost millions of human lives around the globe. It’s entirely possible that going from 1M/2M (Google models) to 10M context windows is the “tipping point” / phase shift that enables a whole new wave of lives saved via doctor-assisting AI’s. I suspect these larger context windows will stick with the whole AI community, allowing us to keep the good (much larger context windows), while discarding the bad (models over-fitted to benchmarks).

Another great example is GPU acceleration. Nvidia paved the way for the AI revolution we are getting to experience together right now. As a whole, the community learned a lot from technologies like NVLink that offer high-speed data transfers, while observing the importance of total memory capacity, and difficulties with processing pipeline bottlenecks. While GPU’s may not have been the most efficient solution forever, SambaNova was likewise able to utilize these lessons in building the RDU to prioritize ultra high-speed data transfers and a processing pipeline that puts Nvidia’s best to shame, and like with the other innovations, we’re all better off for this - even if SambaNova still doesn’t have the context window sizes we long for quite yet… though I’m assured they’re coming soon

I try to view the ecosystem as evolutionary - different labs push boundaries in different ways, and the whole community benefits. While some techniques may be employed to boost benchmark scores, the truly valuable innovations (like new hardware archtiectures, longer context windows, efficient software architectures, and reasoning capabilities) persist and propagate across the field, so even if a model scores well partly due to benchmark optimization, that doesn’t diminish the real breakthroughs that will help advance the entire field. I remain optimistic about how these technologies will continue to evolve collectively, with every lab, every hardware manufacturer, and even every user’s unique contributions moving us all forward together.