I appreciate your perspective, but I’d respectfully like to share a counterpoint.
While there are valid concerns about models being trained specifically to perform well on benchmarks (and your point that Llama 4 models aren’t for me is 100% correct), I believe the AI community’s collective journey is more exciting when you focus on the innovations rather than the downsides.
Looking at the history of innovation, many labs and even users have contributed important pieces to the puzzle. Chain of Thought reasoning, by my understanding, was first discovered on GPT-3 by some creative users doing early work in prompt engineering. By today’s standards, GPT-3 is objectively a very weak model. But CoT stuck with the whole AI community and has become a boon for performance enhancements across a number of models. We kept the good and discarded the bad.
Mistral AI’s implementation with Mixtral 8x7B was groundbreaking in making MoE work effectively for open-source LLMs, and that architecture, though not developed by Mistral, is now widespread thanks in part to Mistral’s work showing how useful that architecture can be. Again, by today’s standards, Mistral 8x7B is objectively a very weak model. But MoE stuck with the whole AI community and offers us dividends in terms of compute efficiency that shows up as lower latency, higher throughput, etc. We kept the good and discarded the bad.
Meta’s contribution with the 10M context window in Llama 4 Scout is genuinely impressive, and these larger context windows are going to be crucial for enabling better long-context applications like improved RAG and multi-document summarization. Think about what might be possible with LLM’s that can ingest several entire medical textbooks and incorporate 10,000+ pages of dense medical information into analysis. I know we like to think of our doctors as highly skilled and knowledgeable, but they are still only human. There is no way for them to exhaustively remember every word of every paragraph of every page of every textbook they had to study, and medical mistakes cost millions of human lives around the globe. It’s entirely possible that going from 1M/2M (Google models) to 10M context windows is the “tipping point” / phase shift that enables a whole new wave of lives saved via doctor-assisting AI’s. I suspect these larger context windows will stick with the whole AI community, allowing us to keep the good (much larger context windows), while discarding the bad (models over-fitted to benchmarks).
Another great example is GPU acceleration. Nvidia paved the way for the AI revolution we are getting to experience together right now. As a whole, the community learned a lot from technologies like NVLink that offer high-speed data transfers, while observing the importance of total memory capacity, and difficulties with processing pipeline bottlenecks. While GPU’s may not have been the most efficient solution forever, SambaNova was likewise able to utilize these lessons in building the RDU to prioritize ultra high-speed data transfers and a processing pipeline that puts Nvidia’s best to shame, and like with the other innovations, we’re all better off for this - even if SambaNova still doesn’t have the context window sizes we long for quite yet… though I’m assured they’re coming soon 
I try to view the ecosystem as evolutionary - different labs push boundaries in different ways, and the whole community benefits. While some techniques may be employed to boost benchmark scores, the truly valuable innovations (like new hardware archtiectures, longer context windows, efficient software architectures, and reasoning capabilities) persist and propagate across the field, so even if a model scores well partly due to benchmark optimization, that doesn’t diminish the real breakthroughs that will help advance the entire field. I remain optimistic about how these technologies will continue to evolve collectively, with every lab, every hardware manufacturer, and even every user’s unique contributions moving us all forward together.