AlphaCode 2 on Codeforces witnessed 2x improvement over the prior record-setting AlphaCode system, which solved 25 percent. Mapping this to competition ranking, the team estimated that AlphaCode 2 is on average at the 85th percentile. In other words, it performs better than 85 percent of the participants, ranking between the ‘Expert’ and ‘Candidate Master’ categories on Codeforces.
Codeforces is a platform for testing competitive programming.
When compared to other AI code generators, the likes of GitHub Copilot (based on OpenAI Codex), Amazon CodeWhispher, Replit, CodeLlama 2, EleutherAI Llemma, and Salesforce CodeGen, AlphaCode 2 shows a unique strength in competitive programming. Whereas the others serve as mere coding assistants, mostly for general coding help and solving basic maths problems.
For OpenAI, Q* represents a significant advancement in AI capabilities for solving maths problems it hadn’t seen before, and for enhanced problem-solving abilities. Google’s AlphaCode 2, powered by Gemini, hints at it reaching the level of advancements as Q* – or even better.
Enjoy the full story here.
The Need for Benchmarks?
Since the beginning of LLMs, benchmarks have been the litmus test to judge their efficiency, at least on paper. There are plenty of them right now. Some popular ones include Human Evals (OpenAI), AGI Evals (Microsoft) MMLU, GSM8K, and others. However, companies often manipulate the data to project themselves at the top; and in this race, there’s yet to emerge a clear winner.
For instance, the recent launch of Gemini and its comparison with GPT-4 on different benchmarks, gives a glimpse of the benchmark manipulation. Google claimed it outperformed GPT-4 on the MMLU benchmark. However, it was later discovered that Google used COT@32 instead of 5-shot learning. Read the full story to find out what happened next.
AI in Fashion with Snezhana
Комментариев нет:
Отправить комментарий
Примечание. Отправлять комментарии могут только участники этого блога.