Meta’s Llama 3.1 405B: The New Open-Source AI Challenging GPT-4o and Claude 3.5 Sonnet

Open-source Large Language Models (LLMs) must catch up to their closed-source counterparts regarding capabilities and performance. Meta, which is a pioneer in making open-source LLMs available, is not giving up and is once again challenging closed counterpart models like Claude 3.5 Sonnet and GPT-4o by OpenAI. Meta is pioneering this change in open-source AI by releasing Llama 3.1 405B, the most advanced and powerful openly available model. Can this model certainly rival the top closed AI models?

Introducing the Next Open-Source AI Generation: Llama 3.1 Models

The Llama 3.1 405B is a significant advancement in open-source AI. By releasing the Llama 3.1 model with 405B parameters, Meta challenges the best-closed models in general knowledge, math, tool use, steerability, and multilingual translation. This release marks a new era of innovation and possibilities. Alongside the 405B, Meta introduces updated versions of the 8B and 70B models. These updates bring better multilingual support, a longer context length of 128K, and stronger reasoning skills. Meta’s Llama 3.1 models’ family will perform better on complex tasks like summarizing long texts, multilingual conversations, and coding help.

You can download this collection of Llama 3.1 models from llama.meta.com and Hugging Face.

Important Change in Meta’s License

Meta informs that they made changes to their license. This change is essential as developers can now use the outputs from Llama models, including the Llama 3.1 405B, to enhance other models.


Model Card: Key Highlights

The Meta Llama 3.1 is a series of pretrained generative, multilingual large language models (LLMs). Models are available in 8B, 70B, and 405B sizes. It is a brand new model series released on July 23, 2024.

Supported Languages: English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai. (I hope that one day also Polish will be available)

Training DataParamsInput modalitiesOutput modalitiesContext lengthKnowledge cutoff
A new mix of publicly available online data.8BMultilingual TextMultilingual Text and code128kDecember 2023
A new mix of publicly available online data.70BMultilingual TextMultilingual Text and code128kDecember 2023
A new mix of publicly available online data.405BMultilingual TextMultilingual Text and code128kDecember 2023
Llama 3.1 (text only). Source: Model Information

Training Data: Llama 3.1 was pretrained on approximately 15 trillion tokens from publicly accessible data. It utilized publicly available instruction datasets and over 25 million synthetically generated examples for fine-tuning. The cutoff of knowledge from this data is December 2023.

Model Performance and Comparisons

In subsequent sections, I will present the Llama 3.1 model series evaluation results compared to older models and leading closed-source models: GPT-4o and Claude 3.5 Sonnet. You will see the results of these models on different benchmarks. I do not describe these benchmarks in this article. If you want to learn more about benchmarks and evaluation metrics used for each benchmark, please take a closer look at the links available in the captions of presented tables and check these evaluation details.

Llama 3.1 vs. Llama 3

The Llama 3.1 series, especially the flagship 405B model, demonstrates significant advancements across various benchmarks, outperforming its predecessors and many non-open-source models. Key improvements are visible in general knowledge, reasoning, and reading comprehension tasks. Notably, the 405B model excels in benchmarks like MMLU, ARC-Challenge, and TriviaQA-Wiki, consistently leading in performance. The 3.1 updates also enhance smaller models (8B, 70B), making them highly competitive in their respective categories.

Comparing Llama 3.1 to Llama 3 is like comparing real Llama to Alpaca, at least like comparing the size of these animals 🙂 Source: Link
Benchmark CategoryBenchmarkLlama 3 8BLlama 3.1 8BLlama 3 70BLlama 3.1 70BLlama 3.1 405B
GeneralMMLU66.766.779.579.385.2
MMLU-Pro (CoT)36.237.155.053.861.6
AGIEval English47.147.863.064.671.6
CommonSenseQA72.675.083.884.185.8
Winogrande60.583.386.7
BIG-Bench Hard (CoT)61.164.281.381.685.9
ARC-Challenge79.479.793.192.996.1
Knowledge reasoningTriviaQA-Wiki78.577.689.789.891.8
Reading comprehensionSQuAD76.477.085.681.889.3
QuAC (F1)44.444.951.151.153.6
BoolQ75.775.079.079.480.0
DROP (F1)58.459.579.779.684.8
Base pretrained models. Comparinson between LLama 3.1 and predecessors. Source: Model Information

Llama 3.1 405B vs GPT-4o and Claude 3.5 Sonnet

The evaluation of Llama 3.1 405B across various benchmarks indicates that it performs competitively with GPT-4o and Claude 3.5 Sonnet. Llama 3.1 405B shows strong results in general tasks, particularly in the MMLU and IFEval benchmarks. The new model from Meta is slightly behind Claude 3.5 Sonnet for code-related benchmarks but close to GPT-4o or even better. In mathematical reasoning, it excels in the GSM8K benchmark but lags somewhat in the MATH benchmark compared to GPT-4o. Llama 3.1 405B also performs well in reasoning tasks, tool use, and multilingual capabilities.

Benchmark CategoryBenchmarkLlama 3.1 405BGPT-4oClaude 3.5 Sonnet
GeneralMMLU (0-shot, CoT)88.688.788.3
MMLU PRO (5-shot, CoT)73.374.077.0
IFEval88.685.688.0
CodeHumanEval8990.292
MBPP EvalPlus88.687.890.5
MathGSM8K (8-shot, CoT)96.896.196.4
MATH (0-shot, CoT)73.876.671.1
ReasoningARC Challenge (0-shot)96.996.796.7
GPQA (0-shot, CoT)51.153.659.4
Tool useBFCL88.580.590.2
Nexus58.756.145.7
Long ContextZeroSCROLLS/QuALITY95.290.590.5
InfiniteBench/En.MC83.482.5
NIH/Multi-needle98.110090.8
MultilingualMultilingual MGSM (0-shot)91.690.591.6
Comparison between LLama 3.1 405B and chosen two leading closed models. Source: Link

Today, on Linkedin, I also saw a fascinating figure created by Maxime Labonne. This Figure presents the result on the MMLU (5-shot) benchmark of closed-source models like GPT-4, Claude 3.5 Sonnet vs. open-weight (open-source) models like Llama 3. Maxime Labonne created this Figure a few months ago, but he updated this with the results of Llama 3.1 models. This Figure, which you can find below in this article, clearly shows that with time, open-weight models are getting closer and closer in performance to closed-weight models. I see this as a promising trend, as I strongly support open-source AI.

Closed-source vs. open-weight (open-source) models. Image by Maxime Labonne

Conclusions

Meta’s release of Llama 3.1 represents a significant advancement in democratizing access to best-performing LLMs. By providing a powerful Llama 3.1 405B model, Meta challenges the closed-weight models of other tech giants and positions itself as a leader in the open-source AI movement. AI is changing before our eyes. It is undeniable that Llama 3.1 will further influence AI and NLP research and development. This rapid change keeps me excited and glad to be working in NLP, AI, and Data Science.

References

  1. Model Information Card on GitHub, Link
  2. Introducing Llama 3.1: Our most capable models to date; Article by Meta
  3. Documentation by Meta
  4. Photo by Willian Justen de Vasconcellos on Unsplash

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top