Introduction to Llama3

An brief introduction to Llama3

Meta released Llama3 at April 18, which is evaluated on several benchmarks and achieves the SOTA on open-sourced LLMs.

Introduction

Instruct model performance

The performance of Llama3 8B compared with Gemma and Mistral:

ModelLlama3 8BGemma 7B - ItMistral &B Instruct
MMLU (5 shot)68.453.358.4
GPQA (0 shot)34.221.426.3
HumanEval (0 shot)62.230.536.6
GSM-8K (8 shot, CoT)79.630.639.9
MATH (4 shot, CoT)30.012.211.0

performance of Llama3 70B compared with Gemini Pro 1.5 and Claude Sonnet:

ModelLlama3 70BGemini Pro 1.5 (Published)Claude 3 Sonnet (Published)
MMLU (5 shot)82.081.979.0
GPQA (0 shot)39.541.5 (CoT)38.5 (CoT)
HumanEval (0 shot)81.771.973.0
GSM-8K (8 shot, CoT)93.091.7 (11 shot)92.3 (0 shot)
MATH (4 shot, CoT)50.458.5 (Minerva prompt)40.5

Pre-trained model performance

The performance of Llama3 8B compared with Gemma and Mistral:

ModelLlama3 8BGemma 7B (Published)Gemma 7B (Measured)Mistral 7B (Published)Mistral 7B (Measured)
MMLU (5 shot)66.664.364.462.563.9
AGIEval English (3-5 shot)45.941.744.9-44.0
BIG-Bench Hard (3 shot, CoT)61.155.159.0-56.0
ARC-Challenge (25 shot)78.653.2(0 shot)79.178.178.7
DROP (3 shot, F1)58.4-56.3-54.4

performance of Llama3 70B compared with Gemini Pro 1.5 and Claude Sonnet:

ModelLlama3 70BGemini Pro 1.0 (Published)Mixtral 8 $\times$ 22B (Measured)
MMLU (5-shot)79.571.877.7
AGIEval English (3-5 shot)63.0-61.2
BIG-Bench Hard (3 shot, CoT)81.375.079.2
ARC-Challenge (25 shot)93.0-90.7
DROP (3 shot, F1)79.774.1 (variable shot)77.6

Model Architecture

Several improvements are made on Llama3 compared to llama2:

  1. Llama3 uses a tokenizer with a vocabulary of 128K tokens.
  2. Llama3 adopts grouped query attention (GQA) across both the 8B and 70B sizes.
  3. Llama3 uses to context window of size 8192 tokens

Traning

Llama3 uses 15T tokens for pre-training. Compares to Llama2, it is seven times larger and includes four times more code.

5% data of the training dataset are non-English to support multi-lingual use cases.

Data processing includes:

  1. Heuristic filters
  2. NSFW filters
  3. Semantic deduplication approaches
  4. Text classifiers to predict data quality. Llama2 is used to generate training data for the text classifiers.

Data mixing strategy is explored to improve the performance of Llama3.

Scaling up pretraining

Llama3 developed a series of scaling laws for downstream benchmark evaluations.

Scaling laws help:

  1. Select an optimal data mix and to make informed decisions on how to best use training compute.
  2. Scaling laws allow Llama3 to predict the performance of the largest models on key tasks without training the models.

The authors finds our that the performance of the model continues to improve log-linearly as the training tokens increase. It is seen that Larger models can match the performance of these smaller models with less training compute, but smaller models are generally preferred because they are much more efficient during inference.

The authors combine three types of parallelization:

  1. Data parallelization
  2. Model parallelization
  3. Pipeline parallelization

Instruction fine-tuning

The fine-tuning of Llama3 contains:

  1. Supervised fine-tuning
  2. Rejection sampling
  3. Proximal Policy Optimization
  4. Direct Preference Optimization

Learning from perference rankings via PPO and DPO also greatly improved the performance of LLma3 on reasoning and coding tasks. Since perference ranking helps the model to select answer when it is in a dilemma.

Reference

Licensed under CC BY-NC-SA 4.0
comments powered by Disqus
Built with Hugo
Theme Stack designed by Jimmy