Overview of Qwen-VL series

Overview of Qwen-VL series

Timeline

时间模型系列Feature
2023.08Qwen-VLVision-centric understanding
Multi-lingual
Multi-image
Fine-grained visual understanding
2024.09Qwen2-VLImage understanding
Video understanding (20min)
Agent capability
Multilingual support
2025.02Qwen2.5-VLDocument parsing
Object grounding
Long video understadning and grouding (1 hour)
Agent functionality

Models

namesizeVisionAdapterLLM
1.0Qwen-VL9.6BViT(1.9B)Cross-attention(0.08B)Qwen-LLM(7B)
2.0Qwen2-VL-2B2BViT(675M)MLP(0.0341B)Qwen2-LLM(1.5B)
Qwen2-VL-7B7BMLP(0.0446B)Qwen2-LLM(7.6B)
Qwen2-VL-72B72BMLP(0.0682B)Qwen2-LLM(72B)
2.5Qwen2.5-VL-3B3BViT(675M)MLP(0.1817B)Qwen2.5-LLM(3B)
Qwen2.5-VL-7B7BMLP(0.3179B)Qwen2.5-LLM(7.6B)
Qwe2.5n-VL-72B72BMLP(0.7267B)Qwen2.5-LLM(72B)

Remark

  • Qwen2-VL的MLP使用的是LayerNorm
  • Qwen2.5-VL的MLP使用的是RMSNorm
  • 出了Qwen-VL之外,所有模型均有对应的Instruct版本,这里为了方便没有列出。Qwen-VL对应的是Qwen-VL-chat。

Data & training

Model seriesstagedataVisionAdapterLLM
Qwen-VLpretraining1.4B
Multi-task pretraining96.8M
SFT350K
Qwen2-VLpretraining600b tokens✅?✅?
Multi-task pretraining800b tokens✅?✅?✅?
SFT✅?✅?
Qwen2.5-VLVisual Pre-Training1.5T token✅?✅?
Multimodal Pre-Training2T token
Long-Context Pre-Training0.6T token✅?✅?✅?
SFT2M samples
DPO

Remark

  1. Qwen2-VL按照技术报告的说法是follow了Qwen-VL的训练方式,因此这里也使用了同样的表示
  2. 带问号的地方代表不确定,为个人猜测。请谨慎参考。
  3. 难以确定的原因是Qwen-VL系列将projection layer和ViT作为一个模块,因此比较难区分是否训练了projection layer

技术报告的训练框架图

qwen-vl-training

qwen2-5-training

References

Licensed under CC BY-NC-SA 4.0
Built with Hugo
Theme Stack designed by Jimmy