Table of Contents

Google Searches

LLM Searches

Fine-Tuning

Embeddings

LLM Models

Name Release date Developer Number of parameters (billion) Corpus size Training cost (petaFLOP-day) License Notes
GPT-1 June 2018 OpenAI 0.117 1 MIT First GPT model, decoder-only transformer. Trained for 30 days on 8 P600 GPUs.
BERT October 2018 Google 0.340 3.3 billion words 9 Apache 2.0 An early and influential language model, but encoder-only and thus not built to be prompted or generative
T5 October 2019 Google 11 34 billion tokens Apache 2.0 Base model for many Google projects, such as Imagen.
XLNet June 2019 Google ~0.340 33 billion words Apache 2.0 An alternative to BERT; designed as encoder-only
GPT-2 February 2019 OpenAI 1.5 40GB (~10 billion tokens) 28 MIT Trained on 32 TPUv3 chips for 1 week.
GPT-3 May 2020 OpenAI 175 300 billion tokens 3640 proprietary A fine-tuned variant of GPT-3, termed GPT-3.5, was made available to the public through a web interface called ChatGPT in 2022.
GPT-Neo March 2021 EleutherAI 2.7 825 GiB MIT The first of a series of free GPT-3 alternatives released by EleutherAI. GPT-Neo outperformed an equivalent-size GPT-3 model on some benchmarks, but was significantly worse than the largest GPT-3.
GPT-J June 2021 EleutherAI 6 825 GiB 200 Apache 2.0 GPT-3-style language model
Megatron-Turing NLG October 2021 Microsoft and Nvidia 530 338.6 billion tokens Restricted web access Standard architecture but trained on a supercomputing cluster.
Ernie 3.0 Titan December 2021 Baidu 260 4 Tb Proprietary Chinese-language LLM. Ernie Bot is based on this model.
Claude December 2021 Anthropic 52 400 billion tokens beta Fine-tuned for desirable behavior in conversations.
GLaM (Generalist Language Model) December 2021 Google 1200 1.6 trillion tokens 5600 Proprietary Sparse mixture of experts model, making it more expensive to train but cheaper to run inference compared to GPT-3.
Gopher December 2021 DeepMind 280 300 billion tokens 5833 Proprietary Later developed into the Chinchilla model.
LaMDA (Language Models for Dialog Applications) January 2022 Google 137 1.56T words, 168 billion tokens 4110 Proprietary Specialized for response generation in conversations.
GPT-NeoX February 2022 EleutherAI 20 825 GiB 740 Apache 2.0 based on the Megatron architecture
Chinchilla March 2022 DeepMind 70 1.4 trillion tokens 6805 Proprietary Reduced-parameter model trained on more data. Used in the Sparrow bot. Often cited for its neural scaling law.
PaLM (Pathways Language Model) April 2022 Google 540 768 billion tokens 29250 Proprietary Trained for ~60 days on ~6000 TPU v4 chips.
OPT (Open Pretrained Transformer) May 2022 Meta 175 180 billion tokens 310 Non-commercial research GPT-3 architecture with some adaptations from Megatron
YaLM 100B June 2022 Yandex 100 1.7TB Apache 2.0 English-Russian model based on Microsoft's Megatron-LM.
Minerva June 2022 Google 540 38.5B tokens from webpages filtered for mathematical content and from papers submitted to the arXiv preprint server Proprietary For solving "mathematical and scientific questions using step-by-step reasoning". Based on PaLM model, further trained on mathematical and scientific data.
BLOOM July 2022 Large collaboration led by Hugging Face 175 350 billion tokens (1.6TB) Responsible AI Essentially GPT-3 but trained on a multi-lingual corpus (30% English excluding programming languages)
Galactica November 2022 Meta 120 106 billion tokens unknown CC-BY-NC-4.0 Trained on scientific text and modalities.
AlexaTM (Teacher Models) November 2022 Amazon 20 1.3 trillion proprietary bidirectional sequence-to-sequence architecture
Neuro-sama December 2022 Independent Unknown Unknown privately-owned A language model designed for live-streaming on Twitch.
LLaMA (Large Language Model Meta AI) February 2023 Meta AI 65 1.4 trillion 6300 Non-commercial research Corpus has 20 languages. "Overtrained" (compared to Chinchilla scaling law) for better performance with fewer parameters.
GPT-4 March 2023 OpenAI Unknown Unknown Unknown proprietary Available for ChatGPT Plus users and used in several products.
Cerebras-GPT March 2023 Cerebras 13 270 Apache 2.0 Trained with Chinchilla formula.
Falcon March 2023 Technology Innovation Institute 40 1 trillion tokens, from RefinedWeb (filtered web text corpus) plus some "curated corpora". 2800 Apache 2.0
BloombergGPT March 2023 Bloomberg L.P. 50 363 billion token dataset based on Bloomberg's data sources, plus 345 billion tokens from general purpose datasets Proprietary Trained on financial data from proprietary sources, for financial tasks.
PanGu-Σ March 2023 Huawei 1085 329 billion tokens Proprietary
OpenAssistant March 2023 LAION 17 1.5 trillion tokens Apache 2.0 Trained on crowdsourced open data
Jurassic-2 March 2023 AI21 Labs Unknown Unknown Proprietary Multilingual
PaLM 2 (Pathways Language Model 2) May 2023 Google 340 3.6 trillion tokens 85000 Proprietary Was used in Bard chatbot.
Llama 2 July 2023 Meta AI 70 2 trillion tokens 21000 Llama 2 license 1.7 million A100-hours.
Claude 2 July 2023 Anthropic Unknown Unknown Unknown Proprietary Used in Claude chatbot.
Mistral 7B September 2023 Mistral AI 7.3 Unknown Apache 2.0
Claude 2.1 November 2023 Anthropic Unknown Unknown Unknown Proprietary Used in Claude chatbot. Has a context window of 200,000 tokens, or ~500 pages.
Grok-1 November 2023 x.AI 314 Unknown Unknown Apache 2.0 Used in Grok chatbot. Grok-1 has a context length of 8,192 tokens and has access to X (Twitter).
Gemini 1.0 December 2023 Google DeepMind Unknown Unknown Unknown Proprietary Multimodal model, comes in three sizes. Used in the chatbot of the same name.
Mixtral 8x7B December 2023 Mistral AI 46.7 Unknown Unknown Apache 2.0 Outperforms GPT-3.5 and Llama 2 70B on many benchmarks. Mixture of experts model, with 12.9 billion parameters activated per token.
Mixtral 8x22B April 2024 Mistral AI 141 Unknown Unknown Apache 2.0
Phi-2 December 2023 Microsoft 2.7 1.4T tokens 419 MIT Trained on real and synthetic "textbook-quality" data, for 14 days on 96 A100 GPUs.
Gemini 1.5 February 2024 Google DeepMind Unknown Unknown Unknown Proprietary Multimodal model, based on a Mixture-of-Experts (MoE) architecture. Context window above 1 million tokens.
Gemma February 2024 Google DeepMind 7 6T tokens Unknown Gemma Terms of Use
Claude 3 March 2024 Anthropic Unknown Unknown Unknown Proprietary Includes three models, Haiku, Sonnet, and Opus.
DBRX March 2024 Databricks and Mosaic ML 136 12T Tokens Databricks Open Model License Training cost 10 million USD.
Fugaku-LLM May 2024 Fujitsu, Tokyo Institute of Technology, etc. 13 380B Tokens The largest model ever trained on CPU-only, on the Fugaku.
Phi-3 April 2024 Microsoft 14 4.8T Tokens MIT Microsoft markets them as "small language model".
Qwen2 June 2024 Alibaba Cloud 72 3T Tokens Multiple sizes, the smallest being 0.5B.
Nemotron-4 June 2024 Nvidia 340 9T Tokens 200,000 NVIDIA Open Model License Trained for 1 epoch. Trained on 6144 H100 GPUs between December 2023 and May 2024.
Llama 3.1 July 2024 Meta AI 405 15.6T tokens 440,000 Llama 3 license 405B version took 31 million hours on H100-80GB, at 3.8E25 FLOPs.
Large language model. (2024, July 25). In Wikipedia. https://en.wikipedia.org/wiki/Large_language_model

Slides