Ther's Blog

萌生此等惬意，实乃徒增彷徨。

Awesome Transformer Quantization

Software Algorithms – Quantization

ViT

Post-Training Quantization for Vision Transformer - PKU & Huawei Noah’s Ark Lab, NIPS 2021
PTQ4ViT: Post-Training Quantization Framework for Vision Transformers - Houmo AI & PKU, ECCV 2021
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer - MEGVII Technology, IJCAI 2022
Q-ViT: Fully Differentiable Quantization for Vision Transformer - Megvii Technology & CASIA, arxiv 2022
TerViT: An Efficient Ternary Vision Transformer - Beihang University & Shanghai Artificial Intelligence Laboratory, arxiv 2022
Patch Similarity Aware Data-Free Quantization for Vision Transformers - CASIA, ECCV 2022
PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers - CASIA, arxiv 2022

BERT

Q8BERT: Quantized 8Bit BERT - Intel AI Lab, NIPS Workshop 2019
Ternarybert: Distillation-aware ultra-low bit bert - Huawei Noah’s Ark Lab, EMNLP 2020
I-BERT: Integer-only BERT Quantization - University of California, Berkeley, ICML 2021
Understanding and Overcoming the Challenges of Efficient Transformer Quantization - Qualcomm AI Research, EMNLP 2021
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers - Microsoft, arxiv 2022
Outlier Suppression: Pushing the Limit of Low-bit Transformer - BUAA & SenseTime & PKU & UESTC, NIPS 2022

GPT

Compression of Generative Pre-trained Language Models via Quantization - The University of Hong Kong & Huawei Noah’s Ark Lab, ACL 2022
NUQMM: QUANTIZED MATMUL FOR EFFICIENT INFERENCE OF LARGE-SCALE GENERATIVE LANGUAGE MODELS - Pohang University of Science and Technology, arxiv 2022
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale - University of Washington & FAIR, NIPS 2022
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - MIT, arxiv 2022
GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS - IST Austria & ETH Zurich, arxiv 2022
The case for 4-bit precision: k-bit Inference Scaling Laws - University of Washington, arxiv 2022
Quadapter: Adapter for GPT-2 Quantization - Qualcomm AI Research, arxiv 2022

Software Algorithms – Pruning

A Fast Post-Training Pruning Framework for Transformers - UC Berkeley, arxiv 2022
SPARSEGPT: MASSIVE LANGUAGE MODELS CAN BE ACCURATELY PRUNED IN ONE-SHOT - IST Austria, arxiv 2023
WHAT MATTERS IN THE STRUCTURED PRUNING OF GENERATIVE LANGUAGE MODELS? - CMU & Microsoft, arxiv 2023
ZipLM: Hardware-Aware Structured Pruning of Language Models - IST Austria, arxiv 2023

Hardware & System Implementations

Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks - IBM, NIPS 2019
A3: Accelerating Attention Mechanisms in Neural Networks with Approximation - Seoul National University & Hynix, HPCA 2020
ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks - Seoul National University, ISCA 2021
Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization - ECNU, ICCAD 2022
Accelerating attention through gradient-based learned runtime pruning - UCSD & Google, ISCA 2022
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale - Microsoft, arxiv 2022
FP8 Quantization: The Power of the Exponent - Qualcomm AI Research, arxiv 2022
PETALS: Collaborative Inference and Fine-tuning of Large Models - Yandex, arxiv 2022
EFFICIENTLY SCALING TRANSFORMER INFERENCE - Google, arxiv 2022
High-throughput Generative Inference of Large Language Models with a Single GPU - Stanford etc., arxiv 2023

Platform

updating …