Software Algorithms – Quantization
ViT
-
Post-Training Quantization for Vision Transformer - PKU & Huawei Noah’s Ark Lab,
NIPS 2021 -
PTQ4ViT: Post-Training Quantization Framework for Vision Transformers - Houmo AI & PKU,
ECCV 2021 -
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer - MEGVII Technology,
IJCAI 2022 -
Q-ViT: Fully Differentiable Quantization for Vision Transformer - Megvii Technology & CASIA,
arxiv 2022 -
TerViT: An Efficient Ternary Vision Transformer - Beihang University & Shanghai Artificial Intelligence Laboratory,
arxiv 2022 -
Patch Similarity Aware Data-Free Quantization for Vision Transformers - CASIA,
ECCV 2022 -
PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers - CASIA,
arxiv 2022
BERT
-
Q8BERT: Quantized 8Bit BERT - Intel AI Lab,
NIPS Workshop 2019 -
Ternarybert: Distillation-aware ultra-low bit bert - Huawei Noah’s Ark Lab,
EMNLP 2020 -
I-BERT: Integer-only BERT Quantization - University of California, Berkeley,
ICML 2021 -
Understanding and Overcoming the Challenges of Efficient Transformer Quantization - Qualcomm AI Research,
EMNLP 2021 -
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers - Microsoft,
arxiv 2022 -
Outlier Suppression: Pushing the Limit of Low-bit Transformer - BUAA & SenseTime & PKU & UESTC,
NIPS 2022
GPT
-
Compression of Generative Pre-trained Language Models via Quantization - The University of Hong Kong & Huawei Noah’s Ark Lab,
ACL 2022 -
NUQMM: QUANTIZED MATMUL FOR EFFICIENT INFERENCE OF LARGE-SCALE GENERATIVE LANGUAGE MODELS - Pohang University of Science and Technology,
arxiv 2022 -
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale - University of Washington & FAIR,
NIPS 2022 -
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - MIT,
arxiv 2022 -
GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS - IST Austria & ETH Zurich,
arxiv 2022 -
The case for 4-bit precision: k-bit Inference Scaling Laws - University of Washington,
arxiv 2022 -
Quadapter: Adapter for GPT-2 Quantization - Qualcomm AI Research,
arxiv 2022
Software Algorithms – Pruning
-
A Fast Post-Training Pruning Framework for Transformers - UC Berkeley,
arxiv 2022 -
SPARSEGPT: MASSIVE LANGUAGE MODELS CAN BE ACCURATELY PRUNED IN ONE-SHOT - IST Austria,
arxiv 2023 -
WHAT MATTERS IN THE STRUCTURED PRUNING OF GENERATIVE LANGUAGE MODELS? - CMU & Microsoft,
arxiv 2023 -
ZipLM: Hardware-Aware Structured Pruning of Language Models - IST Austria,
arxiv 2023
Hardware & System Implementations
-
Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks - IBM,
NIPS 2019 -
A3: Accelerating Attention Mechanisms in Neural Networks with Approximation - Seoul National University & Hynix,
HPCA 2020 -
ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks - Seoul National University,
ISCA 2021 -
Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization - ECNU,
ICCAD 2022 -
Accelerating attention through gradient-based learned runtime pruning - UCSD & Google,
ISCA 2022 -
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale - Microsoft,
arxiv 2022 -
FP8 Quantization: The Power of the Exponent - Qualcomm AI Research,
arxiv 2022 -
PETALS: Collaborative Inference and Fine-tuning of Large Models - Yandex,
arxiv 2022 -
EFFICIENTLY SCALING TRANSFORMER INFERENCE - Google,
arxiv 2022 -
High-throughput Generative Inference of Large Language Models with a Single GPU - Stanford etc.,
arxiv 2023
Platform
updating …