Software Algorithms – Quantization
ViT
-
Post-Training Quantization for Vision Transformer - PKU & Huawei Noah’s Ark Lab,
NIPS 2021
-
PTQ4ViT: Post-Training Quantization Framework for Vision Transformers - Houmo AI & PKU,
ECCV 2021
-
FQ-ViT: Post-Training Quantization for Fully Quantized Vision Transformer - MEGVII Technology,
IJCAI 2022
-
Q-ViT: Fully Differentiable Quantization for Vision Transformer - Megvii Technology & CASIA,
arxiv 2022
-
TerViT: An Efficient Ternary Vision Transformer - Beihang University & Shanghai Artificial Intelligence Laboratory,
arxiv 2022
-
Patch Similarity Aware Data-Free Quantization for Vision Transformers - CASIA,
ECCV 2022
-
PSAQ-ViT V2: Towards Accurate and General Data-Free Quantization for Vision Transformers - CASIA,
arxiv 2022
BERT
-
Q8BERT: Quantized 8Bit BERT - Intel AI Lab,
NIPS Workshop 2019
-
Ternarybert: Distillation-aware ultra-low bit bert - Huawei Noah’s Ark Lab,
EMNLP 2020
-
I-BERT: Integer-only BERT Quantization - University of California, Berkeley,
ICML 2021
-
Understanding and Overcoming the Challenges of Efficient Transformer Quantization - Qualcomm AI Research,
EMNLP 2021
-
ZeroQuant: Efficient and Affordable Post-Training Quantization for Large-Scale Transformers - Microsoft,
arxiv 2022
-
Outlier Suppression: Pushing the Limit of Low-bit Transformer - BUAA & SenseTime & PKU & UESTC,
NIPS 2022
GPT
-
Compression of Generative Pre-trained Language Models via Quantization - The University of Hong Kong & Huawei Noah’s Ark Lab,
ACL 2022
-
NUQMM: QUANTIZED MATMUL FOR EFFICIENT INFERENCE OF LARGE-SCALE GENERATIVE LANGUAGE MODELS - Pohang University of Science and Technology,
arxiv 2022
-
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale - University of Washington & FAIR,
NIPS 2022
-
SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models - MIT,
arxiv 2022
-
GPTQ: ACCURATE POST-TRAINING QUANTIZATION FOR GENERATIVE PRE-TRAINED TRANSFORMERS - IST Austria & ETH Zurich,
arxiv 2022
-
The case for 4-bit precision: k-bit Inference Scaling Laws - University of Washington,
arxiv 2022
-
Quadapter: Adapter for GPT-2 Quantization - Qualcomm AI Research,
arxiv 2022
Software Algorithms – Pruning
-
A Fast Post-Training Pruning Framework for Transformers - UC Berkeley,
arxiv 2022
-
SPARSEGPT: MASSIVE LANGUAGE MODELS CAN BE ACCURATELY PRUNED IN ONE-SHOT - IST Austria,
arxiv 2023
-
WHAT MATTERS IN THE STRUCTURED PRUNING OF GENERATIVE LANGUAGE MODELS? - CMU & Microsoft,
arxiv 2023
-
ZipLM: Hardware-Aware Structured Pruning of Language Models - IST Austria,
arxiv 2023
Hardware & System Implementations
-
Hybrid 8-bit Floating Point (HFP8) Training and Inference for Deep Neural Networks - IBM,
NIPS 2019
-
A3: Accelerating Attention Mechanisms in Neural Networks with Approximation - Seoul National University & Hynix,
HPCA 2020
-
ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks - Seoul National University,
ISCA 2021
-
Accelerating Framework of Transformer by Hardware Design and Model Compression Co-Optimization - ECNU,
ICCAD 2022
-
Accelerating attention through gradient-based learned runtime pruning - UCSD & Google,
ISCA 2022
-
DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale - Microsoft,
arxiv 2022
-
FP8 Quantization: The Power of the Exponent - Qualcomm AI Research,
arxiv 2022
-
PETALS: Collaborative Inference and Fine-tuning of Large Models - Yandex,
arxiv 2022
-
EFFICIENTLY SCALING TRANSFORMER INFERENCE - Google,
arxiv 2022
-
High-throughput Generative Inference of Large Language Models with a Single GPU - Stanford etc.,
arxiv 2023
Platform
updating …