👷 Readings in Digital Typography, Scientific Visualization, Information Retrieval and Machine Learning

[Marek Petrovič] Speed Up and Compresion of NNs with Quantization 18. 11. 2021

Abstract

Despite the high accuracy, neural networks face some practical problems with deployment. Two of these problems are the high memory load and also the speed of inference. Speed of inference is even greater problem for generative networks, because of its relation to the length of the input. This problem could be solved by quantization. By lowering bit widths of computations in the network we can achieve both network compression and faster inference. There are already results that prove that CNNs with 8bit integer quantization achieve almost similar accuracy as in full precision. My work will focus on implementing an evaluation framework, which would allow evaluation and comparison of full precision and quantized transformer for NMT on various domains. 


Readings

1. Ofir Zafrir and Guy Boudoukh and Peter Izsak and Moshe Wasserblat (2019). Q8BERT: Quantized 8Bit BERT. CoRR, https://arxiv.org/abs/1910.06188.

2. Gabriele Prato and Ella Charlaix and Mehdi Rezagholizadeh (2019). Fully Quantized Transformer for Improved Translation. CoRR, https://arxiv.org/abs/1910.10485.