ORPTQ: An Improved Large Model Quantization Method Based on Optimal Quantization Range
DOI:
https://doi.org/10.5755/j01.itc.54.2.40573Keywords:
Large Model Quantization, Optimal Quantization Range, Transformer, GPTQ, ORPTQAbstract
Quantization reduces model storage by representing model in low bits. It can help to improve the application capability of transformer-based large models and make them possible to be deployed on resource-limited systems such as PCs and mobile devices. The best weight-only quantization method currently is to use second-order information to fine-tune the weight step by step during the quantization process, compensating for the quantization errors that have occurred. The method can minimize the functional loss of weight due to quantization by adjusting the remaining elements through algebraic transformations in each step. However, the performance of this quantization method will deteriorate rapidly when the adjustment for weight deviates too far from the starting point, especially in low-bit quantization (e.g. 4 bits or fewer). To meet the mathematical prerequisite of this method in the quantization, this paper introduces two parameters α, β to adjust the quantization range based on the second-order method, and presents three approaches to seek their optimal values. The experimental results show that the performance of the proposed method significantly outperforms the original second-order method in low-bit quantization. The code of this paper is available on github.com/t-scen/ORPTQ.
Downloads
Published
Issue
Section
License
Copyright terms are indicated in the Republic of Lithuania Law on Copyright and Related Rights, Articles 4-37.