Quantization of Weights of Neural Networks with Negligible Decreasing of Prediction Accuracy

Quantization and compression of neural network parameters using the uniform scalar quantization is carried out in this paper. The attractiveness of the uniform scalar quantizer is reflected in a low complexity and relatively good performance, making it the most popular quantization model. We present a design approach for the memoryless Laplacian source with zero-mean and unit variance, which is based on iterative rule and uses the minimal mean-squared error distortion as a performance criterion. In addition, we derive closed-form expressions for SQNR (Signal to Quantization Noise Ratio) in a wide dynamic range of variance of input data. To show effectiveness on real data, the proposed quantizer is used to compress the weights of neural networks using bit rates from 9 to 16 bps (bits/sample) instead of standardly used 32 bps full precision bit rate. The impact of weights compression on the NN (neural network) performance is analyzed, indicating good matching with the theoretical results and showing negligible decreasing of the prediction accuracy of the NN even in the case of high variance-mismatch between the variance of NN weights and the variance used for the design of quantizer, if the value of the bit-rate is properly chosen according to the rule proposed in the paper.


Introduction
In recent times, a significant interest has been directed to the neural networks (NNs), mainly owing to the availability of powerful hardware [33]. The attractiveness of NNs lies in the increased potentiality to resolve challenges occurring in different research areas [33]. Some specific applications of NN can be found in papers [25−27, 29−31], where some promising results have been achieved. Namely, the implementations in image processing and virtual reality environment have been performed in [25] and [26], respectively. In addition, application of NN in image classification has been investigated in [27], where the ship classification problem is considered. The paper [29] applies NN to create the controller within automatic control system, while paper [31] considers a pointer NN for purpose of vehicle routing. In [30], the use of NN has been done in the context of solving four-class motor imagery classification problem.
The state-of-the-art neural networks (NNs) designed for tasks such as speech processing [5], image classification [15] and object recognition [28], just to name a few, represent very complex NN architectures, with a large number of parameters, requiring expensive computational and storage resources. On the other hand, high complexity can be a limiting factor for application in portable and edge computing devices with limited memory and processing power, or in latency-critical services. Hence, the compression of NN is required. To this end, the quantization is commonly employed, where the NN parameters (weights, biases, etc.), typically stored in 32-bits floating point format (full precision), are mapped to the fixed-point representations using lower bit lengths.
The influence of parameters quantization on the NN model performance is an active area of research, where the NN parameters have been quantized with 16-bits [20,32], 8-bits [1,9], 4-bits [3] or 2-bits [4]. Moreover, ternary [34] and binary (1-bit) quantization [10,22] have also been taken into account. It has been shown that representations using higher number of bits (e.g. 8 to 16 bits) provide comparable performance with respect to full precision case, while performance deteriorates with decrease of code-word length (e.g. 2 to 4 bits); however, still offering competitive performance with very high compression ratios.
In the above-mentioned papers, the uniform scalar quantization (USQ) has dominantly been used. USQ was theoretically considered in [8,13,18,19,23,24]. The main advantage of USQ is the design simplicity accompanied with relatively good performance when compared to more complex non-uniform quantization. Nevertheless, a detailed design process of the quantizer, taking into account the assumed statistical distribution of NN parameters, is missing in above mentioned papers [1,4,9,10,20,32,34] about quantization of NN parameters. In this paper we design USQ for compression of NN weights assuming Laplacian distribution of weights and bit rates from 9 to 16 bps. Namely, the Laplacian probability density function (PDF) has already been proved as a relevant model for various data including NN weights [2,11], speech [7,13] or images [13]. It is important to emphasize that we decided to consider a high-resolution quantization where one can expect a high level of reconstructed data quality. However, this not holds true in case when variance of the input data and the variance for which quantizer is designed are mismatched (assumes the utilization of non-adaptive quantizers), since this mismatch effect can cause a serious degradation in data quality. This fact motivated the authors Information Technology and Control 2021/3/50 560 to focus the research at discovering how the degree of mismatch affects the performance of NN. In addition, using the proposed range of bit rates, compression ratios up to 3.56:1 can be achieved.
The analysis conducted in this paper is organized in two directions: development of theoretical model of USQ using asymptotic formulas (since the number of quantization levels N is large), making the design process simple; and implementation of the designed USQ for compression of NN weights. The main contributions of the paper are: _ A simple iterative design method of USQ for the memoryless Laplacian source with zero-mean and unit variance is proposed. _ The influence of the granular and overload distortions on SQNR for different values of variance are estimated, based on the derived closed-form expressions for performance evaluation in a wide dynamic range of variance of input data. _ The designed USQ is applied for quantization of weights of a neural network (Multi-Layer Perceptron) used for classification of images from MNIST database [21], showing very good matching between theoretical and experimental results. It should be highlighted that the variancemismatched scenario (that often occurs in practice), meaning the mismatch between the variance of NN weights and the variance used for the design of the quantizer, is analyzed. This variance mismatched scenario has not been considered in any of previous papers from literature, related to the quantization of neural networks. _ A connection between SQNR of weights quantization and prediction accuracy of NN is shown and threshold for SQNR that assures a negligible decrease of the prediction accuracy is established for the specific NN. This is another new result that has not been presented in literature yet. _ It is shown that the significant decrease of the bitrate R used for representation of weights, obtained by weights quantization, will produce a negligible decrease of NN prediction accuracy even in the case of high degree of the variance mismatch, if the value of the bit-rate R is chosen in an appropriate way, according to the rule provided in the paper.
The rest of the paper is organized as follows. Section 2 provides a detailed description of USQ and proposes a simple design method. In Section 3, the performance of USQ in a variance mismatched scenario is analyzed. In Section 4, the application of the designed USQ in neural networks is presented and the obtained results are discussed. Finally, Section 5 concludes the paper.

Uniform Scalar Quantization
In this section we will design USQ for the symmetric zero-mean Laplacian PDF defined with [13]: simple; and implementation of the designed USQ for compression of NN weights. The main contributions of the paper are: -A simple iterative design method of USQ for the memoryless Laplacian source with zero-mean and unit variance is proposed.
-The influence of the granular and overload distortions on SQNR for different values of variance are estimated, based on the derived closed-form expressions for performance evaluation in a wide dynamic range of variance of input data.
-The designed USQ is applied for quantization of weights of a neural network (Multi-Layer Perceptron) used for classification of images from MNIST database [21], showing very good matching between theoretical and experimental results. It should be highlighted that the variancemismatched scenario (that often occurs in practice), meaning the mismatch between the variance of NN weights and the variance used for the design of the quantizer, is analyzed. This variance mismatched scenario has not been considered in any of previous papers from literature, related to the quantization of neural networks.
-A connection between SQNR of weights quantization and prediction accuracy of NN is shown and threshold for SQNR that assures a negligible decrease of the prediction accuracy is established for the specific NN. This is another new result that hasn't been presented in literature yet.
-It is shown that the significant decrease of the bitrate R used for representation of weights, obtained by weights quantization, will produce a negligible decrease of NN prediction accuracy even in the case of high degree of the variance mismatch, if the value of the bit-rate R is chosen in an appropriate way, according to the rule provided in the paper.
zero-mean Laplacian PDF defined with [13]: where 2 q σ denotes signal variance. Without losing of generality, the design of USQ will be done for the unit variance 2 q σ = 1, that is a standard approach in literature [13]. Due to the symmetry of the considered PDF, designed Nlevel USQ will have thresholds x is also known as the maximal amplitude of the quantizer. To completely define USQ it is sufficient to know only one of these two parameters ( max x or ∆ ), since all required quantization parameters can be derived from them. The design goal of USQ is therefore constrained to determine the optimal value of the parameter max x (or ∆ ), for the assumed input data distribution and established performance criteria.
Performance of a quantizer can be expressed by distortion D [8,13] that represents the mean-square error occurred during quantization. Calculating distortion of USQ for data modeled with the Laplacian PDF in this section, we assume the variance-matched situation [8,13,16] where simple; and implementation of the designed USQ for compression of NN weights. The main contributions of the paper are: -A simple iterative design method of USQ for the memoryless Laplacian source with zero-mean and unit variance is proposed.
-The influence of the granular and overload distortions on SQNR for different values of variance are estimated, based on the derived closed-form expressions for performance evaluation in a wide dynamic range of variance of input data.
-The designed USQ is applied for quantization of weights of a neural network (Multi-Layer Perceptron) used for classification of images from MNIST database [21], showing very good matching between theoretical and experimental results. It should be highlighted that the variancemismatched scenario (that often occurs in practice), meaning the mismatch between the variance of NN weights and the variance used for the design of the quantizer, is analyzed. This variance mismatched scenario has not been considered in any of previous papers from literature, related to the quantization of neural networks.
-A connection between SQNR of weights quantization and prediction accuracy of NN is shown and threshold for SQNR that assures a negligible decrease of the prediction accuracy is established for the specific NN. This is another new result that hasn't been presented in literature yet.
-It is shown that the significant decrease of the bitrate R used for representation of weights, obtained by weights quantization, will produce a negligible decrease of NN prediction accuracy even in the case of high degree of the variance mismatch, if the value of the bit-rate R is chosen in an appropriate way, according to the rule provided in zero-mean Laplacian PDF defined with [13]: where 2 q σ denotes signal variance. Without losing of generality, the design of USQ will be done for the unit variance 2 q σ = 1, that is a standard approach in literature [13]. Due to the symmetry of the considered PDF, designed Nlevel USQ will have thresholds i x and representation levels i y symmetrical around zero: , denote the support region of USQ, where the upper threshold of the support region max x is also known as the maximal amplitude of the quantizer. To completely define USQ it is sufficient to know only one of these two parameters ( max x or ∆ ), since all required quantization parameters can be derived from them. The design goal of USQ is therefore constrained to determine the optimal value of the parameter max x (or ∆ ), for the assumed input data distribution and established performance criteria.
Performance of a quantizer can be expressed by distortion D [8,13] that represents the mean-square error occurred during quantization. Calculating distortion of USQ for data modeled with the Laplacian PDF in this section, we assume the variance-matched situation [8,13,16]  -A simple iterative design method of USQ for the memoryless Laplacian source with zero-mean and unit variance is proposed.
-The influence of the granular and overload distortions on SQNR for different values of variance are estimated, based on the derived closed-form expressions for performance evaluation in a wide dynamic range of variance of input data.
-The designed USQ is applied for quantization of weights of a neural network (Multi-Layer Perceptron) used for classification of images from MNIST database [21], showing very good matching between theoretical and experimental results. It should be highlighted that the variancemismatched scenario (that often occurs in practice), meaning the mismatch between the variance of NN weights and the variance used for the design of the quantizer, is analyzed. This variance mismatched scenario has not been considered in any of previous papers from literature, related to the quantization of neural networks.
-A connection between SQNR of weights quantization and prediction accuracy of NN is shown and threshold for SQNR that assures a negligible decrease of the prediction accuracy is established for the specific NN. This is another new result that hasn't been presented in literature yet.
-It is shown that the significant decrease of the bitrate R used for representation of weights, obtained by weights quantization, will produce a negligible decrease of NN prediction accuracy even in the case of high degree of the variance mismatch, if zero-mean Laplacian PDF defined with [13]: where 2 q σ denotes signal variance. Without losing of generality, the design of USQ will be done for the unit variance 2 q σ = 1, that is a standard approach in literature [13]. Due to the symmetry of the considered PDF, designed Nlevel USQ will have thresholds i x and representation levels i y symmetrical around zero: , denote the support region of USQ, where the upper threshold of the support region max x is also known as the maximal amplitude of the quantizer. To completely define USQ it is sufficient to know only one of these two parameters ( max x or ∆ ), since all required quantization parameters can be derived from them. The design goal of USQ is therefore constrained to determine the optimal value of the parameter max x (or ∆ ), for the assumed input data distribution and established performance criteria.
Performance of a quantizer can be expressed by distortion D [8,13] that represents the mean-square error occurred during quantization. Calculating distortion of USQ for data modeled with the Laplacian PDF in this section, we assume the variance-matched situation [8,13,16] which means that the variance 2 p σ of the data being quantized is equal to the variance 1 2 = q σ used for the design of USQ. Since USQ divides the real line = 1, that is a standard approach in literature [13]. Due to the symmetry of the considered PDF, designed N-level USQ will have thresholds x i and representation levels y i symmetrical around zero: x max ] denote the support region of USQ, where the upper threshold of the support region x max is also known as the maximal amplitude of the quantizer. To completely define USQ it is sufficient to know only one of these two parameters (x max or ∆), since all required quantization parameters can be derived from them. The design goal of USQ is therefore constrained to determine the optimal value of the parameter x max (or ∆), for the assumed input data distribution and established performance criteria.
Performance of a quantizer can be expressed by distortion D [8,13] that represents the mean-square error occurred during quantization. Calculating distortion of USQ for data modeled with the Laplacian PDF in this section, we assume the variance-matched situation [8,13,16] which means that the variance simple; and implementation of the designed USQ for compression of NN weights. The main contributions of the paper are: -A simple iterative design method of USQ for the memoryless Laplacian source with zero-mean and unit variance is proposed.
-The influence of the granular and overload distortions on SQNR for different values of variance are estimated, based on the derived closed-form expressions for performance evaluation in a wide dynamic range of variance of input data.
-The designed USQ is applied for quantization of weights of a neural network (Multi-Layer Perceptron) used for classification of images from MNIST database [21], showing very good matching between theoretical and experimental results. It should be highlighted that the variance-zero-mean Laplacian PDF de σ denotes signal losing of generality, the desi done for the unit variance standard approach in literatu symmetry of the considered level USQ will have thr representation levels i y sy zero: , where the u of the data being quantized is equal to the variance simple; and implementation of the designed USQ for compression of NN weights. The main contributions of the paper are: -A simple iterative design method of USQ for the memoryless Laplacian source with zero-mean and unit variance is proposed.
-The influence of the granular and overload distortions on SQNR for different values of variance are estimated, based on the derived closed-form expressions for performance evaluation in a wide dynamic range of variance of input data.
-The designed USQ is applied for quantization of weights of a neural network (Multi-Layer Perceptron) used for classification of images from MNIST database [21], showing very good matching between theoretical and experimental zero-mean Laplacian PDF d σ denotes signal losing of generality, the des done for the unit variance standard approach in literat symmetry of the considered level USQ will have th representation levels i y sy zero: , lar distortion (denoted as D g ) and overload distortion (denoted as D ov ). These components of distortion can be evaluated according to [8,13]: , the introduced distortion D is composed of the granular distortion (denoted as g D ) and overload distortion (denoted as ov D ). These components of distortion can be evaluated according to [8,13]: (2) in an ded in , the introduced distortion D is composed of the granular distortion (denoted as g D ) and overload distortion (denoted as ov D ). These components of distortion can be evaluated according to [8,13]: . On the other hand, since N is high, it is appropriate to apply the asymptotic quantization theory [13], where the following holds for the components of the distortion: Clearly, for total distortion we obtain: The quantizer designed in this way is referred to as the asymptotic USQ.
Together with distortion, another quantity to express performance of the quantizer is SQNR defined as [8,13]: Equation (6) shows that distortion is a function of max x . Therefore, the aim is to discover the optimal value of max x for which distortion is minimal.

Lemma 1.
The optimal value of max x of the asymptotic USQ can be obtained using the following iterative rule: Proof. By determining the first derivation of distortion given by Eq. (6) with respect to max x and equaling it with zero we obtain: Therefore, max x can be calculated as: more accurate manner than in [12].
Applying the previous iterative algorithm (8), we calculate optimal values of max x for bit- ; the generated codeword contains one bit for sign and R-1 bits for magnitude of the source sample. For those optimal max x we calculate SQNR using (7). Calculated values of max x and SQNR are presented in Table 1. Dependences of optimal values of max x and SQNR on the bitrate R are shown in Figure 1.  Dependences of optimal values of max x and corresponding values of SQNR on R for the designed USQ. It can be seen that as R increases, both curves linearly increase with approximately constant . On the other hand, since N is high, it is appropriate to apply the asymptotic quantization theory [13], where the following holds for the components of the distortion: . On the other hand, since N is high, it is appropriate to apply the asymptotic quantization theory [13], where the following holds for the components of the distortion: Clearly, for total distortion we obtain: The quantizer designed in this way is referred to as the asymptotic USQ.
Together with distortion, another quantity to express performance of the quantizer is SQNR defined as [8,13]: Equation (6) shows that distortion is a function of max x . Therefore, the aim is to discover the optimal value of max x for which distortion is minimal.

Lemma 1.
The optimal value of max x of the asymptotic USQ can be obtained using the following iterative rule: Proof. By determining the first derivation of distortion given by Eq. (6) with respect to max x and equaling it with zero we obtain: Therefore, x can be calculated as: more accurate manner than in [12].
Applying the previous iterative algorithm (8), we calculate optimal values of max x for bit- ; the generated codeword contains one bit for sign and R-1 bits for magnitude of the source sample. For those optimal max x we calculate SQNR using (7). Calculated values of max x and SQNR are presented in Table 1. Dependences of optimal values of max x and SQNR on the bitrate R are shown in Figure 1.  Dependences of optimal values of max x and corresponding values of SQNR on R for the designed USQ. where . On the other hand, since N is high, it is appropriate to apply the asymptotic quantization theory [13], where the following holds for the components of the distortion: Clearly, for total distortion we obtain: The quantizer designed in this way is referred to as the asymptotic USQ.
Together with distortion, another quantity to express performance of the quantizer is SQNR defined as [8,13]: Equation (6) shows that distortion is a function of max x . Therefore, the aim is to discover the optimal value of max x for which distortion is minimal.
The optimal value of max x of the asymptotic USQ can be obtained using the following iterative rule: Proof. By determining the first derivation of distortion given by Eq. (6) with respect to max x and equaling it with zero we obtain: Therefore, x can be calculated as: more accurate manner than in [12].
Applying the previous iterative algorithm (8), we calculate optimal values of max x for bit- ; the generated codeword contains one bit for sign and R-1 bits for magnitude of the source sample. For those optimal max x we calculate SQNR using (7). Calculated values of max x and SQNR are presented in Table 1. Dependences of optimal values of max x and SQNR on the bitrate R are shown in Figure 1.  Dependences of optimal values of max x and corresponding values of SQNR on R for the designed USQ.
Clearly, for total distortion we obtain: . On the other hand, since N is high, it is appropriate to apply the asymptotic quantization theory [13], where the following holds for the components of the distortion: Clearly, for total distortion we obtain: The quantizer designed in this way is referred to as the asymptotic USQ.
Together with distortion, another quantity to express performance of the quantizer is SQNR defined as [8,13]: Equation (6) shows that distortion is a function of max x . Therefore, the aim is to discover the optimal value of max x for which distortion is minimal.
The optimal value of max x of the asymptotic USQ can be obtained using the following iterative rule: Proof. By determining the first derivation of distortion given by Eq. (6) with respect to max x and equaling it with zero we obtain: more accurate manner than in [12].
Applying the previous iterative algorithm (8), we calculate optimal values of max x for bit- ; the generated codeword contains one bit for sign and R-1 bits for magnitude of the source sample. For those optimal max x we calculate SQNR using (7). Calculated values of max x and SQNR are presented in Table 1. Dependences of optimal values of max x and SQNR on the bitrate R are shown in Figure 1.  Dependences of optimal values of max x and corresponding values of SQNR on R for the designed USQ.
The quantizer designed in this way is referred to as the asymptotic USQ.
Together with distortion, another quantity to express performance of the quantizer is SQNR defined as [8,13]: . On the other hand, since N is high, it is appropriate to apply the asymptotic quantization theory [13], where the following holds for the components of the distortion: Clearly, for total distortion we obtain: The quantizer designed in this way is referred to as the asymptotic USQ.
Together with distortion, another quantity to express performance of the quantizer is SQNR defined as [8,13]: Equation (6) shows that distortion is a function of max x . Therefore, the aim is to discover the optimal value of max x for which distortion is minimal.
The optimal value of max x of the asymptotic USQ can be obtained using the following iterative rule: Proof. By determining the first derivation of distortion given by Eq. (6) with respect to max x and equaling it with zero we obtain: Therefore, max x can be calculated as: more accurate manner than in [12].
Applying the previous iterative algorithm (8), we calculate optimal values of max x for bit- ; the generated codeword contains one bit for sign and R-1 bits for magnitude of the source sample. For those optimal max x we calculate SQNR using (7). Calculated values of max x and SQNR are presented in Table 1. Dependences of optimal values of max x and SQNR on the bitrate R are shown in Figure 1.  Dependences of optimal values of max x and corresponding values of SQNR on R for the designed USQ.
Equation (6) shows that distortion is a function of x max . Therefore, the aim is to discover the optimal value of x max for which distortion is minimal.
The optimal value of x max of the asymptotic USQ can be obtained using the following iterative rule: (6) shows that distortion is a function of max x . Therefore, the aim is to discover the optimal value of max x for which distortion is minimal.

Lemma 1.
The optimal value of max x of the asymptotic USQ can be obtained using the following iterative rule: Proof. By determining the first derivation of distortion given by Eq. (6) with respect to max x and equaling it with zero we obtain: Therefore, max x can be calculated as: which shows that max x can be specified iteratively, thus concluding the proof. As a good starting point of this iterative process we can choose , that was proposed in [12] as an approximate solution for max x of USQ. In this way, applying the iterative process, we calculate max x in a To defi num spec diff valu (8) Proof. By determining the first derivation of distortion given by Eq. (6) with respect to x max and equaling it with zero we obtain: Equation (6) shows that distortion is a function of max x . Therefore, the aim is to discover the optimal value of max x for which distortion is minimal.

Lemma 1.
The optimal value of max x of the asymptotic USQ can be obtained using the following iterative rule: Proof. By determining the first derivation of distortion given by Eq. (6) with respect to max x and equaling it with zero we obtain: Therefore, max x can be calculated as: which shows that max x can be specified iteratively, thus concluding the proof. As a good starting point of this iterative process we can choose , that was proposed in [12] as an approximate solution for max x of USQ. In this way, applying the iterative process, we calculate max x in a To defi num spec diffe valu (9) Therefore, x max can be calculated as: Equation (6) shows that distortion is a function of max x . Therefore, the aim is to discover the optimal value of max x for which distortion is minimal.

Lemma 1.
The optimal value of max x of the asymptotic USQ can be obtained using the following iterative rule: Proof. By determining the first derivation of distortion given by Eq. (6) with respect to max x and equaling it with zero we obtain: Therefore, max x can be calculated as: which shows that max x can be specified iteratively, thus concluding the proof. As a good starting point of this iterative process we can choose , that was proposed in [12] as an approximate solution for max x of USQ. In this way, applying the iterative process, we calculate max x in a To defin num spec diffe valu (10) which shows that x max can be specified iteratively, thus concluding the proof. As a good starting point of this iterative process we can choose 2 max 10 2 10 log exp( 3 (6) shows that dis of max x . Therefore, the aim is t value of max x for which distort Lemma 1. The optimal v asymptotic USQ can be obtain iterative rule: Proof. By determining the distortion given by Eq. (6) wi equaling it with zero we obtain which shows that max x can b thus concluding the proof. As of this iterative process , that was pro approximate solution for max x applying the iterative process, , that was proposed in [12] as an approximate solution for x max of USQ. In this way, applying the iterative process, we calculate x max in a more accurate manner than in [12].
Applying the previous iterative algorithm (8), we calculate optimal values of x max for bit-rates 9 ≤ R [bps] ≤ 16, where On the other hand, since N appropriate to apply the asymptotic theory [13], where the following holds nents of the distortion: ) max x .
tal distortion we obtain: designed in this way is referred to as c USQ.
th distortion, another quantity to ormance of the quantizer is SQNR 13]: shows that distortion is a function fore, the aim is to discover the optimal for which distortion is minimal.
The optimal value of x of the more accurate manner than in [12].
Applying the previous iterative algorithm (8), we calculate optimal values of max x for bit- ; the generated codeword contains one bit for sign and R-1 bits for magnitude of the source sample. For those optimal max x we calculate SQNR using (7). Calculated values of max x and SQNR are presented in Table 1. Dependences of optimal values of max x and SQNR on the bitrate R are shown in Figure 1.   ; the generated codeword contains one bit for sign and R-1 bits for magnitude of the source sample. For those optimal x max we calculate SQNR using (7). Calculated values of x max and SQNR are presented in Table 1. Dependences of optimal values of x max and SQNR on the bit-rate R are shown in Figure 1.
It can be seen that as R increases, both curves linearly increase with approximately constant slope. In particular, the slope of nearly 5.5 dB/bit has been observed in case of SQNR. To show validity of the iterative process defined with (8), we can also perform numerical optimization of x max for some specific value of R, by calculating SQNR for different values of x max and finding the optimal value of x max that gives the maximal SQNR. This numerical optimization of x max is shown in Figure 2 for R = 9 bps. Obtained pair of optimal values (x max , SQNR) perfectly matches with the corresponding values from Table 1 obtained by the iterative process (8), proving its validity.  Dependences of optimal values of max x and corresponding values of SQNR on R for the designed USQ. It can be seen that as R increases, both curves linearly increase with approximately constant slope. In particular, the slope of nearly 5.5 dB/bit has been observed in case of SQNR.
To show validity of the iterative process defined with (8), we can also perform numerical optimization of max x for some specific value of R, by calculating SQNR for different values of max x and finding the optimal value of max x that gives the maximal SQNR. This numerical optimization of max x is shown in Figure 2 for R = 9 bps. Obtained pair of optimal values ( max x , SQNR) perfectly matches with the corresponding values from Table 1 obtained by the iterative process (8), proving its validity.   If we want to design USQ for some referent variance 1 2 ≠ q σ , the maximal amplitude should be calculated where max x represents the maximal amplitude from If we define the degree of mismatch = ρ q p σ σ / as in [16], the total distortion becomes: Based on (15), we can express SQNR as:  Figure 3 analyzes SQNR of the optimal asymptotic USQ as a function of the degree of mismatch ρ in the range (-30 dB, 30 dB) for different bit rates (ranging from 9 to 16 bps).  If we want to design USQ for some referent variance 1 2 ≠ q σ , the maximal amplitude should be calculated where max x represents the maximal amplitude from Parameters of the quantizer ( max x , i x , i y ) are the same as in Section 2, since they are determined for 1 2 = q σ during the design process of USQ. However, the variance mismatch will cause deterioration of performance (increasing of distortion and decreasing of SQNR) [16,17]. It will be examined below.
In the case of the variance mismatch, both the granular g D and the overload ov D distortions will depend on 2 p σ :

Fig
Tot , (11) where x max represents the maximal amplitude from Table 1 obtained for the unit variance where max x represents the maximal amplitude from

Uniform Scalar Quantizer in a Wide Dynamic Range
Let us consider a real situation that USQ designed for the Laplacian PDF Parameters of the quantizer ( max x , i x , i y ) are the same as in Section 2, since they are determined for 1 2 = q σ during the design process of USQ. However, the variance mismatch will cause deterioration of performance (increasing of distortion and decreasing of SQNR) [16,17]. It will be examined below.
In the case of the variance mismatch, both the granular g D and the overload ov D distortions will depend on 2 p σ :

Uniform Scalar Quantizer in a Wide Dynamic Range
Let us consider a real situation that USQ designed for the Laplacian PDF where max x represents the maximal amplitude from

Uniform Scalar Quantizer in a Wide Dynamic Range
Let us consider a real situation that USQ designed for the Laplacian PDF Parameters of the quantizer ( max x , i x , i y ) are the same as in Section 2, since they are determined for 1 2 = q σ during the design process of USQ. However, the variance mismatch will cause deterioration of performance (increasing of distortion and decreasing of SQNR) [16,17]. It will be examined below.
In the case of the variance mismatch, both the where max x represents the maximal amplitude from Parameters of the quantizer ( max x , i x , i y ) are the same as in Section 2, since they are determined for 1 2 = q σ during the design process of USQ. However, the variance mismatch will cause deterioration of performance (increasing of distortion and decreasing of SQNR) [16,17]. It will be examined below.
In the case of the variance mismatch, both the granular g D and the overload ov D distortions will depend on 2 p σ : Based on (15) Total, gra , i.e. we have variance mismatch: applied-to variance  differs from designed-for variance 1 2 = q σ . Parameters of the quantizer ( max x , i x , i y ) are the same as in Section 2, since they are determined for 1 2 = q σ during the design process of USQ. However, the variance mismatch will cause deterioration of performance (increasing of distortion and decreasing of SQNR) [16,17]. It will be examined below.
In the case of the variance mismatch, both the granular D g and the overload D ov distortions will depend on 2 p σ : This numerical optimization of max x is shown in Figure 2 for R = 9 bps. Obtained pair of optimal values ( max x , SQNR) perfectly matches with the corresponding values from Table 1 obtained by the iterative process (8), proving its validity.
If we define the degree of mismatch = ρ q p σ σ / as in [16], the total distortion becomes: Based on (15) Figure 3 analyzes SQNR of the optimal asymptotic USQ as a function of the degree of (12) This numerical optimization of max x is shown in Figure 2 for R = 9 bps. Obtained pair of optimal values ( max x , SQNR) perfectly matches with the corresponding values from Table 1 obtained by the iterative process (8), proving its validity.
If we define the degree of mismatch = ρ q p σ σ / as in [16], the total distortion becomes: Based on (15) Figure 3 analyzes SQNR of the optimal asymptotic USQ as a function of the degree of (13) If we define the degree of mismatch = ρ q p σ σ / as in [16], the total distortion becomes: This numerical optimization of max x is shown in Figure 2 for R = 9 bps. Obtained pair of optimal values ( max x , SQNR) perfectly matches with the corresponding values from Table 1 obtained by the iterative process (8), proving its validity.
If we define the degree of mismatch = ρ q p σ σ / as in [16], the total distortion becomes: Based on (15) Figure 3 analyzes SQNR of the optimal asymptotic USQ as a function of the degree of (14) Based on (15), we can express SQNR as: If we want to design USQ for some referent variance This numerical optimization of max x is shown in Figure 2 for R = 9 bps. Obtained pair of optimal values ( max x , SQNR) perfectly matches with the corresponding values from Table 1 obtained by the iterative process (8), proving its validity.   If we want to design USQ for some referent variance 1 2 ≠ q σ , the maximal amplitude should be calculated where max x represents the maximal amplitude from Table 1 obtained for the unit variance If we define the degree of mismatch = ρ q p σ σ / as in [16], the total distortion becomes: Based on (15), we can express SQNR as:  Figure 3 analyzes SQNR of the optimal asymptotic USQ as a function of the degree of mismatch ρ in the range (-30 dB, 30 dB) for different bit rates (ranging from 9 to 16 bps). , the maximal amplitude should be calculated as Based on (15) Figure 3 analyzes SQNR of the optimal asymptotic USQ as a function of the degree of mismatch ρ in the range (-30 dB, 30 dB) for different bit rates (ranging from 9 to 16 bps).    Total, granular and overload signal-to- (15) Figure 3 analyzes SQNR of the optimal asymptotic USQ as a function of the degree of mismatch ρ in the range (-30 dB, 30 dB) for different bit rates (ranging from 9 to 16 bps).  Table 1 are used)

Figure 4
Total, granular and overload signal-to-quantization noise ratio (SQNR, SQNR g and SQNR ov ) versus ρ for the proposed USQ (for R = 9 bps) Based on (15) Figure 3 analyzes SQNR of the optimal asymptotic USQ as a function of the degree of mismatch ρ in the range (-30 dB, 30 dB) for different bit rates (ranging from 9 to 16 bps).

Figure 3
SQNR versus ρ in wide dynamic range of input data variances, for the proposed USQ with different bit rates (the optimal values of max x from Table 1are used).   Total, granular and overload signal-to- We can see in Figure 3 that, as expected, higher SQNR values is obtained as the bit rate R increases. Note that SQNR peaks for a variance matched case ( 2 2 q p σ σ = , ρ = 1, corresponding to 0 dB point in log-scale), but substantially drops if variances are not matched, decreasing more rapidly for ρ > 1 ( ), due to the dominancy of overload distortion for 0 2 > p σ dB as we will see from Figure 4.
Let us define g SQNR that depends only on g D , as well as SQNR ov that depends only on D ov , using (12) and (13): We can see in Figure 3 that, as expected, higher SQNR values is obtained as the bit rate R increases. Note that SQNR peaks for a variance matched case ( that depends only on ov D , using (12) and (13) , (17) that are shown in Figure 4, together with the curve of total SQNR defined with (15)   We can see in Figure 3 that, as expected, higher SQNR values is obtained as the bit rate R increases. Note that SQNR peaks for a variance matched case ( that depends only on ov D , using (12) and (13) , (17) that are shown in Figure 4, together with the curve of total SQNR defined with (15)   , (17) that are shown in Figure 4, together with the curve of total SQNR defined with (15) that takes into account the total distortion D, with the aim to examine the influence of the granular distortion g D and the overload distortion D ov on SQNR. We can see very good matching of SQNR and g SQNR for We can see in Figure 3 that, as ex SQNR values is obtained as the bit ra Note that SQNR peaks for a variance (  (12) and (13) , as well as very good matching of SQNR and SQNR ov for SQNR values is obtained as the bit ra Note that SQNR peaks for a variance ( 2 2 q p σ σ = , ρ = 1, corresponding to 0 d scale), but substantially drops if vari matched, decreasing more rapidly ( 2 2 q p σ σ > ) than for ρ < 1 ( We can see in Figure 3 that, as expected, higher SQNR values is obtained as the bit rate R increases. Note that SQNR peaks for a variance matched case ( 2 2 q p σ σ = , ρ = 1, corresponding to 0 dB point in logscale), but substantially drops if variances are not matched, decreasing more rapidly for ρ > 1 ( 2 2 q p σ σ > ) than for ρ < 1 ( that are shown in Figure 4, together with the curve of total SQNR defined with (15) that takes into account the total distortion D, with the aim to examine the influence of the granular distortion g D and the overload distortion ov D on SQNR. We can see very good matching of SQNR and We can see in Figure 3 that, as expected, higher SQNR values is obtained as the bit rate R increases. Note that SQNR peaks for a variance matched case ( , (17) that are shown in Figure 4, together with the curve of total SQNR defined with (15) that takes into account the total distortion D, with the aim to examine the influence of the granular distortion g D and the overload distortion ov D on SQNR. We can see very good matching of SQNR and We can see in Figure 3 that, a SQNR values is obtained as the Note that SQNR peaks for a var (

SQNR g SQNR SQNR ov
We can see in Figure 3 that, as expected, higher SQNR values is obtained as the bit rate R increases. Note that SQNR peaks for a variance matched case ( In order to compare performance of the designed USQ, we employ the quantizer (the uniform one) used in fixed-point format representations [6,14], conducting the analysis for R = 9 bps. In particular, the generated codeword of baseline quantizer consists of one bit reserved for sign (s = 1), n bits reserved for integer part and m bits reserved for fractional We can see in Figure 3 that, as expected, higher SQNR values is obtained as the bit rate R increases. Note that SQNR peaks for a variance matched case ( 2 2 q p σ σ = , ρ = 1, corresponding to 0 dB point in logscale), but substantially drops if variances are not matched, decreasing more rapidly for ρ > 1 ( q p σ σ > ) than for ρ < 1 ( , (17) that are shown in Figure 4, together with the curve of total SQNR defined with (15) that takes into account the total distortion D, with the aim to examine the influence of the granular distortion g D and the overload distortion ov D on SQNR. We can see very good matching of SQNR and  SQNR vs. with N = 5 , the overload distortion D ov is dominant and SQNR can be approximated with SQNR ov defined with (17); _ in small range of ρ around 1 (i.e. 0 dB), both distortion components contribute to total SQNR, hence the full expression (15) should be used.
In order to compare performance of the designed USQ, we employ the quantizer (the uniform one) used in fixed-point format representations [6,14], quantization noise ratio (SQNR, g SQNR and ov SQNR ) versus ρ for the proposed USQ (for R = 9 bps).

SQNR g SQNR SQNR ov
We can see in Figure 3 that, as expected, higher SQNR values is obtained as the bit rate R increases. Note that SQNR peaks for a variance matched case ( In order to compare performance of the designed USQ, we employ the quantizer (the uniform one) used in fixed-point format representations [6,14], conducting the analysis for R = 9 bps. In particular, the generated codeword of baseline quantizer consists of one bit reserved for sign (s = 1), n bits reserved for integer part and m bits reserved for fractional part of the fixed-point number. The maximal amplitude of this quantizer, denoted as xmax fp , can be calculated as: conducting the analysis for R = 9 bps. In particular, the generated codeword of baseline quantizer consists of one bit reserved for sign (s = 1), n bits reserved for integer part and m bits reserved for fractional part of the fixed-point number. In order to compare performance of the designed USQ, we employ the quantizer (the uniform one) used in fixed-point format representations [6,14], conducting the analysis for R = 9 bps. In particular, the generated codeword of baseline quantizer consists of one bit reserved for sign (s = 1), n bits reserved for integer part and m bits reserved for fractional part of the fixed-point number. The maximal amplitude of this quantizer, denoted as xmax fp , can be calculated as: where the term  SQNR vs. ρ for the proposed and baseline USQ with N = 512 levels (R = 9 bps). (19) where the term

Application in Neural Networks
This section deals with the application of the developed USQ for compression of NN weights and analyzes the effects of quantization to the performance of NN for the image classification task.
As a proof of concept, we use Multi-Layer Perceptron (MLP) [33] that consists of input and output layers, with the goal to perform post-training quantization (i.e. to quantize the learned weights). The input of the NN is fed with the MNIST database [21], containing 60000 monochrome images of hand- The employed NN is trained for 20 epochs achieving the prediction accuracy of 90.84%. The histogram of learned weights (total number amounts to 784 × 10 = 7840) is depicted in Figure 6. Observe that distribution of weights can be approximated well by the Laplacian PDF with the mean value very close to zero.
In practice, the variance of NN weights can vary in wide range, hence the variance mismatch can occur between the variance of weights and the variance used for the design of

Application in Neural Networks
This section deals with the application of the developed USQ for compression of NN weights and ana-

Application in Neural Networks
This section deals with the application of the developed USQ for compression of NN weights and analyzes the effects of quantization to the performance of NN for the image classification task.
As a proof of concept, we use Multi-Layer Perceptron (MLP) [33] that consists of input and output layers, with the goal to perform post-training quantization (i.e. to quantize the learned weights). The input of the NN is fed with the MNIST database [21], containing 60000 monochrome images of handwritten single digits of dimension 28 × 28 pixels, where 50000 images are used for training and 10000 images for validation. Note that the employed NN deals with the classification of grayscale images of hand-written digits into the corresponding category (0−9). Thus, input layer and output layer are constituted by 784 (28 × 28) and 10 (the number of digits) nodes, respectively. Softmax activation function is used at the output layer, while the learning rate and batch size are set to 0.5 and 250, respectively.

Figure 6
The histogram of weights of trained NN.  lyzes the effects of quantization to the performance of NN for the image classification task.
As a proof of concept, we use Multi-Layer Perceptron (MLP) [33] that consists of input and output layers, with the goal to perform post-training quantization (i.e. to quantize the learned weights). The input of the NN is fed with the MNIST database [21], containing 60000 monochrome images of hand-written single digits of dimension 28 × 28 pixels, where 50000 images are used for training and 10000 images for testing. Note that the employed NN deals with the classification of grayscale images of hand-written digits into the corresponding category (0−9). Thus, input layer and output layer are constituted by 784 (28 × 28) and 10 (the number of digits) nodes, respectively. Softmax activation function is used at the output layer, while the learning rate and batch size are set to 0.5 and 250, respectively.
The employed NN is trained for 20 epochs achieving the prediction accuracy of 90.84%. The histogram of learned weights (total number amounts to 784 × 10 = 7840) is depicted in Figure 6. Observe that distribution of weights can be approximated well by the Laplacian PDF with the mean value very close to zero.
In practice, the variance of NN weights can vary in wide range, hence the variance mismatch can occur between the variance of weights and the variance used for the design of USQ. Hence, our aim is to examine the influence of this variance mismatch on the prediction accuracy of NN, applying the following procedure: -firstly, design USQ for a specific value of R from 9 to 16 bps, for the variance equal to the variance of the learned weights (i.e. for the variance mismatched quantization; -apply the quantized weights for classification purposes on the test data (10000 images form MNIST database [21]); -calculate the prediction accuracy of NN with the quantized weights; just to recall, the prediction accuracy score obtained without quantization was 90.84%. (20) In practice, the variance of NN weights can vary in wide range, hence the variance mismatch can occur between the variance of weights and the variance used for the design of USQ. Hence, our aim is to examine the influence of this variance mismatch on the prediction accuracy of NN, applying the following procedure: _ firstly, design USQ for a specific value of R from 9 to 16 bps, for the variance equal to the variance of the learned weights (i.e.   Based on the previous procedure, the influences of the variance mismatch on the quality of quantization of NN weights (i.e. on w SQNR ), as well as on the prediction accuracy of NN with quantized weights can be found, as being shown in Figures 7 and 8, respectively, for different values of ρ and in the range of the bitrate R from 9 to 16 bps.
We can see from Figure 7 that w SQNR approximately linearly increases with R for a given ρ, while the highest w SQNR is achieved for ρ = 0 dB (variance matched  for different values of ρ .  The prediction accuracy for image classification task applied on the MNIST dataset, after quantization of NN weights with different variances, using the designed USQ.  for all 9

Figure 8
The prediction accuracy for image classification task applied on the MNIST dataset, after quantization of NN weights with different variances, using the designed USQ well as on the prediction accuracy of NN with quantized weights can be found, as being shown in Figures 7 and 8, respectively, for different values of ρ and in the range of the bit-rate R from 9 to 16 bps.

Figure 7
w SQNR of USQ for bit rates in the range9 ≤ R ≤ 16 [bps] for different values of ρ .  The prediction accuracy for image classification task applied on the MNIST dataset, after quantization of NN weights with different variances, using the designed USQ. We can see from Figure 7  scenario), as expected. Note also, for a given R and ρ, that w SQNR from Figure 7 matches very well with the theoretical SQNR presented in Figure 3.
For this specific MLP neural network it is empirically found that the decreasing of the prediction accuracy of the network due to quantization of weights is neglecting if w SQNR ≥ 16 dB for quantization of weights. Based on (15), we can theoretically found ranges of ρ where SQNR ≥ 16 dB, that is shown in Table 2 for the bit-rates R from 9 to 16 bps.  Table 2 we can derive the following conclusions: _ for ρ [dB] = 0 dB (i.e. ρ = 1), we obtain the w SQNR much higher than 16 dB for all 9 ≤ R [bps] ≤ 16 (Figure 7), providing excellent accuracy for all considered bit-rates (Figure 8), almost the same as accuracy in the full precision case; this is also theoretically expected, since it follows from Table  2 that ρ [dB] = 0 dB is acceptable for all considered bit-rates; _ for ρ [dB] = -50 dB (i.e. ρ = 0.003), we have w SQNR ≥ 16 dB for R ≥ 14 bps ( Figure 7); also, accuracy becomes acceptable for R ≥ 14 bps, while for R < 14 bps there is a significant drop of accuracy ( Figure 8); this is fully in line with theoretical results presented in Table 2 where ρ [dB] = -50 dB is acceptable for R ≥ 14 bps; _ for ρ [dB] = -34 dB (i.e. ρ = 0.02), we have w SQNR ≥ 16 dB and negligible loss of accuracy for R ≥ 11 bps, but having drop of accuracy for R < 11 bps; this is fully in line with theoretical results from   Table 2 where ρ [dB] = 20 dB is not acceptable for any of the considered bit-rates.
We can see that there is a very good matching between experimental results (shown in Figures 7 and 8) and theoretical predictions presented in Table 2. Also, we can see that the variance mismatch is acceptable in much wider range for negative ρ [dB] than for positive one.
We can conclude that the range of acceptable degree of the variance mismatch ρ depends on the bit-rate R. Increasing R allows wider range of the variance mismatch degree ρ (decreasing the compression ratio on the other hand). Hence, the bit-rate R should be chosen based on the range of the degree of the variance mismatch ρ for the specific application. We can define the following rule: we should choose the smallest R that allows maintaining of high prediction accuracy for given range of ρ for the specific application.
Finally, we provide the results in case of the baseline quantizer approach discussed in [6,14], taking into account R = 9 bps. SQNR versus ρ, obtained from real data (weights), can be found in Figure 9, where good agreement with theoretical results in Figure 5 is observed. On the other hand, the prediction accuracy scores can be found in Figure 10, indicating that MLP achieves better performance in case of using the USQ proposed in this paper.   Table 2 where ρ [dB] = 20 dB is not acceptable for any of the considered bit-rates.
We can see that there is a very good matching between experimental results (shown in Figures 7  and 8) and theoretical predictions presented in Table  2. Also, we can see that the variance mismatch is acceptable in much wider range for negative ρ [dB] than for positive one.
We can conclude that the range of acceptable degree of the variance mismatch ρ depends on the bit-rate R. Increasing R allows wider range of the variance mismatch degree ρ (decreasing the compression ratio on the other hand). Hence, the bit-rate R

Figure 10
The prediction accuracy scores of MLP in image classification task in case when the proposed and two versions of baseline USQ (R = 9 bps) are applied, for different values of ρ.

Figure 10
The prediction accuracy scores of MLP in image classification task in case when the proposed and two versions of baseline USQ (R = 9 bps) are applied, for different values of ρ matching igures 7 in

Figure 10
The prediction accuracy scores of MLP in image classification task in case when the proposed and two versions of baseline USQ (R = 9 bps) are applied, for different values of ρ.

Conclusion
In this paper, USQ was designed for the Laplacian PDF and implemented for quantization of weights of MLP neural network. Firstly, the quantizer was designed for a reference variance and its performance was evaluated for both variance-matched and variance mismatched cases. Especially, it should be highlighted that we proposed a very efficient iterative algorithm for calculation of the most important parameter of the quantizer max x . Then, the designed USQ was applied for quantization of weights of MLP used for classification of images from MNIST database.

Conclusion
In this paper, USQ was designed for the Laplacian PDF and implemented for quantization of weights of MLP neural network. Firstly, the quantizer was designed for a reference variance and its performance was evaluated for both variance-matched and variance mismatched cases. Especially, it should be highlighted that we proposed a very efficient iterative algorithm for calculation of the most important parameter of the quantizer max x . Then, the designed USQ was applied for quantization of weights of MLP used for classification of images from MNIST database. It was shown a very good matching between experimental and theoretical results. Also, it was shown that almost the same prediction accuracy of the network can be achieved using quantized weights with significant decreasing of the bit-rate R [bps] as in the full precision case. Connection between SQNR of weights quantization and prediction accuracy of the neural network was established. Furthermore, the variance mismatched quantization of weights was considered (that is very important in practical applications where the variance mismatch often occurs), showing that even in this case a negligible decrease of accuracy can be achieved by choosing appropriate value of the bit-rate R. Acceptable ranges of the degree of the variance mismatch ρ [dB] are calculated for all considered bit-rates 9 ≤ R [bps] ≤ 16 and the rule for choosing the right value of the bit-rate R was defined: the smallest value of R that allows maintaining of high prediction accuracy for given range of ρ[dB] for the specific application should be chosen. In addition, the benefit of the proposed USQ over the baseline quantizers available in the literature has also been shown.