Golden Codes - armanexplorer planet

Practical code snippets for Django, Python, Bash, Git and All!

View on GitHub

Quantization and layer fusion are two techniques used in deep learning to optimize the performance and efficiency of inference systems. Here is a brief explanation of each technique:

Quantization:

Layer Fusion:

Overall, quantization and layer fusion are important techniques for optimizing the performance and efficiency of deep learning models in inference systems. By reducing the memory footprint and computational requirements of the model, these techniques can improve the speed and accuracy of the system while reducing its energy consumption.

Citations:

[1] https://forums.developer.nvidia.com/t/tensorrt-explicit-quantization-layer-fusion/208983

[2] https://pytorch.org/tutorials/recipes/fuse.html

[3] https://pytorch.org/docs/stable/quantization.html

[4] https://arxiv.org/pdf/2006.10518.pdf

[5] https://proceedings.mlr.press/v157/o-neill21a/o-neill21a.pdf

[6] https://deepganteam.medium.com/three-flavors-of-quantization-cc5be18e7ab4