<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Sig. Proc.</journal-id>
<journal-title>Frontiers in Signal Processing</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Sig. Proc.</abbrev-journal-title>
<issn pub-type="epub">2673-8198</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">1008812</article-id>
<article-id pub-id-type="doi">10.3389/frsip.2022.1008812</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Signal Processing</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>LVAC: Learned volumetric attribute compression for point clouds using coordinate based networks</article-title>
<alt-title alt-title-type="left-running-head">Isik et al.</alt-title>
<alt-title alt-title-type="right-running-head">
<ext-link ext-link-type="uri" xlink:href="https://doi.org/10.3389/frsip.2022.1008812">10.3389/frsip.2022.1008812</ext-link>
</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Isik</surname>
<given-names>Berivan</given-names>
</name>
<xref ref-type="aff" rid="aff1">
<sup>1</sup>
</xref>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1847966/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Chou</surname>
<given-names>Philip A.</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1913190/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Hwang</surname>
<given-names>Sung Jin</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Johnston</surname>
<given-names>Nick</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/1913189/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Toderici</surname>
<given-names>George</given-names>
</name>
<xref ref-type="aff" rid="aff2">
<sup>2</sup>
</xref>
</contrib>
</contrib-group>
<aff id="aff1">
<sup>1</sup>
<institution>Department of Electrical Engineering</institution>, <institution>Stanford University</institution>, <addr-line>Stanford</addr-line>, <addr-line>CA</addr-line>, <country>United States</country>
</aff>
<aff id="aff2">
<sup>2</sup>
<institution>Google</institution>, <addr-line>Mountain View</addr-line>, <addr-line>CA</addr-line>, <country>United States</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1120289/overview">Frederic Dufaux</ext-link>, Universit&#xe9; Paris-Saclay, France</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1126641/overview">Giuseppe Valenzise</ext-link>, UMR8506 Laboratoire des Signaux et Syst&#xe8;mes (L2S), France</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1049768/overview">Stuart Perry</ext-link>, University of Technology Sydney, Australia</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Berivan Isik, <email>berivan.isik@stanford.edu</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Image Processing, a section of the journal Frontiers in Signal Processing</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>12</day>
<month>10</month>
<year>2022</year>
</pub-date>
<pub-date pub-type="collection">
<year>2022</year>
</pub-date>
<volume>2</volume>
<elocation-id>1008812</elocation-id>
<history>
<date date-type="received">
<day>01</day>
<month>08</month>
<year>2022</year>
</date>
<date date-type="accepted">
<day>26</day>
<month>09</month>
<year>2022</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2022 Isik, Chou, Hwang, Johnston and Toderici.</copyright-statement>
<copyright-year>2022</copyright-year>
<copyright-holder>Isik, Chou, Hwang, Johnston and Toderici</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>We consider the attributes of a point cloud as samples of a vector-valued volumetric function at discrete positions. To compress the attributes given the positions, we compress the parameters of the volumetric function. We model the volumetric function by tiling space into blocks, and representing the function over each block by shifts of a coordinate-based, or implicit, neural network. Inputs to the network include both spatial coordinates and a latent vector per block. We represent the latent vectors using coefficients of the region-adaptive hierarchical transform (RAHT) used in the MPEG geometry-based point cloud codec G-PCC. The coefficients, which are highly compressible, are rate-distortion optimized by back-propagation through a rate-distortion Lagrangian loss in an auto-decoder configuration. The result outperforms the transform in the current standard, RAHT, by 2&#x2013;4&#xa0;dB and a recent <italic>non-volumetric</italic> method, Deep-PCAC, by 2&#x2013;5&#xa0;dB at the same bit rate. This is the first work to compress volumetric functions represented by local coordinate-based neural networks. As such, we expect it to be applicable beyond point clouds, for example to compression of high-resolution neural radiance fields.</p>
</abstract>
<kwd-group>
<kwd>point cloud attribute compression</kwd>
<kwd>volumetric functions</kwd>
<kwd>implicit neural networks</kwd>
<kwd>end-to-end optimization</kwd>
<kwd>coordinate based networks</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Upon the recent success of implicit networks, a. k.a. coordinate-based networks (CBNs), in representing a variety of signals such as neural radiance fields (<xref ref-type="bibr" rid="B57">Mildenhall et al., 2020</xref>; <xref ref-type="bibr" rid="B96">Yu A. et al., 2021</xref>; <xref ref-type="bibr" rid="B10">Barron et al., 2021</xref>; <xref ref-type="bibr" rid="B32">Hedman et al., 2021</xref>; <xref ref-type="bibr" rid="B39">Knodt et al., 2021</xref>; <xref ref-type="bibr" rid="B78">Srinivasan et al., 2021</xref>; <xref ref-type="bibr" rid="B101">Zhang et al., 2021</xref>), point clouds (<xref ref-type="bibr" rid="B24">Fujiwara and Hashimoto, 2020</xref>), meshes (<xref ref-type="bibr" rid="B60">Park et al., 2019a</xref>; <xref ref-type="bibr" rid="B54">Mescheder et al., 2019</xref>; <xref ref-type="bibr" rid="B77">Sitzmann et al., 2020</xref>; <xref ref-type="bibr" rid="B49">Martel et al., 2021</xref>; <xref ref-type="bibr" rid="B84">Takikawa et al., 2021</xref>), and images (<xref ref-type="bibr" rid="B49">Martel et al., 2021</xref>), an end-to-end compression framework for representations using CBNs has become inevitably necessary. Motivated by this, we propose the first <italic>end-to-end</italic> learned compression framework for volumetric functions represented by CBNs with a focus on point cloud attributes as other representations lack baselines to compare with. We call our method Learned Volumetric Attribute Compression (LVAC). Point clouds are a fundamental data type underlying 3D sampling and hence play a critical role in applications such as mapping and navigation, virtual and augmented reality, telepresence, and cultural heritage preservation, which rely on sampled 3D data (<xref ref-type="bibr" rid="B52">Mekuria et al., 2017</xref>; <xref ref-type="bibr" rid="B61">Park et al., 2019a</xref>; <xref ref-type="bibr" rid="B65">Pierdicca et al., 2020</xref>; <xref ref-type="bibr" rid="B81">Sun et al., 2020</xref>). Given the volume of data in such applications, compression is important for both storage and communication. Indeed, standards for point cloud compression are underway in both MPEG and JPEG (<xref ref-type="bibr" rid="B75">Schwarz et al., 2019</xref>; <xref ref-type="bibr" rid="B38">Jang et al., 2019</xref>; <xref ref-type="bibr" rid="B25">Graziosi et al., 2020</xref>; <xref ref-type="bibr" rid="B21">3DG, 2020a</xref>).</p>
<p>3D point clouds, such as those shown in <xref ref-type="fig" rid="F1">Figure 1</xref>, each consist of a set of points {(<bold>x</bold>
<sub>
<italic>i</italic>
</sub>, <bold>y</bold>
<sub>
<italic>i</italic>
</sub>)}, where <bold>x</bold>
<sub>
<italic>i</italic>
</sub> is the 3D position of the <italic>i</italic>th point and <bold>y</bold>
<sub>
<italic>i</italic>
</sub> is a vector of attributes associated with the point. Attributes typically include color components, e.g., RGB, but may alternatively include reflectance, normals, transparency, density, spherical harmonics, and so forth. Commonly (<xref ref-type="bibr" rid="B98">Zhang et al., 2014</xref>; <xref ref-type="bibr" rid="B16">Cohen et al., 2016</xref>; <xref ref-type="bibr" rid="B17">de Queiroz and Chou, 2016</xref>; <xref ref-type="bibr" rid="B88">Thanou et al., 2016</xref>; <xref ref-type="bibr" rid="B18">de Queiroz and Chou, 2017</xref>; <xref ref-type="bibr" rid="B63">Pavez et al., 2018</xref>; <xref ref-type="bibr" rid="B75">Schwarz et al., 2019</xref>; <xref ref-type="bibr" rid="B15">Chou et al., 2020</xref>; <xref ref-type="bibr" rid="B40">Krivoku&#x107;a et al., 2020</xref>), point cloud compression is broken into two steps: compression of the point cloud positions, called the <italic>geometry</italic>, and compression of the point cloud <italic>attributes</italic>. As illustrated in <xref ref-type="fig" rid="F2">Figure 2</xref>, once the decoder decodes the geometry (possibly with loss), the encoder encodes the attributes conditioned on the decoded geometry. In this work, we focus on this second step, namely attribute compression conditioned on the decoded geometry, assuming geometry compression (such as <xref ref-type="bibr" rid="B40">Krivoku&#x107;a et al., 2020</xref>; <xref ref-type="bibr" rid="B87">Tang et al., 2020</xref>) in the first step. It is important to note that this conditioning is crucial in achieving good attribute compression. This will become one of the themes of this paper.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>Point clouds <italic>rock</italic>, <italic>chair</italic>, <italic>scooter</italic>, <italic>juggling</italic>, <italic>basketball1</italic>, <italic>basketball2</italic>, and <italic>jacket</italic>.</p>
</caption>
<graphic xlink:href="frsip-02-1008812-g001.tif"/>
</fig>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>Point cloud codec: a geometry encoder and decoder, and an attribute encoder and decoder conditioned on the decoded geometry.</p>
</caption>
<graphic xlink:href="frsip-02-1008812-g002.tif"/>
</fig>
<p>Following successful application of neural networks in image compression (<xref ref-type="bibr" rid="B6">Ball&#xe9; et al., 2016</xref>; <xref ref-type="bibr" rid="B89">Toderici et al., 2016</xref>; <xref ref-type="bibr" rid="B7">Ball&#xe9; et al., 2017</xref>; <xref ref-type="bibr" rid="B90">Toderici et al., 2017</xref>; <xref ref-type="bibr" rid="B4">Ball&#xe9;, 2018</xref>; <xref ref-type="bibr" rid="B8">Ball&#xe9; et al., 2018</xref>; <xref ref-type="bibr" rid="B58">Minnen et al., 2018</xref>; <xref ref-type="bibr" rid="B3">Balle et al., 2020</xref>; <xref ref-type="bibr" rid="B53">Mentzer et al., 2020</xref>; <xref ref-type="bibr" rid="B33">Hu et al., 2021</xref>), neural networks have been used successfully for point cloud geometry compression, demonstrating significant gains over traditional techniques (<xref ref-type="bibr" rid="B95">Yan et al., 2019</xref>; <xref ref-type="bibr" rid="B68">Quach et al., 2019</xref>; <xref ref-type="bibr" rid="B26">Guarda et al., 2019a</xref>,<xref ref-type="bibr" rid="B28">b</xref>; <xref ref-type="bibr" rid="B27">Guarda et al., 2020</xref>; <xref ref-type="bibr" rid="B87">Tang et al., 2020</xref>; <xref ref-type="bibr" rid="B67">Quach et al., 2020b</xref>). However, the same cannot be said for point cloud attribute compression. To our knowledge, our work is among the first to use neural networks for point cloud attribute compression. Previous attempts have been hindered by the inability to properly condition the attribute compression on the decoded geometry, thus leading to poor results. In our work, we show that proper conditioning improves attribute compression performance by over 30% reduction in the BD-rate. This results in a gain of 2&#x2013;4&#xa0;dB in the reconstructed colors over region-adaptive linear transform (RAHT) coding (<xref ref-type="bibr" rid="B17">de Queiroz and Chou, 2016</xref>), which is used in the &#x201c;geometry-based&#x201d; point cloud compression standard of MPEG G-PCC. Additionally, we compare our method with a recent learned framework Deep-PCAC (<xref ref-type="bibr" rid="B76">Sheng et al., 2021</xref>), which is <italic>not volumetric</italic>, and outperform it by 3&#x2013;5&#xa0;dB.</p>
<p>Although learned image compression systems have been based on convolutional neural networks (CNNs), in this work we use what have come to be called <italic>coordinate based networks</italic> (CBNs), also called <italic>implicit networks</italic>. A CBN is a network, such as a multilayer perceptron (MLP), whose inputs include the coordinates of the spatial domain of interest, e.g., <inline-formula id="inf1">
<mml:math id="m1">
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>. We use lightweight MLPs with one hidden layer as CBNs. Keeping the CBNs relatively small provides (1) efficient training/inference and (2) negligible overhead for representing the CBN. A CBN can directly represent a nonlinear function of the spatial coordinates <bold>x</bold>, possibly indexed with a latent or feature vector <bold>z</bold>, as <bold>y</bold> &#x3d; <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(<bold>x</bold>) or <bold>y</bold> &#x3d; <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(<bold>x</bold>; <bold>z</bold>). CBNs have recently come to the fore in accurately representing geometry and spatial phenomena such as radiance fields. However, while there has been an explosion of work using CBNs for <italic>representing</italic> specific objects and scenes (<xref ref-type="bibr" rid="B60">Park et al., 2019a</xref>; <xref ref-type="bibr" rid="B54">Mescheder et al., 2019</xref>; <xref ref-type="bibr" rid="B57">Mildenhall et al., 2020</xref>; <xref ref-type="bibr" rid="B77">Sitzmann et al., 2020</xref>; <xref ref-type="bibr" rid="B96">Yu A. et al., 2021</xref>; <xref ref-type="bibr" rid="B10">Barron et al., 2021</xref>; <xref ref-type="bibr" rid="B32">Hedman et al., 2021</xref>; <xref ref-type="bibr" rid="B39">Knodt et al., 2021</xref>; <xref ref-type="bibr" rid="B49">Martel et al., 2021</xref>; <xref ref-type="bibr" rid="B78">Srinivasan et al., 2021</xref>; <xref ref-type="bibr" rid="B84">Takikawa et al., 2021</xref>; <xref ref-type="bibr" rid="B101">Zhang et al., 2021</xref>), none of that work focuses on <italic>compressing</italic> those representations. (Two exceptions may be (<xref ref-type="bibr" rid="B11">Bird et al., 2021</xref>; <xref ref-type="bibr" rid="B36">Isik, 2021</xref>), which simply apply model compression to the CBNs.). Good lossy compression is nontrivial, and must make the optimal trade-off between the fidelity of the reconstruction and the number of bits used in its binary representation. We show that na&#xef;ve scalar quantization and entropy coding of the parameters <italic>&#x3b8;</italic> and/or latent vectors <bold>z</bold> lead to very poor results, and that superior results can be achieved by proper <italic>orthonormalization</italic> prior to uniform scalar quantization. In addition, for the best rate-distortion performance, the entropy model and CBN must be <italic>jointly</italic> trained to minimize a loss function that penalizes not only large distortion (or error) but large bit rate as well. We achieve this <italic>via</italic> a <italic>rate-distortion Lagrangian loss</italic>. Our main contributions include the following:<list list-type="simple">
<list-item>
<p>&#x2022; We are the first to <italic>compress volumetric functions</italic> modeled by <italic>local coordinate based networks</italic>, by performing an <italic>end-to-end optimization</italic> of a rate-distortion Lagrangian loss function, thereby offering scalable, high fidelity reconstructions even at low bit rates. We show that na&#xef;ve uniform scalar quantization and entropy coding lead to poor results.</p>
</list-item>
<list-item>
<p>&#x2022; We apply our framework to compress point cloud attributes. (It is applicable to other signals as well such as neural radiance fields, meshes, and images.) Hence, we are the first to compress point cloud <italic>attributes</italic> using CBNs. Our solution allows the network to interpolate the reconstructed attributes <italic>continuously</italic> across space, and offers a 2&#x2013;5&#xa0;dB improvement over our learned baseline Deep-PCAC (<xref ref-type="bibr" rid="B76">Sheng et al., 2021</xref>) and a 2&#x2013;4&#xa0;dB improvement over our linear baseline, RAHT (<xref ref-type="bibr" rid="B17">de Queiroz and Chou, 2016</xref>) with adaptive Run-Length Golomb-Rice (RLGR) entropy coding&#x2014;the transform in the latest MPEG G-PCC standard.</p>
</list-item>
<list-item>
<p>&#x2022; We show formulas for orthonormalizing the coefficients to achieve over a 30% reduction in bit rate. Note that appropriate orthonormalization is an essential (and nontrivial) component of all compression pipelines.</p>
</list-item>
</list>
</p>
<p>
<xref ref-type="sec" rid="s2">Section 2</xref> provides a brief overview of our Learned Volumetric Attribute Compression (LVAC) framework without going into the details, <xref ref-type="sec" rid="s3">Section 3</xref> covers related work, <xref ref-type="sec" rid="s4">Section 4</xref> details our framework, <xref ref-type="sec" rid="s5">Section 5</xref> reports experimental results, and <xref ref-type="sec" rid="s6">Section 6</xref> discusses and concludes. We provide a list of notations used in the paper in <xref ref-type="sec" rid="s12">Supplementary Table S1</xref>.</p>
</sec>
<sec id="s2">
<title>2 Overview of the framework</title>
<p>The goal of this work is to develop a volumetric point cloud attribute compression framework that uses the decoded geometry as side information. Unlike standard linear transform coding approaches such as RAHT, our approach performs non-linear interpolation through the learned volumetric functions modeled by neural networks.</p>
<p>Our approach is summarized in <xref ref-type="fig" rid="F3">Figure 3</xref>, where we jointly train 1) some transform coefficients <bold>V</bold> for the point cloud blocks, 2) quantizer stepsizes, 3) an entropy coder, and 4) a CBN <italic>via</italic> backpropagation through a Lagrangian loss function <italic>D</italic> &#x2b; <italic>&#x3bb;R</italic>. Here <italic>D</italic> is the distortion between the reconstructed attributes and the true attributes (color attributes in this work), and <italic>R</italic> is the estimated entropy of the quantized transform coefficients <inline-formula id="inf2">
<mml:math id="m2">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>, computed by our neural entropy model, which is a differentiable proxy for the &#x201c;non-differentiable&#x201d; entropy coder. The quantized transform coefficients <inline-formula id="inf3">
<mml:math id="m3">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> are inverse transformed <italic>via</italic> a linear synthesis matrix <italic>T</italic>
<sub>
<italic>s</italic>
</sub> as in standard transform coding frameworks. Notice, however, that we omit the usual analysis transform prior to quantization. This is because we directly learn the transform coefficients <bold>V</bold> through optimization for each point cloud <xref ref-type="fn" rid="fn1">
<sup>1</sup>
</xref>. These learned transform coefficients <bold>V</bold> are then quantized and synthesized into <italic>latent vectors</italic> <inline-formula id="inf4">
<mml:math id="m4">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> as shown in the figure. While the synthesized latent vector <inline-formula id="inf5">
<mml:math id="m5">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> for the block that a query point resides in could be output as the reconstructed attributes for that point, we take one step further and introduce a non-linear operation: We feed a small neural network, namely a CBN, with the synthesized latent vector and the 3D location of the query point <bold>x</bold>. This network outputs the reconstructed attributes, which are used in our distortion calculation. Finally, our Lagrangian loss is calculated with the estimated rate and the distortion, and this loss is backpropagated through all the blocks in <xref ref-type="fig" rid="F3">Figure 3</xref>. As we will explain in <xref ref-type="sec" rid="s4">Section 4</xref>, the synthesis matrix <italic>T</italic>
<sub>
<italic>s</italic>
</sub> is not learned. However, it is a fixed function of the geometry, as in RAHT. In fact, our synthesis transform can be regarded as the RAHT synthesis transform operating on latent vectors rather than attributes. Thus, we compress our latent vectors conditioned on the geometry as side information. All components besides the synthesis matrix such as the transform coefficients <bold>V</bold>, the quantization stepsizes, the entropy model, and the CBN are jointly trained through the Lagrangian loss function that optimizes both the reconstructions and the bitrate.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>Querying attributes at position <inline-formula id="inf6">
<mml:math id="m6">
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>. The block <inline-formula id="inf7">
<mml:math id="m7">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> at target level <italic>L</italic> in which <bold>x</bold> is located is identified by traversing a binary space partition tree. The &#x201c;learnable&#x201d; transform coefficients <bold>V</bold> are first quantized by rounding to obtain <inline-formula id="inf8">
<mml:math id="m8">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>&#x230a;</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo>&#x2309;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, and then the latent vectors are reconstructed as <inline-formula id="inf9">
<mml:math id="m9">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>. <bold>V</bold> is optimized by back-propagating <italic>D</italic> (<italic>&#x3b8;</italic>, <bold>Z</bold>)&#x2b; <italic>&#x3bb;R</italic> (<italic>&#x3b8;</italic>, <bold>Z</bold>) through all the components in the figure. The pipeline uses differentiable proxies for the quantizer and entropy coder. In <xref ref-type="fig" rid="F4">Figure 4</xref>, we give more details on the quantization and the orthonormalization steps, which are omitted in this figure for simplicity.</p>
</caption>
<graphic xlink:href="frsip-02-1008812-g003.tif"/>
</fig>
<p>As we dive into the details of <xref ref-type="fig" rid="F3">Figure 3</xref> in the following sections, we try to address the following challenges:<list list-type="simple">
<list-item>
<p>&#x2022; It is essential to make sure that the coefficients are orthonormalized prior to quantization. Otherwise, the quantization error would accumulate across different channels. To achieve this, we need to introduce orthonormalization and de-orthonormalization steps before and after the quantization.</p>
</list-item>
<list-item>
<p>&#x2022; Both quantization and entropy coding are non-differentiable operations. Thus, we need to utilize diffentiable proxies to perform backpropagation during training.</p>
</list-item>
</list>
</p>
</sec>
<sec id="s3">
<title>3 Related work</title>
<sec id="s3-1">
<title>3.1 Learned image compression</title>
<p>Using neural networks for good compression is non-trivial. Simply truncating the latent vectors of an existing representation to a certain number of bits is likely to fail, if only because small quantization errors in the latents may easily map into large quantization errors in their reconstructions. Moreover, the entropy of the quantized latents is a more important determiner of the bit rate than the total number of coefficients in the latent vectors or the number of bits in their binary representation. Early work on learned image compression could barely exceed the rate-distortion performance of JPEG on low-quality 32 &#xd7; 32 thumbnails (<xref ref-type="bibr" rid="B89">Toderici et al., 2016</xref>). However, over the years the rate-distortion performance has consistently improved (<xref ref-type="bibr" rid="B6">Ball&#xe9; et al., 2016</xref>; <xref ref-type="bibr" rid="B7">Ball&#xe9; et al., 2017</xref>; <xref ref-type="bibr" rid="B90">Toderici et al., 2017</xref>; <xref ref-type="bibr" rid="B4">Ball&#xe9;, 2018</xref>; <xref ref-type="bibr" rid="B8">Ball&#xe9; et al., 2018</xref>; <xref ref-type="bibr" rid="B58">Minnen et al., 2018</xref>; <xref ref-type="bibr" rid="B3">Balle et al., 2020</xref>; <xref ref-type="bibr" rid="B14">Cheng et al., 2020</xref>; <xref ref-type="bibr" rid="B33">Hu et al., 2021</xref>) to the point where the best learned image codecs outperform the latest video standard (VVC) in PSNR, albeit at much greater complexity (<xref ref-type="bibr" rid="B30">Guo et al., 2021</xref>), and greatly outperform conventional image codecs (by over 2&#xd7; reduction in bit rate) at the same perceptual distortion (<xref ref-type="bibr" rid="B53">Mentzer et al., 2020</xref>). Essentially all current competitive learned image codecs are versions of nonlinear transform coding (<xref ref-type="bibr" rid="B3">Balle et al., 2020</xref>), in which the bottleneck latents in an auto-encoder are uniformly scalar quantized and entropy coded, for transmission to a decoder. The decoder uses a convolutional neural network as a synthesis transform. The codec parameters <italic>&#x3b8;</italic> are trained end-to-end through a differentiable proxy for the quantizer, often modeled as additive uniform noise. The loss function is a Lagragian <italic>L</italic>(<italic>&#x3b8;</italic>) &#x3d; <italic>D</italic>(<italic>&#x3b8;</italic>) &#x2b; <italic>&#x3bb;R</italic>(<italic>&#x3b8;</italic>), where <italic>D</italic>(<italic>&#x3b8;</italic>) and <italic>R</italic>(<italic>&#x3b8;</italic>) are the expected distortion and bit rate. In this work, we use similar proxies for uniform scalar quantization and entropy coding as used for the learned image compression and train our representation using a similar loss function.</p>
</sec>
<sec id="s3-2">
<title>3.2 Coordinate based networks</title>
<p>Early work that used coordinate based networks (<xref ref-type="bibr" rid="B60">Park et al., 2019a</xref>; <xref ref-type="bibr" rid="B54">Mescheder et al., 2019</xref>; <xref ref-type="bibr" rid="B77">Sitzmann et al., 2020</xref>), exemplified by DeepSDF (<xref ref-type="bibr" rid="B61">Park et al., 2019b</xref>), focused on representing geometry <italic>implicitly</italic>, e.g., as the <italic>c</italic>-level set <inline-formula id="inf10">
<mml:math id="m10">
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>:</mml:mo>
<mml:mi>c</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
<mml:mo>&#x2282;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> of a function <inline-formula id="inf11">
<mml:math id="m11">
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>:</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2192;</mml:mo>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:math>
</inline-formula> modeled by a neural network, where <inline-formula id="inf12">
<mml:math id="m12">
<mml:mi mathvariant="bold">z</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> is a global latent vector. As a result such networks were called &#x201c;implicit&#x201d; networks. Much of this work focused on auto-decoder architectures, in which the latent vector <bold>z</bold> was determined for each instance by back propagation through the loss function. The loss function <italic>L</italic> (<italic>&#x3b8;</italic>, <bold>z</bold>) measured a pointwise error between samples <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(<bold>x</bold>
<sub>
<italic>i</italic>
</sub>; <bold>z</bold>) of the network and samples <italic>f</italic> (<bold>x</bold>
<sub>
<italic>i</italic>
</sub>) of a ground truth function, such as the signed distance function (SDF).</p>
<p>Later work that used CBNs, exemplified by NeRF (<xref ref-type="bibr" rid="B57">Mildenhall et al., 2020</xref>; <xref ref-type="bibr" rid="B10">Barron et al., 2021</xref>), used the networks to model not SDFs but rather other, vector-valued, volumetric functions, including color, density, normals, BRDF parameters, and specular features (<xref ref-type="bibr" rid="B96">Yu A. et al., 2021</xref>; <xref ref-type="bibr" rid="B32">Hedman et al., 2021</xref>; <xref ref-type="bibr" rid="B39">Knodt et al., 2021</xref>; <xref ref-type="bibr" rid="B78">Srinivasan et al., 2021</xref>; <xref ref-type="bibr" rid="B101">Zhang et al., 2021</xref>). Since these networks were no longer used to represent solutions implicitly, their name started to shift to &#x201c;coordinate-based&#x201d; networks, e.g., (<xref ref-type="bibr" rid="B86">Tancik et al., 2021</xref>). An important innovation from this cohort was measuring the loss <italic>L</italic>(<italic>&#x3b8;</italic>) not pointwise between samples of <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub> and some ground truth volumetric function <italic>f</italic>, but rather between volumetric renderings (to images) of <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub> and <italic>f</italic>, the latter renderings being ground truth images.</p>
<p>
<xref ref-type="bibr" rid="B57">Mildenhall et al. (2020)</xref> focused on training the CBN <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(<bold>x</bold>) to globally represent a single scene, without benefit of a latent vector <bold>z</bold>. However, subsequent work shifted towards using the CBN with different latent vectors for different objects (<xref ref-type="bibr" rid="B79">Stelzner et al., 2021</xref>; <xref ref-type="bibr" rid="B97">Yu H.-X. et al., 2021</xref>; <xref ref-type="bibr" rid="B43">Kundu et al., 2022a</xref>,<xref ref-type="bibr" rid="B44">b</xref>) or different regions (i.e., blocks or tiles) in the scene (<xref ref-type="bibr" rid="B13">Chen et al., 2021</xref>; <xref ref-type="bibr" rid="B20">DeVries et al., 2021</xref>; <xref ref-type="bibr" rid="B49">Martel et al., 2021</xref>; <xref ref-type="bibr" rid="B50">Mehta et al., 2021</xref>; <xref ref-type="bibr" rid="B69">Reiser et al., 2021</xref>; <xref ref-type="bibr" rid="B84">Takikawa et al., 2021</xref>; <xref ref-type="bibr" rid="B70">Rematas et al., 2022</xref>; <xref ref-type="bibr" rid="B85">Tancik et al., 2022</xref>; <xref ref-type="bibr" rid="B91">Turki et al., 2022</xref>). Partitioning the scene into blocks, and using a CBN with a different latent vector in each block, simultaneously achieves faster rendering (<xref ref-type="bibr" rid="B69">Reiser et al., 2021</xref>; <xref ref-type="bibr" rid="B84">Takikawa et al., 2021</xref>), higher resolution (<xref ref-type="bibr" rid="B13">Chen et al., 2021</xref>; <xref ref-type="bibr" rid="B49">Martel et al., 2021</xref>; <xref ref-type="bibr" rid="B50">Mehta et al., 2021</xref>), and scalability to scenes of unbounded size (<xref ref-type="bibr" rid="B20">DeVries et al., 2021</xref>; <xref ref-type="bibr" rid="B70">Rematas et al., 2022</xref>; <xref ref-type="bibr" rid="B85">Tancik et al., 2022</xref>; <xref ref-type="bibr" rid="B91">Turki et al., 2022</xref>). However, this puts much of the burden of the representation on the local latent vectors, rather than on the parameters of the CBN. This is analogous to conventional block-based image representations, in which the same set of basis functions (e.g., 8 &#xd7; 8 DCT) is used in each block, and activation of each basis vector is specified by a vector of basis coefficients, different for each block.</p>
<p>In this work, we partition 3D space into blocks (hierarchically using trees, akin to (<xref ref-type="bibr" rid="B96">Yu A. et al., 2021</xref>; <xref ref-type="bibr" rid="B49">Martel et al., 2021</xref>; <xref ref-type="bibr" rid="B84">Takikawa et al., 2021</xref>)), and represent the color within each block volumetrically using a CBN <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(<bold>x</bold>; <bold>z</bold>), allowing fast, high-resolution, and scalable reconstruction. Unlike all previous CBN works, however, we train the representation not just for fit but for efficient compression as well <italic>via</italic> transform coding and a rate-distortion Lagrangian loss function. It is worth noting that (<xref ref-type="bibr" rid="B83">Takikawa et al., 2022</xref>), which cites our preprint <xref ref-type="bibr" rid="B35">Isik et al. (2021b)</xref>, recently adapted our approach (though without RD Lagrangian loss or orthonormalization) to use fixed-rate vector quantization across the transform coefficient channels.</p>
</sec>
<sec id="s3-3">
<title>3.3 Point cloud compression</title>
<p>MPEG is standardizing two point cloud codecs: video-based (V-PCC) and geometry-based (G-PCC) (<xref ref-type="bibr" rid="B38">Jang et al., 2019</xref>; <xref ref-type="bibr" rid="B75">Schwarz et al., 2019</xref>; <xref ref-type="bibr" rid="B25">Graziosi et al., 2020</xref>). V-PCC is based on existing video codecs, while G-PCC is based on new, but in many ways classical, geometric approaches. Like previous works (<xref ref-type="bibr" rid="B98">Zhang et al., 2014</xref>; <xref ref-type="bibr" rid="B16">Cohen et al., 2016</xref>; <xref ref-type="bibr" rid="B17">de Queiroz and Chou, 2016</xref>; <xref ref-type="bibr" rid="B88">Thanou et al., 2016</xref>; <xref ref-type="bibr" rid="B18">de Queiroz and Chou, 2017</xref>; <xref ref-type="bibr" rid="B63">Pavez et al., 2018</xref>; <xref ref-type="bibr" rid="B15">Chou et al., 2020</xref>; <xref ref-type="bibr" rid="B40">Krivoku&#x107;a et al., 2020</xref>), both V-PCC and G-PCC compress geometry first, then compress attributes conditioned on geometry. Neural networks have been applied with some success to geometry compression (<xref ref-type="bibr" rid="B95">Yan et al., 2019</xref>; <xref ref-type="bibr" rid="B68">Quach et al., 2019</xref>; <xref ref-type="bibr" rid="B26">Guarda et al., 2019a</xref>,<xref ref-type="bibr" rid="B28">b</xref>; <xref ref-type="bibr" rid="B27">Guarda et al., 2020</xref>; <xref ref-type="bibr" rid="B87">Tang et al., 2020</xref>; <xref ref-type="bibr" rid="B66">Quach et al., 2020a</xref>; <xref ref-type="bibr" rid="B55">Milani, 2020</xref>, <xref ref-type="bibr" rid="B56">2021</xref>; <xref ref-type="bibr" rid="B46">Lazzarotto et al., 2021</xref>), but not to lossy attribute compression. Exceptions may include (<xref ref-type="bibr" rid="B67">Quach et al., 2020b</xref>), which uses learned neural 3D &#x2192; 2D folding but compresses with conventional image coding, and Deep-PCAC (<xref ref-type="bibr" rid="B76">Sheng et al., 2021</xref>), which compresses attributes using a PointNet-style architecture, which is <italic>not volumetric</italic> and underperforms our framework by 2&#x2013;5&#xa0;dB (see <xref ref-type="fig" rid="F12">Figure 12B</xref> and <xref ref-type="sec" rid="s12">Supplementary Material</xref>). The attribute compression in G-PCC uses linear transforms, which adapt based on the geometry. A core transform is the region-adaptive hierarchical transform (RAHT) (<xref ref-type="bibr" rid="B17">de Queiroz and Chou, 2016</xref>; <xref ref-type="bibr" rid="B74">Sandri G. P. et al., 2019</xref>), which is a linear transform that is orthonormal with respect to a discrete measure whose mass is put on the point cloud geometry (<xref ref-type="bibr" rid="B73">Sandri et al., 2019a</xref>; <xref ref-type="bibr" rid="B15">Chou et al., 2020</xref>). Thus RAHT compresses attributes conditioned on geometry. Beyond RAHT, G-PCC uses prediction (of the RAHT coefficients) and joint entropy coding to obtain superior performance (<xref ref-type="bibr" rid="B45">Lasserre and Flynn, 2019</xref>; <xref ref-type="bibr" rid="B22">3DG, 2020b</xref>; <xref ref-type="bibr" rid="B64">Pavez et al., 2021</xref>). Recently (<xref ref-type="bibr" rid="B23">Fang et al., 2020</xref>) use neural methods for lossless entropy coding of the RAHT transform coefficients. Our work exceeds the RD performance of classic RAHT by 2&#x2013;4&#xa0;dB by introducing the flexibility of learning non-linear volumetric functions. Our approach is orthogonal to the prediction and entropy coding in (<xref ref-type="bibr" rid="B45">Lasserre and Flynn, 2019</xref>; <xref ref-type="bibr" rid="B22">3DG, 2020b</xref>; <xref ref-type="bibr" rid="B64">Pavez et al., 2021</xref>; <xref ref-type="bibr" rid="B23">Fang et al., 2020</xref>) and all results could be improved by using combinations of these techniques.</p>
</sec>
</sec>
<sec id="s4">
<title>4 LVAC framework</title>
<sec id="s4-1">
<title>4.1 Approach to volumetric representation</title>
<p>A real-valued (or real vector-valued) function <inline-formula id="inf13">
<mml:math id="m13">
<mml:mi>f</mml:mi>
<mml:mo>:</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>d</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2192;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> is said to be <italic>volumetric</italic> if <italic>d</italic> &#x3d; 3. A volumetric function <italic>f</italic> may be approximated by another volumetric function <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub> in a parametric family of volumetric functions {<italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>:<italic>&#x3b8;</italic> &#x2208; &#x398;} by minimizing an error <italic>d</italic> (<italic>f</italic>, <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>) over <italic>&#x3b8;</italic> &#x2208; &#x398;. Suppose <inline-formula id="inf14">
<mml:math id="m14">
<mml:msubsup>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">{</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">}</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula> is a point cloud with point positions <inline-formula id="inf15">
<mml:math id="m15">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> and point attributes <inline-formula id="inf16">
<mml:math id="m16">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>r</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>. Point cloud attribute compression approximates the volumetric function <italic>f</italic>: <bold>x</bold>
<sub>
<italic>i</italic>
</sub>&#x21a6;<bold>y</bold>
<sub>
<italic>i</italic>
</sub> by finding the optimal or close to the optimal parameter <italic>&#x3b8;</italic>. Different point clouds are represented by different volumetric attribute functions <italic>f</italic>. Therefore the encoding procedure of LVAC comprises of learning the parameter <italic>&#x3b8;</italic> for the given point cloud.</p>
<p>A simple example is linear regression. An affine function <bold>y</bold> &#x3d; <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(<bold>x</bold>) &#x3d; <bold>Ax</bold> &#x2b; <bold>b</bold>, with <italic>&#x3b8;</italic> &#x3d; (<bold>A</bold>, <bold>b</bold>), may fit to the data by minimizing the squared error <italic>d</italic> (<italic>f</italic>, <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>) &#x3d; &#x2016;<italic>f</italic> &#x2212; <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>&#x2016;<sup>2</sup> &#x3d; <italic>&#x2211;</italic>
<sub>
<italic>i</italic>
</sub>&#x2016;<italic>f</italic> (<bold>x</bold>
<sub>
<italic>i</italic>
</sub>) &#x2212; <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(<bold>x</bold>
<sub>
<italic>i</italic>
</sub>)&#x2016;<sup>2</sup> over <italic>&#x3b8;</italic>. Although a linear or affine volumetric function may not be able to represent adequately the complex spatial arrangement of colors of point clouds like those in <xref ref-type="fig" rid="F1">Figure 1</xref>, two strategies may be used to improve the fit:<list list-type="simple">
<list-item>
<p>1) The first is to expand the <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub> function family, e.g., to represent <italic>f</italic> with more expressive CBNs. LVAC accomplishes this by using neural networks and by increasing the number of network parameters. We describe this expansion in the following sections in more detail.</p>
</list-item>
<list-item>
<p>2) The second is to partition the scene into blocks. When restricted to sub-regions, functions may have less complexity and a good fit may be achieved without exploding the number of network parameters in the CBN. LVAC partitions the bounding box of the point cloud into cube blocks. Each block is associated with a latent vector, which is fed to <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub> as an addendum and serves as a local parameter to the block. The next section describes how these latent vectors are used in more detail.</p>
</list-item>
</list>
</p>
</sec>
<sec id="s4-2">
<title>4.2 Latent vectors</title>
<p>In LVAC, the 3D volume is partitioned into blocks <inline-formula id="inf17">
<mml:math id="m17">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> as in <xref ref-type="fig" rid="F3">Figure 3</xref>. The attributes <bold>y</bold>
<sub>
<italic>i</italic>
</sub> in a block <inline-formula id="inf18">
<mml:math id="m18">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> at offset <bold>n</bold> are fit with a volumetric function <bold>y</bold> &#x3d; <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(<bold>x</bold> &#x2212; <bold>n</bold>; <bold>z</bold>
<sub>
<bold>n</bold>
</sub>) represented by a simple CBN, shifted to offset <bold>n</bold>. The CBN parameters <italic>&#x3b8;</italic> are learned for each point cloud. In addition to the global parameter <italic>&#x3b8;</italic>, each block <inline-formula id="inf19">
<mml:math id="m19">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> supplies its own latent vector <bold>z</bold>
<sub>
<bold>n</bold>
</sub>, which selects the exact volumetric function <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(&#x22c5;; <bold>z</bold>) used in the block. The role of <italic>&#x3b8;</italic> is to choose the <italic>subfamily</italic> of volumetric functions best for each point cloud. The role of <bold>z</bold> is to choose a member of the subfamily best for each block, and serves as a local parameter. This procedure is summarized in <xref ref-type="fig" rid="F3">Figure 3</xref>. The overall volumetric function may be expressed as<disp-formula id="e1">
<mml:math id="m20">
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi mathvariant="bold">n</mml:mi>
<mml:mo>;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:msub>
<mml:mrow>
<mml:mn mathvariant="double-struck">1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
<label>(1)</label>
</disp-formula>where the sum is over all block offsets <bold>n</bold>, <inline-formula id="inf20">
<mml:math id="m21">
<mml:msub>
<mml:mrow>
<mml:mn mathvariant="double-struck">1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> is the indicator function for block <inline-formula id="inf21">
<mml:math id="m22">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> (i.e., <inline-formula id="inf22">
<mml:math id="m23">
<mml:msub>
<mml:mrow>
<mml:mn mathvariant="double-struck">1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:math>
</inline-formula> iff <inline-formula id="inf23">
<mml:math id="m24">
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>), and <bold>Z</bold> &#x3d; [<bold>z</bold>
<sub>
<bold>n</bold>
</sub>] is the matrix whose rows <bold>z</bold>
<sub>
<bold>n</bold>
</sub> are the blocks&#x2019; latent vectors.</p>
<p>To compress the point cloud attributes {<bold>y</bold>
<sub>
<italic>i</italic>
</sub>} given the geometry {<bold>x</bold>
<sub>
<italic>i</italic>
</sub>}, LVAC compresses and transmits <bold>Z</bold> and possibly <italic>&#x3b8;</italic> as quantized quantities <inline-formula id="inf24">
<mml:math id="m25">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf25">
<mml:math id="m26">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> using <inline-formula id="inf26">
<mml:math id="m27">
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> bits. This communicates the volumetric function <inline-formula id="inf27">
<mml:math id="m28">
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> to the decoder. The decoder can then use <inline-formula id="inf28">
<mml:math id="m29">
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> to reconstruct the attributes <bold>y</bold>
<sub>
<italic>i</italic>
</sub> at each point position <bold>x</bold>
<sub>
<italic>i</italic>
</sub> as <inline-formula id="inf29">
<mml:math id="m30">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, incurring distortion<disp-formula id="e2">
<mml:math id="m31">
<mml:mi>D</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>d</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mo stretchy="false">&#x2016;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>i</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
<mml:msup>
<mml:mrow>
<mml:mo stretchy="false">&#x2016;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>.</mml:mo>
</mml:math>
<label>(2)</label>
</disp-formula>The decoder can also use <inline-formula id="inf30">
<mml:math id="m32">
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> to reconstruct the attributes <bold>y</bold> at an <italic>arbitrary</italic> position <inline-formula id="inf31">
<mml:math id="m33">
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>. LVAC minimizes the distortion <inline-formula id="inf32">
<mml:math id="m34">
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> subject to a constraint on the bit rate, <inline-formula id="inf33">
<mml:math id="m35">
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2264;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>. This is done by minimizing the Lagrangian <inline-formula id="inf34">
<mml:math id="m36">
<mml:mi>J</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>D</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3bb;</mml:mi>
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> for some Lagrange multiplier <italic>&#x3bb;</italic> &#x3e; 0 matched to <italic>R</italic>
<sub>0</sub>.</p>
<p>In the regime of interest in our work, <italic>&#x3b8;</italic> has about 250-10&#xa0;K parameters, while <bold>Z</bold> has about 500&#xa0;K-8&#xa0;M parameters. Hence the focus of this paper is on compression of <bold>Z</bold>. We assume that the simple CBN parameterized by <italic>&#x3b8;</italic> can be compressed using model compression tools, e.g., (<xref ref-type="bibr" rid="B11">Bird et al., 2021</xref>; <xref ref-type="bibr" rid="B36">Isik, 2021</xref>), to a few bits per parameter with little loss in performance. Alternatively, we assume that the CBN may be trained to generalize across many point clouds, obviating the need to transmit <italic>&#x3b8;</italic>. In <xref ref-type="sec" rid="s5">Section 5</xref>, we explore conservative bounds on the performance of each assumption. In this section, however, we focus on compression of the latent vectors <bold>Z</bold> &#x3d; [<bold>z</bold>
<sub>
<bold>n</bold>
</sub>].</p>
<p>We first describe the linear components of our framework that many conventional methods share (<xref ref-type="bibr" rid="B17">de Queiroz and Chou, 2016</xref>; <xref ref-type="bibr" rid="B71">Sandri et al., 2018</xref>, <xref ref-type="bibr" rid="B74">Sandri et al., 2019 G. P.</xref>; <xref ref-type="bibr" rid="B42">Krivokuca et al., 2021</xref>; <xref ref-type="bibr" rid="B64">Pavez et al., 2021</xref>) and then discuss how we achieve the state-of-the-art compression with the additional non-linearity introduced by CBNs and an end-to-end optimization of the rate-distortion Lagrangian loss <italic>via</italic> back propagation.</p>
<sec id="s4-2-1">
<title>4.2.1 Linear components</title>
<p>Following RAHT (<xref ref-type="bibr" rid="B17">de Queiroz and Chou, 2016</xref>) and followups (<xref ref-type="bibr" rid="B71">Sandri et al., 2018</xref>, <xref ref-type="bibr" rid="B74">Sandri et al., 2019 G. P.</xref>; <xref ref-type="bibr" rid="B42">Krivokuca et al., 2021</xref>; <xref ref-type="bibr" rid="B64">Pavez et al., 2021</xref>), the problem of point cloud attribute compression can be modeled as compression of a piecewise constant volumetric function,<disp-formula id="e3">
<mml:math id="m37">
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mo>&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mn mathvariant="double-struck">1</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>.</mml:mo>
</mml:math>
<label>(3)</label>
</disp-formula>This is the same as (1) with an extremely simple CBN: <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(<bold>x</bold>; <bold>z</bold>) &#x3d; <bold>z</bold>. For the linear case, each latent <inline-formula id="inf35">
<mml:math id="m38">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> directly represents a color, which is constant across block <inline-formula id="inf36">
<mml:math id="m39">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>. It is clear that the squared error &#x2016;<italic>f</italic> &#x2212; <italic>f</italic>
<sub>
<bold>Z</bold>
</sub>&#x2016;<sup>2</sup> is minimized by setting every <bold>z</bold>
<sub>
<bold>n</bold>
</sub> to the average (DC) value of the colors of the points in <inline-formula id="inf37">
<mml:math id="m40">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>. It would be inefficient to quantize and entropy code the colors <bold>Z</bold> &#x3d; [<bold>z</bold>
<sub>
<bold>n</bold>
</sub>] directly without transforming them into a domain that separates important (DC) and unimportant components. Therefore, the convention is to first transform the <italic>N</italic> &#xd7; <italic>C</italic> matrix <bold>Z</bold> using a geometry-dependent <italic>N</italic> &#xd7; <italic>N</italic> analysis transform <bold>T</bold>
<sub>
<italic>a</italic>
</sub>, to obtain the <italic>N</italic> &#xd7; <italic>C</italic> matrix of transform coefficients <bold>V</bold> &#x3d; <bold>T</bold>
<sub>
<italic>a</italic>
</sub>
<bold>Z</bold>, most of which may be near zero. (<italic>N</italic> is the number of blocks <inline-formula id="inf38">
<mml:math id="m41">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> that are <italic>occupied</italic>, i.e., that contain points, and <italic>C</italic> is the number of latent features.) Then <bold>V</bold> is quantized to <inline-formula id="inf39">
<mml:math id="m42">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> and efficiently entropy coded. Finally <inline-formula id="inf40">
<mml:math id="m43">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> is recovered using the synthesis transform <inline-formula id="inf41">
<mml:math id="m44">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msubsup>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>a</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msubsup>
</mml:math>
</inline-formula>.</p>
<p>The analysis and synthesis transforms <bold>T</bold>
<sub>
<italic>a</italic>
</sub> and <bold>T</bold>
<sub>
<italic>s</italic>
</sub> are defined in terms of a hierarchical space partition represented by a binary tree. The root of the tree (level <italic>&#x2113;</italic> &#x3d; 0) corresponds to a large block <inline-formula id="inf42">
<mml:math id="m45">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>,</mml:mo>
<mml:mn mathvariant="bold">0</mml:mn>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> containing the entire point cloud. The leaves of the tree (level <italic>&#x2113;</italic> &#x3d; <italic>L</italic>) correspond to the <italic>N</italic> blocks <inline-formula id="inf43">
<mml:math id="m46">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> in <xref ref-type="disp-formula" rid="e3">(Equation 3)</xref>, which are voxels of a voxelized point cloud. In between, for each level <italic>&#x2113;</italic> &#x3d; 0, 1, &#x2026; , <italic>L</italic> &#x2212; 1, each occupied block <inline-formula id="inf44">
<mml:math id="m47">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> at level <italic>&#x2113;</italic> is split into left and right child blocks of equal size, say <inline-formula id="inf45">
<mml:math id="m48">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> and <inline-formula id="inf46">
<mml:math id="m49">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>, at level <italic>&#x2113;</italic> &#x2b; 1. The split is along either the <italic>x</italic>, <italic>y</italic>, or <italic>z</italic> axis depending on whether <italic>&#x2113;</italic> mod&#x2009; 3 is 0, 1, or 2. Only child blocks that are occupied by any point in the point cloud are retained in the tree. To perform the linear analysis transform <bold>T</bold>
<sub>
<italic>a</italic>
</sub>
<bold>Z</bold>, one can start at level <italic>&#x2113;</italic> &#x3d; <italic>L</italic> &#x2212; 1 and work back to level <italic>&#x2113;</italic> &#x3d; 0, computing the average (DC) value of each block <inline-formula id="inf47">
<mml:math id="m50">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> as<disp-formula id="e4">
<mml:math id="m51">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mspace width="-0.17em"/>
<mml:mo>&#x2b;</mml:mo>
<mml:mspace width="-0.17em"/>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mspace width="-0.17em"/>
<mml:mo>&#x2b;</mml:mo>
<mml:mspace width="-0.17em"/>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:math>
<label>(4)</label>
</disp-formula>where <inline-formula id="inf48">
<mml:math id="m52">
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> and <inline-formula id="inf49">
<mml:math id="m53">
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> are the <italic>weights</italic> of, or number of points in, the left and right child blocks of <inline-formula id="inf50">
<mml:math id="m54">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>. The global DC value of the entire point cloud is <bold>z</bold>
<sub>0,<bold>0</bold>
</sub>. Along the way, the difference between the DC values of each child block and its parent are computed as<disp-formula id="e5">
<mml:math id="m55">
<mml:mi>&#x3b4;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mspace width="0.3333em"/>
<mml:mtext>&#x2009;and&#x2009;</mml:mtext>
<mml:mspace width="0.3333em"/>
<mml:mi>&#x3b4;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>.</mml:mo>
</mml:math>
<label>(5)</label>
</disp-formula>These differences are close to zero and efficient to entropy code. The transform coefficients matrix <bold>V</bold> &#x3d; <bold>T</bold>
<sub>
<italic>a</italic>
</sub>
<bold>Z</bold> consist of the global DC value <bold>z</bold>
<sub>0,<bold>0</bold>
</sub> in the first row, and <italic>N</italic> &#x2212; 1 <italic>right child</italic> differences <inline-formula id="inf51">
<mml:math id="m56">
<mml:mi>&#x3b4;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> in <xref ref-type="disp-formula" rid="e5">(Eq. 5)</xref> in the remaining rows.</p>
<p>To perform the linear synthesis transform <bold>T</bold>
<sub>
<italic>s</italic>
</sub>
<bold>V</bold>, one can start at level <italic>&#x2113;</italic> &#x3d; 0 and work up to level <italic>L</italic> &#x2212; 1, computing the <italic>left child</italic> differences <inline-formula id="inf52">
<mml:math id="m57">
<mml:mi>&#x3b4;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> <xref ref-type="disp-formula" rid="e5">(Eq. 5)</xref> from the <italic>right child</italic> differences <inline-formula id="inf53">
<mml:math id="m58">
<mml:mi>&#x3b4;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> <xref ref-type="disp-formula" rid="e5">(Eq. 5)</xref> in <bold>V</bold> using the constraint<disp-formula id="e6">
<mml:math id="m59">
<mml:mn mathvariant="bold">0</mml:mn>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mspace width="-0.17em"/>
<mml:mo>&#x2b;</mml:mo>
<mml:mspace width="-0.17em"/>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mi>&#x3b4;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mspace width="-0.17em"/>
<mml:mo>&#x2b;</mml:mo>
<mml:mspace width="-0.17em"/>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:mi>&#x3b4;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:math>
<label>(6)</label>
</disp-formula>which is obtained from <xref ref-type="disp-formula" rid="e4">(Eq. 4)</xref> using <xref ref-type="disp-formula" rid="e5">(Eq. 5)</xref>. Then the equations in <xref ref-type="disp-formula" rid="e5">(Eq. 5)</xref> are inverted to obtain <inline-formula id="inf54">
<mml:math id="m60">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> and <inline-formula id="inf55">
<mml:math id="m61">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> from <bold>z</bold>
<sub>
<italic>&#x2113;</italic>,<bold>n</bold>
</sub>, ultimately computing the values <bold>z</bold>
<sub>
<italic>L</italic>,<bold>n</bold>
</sub> &#x3d; <bold>z</bold>
<sub>
<bold>n</bold>
</sub> for blocks at level <italic>L</italic>.</p>
<p>Expressions for the matrices <bold>T</bold>
<sub>
<italic>a</italic>
</sub> and <bold>T</bold>
<sub>
<italic>s</italic>
</sub> can be worked out from the above linear operations. In particular, it can be shown that each row of <bold>T</bold>
<sub>
<italic>s</italic>
</sub> computes the color <bold>z</bold>
<sub>
<italic>L</italic>,<bold>n</bold>
</sub> of some leaf voxel <inline-formula id="inf56">
<mml:math id="m62">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> by summing the color <bold>z</bold>
<sub>0</sub> of the root block with the color differences <italic>&#x3b4;</italic>
<bold>z</bold>
<sub>
<italic>&#x2113;</italic>
</sub> at levels of detail <italic>&#x2113;</italic> &#x3d; 1, &#x2026; , <italic>L</italic> from the root to the leaf. One challenge in the quantization step is each transform coefficient in <bold>V</bold> &#x3d; <bold>T</bold>
<sub>
<italic>a</italic>
</sub>
<bold>Z</bold> requires a different quantization step, i.e., uniform quantization would be suboptimal, since important coefficients should be quantized with finer precision. We can avoid this complication by orthonormalizing <bold>T</bold>
<sub>
<italic>a</italic>
</sub> and <bold>T</bold>
<sub>
<italic>s</italic>
</sub>. In fact, <bold>T</bold>
<sub>
<italic>a</italic>
</sub> and <bold>T</bold>
<sub>
<italic>s</italic>
</sub> can be orthonormalized by multiplication by a diagonal matrix <bold>S</bold> &#x3d; diag (<italic>s</italic>
<sub>1</sub>, &#x2026; , <italic>s</italic>
<sub>
<italic>N</italic>
</sub>), where.<disp-formula id="e7">
<mml:math id="m63">
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mtext>&#x23;&#x2009;points&#x2009;in&#x2009;point&#x2009;cloud</mml:mtext>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
</mml:math>
<label>(7)</label>
</disp-formula>
<disp-formula id="e8">
<mml:math id="m64">
<mml:msub>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mspace width="-0.17em"/>
<mml:mo>&#x2b;</mml:mo>
<mml:mspace width="-0.17em"/>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>w</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>/</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>,</mml:mo>
</mml:math>
<label>(8)</label>
</disp-formula>where element <italic>s</italic>
<sub>1</sub> of <bold>S</bold> corresponds to row one of <bold>V</bold> (the global DC value <bold>z</bold>
<sub>0,<bold>0</bold>
</sub>) and element <italic>s</italic>
<sub>
<italic>m</italic>
</sub> of <bold>S</bold> corresponds to row <italic>m</italic> &#x3e; 1 of <bold>V</bold> (a right child difference <inline-formula id="inf57">
<mml:math id="m65">
<mml:mi>&#x3b4;</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>R</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>). That is, <bold>S</bold>
<sup>&#x2212;1</sup>
<bold>T</bold>
<sub>
<italic>a</italic>
</sub> and <bold>T</bold>
<sub>
<italic>s</italic>
</sub>
<bold>S</bold> are orthonormal (and transposes of each other). This implies that every row of the normalized coefficients <inline-formula id="inf58">
<mml:math id="m66">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:math>
</inline-formula> should be now quantized <italic>uniformly</italic> with the same step size &#x394;, or equivalently that the rows of the unnormalized coefficients <bold>V</bold> &#x3d; <bold>T</bold>
<sub>
<italic>a</italic>
</sub>
<bold>Z</bold> should be quantized with scaled step sizes <italic>s</italic>
<sub>
<italic>m</italic>
</sub>&#x394;. This scaling is crucial as it quantizes with finer precision the coefficients that are more important. The more important coefficients are generally associated with blocks with more points.</p>
</sec>
<sec id="s4-2-2">
<title>4.2.2 Nonlinear components</title>
<p>Now, we provide more details on the nonlinear components in our framework and how they are jointly optimized (learned) with the linear components in the loop to quantize and entropy code the latent vectors <inline-formula id="inf59">
<mml:math id="m67">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> (where now <italic>C</italic> &#x226b; 3 typically) for the blocks <inline-formula id="inf60">
<mml:math id="m68">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> in <xref ref-type="disp-formula" rid="e1">(Eq. 1)</xref>.</p>
<p>LVAC performs joint optimization of distortion and bit rate by querying points <italic>x</italic> at a <italic>target level</italic> of detail <italic>L</italic>&#x2014;lower (i.e., coarser) than the voxel level. Thus the blocks <inline-formula id="inf61">
<mml:math id="m69">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> contain not just one point but say <italic>N</italic>
<sub>
<italic>x</italic>
</sub> &#xd7; <italic>N</italic>
<sub>
<italic>y</italic>
</sub> &#xd7; <italic>N</italic>
<sub>
<italic>z</italic>
</sub> voxels, only some of which are occupied. Then the attributes (typically, colors) of the occupied voxels in <inline-formula id="inf62">
<mml:math id="m70">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="script">B</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> are represented by the volumetric function <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(<bold>x</bold> &#x2212; <bold>n</bold>; <bold>z</bold>
<sub>
<bold>n</bold>
</sub>) of a CBN at level <italic>L</italic>, which better models the attributes <italic>within</italic> the block at certain bit rates than a purely linear transform such as (<xref ref-type="bibr" rid="B17">de Queiroz and Chou, 2016</xref>; <xref ref-type="bibr" rid="B71">Sandri et al., 2018</xref>, <xref ref-type="bibr" rid="B74">Sandri et al., 2019 G. P.</xref>; <xref ref-type="bibr" rid="B42">Krivokuca et al., 2021</xref>; <xref ref-type="bibr" rid="B64">Pavez et al., 2021</xref>). Since the latent vectors <inline-formula id="inf63">
<mml:math id="m71">
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> are not themselves the attributes of the occupied voxels, they are not a direct input to the encoder (see <xref ref-type="fig" rid="F3">Figure 3</xref>). Hence the encoder cannot apply the analysis transform <bold>T</bold>
<sub>
<italic>a</italic>
</sub> to <bold>Z</bold> &#x3d; [<bold>z</bold>
<sub>
<bold>n</bold>
</sub>] to obtain the transform coefficients <bold>V</bold>. Instead, LVAC learns <bold>V</bold> through back-propagation, without an explicit <bold>T</bold>
<sub>
<italic>a</italic>
</sub>, first through the distortion measure and volumetric function (2), and then through the synthesis transform <bold>T</bold>
<sub>
<italic>s</italic>
</sub> and scaling matrix <bold>S</bold>. The coefficients <italic>&#x3b8;</italic> of the CBN are <italic>jointly</italic> optimized at the same time. Learning gives LVAC the opportunity to optimize <bold>V</bold> not just to minimize the distortion <italic>D</italic>, (i.e, to optimize the <italic>fit</italic> of the model to the data) but to minimize the ultimate rate-distortion objective <italic>D</italic> &#x2b; <italic>&#x3bb;R</italic>, which minimizes the distortion subject to a bit rate constraint.</p>
<p>
<xref ref-type="fig" rid="F4">Figure 4</xref> shows the compression pipeline that produces <inline-formula id="inf64">
<mml:math id="m72">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> from <bold>V</bold>, through which the back-propagation must be performed. The diagonal matrix <bold>S</bold> (defined in <xref ref-type="disp-formula" rid="e7">(Eqs. 7</xref>, <xref ref-type="disp-formula" rid="e8">8</xref>) scales the coefficients in <bold>V</bold> to produce <inline-formula id="inf65">
<mml:math id="m73">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">S</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:math>
</inline-formula>, but is constant across channels <italic>c</italic> &#x3d; 1, &#x2026; , <italic>C</italic>. The diagonal matrix <bold>&#x394;</bold> &#x3d; diag (&#x394;<sub>1</sub>, &#x2026; , &#x394;<sub>
<italic>C</italic>
</sub>) applies different step sizes &#x394;<sub>
<italic>c</italic>
</sub> to each channel in <inline-formula id="inf66">
<mml:math id="m74">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> to produce <inline-formula id="inf67">
<mml:math id="m75">
<mml:mi mathvariant="bold">U</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">&#x394;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>, but is constant across coefficients. The quantizer rounds the real matrix <bold>U</bold> elementwise to produce the integer matrix <inline-formula id="inf68">
<mml:math id="m76">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">U</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>&#x230a;</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">U</mml:mi>
</mml:mrow>
<mml:mo>&#x2309;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, which is then entropy coded to produce a bit string of length <italic>R</italic> in total. The integer matrix <inline-formula id="inf69">
<mml:math id="m77">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">U</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> is also transformed by <bold>&#x394;</bold>, <bold>S</bold>, and <bold>T</bold>
<sub>
<italic>s</italic>
</sub> in sequence to produce <inline-formula id="inf70">
<mml:math id="m78">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">U</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi mathvariant="bold">&#x394;</mml:mi>
</mml:math>
</inline-formula>. Note that learnable parameters in <xref ref-type="fig" rid="F4">Figure 4</xref> are <bold>V</bold>, <bold>&#x394;</bold>, and parameters of the entropy coder. Mathematically it does not matter if we optimize <bold>V</bold> or the normalized version <inline-formula id="inf71">
<mml:math id="m79">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:math>
</inline-formula>. In our implementation, we optimize <inline-formula id="inf72">
<mml:math id="m80">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>LVAC pipeline for compressing latents <bold>Z</bold> &#x3d;[<bold>z</bold>
<sub>
<bold>n</bold>
</sub>]. <bold>Z</bold> is represented by difference latents <bold>V</bold>, normalized by <bold>S</bold> across levels and blocks to obtain <inline-formula id="inf73">
<mml:math id="m81">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>, divided by step sizes <bold>&#x394;</bold> across channels to obtain <bold>U</bold>, quantized by rounding to obtain <inline-formula id="inf74">
<mml:math id="m82">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">U</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>&#x230a;</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">U</mml:mi>
</mml:mrow>
<mml:mo>&#x2309;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, and reconstructed as <inline-formula id="inf75">
<mml:math id="m83">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi mathvariant="bold">T</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mi mathvariant="bold">S</mml:mi>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">U</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mi mathvariant="bold">&#x394;</mml:mi>
</mml:math>
</inline-formula>. <bold>V</bold> (or equivalently <inline-formula id="inf76">
<mml:math id="m84">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">V</mml:mi>
</mml:mrow>
<mml:mo>&#x304;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> in practice) is optimized by back-propagating through <italic>D</italic> (<italic>&#x3b8;</italic>, <bold>Z</bold>)&#x2b; <italic>&#x3bb;R</italic> (<italic>&#x3b8;</italic>, <bold>Z</bold>) and the pipeline using differentiable proxies for the quantizer and entropy coder.</p>
</caption>
<graphic xlink:href="frsip-02-1008812-g004.tif"/>
</fig>
<p>As the quantizer and entropy encoder are not differentiable, they must be replaced by differentiable <italic>proxies</italic> during optimization. There are various differentiable proxies for the quantizer (<xref ref-type="bibr" rid="B7">Ball&#xe9; et al., 2017</xref>; <xref ref-type="bibr" rid="B1">Agustsson and Theis, 2020</xref>; <xref ref-type="bibr" rid="B47">Luo et al., 2020</xref>), and we use the proxy<disp-formula id="e9">
<mml:math id="m85">
<mml:mi>Q</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">U</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi mathvariant="bold">U</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mo>,</mml:mo>
</mml:math>
<label>(9)</label>
</disp-formula>where <bold>W</bold> is iid unif (&#x2212;0.5, 0.5). Various differentiable proxies for the entropy coder are also possible. As the number of bits in the entropy code for <bold>U</bold> &#x3d; [<italic>u</italic>
<sub>
<italic>m</italic>,<italic>c</italic>
</sub>], we use the proxy <inline-formula id="inf77">
<mml:math id="m86">
<mml:mi>R</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">U</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mo movablelimits="false" form="prefix">&#x2211;</mml:mo>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2061;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>log</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:msub>
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>u</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, where<disp-formula id="e10">
<mml:math id="m87">
<mml:msub>
<mml:mrow>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>u</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mtext>CDF</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>0.5</mml:mn>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mtext>CDF</mml:mtext>
</mml:mrow>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mi>&#x3d5;</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x2113;</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>c</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi>u</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>0.5</mml:mn>
</mml:mrow>
</mml:mfenced>
</mml:math>
<label>(10)</label>
</disp-formula>
</p>
<p>(<xref ref-type="bibr" rid="B7">Ball&#xe9; et al., 2017</xref>). The CDF is modeled by a neural network with parameters <italic>&#x3d5;</italic>
<sub>
<italic>&#x2113;</italic>,<italic>c</italic>
</sub> that depend on the channel <italic>c</italic> and also the level <italic>&#x2113;</italic> (but not the offset <bold>n</bold>) of the coefficient <italic>u</italic>
<sub>
<italic>m</italic>,<italic>c</italic>
</sub>. At inference time, the bit rate is <italic>R</italic>(&#x230a;<bold>U</bold>&#x2309;) instead of <italic>R</italic>(<bold>U</bold>). These functions are provided by the Continuous Batched Entropy (<italic>cbe</italic>) model in (<xref ref-type="bibr" rid="B5">Ball&#xe9; et al., 2021</xref>).</p>
<p>Note that the parameters &#x394;<sub>
<italic>c</italic>
</sub> as well as the parameters <italic>&#x3d5;</italic>
<sub>
<italic>&#x2113;</italic>,<italic>c</italic>
</sub>, for all <italic>&#x2113;</italic> and <italic>c</italic>, must be transmitted to the decoder. However, the overhead for transmitting &#x394;<sub>
<italic>c</italic>
</sub> is negligible, and the overhead for transmitting <italic>&#x3d5;</italic>
<sub>
<italic>&#x2113;</italic>,<italic>c</italic>
</sub> can be circumvented by using a backward-adaptive entropy code in its place at inference time. (See <xref ref-type="sec" rid="s5-4">Section 5.4</xref>).</p>
</sec>
</sec>
<sec id="s4-3">
<title>4.3 Coordinate based network</title>
<p>Any CBN can be used in the LVAC framework, but in our experiments we usually use a two-layer MLP,<disp-formula id="e11">
<mml:math id="m88">
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>&#x3c3;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mi>&#x3c3;</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mn>3</mml:mn>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:msup>
<mml:mfenced open="[" close="]">
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
<label>(11)</label>
</disp-formula>where <italic>&#x3b8;</italic> &#x3d; (<bold>b</bold>
<sup>3</sup>, <bold>W</bold>
<sup>3&#xd7;<italic>H</italic>
</sup>, <bold>b</bold>
<sup>
<italic>H</italic>
</sup>, <bold>W</bold>
<sup>
<italic>H</italic>&#xd7;(3&#x2b;<italic>C</italic>)</sup>), <italic>H</italic> is the number of hidden units, and <italic>&#x3c3;</italic>(&#x22c5;) is pointwise rectification (ReLU). (Here we take <bold>x</bold>, <bold>y</bold>, and <bold>z</bold> to be column vectors instead of the row vectors we use elsewhere.) Note that there is no positional encoding of <bold>x</bold>. Alternatively, we use a two-layer <italic>position-attention</italic> (PA) network,<disp-formula id="e12">
<mml:math id="m89">
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2b;</mml:mo>
<mml:mi mathvariant="bold">z</mml:mi>
<mml:mo>&#x2299;</mml:mo>
<mml:mi>sin</mml:mi>
<mml:mfenced open="(" close=")">
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">b</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2b;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="bold">W</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mi mathvariant="bold">x</mml:mi>
</mml:mrow>
</mml:mfenced>
<mml:mo>,</mml:mo>
</mml:math>
<label>(12)</label>
</disp-formula>where <italic>&#x3b8;</italic> &#x3d; (<bold>b</bold>
<sup>3</sup>, <bold>b</bold>
<sup>
<italic>C</italic>
</sup>, <bold>W</bold>
<sup>
<italic>C</italic>&#xd7;3</sup>) and &#x2299; is pointwise multiplication. The PA network is a simplified version of the modulated periodic activations in (<xref ref-type="bibr" rid="B50">Mehta et al., 2021</xref>), with many fewer parameters than MLPs while an efficient representation at low bit rates.</p>
<p>Once the latent vectors <bold>Z</bold> &#x3d; [<bold>z</bold>
<sub>
<bold>n</bold>
</sub>] are decoded from <inline-formula id="inf78">
<mml:math id="m90">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">U</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> as <inline-formula id="inf79">
<mml:math id="m91">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">Z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> and <italic>&#x3b8;</italic> is decoded as <inline-formula id="inf80">
<mml:math id="m92">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>, the attributes <inline-formula id="inf81">
<mml:math id="m93">
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">y</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula> of any point <inline-formula id="inf82">
<mml:math id="m94">
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> can be queried as illustrated in <xref ref-type="fig" rid="F3">Figure 3</xref>.</p>
</sec>
</sec>
<sec id="s5">
<title>5 Experimental results</title>
<sec id="s5-1">
<title>5.1 Dataset and experimental details</title>
<p>Our dataset comprises (i) seven full human body voxelized point clouds derived from meshes created in (<xref ref-type="bibr" rid="B29">Guo et al., 2019</xref>; <xref ref-type="bibr" rid="B51">Meka et al., 2020</xref>) (shown in <xref ref-type="fig" rid="F1">Figure 1</xref>) and (ii) seven point clouds&#x2014;four full human bodies and three objects of art&#x2014;from the MPEG PCC dataset (<xref ref-type="bibr" rid="B19">d&#x2019;Eon et al., 2017</xref>; <xref ref-type="bibr" rid="B2">Alliez et al., 2017</xref>) (see the <xref ref-type="sec" rid="s12">Supplementary Material</xref>). Integer voxel coordinates are used as the point positions <bold>x</bold>
<sub>
<italic>i</italic>
</sub>. The voxels (and hence the point positions) have 10-bit resolution. This results in an octree of depth 10, or alternatively a binary tree of depth 30, for every point cloud. For most experiments, we train all variables (latents, step sizes, an entropy model per binary level, and a CBN at the target level <italic>L</italic>) on a single point cloud, as the variables are specific to each point cloud. However, for the generalization experiments in <xref ref-type="sec" rid="s5-4">Section 5.4</xref>, we train only the latents, step sizes, and entropy models on the given point cloud, while using a CBN pre-trained on a different point cloud. Additional experimental details are given in the <xref ref-type="sec" rid="s12">Supplementary Material</xref>.</p>
<p>The entire point cloud constitutes one batch. All configurations are trained in about 25&#xa0;K steps using the Adam optimizer and a learning rate of 0.01, with low bit rate configurations typically taking longer to converge. Each step takes 0.5&#x2013;3.0&#xa0;s on an NVIDIA P100 class GPU in eager mode with various debugging checks in place. We will open-source our code on <ext-link ext-link-type="uri" xlink:href="https://github.com/tensorflow/compression/tree/master/models/lvac/">https://github.com/tensorflow/compression/tree/master/models/lvac/</ext-link>upon publication.</p>
<p>As the experimental results below will show, the relative performance gains of various LVAC configurations and the baselines are largely consistent over all human body point clouds as well as object point clouds. This consistency may be explained in part by all variables in LVAC being trained on the given point cloud; hence LVAC is instance-adaptive (except in our generalization studies). No average-case models are trained to fit all point clouds. Thus we expect consistent behavior across other types of point clouds, e.g., room scans. We acknowledge, however, that some types of point clouds, such as dynamically-acquired LIDAR point clouds, may have a special structure that our framework does not take advantage of. Indeed, MPEG G-PCC has special coding modes for such point clouds.</p>
</sec>
<sec id="s5-2">
<title>5.2 Baselines</title>
<sec id="s5-2-1">
<title>5.2.1 RAHT</title>
<p>Our first baseline is RAHT, which is the core transform in the MPEG G-PCC, coupled with the adaptive Run-Length Golomb-Rice (RLGR) entropy coder (<xref ref-type="bibr" rid="B48">Malvar, 2006</xref>). <xref ref-type="fig" rid="F5">Figure 5A</xref> shows the rate-distortion (RD) performance of <italic>RAHT &#x2b; RLGR</italic> in RGB PSNR (dB) vs<italic>.</italic> bit rate (bits per point, or bpp). As PSNR is a measure of quality, higher is better. In <italic>RAHT &#x2b; RLGR</italic>, the RAHT coefficients are uniformly scalar quantized. The quantized coefficients are concatenated by level from the root to the leaves and entropy coded using RLGR, independently for each color component. The RD performances using RGB and YUV (BT.709) colorspaces are shown in <xref ref-type="fig" rid="F5">Figure 5A</xref> in blue with filled and unfilled markers, respectively. At low bit rates, YUV provides a significant gain in RGB PSNR, but this falls off at high bit rates.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>
<bold>(A)</bold> Baselines. <italic>RAHT &#x2b; RLGR</italic> (<italic>RGB</italic>) and (<italic>YUV</italic>) are shown against 3&#xd7;3 linear models at levels 30, 27, 24, and 21, which optimize the colorspace by minimizing <italic>D</italic> &#x2b; <italic>&#x3bb;R</italic> using the <italic>cbe</italic> entropy model. Since <italic>level &#x3d; 30, model &#x3d; cbe &#x2b; linear(3x3)</italic> outperforms <italic>RAHT &#x2b; RLGR</italic> (<italic>YUV</italic>) we discard the latter and use the others as baselines for more complex CBNs. <bold>(B)</bold> RD performance (YUV PSNR vs bit rate) comparison with <italic>RAHT &#x2b; RLGR</italic> (<italic>RGB</italic>) (<xref ref-type="bibr" rid="B17">de Queiroz and Chou, 2016</xref>) and <italic>Deep-PCAC</italic> (<xref ref-type="bibr" rid="B76">Sheng et al., 2021</xref>).</p>
</caption>
<graphic xlink:href="frsip-02-1008812-g005.tif"/>
</fig>
</sec>
<sec id="s5-2-2">
<title>5.2.2 Deep-PCAC</title>
<p>As a secondary baseline, we provide a comparison with Deep-PCAC (<xref ref-type="bibr" rid="B76">Sheng et al., 2021</xref>)&#x2014;see <xref ref-type="fig" rid="F5">Figure 5B</xref>. As mentioned earlier, Deep-PCAC is based on PointNet, which is not volumetric. Therefore, it cannot be used for other scenarios such as radiance fields and also lacks point cloud features such as infinite zoom. We still compare LVAC with Deep-PCAC just to show that learned point cloud attribute compression is not trivial and requires all the crucial steps that we discussed in this work.</p>
</sec>
<sec id="s5-2-3">
<title>5.2.3 Linear LVAC</title>
<p>Finally, <italic>level &#x3d; 30, model &#x3d; cbe &#x2b; linear</italic> (3 &#xd7; 3) in <xref ref-type="fig" rid="F5">Figure 5A</xref> shows the RD performance of our LVAC framework when 3-channel latents (<italic>C</italic> &#x3d; 3) are quantized and entropy coded using the Continuous Batched Entropy (<italic>cbe</italic>) model with the Noisy Deep Factorized prior from Tensorflow Compression (<xref ref-type="bibr" rid="B5">Ball&#xe9; et al., 2021</xref>) followed by a simple 3 &#xd7; 3 linear matrix as the CBN, at binary target level 30. The performance of this simple linear model agrees with that of <italic>RAHT-RLGR</italic> (<italic>YUV</italic>) at low rates, and outperforms it at high rates. Therefore, it is useful as a pseudo baseline and we show it in all subsequent plots along with our first baseline <italic>RAHT-RLGR</italic> (<italic>RGB</italic>). <xref ref-type="fig" rid="F5">Figure 5A</xref> also shows that at lower target levels (27, 24, 21), LVAC with the 3 &#xd7; 3 matrix saturates at high rates, since the 3 &#xd7; 3 matrix has no positional input, and thus represents the volumetric attribute function as a constant across each block. These constant functions serve as baselines for more complex CBNs at these levels, described next.</p>
<p>Similar observations can be made from the plots in the <xref ref-type="sec" rid="s12">Supplementary Material</xref> for the ten other point clouds. Alternative baselines are considered in <xref ref-type="sec" rid="s5-9">Section 5.9</xref>.</p>
</sec>
</sec>
<sec id="s5-3">
<title>5.3 Coordinate based networks</title>
<p>We now compare configurations of the LVAC framework with four different CBNs: <italic>linear(3x3)</italic> with 9 parameters (as a baseline), <italic>mlp(35 &#xd7; 256 &#xd7; 3)</italic> with 9,987 parameters, <italic>mlp(35 &#xd7; 64 &#xd7; 3)</italic> with 2,499 parameters, and <italic>pa(3 &#xd7; 32 &#xd7; 3)</italic> with 227 parameters, at different target levels. The <italic>mlp(35 &#xd7; 256 &#xd7; 3)</italic> and <italic>mlp(35 &#xd7; 64 &#xd7; 3)</italic> CBNs are two-layer MLPs with 35 inputs (3 for position and 32 for a latent vector, i.e., <italic>C</italic> &#x3d; 32) and 3 outputs, having respectively 256 and 64 hidden nodes. The <italic>pa(3 &#xd7; 32 &#xd7; 3)</italic> CBN is a Position-Attention (PA) network also with 35 inputs (3 for position and 32 for a latent vector) and 3 outputs. All configurations use the Continuous Batched Entropy (<italic>cbe</italic>) model for quantization and entropy coding of the 32-channel latents.</p>
<p>
<xref ref-type="fig" rid="F6">Figures 6A&#x2013;C</xref> shows (in green, red, purple) the RD performance of these CBNs at different target levels (27, 24, 21), along with the baselines (in blue, orange). We observe that first, at each target level <italic>L</italic> &#x3d; 27, 24, 21, the CBNs with more parameters outperform the CBNs with fewer parameters. In particular, especially at higher bit rates, the MLP and PA networks at level <italic>L</italic> improve more than 5&#x2013;10&#xa0;dB over the linear network at level <italic>L</italic>, whose RD performance saturates as described earlier, for each <italic>L</italic>. Second, at each target level <italic>L</italic> &#x3d; 27, 24, 21, there is a range of bit rates over which the MLP and PA networks improve by 2&#x2013;3&#xa0;dB over even the <italic>level &#x3d; 30, model &#x3d; cbe &#x2b; linear(3x3)</italic> baseline, which does not saturate. The range of bit rates in which this improvement is achieved is higher for level 27, and lower for level 21, reflecting that higher quality requires CBNs with smaller blocksizes. In the <xref ref-type="sec" rid="s12">Supplementary Material</xref>, we show these same data factored by CBN type instead of by level, to illustrate again that for each CBN type, each level is optimal for a different bit rate range. <xref ref-type="fig" rid="F5">Figure 5B</xref> demonstrates that LVAC provides a gain of 2&#x2013;5&#xa0;dB over our secondary baseline, Deep-PCAC (<xref ref-type="bibr" rid="B76">Sheng et al., 2021</xref>). Comparison plots for other point clouds are provided in the <xref ref-type="sec" rid="s12">Supplementary Material</xref>.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>Coordinate Based Networks, by target level. <bold>(A&#x2013;C)</bold> each show <italic>mlp(35 &#xd7; 256 &#xd7; 3)</italic>, <italic>mlp(35 &#xd7; 64 &#xd7; 3)</italic>, and <italic>pa(3 &#xd7; 32 &#xd7; 3)</italic> CBNs, along with baselines, at levels 27, 24, 21. More complex CBNs outperform less complex. Higher levels are better for higher bit rates.</p>
</caption>
<graphic xlink:href="frsip-02-1008812-g006.tif"/>
</fig>
<p>The nature of a volumetric function <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(<bold>x</bold>; <bold>z</bold>) represented by a CBN is illustrated in <xref ref-type="fig" rid="F7">Figure 7</xref>. To illustrate, we select the CBN <italic>mlp(35 &#xd7; 256 &#xd7; 3)</italic> trained on the <italic>rock</italic> point cloud at target level <italic>L</italic> &#x3d; 21, and we plot cuts through the volumetric function <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(&#x22c5;; <bold>z</bold>) represented by this CBN. Specifically, let <italic>n</italic> be a randomly selected node at the target level <italic>L</italic>, let <inline-formula id="inf83">
<mml:math id="m95">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> be the quantized cumulative latent at that node, and let <bold>x</bold>
<sub>
<italic>n</italic>
</sub> &#x3d; (<italic>x</italic>
<sub>
<italic>n</italic>
</sub>, <italic>y</italic>
<sub>
<italic>n</italic>
</sub>, <italic>z</italic>
<sub>
<italic>n</italic>
</sub>) be the position of a randomly selected point within the block at that node. Then we plot the first (red) component of the function <inline-formula id="inf84">
<mml:math id="m96">
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, where <bold>x</bold> varies from (0, <italic>y</italic>
<sub>
<italic>n</italic>
</sub>, <italic>z</italic>
<sub>
<italic>n</italic>
</sub>) to (<italic>N</italic>
<sub>
<italic>x</italic>
</sub>, <italic>y</italic>
<sub>
<italic>n</italic>
</sub>, <italic>z</italic>
<sub>
<italic>n</italic>
</sub>), where <italic>N</italic>
<sub>
<italic>x</italic>
</sub> is the width of a block at level <italic>L</italic>. We do this for many randomly selected nodes <italic>n</italic> to get a sense of the distribution of volumetric functions represented at that level. (The distribution looks similar for green and blue components, and for cuts along <italic>y</italic> and <italic>z</italic> axes.) We observe that for many values of <inline-formula id="inf85">
<mml:math id="m97">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>, <inline-formula id="inf86">
<mml:math id="m98">
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> is a roughly constant function. Thus, <inline-formula id="inf87">
<mml:math id="m99">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula> must encode the colors of the palette used for these functions. However, we also observe that for some values of <inline-formula id="inf88">
<mml:math id="m100">
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:math>
</inline-formula>, <inline-formula id="inf89">
<mml:math id="m101">
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> is a ramp or some other nonlinear function across its domain. Finally, we observe almost no energy at frequencies higher than the Nyquist frequency (half the sampling rate), where the sampling occurs at units of voxels. We conclude that <inline-formula id="inf90">
<mml:math id="m102">
<mml:msub>
<mml:mrow>
<mml:mi>f</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>&#x3b8;</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mo>&#x22c5;</mml:mo>
<mml:mo>;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:mi mathvariant="bold">z</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">&#x302;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> acts like a codebook of volumetric functions defined on <italic>N</italic>
<sub>
<italic>x</italic>
</sub> &#xd7; <italic>N</italic>
<sub>
<italic>y</italic>
</sub> &#xd7; <italic>N</italic>
<sub>
<italic>z</italic>
</sub>, fit to the point cloud at hand.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>Cuts through the volumetric function (<italic>R</italic>, <italic>G</italic>, <italic>B</italic>)&#x3d; <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>((<italic>x</italic>, <italic>y</italic>
<sub>
<italic>n</italic>
</sub>, <italic>z</italic>
<sub>
<italic>n</italic>
</sub>); <bold>z</bold>
<sub>
<italic>n</italic>
</sub>) represented by a CBN, along the <italic>x</italic>-axis through a random point <bold>x</bold>
<sub>
<italic>n</italic>
</sub> &#x3d;(<italic>x</italic>
<sub>
<italic>n</italic>
</sub>, <italic>y</italic>
<sub>
<italic>n</italic>
</sub>, <italic>z</italic>
<sub>
<italic>n</italic>
</sub>) in the point cloud within a node <italic>n</italic>, for various occupied nodes <italic>n</italic> at target level 21. It can be seen that the CBN specifies a codebook of volumetric functions defined on blocks, fit to the point cloud at hand.</p>
</caption>
<graphic xlink:href="frsip-02-1008812-g007.tif"/>
</fig>
</sec>
<sec id="s5-4">
<title>5.4 Generalization</title>
<p>We also explore the degree to which the CBNs can be generalized across point clouds; that is, whether they can be trained to represent a universal family of volumetric functions. <xref ref-type="fig" rid="F8">Figure 8</xref> below show that the CBNs can indeed generalize across point clouds at low bit rates. We provide the corresponding plots for other point clouds in the <xref ref-type="sec" rid="s12">Supplementary Material</xref>.</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption>
<p>Coordinate Based Networks with generalization, by level <bold>(A&#x2013;C)</bold> and by network <bold>(D&#x2013;F)</bold>. CBNs that are generalized (i.e., pre-trained on another point cloud) are able to outperform the baselines at low bit rates.</p>
</caption>
<graphic xlink:href="frsip-02-1008812-g008.tif"/>
</fig>
</sec>
<sec id="s5-5">
<title>5.5 Side information</title>
<p>When the latents, step sizes, entropy models, and CBN are all optimized for a specific point cloud, quantizing and entropy coding only the latent vectors [<bold>z</bold>
<sub>
<bold>n</bold>
</sub>] is insufficient for reconstructing the point cloud attributes. The step sizes [&#x394;<sub>
<italic>c</italic>
</sub>], entropy model parameters [<italic>&#x3d5;</italic>
<sub>
<italic>&#x2113;</italic>,<italic>c</italic>
</sub>], and CBN parameters <italic>&#x3b8;</italic> must also be quantized, entropy coded, and sent as <italic>side information</italic>. Sending side information incurs additional bit rate and distortion. Note that the side information for the step sizes is negligible, as there is only one step size for each of <italic>C</italic> &#x3d; 32 channels.</p>
<sec id="s5-5-1">
<title>5.5.1 Side information for the entropy models</title>
<p>We first consider the side information for the entropy models. <xref ref-type="fig" rid="F9">Figure 9</xref> shows the penalty required to transmit side information, for the entropy model for point cloud <italic>rock</italic>. We use the Tensorflow Compression&#x2019;s Continuous Batched Entropy (<italic>cbe</italic>) model with the Noisy Deep Factorized prior. For 32 channels, this model has 23,296 parameters. If each parameter is represented with 32 bits, then 0.89 bits per point of side information is required for the point cloud <italic>rock</italic>, which has 837, 434 points. This would shift the RD performance from the solid green line to the dashed green line in the figure, for <italic>level &#x3d; 27, model &#x3d; cbe &#x2b; mlp(35 &#xd7; 256 &#xd7; 3)</italic>. However, fortunately, this costly side information can be avoided, by using <italic>cbe</italic> during training but using the adaptive Run-Length Golomb-Rice (RLGR) entropy coder <xref ref-type="bibr" rid="B48">Malvar (2006)</xref> during inference time. Since RLGR is backward adaptive, it can adapt to Laplacian-like distributions without sending any side information. Of course its coding efficiency may suffer, but our experiments show that this degradation is almost negligible. Henceforth we report RD performance using only RLGR. The resulting RD performance is shown in the dotted line with unfilled markers&#x2014;an almost negligible degradation. The corresponding plots for other point clouds are given in the <xref ref-type="sec" rid="s12">Supplementary Material</xref>.</p>
<fig id="F9" position="float">
<label>FIGURE 9</label>
<caption>
<p>Side information for entropy model. Sending 32 bits per parameter for the <italic>cbe</italic> entropy model would reduce RD performance from solid to dashed green lines. But the backward-adaptive <italic>RLGR</italic> entropy coder (dotted, unfilled) obviates the need to send side information with almost no loss in performance.</p>
</caption>
<graphic xlink:href="frsip-02-1008812-g009.tif"/>
</fig>
</sec>
<sec id="s5-5-2">
<title>5.5.2 Side information for the CBNs</title>
<p>Next, we consider the side information for the CBNs. For each point cloud, there is one CBN, at the target level <italic>L</italic>. Allocating 32 bits per floating point parameter would give the most pessimistic estimate for the side information. However, it is likely that 32 bits per floating point parameter is an order of magnitude more than necessary. Prior work has shown that simple model compression can be performed at 8 bits (<xref ref-type="bibr" rid="B9">Banner et al., 2018</xref>; <xref ref-type="bibr" rid="B93">Wang et al., 2018</xref>; <xref ref-type="bibr" rid="B82">Sun et al., 2019</xref>) or even more aggressively at 1&#x2013;4 bits per floating point parameter (<xref ref-type="bibr" rid="B31">Han et al., 2015</xref>; <xref ref-type="bibr" rid="B94">Xu et al., 2018</xref>; <xref ref-type="bibr" rid="B59">Oktay et al., 2019</xref>; <xref ref-type="bibr" rid="B80">Stock et al., 2019</xref>; <xref ref-type="bibr" rid="B92">Wang et al., 2019</xref>; <xref ref-type="bibr" rid="B34">Isik et al., 2021a</xref>; <xref ref-type="bibr" rid="B37">Isik et al., 2022</xref>) with very low loss in performance, even with CBNs such as NeRF (<xref ref-type="bibr" rid="B11">Bird et al., 2021</xref>; <xref ref-type="bibr" rid="B36">Isik, 2021</xref>). Alternatively, the CBNs may be generalized by pre-training on other point clouds to avoid having to transmit any side information. <xref ref-type="fig" rid="F10">Figure 10</xref> shows RD performance under the 32-bit assumption as well as under generalization. We refer the reader to the <xref ref-type="sec" rid="s12">Supplementary Material</xref> for the corresponding plots for other point clouds.</p>
<fig id="F10" position="float">
<label>FIGURE 10</label>
<caption>
<p>Effect of side information for coordinate based networks <italic>mlp(35 &#xd7; 256 &#xd7; 3)</italic> <bold>(A&#x2013;C)</bold>, <italic>mlp(35 &#xd7; 64 &#xd7; 3)</italic> <bold>(D&#x2013;F)</bold>, and <italic>pa(3 &#xd7; 32 &#xd7; 3)</italic> <bold>(G&#x2013;I)</bold> at levels 27 <bold>(A,D,G)</bold>, 24 <bold>(B,E,H)</bold>, and 21 <bold>(C,F,I)</bold>. Sending 32 bits per parameter for the CBN would degrade RD performance from solid to dashed lines. The degradation would be inversely proportional to compression ratio if model compression is used. Alternatively, generalization (pre-training the CBN on one or more other point clouds), which works well at low bit rates, would obviate the need to transmit any side information. Generalization is indicated by &#x201c;gen&#x201d; in the legend.</p>
</caption>
<graphic xlink:href="frsip-02-1008812-g010.tif"/>
</fig>
<p>Now, we turn to a key ablation study.</p>
</sec>
</sec>
<sec id="s5-6">
<title>5.6 Orthonormalization</title>
<p>One of our main contributions is to show that na&#xef;ve uniform scalar quantization and entropy coding of the latents leads to poor results, and that properly normalizing the coefficients before quantization achieves over a 30% reduction in bit rate. In this ablation study, we remove our orthonormalization by setting the scale matrix <bold>S</bold> in <xref ref-type="disp-formula" rid="e7">(Eqs. 7</xref>, <xref ref-type="disp-formula" rid="e8">8)</xref> and <xref ref-type="fig" rid="F4">Figure 4</xref> to the identity matrix, thus removing any dependency of the attribute compression on the geometry. This corresponds to a na&#xef;ve approach to compression, for example by assuming a fixed number of bits per latent as in (<xref ref-type="bibr" rid="B84">Takikawa et al., 2021</xref>). <xref ref-type="table" rid="T1">Table 1</xref> shows that compared to this na&#xef;ve approach, our normalization achieves over 30% reduction in bit rate (computed using (<xref ref-type="bibr" rid="B12">Bj&#xf8;ntegaard, 2001</xref>; <xref ref-type="bibr" rid="B62">Pateux and Jung, 2007</xref>)). This quantifies the reduction in bit rate due to conditioning the attribute compression on the geometry. We give the results averaged over all point clouds in <xref ref-type="table" rid="T1">Table 1</xref>, results for <italic>rock</italic> point cloud in <xref ref-type="fig" rid="F11">Figure 11</xref>, and provide results for each point cloud in the <xref ref-type="sec" rid="s12">Supplementary Material</xref>.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>BD-Rate reductions due to normalization, averaged over point clouds. Normalization is crucial for good performance. Without normalization, there is no dependence on geometry.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th rowspan="2" align="left">CBN</th>
<th colspan="4" align="left">level</th>
</tr>
<tr>
<th align="left">30</th>
<th align="left">27</th>
<th align="left">24</th>
<th align="left">21</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">linear (3 &#xd7; 3)</td>
<td align="left">&#x2212;31.6%</td>
<td align="char" char=".">&#x2212;18.6%</td>
<td align="char" char=".">&#x2212;28.3%</td>
<td align="char" char=".">&#x2212;37.7%</td>
</tr>
<tr>
<td align="left">mlp (35 &#xd7; 256 &#xd7; 3)</td>
<td align="left">N/A</td>
<td align="char" char=".">&#x2212;29.9%</td>
<td align="char" char=".">&#x2212;34.3%</td>
<td align="char" char=".">&#x2212;27.4%</td>
</tr>
<tr>
<td align="left">mlp (35 &#xd7; 64 &#xd7; 3)</td>
<td align="left">N/A</td>
<td align="char" char=".">&#x2212;23.8%</td>
<td align="char" char=".">&#x2212;32.1%</td>
<td align="char" char=".">&#x2212;31.1%</td>
</tr>
<tr>
<td align="left">pa (3 &#xd7; 32 &#xd7; 3)</td>
<td align="left">N/A</td>
<td align="char" char=".">&#x2212;41.1%</td>
<td align="char" char=".">&#x2212;40.3%</td>
<td align="char" char=".">&#x2212;38.7%</td>
</tr>
</tbody>
</table>
</table-wrap>
<fig id="F11" position="float">
<label>FIGURE 11</label>
<caption>
<p>RD performance improvement due to normalization, corresponding to entries in <xref ref-type="table" rid="T1">Table 1</xref>, i.e., columns 1, 2, 3, 4 correspond to levels 30, 27, 24, 21, respectively, and rows 1, 2, 3, 4 correspond to <italic>linear(3x3)</italic>, <italic>mlp(35 &#xd7; 256 &#xd7; 3)</italic>, <italic>mlp(35 &#xd7; 64 &#xd7; 3)</italic>, <italic>pa(3 &#xd7; 32 &#xd7; 3)</italic>, respectively.</p>
</caption>
<graphic xlink:href="frsip-02-1008812-g011.tif"/>
</fig>
</sec>
<sec id="s5-7">
<title>5.7 Convex hull</title>
<p>For different bit rate ranges and for different assumptions on the cost of side information, different configurations of the LVAC framework may be optimal. <xref ref-type="fig" rid="F12">Figure 12</xref> shows the convex hull, or Pareto fontier, of all configurations under various assumptions of 0 (<xref ref-type="fig" rid="F12">Figure 12A</xref>), 8 (<xref ref-type="fig" rid="F12">Figure 12B</xref>), and 32 <xref ref-type="fig" rid="F12">(Figure 12C)</xref> bits per floating point parameter. All configurations that we have examined in this paper appear in <xref ref-type="fig" rid="F12">Figure 12</xref>. However, only those that participate in the convex hull appear in the legend and are plotted with a solid line. (The others are dotted.) The convex hull is 2&#x2013;4&#xa0;dB over the baselines. We observe: first, when the side information costs nothing (0 bits per parameter), the convex hull contains exclusively the largest CBN (<italic>mlp(35 &#xd7; 256 &#xd7; 3)</italic>), at higher target levels for higher bit rates. Second, as the cost of the side information increases, the smaller CBNs (<italic>mlp(35 &#xd7; 64 &#xd7; 3)</italic> and <italic>pa(3 &#xd7; 32 &#xd7; 3)</italic>) begin to participate in the convex hull, especially at lower bit rates. Eventually, at 32 bits per parameters, the largest CBN is excluded entirely. Third, the generalizations never participate in the convex hull, despite not incurring any penalty due to side information. This could be because they are trained only on a single other point cloud in these experiments. Training the CBNs on more representative data would probably improve their generalization performance but is left for future work. The corresponding plots for other points clouds are provided in the <xref ref-type="sec" rid="s12">Supplementary Material</xref>.</p>
<fig id="F12" position="float">
<label>FIGURE 12</label>
<caption>
<p>Convex hull (solid black line) of RD performances of all CBN configurations across all levels, including side information using 0 <bold>(A)</bold>, 8 <bold>(B)</bold>, and 32 <bold>(C)</bold> bits per CBN parameter. Configurations that participate in the convex hull are listed, with baselines, in the legend and appear as solid lines. Others are dotted. At 0 bits per parameter (bpp), the more complex CBNs dominate. At higher bpp, the less complex CBNs begin to participate, especially at lower bit rates. CBNs generalized from another point cloud never participate.</p>
</caption>
<graphic xlink:href="frsip-02-1008812-g012.tif"/>
</fig>
</sec>
<sec id="s5-8">
<title>5.8 Subjective quality</title>
<p>
<xref ref-type="fig" rid="F13">Figure 13</xref> shows compression quality at around 0.25&#xa0;bpp, under the assumption of 0 bits per floating point parameter. Additional bit rates are shown in the <xref ref-type="sec" rid="s12">Supplementary Material</xref>.</p>
<fig id="F13" position="float">
<label>FIGURE 13</label>
<caption>
<p>Subjective quality around 0.25&#xa0;bpp. <bold>(A)</bold> Original. <bold>(B)</bold> 0.258&#xa0;bpp, 24.6&#xa0;dB. <bold>(C)</bold> 0.255&#xa0;bpp, 25.9&#xa0;dB. <bold>(D)</bold> 0.255&#xa0;bpp, 28.0&#xa0;dB.</p>
</caption>
<graphic xlink:href="frsip-02-1008812-g013.tif"/>
</fig>
</sec>
<sec id="s5-9">
<title>5.9 Baselines, revisited</title>
<p>We now return to the matter of baselines. <xref ref-type="fig" rid="F14">Figure 14</xref> shows our previous baseline, <italic>RAHT &#x2b; RLGR</italic>, for both RGB and YUV colorspaces (blue lines). Although RAHT is the transform used in MPEG G-PCC, the reference software TMC13 v6.0 (July 2019) offers improved RD performance (green lines) compared to <italic>RAHT &#x2b; RLGR</italic>, due principally to better entropy coding. In particular, TMC13 uses context-adaptive binary arithmetic coding with various coding modes, while <italic>RAHT &#x2b; RLGR</italic> uses RLGR. We use <italic>RAHT &#x2b; RLGR</italic> as our baseline because our experiments use RLGR as our entropy coder; the specific entropy coder used in TMC13 is difficult to extract from the standard. The latest version, TMC13 v14.0 (October 2021), offers even better RD performance, by introducing for example joint coding modes for color channels that are all zero (orange lines). It also introduces predictive RAHT, in which the RAHT coefficients at each level are predicted from the decoded RAHT coefficients at the previous level (<xref ref-type="bibr" rid="B45">Lasserre and Flynn, 2019</xref>; <xref ref-type="bibr" rid="B22">3DG, 2020b</xref>; <xref ref-type="bibr" rid="B64">Pavez et al., 2021</xref>). The prediction residuals, instead of the RAHT coefficients themselves, are quantized and entropy coded. Predictive RAHT alone improves RD performance by 2&#x2013;3&#xa0;dB (red lines). Nevertheless, in the low bit rate regime, LVAC with RLGR and no RAHT prediction has better performance than even TMC13 v14.0 with predictive RAHT (solid black line, from <xref ref-type="fig" rid="F12">Figure 12A</xref>). We believe that the RD performance of LVAC can be further improved significantly. In particular, the principal advances of TMC13 over <italic>RAHT &#x2b; RLGR</italic>&#x2014;better entropy coding and predictive RAHT&#x2014;are equally applicable to the LVAC framework. For example, better entropy coding could be done with a hyperprior (<xref ref-type="bibr" rid="B8">Ball&#xe9; et al., 2018</xref>), and predictive RAHT could be applied to the latent vectors. These explorations are left for future work.</p>
<fig id="F14" position="float">
<label>FIGURE 14</label>
<caption>
<p>Baselines, revisited. In both RGB and YUV color spaces, MPEG G-PCC reference software TMC13 v6.0 improves over <italic>RAHT &#x2b; RLGR</italic>, primarily due to context-adaptive (i.e., dependent) entropy coding. TMC13 v14.0 improves still further, primarily due to predictive RAHT. LVAC (black line, from <xref ref-type="fig" rid="F12">Figure 12A</xref>) outperforms all but TMC13 v14.0. However, better entropy coding (e.g., hyperprior) and predictive RAHT can also be applied to LVAC.</p>
</caption>
<graphic xlink:href="frsip-02-1008812-g014.tif"/>
</fig>
</sec>
</sec>
<sec id="s6">
<title>6 Discussion and conclusion</title>
<p>This work is the first to compress volumetric functions <bold>y</bold> &#x3d; <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(<bold>x</bold>) modeled by local coordinate-based networks. Though we focused on RGB attributes <bold>y</bold>, the extension to other attributes (signed distance, density, <italic>etc.</italic>) is straightforward. Also, though we focused on <inline-formula id="inf91">
<mml:math id="m103">
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mi mathvariant="double-struck">R</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>, extensions to hyper-volumetric functions (such as <bold>y</bold> &#x3d; <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>(<bold>x</bold>, <bold>d</bold>) where <bold>d</bold> is a view direction) is also straightforward. Thus LVAC should be applicable to plenoptic point clouds (<xref ref-type="bibr" rid="B41">Krivokuca et al., 2018</xref>; <xref ref-type="bibr" rid="B71">Sandri et al., 2018</xref>; <xref ref-type="bibr" rid="B99">Zhang et al., 2018</xref>; <xref ref-type="bibr" rid="B72">Sandri et al., 2019</xref>; <xref ref-type="bibr" rid="B100">Zhang et al., 2019</xref>) as well as radiance fields (<xref ref-type="bibr" rid="B57">Mildenhall et al., 2020</xref>; <xref ref-type="bibr" rid="B96">Yu A. et al., 2021</xref>; <xref ref-type="bibr" rid="B49">Martel et al., 2021</xref>; <xref ref-type="bibr" rid="B84">Takikawa et al., 2021</xref>; <xref ref-type="bibr" rid="B101">Zhang et al., 2021</xref>) under an appropriate distance measure. We believe that the main difference between plenoptic point clouds and radiance fields is the distortion measure <italic>d</italic> (<italic>f</italic>, <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>). For point clouds, <italic>d</italic> (<italic>f</italic>, <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>) is measured in the domain of <italic>f</italic>, such as the MSE between colors on points in 3D. For radiance fields, <italic>d</italic> (<italic>f</italic>, <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub>) is measured in the domain of <italic>projections</italic> or renderings of <italic>f</italic> onto 2D images, such as the MSE between colors of pixels that are renderings of <italic>f</italic> and <italic>f</italic>
<sub>
<italic>&#x3b8;</italic>
</sub> onto 2D images. In (<xref ref-type="bibr" rid="B18">de Queiroz and Chou, 2017</xref>), the former distortion measures are called <italic>matching distortions</italic>, while the latter are called <italic>projection distortions</italic>. A change in distortion measure may be all that is required to apply LVAC properly to radiance field compression. This work is also among the first to compress point cloud attributes using neural networks, outperforming RAHT, used in MPEG G-PCC, by 2&#x2013;4&#xa0;dB and Deep-PCAC, a recent learned compression framework, by 2&#x2013;5&#xa0;dB. Although MPEG G-PCC uses additional coding tools to further improve compression, such as context adaptive arithmetic coding, joint entropy coding of color, and predictive RAHT, these tools are also at our disposal, and may be the subject of further work. It should be recalled that learned image compression evolved over dozens of papers and a half dozen years, being competitive at first with only JPEG on thumbnails, and then successively with JPEG-2000, WebP, and BGP. Only recently has learned image compression been able to outperform the latest standard, VVC, in PSNR (<xref ref-type="bibr" rid="B30">Guo et al., 2021</xref>). Learned volumetric attribute compression (LVAC), like learned image compression, is a work in progress.</p>
</sec>
</body>
<back>
<sec sec-type="data-availability" id="s7">
<title>Data availability statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: JPEG Pleno Database: 8i Voxelized Full Bodies (8iVFB v2)&#x2014;A Dynamic Voxelized Point Cloud Dataset: <ext-link ext-link-type="uri" xlink:href="http://plenodb.jpeg.org/pc/8ilabs/">http://plenodb.jpeg.org/pc/8ilabs/</ext-link>.</p>
</sec>
<sec id="s8">
<title>Ethics statement</title>
<p>Written informed consent was obtained from the individual(s) for the publication of any potentially identifiable images or data included in this article.</p>
</sec>
<sec id="s9">
<title>Author contributions</title>
<p>PC proposed the initial idea, BI finalized the proposed framework and implemented it. BI and PC wrote the paper. SH, NJ, and GT helped with the implementation and writing.</p>
</sec>
<ack>
<p>The authors would like to thank Eirikur Agustsson and Johannes Ball&#xe9; for helpful discussions.</p>
</ack>
<sec sec-type="COI-statement" id="s10">
<title>Conflict of interest</title>
<p>Authors PC, SH, NJ, and GT were employed by the company Google.</p>
<p>The remaining author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec sec-type="disclaimer" id="s11">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<sec id="s12">
<title>Supplementary material</title>
<p>The Supplementary Material for this article can be found online at: <ext-link ext-link-type="uri" xlink:href="https://www.frontiersin.org/articles/10.3389/frsip.2022.1008812/full#supplementary-material">https://www.frontiersin.org/articles/10.3389/frsip.2022.1008812/full&#x23;supplementary-material</ext-link>
</p>
<supplementary-material xlink:href="DataSheet1.PDF" id="SM1" mimetype="application/PDF" xmlns:xlink="http://www.w3.org/1999/xlink"/>
</sec>
<fn-group>
<fn id="fn1">
<label>1</label>
<p>We train the latents, quantizer stepsizes, neural entropy model, and the CBNs for each point cloud. However, we show the CBNs could be generalized across different point clouds.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Agustsson</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Theis</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Universally quantized neural compression</article-title>,&#x201d; in <conf-name>Advances in Neural Information Processing Systems</conf-name>. </citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Alliez</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Forge</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>De Luca</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Pierrot-Deseilligny</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Preda</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Culture 3D cloud: A cloud computing platform for 3D scanning, documentation, preservation and dissemination of cultural heritage</article-title>. <source>Hal</source> <volume>64</volume>. </citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Balle</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>Minnen</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Johnston</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Agustsson</surname>
<given-names>E.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Nonlinear transform coding</article-title>. <source>IEEE J. Sel. Top. Signal Process.</source> <volume>1</volume>, <fpage>339</fpage>&#x2013;<lpage>353</lpage>. <pub-id pub-id-type="doi">10.1109/JSTSP.2020.3034501</pub-id> </citation>
</ref>
<ref id="B4">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ball&#xe9;</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Efficient nonlinear transforms for lossy image compression</article-title>. <conf-name>2018 Picture Coding Symp</conf-name>. <publisher-loc>San Francisco, CA, United States</publisher-loc>: <publisher-name>PCS</publisher-name>. <pub-id pub-id-type="doi">10.1109/PCS.2018.8456272</pub-id> </citation>
</ref>
<ref id="B5">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ball&#xe9;</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hwang</surname>
<given-names>S. J.</given-names>
</name>
<name>
<surname>Agustsson</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2021</year>). <source>TensorFlow compression: Learned data compression</source>. <comment>Availableat: <ext-link ext-link-type="uri" xlink:href="http://github.com/tensorflow/compression">http://github.com/tensorflow/compression</ext-link>
</comment>. </citation>
</ref>
<ref id="B6">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ball&#xe9;</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Laparra</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Simoncelli</surname>
<given-names>E. P.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>End-to-end optimization of nonlinear transform codes for perceptual quality</article-title>. <conf-name>Picture Coding Symp</conf-name>. <publisher-loc>Nuremberg, Germany</publisher-loc>: <publisher-name>PCS</publisher-name>. <pub-id pub-id-type="doi">10.1109/PCS.2016.7906310</pub-id> </citation>
</ref>
<ref id="B7">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Ball&#xe9;</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Laparra</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Simoncelli</surname>
<given-names>E. P.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>End-to-end optimized image compression</article-title>,&#x201d; in <conf-name>5th Int. Conf. on Learning Representations (ICLR)</conf-name>. </citation>
</ref>
<ref id="B8">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Ball&#xe9;</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Minnen</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Hwang</surname>
<given-names>S. J.</given-names>
</name>
<name>
<surname>Johnston</surname>
<given-names>N.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Variational image compression with a scale hyperprior</article-title>,&#x201d; in <conf-name>6th Int. Conf. on Learning Representations (ICLR)</conf-name>. </citation>
</ref>
<ref id="B9">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Banner</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Hubara</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Hoffer</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Soudry</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Scalable methods for 8-bit training of neural networks</article-title>,&#x201d; in <conf-name>Proceedings of the 32nd International Conference on Neural Information Processing Systems</conf-name>, <fpage>5151</fpage>&#x2013;<lpage>5159</lpage>. </citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Barron</surname>
<given-names>J. T.</given-names>
</name>
<name>
<surname>Mildenhall</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Tancik</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Hedman</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Martin-Brualla</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Srinivasan</surname>
<given-names>P. P.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Mip-nerf: A multiscale representation for anti-aliasing neural radiance fields</article-title>. <comment>ArXiv</comment>. <pub-id pub-id-type="doi">10.48550/arXiv.2103.13415</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Bird</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Ball&#xe9;</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>3d scene compression through entropy penalized neural representation functions</article-title>,&#x201d; in <conf-name>Picture Coding Symposium (PCS)</conf-name>. </citation>
</ref>
<ref id="B12">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bj&#xf8;ntegaard</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2001</year>). <source>Calculation of average PSNR differences between RD-curves</source>. <publisher-loc>Austin, Texas</publisher-loc>. <comment>Technical Report VCEG-M33, ITU-T SG16/Q6</comment>. </citation>
</ref>
<ref id="B13">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Chen</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Learning continuous image representation with local implicit image function</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</conf-name>, <fpage>8628</fpage>&#x2013;<lpage>8638</lpage>. </citation>
</ref>
<ref id="B14">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Cheng</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Takeuchi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Katto</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Learned image compression with discretized Gaussian mixture likelihoods and attention modules</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>, <fpage>7939</fpage>&#x2013;<lpage>7948</lpage>. </citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>Koroteev</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Krivoku&#x107;a</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>A volumetric approach to point cloud compression&#x2014;Part i: Attribute compression</article-title>. <source>IEEE Trans. Image Process.</source> <volume>29</volume>, <fpage>2203</fpage>&#x2013;<lpage>2216</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2019.2908095</pub-id> </citation>
</ref>
<ref id="B16">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Cohen</surname>
<given-names>R. A.</given-names>
</name>
<name>
<surname>Tian</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Vetro</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Attribute compression for sparse point clouds using graph transforms</article-title>,&#x201d; in <conf-name>IEEE Int&#x2019;l Conf. Image Processing (ICIP)</conf-name>. </citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>de Queiroz</surname>
<given-names>R. L.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Compression of 3d point clouds using a region-adaptive hierarchical transform</article-title>. <source>IEEE Trans. Image Process.</source> <volume>25</volume>, <fpage>3947</fpage>&#x2013;<lpage>3956</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2016.2575005</pub-id> </citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>de Queiroz</surname>
<given-names>R. L.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Motion-compensated compression of dynamic voxelized point clouds</article-title>. <source>IEEE Trans. Image Process.</source> <volume>26</volume>, <fpage>3886</fpage>&#x2013;<lpage>3895</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2017.2707807</pub-id> </citation>
</ref>
<ref id="B19">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>d&#x2019;Eon</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Harrison</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Meyers</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
</person-group> (<year>2017</year>). <source>
<italic>8i voxelized full bodies &#x2014; a voxelized point cloud dataset</italic>. Input document M74006 &#x26; m42914</source>. <publisher-loc>Ljubljana, Slovenia</publisher-loc>: <publisher-name>JPEG &#x26; MPEG</publisher-name>. <comment>ISO/IEC JTC1/SC29 WG1 &#x26; WG11</comment>. </citation>
</ref>
<ref id="B20">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>DeVries</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Bautista</surname>
<given-names>M. A.</given-names>
</name>
<name>
<surname>Srivastava</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Taylor</surname>
<given-names>G. W.</given-names>
</name>
<name>
<surname>Susskind</surname>
<given-names>J. M.</given-names>
</name>
</person-group> (<year>2021</year>). <source>Unconstrained scene generation with locally conditioned radiance fields</source>. </citation>
</ref>
<ref id="B21">
<citation citation-type="book">
<collab>DG 3</collab> (<year>2020a</year>). <source>Final call for evidence on JPEG Pleno point cloud coding. Approved WG 1 document N88014</source>. <comment>ISO/IEC MPEG JTC1/SC29/WG1, online</comment>. </citation>
</ref>
<ref id="B22">
<citation citation-type="book">
<collab>DG 3</collab> (<year>2020b</year>). <source>G-PCC Codec Description v12. Approved WG 11 document N18891</source>. <publisher-loc>Geneva, CH</publisher-loc>. <comment>ISO/IEC MPEG JTC1/SC29/WG11</comment>. </citation>
</ref>
<ref id="B23">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Fang</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Hu</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Xu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Guo</surname>
<given-names>Y.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>3dac: Learning attribute compression for point clouds</article-title>,&#x201d; in <conf-name>2022 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>. </citation>
</ref>
<ref id="B24">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Fujiwara</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Hashimoto</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Neural implicit embedding for point cloud analysis</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>, <fpage>11734</fpage>&#x2013;<lpage>11743</lpage>. </citation>
</ref>
<ref id="B25">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Graziosi</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Nakagami</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Kuma</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Zaghetto</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Suzuki</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Tabatabai</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>An overview of ongoing point cloud compression standardization activities: Video-based (v-pcc) and geometry-based (g-pcc)</article-title>. <source>APSIPA Trans. Signal Inf. Process.</source> <volume>9</volume>, <fpage>e13</fpage>. <pub-id pub-id-type="doi">10.1017/ATSIP.2020.12</pub-id> </citation>
</ref>
<ref id="B26">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Guarda</surname>
<given-names>A. F. R.</given-names>
</name>
<name>
<surname>Rodrigues</surname>
<given-names>N. M. M.</given-names>
</name>
<name>
<surname>Pereira</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2019a</year>). &#x201c;<article-title>Deep learning-based point cloud coding: A behavior and performance study</article-title>,&#x201d; in <conf-name>2019 8th European Workshop on Visual Information Processing (EUVIP)</conf-name>, <fpage>34</fpage>&#x2013;<lpage>39</lpage>. <pub-id pub-id-type="doi">10.1109/EUVIP47703.2019.8946211</pub-id> </citation>
</ref>
<ref id="B27">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Guarda</surname>
<given-names>A. F. R.</given-names>
</name>
<name>
<surname>Rodrigues</surname>
<given-names>N. M. M.</given-names>
</name>
<name>
<surname>Pereira</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Deep learning-based point cloud geometry coding: RD control through implicit and explicit quantization</article-title>,&#x201d; in <conf-name>2020 IEEE Int. Conf. on Multimedia &#x26; Expo Wksps. (ICMEW)</conf-name>. <pub-id pub-id-type="doi">10.1109/ICMEW46912.2020.9106022</pub-id> </citation>
</ref>
<ref id="B28">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Guarda</surname>
<given-names>A. F. R.</given-names>
</name>
<name>
<surname>Rodrigues</surname>
<given-names>N. M. M.</given-names>
</name>
<name>
<surname>Pereira</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2019b</year>). &#x201c;<article-title>Point cloud coding: Adopting a deep learning-based approach</article-title>,&#x201d; in <conf-name>2019 Picture Coding Symposium (PCS)</conf-name>, <fpage>1</fpage>&#x2013;<lpage>5</lpage>. <pub-id pub-id-type="doi">10.1109/PCS48520.2019.8954537</pub-id> </citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Guo</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Lincoln</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Davidson</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Busch</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Whalen</surname>
<given-names>M.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>The relightables: Volumetric performance capture of humans with realistic relighting</article-title>. <source>ACM Trans. Graph.</source> <volume>38</volume>, <fpage>1</fpage>&#x2013;<lpage>19</lpage>. <pub-id pub-id-type="doi">10.1145/3355089.3356571</pub-id> </citation>
</ref>
<ref id="B30">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Guo</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>Z.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Causal contextual prediction for learned image compression</article-title>. <source>IEEE Trans. Circuits Syst. Video Technol.</source> <volume>1</volume>, <fpage>2329</fpage>&#x2013;<lpage>2341</lpage>. <pub-id pub-id-type="doi">10.1109/TCSVT.2021.3089491</pub-id> </citation>
</ref>
<ref id="B31">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Han</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Mao</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Dally</surname>
<given-names>W. J.</given-names>
</name>
</person-group> (<year>2015</year>). <source>Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding</source>. <comment>
<italic>arXiv preprint arXiv:1510.00149</italic>
</comment>. </citation>
</ref>
<ref id="B32">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Hedman</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Srinivasan</surname>
<given-names>P. P.</given-names>
</name>
<name>
<surname>Mildenhall</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Barron</surname>
<given-names>J. T.</given-names>
</name>
<name>
<surname>Debevec</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Baking neural radiance fields for real-time view synthesis</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>, <fpage>5875</fpage>&#x2013;<lpage>5884</lpage>. </citation>
</ref>
<ref id="B33">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Hu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Learning end-to-end lossy image compression: A benchmark</article-title>,&#x201d; in <conf-name>IEEE Transactions on Pattern Analysis and Machine Intelligence</conf-name>. </citation>
</ref>
<ref id="B34">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Isik</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Choi</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zheng</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Weissman</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Ermon</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Wong</surname>
<given-names>H.-S. P.</given-names>
</name>
<etal/>
</person-group> (<year>2021a</year>). <source>Neural network compression for noisy storage devices. NeurIPS deep learning through information geometry workshop</source>. <comment>
<italic>arXiv:2102.07725</italic>
</comment>. </citation>
</ref>
<ref id="B35">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Isik</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>Hwang</surname>
<given-names>S. J.</given-names>
</name>
<name>
<surname>Johnston</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Toderici</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2021b</year>). <source>Lvac: Learned volumetric attribute compression for point clouds using coordinate based networks</source>. <comment>
<italic>arXiv preprint arXiv:2111.08988</italic>
</comment>. </citation>
</ref>
<ref id="B36">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Isik</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Neural 3d scene compression via model compression</article-title>,&#x201d; in <conf-name>IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) WiCV Workshop</conf-name>. <comment>
<italic>arXiv:2105.03120</italic>
</comment>. </citation>
</ref>
<ref id="B37">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Isik</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Weissman</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>No</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>An information-theoretic justification for model pruning</article-title>,&#x201d; in <conf-name>Proceedings of The 25th International Conference on Artificial Intelligence and Statistics of Proceedings of Machine Learning Research</conf-name> (<publisher-loc>Valencia, Spain</publisher-loc>: <publisher-name>PMLR</publisher-name>), <fpage>3821</fpage>&#x2013;<lpage>3846</lpage>. </citation>
</ref>
<ref id="B38">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jang</surname>
<given-names>E. S.</given-names>
</name>
<name>
<surname>Preda</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Mammou</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Tourapis</surname>
<given-names>A. M.</given-names>
</name>
<name>
<surname>Kim</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Graziosi</surname>
<given-names>D. B.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Video-based point-cloud-compression standard in mpeg: From evidence collection to committee draft [standards in a nutshell]</article-title>. <source>IEEE Signal Process. Mag.</source> <volume>36</volume>, <fpage>118</fpage>&#x2013;<lpage>123</lpage>. <pub-id pub-id-type="doi">10.1109/MSP.2019.2900721</pub-id> </citation>
</ref>
<ref id="B39">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Knodt</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Baek</surname>
<given-names>S.-H.</given-names>
</name>
<name>
<surname>Heide</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2021</year>). <source>Neural ray-tracing: Learning surfaces and reflectance for relighting and view synthesis</source>. <comment>
<italic>arXiv preprint arXiv:2104.13562</italic>
</comment>. </citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krivoku&#x107;a</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>Koroteev</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>A volumetric approach to point cloud compression&#x2013;part ii: Geometry compression</article-title>. <source>IEEE Trans. Image Process.</source> <volume>29</volume>, <fpage>2217</fpage>&#x2013;<lpage>2229</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2019.2957853</pub-id> </citation>
</ref>
<ref id="B41">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Krivokuca</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>Savill</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2018</year>). <source>
<italic>8i voxelized surface light field (8iVSLF) dataset</italic>. Input document m42914</source>. <publisher-loc>Ljubljana, Slovenia</publisher-loc>. <comment>ISO/IEC JTC1/SC29 WG11 (MPEG)</comment>. </citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Krivokuca</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Miandji</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Guillemot</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Compression of plenoptic point cloud attributes using 6-d point clouds and 6-d transforms</article-title>. <source>IEEE Trans. Multimed.</source>, <fpage>1</fpage>. <pub-id pub-id-type="doi">10.1109/tmm.2021.3129341</pub-id> </citation>
</ref>
<ref id="B43">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kundu</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Genova</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Fathi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Pantofaru</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Guibas</surname>
<given-names>L.</given-names>
</name>
<etal/>
</person-group> (<year>2022a</year>). &#x201c;<article-title>Panoptic neural fields: A semantic object-aware neural scene representation</article-title>,&#x201d; in <source>Cvpr</source>. </citation>
</ref>
<ref id="B44">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Kundu</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Genova</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Fathi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Pantofaru</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Guibas</surname>
<given-names>L. J.</given-names>
</name>
<etal/>
</person-group> (<year>2022b</year>). &#x201c;<article-title>Panoptic neural fields: A semantic object-aware neural scene representation</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name> (<publisher-loc>New Orleans, Louisiana, United States</publisher-loc>: <publisher-name>CVPR</publisher-name>), <fpage>12871</fpage>&#x2013;<lpage>12881</lpage>. </citation>
</ref>
<ref id="B45">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Lasserre</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Flynn</surname>
<given-names>D.</given-names>
</name>
</person-group> (<year>2019</year>). <source>On an improvement of RAHT to exploit attribute correlation. input document m47378</source>. <publisher-loc>Geneva, CH</publisher-loc>. <comment>ISO/IEC MPEG JTC1/SC29/WG11</comment>. </citation>
</ref>
<ref id="B46">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Lazzarotto</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Alexiou</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Ebrahimi</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>On block prediction for learning-based point cloud compression</article-title>,&#x201d; in <conf-name>2021 IEEE International Conference on Image Processing</conf-name> (<publisher-loc>Anchorage, Alaska, United States</publisher-loc>: <publisher-name>ICIP</publisher-name>), <fpage>3378</fpage>&#x2013;<lpage>3382</lpage>. <pub-id pub-id-type="doi">10.1109/ICIP42928.2021.9506429</pub-id> </citation>
</ref>
<ref id="B47">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Luo</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Talebi</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Yang</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Elad</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Milanfar</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2020</year>). <source>The rate-distortion-accuracy tradeoff: Jpeg case study</source>. <comment>
<italic>arXiv preprint arXiv:2008.00605</italic>
</comment>. </citation>
</ref>
<ref id="B48">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Malvar</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2006</year>). &#x201c;<article-title>Adaptive run-length/golomb-rice encoding of quantized generalized Gaussian sources with unknown statistics</article-title>,&#x201d; in <conf-name>Data Compression Conference (DCC&#x2019;06)</conf-name>, <fpage>23</fpage>&#x2013;<lpage>32</lpage>. </citation>
</ref>
<ref id="B49">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Martel</surname>
<given-names>J. N.</given-names>
</name>
<name>
<surname>Lindell</surname>
<given-names>D. B.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>C. Z.</given-names>
</name>
<name>
<surname>Chan</surname>
<given-names>E. R.</given-names>
</name>
<name>
<surname>Monteiro</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wetzstein</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Acorn: Adaptive coordinate networks for neural scene representation</article-title>. <comment>
<italic>arXiv preprint arXiv:2105.02788</italic>
</comment> </citation>
</ref>
<ref id="B50">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Mehta</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>Gharbi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Barnes</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Shechtman</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Ramamoorthi</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Chandraker</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Modulated periodic activations for generalizable local functional representations</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>, <fpage>14214</fpage>&#x2013;<lpage>14223</lpage>. </citation>
</ref>
<ref id="B51">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Meka</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Pandey</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Haene</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Orts-Escolano</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Barnum</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Davidson</surname>
<given-names>P.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Deep relightable textures - volumetric performance capture with neural rendering</article-title>. <source>ACM Trans. Graph.</source> <volume>39</volume>, <fpage>1</fpage>&#x2013;<lpage>21</lpage>. <pub-id pub-id-type="doi">10.1145/3414685.3417814</pub-id> </citation>
</ref>
<ref id="B52">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mekuria</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Blom</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Cesar</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Design, implementation, and evaluation of a point cloud codec for tele-immersive video</article-title>. <source>IEEE Trans. Circuits Syst. Video Technol.</source> <volume>27</volume>, <fpage>828</fpage>&#x2013;<lpage>842</lpage>. <pub-id pub-id-type="doi">10.1109/tcsvt.2016.2543039</pub-id> </citation>
</ref>
<ref id="B53">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mentzer</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Toderici</surname>
<given-names>G. D.</given-names>
</name>
<name>
<surname>Tschannen</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Agustsson</surname>
<given-names>E.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>High-fidelity generative image compression</article-title>. <source>Adv. Neural Inf. Process. Syst.</source> <volume>33</volume>. </citation>
</ref>
<ref id="B54">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Mescheder</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Oechsle</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Niemeyer</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Nowozin</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Geiger</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Occupancy networks: Learning 3d reconstruction in function space</article-title>,&#x201d; in <conf-name>Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>. </citation>
</ref>
<ref id="B55">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Milani</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>A syndrome-based autoencoder for point cloud geometry compression</article-title>,&#x201d; in <conf-name>2020 IEEE International Conference on Image Processing</conf-name> (<publisher-loc>Abu Dhabi, United Arab Emirates</publisher-loc>: <publisher-name>ICIP</publisher-name>), <fpage>2686</fpage>&#x2013;<lpage>2690</lpage>. <pub-id pub-id-type="doi">10.1109/ICIP40778.2020.9190647</pub-id> </citation>
</ref>
<ref id="B56">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Milani</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Adae: Adversarial distributed source autoencoder for point cloud compression</article-title>,&#x201d; in <conf-name>2021 IEEE International Conference on Image Processing</conf-name> (<publisher-loc>Anchorage, Alaska, United States</publisher-loc>: <publisher-name>ICIP</publisher-name>), <fpage>3078</fpage>&#x2013;<lpage>3082</lpage>. <pub-id pub-id-type="doi">10.1109/ICIP42928.2021.9506750</pub-id> </citation>
</ref>
<ref id="B57">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Mildenhall</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Srinivasan</surname>
<given-names>P. P.</given-names>
</name>
<name>
<surname>Tancik</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Barron</surname>
<given-names>J. T.</given-names>
</name>
<name>
<surname>Ramamoorthi</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2020</year>). &#x201c;<article-title>Nerf: Representing scenes as neural radiance fields for view synthesis</article-title>,&#x201d; in <source>Eccv</source>. </citation>
</ref>
<ref id="B58">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Minnen</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Ball&#xe9;</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Toderici</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Joint autoregressive and hierarchical priors for learned image compression</article-title>. <source>Adv. Neural Inf. Process. Syst.</source> <volume>31</volume>. </citation>
</ref>
<ref id="B59">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Oktay</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Ball&#xe9;</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Shrivastava</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Scalable model compression by entropy penalized reparameterization</article-title>,&#x201d; in <conf-name>International Conference on Learning Representations</conf-name>. </citation>
</ref>
<ref id="B60">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Park</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>Hwang</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2019a</year>). <article-title>Rate-utility optimized streaming of volumetric media for augmented reality</article-title>. <source>IEEE J. Emerg. Sel. Top. Circuits Syst.</source> <volume>9</volume>, <fpage>149</fpage>&#x2013;<lpage>162</lpage>. <pub-id pub-id-type="doi">10.1109/JETCAS.2019.2898622</pub-id> </citation>
</ref>
<ref id="B61">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Park</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Florence</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Straub</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Newcombe</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Lovegrove</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2019b</year>). &#x201c;<article-title>Deepsdf: Learning continuous signed distance functions for shape representation</article-title>,&#x201d; in <conf-name>2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <conf-loc>Long Beach, CA, USA</conf-loc>, <conf-date>15-20 June 2019</conf-date> (<publisher-name>IEEE</publisher-name>), <fpage>165</fpage>&#x2013;<lpage>174</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2019.00025</pub-id> </citation>
</ref>
<ref id="B62">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pateux</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Jung</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2007</year>). <article-title>An excel add-in for computing bjontegaard metric and its evolution</article-title>. <source>ITU-T SG16 Q.</source> <volume>6</volume>, <fpage>7</fpage>. </citation>
</ref>
<ref id="B63">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pavez</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>de Queiroz</surname>
<given-names>R. L.</given-names>
</name>
<name>
<surname>Ortega</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Dynamic polygon clouds: Representation and compression for VR/AR</article-title>. <source>APSIPA Trans. Signal Inf. Process.</source> <volume>7</volume>, <fpage>e15</fpage>. <pub-id pub-id-type="doi">10.1017/ATSIP.2018.15</pub-id> </citation>
</ref>
<ref id="B64">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Pavez</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Souto</surname>
<given-names>A. L.</given-names>
</name>
<name>
<surname>Queiroz</surname>
<given-names>R. L. D.</given-names>
</name>
<name>
<surname>Ortega</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Multi-resolution intra-predictive coding of 3d point cloud attributes</article-title>,&#x201d; in <conf-name>2021 IEEE International Conference on Image Processing (ICIP)</conf-name>, <fpage>3393</fpage>&#x2013;<lpage>3397</lpage>. <pub-id pub-id-type="doi">10.1109/ICIP42928.2021.9506641</pub-id> </citation>
</ref>
<ref id="B65">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Pierdicca</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Paolanti</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Matrone</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Martini</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Morbidoni</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Malinverni</surname>
<given-names>E. S.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <article-title>Point cloud semantic segmentation using a deep learning framework for cultural heritage</article-title>. <source>Remote Sens.</source> <volume>12</volume>, <fpage>1005</fpage>. <pub-id pub-id-type="doi">10.3390/rs12061005</pub-id> </citation>
</ref>
<ref id="B66">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Quach</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Valenzise</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Dufaux</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2020a</year>). &#x201c;<article-title>Folding-based compression of point cloud attributes</article-title>,&#x201d; in <conf-name>2020 IEEE International Conference on Image Processing (ICIP)</conf-name>, <fpage>3309</fpage>&#x2013;<lpage>3313</lpage>. <pub-id pub-id-type="doi">10.1109/ICIP40778.2020.9191180</pub-id> </citation>
</ref>
<ref id="B67">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Quach</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Valenzise</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Dufaux</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2020b2020</year>). &#x201c;<article-title>Improved deep point cloud geometry compression</article-title>,&#x201d; in <conf-name>IEEE 22nd International Workshop on Multimedia Signal Processing (MMSP)</conf-name>, <fpage>1</fpage>&#x2013;<lpage>6</lpage>. </citation>
</ref>
<ref id="B68">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Quach</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Valenzise</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Dufaux</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Learning convolutional transforms for lossy point cloud geometry compression</article-title>,&#x201d; in <conf-name>2019 IEEE Int. Conf. on Image Processing (ICIP)</conf-name>. <pub-id pub-id-type="doi">10.1109/ICIP.2019.8803413</pub-id> </citation>
</ref>
<ref id="B69">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Reiser</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Peng</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Liao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Geiger</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Kilonerf: Speeding up neural radiance fields with thousands of tiny mlps</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>, <fpage>14335</fpage>&#x2013;<lpage>14345</lpage>. </citation>
</ref>
<ref id="B70">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Rematas</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Srinivasan</surname>
<given-names>P. P.</given-names>
</name>
<name>
<surname>Barron</surname>
<given-names>J. T.</given-names>
</name>
<name>
<surname>Tagliasacchi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Funkhouser</surname>
<given-names>T.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <source>Urban radiance fields</source>. <publisher-loc>New Orleans, Louisiana, United States</publisher-loc>: <publisher-name>CVPR</publisher-name>. </citation>
</ref>
<ref id="B71">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Sandri</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>de Queiroz</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Compression of plenoptic point clouds using the region-adaptive hierarchical transform</article-title>,&#x201d; in <conf-name>25th IEEE Int. Conf. on Image Processing</conf-name> (<publisher-loc>Athens, Greece</publisher-loc>: <publisher-name>ICIP</publisher-name>), <fpage>1153</fpage>&#x2013;<lpage>1157</lpage>. </citation>
</ref>
<ref id="B72">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sandri</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>de Queiroz</surname>
<given-names>R. L.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Compression of plenoptic point clouds</article-title>. <source>IEEE Trans. Image Process.</source> <volume>28</volume>, <fpage>1419</fpage>&#x2013;<lpage>1427</lpage>. <pub-id pub-id-type="doi">10.1109/tip.2018.2877486</pub-id> </citation>
</ref>
<ref id="B73">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Sandri</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Figueiredo</surname>
<given-names>V. F.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>de Queiroz</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2019a</year>). &#x201c;<article-title>Point cloud compression incorporating region of interest coding</article-title>,&#x201d; in <conf-name>2019 IEEE International Conference on Image Processing (ICIP)</conf-name>, <fpage>4370</fpage>&#x2013;<lpage>4374</lpage>. <pub-id pub-id-type="doi">10.1109/ICIP.2019.8803553</pub-id> </citation>
</ref>
<ref id="B74">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sandri</surname>
<given-names>G. P.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>Krivoku&#x107;a</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>de Queiroz</surname>
<given-names>R. L.</given-names>
</name>
</person-group> (<year>2019b</year>). <article-title>Integer alternative for the region-adaptive hierarchical transform</article-title>. <source>IEEE Signal Process. Lett.</source> <volume>26</volume>, <fpage>1369</fpage>&#x2013;<lpage>1372</lpage>. <pub-id pub-id-type="doi">10.1109/LSP.2019.2931425</pub-id> </citation>
</ref>
<ref id="B75">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schwarz</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Preda</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Baroncini</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Budagavi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Cesar</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Emerging MPEG standards for point cloud compression</article-title>. <source>IEEE J. Emerg. Sel. Top. Circuits Syst.</source> <volume>9</volume>, <fpage>133</fpage>&#x2013;<lpage>148</lpage>. <pub-id pub-id-type="doi">10.1109/jetcas.2018.2885981</pub-id> </citation>
</ref>
<ref id="B76">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sheng</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Xiong</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>F.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Deep-pcac: An end-to-end deep lossy compression framework for point cloud attributes</article-title>. <source>IEEE Trans. Multimed.</source> <volume>24</volume>, <fpage>2617</fpage>&#x2013;<lpage>2632</lpage>. <pub-id pub-id-type="doi">10.1109/TMM.2021.3086711</pub-id> </citation>
</ref>
<ref id="B77">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Sitzmann</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Chan</surname>
<given-names>E. R.</given-names>
</name>
<name>
<surname>Tucker</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Snavely</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Wetzstein</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2020</year>). <source>Metasdf: Meta-learning signed distance functions</source>. </citation>
</ref>
<ref id="B78">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Srinivasan</surname>
<given-names>P. P.</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Tancik</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Mildenhall</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Barron</surname>
<given-names>J. T.</given-names>
</name>
</person-group> (<year>2021</year>). &#x201c;<article-title>Nerv: Neural reflectance and visibility fields for relighting and view synthesis</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>, <fpage>7495</fpage>&#x2013;<lpage>7504</lpage>. </citation>
</ref>
<ref id="B79">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Stelzner</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Kersting</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Kosiorek</surname>
<given-names>A. R.</given-names>
</name>
</person-group> (<year>2021</year>). <source>Decomposing 3d scenes into objects via unsupervised volume segmentation</source>. <comment>
<italic>arXiv preprint arXiv:2104.01148</italic>
</comment>. </citation>
</ref>
<ref id="B80">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Stock</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Joulin</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Gribonval</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Graham</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>J&#xe9;gou</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>And the bit goes down: Revisiting the quantization of neural networks</article-title>,&#x201d; in <conf-name>International Conference on Learning Representations</conf-name>. </citation>
</ref>
<ref id="B81">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Kretzschmar</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Dotiwalla</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Chouard</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Patnaik</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Tsui</surname>
<given-names>P.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). &#x201c;<article-title>Scalability in perception for autonomous driving: Waymo open dataset</article-title>,&#x201d; in <conf-name>2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name> (<publisher-loc>Seattle, WA, United States</publisher-loc>: <publisher-name>CVPR</publisher-name>), <fpage>2443</fpage>&#x2013;<lpage>2451</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.00252</pub-id> </citation>
</ref>
<ref id="B82">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Sun</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Choi</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>C.-Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Venkataramani</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Srinivasan</surname>
<given-names>V. V.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Hybrid 8-bit floating point (hfp8) training and inference for deep neural networks</article-title>. <source>Adv. Neural Inf. Process. Syst.</source> <volume>32</volume>, <fpage>4900</fpage>&#x2013;<lpage>4909</lpage>. </citation>
</ref>
<ref id="B83">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Takikawa</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Evans</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Tremblay</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>M&#xfc;ller</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>McGuire</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Jacobson</surname>
<given-names>A.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). &#x201c;<article-title>Variable bitrate neural fields</article-title>,&#x201d; in <conf-name>SIGGRAPH22 Conference Proceeding Special Interest Group on Computer Graphics and Interactive Techniques Conference Proceedings</conf-name>, <conf-loc>New York, NY, USA</conf-loc> (<publisher-loc>New York, NY, United States</publisher-loc>: <publisher-name>Association for Computing Machinery</publisher-name>). <pub-id pub-id-type="doi">10.1145/3528233.3530727</pub-id> </citation>
</ref>
<ref id="B84">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Takikawa</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Litalien</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Yin</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Kreis</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Loop</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Nowrouzezahrai</surname>
<given-names>D.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). &#x201c;<article-title>Neural geometric level of detail: Real-time rendering with implicit 3d shapes</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>, <fpage>11358</fpage>&#x2013;<lpage>11367</lpage>. </citation>
</ref>
<ref id="B85">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Tancik</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Casser</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Yan</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Pradhan</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Mildenhall</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Srinivasan</surname>
<given-names>P.</given-names>
</name>
<etal/>
</person-group> (<year>2022</year>). <source>Block-NeRF: Scalable large scene neural view synthesis</source>. <comment>
<italic>arXiv</italic>
</comment>. </citation>
</ref>
<ref id="B86">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tancik</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Mildenhall</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>T.</given-names>
</name>
<name>
<surname>Schmidt</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Hedman</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Barron</surname>
<given-names>J. T.</given-names>
</name>
<etal/>
</person-group> (<year>2021</year>). <article-title>Learned initializations for optimizing coordinate-based neural representations</article-title>. <comment>
<italic>arXiv</italic>
</comment>. <pub-id pub-id-type="doi">10.48550/arXiv.2012.02189</pub-id> </citation>
</ref>
<ref id="B87">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Tang</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Singh</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>H&#xe4;ne</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Dou</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Fanello</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). &#x201c;<article-title>Deep implicit volume compression</article-title>,&#x201d; in <conf-name>2020 IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>. <pub-id pub-id-type="doi">10.1109/CVPR42600.2020.00137</pub-id> </citation>
</ref>
<ref id="B88">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Thanou</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>Frossard</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Graph-based compression of dynamic 3d point cloud sequences</article-title>. <source>IEEE Trans. Image Process.</source> <volume>25</volume>, <fpage>1765</fpage>&#x2013;<lpage>1778</lpage>. <pub-id pub-id-type="doi">10.1109/tip.2016.2529506</pub-id> </citation>
</ref>
<ref id="B89">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Toderici</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>O&#x2019;Malley</surname>
<given-names>S. M.</given-names>
</name>
<name>
<surname>Hwang</surname>
<given-names>S. J.</given-names>
</name>
<name>
<surname>Vincent</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Minnen</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Baluja</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2016</year>). &#x201c;<article-title>Variable rate image compression with recurrent neural networks</article-title>,&#x201d; in <conf-name>4th Int. Conf. on Learning Representations (ICLR)</conf-name>. </citation>
</ref>
<ref id="B90">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Toderici</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Vincent</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Johnston</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Hwang</surname>
<given-names>S. J.</given-names>
</name>
<name>
<surname>Minnen</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Shor</surname>
<given-names>J.</given-names>
</name>
<etal/>
</person-group> (<year>2017</year>). &#x201c;<article-title>Full resolution image compression with recurrent neural networks</article-title>,&#x201d; in <conf-name>2017 IEEE Conf. on Computer Vision and Pattern Recognition (CVPR)</conf-name>. <pub-id pub-id-type="doi">10.1109/CVPR.2017.577</pub-id> </citation>
</ref>
<ref id="B91">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Turki</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Ramanan</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Satyanarayanan</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2022</year>). &#x201c;<article-title>Mega-nerf: Scalable construction of large-scale nerfs for virtual fly-throughs</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name> (<publisher-loc>New Orleans, Louisiana, United States</publisher-loc>: <publisher-name>CVPR</publisher-name>), <fpage>12922</fpage>&#x2013;<lpage>12931</lpage>. </citation>
</ref>
<ref id="B92">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Han</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Haq: Hardware-aware automated quantization with mixed precision</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</conf-name>, <fpage>8612</fpage>&#x2013;<lpage>8620</lpage>. </citation>
</ref>
<ref id="B93">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Choi</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Brand</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Chen</surname>
<given-names>C.-Y.</given-names>
</name>
<name>
<surname>Gopalakrishnan</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Training deep neural networks with 8-bit floating point numbers</article-title>,&#x201d; in <conf-name>Proceedings of the 32nd International Conference on Neural Information Processing Systems</conf-name>, <fpage>7686</fpage>&#x2013;<lpage>7695</lpage>. </citation>
</ref>
<ref id="B94">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Xu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Lin</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Xiong</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Deep neural network compression with single and multiple level quantization</article-title>,&#x201d; in <conf-name>Proceedings of the AAAI Conference on Artificial Intelligence</conf-name>. <pub-id pub-id-type="doi">10.1609/aaai.v32i1.11663</pub-id> </citation>
</ref>
<ref id="B95">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Yan</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Shao</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>T. H.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2019</year>). <source>Deep autoencoder-based lossy geometry compression for point clouds</source>. <comment>CoRR abs/1905.03691</comment>. </citation>
</ref>
<ref id="B96">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Tancik</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Ng</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Kanazawa</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2021a</year>). &#x201c;<article-title>Plenoctrees for real-time rendering of neural radiance fields</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE/CVF International Conference on Computer Vision</conf-name>, <fpage>5752</fpage>&#x2013;<lpage>5761</lpage>. </citation>
</ref>
<ref id="B97">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Yu</surname>
<given-names>H.-X.</given-names>
</name>
<name>
<surname>Guibas</surname>
<given-names>L. J.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2021b</year>). <source>Unsupervised discovery of object radiance fields</source>. <comment>
<italic>arXiv preprint arXiv:2107.07905</italic>
</comment>. </citation>
</ref>
<ref id="B98">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Flor&#xea;ncio</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Loop</surname>
<given-names>C.</given-names>
</name>
</person-group> (<year>2014</year>). &#x201c;<article-title>Point cloud attribute compression with graph transform</article-title>,&#x201d; in <conf-name>2014 IEEE Int&#x2019;l Conf. Image Processing (ICIP)</conf-name>. </citation>
</ref>
<ref id="B99">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). &#x201c;<article-title>A framework for surface light field compression</article-title>,&#x201d; in <conf-name>IEEE Int. Conf. on Image Processing (ICIP)</conf-name>, <fpage>2595</fpage>&#x2013;<lpage>2599</lpage>. </citation>
</ref>
<ref id="B100">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Chou</surname>
<given-names>P. A.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Tang</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>S.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <article-title>Surface light field compression using a point cloud codec</article-title>. <source>IEEE J. Emerg. Sel. Top. Circuits Syst.</source> <volume>9</volume>, <fpage>163</fpage>&#x2013;<lpage>176</lpage>. <pub-id pub-id-type="doi">10.1109/jetcas.2018.2883479</pub-id> </citation>
</ref>
<ref id="B101">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Srinivasan</surname>
<given-names>P. P.</given-names>
</name>
<name>
<surname>Deng</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Debevec</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Freeman</surname>
<given-names>W. T.</given-names>
</name>
<name>
<surname>Barron</surname>
<given-names>J. T.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Nerfactor: Neural factorization of shape and reflectance under an unknown illumination</article-title>. <source>ACM Trans. Graph.</source> <volume>40</volume>, <fpage>1</fpage>&#x2013;<lpage>18</lpage>. <pub-id pub-id-type="doi">10.1145/3478513.3480496</pub-id> </citation>
</ref>
</ref-list>
</back>
</article>