<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Archiving and Interchange DTD v2.3 20070202//EN" "archivearticle.dtd">
<article xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" article-type="methods-article">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Neurorobot.</journal-id>
<journal-title>Frontiers in Neurorobotics</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Neurorobot.</abbrev-journal-title>
<issn pub-type="epub">1662-5218</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fnbot.2023.1277160</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Neuroscience</subject>
<subj-group>
<subject>Methods</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>E2VIDX: improved bridge between conventional vision and bionic vision</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name><surname>Hou</surname> <given-names>Xujia</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/1530547/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name><surname>Zhang</surname> <given-names>Feihu</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<xref ref-type="corresp" rid="c001"><sup>&#x0002A;</sup></xref>
<uri xlink:href="http://loop.frontiersin.org/people/384178/overview"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Gulati</surname> <given-names>Dhiraj</given-names></name>
<xref ref-type="aff" rid="aff2"><sup>2</sup></xref>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Tan</surname> <given-names>Tingfeng</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
<contrib contrib-type="author">
<name><surname>Zhang</surname> <given-names>Wei</given-names></name>
<xref ref-type="aff" rid="aff1"><sup>1</sup></xref>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
</contrib>
</contrib-group>
<aff id="aff1"><sup>1</sup><institution>School of Marine Science and Technology, Northwestern Polytechnical University</institution>, <addr-line>Xi&#x00027;An</addr-line>, <country>China</country></aff>
<aff id="aff2"><sup>2</sup><institution>Siemens EDA</institution>, <addr-line>Munich</addr-line>, <country>Germany</country></aff>
<author-notes>
<fn fn-type="edited-by"><p>Edited by: Alois C. Knoll, Technical University of Munich, Germany</p></fn>
<fn fn-type="edited-by"><p>Reviewed by: Jingang Shi, Xi&#x00027;an Jiaotong University, China; Zhe Zhang, Taiyuan University of Technology, China</p></fn>
<corresp id="c001">&#x0002A;Correspondence: Feihu Zhang <email>feihu.zhang&#x00040;nwpu.edu.cn</email></corresp>
</author-notes>
<pub-date pub-type="epub">
<day>26</day>
<month>10</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>17</volume>
<elocation-id>1277160</elocation-id>
<history>
<date date-type="received">
<day>14</day>
<month>08</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>05</day>
<month>10</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#x000A9; 2023 Hou, Zhang, Gulati, Tan and Zhang.</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Hou, Zhang, Gulati, Tan and Zhang</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/"><p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p></license></permissions>
<abstract>
<p>Common RGBD, CMOS, and CCD-based cameras produce motion blur and incorrect exposure under high-speed and improper lighting conditions. According to the bionic principle, the event camera developed has the advantages of low delay, high dynamic range, and no motion blur. However, due to its unique data representation, it encounters significant obstacles in practical applications. The image reconstruction algorithm based on an event camera solves the problem by converting a series of &#x0201C;events&#x0201D; into common frames to apply existing vision algorithms. Due to the rapid development of neural networks, this field has made significant breakthroughs in past few years. Based on the most popular Events-to-Video (E2VID) method, this study designs a new network called E2VIDX. The proposed network includes group convolution and sub-pixel convolution, which not only achieves better feature fusion but also the network model size is reduced by 25%. Futhermore, we propose a new loss function. The loss function is divided into two parts, first part calculates the high level features and the second part calculates the low level features of the reconstructed image. The experimental results clearly outperform against the state-of-the-art method. Compared with the original method, Structural Similarity (SSIM) increases by 1.3%, Learned Perceptual Image Patch Similarity (LPIPS) decreases by 1.7%, Mean Squared Error (MSE) decreases by 2.5%, and it runs faster on GPU and CPU. Additionally, we evaluate the results of E2VIDX with application to image classification, object detection, and instance segmentation. The experiments show that conversions using our method can help event cameras directly apply existing vision algorithms in most scenarios.</p></abstract>
<kwd-group>
<kwd>image reconstruction</kwd>
<kwd>deep learning</kwd>
<kwd>dynamic vision sensor</kwd>
<kwd>event camera</kwd>
<kwd>image classification</kwd>
<kwd>object detection</kwd>
<kwd>instance segmentation</kwd>
</kwd-group>
<counts>
<fig-count count="11"/>
<table-count count="4"/>
<equation-count count="6"/>
<ref-count count="48"/>
<page-count count="14"/>
<word-count count="7387"/>
</counts>
</article-meta>
</front>
<body>
<sec sec-type="intro" id="s1">
<title>1. Introduction</title>
<p>Robots have become indispensable in modern society, capable of replacing manual labor to execute repetitive and hazardous tasks, thereby enhancing production efficiency and quality while reducing production costs (Jing et al., <xref ref-type="bibr" rid="B11">2022</xref>). Various research studies in the field of robotics are continuously carried out by Bing et al. (<xref ref-type="bibr" rid="B4">2022</xref>, <xref ref-type="bibr" rid="B2">2023a</xref>,<xref ref-type="bibr" rid="B3">b</xref>). In the realm of robotics, computer vision plays a pivotal role in tasks such as robot navigation, perception, and decision-making. Most commonly used camera sensors include CMOS (Sukhavasi et al., <xref ref-type="bibr" rid="B39">2021</xref>), CCD (Adam et al., <xref ref-type="bibr" rid="B1">2019</xref>), and RGBD (Liu et al., <xref ref-type="bibr" rid="B17">2022</xref>) cameras, all of which share a standard parameter: frame rate. These cameras capture images at consistent time intervals, synchronizing their data acquisition. However, they often yield suboptimal results in high-speed motion scenes or environments with inadequate lighting conditions due to their imaging principles. To solve this problem, researchers (Posch et al., <xref ref-type="bibr" rid="B24">2014</xref>) have developed event cameras, sometimes called dynamic vision sensor (DVS). Instead of capturing images at a fixed frame rate, event cameras capture &#x0201C;events&#x0201D;, which are triggered when the cumulative brightness change of a pixel reaches a certain threshold. An event has three elements: timestamp, pixel coordinate, and polarity. Therefore, an event expresses when (i.e., time), at which pixel, an increase or decrease in brightness occurs. Event camera imaging principle guarantees that as long as the brightness change exceeds the threshold value, there will be an output, and it requires small bandwidth. In other words, if there are objects moving very fast in the camera&#x00027;s field of view, it will generate multiple events per second. If there is no object motion or brightness change, there are no events generated. At the same time, since the event camera is better at capturing the brightness change, it performs equally in dark and intense light scenes. Therefore, event cameras have the advantages of low latency, high dynamic range (140 vs. 60 dB), and low power consumption and are not affected by motion blur compared with regular frame-based cameras (Gallego et al., <xref ref-type="bibr" rid="B8">2020</xref>).</p>
<p>Although an event camera has been successfully used in SLAM (Vidal et al., <xref ref-type="bibr" rid="B40">2018</xref>), human detection (Xu et al., <xref ref-type="bibr" rid="B44">2020</xref>), and other fields (Zhou et al., <xref ref-type="bibr" rid="B48">2018</xref>; Perot et al., <xref ref-type="bibr" rid="B22">2020</xref>), the output format of an event camera is far from the familiar camera output format. Therefore, it does not easily lend itself to practical applications. Compared with events alone, reconstructing images from events (as shown in <xref ref-type="fig" rid="F1">Figure 1</xref>) provides a compact representation of the latest available data and enables the application of traditional computer vision to event cameras. In contrast to raw events, images possess a natural interpretability for humans and encompass a broader spectrum of information. Additionally, the reconstructed image offers a synthesis of several advantageous attributes, including high temporal resolution, spatial interpretability, and robust resistance to interference. Consequently, traditional vision algorithms can be seamlessly employed with reconstructed images, eliminating the necessity for the redesign of additional algorithms when integrating event cameras into applications.</p>
<fig id="F1" position="float">
<label>Figure 1</label>
<caption><p>A schematic of an image generated from the event stream (shot in high speed motion scene), with blue for negative polarity and red for positive polarity.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1277160-g0001.tif"/>
</fig>
<p>In the early days of this field, researchers derived the reconstruction formula by modeling the imaging principle of event cameras (Brandli et al., <xref ref-type="bibr" rid="B6">2014</xref>; Munda et al., <xref ref-type="bibr" rid="B19">2018</xref>; Scheerlinck et al., <xref ref-type="bibr" rid="B34">2018</xref>). However, due to the sensor noise, the reconstruction was far from ground truth images. With advent of the powerful deep learning methodology in recent years, we are able to improve the reconstruction and the results converge to the ground truth (Rebecq et al., <xref ref-type="bibr" rid="B26">2019a</xref>,<xref ref-type="bibr" rid="B27">b</xref>; Wang et al., <xref ref-type="bibr" rid="B41">2019</xref>; Scheerlinck et al., <xref ref-type="bibr" rid="B35">2020</xref>; Cadena et al., <xref ref-type="bibr" rid="B7">2021</xref>). While advancements in reconstruction techniques have led to improvements, the utilization of deep neural networks often necessitates substantial time and computational resources. Consequently, their application to edge or mobile devices is constrained. Furthermore, the network architectures developed using some of the current methods do not readily scale down to these resource-constrained devices. To address this challenge, this study proposes E2VIDX, a faster and stronger neural network for image reconstruction. By changing the feature fusion, the network is further optimized by using group convolution and sub-pixel convolution. Simultaneously, this study proposes a simplified loss function to counter the excessive number of parameters. Furthermore, the effectiveness of the proposed E2VIDX is demonstrated by applying it to various high-level vision tasks, including image classification, object detection, and instance segmentation, using the reconstructed images as input data. These applications illustrate the practical utility of E2VIDX in real-world scenarios.</p>
<p>In summary, the main contributions of this study are as follows:</p>
<list list-type="bullet">
<list-item><p>This study proposes an improved event reconstruction method: E2VIDX. On comparing with the state-of-the-art, not only E2VIDX outperforms on the three evaluation indicators but it also has shorter reconstruction time.</p></list-item>
<list-item><p>Ablation study is presented to prove the effectiveness of the proposed module.</p></list-item>
<list-item><p>Designed high-level vision tasks completed to qualitatively and quantitatively evaluate the reconstructed images obtained using E2VIDX.</p></list-item>
</list></sec>
<sec id="s2">
<title>2. Related work</title>
<p>In the domain of event processing, the mainstream image reconstruction algorithms can be divided into two types, namely, asynchronous event processing and synchronous batch processing.</p>
<sec>
<title>2.1. Asynchronous event processing</title>
<p>The idea is to use the sparsity of events; as soon as the event arrives, the new information is integrated into the existing state for updating. Since the information contained in a single event is very little, one of the focuses of asynchronous algorithm research is how to fuse the existing information with the current event, which also requires that the algorithm needs an image or waits enough time when initializing. Brandli et al. (<xref ref-type="bibr" rid="B6">2014</xref>) first proposed using event streams for image reconstruction. They used the complementarity of regular cameras and event cameras to insert events marked with thresholds between two consecutive frames. The threshold is determined by the difference in event summary between two consecutive frames. This method has low computational overhead and can run in real-time using only a CPU, but it must require frame-based images as dense as possible. Reinbacher (Munda et al., <xref ref-type="bibr" rid="B19">2018</xref>) treat the image reconstruction problem as an energy minimization problem, model the noise based on the generalized Kullback&#x02013;Leibler divergence to prevent noise accumulation, and define the optimization problem as an event flow pattern containing timestamps. Finally, it used the variational method to optimize. Scheerlinck et al. (<xref ref-type="bibr" rid="B34">2018</xref>) proposed to use complementary filters to reconstruct intensity images from asynchronous events, with an option to incorporate information into image frames. Complementary filters perform temporal smoothing but do not perform spatial smoothing, which dramatically improves the computational efficiency and significantly improves the reconstruction speed.</p>
<p>Although the above methods, based on mathematical and physical modeling, are reliable in theory, cumulative error of the reconstructed image increases with time because the sensor noise is affected by temperature, humidity, and electrical devices. At the same time, another non-negligible problem is that the contrast threshold of the event camera is different at each pixel and changes over time. Therefore, methods based on asynchronous event processing are limited in their usage scenarios.</p></sec>
<sec>
<title>2.2. Synchronous batch processing</title>
<p>Batch image reconstruction aims to reconstruct an image or video by considering a batch of events rather than a single event, primarily using popular machine learning methods for modeling. To deal with how the event stream is fed into the network, Wang et al. (<xref ref-type="bibr" rid="B41">2019</xref>) proposed two batch processing methods, namely, time-based event stream input and event number-based input. Finally, they successfully used the Conditional Generative Adversarial Network (CGAN) to reconstruct and obtain the image with high dynamic range and no motion blur. E2VID, proposed by Rebecq et al. (<xref ref-type="bibr" rid="B26">2019a</xref>,<xref ref-type="bibr" rid="B27">b</xref>) is the first method to combine convolutional neural network (CNN) and recurrent neural network (RNN) for image reconstruction. It achieves end-to-end video reconstruction with supervised learning from simulated event data, resulting in images with high resolution in time and high-speed motion scenes. Considering the low latency of events, Scheerlinck et al. (<xref ref-type="bibr" rid="B35">2020</xref>) modified E2VID, by replacing the original U-Net (Ronneberger et al., <xref ref-type="bibr" rid="B32">2015</xref>) structure with a stacked structure, and obtained FireNet with fewer parameters and faster operation but with almost the same accuracy. E2VID uses a recurrent neural network to fuse previous information, hence fewer frames are needed to initialize at the beginning stage of reconstruction. SPADE-E2VID (Cadena et al., <xref ref-type="bibr" rid="B7">2021</xref>) adds a SPADE module (Park et al., <xref ref-type="bibr" rid="B21">2019</xref>) to solve this problem, significantly reducing the initialization time. At the same time, a loss function without temporal consistency is proposed to speed up the training speed.</p>
<p>Image reconstruction based on deep learning has made significant progress. However, considering the characteristics of the event camera itself, the designed neural network should consider both running time and reconstruction accuracy.</p></sec></sec>
<sec id="s3">
<title>3. E2VIDX method</title>
<p>This section outlines the specific implementation process of E2VIDX. To feed a stream of events into a neural network, we need to encode the data stream. The encoded tensors are, then, fed into E2VIDX, a convolutional neural recurrent network for training. To efficiently fit the model with the training data set, a convenient and efficient loss function is also designed.</p>
<sec>
<title>3.1. Event encoding</title>
<p>The event camera output is in the form of event streams, as shown in Equation 1.</p>
<disp-formula id="E1"><label>(1)</label><mml:math id="M1"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:msub><mml:mrow><mml:mi>e</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mi>p</mml:mi><mml:mo>,</mml:mo><mml:mi>t</mml:mi></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msubsup><mml:mi>c</mml:mi><mml:mi>&#x003B4;</mml:mi><mml:mrow><mml:mo stretchy="true">(</mml:mo><mml:mrow><mml:mi>t</mml:mi><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mi>p</mml:mi></mml:mrow></mml:msubsup></mml:mrow><mml:mo stretchy="true">)</mml:mo></mml:mrow><mml:mo>,</mml:mo><mml:mi>i</mml:mi><mml:mo>&#x02208;</mml:mo><mml:mn>1</mml:mn><mml:mo>,</mml:mo><mml:mn>2</mml:mn><mml:mo>,</mml:mo><mml:mn>3</mml:mn><mml:mo>&#x02026;</mml:mo></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>Here, we denote &#x003C3;&#x02208;{&#x02212;1, 1} as polarity, <italic>p</italic> &#x0003D; (<italic>x, y</italic>) as event coordinates, <italic>c</italic> as the contrast threshold that triggers an event, and &#x003B4; as the Dirac delta function. To enable the convolutional recurrent neural network to process the event stream, it is essential to encode the event stream into a fixed-size spatiotemporal tensor. The event stream is partitioned into groups based on their timestamp order, with each group containing <italic>N</italic> events, denoted as &#x003B5;<sub><italic>k</italic></sub> &#x0003D; {<italic>e</italic><sub><italic>i</italic></sub>}, <italic>i</italic>&#x02208;[0, <italic>N</italic>&#x02212;1]. This encoding transforms the event stream into a spatiotemporal stereo tensor, which serves as the input. For each event group denoted as &#x003B5;<sub><italic>k</italic></sub>, we quantize the time interval as <inline-formula><mml:math id="M2"><mml:mi>&#x00394;</mml:mi><mml:mi>T</mml:mi><mml:mo>=</mml:mo><mml:msubsup><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>N</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow><mml:mrow><mml:mi>k</mml:mi></mml:mrow></mml:msubsup></mml:math></inline-formula> and distribute it across <italic>B</italic> time channels. Within each event <italic>e</italic><sub><italic>i</italic></sub>, its polarity is associated with the same spatial location and its two closest time channels in the group <italic>E</italic><sub><italic>k</italic></sub>, as shown in Equation 2.</p>
<disp-formula id="E2"><label>(2)</label><mml:math id="M3"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mstyle mathvariant="bold"><mml:mtext>E</mml:mtext></mml:mstyle><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle="true"><mml:munder class="msub"><mml:mrow><mml:mo>&#x02211;</mml:mo></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>x</mml:mi></mml:mrow><mml:mrow><mml:mi>l</mml:mi></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>=</mml:mo><mml:msub><mml:mrow><mml:mi>y</mml:mi></mml:mrow><mml:mrow><mml:mi>m</mml:mi></mml:mrow></mml:msub></mml:mrow></mml:munder></mml:mstyle><mml:msub><mml:mrow><mml:mi>p</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo class="qopname">max</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:mn>0</mml:mn><mml:mo>,</mml:mo><mml:mn>1</mml:mn><mml:mo>-</mml:mo><mml:mo>|</mml:mo><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>n</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msubsup><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mo>|</mml:mo></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <inline-formula><mml:math id="M4"><mml:msubsup><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow><mml:mrow><mml:mo>*</mml:mo></mml:mrow></mml:msubsup><mml:mo>&#x0225C;</mml:mo><mml:mfrac><mml:mrow><mml:mi>B</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:mi>&#x00394;</mml:mi><mml:mi>T</mml:mi></mml:mrow></mml:mfrac><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mi>i</mml:mi></mml:mrow></mml:msub><mml:mo>-</mml:mo><mml:msub><mml:mrow><mml:mi>t</mml:mi></mml:mrow><mml:mrow><mml:mn>0</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:math></inline-formula> is the benchmark time after standardization. Like other methods Wang et al. (<xref ref-type="bibr" rid="B41">2019</xref>), Rebecq et al. (<xref ref-type="bibr" rid="B26">2019a</xref>,<xref ref-type="bibr" rid="B27">b</xref>), Scheerlinck et al. (<xref ref-type="bibr" rid="B35">2020</xref>), Cadena et al. (<xref ref-type="bibr" rid="B7">2021</xref>), we also set <italic>B</italic> as 5 for our experiment.</p></sec>
<sec>
<title>3.2. Network design</title>
<p>The overall structure of E2VIDX is similar to U-net, which is divided into the head, body, and prediction layers, as shown in <xref ref-type="fig" rid="F2">Figure 2</xref>. The body layer comprises the downsampling part and the upsampling part. Unlike E2VID, we add group convolution branch to downsampling layer which helps in feature fusion during upsampling. The original ResBlock is replaced by group convolution, and by observing the output of each layer in training, part of the input of the actual output layer is modified for better low-level and high-level feature fusion. Meanwhile, learnable sub-pixel convolution is used in the upsampling part.</p>
<fig id="F2" position="float">
<label>Figure 2</label>
<caption><p>The network structure of E2VIDX. The U-Net network structure is used, which is divided into the downsampling part and upsampling part, and new feature fusion is added at the same time. In the downsampling part, ConvLSTM is used to fuse the previous reconstruction state information, and subpixel convolution is used to avoid checkerboard artifacts in the upsampling part.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1277160-g0002.tif"/>
</fig>
<sec>
<title>3.2.1. Head</title>
<p>After event encoding, the neural network gets fixed-size tensors with five channels as input. The primary purpose of the head layer is to expand the number of channels to facilitate subsequent feature extraction. The kernel size used in this layer is 3.</p></sec>
<sec>
<title>3.2.2. Body</title>
<p>The body part is the central part of the whole network, which completes the feature extraction and fusion. The downsampling part consists of three recurrent convolution modules with ConvLSTM (Shi et al., <xref ref-type="bibr" rid="B37">2015</xref>). Each convolutional block consists of CBR (Conv&#x0002B;BatchNorm&#x0002B;ReLU) and ConvLSTM modules. The purpose of using ConvLSTM is to preserve the previous state information, which is used to update the current state in combination with the current input. Therefore, the convolutional block operation feeds the input into the CBR and then updates the output as a partial input to the ConvLSTM. The size of the convolution kernel in each convolution block is 5, the stride and padding are 2, and the number of output channels is twice of the input. Therefore, the width and height of the tensor are halved, and the number of channels is doubled for each convolution block. We also feed the output of each convolutional block into a branch, each of which is made up of group convolutions (Xie et al., <xref ref-type="bibr" rid="B43">2017</xref>). We use group convolution instead of the original ResBlock because not only they can effectively reduce the number of parameters but also can speed up the training. After the bottom layer sampling, two layers of group convolutions are connected, aiming to extract the most abstract features fully. The group convolution we employ is shown in <xref ref-type="fig" rid="F3">Figure 3</xref>. The parameters involved are the input dimension <italic>N</italic>, the depth of the channel <italic>d</italic> in each group, the group number &#x003B7;, the total number of group convolution channels &#x003B6;, and the number of output channels <italic>P</italic>. In this study, our relationship between these parameters is: <italic>N</italic> &#x0003D; 2&#x003B6; &#x0003D; 8&#x003B7;, &#x003B6; &#x0003D; <italic>d&#x003B7;</italic>.</p>
<fig id="F3" position="float">
<label>Figure 3</label>
<caption><p>Schematic diagram of group convolution. Compared with the original ResBlock, we group the input channels and perform operations on each group before the confluence.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1277160-g0003.tif"/>
</fig>
<p>The next step is followed by three upsampling layers, where the input of each upsampling layer is the output of the corresponding downsampling layer processed by the group convolution branch and the output of the previous upsampling layer. Traditional upsampling is achieved by unlearnable methods such as linear interpolation; however, we use subpixel convolution Shi et al. (<xref ref-type="bibr" rid="B36">2016</xref>) to replace the original interpolation. The schematic of the sub-pixel convolution is shown in <xref ref-type="fig" rid="F4">Figure 4</xref>. We use sub-pixel convolution for upsampling on each layer because it can effectively decrease the number of arguments (channel count will become <inline-formula><mml:math id="M5"><mml:mfrac><mml:mrow><mml:mn>1</mml:mn></mml:mrow><mml:mrow><mml:msup><mml:mrow><mml:mi>r</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msup></mml:mrow></mml:mfrac></mml:math></inline-formula>, where <italic>r</italic> is the upsampling factor). Additionally, the parameters of the sub-pixel convolution are learnable; its weight changes during training can effectively eliminate checkerboard artifacts (Shi et al., <xref ref-type="bibr" rid="B36">2016</xref>).</p>
<fig id="F4" position="float">
<label>Figure 4</label>
<caption><p>Schematic diagram of subpixel convolution. The expansion is realized by arranging the identical coordinate position tensors on the channel.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1277160-g0004.tif"/>
</fig></sec>
<sec>
<title>3.2.3. Prediction</title>
<p>At the end, the network is the prediction layer, which for each pixel, predicts a value between 0 and 1. The input to the prediction layer is the sum of the upsampled output of <italic>R</italic><sup>1</sup> and <italic>U</italic><sup>1</sup>. The expected inputs to the prediction layer are deep features and shallow features with good quality. <xref ref-type="fig" rid="F5">Figure 5</xref> shows the visual output of each layer in the network. The head layer&#x00027;s output is sparse, meaning the shallow features are insufficient, so we consider <italic>R</italic><sup>1</sup> as representative. <italic>U</italic><sup>1</sup> is the output after upsampling iteration, which has higher level feature properties and is used as a deep feature representative. After getting the input of the prediction layer, it is convolved with a convolution kernel of size 1 &#x000D7; 1, then fed into the BN layer. Finally, the output is obtained by the Sigmoid activation function.</p>
<fig id="F5" position="float">
<label>Figure 5</label>
<caption><p>Visualization of each layer of the network.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1277160-g0005.tif"/>
</fig></sec></sec>
<sec>
<title>3.3. Loss fuction</title>
<p>To obtain a reconstructed image with rich feature information, the loss function consists of two parts. The first part LPIPS (Zhang et al., <xref ref-type="bibr" rid="B46">2018</xref>) is used to measure the high-level features of the image. The second part SSIM (Wang et al., <xref ref-type="bibr" rid="B42">2004</xref>) is to calculate the low-level features. SSIM measures the similarity between two images, mainly judged by focusing on the similarity of edges and textures. Its calculation formula is as follows:</p>
<disp-formula id="E3"><label>(3)</label><mml:math id="M6"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mo class="qopname">SSIM</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mi>L</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:mi>C</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x000D7;</mml:mo><mml:mi>S</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>where <italic>X</italic><sub>1</sub> and <italic>X</italic><sub>2</sub> represent two images, <italic>L</italic> represents brightness similarity, <italic>C</italic> represents contrast similarity, and <italic>S</italic> represents structure score. <italic>L</italic>, <italic>C</italic>, and <italic>S</italic> are, respectively, calculated as follows:</p>
<disp-formula id="E4"><label>(4)</label><mml:math id="M7"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mtable style="text-align:axis;" equalrows="false" columnlines="none" equalcolumns="false" class="array"><mml:mtr><mml:mtd><mml:mi>L</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>u</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>C</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:mn>2</mml:mn><mml:msub><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msubsup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msubsup><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msubsup><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mtd></mml:mtr><mml:mtr><mml:mtd><mml:mi>S</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd><mml:mtd><mml:mo>=</mml:mo><mml:mfrac><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:msub><mml:mrow><mml:mi>&#x003C3;</mml:mi></mml:mrow><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:msub><mml:mo>&#x0002B;</mml:mo><mml:msub><mml:mrow><mml:mi>C</mml:mi></mml:mrow><mml:mrow><mml:mn>3</mml:mn></mml:mrow></mml:msub></mml:mrow></mml:mfrac></mml:mtd></mml:mtr></mml:mtable></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
<p>In the above, <italic>u</italic><sub><italic>X</italic><sub>1</sub></sub> and <italic>u</italic><sub><italic>X</italic><sub>2</sub></sub> represent the mean of images <italic>X</italic><sub>1</sub> and <italic>X</italic><sub>2</sub>, &#x003C3;<sub><italic>X</italic><sub>1</sub></sub> and &#x003C3;<sub><italic>X</italic><sub>2</sub></sub> represent the standard deviation, and &#x003C3;<sub><italic>X</italic><sub>1</sub><italic>X</italic><sub>2</sub></sub> represents the covariance, respectively. <italic>C</italic><sub>1</sub>, <italic>C</italic><sub>2</sub>, and <italic>C</italic><sub>3</sub> are constants used to avoid division by 0. Specifically, <italic>C</italic><sub>1</sub> = 0.01, <italic>C</italic><sub>2</sub> = 0.03, and <italic>C</italic><sub>3</sub> = 0.015.</p>
<p>To increase the similarity of the two images, it is also necessary to make their error in high-level feature expression as small as possible; here, LPIPS is used to achieve that goal. LPIPS uses a VGG19 (Simonyan and Zisserman, <xref ref-type="bibr" rid="B38">2014</xref>) network trained in the MS-COCO dataset to let two images pass through the network and calculate the difference between the output value of each layer of the network.</p>
<disp-formula id="E5"><label>(5)</label><mml:math id="M8"><mml:mrow><mml:mi>d</mml:mi><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msub><mml:mi>X</mml:mi><mml:mn>1</mml:mn></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mi>X</mml:mi><mml:mn>2</mml:mn></mml:msub></mml:mrow><mml:mo>)</mml:mo></mml:mrow><mml:mo>=</mml:mo><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mi>l</mml:mi></mml:munder><mml:mrow><mml:mfrac><mml:mn>1</mml:mn><mml:mrow><mml:msub><mml:mi>H</mml:mi><mml:mi>l</mml:mi></mml:msub><mml:msub><mml:mi>W</mml:mi><mml:mi>l</mml:mi></mml:msub></mml:mrow></mml:mfrac></mml:mrow></mml:mstyle><mml:mstyle displaystyle='true'><mml:munder><mml:mo>&#x02211;</mml:mo><mml:mrow><mml:mi>h</mml:mi><mml:mo>,</mml:mo><mml:mi>w</mml:mi></mml:mrow></mml:munder><mml:mrow><mml:msubsup><mml:mrow><mml:mrow><mml:mo>&#x02016;</mml:mo><mml:mrow><mml:msub><mml:mi>w</mml:mi><mml:mi>l</mml:mi></mml:msub><mml:mo>&#x02299;</mml:mo><mml:mrow><mml:mo>(</mml:mo><mml:mrow><mml:msubsup><mml:mover accent='true'><mml:mi>Y</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mrow><mml:mn>1</mml:mn><mml:mi>h</mml:mi><mml:mi>w</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup><mml:mo>&#x02212;</mml:mo><mml:msubsup><mml:mover accent='true'><mml:mi>Y</mml:mi><mml:mo>&#x0005E;</mml:mo></mml:mover><mml:mrow><mml:mn>2</mml:mn><mml:mi>h</mml:mi><mml:mi>w</mml:mi></mml:mrow><mml:mi>l</mml:mi></mml:msubsup></mml:mrow><mml:mo>)</mml:mo></mml:mrow></mml:mrow><mml:mo>&#x02016;</mml:mo></mml:mrow></mml:mrow><mml:mn>2</mml:mn><mml:mn>2</mml:mn></mml:msubsup></mml:mrow></mml:mstyle></mml:mrow></mml:math></disp-formula>
<p>where <italic>d</italic> is the mean difference between <italic>X</italic><sub>1</sub> and <italic>X</italic><sub>2</sub>. Feature pairs are extracted from the <italic>l</italic> layer and unit normalized in the channel dimension. <italic>w</italic><sub><italic>l</italic></sub> is the scaling factor, &#x02299; stands for the inner product, and &#x00176; is the output of the corresponding layers. The final loss function is as follows:</p>
<disp-formula id="E6"><label>(6)</label><mml:math id="M9"><mml:mtable class="eqnarray" columnalign="left"><mml:mtr><mml:mtd><mml:mrow><mml:mi mathvariant="script">L</mml:mi></mml:mrow><mml:mo>=</mml:mo><mml:mo class="qopname">SSIM</mml:mo><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow><mml:mo>&#x0002B;</mml:mo><mml:mi>d</mml:mi><mml:mrow><mml:mo stretchy="false">(</mml:mo><mml:mrow><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>1</mml:mn></mml:mrow></mml:msub><mml:mo>,</mml:mo><mml:msub><mml:mrow><mml:mi>X</mml:mi></mml:mrow><mml:mrow><mml:mn>2</mml:mn></mml:mrow></mml:msub></mml:mrow><mml:mo stretchy="false">)</mml:mo></mml:mrow></mml:mtd></mml:mtr></mml:mtable></mml:math></disp-formula>
</sec>
<sec>
<title>3.4. Training</title>
<p>Since the ground truth is not easy to obtain when the actual event camera is used to make the dataset, all the datasets used by the mainstream methods (Rebecq et al., <xref ref-type="bibr" rid="B26">2019a</xref>,<xref ref-type="bibr" rid="B27">b</xref>; Scheerlinck et al., <xref ref-type="bibr" rid="B35">2020</xref>; Cadena et al., <xref ref-type="bibr" rid="B7">2021</xref>) are generated in the simulator. For fair evaluation, this study also uses the same dataset. Based on the MS-COCO dataset, the ECOCO dataset (Lin et al., <xref ref-type="bibr" rid="B16">2014</xref>) is used. The event simulator ESIM (Rebecq et al., <xref ref-type="bibr" rid="B25">2018</xref>) is used to map and generate the corresponding event stream and regular image. The image size used in the simulator is 240 &#x000D7; 180 pixels. The simulator was used to generate 1,000 sequences, each event lasting for 2 s, 950 sequences were randomly selected as the training set, and the rest were used as the test set. For all event streams, normal distribution random noise with a mean of 0.18 and a standard deviation of 0.03 are added. The purpose of this is to mimic the noise of the actual camera itself and avoid over-fitting during training, which leads to poor reconstruction results in natural conditions.</p>
<p>During training, the data were randomly flipped [-20&#x000B0;, 20&#x000B0;], randomly flipped horizontally, and cropped to 128 &#x000D7; 128 size to increase the dataset. Our experiments are conducted on the Ubuntu 18.04 LTS operating system using CUDA 11.0, Python 3.8, and PyTorch 1.3.0. The hardware setup included NVIDIA GTX 1080 (8GB), 64GB of RAMs, and an Intel i7-12700 CPU. The epoch is 200, the batch size is 4, ADAM (Kingma and Ba, <xref ref-type="bibr" rid="B12">2014</xref>) optimizer is used, the maximum learning rate is 5 &#x000D7; 10<sup>&#x02212;4</sup>, and warm up learning strategy is adopted.</p></sec></sec>
<sec id="s4">
<title>4. Experiment and analysis</title>
<p>In this section, we qualitatively evaluate E2VIDX against current mainstream methods and then apply it in practice.</p>
<sec>
<title>4.1. Reconstructed image evaluation</title>
<p>To measure the accuracy of each method, we use the same dataset as the previous study (dynamic_6dof, boxes_6dof, poster_6dof, office_zigzag, slider_depth, and calibration). The dataset was taken indoors under six scenarios. It contains variable speed-free motion with six degrees of freedom and linear motion with only one degree of freedom. The camera model used in the dataset is DAVIS240C, which can output event streams and frame images of 240 &#x000D7; 180 size. Each reconstructed image is matched with the frame image with the closest timestamp. MSE, SSIM, and LPIPS of the two images were calculated as evaluation metrics. The qualitative indicators in each dataset are shown in <xref ref-type="table" rid="T1">Table 1</xref>. We use sub-pixel convolution and group convolution, which means a boost on the low-level features of the image. Therefore, the obtained reconstructed image has better performance in SSIM and MSE. SPADE-E2VID adds weight to the LPIPS term in the loss function, so it performs best on LPIPS. In addition to that on LPIPS, our method performs better than both E2VID and FireNet.</p>
<table-wrap position="float" id="T1">
<label>Table 1</label>
<caption><p>The evaluation index scores of the reconstruction results.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Datasets</bold></th>
<th valign="top" align="center" colspan="4">&#x02191;<bold>SSIM</bold></th>
<th valign="top" align="center" colspan="4">&#x02193;<bold>LPIPS</bold></th>
<th valign="top" align="center" colspan="4">&#x02193;<bold>MSE</bold></th>
</tr>
</thead>
<tbody>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<td/>
<td valign="top" align="center"><bold>E2VID</bold></td>
<td valign="top" align="center"><bold>FireNet</bold></td>
<td valign="top" align="center"><bold>SPADE-E2VID</bold></td>
<td valign="top" align="center"><bold>E2VIDX Ours</bold></td>
<td valign="top" align="center"><bold>E2VID</bold></td>
<td valign="top" align="center"><bold>FireNet</bold></td>
<td valign="top" align="center"><bold>SPADE-E2VID</bold></td>
<td valign="top" align="center"><bold>E2VIDX Ours</bold></td>
<td valign="top" align="center"><bold>E2VID</bold></td>
<td valign="top" align="center"><bold>FireNet</bold></td>
<td valign="top" align="center"><bold>SPADE-E2VID</bold></td>
<td valign="top" align="center"><bold>E2VIDX Ours</bold></td>
</tr> <tr>
<td valign="top" align="left">dynamic_6dof</td>
<td valign="top" align="center">0.3841</td>
<td valign="top" align="center">0.3737</td>
<td valign="top" align="center">0.3742</td>
<td valign="top" align="center"><bold>0.4256</bold></td>
<td valign="top" align="center">0.3621</td>
<td valign="top" align="center">0.3348</td>
<td valign="top" align="center"><bold>0.3208</bold></td>
<td valign="top" align="center">0.3472</td>
<td valign="top" align="center">0.1560</td>
<td valign="top" align="center">0.1457</td>
<td valign="top" align="center">0.1073</td>
<td valign="top" align="center"><bold>0.0759</bold></td>
</tr> <tr>
<td valign="top" align="left">boxes_6dof</td>
<td valign="top" align="center">0.5693</td>
<td valign="top" align="center">0.5143</td>
<td valign="top" align="center">0.5537</td>
<td valign="top" align="center"><bold>0.5700</bold></td>
<td valign="top" align="center">0.3111</td>
<td valign="top" align="center">0.3429</td>
<td valign="top" align="center"><bold>0.2921</bold></td>
<td valign="top" align="center">0.3023</td>
<td valign="top" align="center"><bold>0.0414</bold></td>
<td valign="top" align="center">0.0546</td>
<td valign="top" align="center">0.0446</td>
<td valign="top" align="center">0.0426</td>
</tr> <tr>
<td valign="top" align="left">poster_6dof</td>
<td valign="top" align="center"><bold>0.5616</bold></td>
<td valign="top" align="center">0.5420</td>
<td valign="top" align="center">0.5537</td>
<td valign="top" align="center">0.5567</td>
<td valign="top" align="center">0.2916</td>
<td valign="top" align="center"><bold>0.2860</bold></td>
<td valign="top" align="center">0.2877</td>
<td valign="top" align="center">0.3074</td>
<td valign="top" align="center">0.0638</td>
<td valign="top" align="center"><bold>0.0487</bold></td>
<td valign="top" align="center">0.0565</td>
<td valign="top" align="center">0.0624</td>
</tr> <tr>
<td valign="top" align="left">office_zigzag</td>
<td valign="top" align="center">0.4474</td>
<td valign="top" align="center">0.4261</td>
<td valign="top" align="center">0.4479</td>
<td valign="top" align="center"><bold>0.4635</bold></td>
<td valign="top" align="center">0.3208</td>
<td valign="top" align="center">0.3393</td>
<td valign="top" align="center"><bold>0.3031</bold></td>
<td valign="top" align="center">0.3209</td>
<td valign="top" align="center">0.0739</td>
<td valign="top" align="center">0.0813</td>
<td valign="top" align="center">0.0560</td>
<td valign="top" align="center"><bold>0.0515</bold></td>
</tr> <tr>
<td valign="top" align="left">slider_depth</td>
<td valign="top" align="center">0.2821</td>
<td valign="top" align="center">0.2683</td>
<td valign="top" align="center">0.2672</td>
<td valign="top" align="center"><bold>0.3006</bold></td>
<td valign="top" align="center">0.4095</td>
<td valign="top" align="center">0.4097</td>
<td valign="top" align="center">0.4077</td>
<td valign="top" align="center"><bold>0.3679</bold></td>
<td valign="top" align="center">0.1035</td>
<td valign="top" align="center">0.0824</td>
<td valign="top" align="center">0.0803</td>
<td valign="top" align="center"><bold>0.0696</bold></td>
</tr> <tr>
<td valign="top" align="left">calibration</td>
<td valign="top" align="center">0.3795</td>
<td valign="top" align="center">0.3613</td>
<td valign="top" align="center">0.3813</td>
<td valign="top" align="center"><bold>0.3889</bold></td>
<td valign="top" align="center">0.3544</td>
<td valign="top" align="center">0.3138</td>
<td valign="top" align="center"><bold>0.2598</bold></td>
<td valign="top" align="center">0.2987</td>
<td valign="top" align="center">0.0698</td>
<td valign="top" align="center">0.0617</td>
<td valign="top" align="center"><bold>0.0543</bold></td>
<td valign="top" align="center">0.0567</td>
</tr> <tr>
<td valign="top" align="left"><bold>Mean</bold></td>
<td valign="top" align="center">0.4373</td>
<td valign="top" align="center">0.4143</td>
<td valign="top" align="center">0.4297</td>
<td valign="top" align="center"><bold>0.4504</bold></td>
<td valign="top" align="center">0.3416</td>
<td valign="top" align="center">0.3378</td>
<td valign="top" align="center"><bold>0.3118</bold></td>
<td valign="top" align="center">0.3241</td>
<td valign="top" align="center">0.0847</td>
<td valign="top" align="center">0.0791</td>
<td valign="top" align="center">0.0665</td>
<td valign="top" align="center"><bold>0.0598</bold></td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>The higher the SSIM score, the better and the lower the LPIPS and MSE scores, the better. The bold values show that the score is the best compared with other methods.</p>
</table-wrap-foot>
</table-wrap>
<p><xref ref-type="fig" rid="F6">Figure 6</xref> shows the reconstruction results of various methods. The reconstructed images of E2VID and FireNet have a white foreground, causing the deviation of color saturation. SPADE-E2VID has a good performance in reconstruction images, but it needs the previous reconstruction image as input; the accumulated error often cannot be eliminated. Our method performs better in terms of color saturation and contrast and achieves the best performance in terms of SSIM and MSE.</p>
<fig id="F6" position="float">
<label>Figure 6</label>
<caption><p>Comparison of reconstruction results.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1277160-g0006.tif"/>
</fig>
<p>In addition, we calculate the time required for various methods. We made a dataset at each of the four resolutions and averaged three tests of each method using GPU and CPU. The results are presented in <xref ref-type="table" rid="T2">Table 2</xref>. FireNet has the lowest time required due to its lightweight network. However, its reconstruction accuracy is not high. Compared with E2VID and SPADE-E2VID, our method is approximately 10% and 60% faster, respectively, and has the best accuracy. Therefore, FireNet is only necessary when computing power is very limited. Our proposed method can improve the reconstruction accuracy while ensuring as delay as possible.</p>
<table-wrap position="float" id="T2">
<label>Table 2</label>
<caption><p>Timing Performance (ms).</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Methods</bold></th>
<th valign="top" align="left"><bold>Resolution</bold></th>
<th valign="top" align="left"><bold>E2VID</bold></th>
<th valign="top" align="left"><bold>FireNet</bold></th>
<th valign="top" align="left"><bold>SPADE-E2VID</bold></th>
<th valign="top" align="left"><bold>E2VIDX Ours</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left" rowspan="1"><bold>GPU</bold></td>
<td valign="top" align="left">240 &#x000D7; 180</td>
<td valign="top" align="left">8.02</td>
<td valign="top" align="left"><bold>2.81</bold></td>
<td valign="top" align="left">22.02</td>
<td valign="top" align="left">8.19</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">480 &#x000D7; 320</td>
<td valign="top" align="left">22.28</td>
<td valign="top" align="left"><bold>9.46</bold></td>
<td valign="top" align="left">70.48</td>
<td valign="top" align="left">20.65</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">640 &#x000D7; 480</td>
<td valign="top" align="left">42.70</td>
<td valign="top" align="left"><bold>16.86</bold></td>
<td valign="top" align="left">138.44</td>
<td valign="top" align="left">38.52</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">1280 &#x000D7; 720</td>
<td valign="top" align="left">123.42</td>
<td valign="top" align="left"><bold>51.15</bold></td>
<td valign="top" align="left">375.42</td>
<td valign="top" align="left">108.72</td>
</tr> <tr>
<td valign="top" align="left" rowspan="1"><bold>CPU</bold></td>
<td valign="top" align="left">240 &#x000D7; 180</td>
<td valign="top" align="left">86.62</td>
<td valign="top" align="left"><bold>13.98</bold></td>
<td valign="top" align="left">294.04</td>
<td valign="top" align="left">63.18</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">480 &#x000D7; 320</td>
<td valign="top" align="left">296.53</td>
<td valign="top" align="left"><bold>65.28</bold></td>
<td valign="top" align="left">1042.35</td>
<td valign="top" align="left">242.59</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">640 &#x000D7; 480</td>
<td valign="top" align="left">588.39</td>
<td valign="top" align="left"><bold>150.28</bold></td>
<td valign="top" align="left">2210.71</td>
<td valign="top" align="left">496.44</td>
</tr>
 <tr>
<td/>
<td valign="top" align="left">1280 &#x000D7; 720</td>
<td valign="top" align="left">1870.22</td>
<td valign="top" align="left"><bold>581.61</bold></td>
<td valign="top" align="left">6672.57</td>
<td valign="top" align="left">1596.67</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>The GPU is NVIDIA GTX 1080 (8GB) and the CPU is Intel i7-12700. The bold values show that the score is the best compared with other methods.</p>
</table-wrap-foot>
</table-wrap></sec>
<sec>
<title>4.2. Ablation study</title>
<p>To demonstrate the effectiveness of the network design, we designed an ablation study. Experiments are conducted to test the used group convolution and subpixel convolution. For the same hardware environment, keeping other network parameters same, the network is trained for the same epoch. The test results are shown in <xref ref-type="table" rid="T3">Table 3</xref>, and the representative reconnection results are shown in <xref ref-type="fig" rid="F7">Figure 7</xref>.</p>
<table-wrap position="float" id="T3">
<label>Table 3</label>
<caption><p>Score of ablation study evaluation index.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th valign="top" align="left"><bold>Datasets</bold></th>
<th valign="top" align="left" colspan="2">&#x02191;<bold>SSIM</bold></th>
<th valign="top" align="left" colspan="2">&#x02193;<bold>LPIPS</bold></th>
<th valign="top" align="left" colspan="2">&#x02193;<bold>MSE</bold></th>
</tr>
</thead>
<tbody>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<td/>
<td valign="top" align="left"><bold>E2VIDX_grp</bold></td>
<td valign="top" align="left"><bold>E2VIDX_sub</bold></td>
<td valign="top" align="left"><bold>E2VIDX_grp</bold></td>
<td valign="top" align="left"><bold>E2VIDX_sub</bold></td>
<td valign="top" align="left"><bold>E2VIDX_grp</bold></td>
<td valign="top" align="left"><bold>E2VIDX_sub</bold></td>
</tr> <tr>
<td valign="top" align="left">dynamic_6dof</td>
<td valign="top" align="left">0.3919</td>
<td valign="top" align="left">0.4015</td>
<td valign="top" align="left">0.3683</td>
<td valign="top" align="left">0.3301</td>
<td valign="top" align="left">0.1376</td>
<td valign="top" align="left">0.1069</td>
</tr> <tr>
<td valign="top" align="left">boxes_6dof</td>
<td valign="top" align="left">0.5595</td>
<td valign="top" align="left">0.5711</td>
<td valign="top" align="left">0.3140</td>
<td valign="top" align="left">0.3142</td>
<td valign="top" align="left">0.0450</td>
<td valign="top" align="left">0.0411</td>
</tr> <tr>
<td valign="top" align="left">poster_6dof</td>
<td valign="top" align="left">0.5630</td>
<td valign="top" align="left">0.5603</td>
<td valign="top" align="left">0.3072</td>
<td valign="top" align="left">0.3184</td>
<td valign="top" align="left">0.0642</td>
<td valign="top" align="left">0.0632</td>
</tr> <tr>
<td valign="top" align="left">office_zigzag</td>
<td valign="top" align="left">0.4519</td>
<td valign="top" align="left">0.4639</td>
<td valign="top" align="left">0.3349</td>
<td valign="top" align="left">0.3242</td>
<td valign="top" align="left">0.0676</td>
<td valign="top" align="left">0.0547</td>
</tr> <tr>
<td valign="top" align="left">slider_depth</td>
<td valign="top" align="left">0.2880</td>
<td valign="top" align="left">0.3023</td>
<td valign="top" align="left">0.3896</td>
<td valign="top" align="left">0.3762</td>
<td valign="top" align="left">0.0817</td>
<td valign="top" align="left">0.0739</td>
</tr> <tr>
<td valign="top" align="left">calibration</td>
<td valign="top" align="left">0.3691</td>
<td valign="top" align="left">0.3978</td>
<td valign="top" align="left">0.3142</td>
<td valign="top" align="left">0.3002</td>
<td valign="top" align="left">0.0645</td>
<td valign="top" align="left">0.0557</td>
</tr> <tr>
<td valign="top" align="left"><bold>Mean</bold></td>
<td valign="top" align="left">0.4372</td>
<td valign="top" align="left">0.4495</td>
<td valign="top" align="left">0.3380</td>
<td valign="top" align="left">0.3272</td>
<td valign="top" align="left">0.0768</td>
<td valign="top" align="left">0.0659</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>E2VIDX_grp represents the use of group convolution only, and E2VIDX_sub represents the use of subpixel convolution only.</p>
</table-wrap-foot>
</table-wrap>
<fig id="F7" position="float">
<label>Figure 7</label>
<caption><p>Reconstruction results of ablation study. E2VIDX_grp represents the use of group convolution only, and E2VIDX_sub represents the use of subpixel convolution only.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1277160-g0007.tif"/>
</fig>
<p>It can be observed from the table that the two groups of ablation study have a certain degree of decline in the three indices compared with E2VIDX. Among them, the group of experiments without group convolution score better in the evaluation indices, indicating that subpixel convolution has a significant influence on our model. It is also noted that even the ablation studies perform better than E2VID, indicating that we have appropriately chosen our network design, loss function, and data processing. From the perspective of images, the images reconstructed by the ablation study are close to E2VIDX in terms of color and contrast, which can recover the results of perceptual solid perception. However, the images of E2VIDX_grp are missing in detail (burrs appear on the edges of the objects).</p></sec>
<sec>
<title>4.3. Applications</title>
<p>In this section, the reconstructed images are mainly used for various computer vision applications, and three popular visual application experiments are mainly carried out for task difficulty: image classification, object detection, and instance segmentation. The hardware and software platforms used in this section are the same as those mentioned in Section 3.4.</p>
<sec>
<title>4.3.1. Image classfication</title>
<p>Image classification is one of the basic tasks in computer vision, which aims to identify the objects in the image. With the recent advancements in neural networks, this task has been well solved (the accuracy can even exceed the human eye Russakovsky et al., <xref ref-type="bibr" rid="B33">2015</xref>). The datasets in this domain include MNIST (LeCun et al., <xref ref-type="bibr" rid="B14">1998</xref>) and CIFAR-10 (Krizhevsky and Hinton, <xref ref-type="bibr" rid="B13">2009</xref>), which contain regular images and labels. Compared with the previous image classification, the image classification task in this section is carried out under the dataset captured by the event camera. The Neuromorphic-MNIST (N-MNIST) dataset (Orchard et al., <xref ref-type="bibr" rid="B20">2015</xref>) is a &#x0201C;Neuromorphic&#x0201D; version of the MNIST dataset. It is captured by mounting an Asynchronous Time-based Image Sensor (ATIS) (Posch et al., <xref ref-type="bibr" rid="B23">2010</xref>) on the motorized head unit and allowing the sensor to move while viewing the MNIST dataset on the LCD (<xref ref-type="fig" rid="F8">Figure 8</xref>). To fully demonstrate the reliability of image reconstruction, we use LeNet5 (LeCun et al., <xref ref-type="bibr" rid="B14">1998</xref>) to train on the MNIST dataset to obtain the corresponding weight file and then directly use this file to classify and recognize the image reconstructed by the image reconstruction algorithm on N-MNIST. The corresponding reconstruction results are shown in <xref ref-type="fig" rid="F9">Figure 9</xref>, and the classification accuracy is shown in <xref ref-type="table" rid="T4">Table 4</xref>.</p>
<fig id="F8" position="float">
<label>Figure 8</label>
<caption><p>Overview of the N-MNIST dataset. The blue point clouds represent negative polarity and the red point clouds represent positive polarity. x and y are two-dimensional representations of the space.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1277160-g0008.tif"/>
</fig>
<fig id="F9" position="float">
<label>Figure 9</label>
<caption><p>N-MNIST dataset reconstruction results.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1277160-g0009.tif"/>
</fig>
<table-wrap position="float" id="T4">
<label>Table 4</label>
<caption><p>Classification accuracy of N-MNIST dataset.</p></caption> 
<table frame="box" rules="all">
<thead>
<tr style="background-color:&#x00023;919498;color:&#x00023;ffffff">
<th/>
<th valign="top" align="center"><bold>E2VID</bold></th>
<th valign="top" align="center"><bold>FireNet</bold></th>
<th valign="top" align="center"><bold>SPADE-E2VID</bold></th>
<th valign="top" align="left"><bold>E2VIDX Ours</bold></th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="left">Mean accuracy</td>
<td valign="top" align="center">85.78%</td>
<td valign="top" align="center">85.92%</td>
<td valign="top" align="center">84.03%</td>
<td valign="top" align="left"><bold>86.71</bold>%</td>
</tr></tbody>
</table>
<table-wrap-foot>
<p>The bold values show that the score is the best compared with other methods.</p>
</table-wrap-foot>
</table-wrap>
<p>From <xref ref-type="fig" rid="F9">Figure 9</xref>, it can be observed that the reconstruction results of these four methods can accurately recover the handwritten numbers, among which the images of E2VID and FireNet are still slightly white, resulting in insufficient color. SPADE-E2VID needs more time to initialize at the beginning of the reconstruction result because the input needs the output from the previous step. The proposed method (E2VIDX) can provide high-quality reconstructed images. It is worth mentioning that although our LeNet5 is trained on the MNIST dataset, the classification accuracy of N-MNIST dataset is more than 84% (the accuracy of our proposed E2VIDX is the highest 86.71%). This shows that the reconstruction method is reliable and can recover the corresponding feature information.</p></sec>
<sec>
<title>4.3.2. Object detection</title>
<p>Object detection technology has always been one of the challenging fields in computer vision. The object detection task is to automatically identify the object contained in the input image and return its target pixel coordinates and target category. Object detection technology based on deep learning has been extensively researched. Up to now, there have been many excellent object detection algorithms, such as R-CNN series (Girshick, <xref ref-type="bibr" rid="B9">2015</xref>; Ren et al., <xref ref-type="bibr" rid="B31">2015</xref>), YOLO series (Redmon et al., <xref ref-type="bibr" rid="B28">2016</xref>; Redmon and Farhadi, <xref ref-type="bibr" rid="B29">2017</xref>, <xref ref-type="bibr" rid="B30">2018</xref>), and SSD series (Liu et al., <xref ref-type="bibr" rid="B18">2016</xref>; Li and Zhou, <xref ref-type="bibr" rid="B15">2017</xref>; Yi et al., <xref ref-type="bibr" rid="B45">2019</xref>). This section aims to prove the reliability of each reconstruction algorithm. The popular YOLOv5 (Zhang et al., <xref ref-type="bibr" rid="B47">2022</xref>) object detection algorithm is adopted to detect the reconstructed image. The task in this section is still using transfer learning as mentioned in the previous section. The model trained on the conventional image is directly used to detect and reconstruct the image. Specifically, YOLOv5s that has been trained on the COCO dataset is used for detection.</p>
<p>Since there are no corresponding labels in the ECOCO dataset, we can only present qualitative experimental results, as shown in <xref ref-type="fig" rid="F10">Figure 10</xref>. It can be observed from the figure that all reconstruction methods can directly identify the main object, but there are differences in the specific class and confidence. E2VIDX&#x00027;s reconstructed image detection results are improved in confidence compared with the frame images, which indicates that our recovered images have strong interpretability. The detection results of E2VIDX and SPADE-E2VID are better than E2VID and FireNet in object recognition and confidence, especially in the recognition of small objects, such as books.</p>
<fig id="F10" position="float">
<label>Figure 10</label>
<caption><p>YOLOv5 for reconstructed image detection.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1277160-g0010.tif"/>
</fig></sec>
<sec>
<title>4.3.3. Instance segmentation</title>
<p>As one of the difficult visual tasks, instance segmentation, is also the focus of current research. Instance segmentation classifies the image pixel-by-pixel, so it requires high quality of the image itself. In this section, we use the YOLACT (You Only Look At Coefficients) (Bolya et al., <xref ref-type="bibr" rid="B5">2019</xref>) instance segmentation model to conduct experiments and also use the weight files trained under the regular camera dataset to directly segment our reconstructed image. The previously used datasets were all taken indoors, so the reconstruction of outdoor scenes is added in this section. The specific scene is a motor vehicle on the highway. After taking frame images with Huawei P20 Pro, VID2E (Hu et al., <xref ref-type="bibr" rid="B10">2021</xref>) is used to transform them into event streams, and then, reconstruction is performed. The segmentation results of our reconstruction results are shown in <xref ref-type="fig" rid="F11">Figure 11</xref>.</p>
<fig id="F11" position="float">
<label>Figure 11</label>
<caption><p>YOLACT for reconstructed image instance segmentation. <bold>(A)</bold> Instance segmentation of reconstructed images of indoor scenes. <bold>(B)</bold> Instance segmentation of reconstructed images of outdoor scenes.</p></caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fnbot-17-1277160-g0011.tif"/>
</fig>
<p>For indoor scenes, it can be observed that the segmentation effect of E2VIDX is more continuous and accurate compared with other methods. Our method can outline most objects by pixels. In comparison, other methods do not achieve the same performance because the reconstruction results are not ideal and thus can lead to missed detection or false detection. Due to the insufficient illumination conditions, the false detection rate for instance segmentation in frame images is high.</p>
<p>For outdoor scenes, E2VIDX performs image reconstruction equally well, and the reconstructed images are highly consistent with the high-quality original images. The segmentation of the two images (original and reconstructed) is almost the same, indicating that the recovered image has similar characteristics with the high-quality frame image. The outdoor segmentation results of other methods generally perform well but occasionally have misdetection.</p></sec></sec></sec>
<sec sec-type="conclusions" id="s5">
<title>5. Conclusion</title>
<p>In this study, we propose a novel approach named E2VIDX for the field of event camera-based image reconstruction. Specifically, our study proposes: (1) the optimization of the original network structure to strengthen the feature fusion of deep and shallow layers; (2) use of group convolution and sub-pixel convolution to further strengthen the model and the related ablation study to verify its effectiveness. (3) A simple loss function, which is optimized from the semantic and low-level features of the image. Furthermore, we evaluate the reconstructed results in practical vision applications, including image classification, object detection, and instance segmentation. We conduct comprehensive quantitative and qualitative experiments to assess the performance of our approach. Through rigorous experimentation, E2VIDX surpasses the current state-of-the-art methods. When compared with E2VID, our approach exhibits notable improvements, including a 1.3% increase in SSIM, a reduction of 1.7% in LPIPS, a 2.5% decrease in MSE, and a 10% reduction in inference time. We also optimize the model size, reducing it from 32.1MB to 42.9MB. After conducting a series of comparative experiments, we demonstrate that E2VIDX boasts enhanced robustness, enabling direct application of the reconstructed image data. This effectively narrows the gap between conventional computer vision and biomimetic vision. In future, our research will primarily concentrate on the development of a lightweight network structure. We aim to enhance the efficiency of feature extraction by integrating advanced attention mechanisms into our model.</p></sec>
<sec sec-type="data-availability" id="s6">
<title>Data availability statement</title>
<p>The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author.</p></sec>
<sec sec-type="author-contributions" id="s7">
<title>Author contributions</title>
<p>XH: Methodology, Writing&#x02014;original draft. FZ: Methodology, Writing&#x02014;review &#x00026; editing. DG: Writing&#x02014;review &#x00026; editing. TT: Software, Visualization, Writing&#x02014;review &#x00026; editing. WZ: Software, Visualization, Writing&#x02014;review &#x00026; editing.</p></sec>
</body>
<back>
<sec sec-type="funding-information" id="s8">
<title>Funding</title>
<p>The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This research was funded by the National Natural Science Foundation of China (52171322) and Graduate Innovation Seed Fund of Northwestern Polytechnical University (PF2023066 and PF2023067).</p>
</sec>
<ack><p>The authors would like to acknowledge the financial assistance provided by the Key Laboratory of Unmanned Underwater Transport Technology, DG provided guidance on experimental design and writing and FZ provided great support.</p>
</ack>
<sec sec-type="COI-statement" id="conf1">
<title>Conflict of interest</title>
<p>DG was employed by Siemens EDA. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The author(s) declared that they were an editorial board member of Frontiers, at the time of submission. This had no impact on the peer review process and the final decision.</p>
</sec>
<sec sec-type="disclaimer" id="s9">
<title>Publisher&#x00027;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Adam</surname> <given-names>G. K.</given-names></name> <name><surname>Kontaxis</surname> <given-names>P. A.</given-names></name> <name><surname>Doulos</surname> <given-names>L. T.</given-names></name> <name><surname>Madias</surname> <given-names>E.-N. D.</given-names></name> <name><surname>Bouroussis</surname> <given-names>C. A.</given-names></name> <name><surname>Topalis</surname> <given-names>F. V.</given-names></name></person-group> (<year>2019</year>). <article-title>Embedded microcontroller with a ccd camera as a digital lighting control system</article-title>. <source>Electronics</source> <volume>8</volume>, <fpage>33</fpage>. <pub-id pub-id-type="doi">10.3390/electronics8010033</pub-id></citation>
</ref>
<ref id="B2">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bing</surname> <given-names>Z.</given-names></name> <name><surname>Knak</surname> <given-names>L.</given-names></name> <name><surname>Cheng</surname> <given-names>L.</given-names></name> <name><surname>Morin</surname> <given-names>F. O.</given-names></name> <name><surname>Huang</surname> <given-names>K.</given-names></name> <name><surname>Knoll</surname> <given-names>A.</given-names></name></person-group> (<year>2023a</year>). <article-title>&#x0201C;Meta-reinforcement learning in nonstationary and nonparametric environments,&#x0201D;</article-title> in <source>IEEE Transactions on Neural Networks and Learning Systems</source>, <fpage>1</fpage>&#x02013;<lpage>15</lpage>.<pub-id pub-id-type="pmid">37224358</pub-id></citation></ref>
<ref id="B3">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bing</surname> <given-names>Z.</given-names></name> <name><surname>Lerch</surname> <given-names>D.</given-names></name> <name><surname>Huang</surname> <given-names>K.</given-names></name> <name><surname>Knoll</surname> <given-names>A.</given-names></name></person-group> (<year>2023b</year>). <article-title>Meta-reinforcement learning in non-stationary and dynamic environments</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>45</volume>, <fpage>3476</fpage>&#x02013;<lpage>3491</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2022.3185549</pub-id><pub-id pub-id-type="pmid">35737617</pub-id></citation></ref>
<ref id="B4">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bing</surname> <given-names>Z.</given-names></name> <name><surname>Sewisy</surname> <given-names>A. E.</given-names></name> <name><surname>Zhuang</surname> <given-names>G.</given-names></name> <name><surname>Walter</surname> <given-names>F.</given-names></name> <name><surname>Morin</surname> <given-names>F. O.</given-names></name> <name><surname>Huang</surname> <given-names>K.</given-names></name> <etal/></person-group>. (<year>2022</year>). <article-title>Toward cognitive navigation: Design and implementation of a biologically inspired head direction cell network</article-title>. <source>IEEE Trans. Neural Netw. Learn. Syst</source>. <volume>33</volume>, <fpage>2147</fpage>&#x02013;<lpage>2158</lpage>. <pub-id pub-id-type="doi">10.1109/TNNLS.2021.3128380</pub-id><pub-id pub-id-type="pmid">34860654</pub-id></citation></ref>
<ref id="B5">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Bolya</surname> <given-names>D.</given-names></name> <name><surname>Zhou</surname> <given-names>C.</given-names></name> <name><surname>Xiao</surname> <given-names>F.</given-names></name> <name><surname>Lee</surname> <given-names>Y. J.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Yolact: Real-time instance segmentation,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF International Conference on Computer Vision</source>, <fpage>9157</fpage>&#x02013;<lpage>9166</lpage>.</citation>
</ref>
<ref id="B6">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Brandli</surname> <given-names>C.</given-names></name> <name><surname>Muller</surname> <given-names>L.</given-names></name> <name><surname>Delbruck</surname> <given-names>T.</given-names></name></person-group> (<year>2014</year>). <article-title>&#x0201C;Real-time, high-speed video decompression using a frame-and event-based davis sensor,&#x0201D;</article-title> in <source>2014 IEEE International Symposium on Circuits and Systems (ISCAS)</source>. Melbourne, VIC: IEEE, <fpage>686</fpage>&#x02013;<lpage>689</lpage>.</citation>
</ref>
<ref id="B7">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Cadena</surname> <given-names>P. R. G.</given-names></name> <name><surname>Qian</surname> <given-names>Y.</given-names></name> <name><surname>Wang</surname> <given-names>C.</given-names></name> <name><surname>Yang</surname> <given-names>M.</given-names></name></person-group> (<year>2021</year>). <article-title>Spade-e2vid: Spatially-adaptive denormalization for event-based video reconstruction</article-title>. <source>IEEE Trans. Image Process</source>. <volume>30</volume>, <fpage>2488</fpage>&#x02013;<lpage>2500</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2021.3052070</pub-id><pub-id pub-id-type="pmid">33502977</pub-id></citation></ref>
<ref id="B8">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Gallego</surname> <given-names>G.</given-names></name> <name><surname>Delbr&#x000FC;ck</surname> <given-names>T.</given-names></name> <name><surname>Orchard</surname> <given-names>G.</given-names></name> <name><surname>Bartolozzi</surname> <given-names>C.</given-names></name> <name><surname>Taba</surname> <given-names>B.</given-names></name> <name><surname>Censi</surname> <given-names>A.</given-names></name> <etal/></person-group>. (<year>2020</year>). <article-title>Event-based vision: a survey</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>44</volume>, <fpage>154</fpage>&#x02013;<lpage>180</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2020.3008413</pub-id><pub-id pub-id-type="pmid">32750812</pub-id></citation></ref>
<ref id="B9">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Girshick</surname> <given-names>R.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;Fast r-cnn,&#x0201D;</article-title> in <source>Proceedings of the IEEE International Conference on Computer Vision</source>, <fpage>1440</fpage>&#x02013;<lpage>1448</lpage>.</citation>
</ref>
<ref id="B10">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Hu</surname> <given-names>Y.</given-names></name> <name><surname>Liu</surname> <given-names>S.-C.</given-names></name> <name><surname>Delbruck</surname> <given-names>T.</given-names></name></person-group> (<year>2021</year>). <article-title>&#x0201C;v2e: From video frames to realistic DVS events,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <fpage>1312</fpage>&#x02013;<lpage>1321</lpage>.</citation>
</ref>
<ref id="B11">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Jing</surname> <given-names>G.</given-names></name> <name><surname>Qin</surname> <given-names>X.</given-names></name> <name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Deng</surname> <given-names>C.</given-names></name></person-group> (<year>2022</year>). <article-title>Developments, challenges, and perspectives of railway inspection robots</article-title>. <source>Automat. Construct</source>. <fpage>138</fpage>, <volume>104242</volume>. <pub-id pub-id-type="doi">10.1016/j.autcon.2022.104242</pub-id></citation>
</ref>
<ref id="B12">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Kingma</surname> <given-names>D. P.</given-names></name> <name><surname>Ba</surname> <given-names>J.</given-names></name></person-group> (<year>2014</year>). <article-title>Adam: A method for stochastic optimization</article-title>. <source>arXiv preprint</source> arXiv:1412.6980.</citation>
</ref>
<ref id="B13">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Krizhevsky</surname> <given-names>A.</given-names></name> <name><surname>Hinton</surname> <given-names>G.</given-names></name></person-group> (<year>2009</year>). <source>Learning Multiple Layers of Features From Tiny Images</source>, 7.</citation>
</ref>
<ref id="B14">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>LeCun</surname> <given-names>Y.</given-names></name> <name><surname>Bottou</surname> <given-names>L.</given-names></name> <name><surname>Bengio</surname> <given-names>Y.</given-names></name> <name><surname>Haffner</surname> <given-names>P.</given-names></name></person-group> (<year>1998</year>). <article-title>Gradient-based learning applied to document recognition</article-title>. <source>Proc. IEEE</source> <volume>86</volume>, <fpage>2278</fpage>&#x02013;<lpage>2324</lpage>. <pub-id pub-id-type="doi">10.1109/5.726791</pub-id></citation>
</ref>
<ref id="B15">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Li</surname> <given-names>Z.</given-names></name> <name><surname>Zhou</surname> <given-names>F.</given-names></name></person-group> (<year>2017</year>). <article-title>FSSD: feature fusion single shot multibox detector</article-title>. <source>arXiv preprint</source> arXiv:1712.00960.</citation>
</ref>
<ref id="B16">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Lin</surname> <given-names>T.-Y.</given-names></name> <name><surname>Maire</surname> <given-names>M.</given-names></name> <name><surname>Belongie</surname> <given-names>S.</given-names></name> <name><surname>Hays</surname> <given-names>J.</given-names></name> <name><surname>Perona</surname> <given-names>P.</given-names></name> <name><surname>Ramanan</surname> <given-names>D.</given-names></name> <etal/></person-group>. (<year>2014</year>). <article-title>&#x0201C;Microsoft coco: Common objects in context,&#x0201D;</article-title> in <source>Computer Vision-ECCV 2014: 13th European Conference</source>. <publisher-loc>Zurich, Switzerland</publisher-loc>: <publisher-name>Springer</publisher-name>, <fpage>740</fpage>&#x02013;<lpage>755</lpage>.</citation>
</ref>
<ref id="B17">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>F.</given-names></name> <name><surname>Chen</surname> <given-names>D.</given-names></name> <name><surname>Zhou</surname> <given-names>J.</given-names></name> <name><surname>Xu</surname> <given-names>F.</given-names></name></person-group> (<year>2022</year>). <article-title>A review of driver fatigue detection and its advances on the use of rgb-d camera and deep learning</article-title>. <source>Eng. Appl. Artif. Intell</source>. <fpage>116</fpage>, <volume>105399</volume>. <pub-id pub-id-type="doi">10.1016/j.engappai.2022.105399</pub-id></citation>
</ref>
<ref id="B18">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Liu</surname> <given-names>W.</given-names></name> <name><surname>Anguelov</surname> <given-names>D.</given-names></name> <name><surname>Erhan</surname> <given-names>D.</given-names></name> <name><surname>Szegedy</surname> <given-names>C.</given-names></name> <name><surname>Reed</surname> <given-names>S.</given-names></name> <name><surname>Fu</surname> <given-names>C.-Y.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>&#x0201C;SSD: Single shot multibox detector,&#x0201D;</article-title> in <source>Computer Vision-ECCV 2016: 14th European Conference</source>. <publisher-loc>Amsterdam, Netherlands</publisher-loc>: <publisher-name>Springer</publisher-name>, <fpage>21</fpage>&#x02013;<lpage>37</lpage>.</citation>
</ref>
<ref id="B19">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Munda</surname> <given-names>G.</given-names></name> <name><surname>Reinbacher</surname> <given-names>C.</given-names></name> <name><surname>Pock</surname> <given-names>T.</given-names></name></person-group> (<year>2018</year>). <article-title>Real-time intensity-image reconstruction for event cameras using manifold regularisation</article-title>. <source>Int. J. Comput. Vis</source>. <volume>126</volume>, <fpage>1381</fpage>&#x02013;<lpage>1393</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-018-1106-2</pub-id></citation>
</ref>
<ref id="B20">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Orchard</surname> <given-names>G.</given-names></name> <name><surname>Jayawant</surname> <given-names>A.</given-names></name> <name><surname>Cohen</surname> <given-names>G. K.</given-names></name> <name><surname>Thakor</surname> <given-names>N.</given-names></name></person-group> (<year>2015</year>). <article-title>Converting static image datasets to spiking neuromorphic datasets using saccades</article-title>. <source>Front. Neurosci</source>. <fpage>9</fpage>, <volume>437</volume>. <pub-id pub-id-type="doi">10.3389/fnins.2015.00437</pub-id><pub-id pub-id-type="pmid">26635513</pub-id></citation></ref>
<ref id="B21">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Park</surname> <given-names>T.</given-names></name> <name><surname>Liu</surname> <given-names>M.-Y.</given-names></name> <name><surname>Wang</surname> <given-names>T.-C.</given-names></name> <name><surname>Zhu</surname> <given-names>J.-Y.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Semantic image synthesis with spatially-adaptive normalization,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <fpage>2337</fpage>&#x02013;<lpage>2346</lpage>.</citation>
</ref>
<ref id="B22">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Perot</surname> <given-names>E.</given-names></name> <name><surname>De Tournemire</surname> <given-names>P.</given-names></name> <name><surname>Nitti</surname> <given-names>D.</given-names></name> <name><surname>Masci</surname> <given-names>J.</given-names></name> <name><surname>Sironi</surname> <given-names>A.</given-names></name></person-group> (<year>2020</year>). <article-title>Learning to detect objects with a 1 megapixel event camera</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <volume>33</volume>, <fpage>16639</fpage>&#x02013;<lpage>16652</lpage>.</citation>
</ref>
<ref id="B23">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Posch</surname> <given-names>C.</given-names></name> <name><surname>Matolin</surname> <given-names>D.</given-names></name> <name><surname>Wohlgenannt</surname> <given-names>R.</given-names></name></person-group> (<year>2010</year>). <article-title>A QVGA 143 db dynamic range frame-free pwm image sensor with lossless pixel-level video compression and time-domain cds</article-title>. <source>IEEE J. Solid-State Circ</source>. <volume>46</volume>, <fpage>259</fpage>&#x02013;<lpage>275</lpage>. <pub-id pub-id-type="doi">10.1109/JSSC.2010.2085952</pub-id></citation>
</ref>
<ref id="B24">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Posch</surname> <given-names>C.</given-names></name> <name><surname>Serrano-Gotarredona</surname> <given-names>T.</given-names></name> <name><surname>Linares-Barranco</surname> <given-names>B.</given-names></name> <name><surname>Delbruck</surname> <given-names>T.</given-names></name></person-group> (<year>2014</year>). <article-title>Retinomorphic event-based vision sensors: bioinspired cameras with spiking output</article-title>. <source>Proc. IEEE</source> <volume>102</volume>, <fpage>1470</fpage>&#x02013;<lpage>1484</lpage>. <pub-id pub-id-type="doi">10.1109/JPROC.2014.2346153</pub-id></citation>
</ref>
<ref id="B25">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rebecq</surname> <given-names>H.</given-names></name> <name><surname>Gehrig</surname> <given-names>D.</given-names></name> <name><surname>Scaramuzza</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Esim: an open event camera simulator,&#x0201D;</article-title> in <source>Conference on Robot Learning</source>. Zurich, Switzerland: <volume>PMLR</volume>, <fpage>969</fpage>&#x02013;<lpage>982</lpage>.</citation>
</ref>
<ref id="B26">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rebecq</surname> <given-names>H.</given-names></name> <name><surname>Ranftl</surname> <given-names>R.</given-names></name> <name><surname>Koltun</surname> <given-names>V.</given-names></name> <name><surname>Scaramuzza</surname> <given-names>D.</given-names></name></person-group> (<year>2019a</year>). <article-title>&#x0201C;Events-to-video: Bringing modern computer vision to event cameras,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <fpage>3857</fpage>&#x02013;<lpage>3866</lpage>.</citation>
</ref>
<ref id="B27">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Rebecq</surname> <given-names>H.</given-names></name> <name><surname>Ranftl</surname> <given-names>R.</given-names></name> <name><surname>Koltun</surname> <given-names>V.</given-names></name> <name><surname>Scaramuzza</surname> <given-names>D.</given-names></name></person-group> (<year>2019b</year>). <article-title>High speed and high dynamic range video with an event camera</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell</source>. <volume>43</volume>, <fpage>1964</fpage>&#x02013;<lpage>1980</lpage>. <pub-id pub-id-type="doi">10.1109/TPAMI.2019.2963386</pub-id><pub-id pub-id-type="pmid">31902754</pub-id></citation></ref>
<ref id="B28">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Redmon</surname> <given-names>J.</given-names></name> <name><surname>Divvala</surname> <given-names>S.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>Farhadi</surname> <given-names>A.</given-names></name></person-group> (<year>2016</year>). <article-title>&#x0201C;You only look once: Unified, real-time object detection,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <fpage>779</fpage>&#x02013;<lpage>788</lpage>.</citation>
</ref>
<ref id="B29">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Redmon</surname> <given-names>J.</given-names></name> <name><surname>Farhadi</surname> <given-names>A.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Yolo9000: better, faster, stronger,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <fpage>7263</fpage>&#x02013;<lpage>7271</lpage>.</citation>
</ref>
<ref id="B30">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Redmon</surname> <given-names>J.</given-names></name> <name><surname>Farhadi</surname> <given-names>A.</given-names></name></person-group> (<year>2018</year>). <article-title>Yolov3: An incremental improvement</article-title>. <source>arXiv preprint</source> arXiv:1804.02767.</citation>
</ref>
<ref id="B31">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Ren</surname> <given-names>S.</given-names></name> <name><surname>He</surname> <given-names>K.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>Sun</surname> <given-names>J.</given-names></name></person-group> (<year>2015</year>). <article-title>Faster r-cnn: towards real-time object detection with region proposal networks</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. <fpage>28</fpage>, <volume>2015</volume>.<pub-id pub-id-type="pmid">27295650</pub-id></citation></ref>
<ref id="B32">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Ronneberger</surname> <given-names>O.</given-names></name> <name><surname>Fischer</surname> <given-names>P.</given-names></name> <name><surname>Brox</surname> <given-names>T.</given-names></name></person-group> (<year>2015</year>). <article-title>&#x0201C;U-net: Convolutional networks for biomedical image segmentation,&#x0201D;</article-title> in <source>Medical Image Computing and Computer-Assisted Intervention-MICCAI 2015: 18th International Conference</source>. <publisher-loc>Munich, Germany</publisher-loc>: <publisher-name>Springer</publisher-name>.</citation>
</ref>
<ref id="B33">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Russakovsky</surname> <given-names>O.</given-names></name> <name><surname>Deng</surname> <given-names>J.</given-names></name> <name><surname>Su</surname> <given-names>H.</given-names></name> <name><surname>Krause</surname> <given-names>J.</given-names></name> <name><surname>Satheesh</surname> <given-names>S.</given-names></name> <name><surname>Ma</surname> <given-names>S.</given-names></name> <etal/></person-group>. (<year>2015</year>). <article-title>Imagenet large scale visual recognition challenge</article-title>. <source>Int. J. Comput. Vis</source>. <volume>115</volume>, <fpage>211</fpage>&#x02013;<lpage>252</lpage>. <pub-id pub-id-type="doi">10.1007/s11263-015-0816-y</pub-id></citation>
</ref>
<ref id="B34">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Scheerlinck</surname> <given-names>C.</given-names></name> <name><surname>Barnes</surname> <given-names>N.</given-names></name> <name><surname>Mahony</surname> <given-names>R.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Continuous-time intensity estimation using event cameras,&#x0201D;</article-title> in <source>Asian Conference on Computer Vision</source>. <publisher-loc>Cham</publisher-loc>: <publisher-name>Springer</publisher-name>, <fpage>308</fpage>&#x02013;<lpage>324</lpage>.</citation>
</ref>
<ref id="B35">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Scheerlinck</surname> <given-names>C.</given-names></name> <name><surname>Rebecq</surname> <given-names>H.</given-names></name> <name><surname>Gehrig</surname> <given-names>D.</given-names></name> <name><surname>Barnes</surname> <given-names>N.</given-names></name> <name><surname>Mahony</surname> <given-names>R.</given-names></name> <name><surname>Scaramuzza</surname> <given-names>D.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Fast image reconstruction with an event camera,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source>, 156&#x02013;163. <pub-id pub-id-type="doi">10.1109/WACV45572.2020.9093366</pub-id></citation>
</ref>
<ref id="B36">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shi</surname> <given-names>W.</given-names></name> <name><surname>Caballero</surname> <given-names>J.</given-names></name> <name><surname>Husz&#x000E1;r</surname> <given-names>F.</given-names></name> <name><surname>Totz</surname> <given-names>J.</given-names></name> <name><surname>Aitken</surname> <given-names>A. P.</given-names></name> <name><surname>Bishop</surname> <given-names>R.</given-names></name> <etal/></person-group>. (<year>2016</year>). <article-title>&#x0201C;Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <fpage>1874</fpage>&#x02013;<lpage>1883</lpage>.</citation>
</ref>
<ref id="B37">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Shi</surname> <given-names>X.</given-names></name> <name><surname>Chen</surname> <given-names>Z.</given-names></name> <name><surname>Wang</surname> <given-names>H.</given-names></name> <name><surname>Yeung</surname> <given-names>D.-Y.</given-names></name> <name><surname>Wong</surname> <given-names>W.-K.</given-names></name> <name><surname>Woo</surname> <given-names>W.</given-names></name></person-group> (<year>2015</year>). <article-title>Convolutional lstm network: a machine learning approach for precipitation nowcasting</article-title>. <source>Adv. Neural Inf. Process. Syst</source>. 28.</citation>
</ref>
<ref id="B38">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Simonyan</surname> <given-names>K.</given-names></name> <name><surname>Zisserman</surname> <given-names>A.</given-names></name></person-group> (<year>2014</year>). <article-title>Very deep convolutional networks for large-scale image recognition</article-title>. <source>arXiv preprint</source> arXiv:1409.1556.</citation>
</ref>
<ref id="B39">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Sukhavasi</surname> <given-names>S. B.</given-names></name> <name><surname>Sukhavasi</surname> <given-names>S. B.</given-names></name> <name><surname>Elleithy</surname> <given-names>K.</given-names></name> <name><surname>Abuzneid</surname> <given-names>S.</given-names></name> <name><surname>Elleithy</surname> <given-names>A.</given-names></name></person-group> (<year>2021</year>). <article-title>Cmos image sensors in surveillance system applications</article-title>. <source>Sensors</source> <volume>21</volume>, <fpage>488</fpage>. <pub-id pub-id-type="doi">10.3390/s21020488</pub-id><pub-id pub-id-type="pmid">33445557</pub-id></citation></ref>
<ref id="B40">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Vidal</surname> <given-names>A. R.</given-names></name> <name><surname>Rebecq</surname> <given-names>H.</given-names></name> <name><surname>Horstschaefer</surname> <given-names>T.</given-names></name> <name><surname>Scaramuzza</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>Ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios</article-title>. <source>IEEE Robot. Automat. Lett</source>. <volume>3</volume>, <fpage>994</fpage>&#x02013;<lpage>1001</lpage>. <pub-id pub-id-type="doi">10.1109/LRA.2018.2793357</pub-id></citation>
</ref>
<ref id="B41">
<citation citation-type="book"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>L.</given-names></name> <name><surname>Ho</surname> <given-names>Y. S.</given-names></name> <name><surname>Yoon</surname> <given-names>K. J.</given-names></name> <name><surname>Yoon</surname> <given-names>K-J.</given-names></name></person-group> (<year>2019</year>). <article-title>&#x0201C;Event-based high dynamic range image and very high frame rate video generation using conditional generative adversarial networks,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>. <publisher-loc>Long Beach, CA</publisher-loc>: <publisher-name>IEEE</publisher-name>. <fpage>10081</fpage>&#x02013;<lpage>10090</lpage>.</citation>
</ref>
<ref id="B42">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Wang</surname> <given-names>Z.</given-names></name> <name><surname>Bovik</surname> <given-names>A. C.</given-names></name> <name><surname>Sheikh</surname> <given-names>H. R.</given-names></name> <name><surname>Simoncelli</surname> <given-names>E. P.</given-names></name></person-group> (<year>2004</year>). <article-title>Image quality assessment: from error visibility to structural similarity</article-title>. <source>IEEE transactions on image processing</source> <volume>13</volume>:<fpage>600</fpage>&#x02013;<lpage>612</lpage>. <pub-id pub-id-type="doi">10.1109/TIP.2003.819861</pub-id><pub-id pub-id-type="pmid">15376593</pub-id></citation></ref>
<ref id="B43">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xie</surname> <given-names>S.</given-names></name> <name><surname>Girshick</surname> <given-names>R.</given-names></name> <name><surname>Doll&#x000E1;r</surname> <given-names>P.</given-names></name> <name><surname>Tu</surname> <given-names>Z.</given-names></name> <name><surname>He</surname> <given-names>K.</given-names></name></person-group> (<year>2017</year>). <article-title>&#x0201C;Aggregated residual transformations for deep neural networks,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <fpage>1492</fpage>&#x02013;<lpage>1500</lpage>.</citation>
</ref>
<ref id="B44">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Xu</surname> <given-names>L.</given-names></name> <name><surname>Xu</surname> <given-names>W.</given-names></name> <name><surname>Golyanik</surname> <given-names>V.</given-names></name> <name><surname>Habermann</surname> <given-names>M.</given-names></name> <name><surname>Fang</surname> <given-names>L.</given-names></name> <name><surname>Theobalt</surname> <given-names>C.</given-names></name></person-group> (<year>2020</year>). <article-title>&#x0201C;Eventcap: Monocular 3d capture of high-speed human motions using an event camera,&#x0201D;</article-title> in <source>Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>, <fpage>4968</fpage>&#x02013;<lpage>4978</lpage>.</citation>
</ref>
<ref id="B45">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Yi</surname> <given-names>J.</given-names></name> <name><surname>Wu</surname> <given-names>P.</given-names></name> <name><surname>Metaxas</surname> <given-names>D. N.</given-names></name></person-group> (<year>2019</year>). <article-title>Assd: attentive single shot multibox detector</article-title>. <source>Comp. Vision Image Underst</source>. <fpage>189</fpage>, <volume>102827</volume>. <pub-id pub-id-type="doi">10.1016/j.cviu.2019.102827</pub-id></citation>
</ref>
<ref id="B46">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>R.</given-names></name> <name><surname>Isola</surname> <given-names>P.</given-names></name> <name><surname>Efros</surname> <given-names>A. A.</given-names></name> <name><surname>Shechtman</surname> <given-names>E.</given-names></name> <name><surname>Wang</surname> <given-names>O.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;The unreasonable effectiveness of deep features as a perceptual metric,&#x0201D;</article-title> in <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>, <fpage>586</fpage>&#x02013;<lpage>595</lpage>.</citation>
</ref>
<ref id="B47">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhang</surname> <given-names>Y.</given-names></name> <name><surname>Guo</surname> <given-names>Z.</given-names></name> <name><surname>Wu</surname> <given-names>J.</given-names></name> <name><surname>Tian</surname> <given-names>Y.</given-names></name> <name><surname>Tang</surname> <given-names>H.</given-names></name> <name><surname>Guo</surname> <given-names>X.</given-names></name></person-group> (<year>2022</year>). <article-title>Real-time vehicle detection based on improved yolo v5</article-title>. <source>Sustainability</source> <volume>14</volume>, <fpage>12274</fpage>. <pub-id pub-id-type="doi">10.3390/su141912274</pub-id></citation>
</ref>
<ref id="B48">
<citation citation-type="journal"><person-group person-group-type="author"><name><surname>Zhou</surname> <given-names>Y.</given-names></name> <name><surname>Gallego</surname> <given-names>G.</given-names></name> <name><surname>Rebecq</surname> <given-names>H.</given-names></name> <name><surname>Kneip</surname> <given-names>L.</given-names></name> <name><surname>Li</surname> <given-names>H.</given-names></name> <name><surname>Scaramuzza</surname> <given-names>D.</given-names></name></person-group> (<year>2018</year>). <article-title>&#x0201C;Semi-dense 3D reconstruction with a stereo event camera,&#x0201D;</article-title> in <source>Proceedings of the European Conference on Computer Vision (ECCV)</source>, <fpage>235</fpage>&#x02013;<lpage>251</lpage>.</citation>
</ref>
</ref-list>
</back>
</article>