<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" article-type="research-article" dtd-version="2.3" xml:lang="EN">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Mar. Sci.</journal-id>
<journal-title>Frontiers in Marine Science</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Mar. Sci.</abbrev-journal-title>
<issn pub-type="epub">2296-7745</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="doi">10.3389/fmars.2023.1280708</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Marine Science</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Fast ship radiated noise recognition using three-dimensional mel-spectrograms with an additive attention based transformer</article-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Wang</surname>
<given-names>Yan</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/2410362"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-original-draft/"/>
<role content-type="https://credit.niso.org/contributor-roles/formal-analysis/"/>
<role content-type="https://credit.niso.org/contributor-roles/resources/"/>
<role content-type="https://credit.niso.org/contributor-roles/software/"/>
<role content-type="https://credit.niso.org/contributor-roles/visualization/"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Zhang</surname>
<given-names>Hao</given-names>
</name>
<xref ref-type="author-notes" rid="fn001">
<sup>*</sup>
</xref>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
<role content-type="https://credit.niso.org/contributor-roles/conceptualization/"/>
<role content-type="https://credit.niso.org/contributor-roles/funding-acquisition/"/>
<role content-type="https://credit.niso.org/contributor-roles/methodology/"/>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Huang</surname>
<given-names>Wei</given-names>
</name>
<xref ref-type="author-notes" rid="fn001">
<sup>*</sup>
</xref>
<uri xlink:href="https://loop.frontiersin.org/people/2008456"/>
<role content-type="https://credit.niso.org/contributor-roles/writing-review-editing/"/>
<role content-type="https://credit.niso.org/contributor-roles/data-curation/"/>
<role content-type="https://credit.niso.org/contributor-roles/investigation/"/>
<role content-type="https://credit.niso.org/contributor-roles/validation/"/>
</contrib>
</contrib-group>
<aff id="aff1">
<institution>Department of Electrical Engineering, Ocean University of China</institution>, <addr-line>Qingdao</addr-line>, <country>China</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>Edited by: Haixin Sun, Xiamen University, China</p>
</fn>
<fn fn-type="edited-by">
<p>Reviewed by: Naveed Ur Rehman Junejo, University of Lahore, Pakistan; Zeyad Qasem, Peking University, China</p>
</fn>
<fn fn-type="corresp" id="fn001">
<p>*Correspondence: Hao Zhang, <email xlink:href="mailto:zhanghao@ouc.edu.cn">zhanghao@ouc.edu.cn</email>; Wei Huang, <email xlink:href="mailto:hw@ouc.edu.cn">hw@ouc.edu.cn</email>
</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>24</day>
<month>11</month>
<year>2023</year>
</pub-date>
<pub-date pub-type="collection">
<year>2023</year>
</pub-date>
<volume>10</volume>
<elocation-id>1280708</elocation-id>
<history>
<date date-type="received">
<day>21</day>
<month>08</month>
<year>2023</year>
</date>
<date date-type="accepted">
<day>06</day>
<month>11</month>
<year>2023</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2023 Wang, Zhang and Huang</copyright-statement>
<copyright-year>2023</copyright-year>
<copyright-holder>Wang, Zhang and Huang</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.</p>
</license>
</permissions>
<abstract>
<p>Passive recognition of ship-radiated noise plays a crucial role in military and economic domains. However, underwater environments pose significant challenges due to inherent noise, reverberation, and time-varying acoustic channels. This paper introduces a novel approach for ship target recognition and classification by leveraging the power of three-dimensional (3D) Mel-spectrograms and an additive attention based Transformer (ADDTr). The proposed method utilizes 3D Mel-spectrograms to capture the temporal variations in both target signal and ambient noise, thereby enhancing both categories&#x2019; distinguishable characteristics. By incorporating an additional spatial dimension, the modeling of reverberation effects becomes possible. Through analysis of spatial patterns and changes within the spectrograms, distortions caused by reverberation can be estimated and compensated, so that the clarity of the target signals can be improved. The proposed ADDTr leverages an additive attention mechanism to focus on informative acoustic features while suppressing the influence of noisy or distorted components. This attention-based approach not only enhances the discriminative power of the model but also accelerates the recognition process. It efficiently captures both temporal and spatial dependencies, enabling accurate analysis of complex acoustic signals and precise predictions. Comprehensive comparisons with state-of-the-art acoustic target recognition models on the ShipsEar dataset demonstrate the superiority of the proposed ADDTr approach. Achieving an accuracy of 96.82% with the lowest computation costs, ADDTr outperforms other models.</p>
</abstract>
<kwd-group>
<kwd>underwater acoustic target recognition</kwd>
<kwd>deep learning</kwd>
<kwd>additive attention based transformer</kwd>
<kwd>3D mel-spectrogram</kwd>
<kwd>ship radiated noise</kwd>
</kwd-group>
<counts>
<fig-count count="10"/>
<table-count count="3"/>
<equation-count count="13"/>
<ref-count count="53"/>
<page-count count="14"/>
<word-count count="7483"/>
</counts>
<custom-meta-wrap>
<custom-meta>
<meta-name>section-in-acceptance</meta-name>
<meta-value>Ocean Observation</meta-value>
</custom-meta>
</custom-meta-wrap>
</article-meta>
</front>
<body>
<sec id="s1" sec-type="intro">
<label>1</label>
<title>Introduction</title>
<p>Since ship-radiated noise stands as a prominent source of oceanic noise, its recognition possesses crucial importance across diverse domains, such as maritime security, navigation, environmental monitoring, and ocean research. However, the underwater environment is a challenging domain for passive target recognition. The performance is predominantly influenced by the presence of ambient noise interference, the time-varying acoustic channel, and the impact of reverberation. Additionally, ship-radiated noise is the result of vibrations from various ship components and possesses a relatively complex generation mechanism. It primarily involves mechanical noise, propeller noise, and hydrodynamic noise (<xref ref-type="bibr" rid="B21">Li and Yang, 2021</xref>). Hence, ship target recognition is a challenging task.</p>
<p>Feature extraction methods, such as the short-time Fourier transform (STFT) (<xref ref-type="bibr" rid="B13">Gabor, 1946</xref>), the discrete wavelet transform (DWT) (<xref ref-type="bibr" rid="B27">Mallat, 1989</xref>), the Hilbert&#x2013;Huang transform (<xref ref-type="bibr" rid="B49">Yu et&#xa0;al., 2016</xref>), and the limit cycle (<xref ref-type="bibr" rid="B15">Goldobin et&#xa0;al., 2010</xref>), have been proven to be simple yet effective in acoustic signal processing (<xref ref-type="bibr" rid="B51">Zeng and Wang, 2014</xref>; <xref ref-type="bibr" rid="B25">Liu et&#xa0;al., 2017</xref>; <xref ref-type="bibr" rid="B22">Li et&#xa0;al., 2021</xref>; <xref ref-type="bibr" rid="B39">Tuncer et&#xa0;al., 2021</xref>). These methods mainly focus on time domain features and have succeeded due to the assumption of a homogenous propagation environment, such as air, where the frequency characteristics of received signals remain constant over time (<xref ref-type="bibr" rid="B31">Salomons and Havinga, 2015</xref>). However, the underwater propagation environment is completely inhomogeneous in both time and space. Consequently, the amplitude and phase of received signals undergo changes with time and space (<xref ref-type="bibr" rid="B26">Lurton, 2010</xref>).</p>
<p>Mutual time-frequency feature extraction methods, including time-scale decomposition (<xref ref-type="bibr" rid="B11">Frei and Osorio, 2007</xref>), resonance-based sparse signal decomposition (<xref ref-type="bibr" rid="B33">Selesnick, 2011</xref>), multiresolution signal decomposition (<xref ref-type="bibr" rid="B27">Mallat, 1989</xref>), Mel-spectrogram (<xref ref-type="bibr" rid="B17">Hermansky, 1980</xref>), and adaptive sparse non-negative matrix factorization (<xref ref-type="bibr" rid="B18">Jia et&#xa0;al., 2021</xref>), have shown improved performance in signal analysis (<xref ref-type="bibr" rid="B41">Virtanen and Cemgil, 2009</xref>; <xref ref-type="bibr" rid="B14">Gao et&#xa0;al., 2014</xref>; <xref ref-type="bibr" rid="B42">Wang and Chen, 2019</xref>; <xref ref-type="bibr" rid="B28">Monaco et&#xa0;al., 2020</xref>). However, these conventional techniques often focus on stationary signals or specific signal properties (<xref ref-type="bibr" rid="B36">Su et&#xa0;al., 2020</xref>). Unfortunately, underwater ship-radiated noise signals are non-stationary and highly dependent on factors like ship speed, depth, and distance from the receiver. As a result, the accuracy will decrease and their application will be limited.</p>
<p>Multi-stage feature extraction methods have been proposed to mitigate the mentioned limitations. For example, the resonance-based time-frequency manifold (RTFM) (<xref ref-type="bibr" rid="B45">Yan et&#xa0;al., 2018</xref>) combines sparse signal decomposition and a time-frequency manifold to extract oscillatory information and mitigate noise. Additionally, <xref ref-type="bibr" rid="B8">Esmaiel et&#xa0;al. (2021)</xref> combine enhanced variational mode decomposition, weighted permutation entropy, local tangent space alignment, and particle swarm optimization-based support vector machine to improve ship-radiated noise feature extraction in passive sonar. <xref ref-type="bibr" rid="B52">Zhang et&#xa0;al. (2020)</xref> combine adaptive variational mode decomposition and Wigner-Ville Distribution to accurately extract local features and construct time-frequency images.</p>
<p>Inspired by the multi-stage methods, this paper introduces a feature extraction approach that combines Mel-spectrogram with temporal derivative analysis stage by stage. The generated multi-dimensional Mel-spectrograms can effectively capture the temporal variations of both the target signals and the ambient noise. Consequently, the unique characteristics of these signals become more distinguishable. Furthermore, the inclusion of an additional spatial dimension allows for the modeling of reverberation effects, enhancing the overall feature representation.</p>
<p>Previous studies have demonstrated the application of statistical classifiers in the field, showcasing notable achievements (<xref ref-type="bibr" rid="B10">Filho et&#xa0;al., 2011</xref>; <xref ref-type="bibr" rid="B46">Yang et&#xa0;al., 2016</xref>; <xref ref-type="bibr" rid="B38">Tong et&#xa0;al., 2020</xref>). However, achieving promising results often requires sophisticated feature engineering. Furthermore, this kind of approach entails a relatively complex process of partitioning the problem into multiple subsections and then accumulating the results (<xref ref-type="bibr" rid="B19">Khishe, 2022</xref>).</p>
<p>Deep learning has opened up new possibilities for ship-radiated noise recognition. One of the greatest advantages is that relevant features from the acoustic signal can be automatically extracted. In (<xref ref-type="bibr" rid="B29">Purwins et&#xa0;al., 2019</xref>), a multilayer perceptron (MLP) based algorithm successfully defines underwater acoustic radiated noise (<xref ref-type="bibr" rid="B47">Yang et&#xa0;al., 1104</xref>; <xref ref-type="bibr" rid="B34">Shen et&#xa0;al., 2018</xref>; <xref ref-type="bibr" rid="B53">Zhao et&#xa0;al., 2019</xref>; <xref ref-type="bibr" rid="B6">Doan et&#xa0;al., 2020</xref>). demonstrate that a convolutional neural network (CNN) based model can model the original signal waveform directly and excels at capturing local spatial patterns. However, <xref ref-type="bibr" rid="B48">Yang et&#xa0;al. (2020)</xref> point out a limitation of CNNs in their ability to effectively capture the input data&#x2019;s long-range dependencies. The authors address the limitation by employing recurrent neural network (RNN) units to learn the temporal dependencies. By doing so, the classification accuracy is improved.</p>
<p>The Transformer framework was originally introduced in the field of natural language processing with the primary goals of reducing training time and effectively capturing long-range dependencies (<xref ref-type="bibr" rid="B40">Vaswani et&#xa0;al., 2017</xref>; <xref ref-type="bibr" rid="B5">Devlin et&#xa0;al., 2018</xref>; <xref ref-type="bibr" rid="B2">Brown et&#xa0;al., 2020</xref>). Unlike the RNN, the Transformer is a non-sequential architecture that does not rely on past hidden states, allowing for stronger global computation abilities and perfect memory capacity. The Transformer framework has demonstrated exceptional efficiency and outstanding performance in denoising and recognizing underwater acoustic signals (<xref ref-type="bibr" rid="B9">Feng and Zhu, 2022</xref>; <xref ref-type="bibr" rid="B23">Li et&#xa0;al., 2022</xref>; <xref ref-type="bibr" rid="B35">Song et&#xa0;al., 2022</xref>), despite being relatively new to ship-radiated noise recognition.</p>
<p>Within the Transformer, the self-attention mechanism enables global interactions between all positions in the input sequence, which is freed from the limitations caused by localized receptive field and temporal/spatial distance. However, the self-attention mechanism employed by the Transformer has quadratic complexity to the input length, resulting in computational resource wasting and inefficiency. There are many researches focusing on accelerating the Transformer model (<xref ref-type="bibr" rid="B1">Beltagy et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B20">Kitaev et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B43">Wang et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B50">Zaheer et&#xa0;al., 2020</xref>; <xref ref-type="bibr" rid="B37">Tay et&#xa0;al., 2021</xref>), but they usually either suffer from insufficient modeling of global information or insufficient modeling of local information (<xref ref-type="bibr" rid="B44">Wu et&#xa0;al., 2021</xref>).</p>
<p>To balance both modeling efficiency and modeling capability, we propose an efficient variant of the Transformer for ship-radiated noise recognition. This variant incorporates an additive attention mechanism rather than a self-attention mechanism, resulting in linear computational complexity. It also effectively addresses challenges present in the acoustic signal data received from the real ocean environment, including ambient noise interference and reverberation distortion. By doing so, the performance of ship-radiated noise recognition tasks is significantly enhanced, enabling more accurate and reliable results.</p>
<p>
<xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref> provides a comprehensive overview of the proposed model&#x2019;s technological process, encompassing three key stages: patching, embedding, and classification. In the subsequent section, each stage will be elaborated upon in detail.</p>
<fig id="f1" position="float">
<label>Figure&#xa0;1</label>
<caption>
<p>The overall technological process of the proposed model for acoustic signal recognition. The right side of the dotted line provides a detailed illustration of a single Transformer encoder.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-10-1280708-g001.tif"/>
</fig>
<p>The contributions in this paper can be summarized as:</p>
<list list-type="simple">
<list-item>
<p>1. In order to address the performance degradation resulting from long-term dependencies and noisy input data, we introduce an additive attention based Transformer approach, ADDTr. By utilizing the attention mechanism, our model can automatically assign higher importance to relevant information frames, thereby enabling improved modeling of spectral dependencies and capturing critical local dependencies.</p>
</list-item>
<list-item>
<p>2. In order to enhance both the modeling efficiency and modeling capability of the Transformer framework, we propose an additive attention mechanism that replaces the traditional self-attention mechanism. This substitution enables direct modeling of the interaction between global information and local frame representations. Hence it enables the model to attain attention scores with linear computational complexity, without sacrificing the modeling capacity of both global and local information.</p>
</list-item>
<list-item>
<p>3. In order to generate a more comprehensive feature representation of acoustic signals, we propose to use three-dimensional Mel-spectrograms, which are gained by concatenating the delta and delta-delta features with the Mel-spectrogram. This approach facilitates the estimation and compensation of distortions caused by reverberation, thereby enhancing the clarity of the target signals.</p>
</list-item>
</list>
<p>The rest of the paper is structured as follows. Section II, which detailedly describes the methodology of feature extraction and the proposed neural network, is followed by Section III, which presents the dataset used in the paper and the analyses conducted from experimental results. Finally, conclusions are given in section IV.</p>
</sec>
<sec id="s2">
<label>2</label>
<title>Methodology</title>
<sec id="s2_1">
<label>2.1</label>
<title>System overview</title>
<p>In acoustic signal analysis, Mel-spectrograms are often adopted to extract relevant acoustic features that can be used as input for machine learning models. However, for accurate acoustic data classification, Mel-spectrograms themselves cannot provide enough information. They lack the incorporation of temporal dynamics and have a fixed resolution that may not capture fine details in complex scenes. Thus, they may not fully represent important acoustic characteristics such as spatial distribution and temporal evolution. To tackle these issues, we propose an approach to generate a more comprehensive feature representation by incorporating additional temporal and spatial dimensions with the original Mel-spectrograms. This is achieved by concatenating the delta features and the delta-delta features.</p>
<p>To reduce the negative impacts of irregular ocean noise interference, reverberation distortion, and traditional deep learning framework&#x2019;s inherent deficiencies on the ship targets recognition accuracy, we propose a novel ADDTr model.</p>
<p>
<xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref> illustrates the overall process for handling three-dimensional Mel-spectrograms in the model. Initially, the input data undergoes a patching stage where the enriched Mel-spectrogram is flattened and divided into fixed-sized patches. Subsequently, in the embedding stage, the sequence of patches is augmented with a position embedding tensor that captures spatial information and a class token that summarizes the global information of the Mel-spectrogram. During the classification stage, the encoders utilize additive attention to dynamically prioritize essential information for accurate target recognition. Finally, the output from the Transformer encoder is passed to a classification head, enabling the input data to be classified into the appropriate category.</p>
<p>The architecture of ADDTr is inspired by the Vision Transformer (<xref ref-type="bibr" rid="B7">Dosovitskiy et&#xa0;al., 2020</xref>), with a notable modification. Instead of the traditional dot-product-based self-attention mechanism, ADDTr incorporates an innovative additive attention mechanism. This modification improves the efficiency and accelerates the computational speed of the model. More details are provided in subsection C.</p>
</sec>
<sec id="s2_2">
<label>2.2</label>
<title>Feature extraction</title>
<p>In the dataset, each recorded ship-radiated noise is stored as a one-dimensional array based on the audio length and sampling rate. To extract informative feature representations from the raw data, Mel-spectrograms are commonly used. However, Mel-spectrograms alone can only capture the static characteristics of the signal, limiting their ability to capture essential temporal dynamics for accurate feature extraction. To solve this problem, we propose three-dimensional Mel-spectrograms. By incorporating the dynamic characteristics of the signal, the resulting feature representations become more comprehensive, thereby enhancing robustness. <xref ref-type="fig" rid="f2">
<bold>Figure&#xa0;2</bold>
</xref> illustrates the process of extracting a three-dimensional Mel-spectrogram.</p>
<fig id="f2" position="float">
<label>Figure&#xa0;2</label>
<caption>
<p>Block diagram of the 3-D Mel spectrogram formation process. The process can be divided into three main parts. First, the original signal undergoes pre-emphasis, frame blocking, and windowing as a pre-processing step. Then, the Mel-spectrogram is extracted by performing operations such as N-point fast Fourier transform (FFT), squaring, cumulative sum, Mel-filter bank application, and logarithm. In the end, the delta and delta-delta features are obtained by calculating the temporal derivative with consecutive frames.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-10-1280708-g002.tif"/>
</fig>
<p>During the extraction process, the acoustic signal is initially subjected to a pre-emphasis filter. This filter plays a crucial role in equalizing the frequency spectrum of the signal by amplifying the amplitudes of higher-frequency components. This amplification is particularly beneficial as higher-frequency components tend to exhibit lower levels of noise in comparison to their lower-frequency counterparts. By mitigating the natural attenuation of high frequencies, the pre-emphasis filter effectively restores the balance of the frequency spectrum. As a result, the clarity of the signal is enhanced and the prominence of noise is diminished, thereby improving the overall quality of the raw data.</p>
<p>The following Fourier transformation constitutes a fundamental step in the conversion of acoustic signals into Mel-spectrograms, as it enables the analysis of frequency content. However, a direct application of the Fourier transform to the entire signal often leads to adverse effects, such as the generation of nonsensical results and the obliteration of the underlying frequency characteristics. It is widely acknowledged that the frequencies present in a signal tend to remain stationary over brief temporal windows. Accordingly, the frequency characteristics can be accurately captured by combining the outcomes of Fourier transform from neighboring frames. To minimize intra-frame fluctuations, a small frame size is commonly employed, typically on the order of milliseconds. Hence, in this paper, a frame size of 25ms is adopted for ship-radiated noise analysis, with feature aggregation conducted over a temporal interval of 1 second.</p>
<p>Spectral leakage occurs when the signal does not have an integer number of cycles within the chosen window length for the Fourier transform. To counteract spectral leakage and faithfully preserve the frequency characteristics inherent in the acoustic data, a Hanning window is incorporated into the methodology. It gently tapers the signal&#x2019;s edges, thereby mitigating the adverse effects of spectral leakage and enhancing frequency resolution. The power spectrum is subsequently computed using the equation:</p>
<disp-formula>
<label>(1)</label>
<mml:math display="block" id="M1">
<mml:mrow>
<mml:mi>P</mml:mi>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>|</mml:mo>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>F</mml:mi>
<mml:mi>T</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>|</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im1">
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mi>F</mml:mi>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> stands for N-point fast Fourier Transform, and <inline-formula>
<mml:math display="inline" id="im2">
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:math>
</inline-formula> is the <inline-formula>
<mml:math display="inline" id="im3">
<mml:mi>i</mml:mi>
</mml:math>
</inline-formula>th frame of signal <inline-formula>
<mml:math display="inline" id="im4">
<mml:mi>x</mml:mi>
</mml:math>
</inline-formula>. Subsequently, the power spectrum is subjected to the Mel filter bank consisting of 128 bins to extract the Mel-spectrogram. The choice of 128 bins is justified by its alignment with the power of 2, which facilitates efficient computations within the neural network architecture. The Mel-scale, employed in this process, is intentionally designed to exhibit higher resolution at lower frequencies while being less discriminative at higher frequencies. The conversion of Hertz<inline-formula>
<mml:math display="inline" id="im5">
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>f</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> and Mel<inline-formula>
<mml:math display="inline" id="im6">
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> can be accomplished through the utilization of the following equations:</p>
<disp-formula>
<label>(2)</label>
<mml:math display="block" id="M2">
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>2595</mml:mn>
<mml:msub>
<mml:mrow>
<mml:mi>log</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:msub>
<mml:mo stretchy="false">(</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo>+</mml:mo>
<mml:mfrac>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mn>700</mml:mn>
</mml:mrow>
</mml:mfrac>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula>
<label>(3)</label>
<mml:math display="block" id="M3">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>700</mml:mn>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">/</mml:mo>
<mml:mn>2595</mml:mn>
</mml:mrow>
</mml:msup>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The filter bank consists of triangular filters characterized by a response of 1 at their center frequencies. From the center frequency, the response linearly diminishes until it reaches 0 at the center frequencies of the two adjacent filters. This triangular response profile ensures that the filter bank can capture the frequency content of the signal in a localized manner, with higher sensitivity around the center frequencies and reduced sensitivity towards the neighboring frequencies. By employing such triangular filters, the Mel filter bank effectively partitions the frequency spectrum into distinct frequency bands, facilitating the extraction of relevant information for the subsequent generation of Mel-spectrograms. The process can be expressed by the following equation:</p>
<disp-formula>
<label>(4)</label>
<mml:math display="block" id="M4">
<mml:mrow>
<mml:msub>
<mml:mi>H</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&lt;</mml:mo>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>m</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mo>&#xa0;</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&gt;</mml:mo>
<mml:mi>L</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr> </mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im7">
<mml:mi>m</mml:mi>
</mml:math>
</inline-formula> is the number of filters, and <inline-formula>
<mml:math display="inline" id="im8">
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> is the list of Mel-spaced frequencies. <inline-formula>
<mml:math display="inline" id="im9">
<mml:mrow>
<mml:msub>
<mml:mi>H</mml:mi>
<mml:mi>m</mml:mi>
</mml:msub>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> is the <inline-formula>
<mml:math display="inline" id="im10">
<mml:mi>k</mml:mi>
</mml:math>
</inline-formula>th coefficient for the.th filter bank.</p>
<p>The coefficients obtained from the previous steps, known as static coefficients, exhibit a high degree of correlation and reflect the static characteristics of the signal. However, to capture the dynamic characteristics of the target, this paper incorporates additional features in the form of delta spectrograms and delta-delta spectrograms. The additional features are obtained by utilizing the following equation:</p>
<disp-formula>
<label>(5)</label>
<mml:math display="block" id="M5">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:msubsup>
<mml:mi>n</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:msubsup>
<mml:msup>
<mml:mi>n</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im11">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mi>t</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is a delta coefficient, from frame t computed in terms of the static coefficients <inline-formula>
<mml:math display="inline" id="im12">
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>+</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math display="inline" id="im13">
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mrow>
<mml:mi>t</mml:mi>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. While setting <inline-formula>
<mml:math display="inline" id="im14">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>, the deltas-delta coefficients can be calculated using the same equation. By analyzing the variations between adjacent frames, these dynamic features provide valuable information about the temporal changes in the signal. By including delta and delta-delta spectrograms, the model becomes capable of capturing and utilizing the evolving patterns and trends present in the acoustic data. This enhancement significantly improves the overall representation of the data, leading to a more effective recognition and analysis of ship-radiated noise.</p>
<p>In this paper, the sampling rate for each audio record is 22050 Hz. Hence, a one-second signal can generate a three-dimensional Mel-spectrogram with the size of <inline-formula>
<mml:math display="inline" id="im15">
<mml:mrow>
<mml:mn>128</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>32</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>3</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>. <xref ref-type="fig" rid="f3">
<bold>Figure&#xa0;3</bold>
</xref> represents an original ship-radiated noise signal and its corresponding three-dimensional Mel-spectrogram.</p>
<fig id="f3" position="float">
<label>Figure&#xa0;3</label>
<caption>
<p>The original signal and its corresponding 3D Mel-spectrogram.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-10-1280708-g003.tif"/>
</fig>
</sec>
<sec id="s2_3">
<label>2.3</label>
<title>Model architecture</title>
<p>ADDTr adopts the Transformer framework, which operates on input data represented as a one-dimensional sequence of embedded patches. In order to handle three-dimensional Mel-spectrograms denoted as <inline-formula>
<mml:math display="inline" id="im16">
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>X</mml:mi>
</mml:mstyle>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211d;</mml:mi>
<mml:mrow>
<mml:mi>F</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, with <inline-formula>
<mml:math display="inline" id="im17">
<mml:mi>F</mml:mi>
</mml:math>
</inline-formula> representing the number of Mel filter bins, <inline-formula>
<mml:math display="inline" id="im18">
<mml:mi>T</mml:mi>
</mml:math>
</inline-formula> denoting the time dimensions, <inline-formula>
<mml:math display="inline" id="im19">
<mml:mi>C</mml:mi>
</mml:math>
</inline-formula> indicating the spectrogram&#x2019;s dimension, and <inline-formula>
<mml:math display="inline" id="im20">
<mml:mi>&#x211d;</mml:mi>
</mml:math>
</inline-formula> standing for the real number space, the model initiates a patching stage, as illustrated in <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref>. In this stage, a trainable linear projection is utilized to transform the Mel-spectrograms. This projection reshapes the spectrograms into sequences of patches denoted as <inline-formula>
<mml:math display="inline" id="im21">
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>p</mml:mi>
</mml:mstyle>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211d;</mml:mi>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>H</mml:mi>
</mml:msub>
<mml:mo>&#xb7;</mml:mo>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>W</mml:mi>
</mml:msub>
<mml:mo>&#xb7;</mml:mo>
<mml:msub>
<mml:mi>K</mml:mi>
<mml:mi>o</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula>
<mml:math display="inline" id="im22">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>H</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math display="inline" id="im23">
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>W</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> correspond to the height and width of each patch, which are typically set to be equivalent. The parameter <inline-formula>
<mml:math display="inline" id="im24">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>H</mml:mi>
</mml:msub>
<mml:mo>&#xb7;</mml:mo>
<mml:msub>
<mml:mi>X</mml:mi>
<mml:mi>W</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo stretchy="false">/</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>H</mml:mi>
</mml:msub>
<mml:mo>&#xb7;</mml:mo>
<mml:msub>
<mml:mi>P</mml:mi>
<mml:mi>W</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> represents the total number of patches, and serves as the effective input sequence length for the Transformer.</p>
<p>Within the Transformer, a constant latent vector size D is used across all layers. Then, the patches are flattened and transformed to D dimensions using another trainable linear projection known as the patch embedding, denoted as <inline-formula>
<mml:math display="inline" id="im25">
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>E</mml:mi>
</mml:mstyle>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211d;</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mi>P</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mo>&#xb7;</mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>. This mapping ensures that the patches are represented in a suitable format for subsequent processing within the Transformer layers.</p>
<p>The model proceeds to the second stage, known as embedding. Based on the approach described in (<xref ref-type="bibr" rid="B5">Devlin et&#xa0;al., 2018</xref>), our model first incorporates a learnable class token that is inserted at the beginning of the sequence of the flattened patches. This class token serves as a representation of the spectrogram. By consistently placing it at the start of the sequence, the Transformer encoder can easily locate and utilize this token without the need to search the entire sequence. This design choice ensures that the model can effectively capture and utilize the global information present in the spectrogram representation.</p>
<p>The model then incorporates a learnable position embedding tensor, denoted as <inline-formula>
<mml:math display="inline" id="im26">
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>E</mml:mi>
</mml:mstyle>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211d;</mml:mi>
<mml:mrow>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>+</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, during the embedding stage. The tensor is appended to the patch sequence and enables the model to effectively capture the positional information of each patch within the original spectrogram. By including this positional information, the model can better preserve the higher-dimensional context of the input feature map, even when it undergoes dimensionality reduction, reshaping, and segmentation. This ensures that the model retains crucial spatial information during the subsequent processing stages. The whole process can be expressed as:</p>
<disp-formula>
<label>(6)</label>
<mml:math display="block" id="M6">
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mn>0</mml:mn>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>;</mml:mo>
<mml:msubsup>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mi>p</mml:mi>
<mml:mn>1</mml:mn>
</mml:msubsup>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>E</mml:mi>
</mml:mstyle>
<mml:mo>;</mml:mo>
<mml:msubsup>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mi>p</mml:mi>
<mml:mn>2</mml:mn>
</mml:msubsup>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>E</mml:mi>
</mml:mstyle>
<mml:mo>;</mml:mo>
<mml:mo>&#x22ef;</mml:mo>
<mml:mo>;</mml:mo>
<mml:msubsup>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mi>p</mml:mi>
<mml:mi>N</mml:mi>
</mml:msubsup>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>E</mml:mi>
</mml:mstyle>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>E</mml:mi>
</mml:mstyle>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>In the subsequent step, the model extracts more abstract features from the embedded patches through a series of encoder layers. As illustrated in <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref>, each encoder layer follows the same architecture, consisting of an attention layer, a feed-forward MLP layer, and a normalization layer (LN) positioned in between. By incorporating the attention mechanism, the proposed model gains the ability to automatically assign higher importance to relevant information frames within the input sequence. This allows for enhanced modeling of spectral dependencies and the capture of critical local dependencies. Consequently, the model becomes more resilient to the interference of ambient noise present in the raw data. By selectively concentrating on relevant features and suppressing irrelevant ones, the model can effectively filter out noise and focus on the salient aspects of the acoustic signals, leading to improved performance in the presence of challenging environmental conditions.</p>
<p>Different layers in a Transformer encoder are interconnected by residual connections, which effectively alleviate the vanishing gradient problem during back-propagation, and ensure the preservation of the learned information. Additionally, the weighted matrices employed in the proposed attention mechanism are protected from degeneration, ensuring their effectiveness throughout the learning process.</p>
<p>The detailed computing process of the proposed additive attention mechanism is depicted in <xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4</bold>
</xref>. The input <inline-formula>
<mml:math display="inline" id="im27">
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211d;</mml:mi>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, derived from the embedding stage, is initially split into query, key, and value matrices by utilizing three independent linear transformation layers. The generated query matrix (<inline-formula>
<mml:math display="inline" id="im28">
<mml:mtext mathvariant="bold">Q</mml:mtext>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mo>&#x211d;</mml:mo>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula> ), key matrix (<inline-formula>
<mml:math display="inline" id="im29">
<mml:mtext mathvariant="bold">K</mml:mtext>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mo>&#x211d;</mml:mo>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:msup>
</mml:math>
</inline-formula>), and value matrix (<inline-formula>
<mml:math display="inline" id="im28b">
<mml:mrow>
<mml:mtext mathvariant="bold">V</mml:mtext>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mo>&#x211d;</mml:mo>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>) are written as <inline-formula>
<mml:math display="inline" id="im30">
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Q</mml:mi>
</mml:mstyle>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>q</mml:mi>
</mml:mstyle>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>q</mml:mi>
</mml:mstyle>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x22ef;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>q</mml:mi>
</mml:mstyle>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula>
<mml:math display="inline" id="im31">
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>K</mml:mi>
</mml:mstyle>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>k</mml:mi>
</mml:mstyle>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>k</mml:mi>
</mml:mstyle>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x22ef;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>k</mml:mi>
</mml:mstyle>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> , and <inline-formula>
<mml:math display="inline" id="im32">
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>V</mml:mi>
</mml:mstyle>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>v</mml:mi>
</mml:mstyle>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>v</mml:mi>
</mml:mstyle>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x22ef;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>v</mml:mi>
</mml:mstyle>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, respectively.</p>
<fig id="f4" position="float">
<label>Figure&#xa0;4</label>
<caption>
<p>The technical process of the additive attention mechanism. It first transforms the input into query, key, and value matrices, <bold>Q, K, V,</bold> via three independent linear transformations. <bold>Q</bold> is then summarized into a global query vector q&#x2032; by multiplying each vector <italic>q<sub>i</sub>
</italic> with its corresponding attention weight <italic>&#x3b1;<sub>i</sub>
</italic> and summarizing the results. Next, the interaction between the attention key <bold>K</bold> and <bold>q&#x2032;</bold> is modeled through element-wise product, yielding the global context-aware key matrix <bold>P. P</bold> is further summarized into a global key vector <bold>k&#x2032;</bold> by multiplying each vector <italic>p<sub>i</sub>
</italic> with its corresponding attention weight <italic>&#x3b2;<sub>i</sub> </italic>and summarizing the results. Afterward, an element-wise production combines the global key and attention value <bold>V</bold>, resulting in an aggregated representation <bold>U. U</bold> is then processed through a linear transformation to generate the global context-aware attention value <bold>R</bold>. Finally, <bold>Q</bold> and <bold>R</bold> are added to form the final output. Notations: &#x2217; denotes element-wise product, &#xd7; denotes multiplication, and &#x2295; denotes summarization.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-10-1280708-g004.tif"/>
</fig>
<p>Subsequently, the model summarizes the query matrix <inline-formula>
<mml:math display="inline" id="im33">
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Q</mml:mi>
</mml:mstyle>
</mml:math>
</inline-formula> into a global query vector, denoted as <inline-formula>
<mml:math display="inline" id="im34">
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:msup>
<mml:mi>q</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
</mml:mstyle>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211d;</mml:mi>
<mml:mi>D</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>. This global query vector can effectively capture the consolidated global contextual information within the attention query. This summarization process is accomplished by multiplying each vector in the matrix with its corresponding attention weight vector <inline-formula>
<mml:math display="inline" id="im35">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b1;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and then aggregating the results. The left column of <xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4</bold>
</xref> visually illustrates this summarization process.</p>
<p>To be more specific, the attention weight <inline-formula>
<mml:math display="inline" id="im36">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b1;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> of the i-th query vector <inline-formula>
<mml:math display="inline" id="im37">
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>q</mml:mi>
</mml:mstyle>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is computed as:</p>
<disp-formula>
<label>(7)</label>
<mml:math display="block" id="M7">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b1;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>w</mml:mi>
</mml:mstyle>
<mml:mi>q</mml:mi>
<mml:mi>T</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>q</mml:mi>
</mml:mstyle>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">/</mml:mo>
<mml:msqrt>
<mml:mi>D</mml:mi>
</mml:msqrt>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:msubsup>
<mml:mi>e</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>w</mml:mi>
</mml:mstyle>
<mml:mi>q</mml:mi>
<mml:mi>T</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>q</mml:mi>
</mml:mstyle>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">/</mml:mo>
<mml:msqrt>
<mml:mi>D</mml:mi>
</mml:msqrt>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im28a">
<mml:mrow>
<mml:msub>
<mml:mtext mathvariant="bold">w</mml:mtext>
<mml:mi>q</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mo>&#x211d;</mml:mo>
<mml:mi>D</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>. is a learnable parameter vector and <inline-formula>
<mml:math display="inline" id="im38">
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>p</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> represents the exponential function. Then, the global query vector can be computed by:</p>
<disp-formula>
<label>(8)</label>
<mml:math display="block" id="M8">
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:msup>
<mml:mi>q</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
</mml:mstyle>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:munderover>
<mml:msub>
<mml:mi>&#x3b1;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>q</mml:mi>
</mml:mstyle>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</disp-formula>
<p>When modeling the interaction between the summarized global query vector and the key matrix, simply adding or concatenating the query to each vector in the key matrix will yield unsatisfactory results. This is because such approaches fail to differentiate the influence of the global query on different keys. In other words, they treat every key in the same manner and lack the ability to allocate attention selectively. To address this issue, this paper employs element-wise production, which proves effective in capturing the nonlinear relations between two vectors.</p>
<p>The global query vector undergoes an element-wise multiplication with the key matrix <inline-formula>
<mml:math display="inline" id="im39">
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>K</mml:mi>
</mml:mstyle>
</mml:math>
</inline-formula>, resulting in the generation of a global context-aware key matrix denoted as <inline-formula>
<mml:math display="inline" id="im40">
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>P</mml:mi>
</mml:mstyle>
</mml:math>
</inline-formula>. This matrix allows the model to differentiate the influence of the global query across different keys. <inline-formula>
<mml:math display="inline" id="im41">
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>P</mml:mi>
</mml:mstyle>
</mml:math>
</inline-formula> is then summarized into a global key vector, represented as <inline-formula>
<mml:math display="inline" id="im42">
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>k</mml:mi>
</mml:mstyle>
</mml:math>
</inline-formula>. This summarization is achieved by multiplying each vector in <inline-formula>
<mml:math display="inline" id="im43">
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>P</mml:mi>
</mml:mstyle>
</mml:math>
</inline-formula> with its corresponding attention weight <inline-formula>
<mml:math display="inline" id="im44">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b2;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and summing the results. The middle column of <xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4</bold>
</xref> provides a visual depiction of this summarization process. By incorporating this approach, the model will be able to effectively capture relevant information and adapt its attention distribution based on the global context, ultimately leading to enhanced modeling capability and improved performance.</p>
<p>The attention weight of the i-th global context-aware key vector is computed as the following equation:</p>
<disp-formula>
<label>(9)</label>
<mml:math display="block" id="M9">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b2;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:mi>e</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mstyle mathvariant="bold"><mml:mi>w</mml:mi></mml:mstyle>
<mml:mi>k</mml:mi>
<mml:mi>T</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mstyle mathvariant="bold"><mml:mi>p</mml:mi></mml:mstyle>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">/</mml:mo>
<mml:msqrt>
<mml:mi>D</mml:mi>
</mml:msqrt>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:msubsup>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>j</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:msubsup>
<mml:mi>e</mml:mi>
<mml:mi>x</mml:mi>
<mml:mi>p</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mstyle mathvariant="bold"><mml:mi>w</mml:mi></mml:mstyle>
<mml:mi>k</mml:mi>
<mml:mi>T</mml:mi>
</mml:msubsup>
<mml:msub>
<mml:mstyle mathvariant="bold"><mml:mi>p</mml:mi></mml:mstyle>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo stretchy="false">/</mml:mo>
<mml:msqrt>
<mml:mi>D</mml:mi>
</mml:msqrt>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<p>, where <inline-formula>
<mml:math display="inline" id="im45">
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>p</mml:mi>
</mml:mstyle>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:msup>
<mml:mi>q</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
</mml:mstyle>
<mml:mo>*</mml:mo>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>k</mml:mi>
</mml:mstyle>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> (the symbol <inline-formula>
<mml:math display="inline" id="im46">
<mml:mo>*</mml:mo>
</mml:math>
</inline-formula> denotes element-wise production) and <inline-formula>
<mml:math display="inline" id="im47">
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>w</mml:mi>
</mml:mstyle>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211d;</mml:mi>
<mml:mi>D</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is the attention parameter vector. The global key vector <inline-formula>
<mml:math display="inline" id="im48">
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>k</mml:mi>
</mml:mstyle>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211d;</mml:mi>
<mml:mi>D</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is computed as follows:</p>
<disp-formula>
<label>(10)</label>
<mml:math display="block" id="M10">
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
</mml:mstyle>
<mml:mo>=</mml:mo>
<mml:munderover>
<mml:mo>&#x2211;</mml:mo>
<mml:mrow>
<mml:mi>i</mml:mi>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
<mml:mi>N</mml:mi>
</mml:munderover>
<mml:msub>
<mml:mi>&#x3b2;</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>p</mml:mi>
</mml:mstyle>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The right column of <xref ref-type="fig" rid="f4">
<bold>Figure&#xa0;4</bold>
</xref> illustrates the process of modeling global dependencies through the interactions between the attention-value matrix and the global key vector. Similar to the query-key interaction, the global key vector is combined with each value vector through element-wise product, yielding the key-value interaction vector <inline-formula>
<mml:math display="inline" id="im49">
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>u</mml:mi>
</mml:mstyle>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
</mml:mstyle>
<mml:mo>*</mml:mo>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>v</mml:mi>
</mml:mstyle>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. To capture the underlying information in these interaction vectors, a linear transformation layer is applied to each key-value interaction vector, enabling the learning of its hidden representation. The resulting output matrix <inline-formula>
<mml:math display="inline" id="im50">
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>R</mml:mi>
</mml:mstyle>
<mml:mo>=</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>r</mml:mi>
</mml:mstyle>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>r</mml:mi>
</mml:mstyle>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>r</mml:mi>
</mml:mstyle>
<mml:mi>N</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi>&#x211d;</mml:mi>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> is then added to the query matrix, forming the final output.</p>
<p>By stacking multiple encoders, the network is able to comprehensively model global attention and generate a representation for each input spectrogram based on the class token. This process facilitates the integration of both local and global information, resulting in a more informative and context-aware representation of the input data. The computations conducted in the Transformer Encoders can be expressed as follows:</p>
<disp-formula>
<label>(11)</label>
<mml:math display="block" id="M11">
<mml:mrow>
<mml:mtable columnalign="left">
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msubsup>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mi>l</mml:mi>
<mml:mo>'</mml:mo>
</mml:msubsup>
</mml:mrow>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:mi>A</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>v</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mrow>
<mml:mstyle mathvariant="bold"><mml:mtext>l</mml:mtext></mml:mstyle>
<mml:mo>&#x2212;</mml:mo>
<mml:mstyle mathvariant="bold"><mml:mn>1</mml:mn></mml:mstyle>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mrow>
<mml:mstyle mathvariant="bold"><mml:mtext>l</mml:mtext></mml:mstyle>
<mml:mo>&#x2212;</mml:mo>
<mml:mstyle mathvariant="bold"><mml:mn>1</mml:mn></mml:mstyle>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mi>l</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:mi>M</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>P</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mi>N</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msubsup>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mi>l</mml:mi>
<mml:mo>'</mml:mo>
</mml:msubsup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>+</mml:mo>
<mml:msubsup>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mi>l</mml:mi>
<mml:mo>'</mml:mo>
</mml:msubsup>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr columnalign="left">
<mml:mtd columnalign="left">
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Y</mml:mi>
</mml:mstyle>
</mml:mtd>
<mml:mtd columnalign="left">
<mml:mrow>
<mml:mo>=</mml:mo>
<mml:mi>L</mml:mi>
<mml:mi>N</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mrow>
<mml:mi>c</mml:mi>
<mml:mi>l</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>s</mml:mi>
<mml:mi>s</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im51">
<mml:mrow>
<mml:msub>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>Z</mml:mi>
</mml:mstyle>
<mml:mrow>
<mml:mstyle mathvariant="bold"><mml:mtext>l</mml:mtext></mml:mstyle>
<mml:mo>&#x2212;</mml:mo>
<mml:mstyle mathvariant="bold"><mml:mn>1</mml:mn></mml:mstyle>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> denotes the output generated from the former layer, <inline-formula>
<mml:math display="inline" id="im52">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>v</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> denotes the additive attention mechanism that is employed in our ADDTr. <inline-formula>
<mml:math display="inline" id="im53">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mi>L</mml:mi>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula>
<mml:math display="inline" id="im54">
<mml:mrow>
<mml:mi>L</mml:mi>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> represent the feed-forward multilayer perceptron layer and the linear normalization layer, respectively.</p>
<p>The generated representation is then passed to a classification head, constructed using another MLP with one hidden layer, to fulfill the final stage depicted in <xref ref-type="fig" rid="f1">
<bold>Figure&#xa0;1</bold>
</xref>. LN layers and dropout layers are interspersed in the proposed Transformer in order to stabilize the model while deepening the network.</p>
</sec>
<sec id="s2_4">
<label>2.4</label>
<title>Complexity analysis</title>
<p>In this subsection, all instances of <inline-formula>
<mml:math display="inline" id="im55">
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>q</mml:mi>
</mml:mstyle>
</mml:math>
</inline-formula>, <inline-formula>
<mml:math display="inline" id="im56">
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>k</mml:mi>
</mml:mstyle>
</mml:math>
</inline-formula>, and <inline-formula>
<mml:math display="inline" id="im57">
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>v</mml:mi>
</mml:mstyle>
</mml:math>
</inline-formula> mentioned in various equations refer to the same query, key, and value matrices. Additionally, <inline-formula>
<mml:math display="inline" id="im58">
<mml:mrow>
<mml:mi>N</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> represents the length of the input, while <inline-formula>
<mml:math display="inline" id="im59">
<mml:mi>D</mml:mi>
</mml:math>
</inline-formula> is a constant latent vector size that controls the input dimension, i.e., the dimension of the representations.</p>
<p>The proposed method deviates from the conventional approach of modeling global attention using matrix multiplication. Instead, it leverages element-wise multiplication to compute the additive attention <italic>AddictiveAttention</italic>(). The additive attention <inline-formula>
<mml:math display="inline" id="im60">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>v</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mo stretchy="false">(</mml:mo>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> is calculated as:</p>
<disp-formula>
<label>(12)</label>
<mml:math display="block" id="M12">
<mml:mrow>
<mml:mi>A</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>d</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>v</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>A</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>e</mml:mi>
<mml:mi>n</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>n</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>q</mml:mi>
</mml:mstyle>
<mml:mo>,</mml:mo>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>k</mml:mi>
</mml:mstyle>
<mml:mo>,</mml:mo>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>v</mml:mi>
</mml:mstyle>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>q</mml:mi>
</mml:mstyle>
<mml:mo>+</mml:mo>
<mml:mrow>
<mml:mo stretchy="false">[</mml:mo>
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
</mml:mstyle>
<mml:mo>*</mml:mo>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>v</mml:mi>
</mml:mstyle>
</mml:mrow>
<mml:mo stretchy="false">]</mml:mo>
</mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>W</mml:mi>
</mml:mstyle>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>where <inline-formula>
<mml:math display="inline" id="im61">
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:msup>
<mml:mi>k</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:msup>
</mml:mstyle>
</mml:math>
</inline-formula> is the global key vector, which is obtained by Eq.10. <inline-formula>
<mml:math display="inline" id="im62">
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>W</mml:mi>
</mml:mstyle>
</mml:math>
</inline-formula> is the learnable linear projection parameter. The computation complexity of the proposed network is <inline-formula>
<mml:math display="inline" id="im63">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. Noted that the complexity of linear transformations is not taken into account by many prior works, it is also omitted in this paper while calculating computational costs.</p>
<p>The dot-product-based self-attention score is computed by measuring the pairwise similarity between two patches within the sequence. Its calculation process can be expressed as:</p>
<disp-formula>
<label>(13)</label>
<mml:math display="block" id="M13">
<mml:mrow>
<mml:mi>S</mml:mi>
<mml:mi>A</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>q</mml:mi>
</mml:mstyle>
<mml:mo>,</mml:mo>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>k</mml:mi>
</mml:mstyle>
<mml:mo>,</mml:mo>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>v</mml:mi>
</mml:mstyle>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mo>=</mml:mo>
<mml:mi>s</mml:mi>
<mml:mi>o</mml:mi>
<mml:mi>f</mml:mi>
<mml:mi>t</mml:mi>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>q</mml:mi>
</mml:mstyle>
<mml:msup>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>k</mml:mi>
</mml:mstyle>
<mml:mi>T</mml:mi>
</mml:msup>
<mml:mo stretchy="false">/</mml:mo>
<mml:msqrt>
<mml:mi>D</mml:mi>
</mml:msqrt>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>v</mml:mi>
</mml:mstyle>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
</disp-formula>
<p>The softmax() denotes the application of the softmax function along the last axis of the matrix. Then, the computational complexity of original self-attention is <inline-formula>
<mml:math display="inline" id="im64">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mi>N</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, which is much higher than the proposed method since <inline-formula>
<mml:math display="inline" id="im65">
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mo>&gt;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<p>An alternative attention mechanism commonly used in acoustic signal classification is the shifted window attention, initially introduced by the Swin Transformer (<xref ref-type="bibr" rid="B24">Liu et&#xa0;al., 2021</xref>). This attention mechanism prioritizes modeling global interactions and utilizes a nested window approach with standard self-attention to mitigate computational complexity. The computational complexity of this mechanism is <inline-formula>
<mml:math display="inline" id="im66">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mi>D</mml:mi>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> , where <inline-formula>
<mml:math display="inline" id="im67">
<mml:mi>w</mml:mi>
</mml:math>
</inline-formula> represents the window size. Since <inline-formula>
<mml:math display="inline" id="im68">
<mml:mi>w</mml:mi>
</mml:math>
</inline-formula> is a constant positive value by definition, compared to the attention mechanism employed in the paper, it is computationally more expensive due to its higher complexity.</p>
</sec>
</sec>
<sec id="s3">
<label>3</label>
<title>Experiment</title>
<sec id="s3_1">
<label>3.1</label>
<title>Dataset</title>
<p>The ShipsEar dataset (<xref ref-type="bibr" rid="B32">Santos-Dom&#xed;nguez et&#xa0;al., 2016</xref>), which consists of recordings of underwater vessel noise captured in real shallow oceanic environments, is utilized in the paper. This dataset encompasses a diverse range of natural and anthropogenic environmental noise sources. Without any preprocess, the received signals are influenced by reflections and echoes introduced by reverberations, leading to overlapping and smeared spectrograms. The dataset comprises 90 acoustic samples from 11 distinct vessel types, with each category containing one or more samples. The duration of the audio varies from 15 seconds to 10 minutes.</p>
<p>The dataset was divided into three subsets: training set, testing set, and validation set. The training set was allocated 70% of the data and used for model training and fitting. The testing set, comprising 20% of the data, was used to fine-tune the model&#x2019;s hyperparameters and perform an initial assessment of its performance. The remaining 10% constituted the validation set, which remained unknown to the model during training and testing, allowing for the evaluation of the model&#x2019;s generalization ability and robustness.</p>
<p>To ensure consistency in the dataset, a slicing method was applied during the data preprocessing stage, dividing all signals into fixed 1-second durations. This preprocessing step augmented the dataset, resulting in adequate samples to allocate to each category&#x2019;s three subsets. The samples were randomly selected and distributed among the subsets according to the predefined ratio.</p>
</sec>
<sec id="s3_2">
<label>3.2</label>
<title>Training and testing</title>
<p>In this paper, the training and testing of the proposed model were conducted utilizing Nvidia&#x2019;s RTX3090 GPU, which is equipped with 24 GB of G6X memory. <xref ref-type="table" rid="T1">
<bold>Table&#xa0;1</bold>
</xref> lists the parameters used during the training and testing stages, while <xref ref-type="fig" rid="f5">
<bold>Figure&#xa0;5</bold>
</xref> provides a detailed view of the model&#x2019;s performance in each epoch of these stages.</p>
<fig id="f5" position="float">
<label>Figure&#xa0;5</label>
<caption>
<p>The detailed training process and results. The tendency of lines indicates an optimal fit.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-10-1280708-g005.tif"/>
</fig>
<table-wrap id="T1" position="float">
<label>Table&#xa0;1</label>
<caption>
<p>The following parameters are utilized in the proposed model during both the training and testing stages.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="center">Parameter Name</th>
<th valign="top" align="center">Parameter Value</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center">Audio Segment Length</td>
<td valign="top" align="center">1</td>
</tr>
<tr>
<td valign="top" align="center">Patch Size</td>
<td valign="top" align="center">16x16</td>
</tr>
<tr>
<td valign="top" align="center">Batch Size</td>
<td valign="top" align="center">128</td>
</tr>
<tr>
<td valign="top" align="center">Dropout Rate</td>
<td valign="top" align="center">0.3</td>
</tr>
<tr>
<td valign="top" align="center">Optimizer</td>
<td valign="top" align="center">adam</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>The initial assessment of a deep learning model typically involves analyzing training and testing losses, which measure the errors for each example in their respective datasets. As depicted in the figure, both training and testing losses exhibit a decreasing trend, while training and testing accuracies steadily increase. They start to stabilize after ten epochs and stop after fourteen epochs. The behavior indicates the model&#x2019;s effective convergence to an optimal fit.</p>
<p>Overfitting and underfitting are common challenges in deep learning. They usually arise when the model struggles to generalize well on new data or experiences significant errors in the training data. These issues often result in diverging loss lines due to gradient disappearance or explosion. However, as evident from the figure, the convergence lines of the proposed model demonstrate its capability to mitigate these problems and effectively learn the underlying data features. Consequently, the results demonstrate our model&#x2019;s high performance and its potential as a robust data analysis and prediction tool.</p>
</sec>
<sec id="s3_3">
<label>3.3</label>
<title>Evaluation</title>
<p>To comprehensively evaluate the effectiveness of our proposed model and feature extraction method, we conducted several experiments.</p>
<p>The results, as illustrated in <xref ref-type="table" rid="T2">
<bold>Table&#xa0;2</bold>
</xref>, demonstrate the superiority of three-dimensional Mel-spectrograms over their one-dimensional counterparts when employed with various widely used audio classification models. This improvement can be attributed to the incorporation of the signal&#x2019;s dynamic features, which enhances the representational power of the spectrograms. By capturing temporal variations in the acoustic signal, the three-dimensional Mel-spectrograms provide richer and more informative features for accurate classification across different acoustic signal recognition models. The experimental findings highlight the significance of considering dynamic characteristics in feature extraction for acoustic classification tasks.</p>
<table-wrap id="T2" position="float">
<label>Table&#xa0;2</label>
<caption>
<p>Accuracy comparison between different models using 1D and 3D mel-spectrograms as input.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="center">Model Type</th>
<th valign="top" align="center">Accuracy with 1D Mel-spectrograms</th>
<th valign="top" align="center">Accuracy with 3D Mel-spectrograms</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center">CRNN (<xref ref-type="bibr" rid="B12">Fu et&#xa0;al., 2019</xref>)</td>
<td valign="top" align="center">90%</td>
<td valign="top" align="center">93.23%</td>
</tr>
<tr>
<td valign="top" align="center">AST (<xref ref-type="bibr" rid="B16">Gong et&#xa0;al., 2021</xref>)</td>
<td valign="top" align="center">89%</td>
<td valign="top" align="center">93.5%</td>
</tr>
<tr>
<td valign="top" align="center">HTS-AT (<xref ref-type="bibr" rid="B3">Chen et&#xa0;al., 2022</xref>)</td>
<td valign="top" align="center">84.88%</td>
<td valign="top" align="center">89.32%</td>
</tr>
<tr>
<td valign="top" align="center">BEATs (<xref ref-type="bibr" rid="B4">Chen et&#xa0;al., 2022</xref>)</td>
<td valign="top" align="center">83.56%</td>
<td valign="top" align="center">86.25%</td>
</tr>
<tr>
<td valign="top" align="center">SepTr (<xref ref-type="bibr" rid="B30">Ristea et&#xa0;al., 2022</xref>)</td>
<td valign="top" align="center">88.42%</td>
<td valign="top" align="center">91.86%</td>
</tr>
<tr>
<td valign="top" align="center">ADDTr</td>
<td valign="top" align="center">
<bold>91.41%</bold>
</td>
<td valign="top" align="center">
<bold>96.82 %</bold>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>The proposed work gained the highest accuracy and the performances of 3D Mel-spectrograms are generally better than the 1D Mel-spectrograms. The bold values represent the best performance/results.</p>
</fn>
</table-wrap-foot>
</table-wrap>
<p>We conducted a series of experiments to determine the optimal number of Transformer encoders required for effective learning of the three-dimensional Mel-spectrogram features. The results of these experiments, as illustrated in <xref ref-type="fig" rid="f6">
<bold>Figure&#xa0;6</bold>
</xref>, involved training and testing the model with varying numbers of encoders. The findings indicated that the best performance is achieved when employing eight Transformer encoders, highlighting the importance of this specific number. When using more or fewer encoders, it is observed that the classification performance became suboptimal or the computational resources were inefficiently utilized. Hence, the study suggests that the selection of the appropriate number of encoders plays a crucial role in maximizing the model&#x2019;s learning capabilities and achieving superior classification outcomes.</p>
<fig id="f6" position="float">
<label>Figure&#xa0;6</label>
<caption>
<p>A comparison of model performance was conducted by employing 2, 4, 8, and 16 Transformer encoders during both training and testing phases. Each performance is distinguished by a distinct line color, with red, green, orange, and blue lines representing the different encoder counts, respectively. The results reveal that when using 8 encoders, the model demonstrates the highest learning capacity and attains the lowest loss value.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-10-1280708-g006.tif"/>
</fig>
<p>In order to determine the optimal hyper-parameter settings, we conducted experiments with different combinations of patch sizes, batch sizes, and audio segment lengths. The evaluation results, presented in <xref ref-type="fig" rid="f7">
<bold>Figure&#xa0;7</bold>
</xref>, offer a comprehensive analysis of the performance. The patch size refers to the size of patches extracted from the input data by the transformer block. It is shown that models with smaller patch sizes tend to be more computationally intensive due to the inverse square relationship between the transformer&#x2019;s sequence length and the patch size. However, it should be noted that larger patch sizes do not necessarily lead to improved classification performance. In fact, a larger patch size results in fewer patches for the same input, limiting the model&#x2019;s learning opportunities and yielding poorer results, as depicted in <xref ref-type="fig" rid="f7">
<bold>Figure&#xa0;7</bold>
</xref>.</p>
<fig id="f7" position="float">
<label>Figure&#xa0;7</label>
<caption>
<p>Below the x-axis,  values 1, 3, and 5 represent the audio segment length in seconds. The x-axis displays values 64, 128, and 256, indicating different batch sizes. The y-axis represents the classification accuracy of our proposed model. The best result is achieved with a patch size of 16x16, a batch size of 128, and an audio segment length of 1 second.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-10-1280708-g007.tif"/>
</fig>
<p>Another crucial factor to consider is the batch size, which determines the number of samples processed before updating the model&#x2019;s internal parameters. Typically, it is recommended to choose a batch size that aligns with the number of GPUs&#x2019; physical processors, often a power of 2. Deviating from this configuration may result in suboptimal performance. The x-axis in <xref ref-type="fig" rid="f7">
<bold>Figure&#xa0;7</bold>
</xref> represents different audio clip lengths, denoted as 1, 3, and 5 seconds.</p>
<p>The classification accuracy reaches an optimal value of approximately 97% with a patch size 16x16, an audio segment length of 1 second, and a batch size of 128, as illustrated in the figure. This remarkable result highlights the model&#x2019;s ability to accurately identify almost all of the ship-radiated noise, even when recorded in challenging environments. For a more detailed understanding of the recognition performance across different classes, please refer to the graphical representation presented in <xref ref-type="fig" rid="f8">
<bold>Figure&#xa0;8</bold>
</xref>.</p>
<fig id="f8" position="float">
<label>Figure&#xa0;8</label>
<caption>
<p>Each class&#x2019;s identification result. All of the categories have an identification accuracy higher than 92%. Eight out of twelve categories&#x2019; identification accuracy is higher than 95%.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-10-1280708-g008.tif"/>
</fig>
<p>
<xref ref-type="fig" rid="f9">
<bold>Figure&#xa0;9</bold>
</xref> provides insights into the classification performance of the proposed model, considering different optimizers and dropout rates. An optimizer plays a crucial role in updating the model&#x2019;s parameters based on the gradients of the loss function with respect to the weights. We evaluated five commonly used optimizers in acoustic deep learning models: adaptive moment estimation (Adam), root mean square propagation (RMSprop), stochastic gradient descent (SGD), adaptive gradient (Adagrad), and adaptive delta (Adadelta).</p>
<fig id="f9" position="float">
<label>Figure&#xa0;9</label>
<caption>
<p>Comparison of different identification accuracies in different optimizers and dropouts. The adam optimizer reaches the local minimum most effective in the ship target recognition task while the dropout rate should set to 0.3 to achieve the optimal result.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-10-1280708-g009.tif"/>
</fig>
<p>The results depicted in <xref ref-type="fig" rid="f9">
<bold>Figure&#xa0;9</bold>
</xref> reveal that Adam outperforms other optimizers when applied to the non-convex underwater signal dataset. Underwater acoustic signals can be sparse and noisy, making it difficult to estimate accurate gradients during training. Both Adam and RMSprop adaptively adjust learning rates based on historical gradient information, enabling them to effectively handle sparse and noisy gradients. This adaptability leads to more stable and efficient optimization in the presence of such challenges. Adam combines momentum and adaptive learning rates, maintaining separate learning rates for each parameter and utilizing adaptive estimates of both first-order (mean) and second-order (variance) moments of the gradients. RMSprop also adapts learning rates but only considers the first-order moment of the gradients, making it slightly less effective than Adam in handling underwater acoustic data.</p>
<p>On the other hand, SGD suffers from slow convergence due to its fixed learning rate, which can also be sensitive to the choice of learning rate. The fixed learning rate in SGD cannot adjust automatically through the training process, potentially causing oscillations or divergence with a high learning rate, and slow convergence or suboptimal solutions with a low learning rate. Adadelta struggles with sparse gradients, limiting its ability to update parameters effectively. Additionally, Adadelta requires more memory due to the accumulation of squared gradients. Adagrad&#x2019;s diminishing learning rates over time can hinder adaptation in the underwater acoustic target recognition model, and its accumulation of historical gradients may result in less relevance of recent gradients during optimization.</p>
<p>Dropout is a regularization technique that randomly drops nodes in a layer during training to mitigate overfitting. By incorporating dropout, the training process introduces noise and forces the remaining nodes to learn more robust and independent features. Through experiments, it has been determined that the optimal dropout rate is 0.3, indicating that approximately one-third of the inputs are randomly excluded from each update iteration.</p>
<p>A dropout rate lower than 0.3 can result in the model relying too heavily on specific nodes, discouraging the network from learning more diverse representations and degrading its generalization ability. Moreover, when a large number of nodes are randomly dropped, the remaining nodes need to compensate for the missing information. This can lead to slower convergence or difficulties in finding an optimal solution during the training process. Hence, a dropout rate greater than 0.3 can limit the model&#x2019;s ability to learn complex patterns and relationships in the data, also leading to a lower performance.</p>
<p>
<xref ref-type="fig" rid="f10">
<bold>Figure&#xa0;10</bold>
</xref> provides a comparison between the proposed method and several commonly used acoustic data classification models utilizing three-dimensional Mel-spectrograms as inputs. The graph displays different lines representing our proposed ADDTr, convolutional recurrent neural network (CRNN), BEATs, hierarchical token-semantic audio Transformer (HTS-AT), audio spectrogram Transformer (AST), and separable Transformer (SepTr). The results show that the proposed model achieves the optimal performance at epoch 14, outperforming the other models in terms of efficiency. Even after thorough training, the proposed model consistently maintains higher classification accuracies compared to the other models. <xref ref-type="fig" rid="f10">
<bold>Figure&#xa0;10B</bold>
</xref> shows that the proposed model exhibits smoother performance, indicating its robustness on unseen data.</p>
<fig id="f10" position="float">
<label>Figure&#xa0;10</label>
<caption>
<p>Comparison of identification accuracy between the proposed model and other mainstream neural networks based on the number of epochs. The number of epochs represents how many times the entire dataset is processed by the learning algorithm. Our proposed model attains the highest accuracy while requiring the least amount of time during the training, testing, and validation processes.</p>
</caption>
<graphic mimetype="image" mime-subtype="tiff" xlink:href="fmars-10-1280708-g010.tif"/>
</fig>
<p>The better performance of ADDTr can be attributed to the following reasons. While CRNNs have shown promise in audio processing tasks, their limited modeling capabilities due to the sequential nature of recurrent layers may make it challenging for them to capture long-term dependencies in audio data. BEASTs, although they leverage acoustic tokenizers for audio pre-training, may not be optimized for underwater acoustic signals. The employed tokenization strategy may fail to capture the specific acoustic information relevant to underwater acoustics, leading to suboptimal representations and reduced classification performance. HTS-AT relies on the token-semantic audio transformer architecture, which incorporates hierarchical token semantics. However, this approach may not fully capture the complex temporal patterns and dependencies in underwater acoustic signals, resulting in reduced classification performance.</p>
<p>Similarly, SepTr and AST, like other transformer-based models, depend on self-attention mechanisms to capture long-range dependencies in audio signals. Yet, the complex temporal patterns in underwater acoustic data, such as non-linear dependencies and irregular sequences, may pose challenges for self-attention. This can compromise the models&#x2019; ability to accurately capture temporal dynamics, leading to suboptimal performance in tasks where such dynamics are crucial.</p>
<p>In contrast, our proposed model circumvents these deficiencies. It avoids acoustic tokenizers and self-attention mechanisms, instead utilizing additive attention to directly model the interactions between global and local representations. By summing the attended representations, it effectively suppresses noise and enhances relevant acoustic features. This robustness makes our model highly effective for handling underwater acoustic data.</p>
<p>
<xref ref-type="table" rid="T3">
<bold>Table&#xa0;3</bold>
</xref> presents a comparison of the parameter counts, number of epochs, and time consumption for each step among the different models. The results clearly highlight the exceptional efficiency of the proposed model, which outperforms the other models in terms of parameter count, number of epochs required, and time consumed in each step. This further emphasizes the superiority and effectiveness of the proposed model.</p>
<table-wrap id="T3" position="float">
<label>Table&#xa0;3</label>
<caption>
<p>Efficiency comparisons between several models.</p>
</caption>
<table frame="hsides">
<thead>
<tr>
<th valign="top" align="center">Model Type</th>
<th valign="top" align="center">Number of Parameters</th>
<th valign="top" align="center">Computational Complexity</th>
<th valign="top" align="center">Time Consuming for Each Epoch (in second)</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top" align="center">CRNN</td>
<td valign="top" align="center">
<inline-formula>
<mml:math display="inline" id="im70">
<mml:mrow>
<mml:mn>21</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mn>6</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td valign="top" align="center">
<inline-formula>
<mml:math display="inline" id="im71">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:msup>
<mml:mi>D</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td valign="top" align="center">5</td>
</tr>
<tr>
<td valign="top" align="center">AST</td>
<td valign="top" align="center">
<inline-formula>
<mml:math display="inline" id="im72">
<mml:mrow>
<mml:mn>5</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mn>6</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td valign="top" align="center">
<inline-formula>
<mml:math display="inline" id="im73">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mi>N</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td valign="top" align="center">4</td>
</tr>
<tr>
<td valign="top" align="center">HTS-AT</td>
<td valign="top" align="center">
<inline-formula>
<mml:math display="inline" id="im74">
<mml:mrow>
<mml:mn>30</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mn>6</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td valign="top" align="center">
<inline-formula>
<mml:math display="inline" id="im75">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mi>N</mml:mi>
<mml:mi>D</mml:mi>
<mml:msup>
<mml:mi>w</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td valign="top" align="center">4</td>
</tr>
<tr>
<td valign="top" align="center">BEATs</td>
<td valign="top" align="center">
<inline-formula>
<mml:math display="inline" id="im76">
<mml:mrow>
<mml:mn>90</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mn>6</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td valign="top" align="center">
<inline-formula>
<mml:math display="inline" id="im77">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mi>N</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td valign="top" align="center">5</td>
</tr>
<tr>
<td valign="top" align="center">SepTr</td>
<td valign="top" align="center">
<inline-formula>
<mml:math display="inline" id="im78">
<mml:mrow>
<mml:mn>10</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mn>10</mml:mn>
</mml:mrow>
<mml:mn>6</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td valign="top" align="center">
<inline-formula>
<mml:math display="inline" id="im79">
<mml:mrow>
<mml:mi>O</mml:mi>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:msup>
<mml:mi>N</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
<mml:mi>D</mml:mi>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td valign="top" align="center">
<bold>3</bold>
</td>
</tr>
<tr>
<td valign="top" align="center">ADDTr</td>
<td valign="top" align="center">
<bold>4.5 &#xd7; 10<sup>6</sup>
</bold>
</td>
<td valign="top" align="center">
<inline-formula>
<mml:math display="inline" id="im82">
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>O</mml:mi>
</mml:mstyle>
<mml:mrow>
<mml:mo stretchy="false">(</mml:mo>
<mml:mrow>
<mml:mstyle mathvariant="bold" mathsize="normal">
<mml:mi>N</mml:mi>
<mml:mi>D</mml:mi>
</mml:mstyle>
</mml:mrow>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td valign="top" align="center">
<bold>3</bold>
</td>
</tr>
</tbody>
</table>
<table-wrap-foot>
<fn>
<p>
<italic>N</italic> is the input sequence length, <italic>D</italic> denotes the representation dimension, and <italic>w</italic> is the window size. The bold values represent the best performance/results.</p>
</fn>
</table-wrap-foot>
</table-wrap>
</sec>
</sec>
<sec id="s4" sec-type="conclusion">
<label>4</label>
<title>Conclusion</title>
<p>This paper addresses the challenges of passive recognition of ship-radiated noise in underwater environments, characterized by inherent noise, reverberation, and time-varying acoustic channels, through the proposed ADDTr. By utilizing three-dimensional mel-spectrograms, the approach captures the temporal variations of target signals and ambient noise, enabling better distinguishability. The additional spatial dimension in the spectrograms allows for modeling reverberation effects and compensating for distortions, resulting in enhanced clarity of target signals.</p>
<p>The proposed ADDTr, a deep learning Transformer framework with additive attention, effectively models long-term dependencies and spatial information, allowing the model to focus on informative features and suppress noise. By incorporating the additive attention mechanism, our proposed model achieves a significant reduction in computation complexity, transitioning from quadratic complexity to linear complexity. This improvement in computational efficiency enables more efficient and scalable processing of the input data, making the approach highly practical for real-world applications. Comparative evaluations with state-of-the-art models on the ShipsEar dataset demonstrate the superior performance of the proposed approach, achieving the highest accuracy of 96.82% with lower computation costs.</p>
</sec>
<sec id="s5" sec-type="data-availability">
<title>Data availability statement</title>
<p>Publicly available datasets were analyzed in this study. This data can be found here: <ext-link ext-link-type="uri" xlink:href="http://atlanttic.uvigo.es/underwaternoise/">http://atlanttic.uvigo.es/underwaternoise/</ext-link>.</p>
</sec>
<sec id="s6" sec-type="author-contributions">
<title>Author contributions</title>
<p>YW: Writing &#x2013; original draft, Formal Analysis, Resources, Software, Visualization. HZ: Writing &#x2013; review &amp; editing, Conceptualization, Funding acquisition, Methodology. WH: Writing &#x2013; review &amp; editing, Data curation, Investigation, Validation.</p>
</sec>
</body>
<back>
<sec id="s7" sec-type="funding-information">
<title>Funding</title>
<p>The author(s) declare financial support was received for the research, authorship, and/or publication of this article. This work was financially supported by National Natural Science Foundation of China (NSFC:62271459), National Defense Science and Technology Innovation Special Zone Project: Marine Science and Technology Collaborative Innovation Center (22-05-CXZX-04-01-02), the Qingdao Postdoctoral Science Foundation (QDBSH20220202061), and the Fundamental Research Funds for the Central Universities, Ocean University of China (202313036).</p>
</sec>
<sec id="s8" sec-type="COI-statement">
<title>Conflict of interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<sec id="s9" sec-type="disclaimer">
<title>Publisher&#x2019;s note</title>
<p>All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.</p>
</sec>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Beltagy</surname> <given-names>I.</given-names>
</name>
<name>
<surname>Peters</surname> <given-names>M. E.</given-names>
</name>
<name>
<surname>Cohan</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Longformer: The long-document transformer</article-title>. <source>CoRR</source>. <volume>abs/2004.05150</volume>. <uri xlink:href="https://arxiv.org/abs/2004.05150">https://arxiv.org/abs/2004.05150</uri>
</citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Brown</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Mann</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Ryder</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Subbiah</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Kaplan</surname> <given-names>J. D.</given-names>
</name>
<name>
<surname>Dhariwal</surname> <given-names>P.</given-names>
</name>
<etal/>
</person-group>. (<year>2020</year>). <article-title>Language models are few-shot learners</article-title>. <source>CoRR</source> <volume>33</volume>, <fpage>1877</fpage>&#x2013;<lpage>1901</lpage>.</citation>
</ref>
<ref id="B3">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chen</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Du</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Zhu</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Ma</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Berg-Kirkpatrick</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Dubnov</surname> <given-names>S.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>HTS-AT: A hierarchical token-semantic audio transformer for sound classification and detection</article-title>. <source>CoRR</source> <volume>2202</volume>, <fpage>00874</fpage>. doi: <pub-id pub-id-type="doi">10.1109/ICASSP43922.2022.9746312</pub-id>
</citation>
</ref>
<ref id="B4">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Chen</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Tompkins</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Z.</given-names>
</name>
<etal/>
</person-group>. (<year>2022</year>). <source>Beats: Audio pre-training with acoustic tokenizers</source>.</citation>
</ref>
<ref id="B5">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Devlin</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Chang</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Lee</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Toutanova</surname> <given-names>K.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>BERT: pre-training of deep bidirectional transformers for language understanding</article-title>. <source>CoRR</source> <volume>1810</volume>, <elocation-id>4805</elocation-id>. doi:&#xa0;10.04805
</citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Doan</surname> <given-names>V.-S.</given-names>
</name>
<name>
<surname>Huynh-The</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Kim</surname> <given-names>D.-S.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Underwater acoustic target classification based on dense convolutional neural network</article-title>. <source>IEEE Geosci. Remote Sens. Lett.</source> <volume>19</volume>, <fpage>1</fpage>&#x2013;<lpage>5</lpage>. doi: <pub-id pub-id-type="doi">10.1109/LGRS.2020.3029584</pub-id>
</citation>
</ref>
<ref id="B7">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Dosovitskiy</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Beyer</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Kolesnikov</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Weissenborn</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Zhai</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Unterthiner</surname> <given-names>T.</given-names>
</name>
<etal/>
</person-group>. (<year>2020</year>). <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>. <source>CoRR</source> <volume>2010</volume>, <fpage>11929</fpage>.</citation>
</ref>
<ref id="B8">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Esmaiel</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Xie</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Qasem</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Qi</surname> <given-names>J.</given-names>
</name>
<name>
<surname>J.</surname> <given-names>W. A. N. G.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Multi-stage feature extraction and classification for ship-radiated noise</article-title>. <source>Sensors</source> <volume>22</volume>, <fpage>12</fpage>. doi: <pub-id pub-id-type="doi">10.3390/s22010112</pub-id>
</citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Feng</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Zhu</surname> <given-names>X.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>A transformer-based deep learning network for underwater acoustic target recognition</article-title>. <source>IEEE Geosci. Remote Sens. Lett.</source> <volume>19</volume>, <fpage>1</fpage>&#x2013;<lpage>5</lpage>. doi: <pub-id pub-id-type="doi">10.1109/LGRS.2022.3201396</pub-id>
</citation>
</ref>
<ref id="B10">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Filho</surname> <given-names>W. S.</given-names>
</name>
<name>
<surname>de Seixas</surname> <given-names>J. M.</given-names>
</name>
<name>
<surname>de Moura</surname> <given-names>N. N.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Preprocessing passive sonar signals for neural classification</article-title>. <source>IET radar, sonar \&amp; navigation</source> (<publisher-name>IET</publisher-name>) <volume>5</volume>, <fpage>605</fpage>&#x2013;<lpage>612</lpage>. doi: <pub-id pub-id-type="doi">10.1049/iet-rsn.2010.0157</pub-id>
</citation>
</ref>
<ref id="B11">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Frei</surname> <given-names>M. G.</given-names>
</name>
<name>
<surname>Osorio</surname> <given-names>I.</given-names>
</name>
</person-group> (<year>2007</year>). <article-title>Intrinsic time-scale decomposition: time-frequency-energy analysis and real-time filtering of non-stationary signals</article-title>. <source>Proc. R. Soc. London Ser. A</source> <volume>463</volume> (<issue>2078</issue>), <fpage>321</fpage>&#x2013;<lpage>342</lpage>.</citation>
</ref>
<ref id="B12">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Fu</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Xu</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Mi</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Kong</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>H.</given-names>
</name>
<etal/>
</person-group>. (<year>2019</year>). <article-title>Multi model-based distillation for sound event detection</article-title>. <source>IEICE Trans. Inf. Syst.</source> <volume>102</volume>(<issue>10</issue>), <fpage>2055</fpage>&#x2013;<lpage>2058</lpage>, <elocation-id>E102.D</elocation-id>. doi: <pub-id pub-id-type="doi">10.1587/transinf.2019EDL8062</pub-id>
</citation>
</ref>
<ref id="B13">
<citation citation-type="journal">
<person-group person-group-type="author">    <name>
<surname>Gabor</surname> <given-names>D.</given-names>
</name>
</person-group> (<year>1946</year>). <article-title>The Analysis of Complex Signals and Communication Systems</article-title> <source>Journal of the Institution of Electrical Engineers-Part III: Radio and Communication Engineering</source> (<publisher-name>IET</publisher-name>) <volume>93</volume>, <issue>26</issue>, <fpage>429</fpage>&#x2013;<lpage>457</lpage>. doi: <pub-id pub-id-type="doi">10.1049/ji-3-2.1946.0074</pub-id>
</citation>
</ref>
<ref id="B14">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gao</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Woo</surname> <given-names>W. L.</given-names>
</name>
<name>
<surname>Khor</surname> <given-names>L.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Cochleagram-based audio pattern separation using two-dimensional non-negative matrix factorization with automatic sparsity adaptation</article-title>. <source>J. Acoustical Soc. America</source> <volume>135</volume> (<issue>3</issue>), <fpage>1171</fpage>&#x2013;<lpage>1185</lpage>. doi: <pub-id pub-id-type="doi">10.1121/1.4864294</pub-id>
</citation>
</ref>
<ref id="B15">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Goldobin</surname> <given-names>D. S.</given-names>
</name>
<name>
<surname>J. nosuke Teramae</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Nakao</surname>
</name>
<name>
<surname>G. B.</surname> <given-names>E.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Dynamics of limit-cycle oscillators subject to general noise</article-title>. <source>Phys. Rev. Lett.</source> <volume>105</volume> (<issue>15</issue>), <elocation-id>154101</elocation-id>. doi:&#xa0;10.1103%2Fphysrevlett.105.154101
</citation>
</ref>
<ref id="B16">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Gong</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Chung</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Glass</surname> <given-names>J. R.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>AST: audio spectrogram transformer</article-title>. <source>CoRR</source> <volume>2104</volume>, <fpage>01778</fpage>. doi: <pub-id pub-id-type="doi">10.21437/Interspeech.2021-698</pub-id>
</citation>
</ref>
<ref id="B17">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Hermansky</surname> <given-names>H.</given-names>
</name>
</person-group> (<year>1980</year>). <article-title>A perceptual linear predictive (plp) analysis of speech</article-title>. <source>J. Acoustical Soc. America</source> (<publisher-loc>Acoustical Society of America</publisher-loc>) <volume>87</volume> (<issue>4</issue>), <fpage>1738</fpage>-<lpage>1752</lpage>.</citation>
</ref>
<ref id="B18">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Jia</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Mei</surname> <given-names>S.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Combining adaptive sparse nmf feature extraction and soft mask to optimize dnn for speech enhancement</article-title>. <source>Appl. Acoustics</source> <volume>171</volume>, <fpage>107666</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.apacoust.2020.107666</pub-id>
</citation>
</ref>
<ref id="B19">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Khishe</surname> <given-names>M.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Drw-ae: A deep recurrent-wavelet auto encoder for underwater target recognition</article-title>. <source>IEEE J. Oceanic Eng.</source> <volume>47</volume> (<issue>4</issue>), <fpage>1083</fpage>&#x2013;<lpage>1098</lpage>. doi: <pub-id pub-id-type="doi">10.1109/JOE.2022.3180764</pub-id>
</citation>
</ref>
<ref id="B20">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kitaev</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Kaiser</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Levskaya</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2020</year>). <source>Reformer: The efficient transformer</source>.</citation>
</ref>
<ref id="B21">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Yang</surname> <given-names>H.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>The underwater acoustic target timbre perception and recognition based on the auditory inspired deep convolutional neural network</article-title>. <source>Appl. Acoustics</source> <volume>182</volume>, <fpage>108210</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.apacoust.2021.108210</pub-id>
</citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Yuan</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Liu</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Guo</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Ma</surname> <given-names>Y.</given-names>
</name>
<etal/>
</person-group>. (<year>2021</year>). <article-title>Lungattn: advanced lung sound classification using attention mechanism with dual tqwt and triple stft spectrogram</article-title>. <source>Physiol. Measurement</source> <volume>42</volume> (<issue>10</issue>), <fpage>105006</fpage>. doi: <pub-id pub-id-type="doi">10.1088/1361-6579/ac27b9</pub-id>
</citation>
</ref>
<ref id="B23">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Li</surname> <given-names>P.</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Lan</surname> <given-names>Q.</given-names>
</name>
<name>
<surname>Xiao</surname> <given-names>W.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Stm: Spectrogram transformer model for underwater acoustic target recognition</article-title>. <source>J. Mar. Sci. Eng.</source> <volume>10</volume> (<issue>10</issue>), <fpage>1428</fpage>. doi: <pub-id pub-id-type="doi">10.3390/jmse10101428</pub-id>
</citation>
</ref>
<ref id="B24">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Liu</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Lin</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Cao</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Hu</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Wei</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>Z.</given-names>
</name>
<etal/>
</person-group>. (<year>2021</year>). <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>. <source>CoRR</source> <volume>2103</volume>, <fpage>14030</fpage>. doi: <pub-id pub-id-type="doi">10.1109/ICCV48922.2021.00986</pub-id>
</citation>
</ref>
<ref id="B25">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Liu</surname> <given-names>Y.-X.</given-names>
</name>
<name>
<surname>Yang</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Y.-H.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>Lung sound classification based on hilbert-huang transform features and multilayer perceptron network)</article-title>. <source>2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)</source> (<publisher-name>IEEE</publisher-name>) <fpage>765</fpage>&#x2013;<lpage>768</lpage>.</citation>
</ref>
<ref id="B26">
<citation citation-type="web">
<person-group person-group-type="author">
<name>
<surname>Lurton</surname> <given-names>X.</given-names>
</name>
</person-group> (<year>2010</year>) <source>An introduction to underwater acoustics: Principles and applications</source>. Available at: <uri xlink:href="https://api.semanticscholar.org/CorpusID:109354879">https://api.semanticscholar.org/CorpusID:109354879</uri>.</citation>
</ref>
<ref id="B27">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Mallat</surname> <given-names>S.</given-names>
</name>
</person-group> (<year>1989</year>). <article-title>A theory for multiresolution signal decomposition: the wavelet representation</article-title>. <source>IEEE Trans. Pattern Anal. Mach. Intell.</source> <volume>11</volume> (<issue>7</issue>), <fpage>674</fpage>&#x2013;<lpage>693</lpage>. doi: <pub-id pub-id-type="doi">10.1109/34.192463</pub-id>
</citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Monaco</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Amoroso</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Bellantuono</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Pantaleo</surname> <given-names>E.</given-names>
</name>
<name>
<surname>Tangaro</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Bellotti</surname> <given-names>R.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Multi-time-scale features for accurate respiratory sound classification</article-title>. <source>Appl. Sci.</source> <volume>10</volume> (<issue>23</issue>), <elocation-id>8606</elocation-id>. doi: <pub-id pub-id-type="doi">10.3390/app10238606</pub-id>
</citation>
</ref>
<ref id="B29">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Purwins</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Virtanen</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Schl&#xfc;ter</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Chang</surname> <given-names>S.-Y.</given-names>
</name>
<name>
<surname>Sainath</surname> <given-names>T.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Deep learning for audio signal processing</article-title>. <source>IEEE Journal of Selected Topics in Signal Processing</source> (<publisher-name>IEEE</publisher-name>) <volume>13</volume>, <issue>2</issue>, <fpage>206</fpage>&#x2013;<lpage>219</lpage>. doi: <pub-id pub-id-type="doi">10.1109/JSTSP.2019.2908700</pub-id>
</citation>
</ref>
<ref id="B30">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ristea</surname> <given-names>N.-C.</given-names>
</name>
<name>
<surname>Ionescu</surname> <given-names>R. T.</given-names>
</name>
<name>
<surname>Khan</surname> <given-names>F. S.</given-names>
</name>
</person-group> (<year>2022</year>). <source>Septr: Separable transformer for audio spectrogram processing</source>.</citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Salomons</surname> <given-names>E. L.</given-names>
</name>
<name>
<surname>Havinga</surname> <given-names>P. J.</given-names>
</name>
</person-group> (<year>2015</year>). <article-title>A survey on the feasibility of sound classification on wireless sensor node</article-title>. <source>Sensors</source> <volume>15</volume> (<issue>4</issue>), <fpage>7462</fpage>&#x2013;<lpage>7498</lpage>. doi: <pub-id pub-id-type="doi">10.3390/s150407462</pub-id>
</citation>
</ref>
<ref id="B32">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Santos-Dom&#xed;nguez</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Torres-Guijarro</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Cardenal-L&#xf3;pez</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Pena-Gimenez</surname> <given-names>A.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Shipsear: An underwater vessel noise database</article-title>. <source>Appl. Acoustics</source> <volume>113</volume>, <fpage>64</fpage>&#x2013;<lpage>69</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.apacoust.2016.06.008</pub-id>
</citation>
</ref>
<ref id="B33">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Selesnick</surname> <given-names>I.</given-names>
</name>
</person-group> (<year>2011</year>). <article-title>Resonance-based signal decomposition: A new sparsity-enabled signal analysis method</article-title>. <source>Signal Process.</source> <volume>91</volume> (<issue>12</issue>), <fpage>2793</fpage>&#x2013;<lpage>2809</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.sigpro.2010.10.018</pub-id>
</citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Shen</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Yang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Xu</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Sheng</surname> <given-names>M.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Auditory inspired convolutional neural networks for ship type classification with raw hydrophone data</article-title>. <source>Entropy</source> <volume>20</volume> (<issue>12</issue>), <elocation-id>990</elocation-id>. doi: <pub-id pub-id-type="doi">10.3390/e20120990</pub-id>
</citation>
</ref>
<ref id="B35">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Song</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Liur</surname> <given-names>F.</given-names>
</name>
<name>
<surname>Shen</surname> <given-names>T.</given-names>
</name>
</person-group> (<year>2022</year>). <article-title>Method of underwater acoustic signal denoising based on dual-path transformer network</article-title>. <source>IEEE Access</source>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2022.3224752</pub-id>
</citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Su</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>K.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Zhou</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Madani</surname> <given-names>K.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Performance analysis of multiple aggregated acoustic features for environment sound classification</article-title>. <source>Appl. Acoustics</source> <volume>158</volume>, <fpage>107050</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.apacoust.2019.107050</pub-id>
</citation>
</ref>
<ref id="B37">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Tay</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Bahri</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Metzler</surname> <given-names>D.</given-names>
</name>
<name>
<surname>Juan</surname> <given-names>D.-C.</given-names>
</name>
<name>
<surname>Zhao</surname> <given-names>Z.</given-names>
</name>
<name>
<surname>Zheng</surname> <given-names>C.</given-names>
</name>
</person-group> (<year>2021</year>). <source>Synthesizer: Rethinking self-attention in transformer models</source>.</citation>
</ref>
<ref id="B38">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Tong</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Zhang</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Ge</surname> <given-names>Y.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Classification and recognition of underwater target based on mfcc feature extraction</article-title>. <source>2020 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC)</source> (<publisher-name>IEEE</publisher-name>), <fpage>1</fpage>&#x2013;<lpage>4</lpage>.</citation>
</ref>
<ref id="B39">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Tuncer</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Akbal</surname> <given-names>E.</given-names>
</name>
<name>
<surname>Dogan</surname> <given-names>S.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Multileveled ternary pattern and iterative relieff based bird sound classification</article-title>. <source>Appl. Acoustics</source> <volume>176</volume>, <fpage>107866</fpage>. doi: <pub-id pub-id-type="doi">10.1016/j.apacoust.2020.107866</pub-id>
</citation>
</ref>
<ref id="B40">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Vaswani</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Shazeer</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Parmar</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Uszkoreit</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Jones</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Gomez</surname> <given-names>A. N.</given-names>
</name>
<etal/>
</person-group>. (<year>2017</year>). <article-title>Attention is all you need</article-title>. <source>CoRR</source> <volume>1706</volume>, <fpage>03762</fpage>.</citation>
</ref>
<ref id="B41">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Virtanen</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Cemgil</surname> <given-names>A. T.</given-names>
</name>
</person-group> (<year>2009</year>). <article-title>, Mixtures of gamma priors for non-negative matrix factorization based speech separation</article-title>. <source>Independent Component Analysis and Signal Separation: 8th International Conference, ICA 2009, Paraty, Brazil, March 15-18, 2009. Proceedings 8</source> (<publisher-name>Springer</publisher-name>), <fpage>646</fpage>&#x2013;<lpage>653</lpage>.</citation>
</ref>
<ref id="B42">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>Z.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Feature extraction of ship-radiated noise based on intrinsic time-scale decomposition and a statistical complexity measure</article-title>. <source>Entropy</source> <volume>21</volume> (<issue>11</issue>), <fpage>1079</fpage>. doi: <pub-id pub-id-type="doi">10.3390/e21111079</pub-id>
</citation>
</ref>
<ref id="B43">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Wang</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>B. Z.</given-names>
</name>
<name>
<surname>Khabsa</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Fang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Ma</surname> <given-names>H.</given-names>
</name>
</person-group> (<year>2020</year>). <source>Linformer: Self-attention with linear complexity</source>.</citation>
</ref>
<ref id="B44">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wu</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>F.</given-names>
</name>
<name>
<surname>Qi</surname> <given-names>T.</given-names>
</name>
<name>
<surname>Huang</surname> <given-names>Y.</given-names>
</name>
</person-group> (<year>2021</year>). <article-title>Hi-transformer: Hierarchical interactive transformer for efficient and effective long document modeling</article-title>. <source>CoRR</source>. <volume>abs/2106.01040</volume>. doi: <pub-id pub-id-type="doi">10.18653/v1/2021.acl-short.107</pub-id>
</citation>
</ref>
<ref id="B45">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yan</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Junejo</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Cheng</surname> <given-names>E.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Resonance-based time-frequency manifold for feature extraction of ship-radiated noise</article-title>. <source>Sensors</source> <volume>18</volume>, <fpage>936</fpage>. doi: <pub-id pub-id-type="doi">10.3390/s18040936</pub-id>
</citation>
</ref>
<ref id="B46">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Yang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Gan</surname> <given-names>A.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Pan</surname> <given-names>Y.</given-names>
</name>
<name>
<surname>Tang</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Underwater acoustic target recognition using svm ensemble via weighted sample and feature selection</article-title>. <source>2016 13th International Bhurban Conference on Applied Sciences and Technology (IBCAST)</source> (<publisher-name>IEEE</publisher-name>), <fpage>522</fpage>&#x2013;<lpage>527</lpage>.</citation>
</ref>
<ref id="B47">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Li</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Shen</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Xu</surname> <given-names>G.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>A deep convolutional neural network inspired by auditory perception for underwater acoustic target recognition</article-title>. <source>Sensors</source> (<publisher-name>MDPI</publisher-name>) <volume>19</volume> (<issue>5</issue>), <elocation-id>1104</elocation-id>. doi:&#xa0;<pub-id pub-id-type="doi">10.3390/s19051104</pub-id>
</citation>
</ref>
<ref id="B48">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yang</surname> <given-names>C.-H.</given-names>
</name>
<name>
<surname>Wu</surname> <given-names>C.-H.</given-names>
</name>
<name>
<surname>Hsieh</surname> <given-names>C.-M.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Long short-term memory recurrent neural network for tidal level forecasting</article-title>. <source>IEEE Access</source> <volume>1&#x2013;1</volume>, <fpage>08</fpage>. doi: <pub-id pub-id-type="doi">10.1109/ACCESS.2020.3017089</pub-id>
</citation>
</ref>
<ref id="B49">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Yu</surname> <given-names>L.</given-names>
</name>
<name>
<surname>Ma</surname> <given-names>N.</given-names>
</name>
<name>
<surname>Gu</surname> <given-names>X.</given-names>
</name>
</person-group> (<year>2016</year>). <article-title>Early detection of parametric roll by application of the incremental real-time hilbert&#x2013;huang transform</article-title>. <source>Ocean Eng.</source> <volume>113</volume>, <fpage>224</fpage>&#x2013;<lpage>236</lpage>. doi: <pub-id pub-id-type="doi">10.1016/j.oceaneng.2015.12.050</pub-id>
</citation>
</ref>
<ref id="B50">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zaheer</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Guruganesh</surname> <given-names>G.</given-names>
</name>
<name>
<surname>Dubey</surname> <given-names>K. A.</given-names>
</name>
<name>
<surname>Ainslie</surname> <given-names>J.</given-names>
</name>
<name>
<surname>Alberti</surname> <given-names>C.</given-names>
</name>
<name>
<surname>Ontanon</surname> <given-names>S.</given-names>
</name>
<etal/>
</person-group>. (<year>2020</year>). <article-title>Big bird: Transformers for longer sequences</article-title>. <source>Advances in neural information processing systems</source> <volume>33</volume>, <fpage>17283</fpage>&#x2013;<lpage>17297</lpage>.</citation>
</ref>
<ref id="B51">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zeng</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Wang</surname> <given-names>S.</given-names>
</name>
</person-group> (<year>2014</year>). <article-title>Underwater sound classification based on gammatone filter bank and hilbert-huang transform</article-title>. <source>2014 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC)</source> (<publisher-name>IEEE</publisher-name>), <fpage>707</fpage>&#x2013;<lpage>710</lpage>.</citation>
</ref>
<ref id="B52">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Junejo</surname> <given-names>N. U. R.</given-names>
</name>
<name>
<surname>Sun</surname> <given-names>W.</given-names>
</name>
<name>
<surname>Chen</surname> <given-names>H.</given-names>
</name>
<name>
<surname>Yan</surname> <given-names>J.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Adaptive variational mode time-frequency analysis of ship radiated noise</article-title>. <source>2020 7th international conference on information science and control engineering (ICISCE)</source> (<publisher-name>IEEE</publisher-name>) <fpage>1652</fpage>&#x2013;<lpage>1656</lpage>. </citation>
</ref>
<ref id="B53">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zhao</surname> <given-names>M.</given-names>
</name>
<name>
<surname>Zhong</surname> <given-names>S.</given-names>
</name>
<name>
<surname>Fu</surname> <given-names>X.</given-names>
</name>
<name>
<surname>Tang</surname> <given-names>B.</given-names>
</name>
<name>
<surname>Pecht</surname> <given-names>M.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Deep residual shrinkage networks for fault diagnosis</article-title>. <source>IEEE Trans. Ind. Inf.</source> <volume>16</volume> (<issue>7</issue>), <fpage>4681</fpage>&#x2013;<lpage>4690</lpage>. doi:&#xa0;<pub-id pub-id-type="doi">10.1109/TII.2019.2943898</pub-id>
</citation>
</ref>
</ref-list>
</back>
</article>