<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//NLM//DTD Journal Publishing DTD v2.3 20070202//EN" "journalpublishing.dtd">
<article article-type="research-article" dtd-version="2.3" xml:lang="EN" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">
<front>
<journal-meta>
<journal-id journal-id-type="publisher-id">Front. Artif. Intell.</journal-id>
<journal-title>Frontiers in Artificial Intelligence</journal-title>
<abbrev-journal-title abbrev-type="pubmed">Front. Artif. Intell.</abbrev-journal-title>
<issn pub-type="epub">2624-8212</issn>
<publisher>
<publisher-name>Frontiers Media S.A.</publisher-name>
</publisher>
</journal-meta>
<article-meta>
<article-id pub-id-type="publisher-id">638299</article-id>
<article-id pub-id-type="doi">10.3389/frai.2021.638299</article-id>
<article-categories>
<subj-group subj-group-type="heading">
<subject>Artificial Intelligence</subject>
<subj-group>
<subject>Original Research</subject>
</subj-group>
</subj-group>
</article-categories>
<title-group>
<article-title>Learning Medical Materials From Radiography Images</article-title>
<alt-title alt-title-type="left-running-head">Molder et&#x20;al.</alt-title>
<alt-title alt-title-type="right-running-head">Learning Medical Materials From Images</alt-title>
</title-group>
<contrib-group>
<contrib contrib-type="author">
<name>
<surname>Molder</surname>
<given-names>Carson</given-names>
</name>
<uri xlink:href="https://loop.frontiersin.org/people/1048547/overview"/>
</contrib>
<contrib contrib-type="author">
<name>
<surname>Lowe</surname>
<given-names>Benjamin</given-names>
</name>
</contrib>
<contrib contrib-type="author" corresp="yes">
<name>
<surname>Zhan</surname>
<given-names>Justin</given-names>
</name>
<xref ref-type="corresp" rid="c001">&#x2a;</xref>
<uri xlink:href="https://loop.frontiersin.org/people/981501/overview"/>
</contrib>
</contrib-group>
<aff>Data Science and Artificial Intelligence Lab, Department of Computer Science and Computer Engineering, College of Engineering, University of Arkansas, <addr-line>Fayetteville</addr-line>, <addr-line>AR</addr-line>, <country>United&#x20;States</country>
</aff>
<author-notes>
<fn fn-type="edited-by">
<p>
<bold>Edited by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/59759/overview">Tuan D. Pham</ext-link>, Prince Mohammad bin Fahd University, Saudi Arabia</p>
</fn>
<fn fn-type="edited-by">
<p>
<bold>Reviewed by:</bold> <ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/204681/overview">Tiziana Sanavia</ext-link>, Harvard Medical School, United&#x20;States</p>
<p>
<ext-link ext-link-type="uri" xlink:href="https://loop.frontiersin.org/people/1316818/overview">Tehseen Zia</ext-link>, COMSATS University, Pakistan</p>
</fn>
<corresp id="c001">&#x2a;Correspondence: Justin Zhan, <email>jzhan@uark.edu</email>
</corresp>
<fn fn-type="other">
<p>This article was submitted to Medicine and Public Health, a section of the journal Frontiers in Artificial Intelligence</p>
</fn>
</author-notes>
<pub-date pub-type="epub">
<day>18</day>
<month>06</month>
<year>2021</year>
</pub-date>
<pub-date pub-type="collection">
<year>2021</year>
</pub-date>
<volume>4</volume>
<elocation-id>638299</elocation-id>
<history>
<date date-type="received">
<day>06</day>
<month>12</month>
<year>2020</year>
</date>
<date date-type="accepted">
<day>26</day>
<month>05</month>
<year>2021</year>
</date>
</history>
<permissions>
<copyright-statement>Copyright &#xa9; 2021 Molder, Lowe and Zhan.</copyright-statement>
<copyright-year>2021</copyright-year>
<copyright-holder>Molder, Lowe and Zhan</copyright-holder>
<license xlink:href="http://creativecommons.org/licenses/by/4.0/">
<p>This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these&#x20;terms.</p>
</license>
</permissions>
<abstract>
<p>Deep learning models have been shown to be effective for material analysis, a subfield of computer vision, on natural images. In medicine, deep learning systems have been shown to more accurately analyze radiography images than algorithmic approaches and even experts. However, one major roadblock to applying deep learning-based material analysis on radiography images is a lack of material annotations accompanying image sets. To solve this, we first introduce an automated procedure to augment annotated radiography images into a set of material samples. Next, using a novel Siamese neural network that compares material sample pairs, called D-CNN, we demonstrate how to learn a perceptual distance metric between material categories. This system replicates the actions of human annotators by discovering attributes that encode traits that distinguish materials in radiography images. Finally, we update and apply MAC-CNN, a material recognition neural network, to demonstrate this system on a dataset of knee X-rays and brain MRIs with tumors. Experiments show that this system has strong predictive power on these radiography images, achieving 92.8% accuracy at predicting the material present in a local region of an image. Our system also draws interesting parallels between human perception of natural materials and materials in radiography images.</p>
</abstract>
<kwd-group>
<kwd>computer vision</kwd>
<kwd>material analysis</kwd>
<kwd>convolutional neural networks</kwd>
<kwd>siamese neural networks</kwd>
<kwd>image classification</kwd>
<kwd>medical imaging</kwd>
<kwd>radiography</kwd>
</kwd-group>
</article-meta>
</front>
<body>
<sec id="s1">
<title>1 Introduction</title>
<p>Computer vision, the study of using computers to extract information from images and videos, has become embedded in new, broad medical applications due to the high accuracy that deep learning models can achieve. Recent deep learning models have shown to be effective at solving a variety of vision tasks in medical image analysis like analyzing chest X-rays (<xref ref-type="bibr" rid="B16">Irvin et&#x20;al., 2019</xref>; <xref ref-type="bibr" rid="B33">Wang et&#x20;al., 2019</xref>), segmenting brain scans (<xref ref-type="bibr" rid="B20">Lai et&#x20;al., 2019</xref>), and annotating pressure wounds (<xref ref-type="bibr" rid="B36">Zahia et&#x20;al., 2018</xref>).</p>
<p>However, such deep learning models are greatly affected by the quality of the data used to train them and often sacrifice interpretability for increased accuracy. A lack of quality data, especially in expert domains like medicine, limits the possible tasks that computer vision can be used for. One such task, material analysis, examines low-level, textural details to learn about the textural and physical makeup of objects in images. To make this task feasible without relying on experts to create hand-crafted textural datasets, existing datasets need to be augmented to encode textural knowledge.</p>
<p>Medical images contain a great amount of textural data that has been underexplored. Intuitively, different regions of a medical image exhibit low-level characteristics that imply what kind of material is present in a portion of an image. <xref ref-type="fig" rid="F1">Figure&#x20;1</xref> demonstrates this for a knee X-ray and brain MRI. In this example, a &#x201c;spongy&#x201d; section of an X-ray image appears to indicate that the section contains bone, while a brighter region of a brain MRI indicates the presence of a tumor. Many medical image datasets capture such regions of textural interest but do not explicitly encode these textures. For example, brain MRI datasets often include segmentation masks for brain tumors (<xref ref-type="bibr" rid="B7">Cheng, 2017</xref>; <xref ref-type="bibr" rid="B26">Schmainda and Prah, 2018</xref>) that encode these regions, but without explicit textural context.</p>
<fig id="F1" position="float">
<label>FIGURE 1</label>
<caption>
<p>An example of image patches taken from a knee X-ray <bold>(A)</bold> and a brain MRI with a tumor <bold>(B)</bold> (<xref ref-type="bibr" rid="B7">Cheng, 2017</xref>). Although some categories, such as bone, have apparent material distinctions, others such as tumor (in red) and healthy (in blue) brain tissue may not have obvious material differences. However, our system can still discern these differences since the images are expertly labeled, while assigning smaller perceptual distances (similarity) between pairs of categories more similar to each&#x20;other.</p>
</caption>
<graphic xlink:href="frai-04-638299-g001.tif"/>
</fig>
<p>While these masks delineate the region where a tumor resides in an image, they give no textural information about the tumors themselves. To obtain this textural information, one must either hire experts to create a dataset of such textures, or leverage these pre-existing annotations in a way that automatically draws out their relationships with the underlying textures and materials. We propose a method to achieve the latter.</p>
<p>In this paper, we introduce a method to analyze medical radiography images with or without such generic annotations to generate a dataset of image patches representing different textures found in medical images. Our method additionally learns an encoding of the relationship between the textural categories in these images and generates a set of machine-discovered material attributes. These material categories and attributes are then used to classify textures found within medical images both locally and over an entire image. Finally, we evaluate our method on a composite dataset of knee X-rays and brain MRIs, observing the attributes learned while also examining how the network automatically performs knowledge transfer for textures between different image modalities.</p>
<p>Our method has the following novel contributions. First, we propose a method to automatically generate a medical material texture dataset from pre-annotated radiography images. Second, we propose a neural network, D-CNN, that can <italic>automatically</italic> learn a distance metric between different medical materials without human supervision. Third, we upgrade MAC-CNN, a material analysis neural network from prior work (<xref ref-type="bibr" rid="B28">Schwartz and Nishino, 2020</xref>), to use the ResNet (<xref ref-type="bibr" rid="B14">He et&#x20;al., 2015</xref>) architecture, which maintains its high accuracy while having greater scalability to deeper layers.</p>
<p>The remainder of the paper is structured as follows. In <xref ref-type="sec" rid="s2">Section 2</xref>, we discuss the methodology of our system. In <xref ref-type="sec" rid="s3">Section 3</xref>, we evaluate how our system performs on the composite dataset of knee X-rays and brain MRIs. Finally, in <xref ref-type="sec" rid="s4">Section 4</xref>, we evaluate related works and conclude.</p>
</sec>
<sec id="s2">
<title>2 Materials and Methods</title>
<p>At a high level, our approach uses two convolutional neural network (CNN) architectures to predict the materials that appear in small image patches. These image patches are sourced from full radiography images. For material categories that require expertise to properly label, such as brain tumor tissue in a brain MRI, the patch&#x2019;s material label is sourced from an expert mask. For more recognizable materials, such as bone and the image background, these labels are sourced automatically based on a region&#x2019;s average brightness.</p>
<p>The CNNs learn these material classifications while respecting an embedding that encodes the relative difference of pairs of categories, analogous to word embeddings in natural language processing. The system&#x2019;s material category classification for each image patch is a <italic>K</italic>-long vector where <italic>K</italic> is the number of material categories to be classified, and the system&#x2019;s material attribute classification is an <italic>M</italic>-long vector where <italic>M</italic> is the selected number of material attributes to be discovered.</p>
<p>To ensure our network is using accurately categorized data, we introduce a thorough patch generation and categorization process on expertly annotated images in <xref ref-type="sec" rid="s2-1">Section 2.1.</xref> Then, the process to learn the perceptual distances between material categories and encode them in a distance matrix is discussed in <xref ref-type="sec" rid="s2-2">Section 2.2.</xref> In <xref ref-type="sec" rid="s2-3">Section 2.3</xref>, we present the discovery process for another matrix that encodes both the material categories&#x2019; distances stored in the distance matrix and a new set of material attributes. Finally, in <xref ref-type="sec" rid="s2-4">Section 2.4</xref>, we introduce the MAC-CNN, which uses this matrix to categorize local image patches into material categories and material attributes. A summary of the notations used is presented in <xref ref-type="table" rid="T1">Table&#x20;1</xref>.</p>
<table-wrap id="T1" position="float">
<label>TABLE 1</label>
<caption>
<p>Summary of notations.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="left">Notation</th>
<th align="center">Definition</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td align="left">
<italic>T</italic>
</td>
<td align="left">Mask tolerance for a given patch</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf119">
<mml:math id="m135">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>B</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">The average brightness value of a given patch</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf120">
<mml:math id="m136">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>B</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>B</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">The minimum and maximum average brightness allowed</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf121">
<mml:math id="m137">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>B</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mn>0</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">The maximum average brightness for the null class</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf122">
<mml:math id="m138">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">C</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">The set of patches of category <italic>i</italic>
</td>
</tr>
<tr>
<td align="left">
<italic>N</italic>
</td>
<td align="left">The number of patches generated</td>
</tr>
<tr>
<td align="left">
<italic>k</italic>, <italic>K</italic>
</td>
<td align="left">The number of material categories (human)</td>
</tr>
<tr>
<td align="left">
<italic>m</italic>, <italic>M</italic>
</td>
<td align="left">The number of material attributes (generated)</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf123">
<mml:math id="m139">
<mml:mi>&#x3b3;</mml:mi>
</mml:math>
</inline-formula>
</td>
<td align="left">Weight hyperparameter for minimization objectives</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf124">
<mml:math id="m140">
<mml:mi mathvariant="bold">&#x398;</mml:mi>
</mml:math>
</inline-formula>
</td>
<td align="left">Network parameters</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf125">
<mml:math id="m141">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">Optimized network parameters</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf126">
<mml:math id="m142">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mi mathvariant="bold">r</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">Set of reference images</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf127">
<mml:math id="m143">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mi mathvariant="bold">c</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">Set of comparison images</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf128">
<mml:math id="m144">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi mathvariant="bold">n</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi mathvariant="bold">c</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf129">
<mml:math id="m145">
<mml:mrow>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">D-CNN prediction on reference and comparison sets</td>
</tr>
<tr>
<td align="left">
<italic>Y</italic>
</td>
<td align="left">True similarity value for reference and comparison patch</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf130">
<mml:math id="m146">
<mml:mi mathvariant="bold">p</mml:mi>
</mml:math>
</inline-formula>
</td>
<td align="left">D-CNN vector of binary similarity decisions</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf131">
<mml:math id="m147">
<mml:mi mathvariant="bold">D</mml:mi>
</mml:math>
</inline-formula>
</td>
<td align="left">
<inline-formula id="inf132">
<mml:math id="m148">
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>K</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> distance matrix between material categories</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf133">
<mml:math id="m149">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula>
</td>
<td align="left">
<inline-formula id="inf134">
<mml:math id="m150">
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> material category/attribute matrix</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf135">
<mml:math id="m151">
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="bold">A</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">Gaussian kernel density estimate of <inline-formula id="inf136">
<mml:math id="m152">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> at point <italic>p</italic>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf137">
<mml:math id="m153">
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">Beta distribution with parameters <italic>a</italic>, <italic>b</italic> at point <italic>p</italic>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf138">
<mml:math id="m154">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold">A</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">Optimized <inline-formula id="inf139">
<mml:math id="m155">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> matrix</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf140">
<mml:math id="m156">
<mml:mi mathvariant="bold">X</mml:mi>
</mml:math>
</inline-formula>
</td>
<td align="left">Training set of image patches for MAC-CNN</td>
</tr>
<tr>
<td align="left">
<italic>T</italic>
</td>
<td align="left">Pairs <inline-formula id="inf141">
<mml:math id="m157">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> of the set <inline-formula id="inf142">
<mml:math id="m158">
<mml:mi mathvariant="bold">X</mml:mi>
</mml:math>
</inline-formula>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf143">
<mml:math id="m159">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">Raw feature vectors of image patch <italic>i</italic>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf144">
<mml:math id="m160">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">One-hot encoded label of image patch <italic>i</italic>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf145">
<mml:math id="m161">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">MAC-CNN prediction on image patch <inline-formula id="inf146">
<mml:math id="m162">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
<tr>
<td align="left">
<inline-formula id="inf147">
<mml:math id="m163">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>T</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">Equivalent to <inline-formula id="inf148">
<mml:math id="m164">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> but while also considering label <inline-formula id="inf149">
<mml:math id="m165">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<sec id="s2-1">
<title>2.1 Patch Selection and Categorization</title>
<p>The first component of the system is selecting and categorizing patches from the medical images so that every patch corresponds highly to its assigned category. Since images vary widely within medicine, such as the differences between X-rays and MRIs, it is important to normalize the images in such a way that the content and annotations are preserved while removing variations that may mislead the system.</p>
<p>Each specific image mode or dataset may use a different approach to patch generation depending on the nature of the source data. The following steps are used to generate patches of background, brain, bone, and tumor categories, but this system can be used to generate image patches in many different medical applications.</p>
<p>To generate the medical-category image patches used to evaluate the system, the first step is to invert negatives (images where the brightest regions indicate dark areas). Then, each image&#x2019;s raw features are normalized to the range <inline-formula id="inf2">
<mml:math id="m3">
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>0,1</mml:mn>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, and <xref ref-type="other" rid="alg1">Algorithm 1</xref> is used to generate patches.</p>
<table-wrap id="alg1" position="float">
<label>Algorithm 1</label>
<caption>
<p>Patch categorization procedure</p>
</caption>
<table>
<tbody>
<tr>
<td>
<inline-graphic xlink:href="frai-04-638299-fx1.tif"/>
</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>Some images may have expertly annotated masks&#x2014;like a brain tumor in an MRI. Other images&#x2014;like the knee X-rays in our experiment&#x2014;may lack masks and labeling, but the categories sought to be analyzed are simple enough to be assumed. This reduces the detail of the dataset, but still yields useful categories for training which may even be applicable in other image modes. We call material categories that are expertly annotated (such as &#x201c;tumor&#x201d;) <italic>expert categories</italic>, while non-annotated material categories (like &#x201c;bone&#x201d; for the knee X-rays) are called <italic>na&#xef;ve categories</italic> since the na&#xef;ve assumption is made that the average brightness of an image region corresponds to its category.</p>
<p>A third type of material category, the <italic>null category</italic>, corresponds to a category that does not contain useful information, but when isolated can improve the model&#x2019;s ability to learn the other categories. For the cases of X-rays and MRIs, the null category is derived from the image background.</p>
<p>We believe that brightness constraints are a useful way to extract na&#xef;ve categories in most cases. Generally, extremely bright regions and dark regions lack interesting texture data&#x2014;for example, the image background. Meanwhile, moderately bright regions may contain some textural information of interest.</p>
<p>For instance, in identifying brain tumors, gray matter tissue, which may not be annotated with a mask, is not as significant as tumor tissue. However, separating gray matter textures from the background, which is much darker, allows for a classifier to make more specific predictions by preventing it from learning that background regions correspond with gray matter. Additionally, when using multiple image modalities with distinct categories to build a dataset, separating the dark background prevents an overlap in each category&#x2019;s texture&#x20;space.</p>
<p>Although we use brightness constraints, other constraints could be used depending on the imaging modality. For example, with a set of RGB color images, a set of constraints could be created from the average value of an RGB color channel.</p>
<p>To generate a material patch from a selected region of an image, the first step is to calculate the average brightness of the region using Eq. 9, which is the sum of all the region&#x2019;s normalized raw feature values divided by the number of raw features. The constraints <inline-formula id="inf3">
<mml:math id="m4">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>B</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf4">
<mml:math id="m5">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>B</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf5">
<mml:math id="m6">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>B</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mn>0</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, and <italic>T</italic> in <xref ref-type="other" rid="alg1">Algorithm 1</xref> can be altered at run time to create better-fitting categories.</p>
<p>For expert categories, like &#x201c;tumor&#x201d;, that are defined by a mask within the image, the patch generation process needs to ensure that a large enough percentage of the region is within the mask. This value is defined as the mask tolerance <italic>T</italic>, presented in Eq. 10. This value is included to avoid categorizing regions that are on the mask boundary, which may confuse the training of the system. We define a small value of <inline-formula id="inf6">
<mml:math id="m7">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x3e;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> since it allows for patches that intersect categories while still avoiding ambiguity. This increases the pool of eligible image patches, introduces variance to reduce overfitting, and allows for smaller masks (like for pituitary tumors, which are generally small) to be represented in the patch&#x20;set.</p>
<p>For any expert category patch, at least <inline-formula id="inf7">
<mml:math id="m8">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mo>&#xd7;</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mn>100</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> percent of the patch&#x2019;s source region is inside the mask boundary. For any na&#xef;ve category patch, at most <inline-formula id="inf8">
<mml:math id="m9">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mo>&#xd7;</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mn>100</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> percent of the mask is allowed to be within the patch&#x2019;s source region.</p>
<p>To further normalize the patches, we also introduce the average brightness constraints <inline-formula id="inf9">
<mml:math id="m10">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>B</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf10">
<mml:math id="m11">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>B</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf11">
<mml:math id="m12">
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>B</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mn>0</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. Since each patch raw feature is normalized to the range <inline-formula id="inf12">
<mml:math id="m13">
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>0,1</mml:mn>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>, the average brightness constraints are likewise constrained to <inline-formula id="inf13">
<mml:math id="m14">
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>0,1</mml:mn>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>. First, if a region has an average brightness <inline-formula id="inf14">
<mml:math id="m15">
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>B</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3c;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>B</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mn>0</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, the region&#x2019;s patch is automatically added to the null category. For another patch to be included in the dataset, its average brightness must fall within the range <inline-formula id="inf15">
<mml:math id="m16">
<mml:mrow>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>B</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>i</mml:mi>
<mml:mi>n</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>B</mml:mi>
<mml:mo>&#xaf;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mrow>
<mml:mi>m</mml:mi>
<mml:mi>a</mml:mi>
<mml:mi>x</mml:mi>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<p>Using the above constraints, for each iteration of <xref ref-type="other" rid="alg1">Algorithm 1</xref>, a random image in the set is selected, and within that image, a random point <inline-formula id="inf16">
<mml:math id="m17">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> from a set of points spaced <italic>p</italic> pixels apart is selected. For the selected point, patch <inline-formula id="inf17">
<mml:math id="m18">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is spliced from a <inline-formula id="inf18">
<mml:math id="m19">
<mml:mrow>
<mml:mn>32</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>32</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> section of the image below and to the right of <inline-formula id="inf19">
<mml:math id="m20">
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. This patch is evaluated against the constraints to determine if it is eligible to be included in the patch set and what category it belongs to. If the image has a mask, the patch is categorized into the mask or non-mask category based on the mask tolerance value. Patch <inline-formula id="inf20">
<mml:math id="m21">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">P</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is added to its assigned category set <inline-formula id="inf21">
<mml:math id="m22">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">C</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> if it meets the constraints.</p>
<p>The generation process ensures every saved patch originates from a unique point, meaning there are no duplicate patches in the dataset. Additionally, different image types containing different categories may use different constraint values when generating patches. The final patch set is used to form training, validation, and test datasets for both of the CNNs in the following sections.</p>
</sec>
<sec id="s2-2">
<title>2.2 Generating a Similarity Matrix for Material Categories</title>
<p>This section introduces a novel Siamese neural network, the <italic>distance matrix convolutional neural network</italic> (D-CNN), that learns to make similarity decisions between image patches to produce a distance matrix <inline-formula id="inf22">
<mml:math id="m23">
<mml:mi mathvariant="bold">D</mml:mi>
</mml:math>
</inline-formula> that encodes the similarities between pairs of material categories.</p>
<p>The D-CNN works by making binary similarity decisions between a reference image patch of a given category and a comparison patch of a different or the same category. This network assists in evaluating expert categories since it is effective compared to human similarity decisions on na&#xef;ve categories, while not requiring the manual annotation necessary for humans.</p>
<p>The network architecture is based on a modified version of ResNet34 (<xref ref-type="bibr" rid="B14">He et&#x20;al., 2015</xref>) with custom linear layers that perform pairwise evaluation between patches.<xref ref-type="fn" rid="fn1">
<sup>1</sup>
</xref> <xref ref-type="fig" rid="F2">Figure&#x20;2</xref> shows the D-CNN network architecture. The network is trained on a large dataset of greyscale image patches, each having raw feature vectors&#x20;<inline-formula id="inf23">
<mml:math id="m24">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<fig id="F2" position="float">
<label>FIGURE 2</label>
<caption>
<p>The D-CNN architecture. The two ResNet34 (<xref ref-type="bibr" rid="B14">He et&#x20;al., 2015</xref>) networks share the same weights, forming a Siamese neural network. The linear layers at the end of the network find a difference between the two networks&#x2019; values for each patch and give a binary similarity decision <inline-formula id="inf24">
<mml:math id="m25">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> based on this difference. The goal of training the D-CNN is to maximize its ability to make correct similarity decisions.</p>
</caption>
<graphic xlink:href="frai-04-638299-g002.tif"/>
</fig>
<p>The purpose of the D-CNN is to obtain binary similarity decisions <inline-formula id="inf25">
<mml:math id="m26">
<mml:mrow>
<mml:msub>
<mml:mi>s</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mn>0,1</mml:mn>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi>n</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> between a reference image and each of a set of <italic>n</italic> images representing each class in the dataset. The Siamese D-CNN does this without human supervision, using a dataset with <italic>k</italic> material categories <inline-formula id="inf26">
<mml:math id="m27">
<mml:mrow>
<mml:mi>C</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mn>1,2</mml:mn>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>. The dataset is divided into batches of reference images <inline-formula id="inf27">
<mml:math id="m28">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mi mathvariant="bold">r</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> that are each associated with comparison images <inline-formula id="inf28">
<mml:math id="m29">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mi mathvariant="bold">c</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> of every class <inline-formula id="inf29">
<mml:math id="m30">
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>. For each sample, the D-CNN is provided a set of <inline-formula id="inf30">
<mml:math id="m31">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x2b;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> image patches, with the reference image patch <inline-formula id="inf31">
<mml:math id="m32">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi mathvariant="bold">r</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mi mathvariant="bold">r</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> having class <inline-formula id="inf32">
<mml:math id="m33">
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and the <italic>k</italic> comparison image patches <inline-formula id="inf33">
<mml:math id="m34">
<mml:mrow>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">c</mml:mi>
<mml:mi mathvariant="bold">1</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">c</mml:mi>
<mml:mi mathvariant="bold">2</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:mo>&#x2026;</mml:mo>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">c</mml:mi>
<mml:mi mathvariant="bold">k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> having unique classes in shuffled&#x20;order.</p>
<p>A single pass through the D-CNN consists of the reference image <inline-formula id="inf34">
<mml:math id="m35">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi mathvariant="bold">r</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> being paired with one of the comparison images <inline-formula id="inf35">
<mml:math id="m36">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">c</mml:mi>
<mml:mi mathvariant="bold">i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>. Each patch is sent through the D-CNN&#x2019;s convolutional layers with the same weights, and the two convolutional outputs are compared in the linear layers. The D-CNN returns <inline-formula id="inf36">
<mml:math id="m37">
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> if it evaluates that the paired images are of the same class or <inline-formula id="inf37">
<mml:math id="m38">
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> if it evaluates that the paired images are of different classes. This process repeats with <inline-formula id="inf38">
<mml:math id="m39">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi mathvariant="bold">r</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and each of the comparison images&#x20;<inline-formula id="inf39">
<mml:math id="m40">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">c</mml:mi>
<mml:mi mathvariant="bold">i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
<p>For a D-CNN with network parameters <inline-formula id="inf40">
<mml:math id="m41">
<mml:mi mathvariant="bold">&#x398;</mml:mi>
</mml:math>
</inline-formula>, and predictions <inline-formula id="inf41">
<mml:math id="m42">
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi mathvariant="bold">r</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi mathvariant="bold">c</mml:mi>
</mml:msub>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> with corresponding similarity decision labels <italic>y</italic>, the training process can be formalized as the minimization problem described in <xref ref-type="disp-formula" rid="e1">Eq. 1</xref>.<disp-formula id="e1">
<mml:math id="m43">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mtext>argmin</mml:mtext>
</mml:mrow>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
</mml:munder>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>,</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mo>&#x2212;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>y</mml:mi>
<mml:mtext>ln</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>y</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mtext>ln</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>1</mml:mn>
<mml:mo>&#x2212;</mml:mo>
<mml:mrow>
<mml:mover accent="true">
<mml:mi>y</mml:mi>
<mml:mo>&#x5e;</mml:mo>
</mml:mover>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(1)</label>
</disp-formula>The minimization term represents the cross-entropy loss between the D-CNN&#x2019;s predicted value on the comparison between image sets <inline-formula id="inf42">
<mml:math id="m44">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mi mathvariant="bold">r</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and <inline-formula id="inf43">
<mml:math id="m45">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">X</mml:mi>
<mml:mi mathvariant="bold">c</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, and the actual values of the similarity decisions between the two sets. Minimizing this term helps the D-CNN more closely fit the target function, which makes it more accurately evaluate if two image patches are of the same or different material categories.</p>
<p>We note that we selected cross-entropy loss despite many Siamese neural network models using triplet loss (<xref ref-type="bibr" rid="B6">Chechik et&#x20;al., 2010</xref>) in their minimization objective. Triplet loss is useful for tasks like facial recognition (<xref ref-type="bibr" rid="B27">Schroff et&#x20;al., 2015</xref>), where classes cannot be represented in a one-hot manner due to a large number of possibilities. In such cases, an <italic>n</italic>-dimensional non-binary embedding is learned. However, with medical materials, we expect only a small number of categories for each application. Cross-entropy loss greatly simplifies the comparison problem for such cases, as no anchor input is needed. We believe this is viable because the problem space has been simplified&#x2014;sample labels can only take two values (0 or 1). If one desires to learn a distance metric between a large number of medical material categories, the D-CNN could be tweaked to use triplet loss by adding an anchor input and changing the minimization objective.</p>
<p>Specifically, we train the D-CNN as follows. For a predetermined number of epochs, we train the network on a training set of patch comparison samples. At the end of each epoch, we then evaluate the network on a separate validation set of patch comparison samples. The loss on the validation set is tracked for each epoch, and if the current epoch&#x2019;s validation set loss is the lowest of all epochs so far, the D-CNN model&#x2019;s weights are saved. Ideally, the training regimen would converge to the lowest validation set loss on the final epoch, but this is not always the&#x20;case.</p>
<p>Saving the lowest-loss D-CNN model rather than the final epoch D-CNN model mitigates risks of overfitting the model. Overfitting occurs when, in later epochs of the training process, the validation set loss increases due to a model losing its ability to generalize features learned from the training set. Our procedure avoids this by ignoring any D-CNN model iterations that yield a larger validation set loss than earlier epochs.</p>
<p>After training the network, the network is evaluated with a testing set of patch comparison samples it has not seen before. Like in training, the D-CNN makes binary similarity decisions between a reference patch and <italic>n</italic> comparison image patches. These similarity decisions are encoded in a <italic>K</italic>-dimensional vector <inline-formula id="inf44">
<mml:math id="m46">
<mml:mi mathvariant="bold">p</mml:mi>
</mml:math>
</inline-formula> using <xref ref-type="disp-formula" rid="e2">Eq. 2</xref>.<disp-formula id="e2">
<mml:math id="m47">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">p</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:msub>
<mml:mi>N</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:mfrac>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>n</mml:mi>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>k</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msub>
<mml:mi mathvariant="bold">s</mml:mi>
<mml:mi>n</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
<label>(2)</label>
</disp-formula>The distance matrix <inline-formula id="inf45">
<mml:math id="m48">
<mml:mi mathvariant="bold">D</mml:mi>
</mml:math>
</inline-formula> is built from the L2-norm between pairs of entries in <inline-formula id="inf46">
<mml:math id="m49">
<mml:mi mathvariant="bold">p</mml:mi>
</mml:math>
</inline-formula>. Each entry in <inline-formula id="inf47">
<mml:math id="m50">
<mml:mi mathvariant="bold">D</mml:mi>
</mml:math>
</inline-formula>, <inline-formula id="inf48">
<mml:math id="m51">
<mml:mrow>
<mml:msub>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mi>k</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, represents the perceptual distance the D-CNN has established between material categories <italic>k</italic> and <inline-formula id="inf49">
<mml:math id="m52">
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula>. The value of each entry of <inline-formula id="inf50">
<mml:math id="m53">
<mml:mi mathvariant="bold">D</mml:mi>
</mml:math>
</inline-formula> is presented in <xref ref-type="disp-formula" rid="e3">Eq. 3</xref>.<disp-formula id="e3">
<mml:math id="m54">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">D</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mi>k</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msub>
<mml:mo>&#x3d;</mml:mo>
<mml:msub>
<mml:mrow>
<mml:mrow>
<mml:mo>&#x2016;</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">p</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">p</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>&#x2016;</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
<label>(3)</label>
</disp-formula>While training the D-CNN, we define the &#x201c;optimal&#x201d; <inline-formula id="inf51">
<mml:math id="m55">
<mml:mi mathvariant="bold">D</mml:mi>
</mml:math>
</inline-formula> matrix as the one that is generated when the D-CNN has the lowest loss on the validation set. This optimal matrix is saved in addition to the model&#x2019;s weights and is used as the basis for generating the material attributes in later&#x20;steps.</p>
</sec>
<sec id="s2-3">
<title>2.3 Generating Material Attributes</title>
<p>The distance matrix <inline-formula id="inf52">
<mml:math id="m56">
<mml:mi mathvariant="bold">D</mml:mi>
</mml:math>
</inline-formula> introduced in <xref ref-type="sec" rid="s2-2">Section 2.2</xref> maps distances from material categories to other material categories. However, we are also interested in discovering a set of <italic>M</italic> novel material attributes that provide new, useful information that can improve the categorization and separation of image patches.</p>
<p>We reintroduce the method in <xref ref-type="bibr" rid="B28">Schwartz and Nishino (2020)</xref> for mapping material categories to material attributes. This procedure preserves the distances discovered in <inline-formula id="inf53">
<mml:math id="m57">
<mml:mi mathvariant="bold">D</mml:mi>
</mml:math>
</inline-formula> while introducing values for the mapping that reflect how humans generally perceive materials. This mapping is encoded in the <italic>material category-attribute matrix</italic>&#x20;<inline-formula id="inf54">
<mml:math id="m58">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula>.</p>
<p>
<inline-formula id="inf55">
<mml:math id="m59">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> is a <inline-formula id="inf56">
<mml:math id="m60">
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula> matrix, where <italic>K</italic> is the number of material categories encoded by <inline-formula id="inf57">
<mml:math id="m61">
<mml:mi mathvariant="bold">D</mml:mi>
</mml:math>
</inline-formula> and <italic>M</italic> is a freely selected value that defines the number of material attributes that are generated. The entries of <inline-formula id="inf58">
<mml:math id="m62">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> are bound to the range <inline-formula id="inf59">
<mml:math id="m63">
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>0,1</mml:mn>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:math>
</inline-formula> so that each entry represents a conditional probability. The minimization objective for <inline-formula id="inf60">
<mml:math id="m64">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> is presented in <xref ref-type="disp-formula" rid="e4">Eq. 4</xref>.<xref ref-type="fn" rid="fn2">
<sup>2</sup>
</xref>
<disp-formula id="e4">
<mml:math id="m65">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold">A</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mtext>argmin</mml:mtext>
</mml:mrow>
<mml:mi mathvariant="bold">A</mml:mi>
</mml:munder>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>k</mml:mi>
<mml:mo>&#x2032;</mml:mo>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>C</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">a</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">a</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msub>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mo>&#x7c;</mml:mo>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">D</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mi>k</mml:mi>
<mml:mo>&#x2032;</mml:mo>
</mml:mrow>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:math>
<label>(4)</label>
</disp-formula>
<disp-formula id="equ1">
<mml:math id="m66">
<mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:mi>&#x3b3;</mml:mi>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mi>&#x3b2;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mtext>ln</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mfrac>
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="bold">A</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="e5">
<mml:math id="m67">
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="bold">A</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mi>K</mml:mi>
<mml:mi>M</mml:mi>
</mml:mrow>
</mml:mfrac>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:munder>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:mi>&#x3c0;</mml:mi>
<mml:msup>
<mml:mi>h</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mn>2</mml:mn>
</mml:mfrac>
</mml:mrow>
</mml:msup>
<mml:mtext>exp</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mo>&#x2212;</mml:mo>
<mml:mfrac>
<mml:mrow>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mi>p</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
<mml:mrow>
<mml:mn>2</mml:mn>
<mml:msup>
<mml:mi>h</mml:mi>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(5)</label>
</disp-formula>
</p>
<p>The first term of the objective captures the distances between material categories in <inline-formula id="inf61">
<mml:math id="m68">
<mml:mi mathvariant="bold">D</mml:mi>
</mml:math>
</inline-formula> and material attributes in <inline-formula id="inf62">
<mml:math id="m69">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> with a distance measure that iterates over the L2-distance of columns <inline-formula id="inf63">
<mml:math id="m70">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">a</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> in <inline-formula id="inf64">
<mml:math id="m71">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> and compares them against individual entries in&#x20;<inline-formula id="inf65">
<mml:math id="m72">
<mml:mi mathvariant="bold">D</mml:mi>
</mml:math>
</inline-formula>.</p>
<p>The second term of the objective captures an important feature of the <inline-formula id="inf66">
<mml:math id="m73">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> matrix&#x2014;that its entries should conform to a reasonable distribution that mirrors human perception. Like <xref ref-type="bibr" rid="B28">Schwartz and Nishino (2020)</xref>, we use a beta distribution with parameters <inline-formula id="inf67">
<mml:math id="m74">
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.5</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>. The beta distribution is ideal because, for human perception, material attributes usually either strongly exhibit a certain material category or not exhibit it at all. We assume that this observation, like with natural categories, holds with expert categories.</p>
<p>Since the Beta distribution is continuous, it still permits intermediate cases where materials may be similar (as is the case for &#x201c;tumor&#x201d; and &#x201c;brain&#x201d;). The &#x3b3;-weighted term accomplishes this by embedding the <inline-formula id="inf68">
<mml:math id="m75">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> matrix in a Gaussian kernel density estimate <inline-formula id="inf69">
<mml:math id="m76">
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="bold">A</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> and comparing it to the target beta distribution. This comparison is accomplished by evaluating the Kullback&#x2013;Leibler (KL) divergence between those two terms. The Gaussian kernel density estimate of <inline-formula id="inf70">
<mml:math id="m77">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> at point <italic>p</italic> is presented in <xref ref-type="disp-formula" rid="e5">Eq.&#x20;5</xref>.</p>
<p>The optimized matrix <inline-formula id="inf71">
<mml:math id="m78">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold">A</mml:mi>
<mml:mo>&#x2217;</mml:mo>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula> from <xref ref-type="disp-formula" rid="e4">Eq. 4</xref> is held constant and used as the <inline-formula id="inf72">
<mml:math id="m79">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> matrix in further portions of the system.</p>
</sec>
<sec id="s2-4">
<title>2.4 Material Attribute-Category Convolutional Neural Network Architecture</title>
<p>The <italic>material attribute-category convolutional neural network</italic> (MAC-CNN) is an end-to-end convolutional neural network that seeks to directly learn the <italic>K</italic> material categories while also simultaneously learning the <italic>M</italic> material attributes embedded by <inline-formula id="inf73">
<mml:math id="m80">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula>. We improve on the MAC-CNN design in <xref ref-type="bibr" rid="B28">Schwartz and Nishino (2020)</xref> by updating the architecture to classify medical materials more robustly. <xref ref-type="fig" rid="F3">Figure&#x20;3</xref> demonstrates the architecture of our MAC-CNN.</p>
<fig id="F3" position="float">
<label>FIGURE 3</label>
<caption>
<p>The Material Attribute Classifier CNN (MAC-CNN) architecture. The network uses convolutional layers from ResNet34 (<xref ref-type="bibr" rid="B14">He et&#x20;al., 2015</xref>) followed by sequential 512-node and 2048-node fully connected layers to predict the material category <inline-formula id="inf74">
<mml:math id="m81">
<mml:mrow>
<mml:msub>
<mml:mi>c</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>0,1</mml:mn>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mi>K</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>. An auxiliary network of fully connected layers also predicts the material attribute probabilities <inline-formula id="inf75">
<mml:math id="m82">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi>x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</caption>
<graphic xlink:href="frai-04-638299-g003.tif"/>
</fig>
<p>The MAC-CNN in <xref ref-type="bibr" rid="B28">Schwartz and Nishino (2020)</xref> used VGG-16 (<xref ref-type="bibr" rid="B29">Simonyan and Zisserman, 2014</xref>) as its backbone architecture. However, to maintain consistency with the D-CNN and use a more powerful architecture, we introduce an updated version of the MAC-CNN that is built on ResNet34 (<xref ref-type="bibr" rid="B14">He et&#x20;al., 2015</xref>). ResNet is more reliable with deeper layers since its architecture reduces the vanishing gradient problem. This means that, when compared to a deeper version of the VGG network, a deeper version of ResNet could give the MAC-CNN greater predictive power, which could be useful for complex medical material problems. Like all models with more parameters, this comes at the expense of training&#x20;time.</p>
<p>The fully-connected layers in the ResNet network are replaced by two fully-connected layers to be trained from random initialization. These layers determine the <italic>K</italic> material category predictions as shown in <xref ref-type="fig" rid="F3">Figure&#x20;3</xref>, and output a one-hot vector with the material category classification. If the D-CNN is effective at discerning expert categories and the <inline-formula id="inf76">
<mml:math id="m83">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> matrix encodes these categories well, then the MAC-CNN should be able to categorize expert, na&#xef;ve and null categories effectively.</p>
<p>To predict the <italic>M</italic> material attributes, the backbone network is augmented with multiple auxiliary classifier networks. The responses from each block of the ResNet backbone, along with the initial pooling layer, are used as inputs to individual auxiliary classifier networks. An additional auxiliary classifier is used to combine each module&#x2019;s prediction into a single <italic>M</italic>-dimensional prediction vector. The auxiliary network learns to give conditional probabilities that the patch fits each material attribute, allowing the MAC-CNN to retain features that are informative for predicting material attributes.</p>
<p>The goal of the MAC-CNN is realized through training the network on image patches, like the D-CNN. However, the patches&#x2019; material categories are learned directly instead of through similarity decisions. The MAC-CNN also learns material attributes. Therefore, the weights from the D-CNN cannot be directly transferred to the MAC-CNN.</p>
<p>To predict the <italic>M</italic> discovered material attributes, the MAC-CNN uses a learned auxiliary classifier <italic>f</italic> with parameters <inline-formula id="inf77">
<mml:math id="m84">
<mml:mi mathvariant="bold">&#x398;</mml:mi>
</mml:math>
</inline-formula> that maps an image patch with <italic>d</italic> raw features to the <italic>M</italic> attribute probabilities. The model <italic>f</italic>&#x2019;s mapping is given by <inline-formula id="inf78">
<mml:math id="m85">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>:</mml:mo>
<mml:msup>
<mml:mi mathvariant="normal">&#x211d;</mml:mi>
<mml:mi>d</mml:mi>
</mml:msup>
<mml:mo>&#x2192;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>0,1</mml:mn>
<mml:mo>]</mml:mo>
</mml:mrow>
<mml:mi>M</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>. Each term in the output is a conditional probability that the patch exhibits that particular attribute.</p>
<p>Given a <italic>D</italic>-dimensional feature vector output from a hidden layer of the MAC-CNN, the <italic>M</italic> dimensional material attribute prediction is computed by <xref ref-type="disp-formula" rid="e6">Eq 6</xref>. The network&#x2019;s weights and biases&#x20;<inline-formula id="inf79">
<mml:math id="m86">
<mml:mrow>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> have dimensionality <inline-formula id="inf80">
<mml:math id="m87">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi mathvariant="normal">&#x211d;</mml:mi>
<mml:mrow>
<mml:mi>H</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>D</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf81">
<mml:math id="m88">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi mathvariant="normal">&#x211d;</mml:mi>
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#xd7;</mml:mo>
<mml:mi>H</mml:mi>
</mml:mrow>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf82">
<mml:math id="m89">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi mathvariant="normal">&#x211d;</mml:mi>
<mml:mi>H</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, and <inline-formula id="inf83">
<mml:math id="m90">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi mathvariant="normal">&#x211d;</mml:mi>
<mml:mi>M</mml:mi>
</mml:msup>
</mml:mrow>
</mml:math>
</inline-formula>, where <italic>H</italic> is the dimensionality of the hidden layer.<disp-formula id="equ2">
<mml:math id="m91">
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mi>h</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mi>h</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">W</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">b</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="e6">
<mml:math id="m92">
<mml:mrow>
<mml:mi>h</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mtable>
<mml:mtr>
<mml:mtd>
<mml:mn>0</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mn>0</mml:mn>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mi>x</mml:mi>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mn>0</mml:mn>
<mml:mo>&#x3c;</mml:mo>
<mml:mi>x</mml:mi>
<mml:mo>&#x3c;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mtd>
</mml:mtr>
<mml:mtr>
<mml:mtd>
<mml:mn>1</mml:mn>
</mml:mtd>
<mml:mtd>
<mml:mrow>
<mml:mi>x</mml:mi>
<mml:mo>&#x2265;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:mtd>
</mml:mtr>
</mml:mtable>
</mml:mrow>
</mml:mrow>
</mml:mrow>
</mml:math>
<label>(6)</label>
</disp-formula>
</p>
</sec>
<sec id="s2-5">
<title>2.5 Material Attribute-Category Convolutional Neural Network Training</title>
<p>The convolutional layers in the backbone network are pretrained on ImageNet (<xref ref-type="bibr" rid="B10">Deng et&#x20;al., 2009</xref>) for robust feature extraction, while the fully connected layers and auxiliary network are initialized with random weights. The training process optimizes these weights with respect to the target function and allows for a faster training process than starting with random weights for the entire network. A fast training process is important if the MAC-CNN is to be used in many different expert domains with little correlation to each&#x20;other.</p>
<p>Like the D-CNN, we reduce overfitting by saving the MAC-CNN model from the training epoch with the lowest validation-set loss, which is not necessarily the model from the final epoch. This allows for the model to be trained for more epochs while mitigating potential overfitting later in the training process. To improve the MAC-CNN&#x2019;s training convergence, we also use a learning rate scheduler that reduces the learning rate by a factor of 10 following epochs where validation set loss increases.</p>
<p>We train the network parameters <inline-formula id="inf84">
<mml:math id="m93">
<mml:mi mathvariant="bold">&#x398;</mml:mi>
</mml:math>
</inline-formula>, dependent on the material attribute-category matrix <inline-formula id="inf85">
<mml:math id="m94">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula>, to classify patches into <italic>K</italic> material categories and <italic>M</italic> material attributes simultaneously. The training set <inline-formula id="inf86">
<mml:math id="m95">
<mml:mi mathvariant="bold">X</mml:mi>
</mml:math>
</inline-formula> is a set of <italic>N</italic> pairs of raw feature vectors and material category labels of the form <inline-formula id="inf87">
<mml:math id="m96">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>, where <inline-formula id="inf88">
<mml:math id="m97">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is the raw feature vectors of image patch <italic>i</italic> and <inline-formula id="inf89">
<mml:math id="m98">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> is a one-hot encoded label vector for its <italic>K</italic> material categories. <xref ref-type="disp-formula" rid="e7">Equation 7</xref> formalizes the definition of these training pairs.<disp-formula id="e7">
<mml:math id="m99">
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>:</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mtext>&#x2009;</mml:mtext>
<mml:mn>1</mml:mn>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>i</mml:mi>
<mml:mo>&#x2264;</mml:mo>
<mml:mi>N</mml:mi>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mi mathvariant="normal">&#x211d;</mml:mi>
<mml:mi>d</mml:mi>
</mml:msup>
<mml:mo>,</mml:mo>
<mml:mtext>&#x2009;</mml:mtext>
<mml:msub>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msup>
<mml:mrow>
<mml:mrow>
<mml:mo>{</mml:mo>
<mml:mrow>
<mml:mn>0,1</mml:mn>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi>K</mml:mi>
</mml:msup>
</mml:mrow>
<mml:mo>}</mml:mo>
</mml:mrow>
<mml:mo>.</mml:mo>
</mml:mrow>
</mml:math>
<label>(7)</label>
</disp-formula>The loss function and minimization objective for the MAC-CNN is given in <xref ref-type="disp-formula" rid="e8">Eq. 8</xref>, which follows from the loss function used in <xref ref-type="bibr" rid="B28">Schwartz and Nishino (2020)</xref>.<xref ref-type="fn" rid="fn3">
<sup>3</sup>
</xref> The loss function combines the negative log-likelihood of the <italic>K</italic> material category predictions for each image patch <inline-formula id="inf90">
<mml:math id="m100">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:math>
</inline-formula>.<disp-formula id="equ4b">
<mml:math id="m104">
<mml:mrow>
<mml:msup>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
<mml:mtext>&#x2a;</mml:mtext>
</mml:msup>
<mml:mo>&#x3d;</mml:mo>
<mml:munder>
<mml:mrow>
<mml:mtext>argmin</mml:mtext>
</mml:mrow>
<mml:mtext>&#x398;</mml:mtext>
</mml:munder>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>T</mml:mi>
</mml:mrow>
</mml:munder>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:munder>
<mml:mo>&#x2212;</mml:mo>
<mml:msub>
<mml:mi>y</mml:mi>
<mml:mi>j</mml:mi>
</mml:msub>
<mml:mtext>ln</mml:mtext>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>f</mml:mi>
<mml:msub>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mi>j</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="equ5">
<mml:math id="m105">
<mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>&#x3b3;</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>&#x2208;</mml:mo>
<mml:mi>P</mml:mi>
</mml:mrow>
</mml:munder>
<mml:mi>&#x3b2;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mtext>ln</mml:mtext>
<mml:mfrac>
<mml:mrow>
<mml:mi>&#x3b2;</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mrow>
<mml:mi>q</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>p</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi>T</mml:mi>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:mfrac>
</mml:mrow>
</mml:math>
</disp-formula>
<disp-formula id="e8">
<mml:math id="m106">
<mml:mrow>
<mml:mo>&#x2b;</mml:mo>
<mml:msub>
<mml:mi>&#x3b3;</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
<mml:mrow>
<mml:mover accent="true">
<mml:mrow>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>1</mml:mn>
</mml:mrow>
</mml:munder>
</mml:mrow>
<mml:mi>K</mml:mi>
</mml:mover>
</mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">a</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>&#x2212;</mml:mo>
<mml:mfrac>
<mml:mn>1</mml:mn>
<mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
<mml:mo>&#x7c;</mml:mo>
</mml:mrow>
</mml:mfrac>
<mml:munder>
<mml:mstyle displaystyle="true">
<mml:mo>&#x2211;</mml:mo>
</mml:mstyle>
<mml:mrow>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>,</mml:mo>
<mml:msub>
<mml:mi mathvariant="bold">y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x2208;</mml:mo>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:munder>
<mml:mi>f</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>;</mml:mo>
<mml:mi mathvariant="bold">&#x398;</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
<mml:mo>&#x7c;</mml:mo>
<mml:msubsup>
<mml:mo>&#x7c;</mml:mo>
<mml:mn>2</mml:mn>
<mml:mn>2</mml:mn>
</mml:msubsup>
</mml:mrow>
</mml:math>
<label>(8)</label>
</disp-formula>
</p>
<p>The <inline-formula id="inf91">
<mml:math id="m107">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b3;</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>-weighted term represents the KL-divergence between the <italic>M</italic> material attribute predictions for <inline-formula id="inf92">
<mml:math id="m108">
<mml:mrow>
<mml:msub>
<mml:mi mathvariant="bold">x</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> and a Beta distribution with <inline-formula id="inf93">
<mml:math id="m109">
<mml:mrow>
<mml:mi>a</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi>b</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>0.5</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula>. The Beta distribution is again chosen as a comparison distribution for reasons like those discussed in <xref ref-type="sec" rid="s2-2">Section&#x20;2.2</xref>.</p>
<p>The <inline-formula id="inf94">
<mml:math id="m110">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b3;</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>-weighted term constrains the loss to the material attributes encoded in the <inline-formula id="inf95">
<mml:math id="m111">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> matrix. The term represents the mean squared error between rows of <inline-formula id="inf96">
<mml:math id="m112">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula>, where each row represents one category&#x2019;s probability distribution of attributes, and the material attribute predictions on the samples <inline-formula id="inf97">
<mml:math id="m113">
<mml:mrow>
<mml:msub>
<mml:mi>T</mml:mi>
<mml:mi>k</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> for each category.</p>
<p>The hyperparameters <inline-formula id="inf98">
<mml:math id="m114">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b3;</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>, <inline-formula id="inf99">
<mml:math id="m115">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b3;</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula> assign weights to their respective loss terms and are chosen at training&#x20;time.</p>
</sec>
</sec>
<sec id="s3">
<title>3 Results</title>
<p>The patch generation procedure, D-CNN, and MAC-CNN were implemented using the PyTorch neural network library (<xref ref-type="bibr" rid="B24">Paszke et&#x20;al., 2019</xref>) and the Python programming language. The implementation was run on a system with an Intel Core i9 processor and two Nvidia Quadro RTX 8000 graphics cards. Our implementation is available on GitHub at <ext-link ext-link-type="uri" xlink:href="https://github.com/cmolder/medical-materials">https://github.com/cmolder/medical-materials</ext-link>.</p>
<p>To evaluate our methods on an expert domain, we compiled a dataset of local image patches of four categories&#x2014;background, tumor, bone, and brain&#x2014;using the procedure described in <xref ref-type="sec" rid="s2-1">Section 2.1.</xref> These patches were generated from a combination of medical image datasets of knee X-rays and brain MRIs with tumors. The dataset was divided into a 60-20-20 percent training, validation, and testing split to be evaluated using our system.</p>
<sec id="s3-1">
<title>3.1 Dataset</title>
<p>For bone category material patches, a set of 300 knee X-rays were sampled from the Cohort Hip and Cohort Knee (CHECK) baseline dataset (<xref ref-type="bibr" rid="B4">Bijlsma and Wesseling, 2015</xref>). For healthy brain and brain tumor category material patches, two datasets were combined: 3804 MRI scans with brain tumors were sourced from <xref ref-type="bibr" rid="B7">Cheng (2017)</xref> and additional brain MRI scans were sourced from The Cancer Imaging Archive (<xref ref-type="bibr" rid="B9">Clark et&#x20;al., 2013</xref>; <xref ref-type="bibr" rid="B26">Schmainda and Prah, 2018</xref>).</p>
<p>These medical radiography scans were used to generate image patches using the procedure discussed in <xref ref-type="sec" rid="s2-1">Section 2.1.</xref> The raw feature vectors from these image patches were then used to train, validate and test the D-CNN, optimize the material attribute-category matrix, and train, validate and test the MAC-CNN. 50 brain MRIs from <xref ref-type="bibr" rid="B7">Cheng (2017)</xref> were removed from the dataset to test the MAC-CNN&#x2019;s capabilities of evaluating images in a sliding-window manner in <xref ref-type="sec" rid="s3-5">Section&#x20;3.5.</xref>
</p>
<p>The patches were generated using the process described in <xref ref-type="sec" rid="s2-1">Section 2.1</xref> at a size of <inline-formula id="inf100">
<mml:math id="m116">
<mml:mrow>
<mml:mn>32</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>32</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> pixels.</p>
</sec>
<sec id="s3-2">
<title>3.2 Training Distance Matrix Convolutional Neural Network and Material Attribute-Category Convolutional Neural Network</title>
<p>To demonstrate that the D-CNN and MAC-CNN classifiers are trained effectively and do not overfit the training data, we present results from training multiple initializations of the D-CNN and MAC-CNN models. For reference, <xref ref-type="table" rid="T2">Table&#x20;2</xref> contains the list of parameters we selected to train the D-CNN and MAC-CNN.</p>
<table-wrap id="T2" position="float">
<label>TABLE 2</label>
<caption>
<p>D-CNN and MAC-CNN training parameters.</p>
</caption>
<table>
<thead valign="top">
<tr>
<th align="center">Notation</th>
<th align="center">Definition</th>
<th align="center">Value</th>
</tr>
</thead>
<tbody valign="top">
<tr>
<td colspan="3" align="center">
<bold>D-CNN</bold>
</td>
</tr>
<tr>
<td align="left">
<italic>&#x2003;E</italic>
</td>
<td align="left">Number of epochs</td>
<td align="center">15</td>
</tr>
<tr>
<td align="left">
<italic>&#x2003;B</italic>
</td>
<td align="left">Batch size</td>
<td align="center">50</td>
</tr>
<tr>
<td align="left">
<italic>&#x2003;&#x3b7;</italic>
</td>
<td align="left">Learning rate</td>
<td align="center">10<sup>-3</sup>
</td>
</tr>
<tr>
<td colspan="3" align="center">
<bold>MAC-CNN</bold>
</td>
</tr>
<tr>
<td align="left">
<italic>&#x2003;E</italic>
</td>
<td align="left">Number of epochs</td>
<td align="center">15</td>
</tr>
<tr>
<td align="left">
<italic>&#x2003;B</italic>
</td>
<td align="left">Batch size</td>
<td align="center">50</td>
</tr>
<tr>
<td align="left">
<italic>&#x2003;</italic>
<inline-formula id="inf101">
<mml:math id="m117">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b7;</mml:mi>
<mml:mn>0</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">Initial learning rate</td>
<td align="center">10<sup>-4</sup>
</td>
</tr>
<tr>
<td align="left">
<italic>&#x2003;</italic>
<inline-formula id="inf102">
<mml:math id="m118">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b3;</mml:mi>
<mml:mn>1</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="left">KL-divergence weight</td>
<td align="center">10<sup>-2</sup>
</td>
</tr>
<tr>
<td align="left">
<italic>&#x2003;</italic>
<inline-formula id="inf103">
<mml:math id="m119">
<mml:mrow>
<mml:msub>
<mml:mi>&#x3b3;</mml:mi>
<mml:mn>2</mml:mn>
</mml:msub>
</mml:mrow>
</mml:math>
</inline-formula>
</td>
<td align="center">Perceptual difference weight</td>
<td align="center">1</td>
</tr>
</tbody>
</table>
</table-wrap>
<p>To evaluate how the training process affects the D-CNN and MAC-CNN, we first evaluated the effects of training a single instance of each network. For each network, we plotted the resulting loss and accuracy from each training epoch on the training, testing, and validation datasets. <xref ref-type="fig" rid="F4">Figure&#x20;4</xref> presents these results.</p>
<fig id="F4" position="float">
<label>FIGURE 4</label>
<caption>
<p>D-CNN and MAC-CNN loss and accuracies per epoch, for one randomly initialized pair of models. The training procedure always saves the model with the lowest validation set loss, regardless of whether it is from the last epoch. For the D-CNN in this example, the model from epoch 5 is saved, as well as the <inline-formula id="inf104">
<mml:math id="m120">
<mml:mi mathvariant="bold">D</mml:mi>
</mml:math>
</inline-formula> matrix it generates on the validation set. For the MAC-CNN in this example, the model from the final epoch (epoch 14) is saved. Saving the lowest-loss model instead of the final model greatly mitigates the potential effects of overfitting in later epochs.</p>
</caption>
<graphic xlink:href="frai-04-638299-g004.tif"/>
</fig>
<p>The resulting losses and accuracies yield three main findings&#x2014;first, our decision to save the lowest-loss model rather than the final model is justified, especially for the D-CNN. For the D-CNN, validation and testing loss can vary significantly between epochs, and later epochs may yield noticeably higher losses and lower accuracies on the validation and testing sets. Second, testing and validation losses and accuracies trend very closely, as both sets are large and similar in size. Third, the learning rate scheduler used to train the MAC-CNN appears to better regulate its loss and accuracy in later epochs.</p>
<p>We also considered how the random initializations of the non-ResNet34 layers affect the training of the networks. While the convolutional layers for both the D-CNN and MAC-CNN are initialized with weights pretrained on ImageNet (<xref ref-type="bibr" rid="B10">Deng et&#x20;al., 2009</xref>), the fully connected and auxiliary layers are trained from scratch. Therefore, we trained 30 instances of both the D-CNN and MAC-CNN to see the loss and accuracy distributions on the training and validation sets.<xref ref-type="fn" rid="fn4">
<sup>4</sup>
</xref> <xref ref-type="fig" rid="F5">Figure&#x20;5</xref> presents these distributions over the 15-epoch training process. The center lines depict the median loss and accuracy, while the shaded regions depict the region between the 25th and 75th percentiles of loss and accuracy.</p>
<fig id="F5" position="float">
<label>FIGURE 5</label>
<caption>
<p>D-CNN and MAC-CNN loss and accuracies per epoch, sampled over 30 pretrained networks with random initializations of the fully connected layers. The lines represent the median loss/accuracy, and the shaded regions represent losses/accuracies between the 25th and 75th percentile for each epoch. As results from the validation and testing sets trend very closely, only results from the training and validation sets are presented.</p>
</caption>
<graphic xlink:href="frai-04-638299-g005.tif"/>
</fig>
<p>The distribution plots demonstrate that the results in <xref ref-type="fig" rid="F4">Figure&#x20;4</xref> are typical of training a D-CNN and MAC-CNN. That is, the D-CNN trains more sporadically, but still achieves a lower validation loss during training, while the MAC-CNN trains more regularly and achieves its lowest validation loss in later epochs. The MAC-CNN is unlikely to overfit, as its validation loss does not typically increase late in the training process. The D-CNN has a somewhat greater risk of overfitting, but the impact of any potential overfitting from the D-CNN is mitigated by saving the lowest-loss model. It may be possible to regularize the D-CNN training by using a learning rate scheduler like the one used for the MAC-CNN.</p>
<p>As mentioned in <xref ref-type="sec" rid="s2-5">Section 2.5</xref>, we would also like our models to have a short training time so they can be quickly applied to new expert medical domains. Therefore, we timed the training process of 10 instances of the D-CNN and MAC-CNN over 15 epochs. We found that the time required to train both the D-CNN and MAC-CNN, starting with pretrained convolutional layers, is relatively&#x20;short.</p>
<p>In a separate experiment with a single, consumer-grade Nvidia RTX 2080 Ti graphics card, we evaluated the training times for 10 instances of the D-CNN and MAC-CNN using our implementation. Training 10&#x20;D-CNN instances for 15 epochs required an average of 23.7&#xa0;min per instance (standard deviation 6.1&#xa0;s), while training 10&#x20;ResNet34-based MAC-CNN instances for 15 epochs required an average of 14.3&#xa0;min per instance (standard deviation 8.6&#xa0;s).</p>
</sec>
<sec id="s3-3">
<title>3.3 Evaluating Distance Matrix Convolutional Neural Network Performance</title>
<p>On a testing set of 42,768 patches with evenly split categories, the D-CNN achieved an accuracy of 90.79%, which is the percentage of times that it correctly determined whether a reference and comparison patch were from the same material category or different material categories.</p>
<p>Although the D-CNN is accurate at making similarity decisions in general, the most informative accuracy values are those for each pair of material categories, as these accuracy values are reflective of the similarity between categories. <xref ref-type="fig" rid="F6">Figure&#x20;6</xref> demonstrates the accuracy of the D-CNN on each pair of category groupings.</p>
<fig id="F6" position="float">
<label>FIGURE 6</label>
<caption>
<p>The accuracy of the D-CNN making correct similarity decisions between reference and comparison patches of every pair of categories. The null category, background, was easily determined to be similar or dissimilar to other patches due to its homogeneity and difference from other classes of patches. Meanwhile, the D-CNN was less accurate at classifying more similar pairs of categories, such as brain and tumor. The less accurate comparisons result in a smaller perceptual distance in the <inline-formula id="inf105">
<mml:math id="m121">
<mml:mi mathvariant="bold">D</mml:mi>
</mml:math>
</inline-formula> matrix.</p>
</caption>
<graphic xlink:href="frai-04-638299-g006.tif"/>
</fig>
<p>These accuracies follow human intuition on how perceptually different these materials are expected to be. For example, brain and tumor patches generally appear similar, and therefore the D-CNN is less likely to correctly determine if patches of&#x20;the two categories are the same or different. Meanwhile,&#x20;the network is far more accurate at evaluating material patches that appear highly different, such as brain and&#x20;bone.</p>
</sec>
<sec id="s3-4">
<title>3.4 Evaluating Material Attribute-Category Convolutional Neural Network Performance</title>
<p>The distance matrix generated during the training epoch where the D-CNN achieved the greatest validation accuracy was used as the basis for the material attribute-category matrix <inline-formula id="inf106">
<mml:math id="m122">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula>. The L-BFGS-B algorithm optimized an <inline-formula id="inf107">
<mml:math id="m123">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> matrix with a minimal distance <inline-formula id="inf108">
<mml:math id="m124">
<mml:mrow>
<mml:mi>d</mml:mi>
<mml:mrow>
<mml:mo>(</mml:mo>
<mml:mrow>
<mml:mi mathvariant="bold">D</mml:mi>
<mml:mo>,</mml:mo>
<mml:mi mathvariant="bold">A</mml:mi>
</mml:mrow>
<mml:mo>)</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula> &#x3d;&#x20;1.18.</p>
<p>Using this matrix, the MAC-CNN reached 92.82% accuracy at determining the material category of each image patch from a testing set. For reference, <xref ref-type="bibr" rid="B28">Schwartz and Nishino (2020)</xref> attained 84% accuracy at best for a given category. However, the fewer number of categories that our MAC-CNN is evaluating may make the classification problem easier, yielding a higher accuracy. Our network evaluates four categories while <xref ref-type="bibr" rid="B28">Schwartz and Nishino (2020)</xref> evaluated&#x20;13.</p>
<p>When withholding the material attributes and calculating loss as a mean squared error between the predicted and actual image patch material category, the accuracy of the MAC-CNN for determining material categories on the testing set was 91.74%. This shows that including the <inline-formula id="inf109">
<mml:math id="m125">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> matrix&#x2019;s material attributes does not significantly alter the network&#x2019;s ability to predict material categories.</p>
<p>Additionally, we compared the performance of our ResNet34-based MAC-CNN to a variant based on VGG-16 (<xref ref-type="bibr" rid="B29">Simonyan and Zisserman, 2014</xref>). The VGG-16 variant reflects the MAC-CNN architecture proposed by <xref ref-type="bibr" rid="B28">Schwartz and Nishino (2020)</xref>, with convolutional sequential layers and an identical auxiliary network design. After training the VGG-16 model on the material patch dataset with the same learning parameters, the VGG-16 model had an accuracy of 93.39% for determining material categories on the testing set. This shows that the ResNet34 and VGG-16 models have comparable accuracy (within 0.6%). While these two smaller models perform similarly, ResNet&#x2019;s better scalability to more layers (<xref ref-type="bibr" rid="B14">He et&#x20;al., 2015</xref>) makes it advantageous for larger medical material datasets.</p>
<p>To evaluate the relationship between the material attributes learned by the MAC-CNN from <inline-formula id="inf110">
<mml:math id="m126">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> and the material categories of the image patches, a correlation matrix was generated to show how positively or negatively each learned material attribute related to the occurrence of the true label of a given material category. <xref ref-type="fig" rid="F7">Figure&#x20;7</xref> presents this matrix.</p>
<fig id="F7" position="float">
<label>FIGURE 7</label>
<caption>
<p>The correlation of MAC-CNN categorizations between material categories and material attributes. The most strongly exhibited association is with attribute 0 and the background category, which may be attributable to its homogeneity as the null category. Attributes 1 and 2 do not greatly separate the brain and tumor material categories, likely due to their small perceptual distance.</p>
</caption>
<graphic xlink:href="frai-04-638299-g007.tif"/>
</fig>
<p>The matrix shows that attributes 1 and 3 are relatively uncorrelated to brain, bone, and tumor, and attribute 0 is negatively correlated to brain, bone, and tumor. Attribute 2 is moderately positively correlated with tumor and brain and slightly less positively correlated with bone. This matrix demonstrates that the attributes do not correspond one-to-one to given categories, meaning that the attributes encode different information than the categories.</p>
<p>An important factor in evaluating the MAC-CNN is determining if the material attributes encoded in <inline-formula id="inf111">
<mml:math id="m127">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> can accurately separate image patches by category. We used a method called <italic>t</italic>-SNE embedding (<xref ref-type="bibr" rid="B31">van der Maaten and Hinton, 2008</xref>), also used to evaluate the material attributes in <xref ref-type="bibr" rid="B28">Schwartz and Nishino (2020)</xref>, to determine how well the MAC-CNN&#x2019;s material attribute predictions separate material categories compared to the raw feature vectors of the patches. <italic>t</italic>-SNE embedding is a machine learning procedure that embeds the distributions of neighboring points in high-dimensional spaces to lower-dimension spaces, making the visualization of these high-dimensional spaces practical.</p>
<p>
<xref ref-type="fig" rid="F8">Figure&#x20;8</xref> shows the <italic>t</italic>-SNE embedding on the raw feature vectors and the <italic>M</italic> attributes learned by the MAC-CNN from <inline-formula id="inf112">
<mml:math id="m128">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> on the test set. The graphs demonstrate a much clearer separation of categories for the material attributes compared to raw feature vectors, while also maintaining intuitive perceptual distances&#x2013;for example, brain and tumor are more closely grouped than brain and bone. This indicates that the MAC-CNN&#x2019;s learned attributes provide useful information that separates material patches by category compared to merely parsing the raw features.</p>
<fig id="F8" position="float">
<label>FIGURE 8</label>
<caption>
<p>The results from t-SNE embedding (<xref ref-type="bibr" rid="B31">van der Maaten and Hinton, 2008</xref>) on the raw feature set <bold>(A)</bold> and the learned <italic>M</italic> attribute predictions encoded in <inline-formula id="inf113">
<mml:math id="m129">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> <bold>(B)</bold>. Although some separation is apparent on the raw feature embedding, especially between less-correlated categories like bone and brain, the separation is far stronger on the <italic>M</italic> attribute embedding, and even separates similar categories like brain and tumor while maintaining perceptual distances. This shows that the MAC-CNN has effectively learned to distinguished categories of image patches using the <italic>M</italic> attributes.</p>
</caption>
<graphic xlink:href="frai-04-638299-g008.tif"/>
</fig>
</sec>
<sec id="s3-5">
<title>3.5 Expanding Material Recognition to Full Images</title>
<p>As shown in Section 3.4, the MAC-CNN can accurately distinguish material categories from localized image patches. However, it is interesting and potentially useful to explore if this localized information can still yield useful results in the context of an entire image. If this were the case, then the MAC-CNN could be a promising component for future image analysis systems. However, it would not be reasonable to use the MAC-CNN alone since it is only able to extract <italic>local</italic> information, losing valuable information that comes from greater context.</p>
<p>To test the MAC-CNN to full medical scans, patches were sampled in a sliding-window fashion from full images. A <inline-formula id="inf114">
<mml:math id="m130">
<mml:mrow>
<mml:mn>32</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>32</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> pixel window was used with a stride of 4 pixels.</p>
<p>The one-hot classification of material categories from performing a sliding-window analysis on the MAC-CNN was mapped to a matrix that contains the label of each patch sampled from the image. <xref ref-type="fig" rid="F9">Figure&#x20;9</xref> shows the MAC-CNN&#x2019;s output on four brain MRI images using this convolutional system.</p>
<fig id="F9" position="float">
<label>FIGURE 9</label>
<caption>
<p>The MAC-CNN&#x2019;s category decisions applied in a sliding-window manner to some full brain scans. The first column contains raw images with the expertly annotated mask (&#x201c;tumor&#x201d;) highlighted, while the second column contains raw images overlaid with the results from the MAC-CNN. The MAC-CNN is effective at detecting tumor regions, but often picks up extraneous noise. The network also appears to exhibit knowledge transfer from the knee X-rays, as it recognized bone textures around the perimeter of the skull that it learned from the knee X-rays.</p>
</caption>
<graphic xlink:href="frai-04-638299-g009.tif"/>
</fig>
<p>The MAC-CNN is effective in most cases at isolating the expertly annotated mask, which for the case of a brain scan is of the &#x201c;tumor&#x201d; category. However, the network is often too sensitive and miscategorizes some portions of the brain MRI as tumor despite it being outside the expertly annotated mask. The miscategorizations are likely because the network is only viewing small image patches of the MRI, meaning the network has no greater context when making its categorizations.</p>
<p>With that in consideration, the network still generally identified tumors when they were present. This shows that the network successfully learned a variety of textures that indicate the presence of a brain tumor. Interestingly, some transfer learning also occurred from learning on knee X-ray image patches, as the sliding-window analysis sometimes picked up the perimeter of the skull as having a bone texture. This shows that the MAC-CNN&#x2019;s predictive power is robust enough to apply its categorizations from a variety of image types to other image types with similar textural appearances.</p>
<p>Learned material attributes may also provide insight into full image analysis. <xref ref-type="fig" rid="F10">Figure&#x20;10</xref> shows the MAC-CNN&#x2019;s sliding-window evaluation of a single brain MRI on each of the <italic>m</italic> material attributes. The attributes appear to pick up different but useful information from the material categories. Attributes 0 and 1, for example, tend to identify regions of the scan that are <italic>not</italic> tumor, while attribute 3 tends to pick up on likely tumor regions. Meanwhile, attribute 2 tends to pick up regions that are non-null. This behavior tends to correspond to the correlation values presented in <xref ref-type="fig" rid="F7">Figure&#x20;7</xref>.</p>
<fig id="F10" position="float">
<label>FIGURE 10</label>
<caption>
<p>The MAC-CNN&#x2019;s <inline-formula id="inf115">
<mml:math id="m131">
<mml:mrow>
<mml:mi>M</mml:mi>
<mml:mo>&#x3d;</mml:mo>
<mml:mn>4</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> attributes applied in a sliding-window manner across a single brain scan. Each row represents a different attribute being evaluated, with the raw output <bold>(B)</bold> and the output overlaid on the image <bold>(A)</bold>. A higher value means the given attribute is expressed more strongly at that location in the image. Each attribute picks up different aspects of the scan, and different attributes can either positively or negatively exhibit aspects of different material categories. For example, attributes 0 and 1 negatively correlate to the expertly annotated mask (tumor) while attribute 3 correlates highly to&#x20;it.</p>
</caption>
<graphic xlink:href="frai-04-638299-g010.tif"/>
</fig>
</sec>
</sec>
<sec id="s4">
<title>4 Discussion</title>
<sec id="s4-1">
<title>4.1 Related Work</title>
<p>Our methodology draws from many recent, relevant works about material analysis, computer vision, neural networks, and machine learning applications in medicine.</p>
<p>A few recent works in the medical field include applying machine learning to classify necrotic sections of pressure wounds (<xref ref-type="bibr" rid="B36">Zahia et&#x20;al., 2018</xref>), segment brain scans (<xref ref-type="bibr" rid="B20">Lai et&#x20;al., 2019</xref>), and segment chest X-rays (<xref ref-type="bibr" rid="B33">Wang et&#x20;al., 2019</xref>).</p>
<p>In material analysis, there has been significant research into leveraging fully and weakly-supervised learning systems. <xref ref-type="bibr" rid="B1">Bell et&#x20;al. (2015)</xref> introduced and evaluated the Materials in Context database, a large set of image patches with natural material category labels, in a fully supervised manner. <xref ref-type="bibr" rid="B2">Berg et&#x20;al. (2010)</xref> proposed a weakly supervised attribute discovery model for data mining images and text on the Internet, which did include some local attribute classification. However, their network&#x2019;s text annotations were associated with an entire image, and the images were not specific to an expert domain.</p>
<p>Material analysis has been performed in multiple expert domains with reduced data availability, including medicine. <xref ref-type="bibr" rid="B12">Gibert et&#x20;al. (2015)</xref> performed material analysis on photographs of railroad tracks using several different domain-specific categories to detect decaying infrastructure. Their annotation uses a system of bounding-boxes on photographs of railroad ties to determine regions of given categories. Material analysis has also been studied in medicine. <xref ref-type="bibr" rid="B22">Marvasti et&#x20;al. (2018)</xref> performed texture analysis on CT scans of liver lesions using a Bayesian network, evaluating features such as location, shape, proximity, and texture.</p>
<p>Specifically, our method is based on the material analysis method introduced by <xref ref-type="bibr" rid="B28">Schwartz and Nishino (2020)</xref>. The work proposed a dataset of natural material categories and used a weakly supervised learning method to generate material attributes. The proposed method differs from <xref ref-type="bibr" rid="B28">Schwartz and Nishino (2020)</xref> in several ways. First, we specialize our method to medical radiography images, while <xref ref-type="bibr" rid="B28">Schwartz and Nishino (2020)</xref> focused exclusively on natural materials found in common photographs. Second, our method automatically generates a material distance metric from material patches using the D-CNN, while <xref ref-type="bibr" rid="B28">Schwartz and Nishino (2020)</xref> used human annotators to manually make binary similarity decisions among pairs of material patches. We decided this was necessary because the evaluation medical material similarity needs experts to properly evaluate by hand, and doctors and similar experts are scarce and expensive to retain in most situations. Third, our method upgrades the MAC-CNN proposed by <xref ref-type="bibr" rid="B28">Schwartz and Nishino (2020)</xref> by using the more scalable ResNet (<xref ref-type="bibr" rid="B14">He et&#x20;al., 2015</xref>) architecture instead of VGG (<xref ref-type="bibr" rid="B29">Simonyan and Zisserman, 2014</xref>), letting larger, more augmented medical material datasets benefit from easier training on larger variants of the MAC-CNN.</p>
<p>We based the D-CNN on the Siamese neural network architecture as it has shown to be useful in a variety of similarity-evaluation problems. The Siamese neural network was first introduced by <xref ref-type="bibr" rid="B5">Bromley et&#x20;al. (1994)</xref> to detect forgeries in digital signatures. Since then, Siamese networks have been used for human re-identification (<xref ref-type="bibr" rid="B32">Varior et&#x20;al., 2016</xref>; <xref ref-type="bibr" rid="B8">Chung et&#x20;al., 2017</xref>), one-shot image classification (<xref ref-type="bibr" rid="B18">Koch et&#x20;al., 2015</xref>), object tracking (<xref ref-type="bibr" rid="B3">Bertinetto et&#x20;al., 2016</xref>; <xref ref-type="bibr" rid="B13">Guo et&#x20;al., 2017</xref>), and sentence similarity (<xref ref-type="bibr" rid="B23">Mueller and Thyagarajan, 2016</xref>). In medicine, Siamese networks have been used in similarity-evaluation tasks like gait recognition (<xref ref-type="bibr" rid="B37">Zhang et&#x20;al., 2016</xref>), spinal metastasis detection (<xref ref-type="bibr" rid="B34">Wang et&#x20;al., 2017</xref>), and to segment brain cytoarchitectonics (<xref ref-type="bibr" rid="B30">Spitzer et&#x20;al., 2018</xref>).</p>
<p>Many novel neural network architectures have been proposed for computer vision tasks, including ResNeXt (<xref ref-type="bibr" rid="B35">Xie et&#x20;al., 2017</xref>), DenseNet (<xref ref-type="bibr" rid="B15">Huang et&#x20;al., 2017</xref>), PNASNet (<xref ref-type="bibr" rid="B21">Liu et&#x20;al., 2018</xref>), and the Vision Transformer (ViT) (<xref ref-type="bibr" rid="B11">Dosovitskiy et&#x20;al., 2020</xref>). For both the D-CNN and MAC-CNN, the ResNet (<xref ref-type="bibr" rid="B14">He et&#x20;al., 2015</xref>) architecture is used. ResNet was selected over these other architectures for a few reasons.</p>
<p>For ViT, we do not believe the model is suitable for small texture patches. ViT divides its input into patches as tokens, and embeddings of these tokens are used as inputs into the model. While ViT achieves excellent performance on small-sized image datasets like CIFAR-10/100 (<xref ref-type="bibr" rid="B19">Krizhevsky, 2009</xref>), where each image is <inline-formula id="inf116">
<mml:math id="m132">
<mml:mrow>
<mml:mn>32</mml:mn>
<mml:mo>&#xd7;</mml:mo>
<mml:mn>32</mml:mn>
</mml:mrow>
</mml:math>
</inline-formula> pixels, such images contain more information than our texture patches. Each sub-patch of a CIFAR image sample may contain distinct information, but the sub-patches of a texture patch are not expected to do so because material patches only contain local context.</p>
<p>For PNASNet and other neural architecture search models, interpretability is sacrificed for accuracy. These discovered architectures are less interpretable than handcrafted architectures like ResNet. Interpretability is important in domains like medicine. For example, identifying causal relationships is key for doctors to diagnose conditions, and these causalities are easier to identify from interpretable models.</p>
<p>ResNet specifically has the following benefits. First, its structure, like VGG (<xref ref-type="bibr" rid="B29">Simonyan and Zisserman, 2014</xref>) and earlier convolutional architectures, allows for greater interpretability. The convolutional layers are stacked sequentially, and the feature maps of the hidden states can be visualized to determine what each convolutional filter detects. Second, unlike VGG, ResNet&#x2019;s skip connections allow for the training of a much deeper network, which could be useful for complex, large medical material datasets with dozens of categories. Third, unlike some recent architectures, the purely sequential layers of ResNet&#x2019;s design allow for an intuitive auxiliary network design. The sequential design allows for the auxiliary classifiers of the MAC-CNN to be placed so that each classifier processes a hidden state from a different stage of the network. With non-sequential models like ViT and PNASNet, finding an efficient placement of these auxiliary classifiers may be challenging. Fourth, ResNet models have a relatively small number of parameters compared to larger, more recent models, allowing for quicker training. This could be useful for specialized medical material problems, where a small group of researchers or doctors may not have many available computational resources.</p>
<p>U-Net (<xref ref-type="bibr" rid="B25">Ronneberger et&#x20;al., 2015</xref>) uses a fully convolutional network to predict segmentation maps from input images. A fundamental difference between U-Net and the proposed method is that U-Net requires segmentation maps as ground truth label data. In the proposed method, we do not use segmentation maps as ground truth label data because often complete and complex segmentation maps are not available for training. For example, in <xref ref-type="fig" rid="F9">Figure&#x20;9</xref>, to segment bone, brain tissue, brain tumor tissue, and the background, a 4-class segmentation map would be required by U-Net to be the label data for each training image. The dataset created in the proposed method instead uses a 2-class segmentation map: brain tumor tissue and everything else. In the proposed method, the dataset used to train the network uses class labels only or simple derived labels as explained in <xref ref-type="sec" rid="s2-1">Section 2.1.</xref> The proposed method uses a patch generation process to create labeled material patches that can be used to train the network to pick up on local patterns relating to material type. This avoids the problem of expensive manual annotation.</p>
</sec>
<sec id="s4-2">
<title>4.2 Conclusion and Future Work</title>
<p>The D-CNN and MAC-CNN demonstrate that medical material categories can be successfully evaluated from radiography images using local information. They also demonstrate that na&#xef;ve categories, such as healthy brain tissue in an MRI scan, are useful to augment expert categories, like brain tumors. We also demonstrated that such a system can be trained simultaneously on a range of expert, na&#xef;ve, and null categories and can robustly pick up relevant categories without being conditioned on a subset of categories or attributes.</p>
<p>The knowledge transfer demonstrated on the brain MRIs and knee X-rays suggests that a larger version of the model would be able to analyze a more detailed or broader set of materials. For example, training this network on brain MRI data with more detailed labeling could yield greater accuracy and less noise than merely comparing healthy brain tissue and tumors. More granular data could also reduce the number of inaccurate predictions and noise when attempting sliding-window material categorization of whole images.</p>
<p>Rather than relying on more expensive segmentation maps to act as ground truth, instead, this model could be improved by modifying the patch generation procedure or sliding window approach. The patch generation procedure could be improved by iterating the process and using the model&#x2019;s predictions to create a new set of patches, which can be used to train a new model. The sliding window approach uses a small, but fixed window size which makes it difficult to predict the labels of fine details in the image where multiple materials are present. The limitations of a fixed sliding window are avoided in U-Net (<xref ref-type="bibr" rid="B25">Ronneberger et&#x20;al., 2015</xref>) at the cost of requiring complete segmentation maps for ground&#x20;truth.</p>
<p>The D-CNN and MAC-CNN could also be extended to consider a larger context to further enhance material analysis. For example, a temporal dimension could be added to brain MRI data to model the progression of a brain tumor&#x2019;s texture over time. Additionally, the networks could be extended to parse three-dimensional voxel data to extract more information from&#x20;MRIs.</p>
<p>Overall, the D-CNN and MAC-CNN demonstrate the ability to perform expert material analysis from existing expertly annotated data without the need for experts to manually classify materials. The system also successfully demonstrates that intuitive observations about materials in nature can also hold in expert domains.</p>
</sec>
</sec>
</body>
<back>
<sec id="s5">
<title>Data Availability Statement</title>
<p>The original contributions presented in the study are included in the article/Supplementary Material, further inquiries can be directed to the corresponding author.</p>
<p>The datasets used to evaluate our models can be found in the following repositories:</p>
<p>- Cohort Knee and Cohort Hip (CHECK): <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://easy.dans.knaw.nl/ui/datasets/id/easy-dataset:62955">https://easy.dans.knaw.nl/ui/datasets/id/easy-dataset:62955</ext-link>
</p>
<p>- brain tumor dataset: <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://figshare.com/articles/brain_tumor_dataset/1512427">https://figshare.com/articles/brain_tumor_dataset/1512427</ext-link>
</p>
<p>- Brain-Tumor-Progression: <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://wiki.cancerimagingarchive.net/display/Public/Brain-Tumor-Progression">https://wiki.cancerimagingarchive.net/display/Public/Brain-Tumor-Progression</ext-link>
</p>
<p>The code used to implement our method and experiments can be found in the following GitHub repository: <ext-link ext-link-type="uri" xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="https://github.com/cmolder/medical-materials">https://github.com/cmolder/medical-materials</ext-link>
</p>
</sec>
<sec id="s6">
<title>Author Contributions</title>
<p>CM was involved in the development, design, implementation, and testing of the methods and models. CM was the primary author of the manuscript. BL was involved in the development and design of the model and some of the implementation. BL contributed to the manuscript. JZ was involved in the development and design of the model and provided insight during the research and production of the manuscript.</p>
</sec>
<sec id="s7">
<title>Funding</title>
<p>This work was supported in part by Arkansas Research Alliances, National Science Foundation under grant 1946391, and National Institute of Health under grant P20GM139768, P20GM121293, 7R01CA225773-03. This research was also supported by a University of Arkansas Honors College Research Team Grant. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.</p>
</sec>
<sec sec-type="COI-statement" id="s8">
<title>Conflict of Interest</title>
<p>The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.</p>
</sec>
<ack>
<p>We acknowledge Hadi Salman and Alycia Carey for their assistance in helping us develop our approach. We also&#x20;acknowledge Gabriel Schwartz for providing source code that we referred to during the implementation of this&#x20;paper.</p>
</ack>
<fn-group>
<fn id="fn1">
<label>1</label>
<p>The D-CNN uses the Adam optimizer (<xref ref-type="bibr" rid="B17">Kingma and Ba, 2014</xref>) for gradient descent and weight updates.</p>
</fn>
<fn id="fn2">
<label>2</label>
<p>The L-BFGS-B optimization algorithm is used to find a local minimum for the objective, starting from a randomized <inline-formula id="inf117">
<mml:math id="m133">
<mml:mi mathvariant="bold">A</mml:mi>
</mml:math>
</inline-formula> with entries <inline-formula id="inf118">
<mml:math id="m134">
<mml:mrow>
<mml:msub>
<mml:mi>a</mml:mi>
<mml:mrow>
<mml:mi>k</mml:mi>
<mml:mi>m</mml:mi>
</mml:mrow>
</mml:msub>
<mml:mo>&#x2208;</mml:mo>
<mml:mrow>
<mml:mo>[</mml:mo>
<mml:mn>0,1</mml:mn>
<mml:mo>]</mml:mo>
</mml:mrow>
</mml:mrow>
</mml:math>
</inline-formula>.</p>
</fn>
<fn id="fn3">
<label>3</label>
<p>The MAC-CNN uses the Adam optimizer (<xref ref-type="bibr" rid="B17">Kingma and Ba, 2014</xref>) for gradient descent and weight updates.</p>
</fn>
<fn id="fn4">
<label>4</label>
<p>The testing set loss and accuracy distributions are not included, as they have similar distributions to the validation&#x20;set.</p>
</fn>
</fn-group>
<ref-list>
<title>References</title>
<ref id="B1">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Bell</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Upchurch</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Snavely</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Bala</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Material Recognition in the Wild with the Materials in Context Database</article-title>,&#x201d; in <conf-name>The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</conf-name>, <fpage>3479</fpage>&#x2013;<lpage>3487</lpage>. </citation>
</ref>
<ref id="B2">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Berg</surname>
<given-names>T. L.</given-names>
</name>
<name>
<surname>Berg</surname>
<given-names>A. C.</given-names>
</name>
<name>
<surname>Shih</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Automatic Attribute Discovery and Characterization from Noisy Web Data</article-title>. <source>ECCV</source> <volume>6311</volume>, <fpage>663</fpage>&#x2013;<lpage>676</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-642-15549-9_48</pub-id> </citation>
</ref>
<ref id="B3">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bertinetto</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Valmadre</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Henriques</surname>
<given-names>J.&#x20;F.</given-names>
</name>
<name>
<surname>Vedaldi</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Torr</surname>
<given-names>P. H. S.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Fully-convolutional Siamese Networks for Object Tracking</article-title>,&#x201d; in <source>European Conference on Computer Vision</source> (<publisher-name>Springer</publisher-name>), <fpage>850</fpage>&#x2013;<lpage>865</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-48881-3_56</pub-id> </citation>
</ref>
<ref id="B4">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bijlsma</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Wesseling</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2015</year>). <source>Thematic Collection: Check (Cohort Hip &#x26; Cohort Knee)</source>. <pub-id pub-id-type="doi">10.17026/dans-zc8-g4cw</pub-id> </citation>
</ref>
<ref id="B5">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Bromley</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Guyon</surname>
<given-names>I.</given-names>
</name>
<name>
<surname>LeCun</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>S&#xe4;ckinger</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Shah</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>1994</year>). &#x201c;<article-title>Signature Verification Using a&#x201d; Siamese&#x201d; Time Delay Neural Network</article-title>,&#x201d; in <source>Advances In Neural Information Processing Systems</source>, <fpage>737</fpage>&#x2013;<lpage>744</lpage>. </citation>
</ref>
<ref id="B6">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Chechik</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Sharma</surname>
<given-names>V.</given-names>
</name>
<name>
<surname>Shalit</surname>
<given-names>U.</given-names>
</name>
<name>
<surname>Bengio</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2010</year>). <article-title>Large Scale Online Learning of Image Similarity through Ranking</article-title>. <source>J.&#x20;Machine Learn. Res.</source> <volume>11</volume>, <fpage>1109</fpage>&#x2013;<lpage>1135</lpage>. </citation>
</ref>
<ref id="B7">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Cheng</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2017</year>). <source>Brain Tumor Dataset</source>. <pub-id pub-id-type="doi">10.6084/m9.figshare.1512427.v5</pub-id> </citation>
</ref>
<ref id="B8">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Chung</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Tahboub</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Delp</surname>
<given-names>E. J.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>A Two Stream Siamese Convolutional Neural Network for Person Re-identification</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE International Conference on Computer Vision</conf-name>, <fpage>1983</fpage>&#x2013;<lpage>1991</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2017.218</pub-id> </citation>
</ref>
<ref id="B9">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Clark</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Vendt</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Smith</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Freymann</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Kirby</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Koppel</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2013</year>). <article-title>The Cancer Imaging Archive (Tcia): Maintaining and Operating a Public Information Repository</article-title>. <source>J.&#x20;Digital Imaging</source> <volume>26</volume>, <fpage>1045</fpage>&#x2013;<lpage>1057</lpage>. <pub-id pub-id-type="doi">10.1007/s10278-013-9622-7</pub-id> </citation>
</ref>
<ref id="B10">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Deng</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Dong</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Socher</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>L.-J.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Fei-Fei</surname>
<given-names>L.</given-names>
</name>
</person-group> (<year>2009</year>). &#x201c;<article-title>Imagenet: A Large-Scale Hierarchical Image Database</article-title>,&#x201d; in <conf-name>2009 IEEE conference on computer vision and pattern recognition</conf-name>
<publisher-name>(IEEE</publisher-name>), <fpage>248</fpage>&#x2013;<lpage>255</lpage>. <pub-id pub-id-type="doi">10.1109/CVPR.2009.5206848</pub-id> </citation>
</ref>
<ref id="B11">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Dosovitskiy</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Beyer</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Kolesnikov</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Weissenborn</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Zhai</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Unterthiner</surname>
<given-names>T.</given-names>
</name>
<etal/>
</person-group> (<year>2020</year>). <source>An Image Is worth 16x16 Words: Transformers for Image Recognition at Scale</source>. <conf-name>International Conference on Learning Representations (2021)</conf-name>, <publisher-name>ICLR</publisher-name>.</citation>
</ref>
<ref id="B12">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Gibert</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Patel</surname>
<given-names>V. M.</given-names>
</name>
<name>
<surname>Chellappa</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Material Classification and Semantic Segmentation of Railway Track Images with Deep Convolutional Neural Networks</article-title>,&#x201d; in <conf-name>2015 IEEE International Conference on Image Processing (ICIP)</conf-name>, <fpage>621</fpage>&#x2013;<lpage>625</lpage>. <pub-id pub-id-type="doi">10.1109/ICIP.2015.7350873</pub-id> </citation>
</ref>
<ref id="B13">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Guo</surname>
<given-names>Q.</given-names>
</name>
<name>
<surname>Feng</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Zhou</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Huang</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Wan</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>S.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Learning Dynamic Siamese Network for Visual Object Tracking</article-title>,&#x201d; in <conf-name>The IEEE International Conference on Computer Vision (ICCV)</conf-name>, <fpage>1763</fpage>&#x2013;<lpage>1771</lpage>. <pub-id pub-id-type="doi">10.1109/ICCV.2017.196</pub-id> </citation>
</ref>
<ref id="B14">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>He</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zhang</surname>
<given-names>X.</given-names>
</name>
<name>
<surname>Ren</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Sun</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2015</year>). <source>Deep Residual Learning for Image Recognition</source>. <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition (2016)</conf-name>, <conf-loc>Las Vegas</conf-loc>, <publisher-name>CVPR</publisher-name>.</citation>
</ref>
<ref id="B15">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Huang</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Van Der Maaten</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Weinberger</surname>
<given-names>K. Q.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Densely Connected Convolutional Networks</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name>, <fpage>4700</fpage>&#x2013;<lpage>4708</lpage>. </citation>
</ref>
<ref id="B16">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Irvin</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Rajpurkar</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Ko</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Yu</surname>
<given-names>Y.</given-names>
</name>
<name>
<surname>Ciurea-Ilcus</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Chute</surname>
<given-names>C.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). <source>Chexpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison</source>.</citation>
</ref>
<ref id="B17">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Kingma</surname>
<given-names>D. P.</given-names>
</name>
<name>
<surname>Ba</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2014</year>). <source>Adam: A Method for Stochastic Optimization</source>. <conf-name>International Conference on Learning Representations (2015)</conf-name>, <conf-loc>San Diego</conf-loc>, <publisher-name>ICLR</publisher-name>.</citation>
</ref>
<ref id="B18">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Koch</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Zemel</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Salakhutdinov</surname>
<given-names>R.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Siamese Neural Networks for One-Shot Image Recognition</article-title>,&#x201d; in <conf-name>ICML Deep Learning Workshop</conf-name>, <publisher-name>Lille, France</publisher-name>). <volume>2</volume> </citation>
</ref>
<ref id="B19">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Krizhevsky</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2009</year>). <source>Learning Multiple Layers of Features from Tiny Images</source>. <comment>Technical Rep., University of Toronto</comment>, <fpage>60</fpage>.</citation>
</ref>
<ref id="B20">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Lai</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Ling</surname>
<given-names>X.</given-names>
</name>
</person-group> (<year>2019</year>). <article-title>Segmentation of Brain Mr Images by Using Fully Convolutional Network and Gaussian Mixture Model with Spatial Constraints</article-title>. <source>athematical Probl. Eng.</source> <volume>2019</volume>, <fpage>4625371</fpage>. <pub-id pub-id-type="doi">10.1155/2019/4625371</pub-id> </citation>
</ref>
<ref id="B21">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Liu</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zoph</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Neumann</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Shlens</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Hua</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Li</surname>
<given-names>L.-J.</given-names>
</name>
<etal/>
</person-group> (<year>2018</year>). &#x201c;<article-title>Progressive Neural Architecture Search</article-title>,&#x201d; in <conf-name>Proceedings of the European conference on computer vision (ECCV)</conf-name>, <fpage>19</fpage>&#x2013;<lpage>34</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-01246-5_2</pub-id> </citation>
</ref>
<ref id="B22">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Marvasti</surname>
<given-names>N. B.</given-names>
</name>
<name>
<surname>Y&#xf6;r&#xfc;k</surname>
<given-names>E.</given-names>
</name>
<name>
<surname>Acar</surname>
<given-names>B.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Computer-aided Medical Image Annotation: Preliminary Results with Liver Lesions in Ct</article-title>. <source>IEEE J.&#x20;Biomed. Health Inform.</source> <volume>22</volume>, <fpage>1561</fpage>&#x2013;<lpage>1570</lpage>. <pub-id pub-id-type="doi">10.1109/JBHI.2017.2771211</pub-id> </citation>
</ref>
<ref id="B23">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Mueller</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Thyagarajan</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Siamese Recurrent Architectures for Learning Sentence Similarity</article-title>,&#x201d; in <conf-name>Proceedings of the 30th AAAI Conference on Artificial Intelligence</conf-name>, <conf-loc>Phoenix</conf-loc>,<publisher-name>(AAAI Press)</publisher-name>, <fpage>2786</fpage>&#x2013;<lpage>2792</lpage>. </citation>
</ref>
<ref id="B24">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Paszke</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Gross</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Massa</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Lerer</surname>
<given-names>A.</given-names>
</name>
<name>
<surname>Bradbury</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Chanan</surname>
<given-names>G.</given-names>
</name>
<etal/>
</person-group> (<year>2019</year>). &#x201c;<article-title>Pytorch: An Imperative Style, High-Performance Deep Learning Library</article-title>,&#x201d; in <conf-name>Conference on Neural Information Processing Systems (NeurIPS) (2019)</conf-name>, <conf-loc>Vancouver, Canada</conf-loc>, <publisher-name>Curran Associates, Inc.</publisher-name>), <fpage>8026</fpage>&#x2013;<lpage>8037</lpage>. </citation>
</ref>
<ref id="B25">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Ronneberger</surname>
<given-names>O.</given-names>
</name>
<name>
<surname>Fischer</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Brox</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2015</year>). <source>enU-Net: Convolutional Networks for Biomedical Image Segmentation</source>. <conf-name>International Conference on Medical Image Computing and Computer-Assisted Intervention (2015)</conf-name>, <conf-loc>Munich, Germany</conf-loc>, <publisher-name>MICCAI</publisher-name>.</citation>
</ref>
<ref id="B26">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Schmainda</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Prah</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2018</year>). <source>Data from Brain-Tumor-Progression</source>. <pub-id pub-id-type="doi">10.7937/K9/TCIA.2018.15quzvnb</pub-id> </citation>
</ref>
<ref id="B27">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Schroff</surname>
<given-names>F.</given-names>
</name>
<name>
<surname>Kalenichenko</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Philbin</surname>
<given-names>J.</given-names>
</name>
</person-group> (<year>2015</year>). &#x201c;<article-title>Facenet: A Unified Embedding for Face Recognition and Clustering</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name>, <fpage>815</fpage>&#x2013;<lpage>823</lpage>. </citation>
</ref>
<ref id="B28">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Schwartz</surname>
<given-names>G.</given-names>
</name>
<name>
<surname>Nishino</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2020</year>). <article-title>Recognizing Material Properties from Images</article-title>. <source>IEEE Trans. Pattern Anal. Machine Intelligence</source> <volume>42</volume>, <fpage>1981</fpage>&#x2013;<lpage>1995</lpage>. <pub-id pub-id-type="doi">10.1109/tpami.2019.2907850</pub-id> </citation>
</ref>
<ref id="B29">
<citation citation-type="book">
<person-group person-group-type="author">
<name>
<surname>Simonyan</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Zisserman</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2014</year>). <source>Very Deep Convolutional Networks for Large-Scale Image Recognition</source>. <conf-name>International Conference on Learning Representations (2015)</conf-name>, <conf-loc>San Diego</conf-loc>, <publisher-name>ICLR</publisher-name>.</citation>
</ref>
<ref id="B30">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Spitzer</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Kiwitz</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Amunts</surname>
<given-names>K.</given-names>
</name>
<name>
<surname>Harmeling</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Dickscheid</surname>
<given-names>T.</given-names>
</name>
</person-group> (<year>2018</year>). &#x201c;<article-title>Improving Cytoarchitectonic Segmentation of Human Brain Areas with Self-Supervised Siamese Networks</article-title>,&#x201d; in <conf-name>International Conference on Medical Image Computing and Computer-Assisted Intervention</conf-name>
<publisher-name>(Springer</publisher-name>), <fpage>663</fpage>&#x2013;<lpage>671</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-030-00931-1_76</pub-id> </citation>
</ref>
<ref id="B31">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>van der Maaten</surname>
<given-names>L.</given-names>
</name>
<name>
<surname>Hinton</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2008</year>). <article-title>Visualizing Data Using T-Sne</article-title>. <source>J.&#x20;Machine Learn.</source> <volume>9</volume>, <fpage>2579</fpage>&#x2013;<lpage>2605</lpage>. </citation>
</ref>
<ref id="B32">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Varior</surname>
<given-names>R. R.</given-names>
</name>
<name>
<surname>Haloi</surname>
<given-names>M.</given-names>
</name>
<name>
<surname>Wang</surname>
<given-names>G.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Gated Siamese Convolutional Neural Network Architecture for Human Re-identification</article-title>,&#x201d; in <conf-name>European conference on computer vision</conf-name>
<publisher-name>(Springer</publisher-name>), <fpage>791</fpage>&#x2013;<lpage>808</lpage>. <pub-id pub-id-type="doi">10.1007/978-3-319-46484-8_48</pub-id> </citation>
</ref>
<ref id="B33">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Wu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Khan</surname>
<given-names>Z. U.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Zhu</surname>
<given-names>M.</given-names>
</name>
</person-group> (<year>2019</year>). &#x201c;<article-title>Deep Convolutional Neural Network with Segmentation Techniques for Chest X-ray Analysis</article-title>,&#x201d; in <conf-name>2019 14th IEEE Conference on Industrial Electronics and Applications (ICIEA)</conf-name>, <fpage>1212</fpage>&#x2013;<lpage>1216</lpage>. <pub-id pub-id-type="doi">10.1109/ICIEA.2019.8834117</pub-id> </citation>
</ref>
<ref id="B34">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Wang</surname>
<given-names>J.</given-names>
</name>
<name>
<surname>Fang</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>Lang</surname>
<given-names>N.</given-names>
</name>
<name>
<surname>Yuan</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Su</surname>
<given-names>M.-Y.</given-names>
</name>
<name>
<surname>Baldi</surname>
<given-names>P.</given-names>
</name>
</person-group> (<year>2017</year>). <article-title>A Multi-Resolution Approach for Spinal Metastasis Detection Using Deep Siamese Neural Networks</article-title>. <source>Comput. Biol. Med.</source> <volume>84</volume>, <fpage>137</fpage>&#x2013;<lpage>146</lpage>. <pub-id pub-id-type="doi">10.1016/j.compbiomed.2017.03.024</pub-id> </citation>
</ref>
<ref id="B35">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Xie</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Girshick</surname>
<given-names>R.</given-names>
</name>
<name>
<surname>Doll&#xe1;r</surname>
<given-names>P.</given-names>
</name>
<name>
<surname>Tu</surname>
<given-names>Z.</given-names>
</name>
<name>
<surname>He</surname>
<given-names>K.</given-names>
</name>
</person-group> (<year>2017</year>). &#x201c;<article-title>Aggregated Residual Transformations for Deep Neural Networks</article-title>,&#x201d; in <conf-name>Proceedings of the IEEE conference on computer vision and pattern recognition</conf-name>, <fpage>1492</fpage>&#x2013;<lpage>1500</lpage>. </citation>
</ref>
<ref id="B36">
<citation citation-type="journal">
<person-group person-group-type="author">
<name>
<surname>Zahia</surname>
<given-names>S.</given-names>
</name>
<name>
<surname>Sierra-Sosa</surname>
<given-names>D.</given-names>
</name>
<name>
<surname>Garcia-Zapirain</surname>
<given-names>B.</given-names>
</name>
<name>
<surname>Elmaghraby</surname>
<given-names>A.</given-names>
</name>
</person-group> (<year>2018</year>). <article-title>Tissue Classification and Segmentation of Pressure Injuries Using Convolutional Neural Networks</article-title>. <source>Comp. Methods Programs Biomed.</source> <volume>159</volume>, <fpage>51</fpage>&#x2013;<lpage>58</lpage>. <pub-id pub-id-type="doi">10.1016/j.cmpb.2018.02.018</pub-id> </citation>
</ref>
<ref id="B37">
<citation citation-type="confproc">
<person-group person-group-type="author">
<name>
<surname>Zhang</surname>
<given-names>C.</given-names>
</name>
<name>
<surname>Liu</surname>
<given-names>W.</given-names>
</name>
<name>
<surname>Ma</surname>
<given-names>H.</given-names>
</name>
<name>
<surname>Fu</surname>
<given-names>H.</given-names>
</name>
</person-group> (<year>2016</year>). &#x201c;<article-title>Siamese Neural Network Based Gait Recognition for Human Identification</article-title>,&#x201d; in <conf-name>2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</conf-name>
<publisher-name>(IEEE</publisher-name>), <fpage>2832</fpage>&#x2013;<lpage>2836</lpage>. </citation>
</ref>
</ref-list>
</back>
</article>
