The mid-level vision toolbox for computing structural properties of real-world images

Walther, Dirk B.; Farzanfar, Delaram; Han, Seohee; Rezanejad, Morteza

doi:10.3389/fcomp.2023.1140723

TECHNOLOGY AND CODE article

Front. Comput. Sci., 13 September 2023

Sec. Computer Vision

Volume 5 - 2023 | https://doi.org/10.3389/fcomp.2023.1140723

The mid-level vision toolbox for computing structural properties of real-world images

DB
Dirk B. Walther ^*
DF
Delaram Farzanfar ^†
SH
Seohee Han ^†
MR
Morteza Rezanejad

Department of Psychology, University of Toronto, Toronto, ON, Canada

Article metrics

View details

Citations

4,7k

Views

1,1k

Downloads

Abstract

Mid-level vision is the intermediate visual processing stage for generating representations of shapes and partial geometries of objects. Our mechanistic understanding of these operations is limited, in part, by a lack of computational tools for analyzing image properties at these levels of representation. We introduce the Mid-Level Vision (MLV) Toolbox, an open-source software that automatically processes low- and mid-level contour features and perceptual grouping cues from real-world images. The MLV toolbox takes vectorized line drawings of scenes as input and extracts structural contour properties. We also include tools for contour detection and tracing for the automatic generation of vectorized line drawings from photographs. Various statistical properties of the contours are computed: the distributions of orientations, contour curvature, and contour lengths, as well as counts and types of contour junctions. The toolbox includes an efficient algorithm for computing the medial axis transform of contour drawings and photographs. Based on the medial axis transform, we compute several scores for local mirror symmetry, local parallelism, and local contour separation. All properties are summarized in histograms that can serve as input into statistical models to relate image properties to human behavioral measures, such as esthetic pleasure, memorability, affective processing, and scene categorization. In addition to measuring contour properties, we include functions for manipulating drawings by separating contours according to their statistical properties, randomly shifting contours, or rotating drawings behind a circular aperture. Finally, the MLV Toolbox offers visualization functions for contour orientations, lengths, curvature, junctions, and medial axis properties on computer-generated and artist-generated line drawings. We include artist-generated vectorized drawings of the Toronto Scenes image set, the International Affective Picture System, and the Snodgrass and Vanderwart object images, as well as automatically traced vectorized drawings of set architectural scenes and the Open Affective Standardized Image Set (OASIS).

Introduction

Visual processing relies on different transformations of a visual representation derived from the pattern of light on the retina. In the early stages of visual processing, primary visual cortex (V1) encodes a representation of natural scene statistics based on contrast, orientation, and spatial frequencies (Hubel and Wiesel, 1962; Olshausen and Field, 1996; Vinje and Gallant, 2000). In later stages of visual processing, the semantic content of a visual scene is encoded in scene-selective regions based on category information (Epstein and Kanwisher, 1998; Epstein et al., 2001). Yet, despite a mechanistic understanding of these representations, we know less about the intervening stages of visual processing.

Mid-level vision is the intermediate visual processing stage for combining elementary features into conjunctive feature sets representing shapes and partial geometries of objects and scenes (Peirce, 2015; Malcolm et al., 2016). Along the ventral visual pathway, anatomical regions V2, V3, and V4 are the likely biological substrate supporting these operations, whose contributions to visual processing are far less understood (Peirce, 2015). Some evidence suggests that V2 is sensitive to border ownership, and V4 encodes curvature and symmetry information (Peterhans and von der Heydt, 1989; Gallant et al., 1996; Pasupathy and Connor, 2002; Wilder et al., 2022). Drawing from physiologically plausible representations of mid-level visual processing, we offer a set of computational tools for image processing to help fill this gap in our understanding of visual perception and help uncover intermediate stages of visual processing. Understanding mid-level vision allows us to investigate how discrete percepts are constructed and used to facilitate goal-driven behaviors.

Much of mid-level vision operations are qualitatively explained by Gestalt psychology (Koffka, 1935). Gestalt grouping cues are principles of perceptual organization that explain how basic visual elements are organized into meaningful whole percepts – these principles are proximity, similarity, continuity, closure, and figure/ground (Wertheimer, 1922). Empirical studies of Gestalt grouping cues frequently use stylized lab stimuli, such as clouds of dots (e.g., Kubovy, 1994; Wagemans, 1997; Norcia et al., 2002; Sasaki, 2007; Bona et al., 2014, 2015), Gabor patches (e.g., Field et al., 1993; Machilsen et al., 2009), or simple shapes (e.g., Elder and Zucker, 1993; Wagemans, 1993; De Winter and Wagemans, 2006). Typically, these stimuli are constructed to contain a specific amount of symmetry, contour integration, parallelism, closure, etc. By comparison, little empirical work has been done on testing Gestalt grouping principles for perceiving complex, real-world scenes (but see Geisler et al., 2001; Elder and Goldberg, 2002). More recent research in human and computer vision has extended the work of Wertheimer to physiologically plausible representations of shapes using the medial axis transform (Blum, 1967; Ayzenberg and Lourenco, 2019; Rezanejad et al., 2019, 2023; Ayzenberg et al., 2022).

Underlying medial axis-based representations of shape is an understanding of vision in terms of contours and shapes. Contours and shapes form the basis of early theories of vision, such as Marr’s 2 ½ D sketch (Marr and Nishihara, 1978; Marr, 1982), or the recognition- by-components model (Biederman, 1987), as well as practical applications to the recognition of three-dimensional objects (Lowe, 1987). Perceptual organization is recognized to play an important role in these systems (Feldman and Singh, 2005; Lowe, 2012; Pizlo et al., 2014) as well as in computer vision more generally (Desolneux et al., 2004, 2007; Michaelsen and Meidow, 2019).

We here provide a software toolbox¹ for the study of mid-level vision using naturalistic images. This toolbox opens an avenue for testing mid-level features’ role in visual perception by measuring low- and mid-level image properties from contour drawings and real-world photographs. Our measurement techniques are rooted in biologically inspired computations for detecting geometric relationships between contours. Working on contour geometry has the clear advantage of resulting in tractable, mechanistic algorithms for understanding mid-level vision. However, it has the disadvantage of not being computable directly from image pixels. We overcome this difficulty by offering algorithms that detect contours in color photographs and trace the contours to arrive at vectorized representations.

Contour extraction

Most functions in the MLV Toolbox rely on vectorized contour drawings. These drawings can be generated by humans tracing the important contours in photographs, or by importing existing vector graphics from an SVG file with https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/importSVG.html, or by automatically detecting edges from the photographs and tracing the contours in the extracted edge maps.

Edge detection is performed using a structured forest method known as the Dollár edge detector (Dollár and Zitnick, 2013, 2014). We here use the publicly available Structured Edge Detection Toolbox V3.0.² This computationally efficient edge detector achieves excellent accuracy by predicting local edge masks in a structured learning framework applied to random decision forests. As the code for this toolbox was written in Matlab, this software became a natural choice as the edge detector for our toolbox. Using image-specific adaptive thresholding, we generate a binarized version from the edge map and its corresponding edge strength. The binarized edge map is then morphologically thinned to create one-pixel-wide contour segments.

Our method for tracing contours is adapted from the Edge Linking and Line Segment Fitting code sections from Peter Kovesi’s Matlab and Octave Functions for Computer Vision and Image Processing.³ These are edge-linking functions that enable the system to take a binary edge image and create lists of connected edge points. Additional helper functions fill in small gaps in a given binary edge map image (edgelink) and form straight line segments to sets of line segments that are shorter than a given tolerance value (lineseg).

Functions: , https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/traceLinedrawingFromEdgeMap.html

The definition of the beginning and end of contours depends on the method of generation. We provide two data sets, for which the contours were drawn by trained artists using a graphics tablet. For these data sets, the beginning of a contour is defined as the artist putting the pen on the graphics tablet and the end by them lifting the pen up. For automatically traced contours, the beginning and end are defined by the beginning and end of lists of connected edge points.

Both methods result in vectorized line drawings that are represented as a set of contours (Figure 1). Each contour consists of a list of successive, connected straight line segments. The information is contained in a vecLD struct with the following fields:

originalImage:	the file name of the original photograph.
imsize:	[width, height].
lineMethod:	a descriptive string indicating how the line drawing was generated, e.g., ‘artist’, ‘importSVG’, ‘traceLineDrawingFromRGB’.
numContours:	the number of contours.
contours:	cell array of size (1, numContours) containing the individual contour information. Each cell is an N x 4 array. Each row of the array represents one contour line segment. The columns are the start and end coordinates of the line segments in the order: X1, Y1, X2, Y2. Note that the end point of one segment is the start point of the next segment. This way of storing the coordinates is somewhat redundant, but it allows for greater efficiency for plotting, processing, and splitting contours.

Figure 1

More fields are added to the struct as image properties are computed. Vectorized line drawings can be plotted into a figure window using the https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/drawLinedrawing.html function. They can be rendered into a binary image using the https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/renderLinedrawing.html function.

Measuring contour properties

Analysis of properties of individual contours and contour segments follows the intuitive definitions outlined below.

Orientation of individual contours is computed as:

where orientations are measured in degrees in the counterclockwise direction, starting from 0^° at horizontal all the way to 360^° back at horizontal. Orientations are stored in a direction-specific way so that 180^° is not considered the same as 0^°. This coding is important for computing curvature and junction angles correctly.

When computing histograms of orientation, however, orientation angles are computed modulo 180 degrees. Orientation histograms are weighted by the number of pixels (length) of a particular line segment. By default, eight histograms are computed with bin centers at 0, 22.5, 45, 67.5, 90, 112.5, 135, and 157.5 degrees.

Functions: , https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/getOrientationStats.html

The length of contour segments is computed as the Euclidean distance between the start and end points:

The length of an entire contour is the sum of the lengths of the individual contour segments. Contour histograms are computed with bins equally spaced on a logarithmic scale within the bounds of 2 pixels and (width + height). For instance, an eight-bin histogram (the default) for images of size 800 × 600 pixels has bin centers located at 3.4, 8.5, 19.5, 43.2, 94.2, 204.2, 441.5, and 953.1 pixels.

Functions: , https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/getLengthStats.html

Mathematically, curvature is defined as the change in angle per unit length. In calculus, the change of angle is given by the second derivative. For the piecewise straight line segments in our implementation, we compute the curvature for each line segment as the amount of change in orientation from this to the next line segment, divided by the length of the segment:

For the last segment of a contour, we use the angle difference with the previous instead of the next segment. Histograms of contour curvature are computed with bins equally spaced on a logarithmic scale between 0 degrees / pixel (straight line; no curvature) and 90 degrees / pixel (a minimal line of 2 pixels length making a sharp 180-degree turn). For a default eight-bin histogram, bins are centered at 0.33, 1.33, 3.09, 6.20, 11.65, 21.23, 38.06, and 67.64 degrees per pixel.

Functions: , https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/getCurvatureStats.html

Contour junctions are detected at the intersections between contours. The intersections are computed algebraically from the coordinates of all contour segments. Intersections of contour segments within a contour are not considered. Junctions are still detected when contours do not meet exactly. This function is controlled by two parameters, a relative epsilon (RE) and an absolute epsilon (AE). The relative epsilon controls the allowable gap between the end point of a contour segment and the hypothetical junction location as the fraction of the length of the contour segment. The absolute epsilon determines the maximum allowable gap in pixels. The minimum between the two gap measures is used as a threshold value for detecting junctions. Hand tuning of the parameters resulted in values of AE = 1 pixel and RE = 0.3. These two parameters can be set as optional arguments of the https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/detectJunctions.html function.

Contour segments participating in junctions are algorithmically severed into two new segments at the junction locations whenever junctions are detected far enough away from the end points. The angles between all segments participating in a junction are measured as the difference in (directed) orientation angle between adjacent segments. Inspired by previous literature on contour junctions (Biederman, 1987), junction types are classified according to how many contour segments participate in the junction and according to their angles as follows:

3 segments:	determine the maximum angle between any two segments.
	Y junctions:
	T junctions:, i.e., .
	Arrow junctions:.
4 segments:	X junctions
>4 segments:	Star junctions

Junctions between two contour segments are sometimes described as L junctions. Here, we do not consider L junctions as they would occur at every location where one contour segment ends and the next begins, making them too numerous to be useful.

Integer counts of the number of junctions of each count are collected in junction histograms, which can optionally be normalized by the total number of pixels in a vectorized line drawing. The minimum angles between any of the contour segments, which are bounded between 0 and 120 degrees, are also counted as “Junction Angles” in a histogram with bin centers at 7.5, 22.5, 37.5, 52.5, 67.5, 82.5, 97.5, and 112.5 degrees.

Functions: , https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/getJunctionStats.html, https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/detectJunctions.html

For each contour property, the following fields are added to the vecLD struct:

property:	(1, numContours) cell array; each cell contains a vector of properties for the corresponding contour segments
propertyHistograms:	(numContours, numBins) array with the property histograms for the individual contours
normPropertyHistograms:	the same but normalized by the total number of contour pixels
sumPropertyHistogram:	(1,numBins) array: the property histogram for the entire image – the sum of propertyHistograms
normSumPropertyHistogram:	the same but normalized by the total number of contour pixels
propertyBins:	(1,numBins) array: the centers of the histogram bins

To visualize the contour properties, use the https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/drawLinedrawingProperty.html function (Figure 2). The first argument to the function is a vecLD struct, the second a string denoting the contour property. The function draws the color drawing into the current figure window. Use https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/drawAllProperties.html to visualize all contour properties for a given vecLD struct, either using subplots or in separate figure windows. The https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/renderLinedrawingProperty.html function draws the contour properties into an image instead of a figure.

Figure 2

We have used these contour properties previously to explain behavior (Walther and Shen, 2014; Wilder et al., 2018) and neural mechanisms of scene categorization (Choo and Walther, 2016). We have also related statistics of these properties to the emotional content of scenes and generated artificial images with specific property combinations to elicit emotional responses (Damiano et al., 2021a), as well as to esthetic pleasure (Farzanfar and Walther, 2023).

Medial axis-based properties

Blum (1967) was probably the first to introduce medial axis-based representations and the method for producing them using a grassfire analogy. Imagine a shape cut out of a piece of paper set on fire around its border, where the fire front moves toward the center of the shape at a constant pace. Skeletal points are formed at the locations where the fire fronts collide. In other words, we can look at the Medial Axis Transform (MAT) as a method for applying the grassfire process to disclose its quench sites and associated radius values (Feldman and Singh, 2006). The MAT provides a complete visual representation of shapes, as it is applicable to all bounded shapes as well as the areas outside of closed shapes. Humans have been shown to rely on the shape skeleton defined by the MAT when they attend to objects (Firestone and Scholl, 2014), represent shapes in memory (Ayzenberg and Lourenco, 2019), or judge the esthetic appeal of shapes (Sun and Firestone, 2022).

In this toolbox, we compute measures of the relationships between contours using the medial axis transform. Visually, the medial axis transform is made up of a number of branches that come together at branch points to create a shape skeleton. A group of contiguous regular points from the skeleton that are located between two junction points, two end points, or an end point and a junction point are known as skeletal branches. The behavior of the average outward flux (AOF) of the gradient of the Euclidean distance function to the boundary of a 2D object through a shrinking disk can be used to identify skeletal points that lie on skeletal branches and identify the types of those three classes of points, as demonstrated by Rezanejad (2020). We go over this calculation in the following.

Imagine that the distance transform of a shape Ω is a signed distance function that indicates the closest distance of a given point to the shape’s boundary (Figure 3A). Formally, we can imagine a positive sign for the distance value when is inside the shape Ω and a negative sign when is outside of the shape Ω. We can then define the distance function gradient vector for point as as the unit vector that connects point to its closest boundary point. One of the ways to identify skeletal points is to investigate the distance function gradient vector which is multivalued at the skeletal points. To do this investigation, we use a measure called AOF (Average Outward Flux). To compute AOF, we compute the outward flux of through shrinking circular neighborhoods all over the image:

where is the boundary of the shrinking circular neighborhood and is the normal to the boundary (Figure 3B). By analyzing the behavior of AOF, we can classify each point into a medial axis or non-medial axis point. In particular, any points that are not on the skeleton have a limiting AOF value of zero, so the medial axis is the set of points where their AOF values are non-zero (Figure 3C). The AOF value has a sinusoidal relationship with the object angle (the mid-angle between spoke vectors that connect a skeletal point to the closest boundary points). In the discrete space, we threshold the AOF based on a particular value, which means that we keep skeletal points that have object angles above a certain degree. The object angle can be provided as an optional parameter for the https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/computeMAT.html function, with a default value of 28 degrees (Rezanejad, 2020). We compute three local relational properties based on the medial axis transform.

Figure 3

Parallelism is computed as the local ribbon symmetry by computing the first derivative of the radial distance function along the medial axis. A small first derivative (small change) indicates that the contours on either side of the medial axis are locally parallel to the medial axis and, thereby, to each other (Figure 3D).

Separation is computed as the absolute value of the radial distance function. Separation is related to the Gestalt grouping rule of proximity (Figure 3E).

Mirror symmetry is generally understood to be the symmetry generated by reflection over a straight axis. Reflections over bent axes are not perceived as mirror symmetric. We therefore compute the curvature of the medial axis as a measure of local mirror symmetry. The straighter the medial axis is, the stronger is local mirror symmetry (Figure 3F).

These properties are initially computed on the medial axis and then projected back onto the contours of the line drawing and normalized to [0,1]. Note that not all contour pixels may receive a valid MAT property, since the projection from the medial axis back to the contour pixels is not a surjective function. As mentioned above, we apply a small threshold on the AOF values in discrete pixel space to compute a medial axis that is thin, smooth and does not cover the entire area of the interior shape. While the analytical formulation of the medial axis is a one-to-many mapping from skeletal points to the boundary that ensures that all the boundary points are reconstructable in the discrete space, we may lose a small portion of the boundary points that will not be associated with the skeletal points.

Functions: , https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/computeMATproperty.html, https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/mapMATtoContour.html, https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/computeAllMATproperties.html

To compute statistics over the MAT properties along the contours, the contours are mapped to the vectorized line drawing. Histograms with equally spaced bins (default: 8 bins) to cover the interval [0,1] are computed over all contour pixels with valid MAT properties.

Functions: , https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/getMATpropertyStats.html, https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/computeAllMATfromVecLD.html

Similar to the contour properties, the following fields are added to the vecLD struct for each MAT property:

property:	(1, numContours) cell array; each cell contains a vector of properties for the corresponding contour segments
propertyMeans:	(1, numContours) array with the means of property over each contour
property_allX:	x coordinates of all contour pixels with a property score
property_allY:	y coordinates of all contour pixels with a property score
property_allScores:	The property scores for all contour pixels
propertyHistograms:	(numContours, numBins) array with the property histograms for the individual contours
propertyNormHistograms:	the same but normalized by the total number of contour pixels
propertySumHistogram:	(1, numBins) array: the property histogram for the entire image – the sum of propertyHistograms
propertyNormSumHistogram:	the same but normalized by the total number of contour pixels
propertyBins:	(1, numBins) array: the centers of the histogram bins

MAT properties are easily visualized in a figure using https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/drawMATproperty.html or drawn into an image using https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/renderMATproperty.html.

The histograms for all image properties can be written into a table for a set of images for further statistical analysis. The function https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/allLDHistogramsToTable.html generates such a table, which can then be used with Matlab’s statistical functions or be written to a CSV file for further processing in R or other analysis software.

We have used these functions to investigate the role of MAT-based features for computer vision (Rezanejad et al., 2019, 2023), eye movements (Damiano et al., 2019), neural representations of symmetry (Wilder et al., 2022), to investigate esthetic appeal (Damiano et al., 2021b; Farzanfar and Walther, 2023) and image memorability (Han et al., 2023).

Splitting functions

Splitting the contours in a line drawing into two halves based on some statistical property allows for the empirical testing of the causal involvement of that property in some perceptual or cognitive function. We provide functions for separating contours into two drawings by different criteria:

: allows for the splitting according to one image property or a combination of image properties. The function also contains an option to generate a random split of the contours. We have used this function to split images by their MAT properties for behavioral and fMRI experiments as well as for computer vision analyses (Rezanejad et al., 2019, 2023; Wilder et al., 2022).

: allows for splitting the contours according to more fine-grained weights for the individual feature histograms.

: splits the contours by the output of a statistical model, trained with the contour and MAT properties as its features. This function has been used to split contours according to predicted esthetic appeal (Farzanfar and Walther, 2023) or according to predicted memorability (Han et al., 2023).

: Splits the contours into pieces near the contour junctions and middle segments. This method was used to investigate the role junctions for scene categorization (Wilder et al., 2018).

For an input vecLD struct, these functions generate two new, disjoint vecLD structs, each with approximately half of the pixels (Figure 4). Contours that cannot be uniquely assigned to one or the other drawing are omitted from both.

Figure 4

Other image transformations

We include some other specific manipulations of the line drawing images that have proven useful in manipulating images (Figure 5). The function https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/rotateLinedrawing.html rotates the coordinates of all contours in the input vecLD structure by a given angle, and https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/applyCircularAperture.html clips the contours to a circular aperture. In combination, these functions can be used to generate randomly rotated line drawings (Choo and Walther, 2016).

Figure 5

Randomly shifting individual contours within the image bounding box destroys the distribution of contour junctions while keeping the statistics of orientation, length, and curvature constant (Walther and Shen, 2014; Choo and Walther, 2016). This functionality is provided by https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/randomlyShiftContours.html.

Datasets

We provide five data sets already processed as vecLD structs:

A set of 475 images of six scene categories (beaches, cities, forests, highways, mountains, and offices). The line drawings were created by trained artists at the Lotus Hill Institute in Fudan, People’s Republic of China.
Line drawings of the 1,182 images in the International Affective Picture System (IAPS) (Lang et al., 2008), also created at the Lotus Hill Institute.
Hand-traced drawings of 260 objects from (Snodgrass and Vanderwart, 1980).
A set of 200 architectural scenes published in (Vartanian et al., 2013), traced automatically using https://htmlpreview.github.io/?https://raw.githubusercontent.com/bwlabToronto/MLV_toolbox/main/doc/MLVcode/traceLineDrawingFromRGB.html.
Line drawings of the 900 images in the Open Affective Standardized Image Set (OASIS) (Kurdi et al., 2017).

We plan to add more datasets in the near future.

Conclusion

A major obstacle for research on the role of mid-level visual features in the perception of complex, real-world scenes has been the capability to measure and selectively manipulate these features in scenes. We offer the Mid-level Vision Toolbox to the research community as way to overcome this obstacle and enable future research on this topic. We include functionality for assessing a variety of low- and mid-level features based on the geometry of contours, as well as functions for generating contour line drawings from RGB images and functions for manipulating such drawings.

The easily accessible data structures and function interfaces of MLV Toolbox allow for future expansions of its functionality. For instance, symmetry relationships, limited to the nearest contours in the current implementation, could be expanded to include symmetries across larger scales. Another expansion could be the detection of another important grouping cue, closure of contours. Figure-ground segmentation cues, such as border ownership could be included in the future as well. Our group will continue to work on expanding the functionality of the toolbox, and we also invite contributions from other researchers.

Computational models of visual perception in humans and non-human primates have progressed rapidly in recent years, thanks to the advent of convolutional neural networks (Krizhevsky et al., 2012). Convolutional Neural Networks have been shown to correlate well with biological vision systems (e.g., Cadieu et al., 2014; Khaligh-Razavi and Kriegeskorte, 2014; Yamins et al., 2014; Güçlü and van Gerven, 2015), and they are getting closer to replicating the functionality of biological systems all the time (Schrimpf et al., 2020). Why, then, should we care about hand-coded algorithms for detecting mid-level features as in the MLV Toolbox? Despite these models’ increasing ability to match biological vision, the mechanisms underlying their impressive performance are barely any clearer than those underlying biological vision. We need to probe these deep neural networks empirically to better understand the mechanisms underlying the function (Bowers et al., 2022).

A century of empirical research as well as existing practice in design and architecture have unequivocally demonstrated the importance of Gestalt grouping rules for human perception (Wagemans et al., 2012) as well as esthetic appreciation (Arnheim, 1965; Leder et al., 2004; Chatterjee, 2022). To what extent convolutional neural networks learn to represent these grouping rules is an open question. We know from work with random dot patterns that the human brain represents symmetry relationships in fairly high-level areas, such as the object-sensitive lateral occipital complex (Bona et al., 2014, 2015). We are only starting to learn how the brain represents Gestalt grouping rules for perceiving complex scenes (Wilder et al., 2022).

We here provide a set of computational tools that will enable studies of the mid-level representations that arise in both biological and artificial vision systems. Although the specific computations used here to measure mid-level visual properties are unlikely to be an accurate reflection of the neural mechanisms in the human visual system, we nevertheless believe that measuring and manipulating mid-level visual cues in complex scenes will be instrumental in furthering our understanding of visual perception.

Funding

This work was supported by an NSERC Discovery grant (RGPIN-2020-04097) and an XSeed grant from the Faculty of Applied Science and Engineering and the Faculty of Arts and Science of the University of Toronto to DW, an Alexander Graham Bell Canada Graduate Scholarship from NSERC to DF, a Connaught International Scholarship to SH, and a Faculty of Arts and Science Postdoctoral Fellowship to MR.

Publisher’s note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

Statements

Data availability statement

The datasets presented in this study can be found at: http://mlvtoolbox.org.

Author contributions

DW and MR contributed to the algorithms and their implementation in the toolbox. DF and SH tested the code and provided suggestions and feedback for features and improvements. DW wrote the first draft of the manuscript. MR and DF wrote sections of the manuscript. All authors contributed to the article and approved the submitted version.

Conflict of interest

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

1.^http://mlvtoolbox.org

2.^https://github.com/pdollar/edges

3.^https://www.peterkovesi.com/matlabfns/#edgelink

References

1
ArnheimR. (1965). Art and visual perception: A psychology of the creative eye. Berkeley, California: Univ of California Press.
- Google Scholar
2
AyzenbergV.KampsF. S.DilksD. D.LourencoS. F. (2022). Skeletal representations of shape in the human visual cortex. Neuropsychologia164:108092. doi: 10.1016/j.neuropsychologia.2021.108092
3
AyzenbergV.LourencoS. F. (2019). Skeletal descriptions of shape provide unique perceptual information for object recognition. Sci. Rep.9:9359. doi: 10.1038/s41598-019-45268-y
4
BiedermanI. (1987). Recognition-by-components: a theory of human image understanding. Psychol. Rev.94, 115–147. doi: 10.1037/0033-295X.94.2.115
5
BlumH. (1967). A transformation for extracting new descriptions of shape. Cambridge, Massachusetts: MIT Press.
- Google Scholar
6
BonaS.CattaneoZ.SilvantoJ. (2015). The causal role of the occipital face area (OFA) and lateral occipital (LO) cortex in symmetry perception. J. Neurosci.35, 731–738. doi: 10.1523/JNEUROSCI.3733-14.2015
7
BonaS.HerbertA.ToneattoC.SilvantoJ.CattaneoZ. (2014). The causal role of the lateral occipital complex in visual mirror symmetry detection and grouping: an fMRI-guided TMS study. Cortex51, 46–55. doi: 10.1016/j.cortex.2013.11.004
8
BowersJ. S.MalhotraG.DuimovicM.MonteroM. L.TsvetkovC.BiscioneV.et al. (2022). Deep problems with neural network models of human vision. Behav. Brain Sci.1, 1–74. doi: 10.1017/S0140525X22002813
- CrossRef
- Google Scholar
9
CadieuC. F.HongH.YaminsD. L.PintoN.ArdilaD.SolomonE. A.et al. (2014). Deep neural networks rival the representation of primate IT cortex for core visual object recognition. PLoS Comput. Biol.10:e1003963. doi: 10.1371/journal.pcbi.1003963
10
ChatterjeeA. (2022). “An early framework for a cognitive neuroscience of visual aesthetics” in Brain, beauty, & art. eds. ChatterjeeA.CardilloE. R. (Essays Bringing Neuroaesthetics in Focus, New York, NY: Oxford University Press)
- Google Scholar
11
ChooH.WaltherD. B. (2016). Contour junctions underlie neural representations of scene categories in high-level human visual cortex. NeuroImage135, 32–44. doi: 10.1016/j.neuroimage.2016.04.021
12
DamianoC.WaltherD. B.CunninghamW. A. (2021a). Contour features predict valence and threat judgements in scenes. Sci. Rep.11, 1–12. doi: 10.1038/s41598-021-99044-y
- CrossRef
- Google Scholar
13
DamianoC.WilderJ.WaltherD. B. (2019). Mid-level feature contributions to category-specific gaze guidance. Atten. Percept. Psychophys.81, 35–46. doi: 10.3758/s13414-018-1594-8
14
DamianoC.WilderJ.ZhouE. Y.WaltherD. B.WagemansJ. (2021b). The role of local and global symmetry in pleasure, interest, and complexity judgments of natural scenes. Psychol. Aesthet. Creat. Arts17, 322–337. doi: 10.1037/aca0000398
- CrossRef
- Google Scholar
15
De WinterJ.WagemansJ. (2006). Segmentation of object outlines into parts: a large-scale integrative study. Cognition99, 275–325. doi: 10.1016/j.cognition.2005.03.004
16
DesolneuxA.MoisanL.MorelJ.-M. (2004). “Gestalt theory and computer vision,” in Seeing, thinking and knowing: Meaning and self-organisation in visual cognition and thought. Dordrecht: Springer Netherlands, 71–101.
- Google Scholar
17
DesolneuxA.MoisanL.MorelJ.-M., (2007). From gestalt theory to image analysis: A probabilistic approach. Springer Science & Business Media, New York, NY
- Google Scholar
18
DollárP.ZitnickC. L., (2013). Structured forests for fast edge detection, in: Proceedings of the IEEE International Conference on Computer Vision. pp. 1841–1848.
- Google Scholar
19
DollárP.ZitnickC. L. (2014). Fast edge detection using structured forests. IEEE Trans. Pattern Anal. Mach. Intell.37, 1558–1570. doi: 10.1109/TPAMI.2014.2377715
- CrossRef
- Google Scholar
20
ElderJ. H.GoldbergR. M. (2002). Ecological statistics of gestalt laws for the perceptual organization of contours. J. Vis.2, 5–353. doi: 10.1167/2.4.5
21
ElderJ.ZuckerS. (1993). The effect of contour closure on the rapid discrimination of two-dimensional shapes. Vis. Res.33, 981–991. doi: 10.1016/0042-6989(93)90080-G
22
EpsteinR.DeYoeE. A.PressD. Z.RosenA. C.KanwisherN. (2001). Neuropsychological evidence for a topographical learning mechanism in parahippocampal cortex. Cogn. Neuropsychol.18, 481–508. doi: 10.1080/02643290125929
23
EpsteinR.KanwisherN. (1998). A cortical representation of the local visual environment. Nature392, 598–601. doi: 10.1038/33402
- CrossRef
- Google Scholar
24
FarzanfarD.WaltherD. B. (2023). Changing What You Like: Modifying Contour Properties Shifts Aesthetic Valuations of Scenes. Psychol. Sci. doi: 10.1177/09567976231190546. [Epub ahead of print].
- CrossRef
- Google Scholar
25
FeldmanJ.SinghM. (2005). Information along contours and object boundaries. Psychol. Rev.112, 243–252. doi: 10.1037/0033-295X.112.1.243
26
FeldmanJ.SinghM. (2006). Bayesian estimation of the shape skeleton. Proc. Natl. Acad. Sci.103, 18014–18019. doi: 10.1073/pnas.0608811103
27
FieldD. J.HayesA.HessR. F. (1993). Contour integration by the human visual system: evidence for a local “association field”. Vis. Res.33, 173–193. doi: 10.1016/0042-6989(93)90156-Q
28
FirestoneC.SchollB. J. (2014). “Please tap the shape, anywhere you like” shape skeletons in human vision revealed by an exceedingly simple measure. Psychol. Sci.25, 377–386. doi: 10.1177/0956797613507584
29
GallantJ. L.ConnorC. E.RakshitS.LewisJ. W.Van EssenD. C. (1996). Neural responses to polar, hyperbolic, and Cartesian gratings in area V4 of the macaque monkey. J. Neurophysiol.76, 2718–2739. doi: 10.1152/jn.1996.76.4.2718
30
GeislerW. S.PerryJ. S.SuperB. J.GalloglyD. P. (2001). Edge co-occurrence in natural images predicts contour grouping performance. Vis. Res.41, 711–724. doi: 10.1016/S0042-6989(00)00277-7
31
GüçlüU.van GervenM. A. (2015). Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream. J. Neurosci.35, 10005–10014. doi: 10.1523/JNEUROSCI.5023-14.2015
32
HanS.RezanejadM.WaltherD. B. (2023). Making memorability of scenes better or worse by manipulating their contour properties. J. Vis.23:5494. doi: 10.1167/jov.23.9.5494
- CrossRef
- Google Scholar
33
HubelD. H.WieselT. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat’s visual cortex. J. Physiol.160, 106–154. doi: 10.1113/jphysiol.1962.sp006837
34
Khaligh-RazaviS.-M.KriegeskorteN. (2014). Deep supervised, but not unsupervised, models may explain IT cortical representation. PLoS Comput. Biol.10:e1003915. doi: 10.1371/journal.pcbi.1003915
35
KoffkaK., (1935). Principles of gestalt psychology. Harcourt Brace and Company, New York, NY.
- Google Scholar
36
KrizhevskyA.SutskeverI.HintonG. E. (2012). “ImageNet classification with deep convolutional neural networks” in Advances in neural information processing systems. eds. PereiraF.BurgesC. J.BottouL.WeinbergerK. Q. (United States: Curran Associates, Inc)
- Google Scholar
37
KubovyM. (1994). The perceptual organization of dot lattices. Psychon B Rev1, 182–190. doi: 10.3758/bf03200772
38
KurdiB.LozanoS.BanajiM. R. (2017). Introducing the open affective standardized image set (OASIS). Behav. Res. Methods49, 457–470. doi: 10.3758/s13428-016-0715-3
39
LangP. J.BradleyM. M.CuthbertB. N. (2008). International affective picture system (IAPS): Affective ratings of pictures and instruction manual. University of Florida, Gainsville, FL.
- Google Scholar
40
LederH.BelkeB.OeberstA.AugustinD. (2004). A model of aesthetic appreciation and aesthetic judgments. Brit J Psychol95, 489–508. doi: 10.1348/0007126042369811
- CrossRef
- Google Scholar
41
LoweD. G. (1987). Three-dimensional object recognition from single two-dimensional images. Artif. Intell.31, 355–395. doi: 10.1016/0004-3702(87)90070-1
- CrossRef
- Google Scholar
42
LoweD. (2012). Perceptual organization and visual recognition. Springer Science & Business Media. New York
- Google Scholar
43
MachilsenB.PauwelsM.WagemansJ. (2009). The role of vertical mirror symmetry in visual shape detection. J. Vis.9:11. doi: 10.1167/9.12.11
- CrossRef
- Google Scholar
44
MalcolmG. L.GroenI. I. A.BakerC. I. (2016). Making sense of real-world scenes. Trends Cogn. Sci.20, 843–856. doi: 10.1016/j.tics.2016.09.003
45
MarrD., (1982). Vision: A computational investigation into the human representation and processing of visual information.
- Google Scholar
46
MarrD.NishiharaH. K. (1978). Representation and recognition of the spatial organization of three-diemnsional shapes. Proccedings of the Royal Soc London B.200, 269–294. doi: 10.1098/rspb.1978.0020
- CrossRef
- Google Scholar
47
MichaelsenE.MeidowJ. (2019). Hierarchical perceptual grouping for object recognitionSpringer, New York.
- Google Scholar
48
NorciaA. M.CandyT. R.PettetM. W.VildavskiV. Y.TylerC. W. (2002). Temporal dynamics of the human response to symmetry. J. Vis.2, 1–139. doi: 10.1167/2.2.1
49
OlshausenB. A.FieldD. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature381, 607–609. doi: 10.1038/381607a0
50
PasupathyA.ConnorC. E. (2002). Population coding of shape in area V4. Nat. Neurosci.5, 1332–1338. doi: 10.1038/972
- CrossRef
- Google Scholar
51
PeirceJ. W. (2015). Understanding mid-level representations in visual processing. J. Vis.15:5. doi: 10.1167/15.7.5
52
PeterhansE.von der HeydtR. (1989). Mechanisms of contour perception in monkey visual cortex. II. Contours bridging gaps. J. Neurosci.9, 1749–1763. doi: 10.1523/JNEUROSCI.09-05-01749.1989
53
PizloZ.LiY.SawadaT., (2014). Making a machine that sees like us. Oxford University Press, USA.
- Google Scholar
54
RezanejadM., (2020). Medial measures for recognition, mapping and categorization, McGill University. Canada
- Google Scholar
55
RezanejadM.DownsG.WilderJ.WaltherD. B.JepsonA.DickinsonS.et al. (2019). Scene categorization from contours: medial Axis based salience measures. Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 4116–4124.
- Google Scholar
56
RezanejadM.WilderJ.WaltherD. B.JepsonA.DickinsonS.SiddiqiK. (2023). Shape Based Measures Improve Scene Categorization. under review. IEEE Trans. Pattern Anal. Mach. Intell.
- Google Scholar
57
SasakiY. (2007). Processing local signals into global patterns. Curr. Opin. Neurobiol.17, 132–139. doi: 10.1016/j.conb.2007.03.003
- CrossRef
- Google Scholar
58
SchrimpfM.KubiliusJ.LeeM. J.MurtyN. A. R.AjemianR.DiCarloJ. J. (2020). Integrative benchmarking to advance Neurally mechanistic models of human intelligence. Neuron108, 413–423. doi: 10.1016/j.neuron.2020.07.040
59
SnodgrassJ. G.VanderwartM. (1980). A standardized set of 260 pictures: norms for name agreement, image agreement, familiarity, and visual complexity. J. Exp. Psychol.6, 174–215. doi: 10.1037/0278-7393.6.2.174
- CrossRef
- Google Scholar
60
SunZ.FirestoneC. (2022). Beautiful on the inside: aesthetic preferences and the skeletal complexity of shapes. Perception51, 904–918. doi: 10.1177/03010066221124872
61
VartanianO.NavarreteG.ChatterjeeA.FichL. B.LederH.ModroñoC.et al. (2013). Impact of contour on aesthetic judgments and approach-avoidance decisions in architecture. Proc National Acad Sci110, 10446–10453. doi: 10.1073/pnas.1301227110
62
VinjeW. E.GallantJ. L. (2000). Sparse coding and decorrelation in primary visual cortex during natural vision. Science287, 1273–1276. doi: 10.1126/science.287.5456.1273
63
WagemansJ. (1993). Skewed symmetry: a nonaccidental property used to perceive visual forms. J. Exp. Psychol. Hum. Percept. Perform.19, 364–380. doi: 10.1037/0096-1523.19.2.364
64
WagemansJ. (1997). Characteristics and models of human symmetry detection. Trends Cogn. Sci.1, 346–352. doi: 10.1016/s1364-6613(97)01105-4
- CrossRef
- Google Scholar
65
WagemansJ.ElderJ. H.KubovyM.PalmerS. E.PetersonM. A.SinghM.et al. (2012). A century of gestalt psychology in visual perception: I. perceptual grouping and figure–ground organization. Psychol. Bull.138, 1172–1217. doi: 10.1037/a0029333
66
WaltherD. B.ShenD. (2014). Nonaccidental properties underlie human categorization of complex natural scenes. Psychol. Sci.25, 851–860. doi: 10.1177/0956797613512662
67
WertheimerM. (1922). Untersuchungen zur Lehre von der Gestalt, I: Prinzipielle Bemerkungen [Investigations in Gestalt theory: I. The general theoretical situation]. Psychol. Forsch.1, 47–58. doi: 10.1007/BF00410385
- CrossRef
- Google Scholar
68
WilderJ.DickinsonS.JepsonA.WaltherD. B. (2018). Spatial relationships between contours impact rapid scene classification. J. Vis.18:1. doi: 10.1167/18.8.1
69
WilderJ.RezanejadM.DickinsonS.SiddiqiK.JepsonA.WaltherD. B. (2022). Neural correlates of local parallelism during naturalistic vision. PLoS One17:e0260266. doi: 10.1371/journal.pone.0260266
70
YaminsD. L.HongH.CadieuC. F.SolomonE. A.SeibertD.DiCarloJ. J. (2014). Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl. Acad. Sci.111, 8619–8624. doi: 10.1073/pnas.1403112111

Summary

Keywords

mid-level vision, perceptual grouping, gestalt grouping rules, contour drawings, medial axis transform, symmetry, contour tracing

Citation

Walther DB, Farzanfar D, Han S and Rezanejad M (2023) The mid-level vision toolbox for computing structural properties of real-world images. Front. Comput. Sci. 5:1140723. doi: 10.3389/fcomp.2023.1140723

Received

09 January 2023

Accepted

18 August 2023

Published

13 September 2023

Volume

5 - 2023

Edited by

Ernest Greene, University of Southern California, United States

Reviewed by

William McIlhagga, University of Bradford, United Kingdom; Eckart Michaelsen, System Technologies and Image Exploitation IOSB, Germany

Updates

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

*Correspondence: Dirk B. Walther, dirk.bernhardt.walther@utoronto.ca

†These authors have contributed equally to this work

Disclaimer

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article or claim that may be made by its manufacturer is not guaranteed or endorsed by the publisher.

Computer Vision

TECHNOLOGY AND CODE article

The mid-level vision toolbox for computing structural properties of real-world images

Abstract