A proposed scenario to improve the Ncut algorithm in segmentation

In image segmentation, there are many methods to accomplish the result of segmenting an image into k clusters. However, the number of clusters k is always defined before running the process. It is defined by some observation or knowledge based on the application. In this paper, we propose a new scenario in order to define the value k clusters automatically using histogram information. This scenario is applied to Ncut algorithm and speeds up the running time by using CUDA language to parallel computing in GPU. The Ncut is improved in four steps: determination of number of clusters in segmentation, computing the similarity matrix W, computing the similarity matrix's eigenvalues, and grouping on the Fuzzy C-Means (FCM) clustering algorithm. Some experimental results are shown to prove that our scenario is 20 times faster than the Ncut algorithm while keeping the same accuracy.

In image segmentation, there are many methods to accomplish the result of segmenting an image into k clusters. However, the number of clusters k is always defined before running the process. It is defined by some observation or knowledge based on the application. In this paper, we propose a new scenario in order to define the value k clusters automatically using histogram information. This scenario is applied to Ncut algorithm and speeds up the running time by using CUDA language to parallel computing in GPU. The Ncut is improved in four steps: determination of number of clusters in segmentation, computing the similarity matrix W, computing the similarity matrix's eigenvalues, and grouping on the Fuzzy C-Means (FCM) clustering algorithm. Some experimental results are shown to prove that our scenario is times faster than the Ncut algorithm while keeping the same accuracy. KEYWORDS GPU, CPU, parallel computing, Ncut, FCM

. Introduction
Image segmentation is a key step for grouping objects which have the same characteristics or properties such as color, intensity, or texture. In an image, each object is called the domain-Region and its outlines are called the boundary. Feature vectors of regions are created based on their properties and are used to distinguish them. Image segmentation describes details of the various components in image to classify and recognize objects easier (Nock and Nielsen, 2004;Starck et al., 2005;Yu et al., 2009;Belahcene et al., 2014;Dhanachandra et al., 2015;Minaee and Wang, 2019). For instance, based on image segmentation, face detection contributes to better face recognition and user identification.
In the segmentation process of image with high-resolution data, the parallel computation on the image data is divided into two approaches. Approach to parallel programming model with hybrid model on CPU -GPU (Agulleiro et al., 2012) is a powerful co-processor system because the CPU and GPU have the combined properties of using both types of additional processors allowing for the execution of many large applications for optimal performance. Specifically, OpenMP, CUDA, and MPI libraries are used on CPUs and GPUs (Sirotković et al., 2012;Baker and Balhaf, 2017;Fakhi et al., 2017;Dalvand et al., 2020;Wang N. et al., 2020). Clustering in a multi-core architecture starts with dividing the image data into regions in a grid pattern, and then parallelizes the segmentation over the regions. Parallel file system approach with Hadoop, MarReduce, and Spark (Augustine and Raj, 2016;Li et al., 2016;Cao et al., 2018;Liu et al., 2019;Wang X. et al., 2020) is built for high resolution image segmentation. All of the above approaches allow increasing the performance of the algorithm.
Among the several segmentation algorithms, Ncut algorithm (Shi and Malik, 2000) is one of the efficient algorithms for image segmentation, which is based on graph theory. It detects the boundary between two regions by partitioning and grouping based on not only local features of image but also global features of image. In this algorithm, a distinct parameter for dividing the input image into different regions was calculated. However, in graph theory, dividing graph problem is an NP-complete problem meaning that it cannot be solved in polynomial time. Besides, the complexity of the Ncut algorithm is affected by the size of the input image. On that basis, for the purpose of improving the processing time of image segments, several parallelization methods have been applied. Shiloach and Vishkin (1982) developed the first parallel algorithm based on the breadth-first search algorithm. Anderson and Setubal (1992) use parallel in the numbering algorithm on workstations for better computation. Wassenberg et al. (2009) used the Minimum Spanning Tree with optimal function calculation and parallel execution on machines sharing memory. XianLou and ShuangYuan implemented Ncut algorithm in parallel on GPUs. Therefore, the performance of these methods depends on the size of each small area. The right number of divisions and the proper size is a problem (XianLou and ShuangYuan, 2013). However, these methods mostly implement parallel algorithms on each small partition of the image. That means the image is initially divided into many small regions and then applying the segmentation algorithm on small areas in parallel.
The execution time of Ncut algorithm is O(MxN) in which N is the number of pixels that is equivalent to the number of nodes of the graph created by image. Besides, M is the number of steps that Lanczos algorithm (Cullum and Willoughby, 2002) takes to find the eigenvalues in the process. Since every node of input image only relates with some neighbors, W matrix can be stored as a sparse matrix which is efficient usage memory. Moreover, because computing the similarity matrix W and its eigenvalues take too much time, we propose a parallel computing method using CUDA on GPU for solving Ncut problem.
Our paper is organized into three main sections, the first section is the introduction of our approach, which is discussed in this paper. The second section describes our proposed method in detail. This section consists of three subsections: determination of number of clusters in segmentation, computing the similarity matrix W, computing the similarity matrix's eigenvalues, and grouping on the FCM. The last section is about some experimental results in comparing the speed time between our approach and some conventional approaches and comparison of accuracy in segmentation.

. Proposed method
Ncut method proposed by the group author Shi and Malik is as follow: where the vertex set V of graph are the point in the feature space, every edge in the edge set E is formed between every pair of nodes, and set the weight on the edge connecting two nodes to be a measure of the similarity between the two nodes. + Solve (D − W) x = λDx for eigenvector with the smallest eigenvalues. Where, D is an N×N diagonal matrix, W is an N×N symmetrical matrix, x is an eigenvector, and λ is an eigenvalue. + Use the eigenvector with the second smallest eigenvalue to bipartition the graph by finding the splitting point such that Ncut is minimized. + Decide if the current partition should be subdivided and recursively repartition the segment parts if necessary. + Recursively repartition the segment part if necessary.
The Ncut algorithm should be improved in the image segmentation problem for computational performance. Firstly, automatic k cluster prediction method is needed to choose the number of k partitions in image segmentation applications, we propose to predict the number k clusters based on the characteristic histogram of the image. Secondly, in one step of the algorithm, the K-means grouping method is used to group on the eigenvector set found. Since the eigenvector set is a real data set, there will be errors during the clustering calculation due to the computer's structure and numerical representation. We propose to use FCM algorithm with the expectation that it can fuzzify the data so that errors can be accepted for better data clustering prediction. Furthermore, the process of finding the similarity matrix and eigenvalues of the sequential execution problem takes up a considerable amount of execution time in the whole algorithm. We propose to apply parallel computation on GPU for this calculation step with the expectation of better computing performance on large image data.

. . Determination of number of clusters in segmentation
In most of segmentation problems, the issue of deciding the number of objects in order for segmentation is crucial and indispensable. Generally, this number of groups will be intuitively inputted based on user estimation. The estimation comes from viewing an arbitrary image and giving a number k. We propose an automatic approach to deciding the number k of clusters based on histogram. The gray-level histogram provides many extreme points (minimum and maximum points). The exploration of number of maximum points is the key of deciding the number of clusters in which the following formula is satisfied (1).
Let δ i ∈ R and define f : Where (P 1 (x) − P 2 (x)) > δ 1 is the height deviation or the deviation of the total pixels in a gray-level between two peaks. And the P 1 y − P 2 y > δ 2 is the distance between two extreme points. The value of δ 1 and δ 2 is estimated by statistics from a pre-selected set of images.
The Figure 1 of gray-level histogram shows us the number of clusters based on distance and deviation.
Algorithm 1 is present to determine the number of segments in an image.
Input: Array is gray-level histogram, δ 1 , δ 2 Output: K image segments K:=0 Step 1: Find the largest of local element P 1 Step 2: Find the smallest of local element P 2 such that P 2 y < P 1 y and satisfy formula (1) Step 3: K increases by 1 unit Step 4: Repeat step 1 Algorithm . K-segment_Histogram.
. . Computing the similarity matrix W The similarity matrix W or the matrix affinity is the matrix representation of the relationship between the nodes in the original image. For example, the original image is converted to graph and W matrix as shown in Figure 2.
We apply the Malik and Shi grouping algorithm to image segmentation based on brightness. We construct the graph G = (V, E) by taking each pixel as a node and define the edge weight w ij between node i and j as the product of a feature similarity term and spatial proximity term using formula (2).
Where X(i) is the spatial location of node i and I(i) is the intensity value of the brightness. We have the weight w ij = 0 for any pair of nodes i and j that are more than r pixel apart.
As we see, r is often small than size of matrix image (Shi and Malik, 2000). Therefore, the numbers of zeros elements are more than other elements in the similarity matrix. In other words, matrix affinity W is a sparse matrix. To save memory usage, we stored W in form coordinate (COO) (Bell and Garland, 2008) which consist of element's indexes having nonzero elements. Specifically, the input matrix A contained into three arrays row, col and val corresponding with row index, column index and value of nonzero elements as shown in Figure 3.
For each pixel i, there is corresponding connected pixels j. In other words, there will be w ij for two connected pixels i and j according to the given connection distance r. Thus, with the input image I(row×col), it will take a long time to consider the vertices sequentially (Shi and Malik, 2000). Therefore, we propose Algorithm 2 to parallelize each vertex iǫ(row×col) to find connected vertices j and calculate the weight w ij using formula (2).   Step 1: Copy image I, dMaskx, dMasky, and dnMask from host (CPU memory) to device (GPU memory) Step 2: Determine the total number of non-zero elements of the weight matrix W corresponding to the input matrix I by Algorithm 6 Step 3: Allocate the memory for variables dRow, dCol, dVal are n × sizeof (int), n × sizeof (int), Step 4: Build a grid of execution threads (gridDim = (floor(((nrow×ncol)/r) 2 ), 1, 1), blockDim = (r, r, 1)) Step 5: Call kernel function, execute parallelly threads to calculate dVal, dCol, dRow: Step 5.1: Determine the index of the threads under execution idx = blockidx.x * blockDim.x + threadIdx.x Step 5.2: Calculate the pixel position under consideration: tx = idx ncol ; ty = idx% ncol Step 5.3: Calculate the neighboring points at the pixel under consideration that will have Step 5.4: Calculate the wij values by formula (2) and store in the variables dVal, dRow, dCol Algorithm . Sparse matrix_Ncut_GPU.

. . Computing the similarity matrix's eigenvalues
The Ncut problem to find eigenvalues of weight matrix W means finding k smallest eigenvectors of Laplace matrix based on weight matrix W. Since weight matrix W is a symmetric sparse matrix of relatively large size, it takes a lot of time to compute the eigenvector in the whole image segmentation process. Especially with the larger image size, the matrix size also increases exponentially, so it takes more time to find the eigenvalues in this case. We propose a method to parallelize the eigenvalues of the W matrix on the GPU and the Lanczos method (Cullum and Willoughby, 2002) as an effective algorithm for finding the k smallest eigenvalues. We compute Lanczos by paralleling in GPU. Each j, we compute parallel the multiplication between matrix and vector, the multiplication between vector and vector according to Algorithm 3.

. . Applying FCM algorithm to eigenvector matrix
The FCM algorithm (Nayak et al., 2014) allows a point to belong to one or more groups depending on the degree Input:The symmetric matrix A n×n , k Output:The k smallest eigenvalue and eigenvector Step 1: Random vector v 1 ∈ R n and v 1 2 = 1 Step 2: j = 1, β 1 = 0, v 0 = 0 Step 3: We compute: Step 4: If j < k − 1 back to Step 4, else continue Step 5: Calculate the k smallest eigenvalue and eigenvector based on diagonal matrix T

Algorithm . Lanczos_Ncut_GPU.
of membership function of each point corresponding to the centers of the groups (using fuzzy logic). Therefore, FCM has more flexibility with data sets with overlapping data clusters (high similarity with images). The algorithm is mainly based on the optimization of the objective function according to the formula (3).
In which, the matrix belonging to U=[u ij ] ǫ M fcm is the fuzzy partition of the data set Z, u ij ǫ [0,1] indicates the dependence of point x i on the j th cluster, with C j=1 u ij = 1, ∀i; V = [c 1 , c 2 , . . . , c c ] is the sample vector or cluster center of the C groups, calculated according to the distance standard D ij = x i j − c 2 ; m ∈ [1; ∞] is the exponent that determines the fuzziness of the clustering. Instead of using k-mean in Ncut algorithm, we propose the FCM algorithm to group the image from second smallest eigenvalue by fuzzy original data to optimal cluster according to Algorithm 4. The FCM are fuzzy-graph structures for representing data in a fuzzy way. It accepted a computation noise for clustering, and it is more optimal on the real data, and eigenvector set from the Ncut problem.

. . Determination of number of clusters in segmentation
We conduct an experiment of determination of number of clusters in segmentation on four datasets. The comparison is processed on the resultant number of clusters from intuition and our method on histogram. The Table 1 is the demonstration of our experiment on 4 datasets with the threshold of Input:k eigenvector matrix Output:Clustered k eigenvector matrix Step 1: Let the matrix U ǫ R nxk be the matrix with k eigenvectors v 1 , ..., v k in column form Step 2: For i = 1, ..., n, let y i ǫ R k be the vector corresponding to the i th row of the matrix U Step 3: Group the points (y i ) i=1,...,n ǫ R k using FCM into clusters C 1 , ..., C k Step 4: We have the result as clusters A 1 , ..., height deviation and peak distance of δ 1 and δ 2 as 50 and 100 respectively. From Table 1, it can be seen that the difference between the two ways of visual determination and that using histogram is not significant. Therefore, the prediction of cluster number is feasible in dynamic image processing applications.    . . Computing the similarity matrix W As Figure 4, we have the graph illustrates the speed time to calculate W matrix between Shi and Malik algorithm and sparse matrix_Ncut_GPU algorithm.
The vertical axis is the speed(s) and the horizontal axis is the size of image which is demonstrated in Table 2.
In the graph, when image is 1024 × 1024, the Shi and Malik method is being out of memory. The sparse matrix_Ncut_GPU algorithm executes for a small increase when the image size increases.

. . Computing the matrix's eigenvalues
In this section, the experiment is conducted on the increasing image size in Table 3 on the three clusters (k = 3). This corresponds to exploring three eigenvalue and eigenvector of the W matrix. The Table 3 represents us that the GPU time (time to find eigenvalue in parallel way) is quicker than Shi/Malik method.
The graph below illustrate the Shi/Malik method and Lanczos_Ncut_GPU method with size of image, the number of segment as shown in Figure 5.
The result of calculating eigenvalues by Lanczos_Ncut_GPU algorithm is 17 times faster than that by Shi/Malik algorithm when the image size increases.
. . Grouping on the eigenvalue by k-mean and FCM After determination of a set of eigenvalues, the Ncut algorithm makes use of these eigenvalues to cluster them into k clusters. It corresponds to the k regions of an image. Figures 6, 7 demonstrate the clustering results by K-means and FCM. There are no differences between the two approaches to image segmentation. Figure 8 and Table 4 depict the computational time in image segmentation between Shi/Malik's algorithm and the one we improved by GPU. It is found that the execution time of the algorithm we improved by GPU is 20 times faster than that of the Shi/Malik algorithm while keeping the same accuracy.

. Conclusion
In this paper, we analyze the Ncut algorithm to cluster an image into regions. Because in the Ncut algorithm, we have to input the number of clusters k, it is not automated process. For that .

FIGURE
The speed time to calculate image segmentations between Shi/Malik algorithm and Our algorithm. reason, we predict k group depending on histogram will give a good result. Moreover, we improve the speed performance by parallel computing on GPU. Specifically, it is to parallelize the calculation of the W matrix and finding of the eigenvector matrix in the Ncut algorithm. Finally, we used the FCM algorithm to cluster the data in the above eigenvector matrix. Some experimental results on image data sets are conducted to prove that our approach is 20 times quicker than Shi and Malik approach in computing eigenvalue and computing similarity matrix.

Data availability statement
Publicly available datasets were analyzed in this study. This data can be found here: https://ccia.ugr.es/cvg/dbimagenes.

Author contributions
PB, HH, and NT have designed methods. NT implemented it and tested performance. All authors contributed to the article and approved the submitted version.