GPU Accelerated Multiple-Relaxation-Time Lattice Boltzmann Simulation of Convective Flows in a Porous Media

A two-dimensional (2D) multiple-relaxation-time (MRT) lattice Boltzmann method (LBM) is used for porous media with the Brinkman-Forchheimer extended Darcy model to investigate the natural and mixed convection ﬂows in a square cavity. This Brinkman-Forchheimer model is directly used through the forcing moments as source term. A Tesla K40 NVIDIA graphics card has been used for the present Graphics Process Unit (GPU) parallel computing via Compute Uniﬁed Device Architecture (CUDA) C platform. The numerical results are presented in terms of the velocity, temperature, streamlines, isotherms and local and average Nusselt numbers. For the wide range of Rayleigh numbers, ( Ra = 10 3 to 10 10 ), Reynolds numbers, Darcy numbers and the porosities, the average Nusselt number is compared with the available results computed by ﬁnite element method (FEM) and single-relaxation-time (SRT) lattice Boltzmann method and the agreement shows quite well. The results are also validated with the previous experimental results. The simulations speed up maximum 144x using CUDA C in GPU comparing with the time of FORTRAN 90 code using a single core CPU simulation.


Introduction
In recent years, Graphics Process Units (GPUs) play a large role in using High-Performance Computing (HPC) due to the significant larger performance than the traditional Central Process Unit(CPU) based processors. A GPU has many slim processing units on a single chip and performs in parallel a very large number of operations on a correspondingly large number of operands. GPU computing is a heterogeneous simulation system, a small part is run sequentially by the CPU and the major part is run in GPU. It has already been established that the GPU parallel computing is well accepted due to the remarkable floating point arithmetic performance. Now the GPU is considered a computational accelerator processing a massively multi-threaded architecture which has been widely used for graphical and general purpose computations, such as molecular dynamics simulations and computational fluid dynamics [1,2]. Fluid flow and convection heat transfer through porous media is an interesting and useful research area due to its application in the fields of science and engineering, such as hydrology, civil and mechanical engineering, chemical engineering and petroleum engineering, thermal management of electronic cooling and the improvement of heat transfer system, (Guo and Zhao [3]). Guo and Zhao [4] first proposed Bhatnagar-Gross-Krook (BGK) or single-relaxationtime lattice Boltzmann method for the isothermal incompressible flow in porous media. In their model, porosity parameter introduced through equilibrium distribution function and added the forcing term into the evolution equation to account for the linear and nonlinear drag forces of media using Darcy's term and {blueForchheimer's term. They successfully applied their model for the two-dimensional (2D) Poiseuille flow, Couette flow, and lid-driven cavity flow and compared their LBM results with the analytical and finite difference solutions. Latter on, Guo and Zhao [3] extended their study of the convective heat transfer using the same single-relaxation-time lattice Boltzmann method (SRT-LBM) and successfully simulated the temperature field for the natural convection of side-heated cavity and mixed convection of the channel flow through porous media.
Following Guo and Zhao [4], Seta et al. [5,6] studied the Poiseuille flow and natural convection flow in a side-heated cavity filled with porous media using the SRT-LBM for the non-Darcy regime considering the Rayleigh number 10 3 ≤ Ra ≤ 10 6 . For the high Rayleigh number 10 3 ≤ Ra ≤ 10 10 , an extensive investigation has been done by Vishnampet et al. [7] considering the Darcy and non-Darcy regime using SRT-LBM and compared their results with available numerical and experimental results. The above-mentioned SRT-LBM has many advantages in the computational viewpoint, is intrinsic parallelism of the algorithm and the simplicity of programming over traditional finite difference, finite volume and finite element methods. It has been successfully used in various complex fluid system, such as, multiphase flow [8], suspension in fluids [9], magneto-hydrodynamics flow [10,11], nanofluids [12] and flow through porous media with variable porosity [3]- [7]. Even though these advantages, it has some shortcomings, leads numerical instability when the dimensionless relaxation time τ is close to 0.5 [13].
It has already been established that the shortcomings in SRT-LBM are removed by using the multiple-relaxation-time lattice Boltzmann method (MRT-LBM). The MRT-LBM is numerically more stable, and it has more degrees of freedom than the SRT-LBM. Initially, Lallemand and Luo [14] constructed a generalized lattice Boltzmann equation in moment space rather than in discrete velocity space based on d'Humieres [15], that is now familiar as multiple-relaxationtime lattice Boltzmann method. Lallemand and Luo [14] analyzed the stability of the model and compared with the BGK LB equation model and found that the mechanism of separate relaxations for the kinetic modes leads to a model which was much more stable than the BGK LB equation model. There are many studies have been done by MRT-LBM, and from them, some are cited in the following sections.
Hua et al. [16] simulated high Reynolds number flow in a lid-driven cavity by MRT-LBM for Re ranges from 20000 to 100000 and they mentioned that was the first attempt to simulate 2D cavity flow for maximum Re = 100000. Using the MRT-LBM two lid-driven cavity flow had been investigated by Guo et al. [17]. Du et al. [13] proposed an incompressible MRT-LB model and simulated the lid-driven cavity flow for Re = 100000. They also simulated the double shear layer flow for high Reynolds number and showed that the numerical results are better than that of SRT-LBM. Computation of channel flow past a square cylinder with an upstream control, bi-partition has been done using the MRT-LBM by Moussaoui et al. [18]. Du and Liu [19] investigated the natural convection flow in a side-heated cavity using the MRT-LBM for the fluid flow and SRT-LBM for the temperature field. Computation of heat transfer and fluid flow in an obstructed channel was done by Moussaoui et al. [20]. They solved the fluid flow LB equation using the MRT technique and the energy equation for temperature by the finite difference method. Few articles are available in the literature with double MRT-LBM for velocity and thermal flow simulation. Guo et al. [21] investigated the mixed convection flow in a slender rectangular cavity with the D2Q9 model for velocity and D2Q5 model for the temperature simulation. For the pure fluid flows and heat transfer simulation have been studied in [22,23,24,25,26] using double MRT-LBM. These studies are conducted the CPU simulations.
Multi relaxation time lattice Boltzmann simulations of transitional flows in a deep 2D liddriven cavity is investigated by [27] using GPU computing. They concluded that the GPU showed the efficient performance for larger grid size problems. Xu et al. [28] have studied the double MRT lattice Boltzmann simulation of lid-driven and side-heated cavity flows using GPU in the directive-based OpenACC platform and found that the optimized data layout gives a better performance over CPU. Ren and Chan [29,30] studied SRT-LBM simulation of natural convection flow using the GPU in CUDA platform. They also investigated the SRT-LBM for velocity and MRT-LBM for total enthalpy simulation of PCM melting process in an enclosure using GPU computing [31].
In this paper, a multiple-relaxation-time lattice Boltzmann method is used for porous media with the Brinkman-Forchheimer extended Darcy model to investigate the natural and mixed convection flows in a side or top heated square cavity using GPU computing via CUDA C platform. A FORTRAN 90 code has also been used for comparing the CPU simulation time with the simulation by CUDA C in GPU. The Brinkman-Forchheimer model is used as a source term as forcing moments . Following [21]- [25] in double MRT-LBM, the fluid flow is simulated by D2Q9 model and the temperature field by D2Q5 model. In the natural convection simulation, the Rayleigh number is considered from Ra = 10 3 to 10 10 , Darcy number Da = 10 −2 to 10 −7 and the porosity parameter is ǫ = 0.4 and 0.6. The Reynolds numbers Re = 400 and 1000 and the Grashof number Gr = 100 are considered for the mixed convection case with the same ǫ.

Governing Equations in Macro Scale
The non-dimensional governing equations for the laminar natural and mixed convection incompressible 2D flows for porous media are [32]: ∂u y ∂t where u x and u y are the velocities of the fluid along the x and y directions, p is the pressure,θ is the temperature of the fluid and ǫ is the porosity of the porous media. Here A, B and C varies for the natural and mixed convection case and D is the body force term for the porous media that is given in the next section. In the Natural convection case: A = P r, B = RaP r, C = 1 and in the Mixed convection case: A = 1/Re, B = Gr/Re 2 and C = 1/ReP r, where P r is the Prandtl number, Re is the Reynolds number, Ra is the Rayleigh number and Gr/Re 2 is the Richardson number respectively. For the present simulation, the corresponding lattice Boltzmann model of the above equations are described below:

Formulation of the problem in LBM
In this paper, it has been employed D2Q9 lattice model MRT-LBM for simulating the fluid velocity field and D2Q5 lattice model MRT-LBM (Trouette [25]) for simulating the temperature distribution, which is briefly described below:

Multiple-Relaxation-Time Lattice Boltzmann for Fluid Flow
The multiple-relaxation-time lattice Boltzmann equation for fluid flow the collision operation can be generalized as where Ω is the collision operator and F i are components of the body force which will be defined later on. It is convenient to perform the collision process in the momentum space instead of velocity space. So Eq. (5) can be transformed as where The collision matrix S in a moment space is a diagonal matrix. The nine eigen-values of S are all between 0 and 2 so as to maintain linear stability and the separation of scales, which means that the relaxation times of non-conserved quantities are much faster than the hydrodynamic time scales. The LBGK model is the special case in which the nine relaxation times are all equal, and the collision matrix S = 1 τ I, where I is the identity matrix. The diagonal collision matrix S can be written as where Here τ is the relaxation time relates to the kinematic viscosity of fluid where c 2 s = 1 3 . The body force F encompasses the viscous diffusion, the inertia due to the presence of a porous media, and an external term (Seta et al. [5]). The widely used Ergun's relation [33], the body force for the porous media can be defined as where K def = DaH 2 is the permeability of the porous medium and G = gβ(T − T m ) is the buoyancy term. Here Da is the Darcy number, H is the height of the cavity and T m = T h + T c is the reference temperature, T h and T c indicate the temperature for the heated and cold walls respectively. Including the porosity in the equilibrium moment vector m eq is where j x = ρu x and j y = ρu y . Here u x is the horizontal and u y is the vertical velocity components of the fluid.
The discrete velocity components of the two dimensional (D2Q9) lattice model are: (see Figure 1(a)), where, c def = ∆x/∆t is the lattice speed, where ∆x = ∆t = 1 . The macroscopic fluid density ρ and velocity u are obtained from the moments of the distribution function as follows:

Multiple-Relaxation-Time Thermal Lattice Boltzmann
Similarly, the multiple-relaxation-time thermal lattice Boltzmann the collision operation can be written as where g = (g 0 , g 1 , g 2 , ......, g n ) T is the thermal distribution function. Here D2Q5 model has been used for the thermal lattice Boltzmann method. The collision matrix N for D2Q5 is The five eigen-values of S are all between 0 and 2. The diagonal collision matrix S can be written as The values of s i is given in detail in Troutte [25]. They are These parameters corresponds to the thermal diffusivity For the D2Q5 thermal LB model, a depends on the thermal diffusivity α and is less than 1 to avoid the numerical instability.
Corresponding to the distribution function g i , the equilibrium moments m eq is In this case, for the 5-bit two dimensional (D2Q5)lattice the discrete velocities are (see The temperature T can be calculated as

Application of the MRT-LBM to convective flows in a square cavity
To show the applicability of this MRT-LBM for porous media for the convective flows two cases have been considered, one is for natural convection and the other one is mixed convection flow. For the present problem, the Boussinesq approximation is used in which all fluid properties are assumed to be constant, but the density whose variation with temperature is allowed through the buoyancy term.

Porous media
Adaibatic wall Adaibatic wall

Boundary Conditions
For the no-slip wall the bounce-back boundary condition is applied on velocity fields. As like Trouette [25], the incoming unknown distribution function At a wall with a fixed temperature T w , the following boundary condition is applied for the D2Q5 model MRT thermal lattice Boltzmann simulation: For an adiabatic wall the anti-bounce-back condition is applied

Average rate of heat transfer
In the heat transfer problem, the interest in calculating the rate of heat transfer in terms of the local and average Nusselt number that are defined below respectively: For natural convection flow: and For mixed convection flow: and where H is the height and L is the lenght of the cavity.

Implementation of Forcing Term in MRT-LBM
The implementation of the forcing term (11) by adding explicitly with nine forcing moments as Guo and Shu [34]: With the above forcing moments, the collision process of MRT-LBM is implemented in moment space as

CUDA C Programming in GPU
The GPU computing is a heterogeneous simulation system, a small part is done sequentially by the CPU and the major part is simulated by GPU parallel computing. CUDA programming model mainly depends on the concept of the kernel. A kernel is a function that is executed in concurrent threads on the GPU. In NVIDIA Tesla k40 architecture, maximum 1024 threads formed a block and blocks are grouped into execution grid ( Figure 3). In CUDA, there is two programming language, one is CUDA FORTRAN and the other one is CUDA C (or C++). In this study, we used CUDA C that is a slight modification of C programming language (NVIDIA [35]). For example, in a CUDA C programming, a new keyword is introduced as: global for the device (GPU) code and this function is calling from the main code using a triple angle <<< ... >>>. The heterogeneous simulation system in a CUDA code can be classified into four categories (Obrecht et. al. [36]): (i) Sequential functions run by the CPU (ii) Launching functions allowing the CPU to start a kernel (ii) Kernel run by GPU and (iv) Auxiliary functions that are inlined into the kernel at compile time.

Data structure modification: AoS to SoA
One of the most important changes of the data structure modification in the GPU code from the array of structure (AoS) to the structure of array (SoA). In CPU based LBM code, the data layout of the distribution function is usually arranged as AoS because CPU can use of cache memory. For example, the distribution function f i (x, t) is stored with index (i + 9 × x + 9 × N x × y) for the D2Q9 model, as depicted in Figure. 4. But GPU is functioning based on single instruction multiple threads (SIMT) execution model. In CUDA C, to meet the requirement of coalescing memory access, the data layout should be changed to SoA in LBM implementation. Then the distribution function f i (x, t) is stored with index (x + N x × y + N x × N y × i), so that the parallel threads running the same instruction can access consecutive locations in memory [37,28].

Results Discussion
In this paper, a new approach is proposed for the convective heat transfer from the fluid saturated porous media using the GPU parallel computing via CUDA C platform. Before conducting the main simulation the present CUDA C code has been validated for the liddriven cavity flow, natural convection flow for side heated square cavity and mixed convection lid-driven cavity flow. The different cases are given below: In order to validate the fluid flow and heat transfer case, the MRT-LBM results in terms of the average Nusselt number, N u, obtained from the simulations are compared with the available data of SRT-LBM [5], finite difference data of de Vahl Davis [40], finite Element data of Nithiarasu et al. [32] for the clear fluid (Pr =0.71) in a side-heated square cavity under the same boundary conditions for the laminar flow where 10 3 ≤ Ra ≤ 10 6 . For the transition-toturbulent flows, the present results are also compared with the available results of SRT-LBM of Vishnampet et al. [7], Dixit and Babu [42] and the spectral method of Le Quéré [43] where the Ra ranges from 10 7 to 10 9 . In these cases, the Darcy number Da = 10 8 and the porosity ǫ = 0.9999 have been considered. The results are given in Table 2 and the agreement shows quite well up to Ra = 10 8 . For Ra = 10 9 , the average Nusselt number of Dixit and Babu [42] is 57.35 but in the present case it is 54.56. Figure 7(a)-(c) depicts the streamlines and isotherms for the transition-to-turbulent flows while P r = 0.71 and Ra = 10 7 to Ra = 10 9 while Da = 10 8 and ǫ = 0.9999. From these figures, it has been seen clearly the pattern of the streamlines and isotherms for Ra = 10 7 and 10 8 coincides with the available results in the literature, but for Ra = 10 9 varies slightly. Since the results of Ra = 10 7 and 10 8 are for transitional flows, but the flow is turbulent for Ra = 10 9 (Dixit and Babu [42]).

Natural convection flow with porous media for
Ra = 10 3 to 10 10 In Table 3, the present numerical results are compared with the previous experimental results of Sathe et al. [41] for the different Ra, P r and Da while the aspect ratio A = 10. The comparison shows that the agreement is excellent and the maximum error is 2.2%. So, the present MRT-LBM method is suitable for simulating the flow phenomena and heat transfer for the porous media. The streamlines and isotherms for the low Rayleigh number, Ra ranges from 10 3 to 10 6 while Da = 10 −2 and ǫ = 0.6 are depicted in the Figure 8 Figure 9(a)-(d) show the streamlines and isotherms for the relative high Rayleigh number with Darcy (Da ≤ 10 −6 ) and non-Darcy (Da ≥ 10 −4 ) regime while ǫ = 0.4. The pattern of the streamlines and isotherms are clearly different for the Darcy and non-Darcy region and they are qualitatively similar with the results of (Dixit and Babu [42]).
The effects of the Rayleigh number on the velocity distribution are depicted in Figure 11(a)-(b) while Da = 10 −2 and ǫ = 0.4. It is seen that the velocity increase while increasing the Rayleigh numbers. For both velocity distribution, larger velocity occurs near the walls and the minimum velocity occurs at the center of the cavity where minimum values of stream function occur. The local rate of heat transfer in terms of the local Nusselt number N u is illustrated in the Figure 11 In the Table 4, the average Nusselt number, N u, has been inserted for the different Ra = 10 3 to 10 10 , Darcy number Da = 10 −2 to 10 −7 and the porosity ǫ = 0.4 and 0.6. The results obtained by present MRT-LBM are compared with the results obtained by the finite element method   [45,46], Khalek [47], Tiwari and Das [48] and Kefayati et al. [49] for Gr = 100, and P r = 0.71.

Re
Ri of Nithiarasu et al. [32], SRT-LBM of Guo and Zhao [3] and the SRT-LBM of Vishnampet et al. [7]. The comparison shows the agreement is quite excellent. A set of new results are presented for very high and very small Darcy numbers that are presented in the bottom the Table 4. For ǫ = 0.4 and 0.6 the Average Nusselt number increases while the Rayleigh increases from Ra = 10 8 to 10 10 keeping the Darcy number fixed at Da = 10 −7 .

Mixed convection flow in a lid-driven square cavity
In the mixed convection case, the fluid velocity are non-dimensionalised by the lid velocity U = 0.1 and the temperature as like natural convection flow. Firstly, the code is validated while the heated lid is moving along x-direction in pure fluid case considering the Gr = 100 and the Re = 100, 400 and 1000. Figure 12(a)-(c) shows the velocity and temperature distribution respectively while Re = 400, Gr = 100, Da = 10 8 and ǫ = 1. These velocity and temperature distribution are compared with the results of Iwatsu et al. [44] and Khanafer et al. [45] that shows an excellent agreement. Another comparison has been made regarding the average Nusselt number, N u, that is shown in Table 5 for the three different Reynolds number, Re = 100, 400 and 1000, and Gr = 100. The agreement of the average Nusselt number with the availbale results of Iwatsu et al. [44], Khanafer et al. [45,46], Khalek [47], Tiwari and Das [48] and Kefayati et al. [49] is quite acceptable. Figure 13(a)-(d) depict the streamlines (top) and isotherms (bottom) for the two different Re while P r = 0.71, Gr = 100. For Re = 400 and 1000 there three vortex, primary,secondary and tertiary. The primary vortex spans in the whole cavity except the right and left the bottom corner. In the isotherms, it is seen the maximum temperature occurs near the top wall and the minimum at the bottom wall and qualitatively these results agree with results of Iwatsu et al. [44] and Khanafer et al. [45].

Mixed convection flow in lid-driven square cavity with porous media
The effects of the Darcy number, Da on the flow field and temperature distribution is illustrated in the Figure 14 Table 5 for the different Reynolds number,Re, the Darcy number, Da and for ǫ = 0.4 and 0.6 while Gr = 100. For the higher Re, the N u increases but for decreasing Da, reducing the average rate of heat transfer. The effects of the porosity are also significant in the rate of heat transfer. The average rate of heat transfer decreases for increasing the values of porosity ǫ that is opposite to the natural convection case.  Table 7 shows a comparison of the GPU parallel performance over a sequential CPU performance in terms of the simulation time per iteration step. The GPU performance is higher than CPU and the speed up is calculated as a ratio of CPU simulation time over GPU time. From this table, it is interesting to see that for 128 × 128 lattice size the simulation in GPU speeds up approximately 19 times than the CPU but for 2048 × 2048 it is 144 times faster. The performance of the GPU implementation strongly depends on the grid size. So, for larger grid problems the GPU performance is better than the lower grid size. Achieving better numerical results using LBM requires larger computational grid size. This requirement is the crucial factor for the overall performance optimization. Since CPU has more latency in processor clock speed than GPU hardware, in the case of the smaller computational grid the performance of the numeric model using sequential programming in CPU surpasses the model implemented in GPU parallel computing environment. However, GPU has more throughput than CPU due to it's many processor architectures which are very suitable for parallel task execution of numerical calculations. GPU hardware consists an array of scalar processors which are executed in a group predefined in terms of GPU hardware. During the invocation of the parallel code in CUDA computing environment, GPU hardware requires a minimum amount of grid size allocation in global memory bus to occupy all it's processors for a synchronized task execution. Lower latency in device to host memory transfer is seen in smaller grid allocation which can potentially degrade the overall performance of the numerical analysis. This latency issue can be significantly evaded using larger grid allocation for hiding the memory issues by instantaneous data sharing among large amount of scalar processors. Moreover, the memory latency issue can be avoided with the assignment of the larger grid for the numerical calculation. Thus, the overall performance can be improved with gradually incrementing the grid size of the numerical model. Even, the performance speed up becomes substantially better in GPU than CPU programs whenever grid size is incremented. According to Lin et al. [27], for larger computational grids require more arithmetic operations, that would hide the memory latency and hence show a greater parallel performance.

Conclusions
In this paper, a double multiple-relaxation-time lattice Boltzmann method is proposed for the porous media with the Brinkman-Forchheimer extended Darcy model to simulate the natural and mixed convection flows in a square cavity. The numerical simulation has been done using the state-of-the-art GPU parallel computing via CUDA C platform. For the porous media, the Brinkman-Forchheimer model is directly used as the source term through the equilibrium distribution function. This approach is completely new that differs with the approach used in the single-relaxation-time lattice Boltzmann method. The numerical results for the natural convection case are computed for the wide range of Rayleigh numbers, 10 3 ≤ Re ≤ 10 10 , Darcy number 10 −2 ≤ Da ≤ 10 −7 and the porosity parameter ǫ = 0.4 and 0.6. In the mixed convectional case, the simulations are done for the Reynolds number Re = 400, 100 and Grashof number Gr = 100 with 10 −1 ≤ Da ≤ 10 −2 .
For increasing the Ra in natural convection and Re in mixed convection the velocity increases but in both cases the velocity decreases while the Darcy number decreases. The average Nusselt number, N u, increases for increasing the porosity ǫ in natural convection but in mixed convection case opposite phenomena has happened.
The present results are compared with the available results computed by the finite difference, finite element method and single-relaxation time lattice Boltzmann method, and the comparison shows better agreement indeed. It is also well known that the MRT-LBM is superior to SRT-LBM in terms of the numerical stability. The forcing term implemented by nine discrete forcing moments that are added separately with the moment's space. The present proposed approach for the porous media with MRT-LBM can be used for other application in fluid flow and heat transfer simulation.
Using the CUDA C, the existing MRT-LBM FORTRAN 90 code has been re-written for GPU parallel computing that speeds up the simulation significantly. Precisely, in this Tesla k40 GPU it is 144 times faster than the core i7 CPU simulation with 2048 × 2048 lattice size. For the larger grids problem, GPU is more efficient than the smaller grids problems.