Application of Video-to-Video Translation Networks to Computational Fluid Dynamics

In recent years, the evolution of artificial intelligence, especially deep learning, has been remarkable, and its application to various fields has been growing rapidly. In this paper, I report the results of the application of generative adversarial networks (GANs), specifically video-to-video translation networks, to computational fluid dynamics (CFD) simulations. The purpose of this research is to reduce the computational cost of CFD simulations with GANs. The architecture of GANs in this research is a combination of the image-to-image translation networks (the so-called “pix2pix”) and Long Short-Term Memory (LSTM). It is shown that the results of high-cost and high-accuracy simulations (with high-resolution computational grids) can be estimated from those of low-cost and low-accuracy simulations (with low-resolution grids). In particular, the time evolution of density distributions in the cases of a high-resolution grid is reproduced from that in the cases of a low-resolution grid through GANs, and the density inhomogeneity estimated from the image generated by GANs recovers the ground truth with good accuracy. Qualitative and quantitative comparisons of the results of the proposed method with those of several super-resolution algorithms are also presented.


Introduction
Artificial intelligence is advancing rapidly and has come to be comparable to or outperform humans in several tasks.In generic object recognition, deep convolutional neural networks have surpassed human-level performance (e.g, [5,6], [8]).The agent trained by reinforcement learning is capable of reaching a level comparable to professional human game testers ( [14]).In the case of machine translation, Google's neural machine translation system, using Long Short-Term Memory (LSTM) recurrent neural networks ( [7,2]), is a typical and famous example and its translation quality is becoming comparable to that of humans ( [22]).
One of the hottest research topics in artificial intelligence is generative models and one approach to implementing a generative model is generative adversarial networks (GANs) proposed by [3].GANs consist of two models trained with conflicting objectives.[16] applied deep convolutional neural networks to those two models, whose architecture is called deep convolutional GANs (DCGAN).DCGAN can generate realistic synthesis images from vectors in the latent space.[9] proposed the network learning the mapping from an input image to an output image to enable the translation between two images.This network, the so-called pix2pix, can convert black-and-white images into color images, line drawings into photo-realistic images, and so on.
The combination of deep learning and simulation has been recently researched.One of such applications is to use simulation results for improving the prediction performance of deep learning.Since deep learning requires a lot of data for training, numerical simulations that can generate various data by changing physical parameters could help compensate for the lack of training data.Another application is to speed up the solver of computational fluid dynamics (CFD).[4] used a convolutional neural network (CNN) to predict velocity fields approximately but fast from the geometric representation of the object.Another example is that velocity fields are predicted from parameters such as source position, inflow speed, and time by CNN ( [10]).Their method is feasible to generate velocity fields up to 700 times faster than simulations.As a more general method, not limited to CFD problems, [17] proposed the physicsinformed neural network (PINN), which utilizes a relatively simple deep neural network to find solutions to various types of nonlinear partial differential equations.
GANs also have been combined with numerical simulations to enable a new type of solution method.[1] used the conditional GAN (cGAN) to generate the solution of steady-state heat conduction and incompressible flow from boundary conditions and calculation domain shape/size.[23] proposed a method for super-resolution fluid flow by a temporally coherent generative model (tem-poGAN).They showed that tempoGAN can infer high-resolution, temporal, and volumetric physical quantities from those of low-resolution data.
The above-mentioned studies about the combination of GANs and simulations show that GANs can generate the three-dimensional data of the solution of physical equations.The main topic in this research is the translation of images (distributions of the physical quantity) by GANs.In the case that the accuracy of the simulation is particularly important, a large number of computational grids are needed.Additionally, the number of simulation cases for design optimization is typically numerous.It means that the computational cost (machine power and time) becomes large.In such a case, it is important to reduce the computational cost, and one way to do so is to make effective use of low-cost simulations.Based on such an idea, I investigated the feasibility of time-series image-to-image translation: translation from time-series distribution plots in the case of low-resolution computational grids to those in the case of high-resolution grids.A quantitative evaluation of the quality of generated images was also performed.
The method proposed in this paper is the video(sequential images)-to-video translation in which the difference of solutions between the high-and lowresolution grid simulations is learned.Meanwhile, the PINN constructs universal function approximators of physical laws by minimizing the loss function composed of a mismatch of state variables including the initial and boundary conditions and the residual for the partial differential equations ( [13]).In other words, the PINN is an alternative to CFD, while the proposed method is a complement to CFD.
The paper is organized as follows.In section 2, I describe the outline of the simulations whose results are input to GANs and the details of the network architecture.In section 3, I give the results of time-series image-to-image translation (in other words, video-to-video translation) of physical quantity distribu-tion and a discussion mainly about the quality of generated images.Conclusions are presented in section 4.

Numerical Simulations
I solved the following ideal magnetohydrodynamic (MHD) equations numerically in 2 dimensions to prepare input images to GANs: where ρ, p, and v are the density, pressure, and velocity of the gas; B is the magnetic field; γ represents the heat capacity ratio and is equal to 5/3 in this paper; p T and e represent the total pressure and the internal energy density; I is the unit matrix.One of typical test problems for MHD, the so-called Orszag-Tang vortex problem ( [15]), was solved by the Roe scheme ( [18]) with MUSCL (monotonic upstream-centered scheme for conservation laws; [20]).The initial conditions are summarized in Table 1.B 0 is a parameter for controlling the magnetic field strength.The compuational domain is 0 ≤ x ≤ 1 and 0 ≤ y ≤ 1.The periodic boundary condition is applied in both x-and y-directions.Simulations for each condition were performed twice on computational grids with different resolutions.The number of grid points is (N x × N y ) = (51 × 51) or (251 × 251).In the case of (N x × N y ) = (251 × 251), the calculation time is more than 70 times longer than in the other case though the obtained solution is expected to be close to the true solution.

Generative Adversarial Network Architecture
After the original concept of GANs was proposed by [3], various GANs have been researched.Among such networks, I focused on pix2pix, which is a type of conditional GAN and a network for learning the relationship between the input and output images.The feasibility of translating from the results of lowresolution grid simulations to those of high-resolution grid simulations has been investigated in this research.Furthermore, in order to enable the translation across two time-series, the architecture combined pix2pix and LSTM has been constructed.
Figure 1 shows the schematic picture of the architecture of the generator in this research.The role of the LSTM layer is to adjust the image translation dependent on the physical time of the simulation; for the initial state of the simulation (T = 0), no translation is needed at all, but as physical time passes, progressively larger translations are needed.Note that the weights of the encoder (decoder) before (after) the LSTM layer are the same in the time direction.Plots of the time evolution of the density in the low-resolution simulations are input to the generator (plots are read as single-channel images).The input images are converted to vectors by the first-half of a U-shaped network (U-Net).In Figure 2, I denote the architecture of the first-half of U-Net in detail.It consists of eight convolutional blocks with a kernel size of (4 × 4) or (2 × 2).The instance normalization ( [19]) is applied except for the first and last blocks.The activation function is a leaky rectified linear unit (leaky ReLU; [12]) with a slope of 0.2 for all blocks.A 512-dimensional vector is generated at the end of this architecture.
A series of 512-dimensional vectors converted from the time-series plots is input to the LSTM layer.An input vector x t originated from the plot at time = t is calculated with the hidden state h t−1 and memory cell c t−1 .A forget gate (f ), an input gate (i), an output gate (o), and part of the term to be added to the memory cell (z) in Figure 3 are calculated as follows: where σ is the sigmoid function and tanh is the hyperbolic tangent function; W • and R • are the input-to-hidden weight matrices and the recurrent weight matrices; b • are bias vectors.The hidden state and memory cell are updated by: The hidden state h t is reshaped as (1, 1, 512).The reshaped hidden state h ′ t is passed to the latter-half of U-Net and is decoded to the image data (see Figure 4).This part consists of eight deconvolutional blocks with an upsampling of the feature map, convolution with a kernel size of (2 × 2) or (4 × 4) (the size of the feature map does not change because the stride of convolution is 1), the instance normalization and activation by ReLU function except the last block.As seen in Figure 1, the generator outputs synthetic time-series plots of density distribution.
The authenticity of the images is judged by the discriminator.Figure 5 shows the details of the architecture of the discriminator in this research.A real image (plot of the density distribution in a high-resolution simulation) or a synthesis image is input to the discriminator.It consists of five convolutional blocks with a kernel size of (4 × 4).The instance normalization is applied except for the first and last blocks.Except for the last block, the leaky ReLU function with a slope of 0.2 is applied as the activation function.The 16 × 16 patch is eventually output.The discriminator classifies each patch into real or synthetic.We call its architecture the patchGAN ( [9]).
The objective of the network is the same as the regular pix2pix as follows: where G and D denote the generator and discriminator, λ is the weighted sum parameter and equal to 100 in this research, and x and y mean the source and target images.G (x) returns a synthesis image and D (x, y) or D (x, G (x)) returns the probability that y or G (x) is a real target image.L L1 (G) is the mean absolute error (L1 loss) calculated from the pixel-wise comparison between the real image and the synthetic image.The optimizer is Adam with a learning rate of 0.0002.The details of the first-half of U-Net.The expression "conv4x4 64" refers to a convolutional layer with a kernel size of (4 × 4) and 64 channels.Each feature map is copied and is concatenated to the feature map of the corresponding block in the latter-half of U-Net denoted in Figure 4.
Figure 3: The architecture of LSTM.The input to LSTM (x t ) is the vector transformed from an image of density distribution, and the output is the reshaped hidden state vector (h ′ t ) resulting from several operations.The vector c is the memory cell, and f , i, o, and z are a forget gate, an input gate, an output gate, and part of the term to be added to the memory cell (see equations (7) to (10) for details).
The architecture is implemented using Keras 2.5.0 and TensorFlow as a backend.The model was trained on Google Colaboratory with Tesla P100-PCIe GPU.For applying the convolution and deconvolution to the sequential data, sets of operations as shown in Figures 2, 4, and 5 are passed to the TimeDistributed layer.The skip connections are implemented by feeding the outputs of the previous upsampling block in the latter-half of U-Net and the same-level (it means that the size of the feature map is the same) convolutional block in Figure 4: The details of the latter-half of U-Net.The expression "Upsam-pling2x2" refers to an upsampling layer that doubles the size of input by copying one value twice horizontally and vertically, respectively.From the first-half of U-Net displayed in Figure 2, feature maps are passed to corresponding blocks and are concatenated to the feature maps output from the previous blocks.

Results and Discussion
In this chapter, I show the results of time-series image-to-image translation for the training datasets first and then explain the way to evaluate the quality of the synthesis images quantitatively.The evaluation result of the synthesis images for the training datasets is presented next.Then, I show the results for the testing datasets.Finally, the quality of the synthesis images is compared with those of images upsampled by conventional super-resolution algorithms.The conditions (the magnetic field strength) of the simulations are shown in Table 2 that summarizes the details of the training and testing datasets.The sixteen cases were performed to prepare the training datasets, and the nineteen cases were performed to prepare the testing datasets.For each case, two simulations were run with the high-resolution and the low-resolution grids.Compared to the high-resolution grid cases, the density distributions in the lowresolution grid cases show less fine structure and become closer to the uniform.Figure 7 displays the comparison of the inhomogeneity of the density between the high-resolution grid cases and the low-resolution grid cases.The inhomogeneity is defined by α = σ ρ / ρ, where σ ρ and ρ are the standard deviation and the average of the density.In the low-resolution grid, the numerical diffusion is larger than in the high-resolution grid, and therefore the inhomogeneity of the density tends to be smaller especially from the middle stage of the vortex development and in the relatively strong magnetic field (see Figure 7-(b)).The synthesis images reproduce the fine structures of the density distributions and appear to be well consistent with the high-resolution grid results.
To quantitatively evaluate the quality of the synthesis images, I estimated the density inhomogeneity from the distribution map.When calculating the density inhomogeneity from the simulation result, we can use the value of the density on each grid; however, the density distribution maps (including synthesis images in this research) have only the information of the RGB values.Therefore, to estimate the density inhomogeneity from the distribution map, I trained a three-layer fully connected neural network with 196608 (256pixel × 256pixel × 3) inputs, two hidden layers of 1024 and 128 neurons and one output layer.Figure 8 shows the result of the inhomogeneity prediction from the density distribution maps.The horizontal axis is the inhomogeneity calculated from the density values on the grids, and the vertical axis is the inhomogeneity predicted from the distribution maps by the trained neural network.The coefficient of determination R 2 is equal to 0.999.Thus we conclude that the trained neural network provides an accurate estimation of the density inhomogeneity from the distribution maps and the synthesis images.
We can quantitatively evaluate the quality of the synthesis images by inputting those into the neural network and comparing the outputted inhomogeneity with the inhomogeneity calculated from the high-resolution grid simulation results.Figure 9 shows that the inhomogeneity predicted from the synthesis images matches that calculated from the high-resolution grid simulation results with good accuracy; therefore the quality of the synthesis images is definitely good for the training datasets.

Results for the Testing Datasets
In the previous subsection, I have shown that the results for the training datasets are pretty good.However, the generalization ability needs to be investigated for practical use.The testing datasets (the magnetic field strength is different from the training datasets as shown in Table 2) that were not used for training are input to the trained model, and the synthesis images are output from the generator.Figures 10-(a), (b) show the comparison of the simulation results and the synthesis images for two example cases.From the 19 cases in the testing datasets, the results for the cases with B 0 = 0.75 and 1.7 were selected for Figure 11 is almost the same as Figure 9 but for the testing datasets.The density inhomogeneity predicted from the synthesis images through the fully connected neural network (explained in the previous subsection) is in good agreement with the inhomogeneity calculated from the results of high-resolution grid simulations.This result indicates that the method in this research is capable of obtaining high generalization ability.

Comparison with conventional super-resolution algorithms
To demonstrate the effectiveness of the proposed method and the quality of the generated images, I compare the results with those obtained by conventional super-resolution algorithms.The algorithms investigated here are a bicubic interpolation, a Lanczos interpolation, and Laplacian Pyramid Super-Resolution Network (LapSRN; [11]).The pixel size of the image to be used as the basis of the super-resolution is 64 × 64, and each algorithm quadruples the pixel size.These results were compared qualitatively and quantitatively with the result of high-resolution grid simulation and the image generated by the proposed method.Plots of the density distribution in high-resolution simulations in the training datasets were used to train LapSRN.I performed the super-resolution algorithms to the testing datasets (380 images).As an example, the results for the B 0 = 1.7 and T = 0.38 case are compared in Figure 12.In this case, none of the three conventional super-resolution algorithms can work with a quality comparable to the method proposed in this Figure 11: Comparison of the inhomogeneity of the high-resolution grid simulation results and the inhomogeneity predicted from the synthesis images for the testing datasets.research.To compare the proposed method with the others quantitatively, the pixel-wise mean squared error (MSE) and the structural similarity index measure (SSIM; [21]) are calculated between the ground truth image and the synthesis image or the result of super-resolution.Figure 13 shows that the quality of the synthesis images by the proposed method is significantly high compared to that of the results by the conventional super-resolution algorithms.

Application of this research
In this subsection, I discuss an application of this research.As mentioned above, results of high computational cost simulations can be estimated from those of low-cost simulations by the method in this paper.However, it is important to note that simulation results of quite a few cases are needed to train the network1 .Therefore, it is not beneficial for a small number of simulations.The more simulations are required, the greater the benefits arise.One such case is optimization based on CFD simulations.As the number of objective variables to be optimized increases, the number of calculations required to obtain the desired performance is expected to increase; in some cases, it takes several thousand cases to evaluate.In such multi-objective optimization simulations, for example, the first dozens to several hundred cases are simulated on both highand low-resolution grids, and the results are used to train the GANs.After the GANs are trained, low-resolution grid simulations are run, the results are input to the GANs to reproduce the results of high-resolution grid simulations, and objective variables are estimated from synthesis images by, for example, a neural network.I demonstrate the estimation of computational cost reduction.If the number of simulations required originally and that to train the GANs are N (several thousands in some cases) and N t (N > N t ), the calculation times of the highand low-resolution grid simulations are T h and T l (T h > T l ), and the computational cost to train the GANs is T t , the computational cost reduction is roughly equal to where the first term corresponds to the computational cost in the case that all simulations are run on the high-resolution grid, and the second term corresponds to that in the case that the method in this research is applied (the cost to reproduce the results of high-resolution grid simulations by the GANs is negligible compare to performing the simulations).In this way, by substituting low-resolution grid simulations and the result conversion by the GANs for quite a part of high-resolution grid simulations, a great reduction of the computational cost should be achieved.

Conclusions
In this paper, I validated an idea to use GANs for reducing the computational cost of CFD simulations.I studied the idea of reproducing the results of highresolution grid simulations with a high computational cost from those of lowresolution grid simulations with a low computational cost.More specifically speaking, distribution maps of a physical quantity in time series reproduced using pix2pix and LSTM.The quality of the reproduced synthesis images was good for both the training and testing datasets.The conditions treated in this paper are simple; the computational region is a square with a constant grid interval, the boundary conditions are cyclic, and the governing equations are the ideal MHD equations.In the next step, I need to examine the idea in more realistic conditions.

®Figure 1 :
Figure 1: Schematic picture of the architecture of the generator in this research.The generator in the original pix2pix network is a U-shaped network (U-Net).In this research, the LSTM layer is inserted into the middle of U-Net.The skip connections from the first-half of U-Net to the latter-half over the LSTM layer are implemented.

Figure 2 :
Figure2: The details of the first-half of U-Net.The expression "conv4x4 64" refers to a convolutional layer with a kernel size of (4 × 4) and 64 channels.Each feature map is copied and is concatenated to the feature map of the corresponding block in the latter-half of U-Net denoted in Figure4.

Figure 5 :
Figure 5: The details of the architecture of the discriminator in this research.

Figure 6
Figure 6 shows two examples of the time-evolution of density distribution for the training datasets.The top and bottom images of Figure 6-(a), (b) show the simulation results, and the middle images are synthesis ones generated from the top ones (the results of low-resolution grid simulations) through the generator.Compared to the high-resolution grid cases, the density distributions in the lowresolution grid cases show less fine structure and become closer to the uniform.Figure7displays the comparison of the inhomogeneity of the density between the high-resolution grid cases and the low-resolution grid cases.The inhomogeneity is defined by α = σ ρ / ρ, where σ ρ and ρ are the standard deviation and the average of the density.In the low-resolution grid, the numerical diffusion is larger than in the high-resolution grid, and therefore the inhomogeneity of the density tends to be smaller especially from the middle stage of the vortex development and in the relatively strong magnetic field (see Figure7-(b)).The synthesis images reproduce the fine structures of the density distributions and appear to be well consistent with the high-resolution grid results.To quantitatively evaluate the quality of the synthesis images, I estimated the density inhomogeneity from the distribution map.When calculating the density inhomogeneity from the simulation result, we can use the value of the density on each grid; however, the density distribution maps (including synthesis images in this research) have only the information of the RGB values.Therefore, to estimate the density inhomogeneity from the distribution map, I trained a three-layer fully connected neural network with 196608 (256pixel × 256pixel × 3) inputs, two hidden layers of 1024 and 128 neurons and one output layer.Figure8shows the result of the inhomogeneity prediction from the density distribution maps.The horizontal axis is the inhomogeneity calculated from the density values on the grids, and the vertical axis is the inhomogeneity predicted from the distribution maps by the trained neural network.The coefficient of determination R 2 is equal to 0.999.Thus we conclude that the trained neural network provides an accurate estimation of the density inhomogeneity from the distribution maps and the synthesis images.We can quantitatively evaluate the quality of the synthesis images by inputting those into the neural network and comparing the outputted inhomogeneity with the inhomogeneity calculated from the high-resolution grid simulation results.Figure9shows that the inhomogeneity predicted from the synthesis

Figure 6 :
Figure 6: Two examples of the time-evolution of density distribution for the training datasets.

Figure 7 :
Figure 7: Comparison of the inhomogeneity of the density between the highresolution grid cases and the low-resolution grid cases for the training datasets.(a)The inhomogeneities for all time-series and all magnetic field strength cases are plotted.(b)The inhomogeneities for T ≥ 0.12 and B 0 ≥ 0.6 are plotted.

Figure 8 :
Figure 8: Comparison of the inhomogeneity calculated from the density values on the grids and the inhomogeneity predicted from the distribution maps.The coefficient of determination R 2 is equal to 0.999.

Figure 9 :
Figure 9: Comparison of the inhomogeneity of the high-resolution grid simulation results and the inhomogeneity predicted from the synthesis images for the training datasets.

Figure 10 :
Figure 10: Examples of the time-evolution of density distribution for the testing datasets.

Figure 12 :
Figure 12: Comparison of the results of the conventional super-resolution algorithms with that of the proposed method and ground truth.

Figure 13 :
Figure 13: Box plots of the pixel-wise mean squared error (MSE) and the structural similarity index measure (SSIM) calculated in the testing datasets (380 images).

Table 1 :
The initial conditions of simulations

Table 2 :
The details of the training and testing datasets