A Numerical Study of Scalable Cardiac Electro-Mechanical Solvers on HPC Architectures

We introduce and study some scalable domain decomposition preconditioners for cardiac electro-mechanical 3D simulations on parallel HPC (High Performance Computing) architectures. The electro-mechanical model of the cardiac tissue is composed of four coupled sub-models: (1) the static finite elasticity equations for the transversely isotropic deformation of the cardiac tissue; (2) the active tension model describing the dynamics of the intracellular calcium, cross-bridge binding and myofilament tension; (3) the anisotropic Bidomain model describing the evolution of the intra- and extra-cellular potentials in the deforming cardiac tissue; and (4) the ionic membrane model describing the dynamics of ionic currents, gating variables, ionic concentrations and stretch-activated channels. This strongly coupled electro-mechanical model is discretized in time with a splitting semi-implicit technique and in space with isoparametric finite elements. The resulting scalable parallel solver is based on Multilevel Additive Schwarz preconditioners for the solution of the Bidomain system and on BDDC preconditioned Newton-Krylov solvers for the non-linear finite elasticity system. The results of several 3D parallel simulations show the scalability of both linear and non-linear solvers and their application to the study of both physiological excitation-contraction cardiac dynamics and re-entrant waves in the presence of different mechano-electrical feedbacks.


INTRODUCTION
In recent years, several areas of medicine, and in particular cardiology, have undergone a cultural revolution generated by new findings that have emerged from molecular biology. This new knowledge has helped to identify, for each disease and for each patient, the specific mechanisms of the disease and the resulting medical treatments, leading to the so-called personalized medicine. For example, the use of mathematical models with parameters for the individual patient-specific characteristics could allow cardiologists to predict the effectiveness of anti-arrhythmic drug treatments or the proper installation of implantable defibrillators (see e.g., Nordsletten et al., 2011;Constantino et al., 2012;Lamata et al., 2015;Trayanova and Chang, 2016).
The spatio-temporal evolution of the electrical impulse in the cardiac tissue and the subsequent process of cardiac contraction-relaxation are quantitatively described by the cardiac electromechanical coupling model, which consists of the following four sub-models: • the static finite elasticity model describing the deformation of cardiac tissue, derived from an anisotropic strain energy function which characterizes the passive mechanical properties of the myocardium; • the active tension system of non-linear ordinary differential equations (ODEs), describing the dynamics of the intracellular calcium, cross-bridge binding and myofilament tension; • the anisotropic Bidomain model of the cardiac tissue, which is a non-linear system of two partial differential equations (PDEs) of reaction-diffusion type, describing the spatiotemporal evolution of the intra-and extracellular electric potentials in the cardiac tissue; • the ionic membrane model of the cardiac myocyte, a stiff system of ODEs describing the dynamics of ionic currents, gating variables, ionic concentrations and stretch-activated channels.
The theoretical and numerical challenges posed by this complex non-linear electro-mechanical model are very interesting. Indeed, the theoretical analysis of the well-posedness of the cardiac electro-mechanical coupling model is still an open problem, as well as the convergence analysis of its finite element approximation. On the numerical level, the very different space and time scales associated with the electrical and mechanical submodels, as well as their non-linear and multiphysics interactions, make the approximation and simulation of the cardiac electromechanical coupling model a very demanding and expensive computational task. In the last decade, several groups have performed cardiac computational studies based on three-dimensional electrical and electro-mechanical simulations (see Pathmanathan and Whiteley, 2009;Göktepe and Kuhl, 2010;Keldermann et al., 2010;Gurev et al., 2011;Trayanova et al., 2011;Land et al., 2012b;Nobile et al., 2012;Rossi et al., 2012;Dal et al., 2013;Sundnes et al., 2014;Favino et al., 2016). However, the computational costs required by the solution of the mathematical models describing the cardiac bioelectrical and mechanical activity are still too high to allow their use in a clinical setting. Therefore, there is a strong effort in the research community to develop effective computational tools and to speedup the simulation of the cardiac electro-mechanical activity (see e.g., Vázquez et al., 2011;Lafortune et al., 2012;Washio et al., 2013;Aguado-Sierra et al., 2015;Gurev et al., 2015;Land et al., 2015;Augustin et al., 2016).
The goal of this work is to study the performance of our parallel electro-mechanical solver in three-dimensional left-ventricular simulations on two different HPC (High Performance Computing) architectures. The finite element parallel solver we have developed is based on Multilevel Additive Schwarz preconditioners accelerated by PCG for solving the discretized Bidomain system and on Newton-Krylov methods with Balancing Domain Decomposition by Constraints (BDDC) preconditioners for solving the discretized non-linear finite elasticity system. Extensive numerical simulations have shown the scalability of both linear and non-linear solvers and their effectiveness in the study of the physiological excitationcontraction cardiac dynamics and of re-entrant waves in the presence of different mechano-electrical feedbacks.
The paper is organized as follows. The main four electromechanical cardiac sub-models are briefly introduced in section 2 and discretized in time and space in section 3, where the main computational kernels, parallel solvers and preconditioners are also described. Section 4 contains the main results of the paper obtained in large-scale 3D simulations using high-performance parallel architectures.

ELECTRO-MECHANICAL CARDIAC MODELS
We conside a cardiac electro-mechanical coupling model consisting of the following four coupled sub-models; see also Figure 1.

Cardiac Tissue Mechanical Model
We assume a quasi-steady state regime and model the cardiac tissue as a non-linear hyperelastic material satisfying the equilibrium equation with appropriate boundary conditions, where we denote by x = x(X, t) the spatial coordinates of the deformed cardiac domain (t) at time t, by X = (X 1 , X 2 , X 3 ) T the material coordinates of the undeformed cardiac domain , by F(X, t) = ∂x ∂X the deformation gradient and by u(X, t) = x − X the displacement field. Following the active stress approach, the second Piola-Kirchhoff stress tensor S is written as the sum of passive (pas), volumetric (vol) and active (act) components, i.e., where E = 1 2 (C − I) is the Green-Lagrange strain tensor and W pas is an exponential strain energy function describing the myocardium as an hyperelastic material transversely isotropic (derived form the orthotropic law proposed in Holzapfel and Ogden, 2009;Eriksson et al., 2013) where a, b, a (l,n,ln) , b (l,n,ln) are positive material parameters and We did not employ an isochoric-deviatoric decomposition of the deformation gradient tensor. The volumetric term W vol = K (J − 1) 2 is a penalization term added to enforce the nearly incompressibility of the myocardium, where K is a positive bulk modulus and J = detF. The model is closed by imposing boundary conditions of mixed Dirichlet and traction type.

Mechanical Active Tension Model
The active tension generation model is based on calcium kinetic and myofilament dynamics. Here we consider the model proposed in Land et al. (2012a), where the active tension T a depends on the intracellular calcium concentration Ca i , the fiber stretch λ = a T l C a l , the fiber stretch-rate dλ dt and auxiliary variables included in vector z, i.e., The generated active force is assumed to act only along the fiber direction, so the active Cauchy stress is where a l is a unit vector parallel to the local fiber direction and T a is the active fiber stress associated to the deformed cardiac tissue. In the deformed configuration, the unit vector parallel to the local fiber direction can be written as where a l is the fiber direction in the reference configuration. Then the active stress component S act of the second Piola-Kirchhoff tensor is given by

The Bioelectrical Bidomain Model
We denote by v, u e , w, c the transmembrane potential, the extracellular potential, the gating and ionic concentrations variables on the deformed configuration and by v, u e , w, c the same quantities on reference configuration. The Bidomain model, written on the deformed configuration (t) is given in its parabolic-elliptic formulation by where c m and i ion are the membrane capacitance and ionic current per unit volume, respectively. We apply insulating boundary conditions on ∂ (t), i.e., In the Lagrangian framework, after the pull-back on the reference configuration × (0, T), this system becomes where V = ∂u ∂t is the rate of deformation; see Colli Franzone et al.
(2016a) for the detailed derivation. These two partial differential equations (PDEs) are coupled through the reaction term i ion with the ODE system of the membrane model, given in (t)×(0, T) by The bioelectrical system (Equations 6, 7) is completed by prescribing initial conditions on v, w, c, insulating boundary conditions on u e , u i = v + u e , and the intra-and extracellular applied current i app = i i app = − i e app . We recall that the extracellular potential u e is defined only up to a time dependent constant in space R(t), which can be determined by choosing a reference potential. Here we select as a reference potential the average of the extracellular potential over the cardiac volume, i.e., we require u e (X, t)J(X, t)dX = 0. Assuming transversely isotropic properties of the intra-and extracellular media, the conductivity tensors on the deformed configuration are given by where σ i,e l , σ i,e t are the the intra-and extracellular conductivity coefficients measured along the fiber direction a l and any cross fiber direction, respectively. From Equation (4), it follows that the tensors D i,e (x, t) written on the reference configuration are Frontiers in Physiology | www.frontiersin.org Therefore, the equivalent conductivity tensors appearing into the bidomain model written in the reference configuration are given by For the values of the conductivity coefficients of the Bidoman model (see Colli Franzone et al., 2016a).

The Ionic Membrane Model and Stretch-Activated Channel Currents
The ionic current in the Bidomain model (Equation 6) is given by i ion = χI ion , where χ is the membrane surface to volume ratio and the ionic current per unit area of the membrane surface I ion is given by the sum I ion (v, w, c, λ) = I m ion (v, w, c) + I sac of two terms: the ionic term I m ion (v, w, c) given by the ten Tusscher model (TP06) (ten Tusscher et al., 2004;ten Tusscher and Panfilov, 2006), available from the cellML depository (models.cellml.org/ cellml), and a stretch-activated current term I sac . The TP06 ionic model also specifies the functions R w (v, w) and R c (v, w, c) in the ODE system Equation (Equation 7), consisting of 17 ordinary differential equations modeling the main ionic currents dynamics.
The stretch-activated current (SAC) is modeled as the sum of a non-selective and a potassium selective currents as in Niederer and Smith (2007). The non-selective SAC current is defined by with γ sl (λ) = 10max(λ − 1, 0), g ns = 4.13 · 10 −3 mS/cm 2 and the value of r measures the relative conductance of the ions Na + and K + and determines the reversal potential v ns of I ns , varying the degree of expression of the ions Na + and K + . We have chosen r = 0.2.

Time Discretization
The time discretization of the electromechanical model is performed by the following semi-implicit splitting method, where different electrical and mechanical time steps could be used.
(a) given v n , w n , c n at time step t n , we compute the new variables w n+1 , c n+1 by solving the ODE system of the ionic membrane model (Equation 7) with a first order implicit-explicit (IMEX) method, i.e., (b) given the calcium concentration Ca n+1 i , which is part of the vector of concentration variables c n+1 , we compute the new deformed coordinates x n+1 , providing the new deformation gradient tensor F n+1 , by solving the variational formulation of the mechanical problem (Equation 1) and the active tension system, i.e., (c) given w n+1 , c n+1 , F n+1 and J n+1 = det(F n+1 ), we compute the new electric potentials v n+1 , u n+1 e by solving the variational formulation of the Bidomain system (Equation 6) with a first order IMEX and operator splitting method, consisting of decoupling the parabolic from the elliptic equation, i.e., In our simulations, we use the electrical time step size e t = 0.05 ms, and a mechanical times step five times larger, m t = 0.25 ms. In order to approximate the convective term in the variational formulation of Equation (6), an upwind discretization strategy is employed. We refer to Colli Franzone et al. (2015) and Colli Franzone et al. (2016a) for more details about the numerical scheme.

Space Discretization
The cardiac domain is discretized with a structured hexahedral grid T h m for the mechanical model (Equation 1) and T h e for the Bidomain model (Equation 6), where T h e is a refinement of T h m , i.e., the mechanical mesh size h m is an integer multiple of the electrical mesh size h e . We consider the variational formulations of both mechanical and bioelectrical models and then approximate all scalar and vector fields by isoparametric Q 1 finite elements in space. In all our simulations, we employ an electrical mesh size h e = 0.01 cm in order to properly resolve the sharp excitation front, while the smoother mechanical deformation allow us to use a coarse mechanical mesh of size h m = 0.08 cm. The resulting electrical mesh consists of N φ × N θ × N k elements, whose values will be specified in each numerical test reported in the Results section.

Computational Kernels and Parallel Solvers
At each time step of the space-time discretization described above, the two main computational kernels are: (a) the solution of a non-linear system arising from the discretization of the mechanical problem (1); to this end, we use a parallel Newton-Krylov-BDDC (NK-BDDC) solver, where the Krylov method chosen is GMRES and the BDDC preconditioner will be described in the next sections; (b) the solution of two linear systems deriving from the discretization of the elliptic and parabolic equations in the Bidomain model (Equation 6); to this eand, we use a parallel Preconditioned Conjugate Gradient (PCG) method, with Multilevel Additive Schwarz preconditioner for the very ill-conditioned elliptic system and with Block-Jacobi preconditioner for the easier parabolic system.
The parallelization of these two main computational kernels of our electro-mechanical solver is based on the parallel library PETSc (Balay et al., 2012) from the Argonne National Laboratory. All the parallel simulations have been performed on highperformance supercomputers and Linux clusters described in the Result section. For the parallel implementation of the BDDC preconditioner, see Zampini (2016).

Multilevel Additive Schwarz Preconditioners
We now describe the Multilevel Additive Schwarz preconditioner employed in the PCG solution of the elliptic kernel (b) associated with the Bidomain system. Let k , k = 0, ..., ℓ − 1 be a family of ℓ nested triangulations of , with finer mesh sizes from level 0 to ℓ−1, and let A k be the matrix obtained by discretizing the second equation of Equation (6) on k ; we have A ℓ−1 = A bid , where A bid is the stiffness matrix related to the elliptic equation of Equation (6) discretized on the fne mesh. Denote by R k the restriction operators from ℓ−1 to k . We decompose each grid k , for k = 1, ..., ℓ − 1, into N k overlapping subgrids k i for i = 1, ..., N k , such that the overlap size δ k at level k = 1, ..., ℓ − 1 equals the mesh size h k of the grid k . We denote by R k i the restriction operator from ℓ−1 to k i and define A k i : = R k i A k R k T i . The Multilevel Additive Schwarz (MAS(ℓ)) preconditioner is given by The resulting PCG algorithm has a convergence rate independent of the number of subdomains N k (scalability), the number of levels ℓ (multilevel optimality), while it depends linearly on the ratio H k /h k of subdomain to element size on level k (optimality); see Pavarino and Scacchi (2008), Scacchi (2008), and Pavarino and Scacchi (2011) for the theoretical details.

Iterative Substructuring, Schur Complement System and BDDC Preconditioners
We then turn to the BDDC preconditioner used in the mechanical computational kernel (a) above, i.e., the Jacobian system arising at each iteration of the Newton method applied to the non-linear elasticity system (Equation 1). For sake of simplicity, in the following sections we will denote the reference domain by instead of . We consider a decomposition of into N non-overlapping subdomains and set H = max H i . We first reduce the Jacobian system arising at each Newton step of the mechanical solver, to the interface Frontiers in Physiology | www.frontiersin.org by eliminating the interior degrees of freedom (dofs) associated with the basis functions having support in each subdomain's interior and obtaining the Schur complement system Here S Ŵ = K ŴŴ − K ŴI K −1 II K ŴI and g Ŵ = f Ŵ − K ŴI K −1 II f I are obtained from the global system (Equation 10) by reordering the finite element basis functions into interior (denoted by the subscript I) and interface (denoted by the subscript Ŵ) basis functions The Schur complement system (Equation 11) is solved iteratively by the GMRES method, where only the action of S Ŵ on a given vector is required and S Ŵ is never explicitly formed; instead, a block diagonal problem on the interior dofs is solved while computing the matrix vector product. Once the interface solution x Ŵ has been determined, the internior dofs x I can be found by solving local problems on each subdomain i . We then solve by the GMRES method the preconditioned Schur complement system where M −1 BDDC is the BDDC preconditioner, defined in Equation (17) below.
Balanced Domain Decomposition by Constraints (BDDC) preconditioners where introduced by Dohrmann (2003) and first analyzed by Mandel and Dohrmann (2003) and Mandel et al. (2005). In these methods all local and coarse problems are treated additively and the user selects the so-called primal continuity constraints across the subdomains' interface. Usual choices of primal constraints are e.g., point constraints at subdomain vertices and/or averages or moments over subdomains edges or faces. Closely related to BDDC methods are FETI and FETI-DP algorithms, as well as the previous balancing Neumann-Neumann methods; for more details, we refer the ineterested reader to the domain decomposition monograph (Toselli and Widlund, 2005, Ch. 6). See also Brands et al. (2008) and Klawonn and Rheinbach (2010) for FETI-DP algorithms applied in other fields of computational biomechanics.

Subspace Decompositions
Let V be the Q 1 finite element space for displacements and V (i) be the local finite element space defined on subdomain i that vanish on ∂ i ∩ ∂ D . This local space can be split into a direct sum of its interior (I) and interface (Ŵ) subspaces V (i) = V (i) I V (i) Ŵ and we can define the associated product spaces as While our finite element approximations are continuous across the interface Ŵ, the functions of V Ŵ are generally discontinuous across Ŵ, We then define the subspace V Ŵ : = {functions of V Ŵ that are continuous across Ŵ}, and the intermediate subspace defined by further splitting the interface dofs (denoted by the subscript Ŵ) into primal (subscript ) and dual (subscript ) dofs. Here: (a) the subspace V consists of functions which are continuous at selected primal variables. These can be e.g., the subdomain basis functions associated with subdomains' vertices and/or edge/face basis functions with constant values at the nodes of the associated edge/face. A change of basis can be performed so that each primal variable correspond to an explicit dof.
(b) the subspace V = N i=1 V (i) is the product space of the local subspaces V (i) of dual interface functions that vanish at the primal dofs.

Restriction and Scaling Operators
The definition of our dual-primal preconditioners require also the following restriction and interpolation operators, associated with boolean matrices (with {0, 1} elements): where V (i) is the local primal subspace. Moreover, we define the pseudo-inverse counting functions δ † i (x), which are defined at each dof x on the interface of subdomain i by with N x the number of subdomains sharing x. We finally define scaled local restriction operators R (i) D, by scaling by by δ † i the only nonzero element of each row of R (i) . We then define the scaling matrix

Choice of Primal Constraints
The efficiency of BDDC (and more in general dual-primal) preconditioners is strongly dependent of the choice of primal contraints. The simplest choice of selecting the subdomains vertices as primal dofs is not always sufficient to obtain scalable and fast preconditioners. Therefore, richer (and computationally more expensive) primal sets have been developed in order to obtain faster preconditioners. These stronger preconditioners are based on larger coarse problems employing also edge and/or face based primal dofs, see e.g., Toselli and Widlund (2005).

Matrix Form of the BDDC Preconditioner
Analogously to the dual-primal splitting introduced before, we partition the local dofs into interior (I), dual ( ), and primal ( ) dofs, so that the local stiffness matrix K (i) associated to subdomain i can be written as The BDDC preconditioner is then defined as where the scaled restriction matrix R D,Ŵ has been defined in Equations (14, 16), and (18) The first term in Equation (18) represent the sum of local problems on each subdomain i , with Neumann data on the local dual dofs and with zero Dirichlet data on the local primal dofs. The second term in Equation (18) represents a coarse problem for the primal variables involving the coarse matrix and a matrix mapping primal to interface dofs The columns of are associated with coarse basis functions defined as the minimum energy extension into the subdomains with respect to the original bilinear form and subject to the chosen set of primal constraints.
For compressible linear elasticity problems it can be shown that the BDDC algorithm is scalable and quasi-optimal, satisfying a condition number bound (see e.g., Toselli and Widlund, 2005, Ch. 6.4) as with C( H h m ) = α constant if the primal space is sufficiently rich, while C( H h ) = α H h if the primal space is the minimal one spanned by the dofs associated with the subdomain vertices. We recall that H is the characteristic subdomain size and h = h m is the characteristic mechanical mesh size defined in section 3.1. We could not prove a similar bound for the convergence rate of our non-symmetric NK-BDDC preconditioned operator, since our complex non-linear elasticity problem (Equation 1) involves an exponential strain energy function. Nevertheless, the numerical results presented in the next section suggests that such a bound holds also for our operator and demonstrate the effectiveness and scalability of the NK-BDDC method.

RESULTS
In this section, we report the results of several 3D parallel simulations with our electro-mechanical Bidomain solver, using two HPC architectures: We start by studying the performance of our electro-mechanical Bidomain solver on a closed ellipsoidal ventricular geometry during a double reentry dyamics initiated by an S1-S2 protocol. Figure 2 shows the snapshots of the transmembrane potential and mechanical deformation time evolution every 50 ms, computed on 256 KNL processors of Marconi-A2. At each time instant, we report the epicardial lateral view (top panel) and selected horizontal and vertical transmural sections (bottom panel). After three S1 stimulations applied at the apex every 500 ms (not shown), an S2 cross-gradient stimulation (visible as a vertical strip in the t = 0 panel) is applied 280 ms. after the last S1 stimulus, and this instant is taken as the reference time t = 0 ms for this simulation. Two counter-rotating scroll waves are generated by the S2 stimulus, with transmural filaments located near the apex and rotation period of about 250 ms (see the panels t = 0, 250, 500 ms). The lateral epicardial view of the upper panels shows mostly one of the two scroll waves, but the second almost-symmetric one is visible in the transmural sections of the lower panels. This reentry dynamics is visible also in Figure 3 that reports the waveforms at epicardial sites P1, P2, P3 (shown in Figure 3A) of the transmembrane potential V (Figure 3B), extracellular potential u e (Figure 3C), fiber stretch λ (Figure 3D), active tension T a (Figure 3E), intracellular calcium concentration Ca i (Figure 3F).

Test 2: Weak Scalability of the Elliptic
Bidomain -TP06 Solver (Figures 4, 5) Figures 4, 5 (left columns) report the results of weak scalability tests on MIRA BG/Q for the elliptic solver (PCG-MAS(4)) required by the bioelectrical Bidomain -TP06 model on a half ellipsoidal domain representing an idealized half left ventricle. The number of processors is increased from 1K to 163K cores of the Mira BG/Q supercomputer of the Argonne National Lab. Figure 4A1 reports the condition number (blue), iteration counts (red), solution times (yellow) of the PCG -MAS(4) solver. Both a fixed half ellipsoidal domain (Figure 4A1, top plot) and an increasing ellipsoidal domain (Figure 4A1, bottom plot) are considered, where in both cases the local meshsize (hence the local problem size on each processor) is kept fixed at H/h = 16. The results clearly show the very good scalability of the PCG -MAS(4) solver, since all quantities are bounded from above as the processor count is increased from 1K to 163K cores (a factor 163) and therefore the global problem size increases from about O(10 6 ) to O(10 8 ) degrees of freedom. In particular, we remark that in spite of this problem size increase of a factor 163, the CPU times are almost constant in the case of an increasing half ellipsoid ( Figure 4A1, bottom plot) or increase by only a factor 2-3 in the case of a fixed half ellipsoid (Figure 4A1, top plot), while being almost constant between 16K and 128K cores.
Analogously, Figures 4, 5 (right columns) report the results of weak scalability tests on Marconi -A2 for the elliptic solver (PCG-MAS(4)) and also the non-linear mechanical solver (NK-BDDC), described in section 4.3 below. As before, the results clearly show the very good scalability of the PCG-MAS(4) solver, since all quantities associated with the elliptic solver are bounded from above.
In order to study more in detail the weak scalability test on a fixed half ellipsoid (Figure 4A1, top plot), we report in Figures 5A1,B1,C1 the percent summary (given by the LogView PETSc subroutine) of the main PETSc functions called by the PCG -MAS(4) elliptic solver. These PETSc functions, shown in the legend of each plot, range from inner products (VecTDoc) and vector norms (VecNorm) to the whole PCG solver (KSPSolve) and application of the MAS(4) preconditioner (PCApply). In particular, we report the percent of: CPU time as a fraction of the KSPSolve time (Figure 5A1), flops (Figure 5B1), messages ( Figure 5C1). When one of these PETSc functions has a negligible percentage, the corresponding legend shows it equal to 0. After an initial increase in some cases, all reported quantities are very scalable up to 64K cores, and most up to 163K cores, except the VecTDot percent of flops (in Figure 5B1). As expected, the percentage of time ( Figure 5A1) and flops ( Figure 5B1) are dominated by the PCG solver (KSPSolve), followed by matrix multiplications (MatMult) and inner products (VecTDot). The percentage

Test 3: Weak Scalability of the Electro-Mechanical Solver (Figures 4, 5)
We then study the weak scalability of our electro-mechanical solver from 128 to 2048 KNL processors of Marconi-A2, in particular of the two main computational kernerls: the non-linear mechanical solver (NK-BDDC) and the linear elliptic Bidomain solver (PCG -MAS(4)). Figure 4A2 reports the CPU times and iteration counts for both solvers, while Figures 5A2,B2,C2 reports the percent summary of the main PETSc functions called by the electro-mechanical solver.
In this weak scaling test, the local meshsize (hence the local problem size on each processor) is kept fixed at H/h = 16, while the global problem size grows proportionally to the processor count by assigning one subdomain to each processor. Hence, the computational domain consists of increasing portions or an ellipsoidal domain. The results in Figure 4A2 clearly show the very good scalability of the PCG -MAS(4) elliptic linear solver, since both its CPU times and iteration counts are bounded from above as the processor count is increased to 2,048 cores. On the other hand, the timings of the nonlinear SNES solver are not scalable beyond 512 processors, even if the iteration counts are. This is due to the nonscalability of the coarse solver (Mumps) employed in the BDDC preconditioner. In order to study more in detail this scalability test, we report in Figures 5A2,B2,C2 the percent summary (given by the LogView PETSc subroutine) of the main PETSc functions called by the electro-mechanical solver. These PETSc functions, shown in the legend of each plot, range from inner products (VecTDoc) and vector norms (VecNorm) to the linear solvers (KSPSolve) and preconditioner applications (PCApply) required by both the linear (PCG-MAS(4)) and non-linear (NK-BDDC) solvers. In particular, we report the percent of: CPU time (Figures 5A2), flops (Figure 5B2) and messages ( Figure 5C2). When one of these PETSc functions has a negligible percentage, the corresponding legend shows it equal to 0). All reported pertentages are very scalable, showing quite flat plots, except the time percentages (Figure 5A2), where the KSPSolve and PCApply percentages grow considerably beyond 512 cores, due mostly to the growth of MatSolve and PCSetUp, which we know already from Figures 5A1,A2 are due to the nonscalable direct coarse solve (Mumps) of the BDDC preconditioner called by the non-linear SNES solver. As expected, the percentage of time ( Figure 5A2) and flops ( Figure 5B2)  The global mesh size is fixed to 384 × 192 × 48 finite elements while the number of processors is increased from 32 = 8 × 4 × 1 (with local mesh 48 × 48 × 48) to 256 = 16 × 8 × 2 (with local mesh 24 × 24 × 24). Figure 6A shows the timings of the NK-BDDC solver: after an initial superlinear speedup from 32 to 64 cores, the timings still reduce when going to 128 and 256 cores but with worse speedups (see also Figure 7A) and start to increase at 512 cores or more (not shown). Figure 7B shows the number of Newton iterations for each NK-BDDC solve, which remain constant at 4 iterations independently of the number of processors. Figure 7C reports the cumulative GMRES iterations for each NK-BDDC mechanical solve, which increase in time since the Jacobian mechanical system becomes increasingly ill-conditioned due to the spreading of the electrical activation front and subsequent mechanical contraction. The number of iterations is reduced when going from 64 to 128 and to 256 cores, but unexpectedly in the 32 core test we got the lowest iteration counts after 20 ms. Figure 7D shows that the number of PCG iterations for each Bidomain elliptic solve are almost constant independently of the number of processors used. The timings of each Bidomain elliptic ( Figure 7E) and parabolic (Figure 7F) solve show a reduction when the number of processors is increased, but with reduced speedup when using 256 cores or more.
As before, we now study in Figure 7 the percent summary (given by the LogView PETSc subroutine) of the main PETSc functions in this strong scaling test for the electro-mechanical solver. We report the percent of: flops (Figure 7C), CPU time ( Figure 7D), messages (Figure 7E), reductions ( Figure 7F). Again we find quite flat plots, except the time percentages (Figure 7D), where the the KSPSolve percentage grows considerably due mostly to the growth of PCApply and MatSolve, which again we attribute mostly to the nonscalable direct coarse solve (Mumps) of the BDDC preconditioner called by the

DISCUSSION
We have developed a high-performance parallel solver for cardiac electro-mechanical 3D simulations. After numerical discretization in space with Q 1 finite elements and IMEX operator splitting finite differences in time, the main computational kernels at each time step require: (a) the solution of a non-linear system deriving from the discretization of the cardiac mechanical problem (1) by a parallel Newton-Krylov-BDDC (NK-BDDC) solver, where the Krylov method chosen is GMRES; (b) the solution of the two linear systems deriving from the discretization of the elliptic and parabolic equations in the Bidomain model (Equation 6) by a parallel PCG method with Multilevel Additive Schwarz and Block-Jacobi preconditioners, respectively. The parallelization of our solver has been based on simulations have been performed on the parallel library PETSc (Balay et al., 2012) from the Argonne National Laboratory and large-scale 3D simulations have been run on high-performance supercomputers.
We have investigated the performance of the parallel electromechanical solver in both physiological excitation-contraction cardiac dynamics and pathological situations characterized by re-entrant waves.

Bidomain Solver
The results have shown that the electrical Bidomain solver is scalable, in terms of both weak and strong scaling, and is robust with respect to the deformation induced by the mechanical contraction. Bidomain weak scaling tests have been performed both on the Mira BG/Q and Marconi-A2 clusters. The two architectures and the number of cores used are different, although the load per core is the same. Thus, we can not compare fairly the performances obtained on the two architectures. However, the CPU times reported in Figure 4A, bottom and Figure 5A have the same order of magnitude, showing that the solution of the Bodomain linear systems on the two architectures exhibit comparable costs.

Mechanical Solver
The results have shown that also the mechanical NK-BDDC solver is scalable in terms of non-linear and linear iterations counts, but the CPU timings, especially in the weak scaling test, do not present a scalable behavior. Our results seem to indicate that this increase of CPU timings can be attributed to the increase of computational costs required by the BDDC coarse solver. A possible remedy would be to employ a multilevel BDDC solver, where the coarse problem is solved recursively by a BDDC method with additional local and coarse problems, or to employ an adaptive selection of BDDC primal constraints. The nonscalability and ill-conditioning of the nonlinear mechanical system could also be associated with: (a) the penalty formulation employed to enforce the almost incompressibility of the cardiac tissue; (b) the presence of the stress induced by the active tension contraction model; (c) the particular mechanical boundary condition enforcing zero displacements on a fixed endocardial basal ring and fixed intracavitary endocardial pressure.

Comparison With Previous Studies
So far, only few studies have developed and investigated parallel numerical solvers for cardiac electro-mechanics. Lafortune et al. (2012) have proposed a fully explicit Monodomain-mechanical solver, obtaining good strong scalability results up to 500 cores. The advantage of our approach with respect to that presented in Lafortune et al. (2012) is that our solver, resulting from a semiimplicit time discretization of the electro-mechanical model, allows larger time step sizes and time adaptivity. Augustin et al. (2016) have developed a very effective electro-mechanical solver, tested on highly accurate patient-specific geometric models and based on Algebraic Multigrid (AMG) preconditioners for both the Bidomain and mechanical systems. The strong scalability results they have reported show a very good performance of AMG applied to the non-linear mechanical system, whereas the AMG preconditioner is less effective for the Bidomain linear system. The advantage of our solver compared to that introduced in Augustin et al. (2016) is that both Multilevel Additive Schwarz and BDDC preconditioners should be more robust than AMG when high order finite elements or isogeometric analysis (see e.g., Charawi, 2017) discretizations are employed. On the other hand, while BDDC preconditioners can be easily constructed for unstructured meshes, Multilevel Additive Schwarz methods are more difficult to implement in case of such grids.

Future Work
In order to improve our mechanical solver, further studies could consider the following issues: (a) mixed formulations of the mechanical system based on inf-sup stable displacement-pressure discrete spaces; (b) alternative active tension contraction models; (c) alternative mechanical boundary conditions and pressurevolume relationships involving multielement Windkessel models.