Fast All-Electron Hybrid Functionals and Their Application to Rare-Earth Iron Garnets

Virtual materials design requires not only the simulation of a huge number of systems, but also of systems with ever larger sizes and through increasingly accurate models of the electronic structure. These can be provided by density functional theory (DFT) using not only simple local approximations to the unknown exchange and correlation functional, but also more complex approaches such as hybrid functionals, which include some part of Hartree–Fock exact exchange. While hybrid functionals allow many properties such as lattice constants, bond lengths, magnetic moments and band gaps, to be calculated with improved accuracy, they require the calculation of a nonlocal potential, resulting in high computational costs, that scale rapidly with the system size. This limits their wide application. Here, we present a new highly-scalable implementation of the nonlocal Hartree-Fock-type potential into FLEUR—an all-electron electronic structure code that implements the full-potential linearized augmented plane-wave (FLAPW) method. This implementation enables the use of hybrid functionals for systems with several hundred atoms. By porting this algorithm to GPU accelerators, we can leverage future exascale supercomputers which we demonstrate by reporting scaling results for up to 64 GPUs and up to 12,000 CPU cores for a single k-point. As proof of principle, we apply the algorithm to large and complex iron garnet materials (YIG, GdIG, TmIG) that are used in several spintronic applications.


INTRODUCTION
Materials science aims to understand and predict material properties ever more accurately, so that new sophisticated materials can be discovered to drive innovation in domains that rely on them. While materials science has been around for millennia, it was only at the beginning of the last century that the arrival of quantum mechanics enabled the exact description of the microscopic properties in materials. However, the cost of calculating the exact solution to the Schrödinger equation grows exponentially with the size of the system and is therefore limited to very small systems. Density functional theory (DFT) replaces the 3N-dimensional wave function as the central quantity with the 3-dimensional ground-state density and thereby reduces the exponential computational cost to a polynomial one. While DFT is in principle exact, a key ingredient, the so-called exchange-correlation energy, has no known analytical expression. The approximations used for this term determine the accuracy with which material properties can be predicted. While the most commonly used approximations, the local density approximation (LDA) and the generalized gradient approximation (GGA), can predict certain properties with a high precision at a very low computational cost, they fail to predict some essential electronic properties in particular of electronically complex materials (Alberi et al., 2018).
DFT is increasingly being used in the context of highthroughput calculations, where hundreds of thousands of material candidates are screened using automated workflows (Yan et al., 2017;Mounet et al., 2018;Rosen et al., 2019). However, all of these calculations are limited to material classes and properties for which the underlying exchangecorrelation functionals have a good predictive power. In order to enhance these calculations with material classes and properties for which LDA and GGA fail, it is necessary to rely on more accurate methods producing high-quality results. One class of accurate methods are the hybrid exchange-correlation functionals which are particularly suited to predicting electronic properties such as the band gap, the degree of charge localization and the polarization in materials with a stronger electron correlation (Cramer and Truhlar, 2009;Zhang et al., 2011;Burke, 2012;Becke, 2014;Garza and Scuseria, 2016).
Hybrid exchange-correlation functionals, such as PBE0 (Perdew et al., 1996) or HSE06 (Krukau et al., 2006) functionals, mix a portion of an orbital dependent exact exchange with the electron correlation described by other approximations, such as LDA or GGA (Becke, 1993). Their reliance on the orbital dependent exact exchange makes them computationally considerably more expensive than LDA or GGA. While an LDA or a GGA calculation grows with the 3rd power of the number of atoms, a hybrid exchange-correlation functional calculation typically grows with the 4th power of the number of atoms. Additionally, the computational cost of a hybrid calculation grows quadratically with the number of k-points used to sample the Brillouin zone, whereas for an LDA or a GGA calculation it only grows linearly. This large computational cost has prohibited precise predictions for systems with large unit cells, including a number of interesting material classes such as garnets (Rodic et al., 1999;Nakamoto et al., 2017) or materials of interest for solid-state batteries (Yu et al., 2016).
This article focuses on the implementation of hybrid functionals in the full-potential linearized augmented-planewave (FLAPW) (Wimmer et al., 1981) method as it is implemented in the open-source code FLEUR (Fleur, 2021). Unlike approaches employing the pseudopotential approximation, the FLAPW method treats all electrons explicitly and does not employ any approximations to represent the potential or density. It is therefore well suited for a wide range of systems, including systems containing heavy atoms that have d-and f-electrons. It is considered to be one of the most accurate DFT methods and has been used as a benchmark for other methods and codes (Lejaeghere et al., 2016). More specifically we focus here on the efficient implementation of the Hartree-Fock type exact exchange based on the bare Coulomb kernel as relevant for the PBE0 functional. Functionals based on the screened Coulomb kernel as HSE06 can be always expressed in terms of the matrix elements of the bare Coulomb kernel subtracted by matrix elements of a smooth function (Schlipf et al., 2011), whose numerical evaluation is not time critical and is not further discussed here.
There have been significant advances in bringing hybrid functionals to systems with hundreds or even thousands of atoms in other approaches, such as the projector augmented wave method (PAW) (Barnes et al., 2017;Carnimeo et al., 2019), the s-MTACE METHOD (Mandal et al., 2021), gaussian basis functions (Guidon et al., 2008) and atomic-orbitals method (Hakala and Foster, 2013;Lin et al., 2021). Even some all-electron methods have demonstrated their capability to calculate large systems with hybrid functionals (Ihrig et al., 2015;Levchenko et al., 2015). However, hybrid functionals within FLAPW have been constrained to very small systems (Betzinger et al., 2010;Schlipf et al., 2011;Blaha et al., 2020). The work presented here enables FLEUR's hybrid functional implementation to run on the world's most advanced supercomputers and use their immense computational power to investigate these large and interesting systems containing hundreds of atoms. Building on the pioneering work previously done on hybrid functionals in the FLAPW basis and in FLEUR specifically (Betzinger et al., 2010;Schlipf et al., 2011), we analyzed the performance and bottlenecks of this legacy implementation, and explored algorithmic improvements needed to calculate hundreds of atoms with the accuracy that FLAPW and hybrid functionals offer.

METHODS
The FLAPW method, implemented by FLEUR, partitions the unit cell of volume Ω into two kinds of domains. In spherical regions MT a centered around each atomic nucleus, a muffin-tin orbital basis (Andersen and Woolley, 1973) is used, relying on the products of spherical harmonics and radial functions. In between these spheres, in the so-called interstitial region (IS), a plane-wave basis is used. The resulting LAPW basis functions used to represent the wave functions are Here k and G are the Bloch-and reciprocal lattice vectors, while σ indicates the spin. α aσ lm and β aσ lm are coefficients chosen, such that the wave function is smooth and continuous at the boundary between the interstitial region and muffin-tin spheres. u and _ u are radial functions, where u is the solution to the radial Schrödinger equation for the spherically averaged muffin-tin potential and a fixed energy parameter and _ u is its energy derivative. a indicates the nucleus, l and m are the orbital-and magnetic quantum numbers of the spherical harmonic Y lm . r denotes the position, while r a ≔r − R a is the position relative to the Frontiers in Materials | www.frontiersin.org March 2022 | Volume 9 | Article 851458 center of the muffin-tin sphere and e a ≔r a /norm r a is the unit vector in direction of r a . In order to calculate the Hartree-Fock exact exchange, the Coulomb integral , containing four basis functions needs to be evaluated. It has previously been noted (Boys and Shavitt, 1959;Shavitt, 1959), that the product of the basis functions sharing the same argument φ p i (r)φ j (r) and φ p j (r′)φ i (r′) can be expressed in a more efficient way, since the basis function are already designed to be complete and the set of all products {φ p i (r)φ j (r)} makes a linear dependent set. In the case of the LAPW basis, this observation can be exploited by employing the so-called mixed-product basis. The mixed-product basis reduces the basis set for the muffin-tin regions separately from the interstitial region. In the muffin-tin regions, the overlap matrix of the products is calculated and diagonalized. The eigenvectors whose eigenvalues are above a certain threshold ] then provide a linear-independent representation of the product space. In the interstitial region products of plane-waves are also planes-waves, but with a higher cut-off G max ′ 2G max , where G max is the plane-wave cut-off of the LAPW basis. In practice reduced values of G max ′ have proven to provide accurate results. While this new mixed-product basis is neither continuous nor smooth, it provides a significant reduction in computational effort compared to the naive evaluation of the Coulomb integral. A detailed description of the mixed-product basis can be found in (Friedrich et al., 2009).

Exact Exchange
Using this basis set the nonlocal exact exchange can be expressed as where n, n′ and n′′ are band indices of the states ϕ σ nk , ϕ σ n′k and ϕ σ n ′′ k . I and J are indices enumerating the mixed-product basis and C is the Coulomb matrix expressed in this basis. A detailed derivation of M and C can be found in (Friedrich et al., 2009). Note that while the sum n′′ only stretches over occupied states, n and n′ cover all states. The square Coulomb matrix C with the dimensions of the size of the mixed-product basis is largely sparse, which allows for a significant reduction in the computational effort. The product C IJ (q)〈M q,J ϕ σ n ′′ k−q |ϕ σ n′k 〉 is referred to "Sparse matmul" in Figure 1; Figures 3-5. The projector matrix P σ,n,k (n ′′ , q, I) 〈ϕ σ nk |ϕ σ n ′′ k−q M q,I 〉 has the dimensions of the size of the mixed-product basis and the number of states. The evaluation of this term is split into two components, one called "inters. wave-prod" and one called "MT wave-prod", referring to the evaluation of this term either within the interstitial region or the muffin-tins. In order to apply V exact σ,nn′ (k) to the Hamiltonian in the LAPW basis it is transformed from the eigenspace to the LAPW basis by applying the overlap matrix of the LAPW matrix and the eigenvector matrix. In Figure 1; Figures 3-5 the full evaluation of the non-local potential, i.e., the setup of the Coulomb matrix, the evaluation of Eq. 2 and the transformation into the LAPW basis together, is denoted as "non-local pot." The numerical evaluation of Eq. 2 represents the majority of the computational effort in a FLEUR calculation using hybrid functionals. In particular, the projection of the products of wave functions given in the LAPW basis set (Eq. 1) onto the mixed-product basis and the multiplication of these projections with the Coulomb matrix provide significant computational challenges. The implementation developed for this work relies on collecting data processed in the same way. It allows to exploit data parallelism on multiple levels, be it the use of a SIMD instruction set or an efficient and balanced use of multiple threads. Two significant changes have been made to the basic algorithm introduced in (Betzinger et al., 2010). First, contrary to the previous implementation, the projection onto the mixed-product basis in the interstitial region is now calculated by Fourier transforming the wavefunctions into real-space and multiplying pairs of wavefunctions there, before transforming them back into Gspace. By employing fast Fourier transformations, this reduces the complexity of this calculation from O(n G 2 ) to O(n G log(n G )), where n G is the number of LAPW basis functions. Second, rather than calculating all elements of V exact σ (k) individually as vector-matrix-vector products of the Coulomb matrix and the mixed-product basis, the new implementation stacks groups of vectors of the mixedproduct-basis into matrices and then calculates blocks of V exact σ (k) as a single matrix-matrix-matrix product. While these operations are mathematically identical, this new block-wise implementation is twice as fast as the elementwise implementation even on a single CPU core, which is due to its better utilization of the core's vector units. Additionally, the element-wise evaluation of this term experiences almost no speedup if multiple cores are used, while the speedup of the modern implementation is shown in Figure 1 in blue. Calculating the non-local potential on a single NVIDIA A100 GPU results in a speedup of 4× compared to an AMD EPYC 7742 CPU for the NaCl system with 64 atoms.

Shared Memory Parallelization
To enable the utilization of supercomputers with complex memory hierarchies, we rely on two classes of parallelization. We use shared memory parallelization to make full use of many-core CPUs or GPUs. While distributed memory parallelization is employed to distribute the calculation over hundreds of compute nodes. Shared memory parallelization is realized by utilizing libraries for standard math problems such as matrix-multiplications or Fourier transformations whenever possible. Code parts that do not fall within the mold of any standard math problem were parallelized using OpenMP on CPUs and OpenACC on GPUs. In Figure 1 the strong scaling behavior on a single node is shown. For strong scaling a fixed-size problem is calculated with an increasing amount of resources and the resulting speedup is measured. Here, the speedup is defined as S n ≔ Tn min Tn , which in the case of ideal scaling behaviour is S ideal n n n min , where T n is the runtime with n cores, nodes or GPUs and n min is the minimal value of n that was used in a calculation. This can be used to define the parallel efficiency as τ n ≔ S n S ideal n . Frontiers in Materials | www.frontiersin.org March 2022 | Volume 9 | Article 851458 While some parts, such as the projection onto the mixedproduct basis in the muffin-tins or the setup of the Coulomb matrix show excellent scaling, the speedup of the projection on the mixed-product basis in the interstitial exhibits a plateau around a speedup of 4 (see orange line). As mentioned previously, this algorithm relies on fast Fourier transformations, which have a low algorithmic intensity, meaning that only few calculations are performed compared to the number of load and store operations, making the algorithm more likely to be limited by the memory bandwidth rather than the available computational power, explaining the plateau in the speedup seen in Figure 1. Up until 8 cores, the FFT still benefits from the additional compute resources, beyond that the FFTs are not limited anymore by the computational power, but rather by the memory bandwidth, which does not increase with the number of assigned cores.

Distributed Memory Parallelization
The parallelism shown so far is limited to a single shared memory node and thus limited by the number of cores on a given node. Therefore, in order to scale the computational challenge posed by the hybrid exchange-correlation functionals to hundreds of nodes, we implemented three additional levels of distributed memory parallelism using MPI. The first two levels distribute the computations for different k-and q-points, while the third level parallelizes that of different occupied bands n′′. The distributed memory parallelization scheme is shown in Figure 2.
The parallelization over k-and q-points requires very little communication and thus is very efficient, while the bandparallelization requires more communication. However, it allows us to limit the size of the largest matrix stored on a single node to n basis size × n total bands , which then has a size on the order of O(n atoms 2 ). This turns out to be the bottleneck that determines the largest system we can calculate on a given computing platform. With 90 GB of memory per node we were able to calculate systems with up to 200 atoms. Figure 3 singles out the strong scaling behavior of this 3rd MPI level for a single k-and q-point. All code parts except for the setup of the Coulomb matrix show a good scaling behavior with a parallel efficiency of over 50% on either 64 CPU nodes or 64 GPUs. The scaling behavior of the Coulomb-matrix setup is not critical, since it does not dominate the run time of the calculation even on 256 nodes. Additionally, it only scales linearly with the number of k-points whereas the other parts of the nonlocal potential scale quadratically with the number of k-points. To FIGURE 1 | Strong scaling behavior with OpenMP on a single AMD EPYC 7742 64 core processor. The overall FLEUR iteration is shown with brown pentagons, while the calculation of the nonlocal potential is shown in red triangles. The four remaining lines show the major parts of the nonlocal potential. The red triangles indicating the nonlocal potential largely coincide with the brown pentagons indicating the full runtime, making it difficult to see them. The parallel efficiency of slightly above 100% in the routine for the setup mixed-product basis for 4 and 8 nodes is explained by cache effects. Depending on the number of cores used the stride of the parallelized loops executed on each core is changed, which can reduce the number of cache misses if this stride coincides with certain array dimensions.
FIGURE 2 | The distributed-memory parallelization of Eq. 2 is divided into three levels. For each k-point the exact exchange is calculated as an independent problem. At a k-point k i , the exact exchange is calculated as a sum over all q-points associated with k i , building the kq-pairs. These first two levels require very little communication, i.e., copying the final results to their destination for the k-points and a reduction within the sub-communicator of each k-point for the q-point sum. The third level of distributed-memory parallelization calculates groups of occupied bands n′′ in parallel. Here, a lot of inter-dependencies create a much larger communication demand compared to the first two levels. In a typical calculation of the non-local potential the workload is not uniformly distributed: Some k-points have more q-points than others and some kq-pairs might have more associated bands than others. The algorithm attempts to compensate this by assigning more nodes to heavier calculations.
Frontiers in Materials | www.frontiersin.org March 2022 | Volume 9 | Article 851458 investigate the performance of our algorithm with multiple kpoints we show the strong scaling behavior for a NaCl supercell with 64 atoms, 10 k-points and 205 kq-pairs in Figure 4. Here the scaling behavior of the Coulomb-matrix setup is the weakest once again, but it only accounts for less than 10%, even with 410 nodes. All other code parts show nearly perfect scaling. This is due to the fact that each kq-pair represents a largely independent problem that only requires little communication and the 3rd MPI level is only used with 205 and 410 nodes, since the parallelization over the 205 kq-pairs is preferred. SuperMUC-NG has two CPUs per node and therefore we assigned two MPI ranks to each node, resulting in a better performance compared to a one rankper-node setup. FIGURE 4 | Strong scaling behavior of multiple k-points for a 64 atom NaCl supercell with a potassium (K) defect. The system has 10 k-points and 205 kq-pairs. The super-scalar behavior is caused by the fact, that the 205 kq-pairs are not evenly distributed on 10 nodes (20 MPI). Some nodes are assigned more kq-pairs and therefore need longer while the others sit idle. This effect disappears for 205 or 410 nodes, which allow for a perfectly even and therefore more efficient distribution. For 41 nodes, the speedup and parallel efficiency of the Coulomb-matrix setup drop drastically. This is due to the fact, that the Coulomb-matrix setup does not have a qdependence, while the number of nodes is chosen to be optimal for the evaluation of Eq. 2. For 10 nodes (20 MPI), all k-points can be calculated in parallel on 2 processes each, while for 41 nodes (82 MPI), it is only possible to calculate 2 k-points in parallel, so that each is distributed over 41 processes, leading to an inefficient parallelization. In practical calculations this is mitigated by including more nodes (e.g., 45), so that both k-and kq-parallelization are efficient. However, even with 41 nodes, the Coulomb-matrix setup only accounts for 6% of the total runtime.

Weak Scaling
While the meaning of strong scaling is very intuitive, it does not necessarily reflect real life applications. Being able to calculate a system with twenty atoms in a minute or less may not advance science significantly. Certainly, science is advanced by being able to calculate increasingly bigger, more inhomogeneous and more complex systems in a reasonable time frame. Weak scaling deals with the latter. As discussed in the introduction, the computational demand of a hybrid functional calculation scales with O n atom 4 .
( 3 ) For simplicity and to focus on the ultimately limiting parallelization level, we use a single kq-pair and neglect the very efficient parallelization over different k-and q-vectors. In Figure 5 a gallium arsenide (GaAs) setup was scaled into supercells with a single nitrogen defect. Then, they were calculated with the parallelization chosen such that n nodes n atoms min n atoms ( ) where min(n atoms ) is the number of atoms in the smallest supercell.
With ideal weak scaling behavior the runtime should be constant regardless of the size of the unit cell, since the computational cost in Eq. 3 is canceled out by the additional compute resources chosen in Eq. 4. Figure 5 shows that the hybrid functionals in FLEUR can be applied efficiently to a wide variety of system sizes. The time needed for the calculation of the nonlocal potential of the largest GaAs supercell is 9% larger compared to that of the smallest supercell and the full iteration runtime is 30% larger. The runtime does not monotonously increase as one would expect for the weak scaling of a simple algorithm performing a single task. In FLEUR, the situation is more complex, some parts of the code scale with O(n atom 3 ), while others scale with O(n atom 4 ): While the setup of the mixedproduct basis in the muffin-tin spheres grows with O(n atom 3 ), its counterpart in the interstitial region grows with O(n atom 3 log(n atom )). In the Coulomb-matrix setup, some parts such as the MT-MT interaction grow with O(n atom 2 ), while e.g., Γ-point correction for in the interstitial grows with O(n atom 4 ). For larger systems terms with a bigger scaling-exponent will be dominant, but in small systems the parts with the smaller scaling-exponents dominate the runtime. In these cases the choice of Eq. 4 is not suitable, because the compute resources are increased faster than the computational complexity grows, leading to the initial dip in the overall runtime in Figure 5.

APPLICATION TO RARE-EARTH IRON GARNETS
Yttrium iron garnet (Y 3 Fe 5 O 12 or short YIG) is a complex ferrimagnetic insulator with a number of remarkable properties and applications, in the fields of magnonics (Serga et al., 2010), ultra-low temperature physics (Demokritov et al., 2006) and quantum computing (Tabuchi et al., 2015). This success has sparked interest in a related class of materials, the so-called rare-earth-iron garnets (RIGs), where the yttrium atom in the YIG structure is replaced with an element of the lanthanide series. Here applications range from materials with giant magnetostriction (Sayetat, 1986) to spin Seebeck insulators (Uchida et al., 2010). Despite great interest in these materials there is only a limited number of theoretical studies of their electronic structure. This is most likely due to the large unit cells with 160 atoms in the conventional and 80 atoms in the primitive unit cell.
The typical unit cell of a garnet is shown in Figure 6. The iron atoms in this structure have two types of environments. They are either in the centre of an octahedron or a tetrahedron spanned by neighbouring oxygen atoms. These different iron environments have a strong effect on the electronic structure, which is discussed in detail later in this paper. YIG and most RIGs are ferrimagnets, such that the magnetic moments of the 8 octahedral iron atoms point in the opposite direction with respect to the 12 tetrahedral iron atoms, which, for the RIGs discussed here, are aligned in parallel with the rare-earth elements. Only a very minor magnetic moment is induced in the yttrium and oxygen atoms.

Electronic Structure
In order to understand how the choice of the exchangecorrelation functional affects the electronic structure of YIG we calculated the density-of-states (DOS) with PBE and with PBE0. All calculations shown in the paper were performed on a 2 × 2 × 2 k-point grid. We confirmed that the DOS is converged with this grid by comparing the PBE results to results on a denser kpoint grid. We use a smearing of σ = 0.136 eV for all DOS calculations shown. The muffin-tin radii of Y, Gd, Tm, Fe and O As expected, with a value of 0.44 eV, PBE massively underestimates the experimental band gap of 2.8 eV (Larsen and Metselaar, 1975), while PBE0 predicts an improved band gap of 1.83 eV. However, the experimental value relies on optical measurements, which are not sensitive to all transitions, potentially missing certain states and thus overestimating the real band gap.
In Figure 7A) the DOS of YIG is calculated using PBE as an exchange-correlation functional. In this figure, the antiferrimagnetic alignment of the iron atoms is visible: the occupied states associated with the tetrahedral iron atoms are mainly in the spin-up channel and the unoccupied ones are in the spin-down channel, while for the octahedral iron atoms the situation is reversed: below the Fermi level the octahedral iron states are mostly in the spin-down channel and above it in the spin-up one. Most states associated with the oxygen atoms are occupied, while the yttrium states are largely unoccupied. Below the Fermi level, the DOS in the interstitial region closely follows the oxygen DOS. Additionally, the DOS associated with both iron types also coincide with the oxygen and interstitial DOS. This indicates that the 2p-states of the oxygen and the 3d-states of iron hybridize for both iron environments. This analysis is supported by the number of valence electrons found in the different muffintin spheres, which are 6.5 and 6.2 electrons for iron atoms in the tetrahedral and octhedral environments, respectively, 1.1 valence electrons in the sphere of yttrium, and an average of 3.7 valence electrons in the spheres of oxygen. The large number of 164.1 electrons in the interstitial region additionally indicates a high degree of de-localization of these states. For the unoccupied octahedral iron states in contrast, we can see a clear signature of simple crystal field splitting of localized d-states: the three t 2gstates shift down and the two e g -states shift up leading to two distinct peaks with the lower one containing three and the higher one containing two states. Similarly, for the unoccupied tetrahedral Fe d-levels the e-states are shifted down, while the t 2 -states are shifted up. This separation however, is not so clear as the shifts are smaller, the peaks still overlap and another splitting due to next-nearest neighbors can be seen.
In Figure 7B) we show the DOS calculated using the hybrid exchange-correlation functional PBE0. The results are qualitatively different from the PBE results with the most significant change seen in the different behavior of the two types of Fe in the PBE results: While the occupied tetrahedral iron 3d-states still hybridize with the 2p-states of the surrounding oxygen atoms, most of the octahedral iron 3d-states are now strongly localized and form a double peak in the DOS at around −6.5 eV.
Such a localization effect of the d-states can also be reproduced in a PBE+U treatment (Chen et al., 2021). However, in these simulations the d-states of both Fe types show the same behavior. The different tendency to localize can also be seen in these simulation by the different values of U used for the different atoms to achieve the localization. Hence, the result strongly suggests that the local Coulomb interaction exhibits different strength in the two environments of Fe. This effect can be caused by the different initial localization of the d-states as well as by different interactions and screening effects of the surrounding. Such a difference can also be seen in the unoccupied spectrum which is again dominated by a crystal field splitting of the dstates. However, in the octahedral environment this effect is again much clearer while the tetrahedral d-states form a broad band with several peaks also indicating next-nearest neighbor effects.
Finally, we would like to point out that the octahedral Fe dstates show a rather complex sub-structure with a large peak at FIGURE 6 | Unit cell of a garnet. Oxygen atoms are shown in red, while iron atoms are shown inside the grey polyhedra. The rare-earth or yttrium atoms are shown inside the golden dodecahedra. While the yttrium or rare-earth atoms are all symmetry equivalent the iron is present in two different environments. Structure from (Y3Fe5O12 crystal structure, 2012) and plotted with (Momma and Izumi, 2011).
Frontiers in Materials | www.frontiersin.org March 2022 | Volume 9 | Article 851458 − 7 eV and a minor peak at − 8 eV. This is not a crystal field splitting but rather shows that different states with a different degree of localization are formed. While the lower peak is clearly separated from the O p-states, some remaining hybridization can be identified for the larger peak. Further investigations of the consequences of these differences between the Fe atoms is beyond the scope of this paper, but we expect this electronic structure to have some influence, e.g. on magnetic interactions and the transition temperature.

Magnetic Moment
In the introduction we discussed that some key applications of YIG are related to its magnetic properties. Therefore, we want to investigate the precision of our predictions for magnetic properties with different exchange-correlation functionals. In Table 1 we compare the magnetic moments predicted for the different iron atom types. We use the magnetic moment inside the muffin-tin sphere to assign the moment to a specific atom. Therefore, the magnetic moment depends on the choice of the radius of the muffin-tin sphere and, strictly speaking, is not uniquely defined. The magnetization calculated for the oxygen and yttrium atoms is negligible regardless of the computational method used. The total magnetic moment per unit formula was 5 μ B for every functional. This agreement is expected, since YIG is a magnetic insulator, which constrains the total magnetization per unit cell to integer values.
While PBE predicts the magnetic moments of the two iron types within only ±0.5 μ B of the experimental value for the R3 crystal structure, the predictions by PBE0 are even closer to the experimental those results. This again can be understood by the observed tendency to localize the Fe d-states and compares very well to magnetic moments predicted by Barker et al. (Barker et al., 2020) obtained using QSGW, another highly precise electronic structure method. The slight difference in the magnetic moments between the QSGW and PBE0 approach we account to the different choice of the muffin-tin radii and the different degree of localization of the Fe d-states. Interestingly, in the comparison of PBE0 to QSGW in the case of this octahedral iron, the magnetic moment for this octahedral Fe agrees better with the experimental results in the R3 crystal structure than the QSGW value, supporting our findings of a slightly stronger localized d-wave function in the case of the PBE0 functional. We note that all theoretical results were calculated for the cubic Ia3d crystal structure and the magnetic moments of Fe in the tetrahedral environment agree quite well with each other but also with the experimental values of Fe in the trigonal R3 structure. On the other hand, the experimental Fe moment in the Ia3d symmetry is completely off. It shows a moment of 5.37 μ B . This value seems unrealistic since it is higher than that of a free Fe 3+ ion (~5 μ B ), while the presence of hybridization with oxygen is expected to lower the moment further. We conclude that further experimental efforts are needed to analyze the structuremagnetism relationship of YIG.

Rare-earth-iron Garnets
As two representatives of the rare-earth-iron garnet group, we chose to examine Gd 3 Fe 5 O 12 (GdIG) and Tm 3 Fe 5 O 12 (TmIG) more closely. We selected these materials, because a lot of interesting experimental (Fechine et al., 2008;Phan et al., 2009;Lassri et al., 2011;Lee et al., 2020;Vilela et al., 2020;Vu et al., 2020) and even some theoretical work using the FLAPW method (Lassri et al., 2011) has been published for these materials.
In Figure 8 we present the density of states for GdIG and TmIG calculated with the PBE0 exchange-correlation functional. Reaching numerical self-consistency for TmIG was challenging with PBE, which is the starting point for any PBE0 calculation. We achieved self-consistency by using a few hundred straight mixing iterations with a low mixing parameter, followed by a set of Anderson mixing iterations until convergence was reached. With a converged PBE as a  starting density the convergence of PBE0 is straight forward. This difficult convergence is caused by the metallic behavior of TmIG with PBE as a functional. After the non-local potential is included, a gap opens up and all later density convergence cycles do not exhibit this problematic metallic behavior. GdIG converged without problems both for PBE and PBE0. For GdIG the band gap was calculated to be 1.7 eV with PBE0. Literature values obtained using PBE + U suggest a gap of 1.6 eV (Nakamoto et al., 2017). For TmIG we also predict a band gap of 1.7 eV using PBE0. To our knowledge, this is the first prediction for the band gap of TmIG. We are not aware of any experimental results regarding the band gap in either system.
The electronic structure of these two systems has a few striking similarities with that of YIG. The 3d-states of both types of iron atoms hybridize with the oxygen 2p-states in PBE, while with PBE0 the octahedral iron states show localization and a strong shift to lower energies. This again highlights the difference of the tetrahedral and octahedral oxygen environment of the iron atoms causing different effective interactions at these atoms and casting doubt on simple PBE+U predictions for these garnet systems. For the unoccupied octahedral iron states we can see the typical signature of crystal field splitting and in the tetrahedral case this signature is weaker. The additional 4f-states of the rare-earth elements in the spin-up channel are strongly localized in PBE. In PBE0 they show a slightly larger bandwidth, indicating increased hybridization with the oxygen 2p-states which could be understood due to the decrease of hybridization of these states with the octahedral Fe d-states. As expected, Gd has no occupied 4f-states in the spin-down channel, while the 4f-states of Tm are partially occupied, causing a metallic behavior in PBE. In PBE0 the increased interaction provided by the exchange term opens a gap in the Tm 4f-band.
In Table 2 the magnetic moments of all atom types are given. For GdIG we predict a total magnetization per formula unit of 16.0 μ B and for TmIG we predict 1.75 μ B for PBE as well as PBE0. Notice, that the formula unit contains 20 atoms, while the primitive unit cell contains 80. This means, while the magnetic moment per formula unit is not integer, it is integer per unit cell.
The predicted total magnetic moments are in exact agreement with experimental results for GdIG (Geller et al., 1965), while they are in good agreement with the experimental value of 1.2 μ B for the TmIG. This experimental value would correspond to a total magnetic moment of 4.8 μ B for the primitive unit cell. PBE + U shows a tendency to predict larger magnetic moments for almost all atoms: 4.2 μ B for the octahedral iron, − 4.1 μ B for the tetrahedral iron, 7.0 μ B for Gd and 1.9 μ B for Tm (Nakamoto et al., 2017).

CONCLUSION
In this article we presented a highly scalable implementation of hybrid exchange-correlation functionals in the LAPW basis. In this work we focused on the scalable implementation of the Hartree-Fock exact exchange, which corresponds to the implementation of the PBE0 functional, but screened functionals like HSE06 are related by an additional fast computation of a smooth FIGURE 8 | The density-of-states is calculated for GdIG in (A) and for TmIG in (B) using PBE0 on a 2 × 2 × 2 k-point grid. Both calculations were performed with a K max 4.5a 0 −1 and the mixed-product basis was setup using ] = 10 −4 and l MPB = 16. Both band gaps are 1.7 eV and marked in red. The Gd states are fully occupied for the majority spin-channel and fully unoccupied for the minority spin-channel. The Tm spin-up channel is also fully occupied while the minority spin channel is only partially occupied. function. The combination of shared and distributed memory parallelization allows to calculate a broad range of systems with high efficiency. Combining all three MPI levels gives us an outlook on the scaling potential of this algorithm. If we were for example to calculate the GaAs system with 120 atoms and we would use 8 kpoints we would get 125 kq-pairs. Figure 3 shows that for this system a single kq-pair has a good parallel performance even if distributed over more than 32 GPUs. Therefore, it is reasonable to assume that the calculation of the nonlocal potential for a system with 8 k-points would still have good scaling with 32 GPUs kq−pair × 125 GPUs 4000 GPUs, which is~250 more than the 44 PetaFLOP JUWELS Booster Module has to offer. This not only allows the code to run on the supercomputers currently available, it also gives us confidence that our code can make good practical use of future exascale machines. Here, making good practical use of a supercomputer does not necessarily mean sending jobs which queue for weeks-on-end and then scale to every single core which the machine has to offer, but rather that we can efficiently use significant portions of the machine to investigate interesting and meaningful systems.
Using the new implementation of the hybrid functional code, we performed simulations of the electronic structure of iron based garnet materials. The significant improvement in the obtained band gap as well as the changes in the electronic structures discussed in detail demonstrate the significance and power of this treatment for these technological relevant material class. Our results suggests an experimental reevaluation of the structuremagnetism relation of the yttrium iron garnet (YIG), Y 3 Fe 5 O 12 .

DATA AVAILABILITY STATEMENT
The raw data supporting the conclusions of this article will be made available by the authors, without undue reservation.

AUTHOR CONTRIBUTIONS
MR, DW, and GM performed and analyzed the initial performance measurements. MR, DW, and GM designed the 3-Level MPI parallelism with feedback form CT and MM. CT and MM suggested solutions for the performance bottlenecks in MPIs one-sided communication. MR, DW, GM implemented shared-memory parallelism using OpenMP, while CT contributed memory blocking for certain OpenMP kernels. MR performed the performance measurements and the electronic structure calculation for the garnet materials. JB and SB helped with the analysis and understanding of the electronic structure of the garnet materials. JB created