|Computer Modeling in Engineering & Sciences|
Matrix-Free Higher-Order Finite Element Method for Parallel Simulation of Compressible and Nearly-Incompressible Linear Elasticity on Unstructured Meshes
Dedicated to Professor Karl Stark Pister for his 95th birthday
1Department of Computer Science, University of Colorado Boulder, Boulder, CO, USA
2Department of Civil, Environmental, and Architectural Engineering, University of Colorado Boulder, Boulder, CO, USA
*Corresponding Author: Richard Regueiro. Email: email@example.com
Received: 13 May 2021; Accepted: 02 July 2021
Abstract: Higher-order displacement-based finite element methods are useful for simulating bending problems and potentially addressing mesh-locking associated with nearly-incompressible elasticity, yet are computationally expensive. To address the computational expense, the paper presents a matrix-free, displacement-based, higher-order, hexahedral finite element implementation of compressible and nearly-compressible () linear isotropic elasticity at small strain with p-multigrid preconditioning. The cost, solve time, and scalability of the implementation with respect to strain energy error are investigated for polynomial order p = 1, 2, 3, 4 for compressible elasticity, and p = 2, 3, 4 for nearly-incompressible elasticity, on different number of CPU cores for a tube bending problem. In the context of this matrix-free implementation, higher-order polynomials (p = 3, 4) generally are faster in achieving better accuracy in the solution than lower-order polynomials (p = 1, 2). However, for a beam bending simulation with stress concentration (singularity), it is demonstrated that higher-order finite elements do not improve the spatial order of convergence, even though accuracy is improved.
Keywords: Matrix-free; higher-order; finite element; parallel; linear elasticity; multigrid solvers; unstructured meshes
Modeling bending deformation and associated stress of compressible and nearly-incompressible linear isotropic elastic materials via the standard linear interpolation displacement-based Finite Element Method (FEM) may lead to inaccurate stress calculations (depending on element size h) as well as potential mesh-locking [1,2]. With respect to accuracy, either h-or p-refinement are adopted, but oftentimes not the two together because of the associated high computational cost. With respect to mesh-locking, several FEM approaches involving mixed formulations have been proposed to overcome volumetric strain locking of linear finite elements [3–6]. Most such mixed methods for linear elasticity at small strain solve for two fields (displacement and pressure), which are discretized separately via an additive split of the deviatoric and volumetric parts of the small strain tensor [7,8]. As an alternative to, and oftentimes in conjunction with, mixed formulations, higher-order finite elements have been adopted to attempt to overcome mesh locking . However, assembling a global Jacobian matrix based on higher-order finite element interpolation functions leads to expensive solve times as a result of higher memory requirements . Matrix-free formulations [11,12] provide the benefits of higher-order finite element methods but with more efficient memory usage. Concurrently, using tensor-product-basis evaluation further improves the performance of higher-order methods by supplying favorable storage and work estimates of O(pd) and O(pd+ 1) per p-order element, respectively, for discretizations in , where d denotes spatial dimension. Furthermore, tensor-product-based operator evaluations can be cast as matrix-matrix products. Iterative solvers, such as Krylov subspace methods, only require the result of matrix-vector products rather than an assembled matrix, which allows these methods to take advantage of the computational efficiencies offered by matrix-free formulations. Therefore, storage of large, sparse matrices is avoided with iterative solvers.
Krylov subspace methods require preconditioning  to efficiently and stably solve large-scale elliptic problems such as 3D linear isotropic elasticity within the static balance of linear momentum equations. Geometric multigrid is a robust preconditioning scheme that is an appealing choice for structured meshes, and has been considered in numerous studies using finite difference and finite element methods with tensor-product-bases implemented on CPUs and GPUs [14–16]. p-multigrid, developed by Rønquist et al. , is a version of geometric multigrid based on coarsening by decreasing the basis order in higher-order or spectral finite elements rather than coarsening by aggregating elements. As a result, p-multigrid is a natural approach for problems on unstructured meshes, whereby spatial convergence only weakly depends on polynomial order p, if properly implemented. Computational cost of the discretization scheme may be affected by polynomial order p and element size h, among other factors [18,19], such as adaptive r-refinement of the mesh .
In terms of engineering applications, compressible linear isotropic elasticity (e.g., ν = 0.3) is appropriate for modeling metals in their elastic regime, such as in serviceability studies to analyze the stress intensity factor at a crack tip to determine if the crack will propagate and thus estimate the life cycle of a metallic component. Nearly-incompressible linear isotropic elasticity () is appropriate for modeling small deformations (small strains and small rotations) of rubber-like materials such as solid rock propellant binders . When metals or rubber-like materials are loaded beyond their small strain linear elastic limit, then large deformation hyperelasticity or hyper-elasto-plasticity constitutive models are needed, which are beyond the scope of this paper.
In this work, we apply a previously-developed matrix-free, higher-order, FEM for solving three-dimensional (3D) linear isotropic elasticity with p-multigrid preconditioning [22,23] to (i) a tube geometry subjected to method of manufactured solutions (MMS), (ii) a tube bending problem for compressible (ν = 0.3) and nearly-incompressible (ν = 0.499999) elasticity and polynomial order p = 1, 2, 3, 4, and (iii) a beam bending problem under body force loading for which the beam is clamped on both ends. Simulations (ii) and (iii)—but simulation (iii) in particular—cause a stress concentration (singularity) in the clamped regions, where the higher-order FEM may not converge optimally as expected . Therefore, we investigate order of convergence of the matrix-free higher-order FEM for compressible elasticity when there is a stress concentration. We aim to answer the following questions: (1) With the matrix-free parallel implementation, what combination of mesh-refinement h and polynomial order p leads to a certain strain energy error that requires the least computational cost for compressible and nearly-incompressible elasticity using the standard displacement-based FEM? (2) What combination of mesh-refinement h and polynomial order p achieves a certain strain energy error for least amount of simulation time? (3) Does the higher-order, displacement-based FEM remain competitive as Poisson’s ratio approaches the nearly-incompressible limit ()? (4) Does the higher-order FEM converge as expected when there is a stress concentration present? The parallel computational toolkit PETSc  is employed for the linear solver as well as Algebraic MultiGrid (AMG) preconditioner for the coarse-grid solve, while libCEED  is used to perform efficient tensor-product-basis evaluation.
The rest of the paper is organized as follows. In Section 2, the constitutive model for compressible 3D linear isotropic elasticity and the variational form of static balance of linear momentum are presented. In Section 3, a preconditioning scheme used to accelerate convergence is briefly discussed. In Section 4, numerical results are presented, and in Section 5, observations and conclusions are provided.
2 Compressible Linear Isotropic Elasticity
For linear isotropic elastic materials, we consider the stored strain energy function as,
Therefore, its stress-strain relationship is given by its derivative with respect to strain as,
where and are stress and strain tensors, respectively; and λ and μ are the Lamé parameters. The small strain tensor is,
The strong form of the static balance of linear momentum is given by the following,
where the functions and are prescribed boundary displacement and traction, respectively, is the body force per unit mass (such as the gravitational acceleration vector), and ρ is the mass density. Introducing weighting function for the displacement field, the corresponding variational equation for (4) is written as,
2.1 Residual Evaluation
Discretization of the variational Eq. (5) can be written to facilitate matrix-free evaluation . The residual in discrete form can be expressed as,
where and are evaluations of the finite element shape functions and their derivatives at the quadrature points in x, y, and z directions, is the element e restriction operator that separates Degrees of Freedom (DoF) based on the elements to which they belong, and represents pointwise function evaluation. f0 and f1 are determined from the constitutive model and its tangent, where and , and is the total nodal displacement vector. The basis operators are represented as Kronecker products,
where and evaluations of the finite element shape functions and their derivatives at the quadrature points in 1D. Comparing the variational equation in (5) to its discrete form in (6) yields and , assuming .
2.2 Jacobian Evaluation
Similar to a residual evaluation, the action of the Jacobian can be computed using the notation proposed by [10,26], as,
In the small strain case, for Eq. (2), is not a function of or . Therefore, its derivative with respect to or is zero, (i.e., and . On the other hand, is a function of , but it is not a function of . Therefore, and ; due to linearity, , or equivalently, , which is equivalent to (2) applied to an increment within a nonlinear iterative solver. While these minor simplifications are possible for linear problems in the present work, our implementation solves the problem as though it were nonlinear, and we will continue using the corresponding terminology.
3 -Multigrid Preconditioning of Linear Isotropic Elasticity
Using the formulation in Eq. (8), we can compute the action of global Jacobian matrix on with arbitrary user defined polynomial order p. An iterative linear equation solver is required with matrix-free operators, which necessitates preconditioning, especially at higher-order. With unstructured meshes, a natural hierarchy of grids does not exist, so h-multigrid can be difficult to implement. Algebraic MultiGrid (AMG) is suitable for low order meshes for which the Jacobian matrix can be assembled. However, assembly of this matrix is prohibitively expensive for higher-order element meshes . Therefore, we use geometric multigrid with p coarsening while adopting AMG as the coarse grid solver. A Chebyshev polynomial smoother based upon the operator diagonal  is called in the multigrid cycle.
In p-multigrid, grid transfer operations increase or decrease the polynomial order of the element basis functions, and these operations can be implemented in a matrix-free fashion via of (8) with suitable basis evaluators . The coarse-to-fine (ctof) basis operation, , interpolates the DoFs on the nodes of a coarse grid element to the nodes of a fine grid element ( for , for example, where ). The corresponding coarse and fine grid element restriction operators, , are used in the grid transfer operators. The operator correctly computes the interior degrees of freedom but over-counts nodes on the element facets. We can count the multiplicity of each node on the fine grid by applying the transpose fine grid restriction to the unit vector, . Thus, the p-multigrid prolongation operator is given by,
and then the p-multigrid restriction operator is given by .
4 Numerical Examples
Using the 3D linear isotropic elasticity model and matrix-free, higher-order FEM presented in Sections 2 and 3, we present numerical results based on (i) method of manufactured solutions (MMS) applied to a tube geometry, (ii) a tube bending simulation for compressible and nearly-incompressible elasticity on unstructured meshes, and (iii) a beam bending simulation with stress concentration on structured meshes. Polynomial orders p = 1, 2, 3, 4 are applied for a range of meshes in the compressible case, and polynomial orders p = 2, 3, 4 are applied for the same range of meshes in the nearly-incompressible case. For the compressible case, Poisson’s ratio ν = 0.3 and Young’s modulus E = 69 × 109 Pa, that correspond to aluminum. For the nearly-incompressible case, ν = 0.499999 and E = 0.1 × 109 Pa, that correspond to rubber-like materials in the small elastic strain regime.
The parallel platform used to run all FE simulations with higher-order polynomial element meshes is an Intel Xeon E5-2680 v3 @2.50 GHz (2 CPUs/node, 24 cores/node) and 113 GB RAM per compute node . The PETSc toolkit v3.14 is called for mesh management, domain decomposition, parallel assembly operations, and communication over all MPI processes. The matrix-free Jacobian operator is implemented using libCEED v0.8 and PETSc for Eq. (8). Preconditioning with p-multigrid and AMG is enabled through PETSc’s multigrid interface with grid-transfer and operator application at each level implemented in libCEED. Vectorized tensor-product operations for 8-element batches in the matrix-free operator evaluation on AVX-2 extensions of 86 instruction sets for each local processor are conducted in libCEED. For simulations conducted with MMS and tube bending, an unstructured mesh consisting of 400 Hex8 elements (trilinear) is generated in the Trelis  software. The mesh is then refined 5 times to generate 3,200, 10,800, 25,600, 50,000, and 86,400 element meshes, respectively.
4.1 Achieving L2 Strain Energy Error via h- and p-Refinement in Tube Meshes Subjected to MMS
Considering the tube geometry in Fig. 1 with method of manufactured solutions (MMS)-generated displacement boundary conditions, the first four tube unstructured mesh refinements with 3,200, 10,800, 25,600, and 50,000 elements are used in the FE simulations for the compressible and nearly-incompressible cases. For the MMS, we produce a contrived, but smooth, right hand side ρ based on = [u1, u2, u3]T with u1 = e2x sin(3y) cos(4z)/108, u2 = e3x sin(4y) cos(2z)/108, and u3 = e4x sin(2y) cos(3z)/108. Every problem is run once to determine the L2 global strain energy Φh error from the MMS. Tables 1a and 1b summarize the size of each problem in terms of degrees of freedom (“#DoF”) based on h-and p-refinement and the corresponding L2 strain energy error based on MMS for the compressible case. Tables 2a and 2b summarize the size of each problem in terms of DoF based on h-and p-refinement and the corresponding L2 strain energy error based on MMS for the nearly-incompressible case. Considering the smoothness of the MMS, and relatively high number of DoF for even the coarsest mesh (3009 elements), the strain energy error is already quite small prior to refinement. Although, for p-refinement, we see for the compressible case some convergence, while for the nearly-incompressible case, the error is small and stays small (but not as small as the compressible case, 10−10 vs. 10−7) indicating stability for smooth deformations via MMS.
4.2 Achieving Reduced -Refinement in Tube Meshes Subjected to Bending
Tube meshes subjected to bending are simulated to study cost and spatial convergence of the matrix-free, higher-order FEM for h-refinement and p-refinement based on strain energy Φh for compressible and nearly-incompressible elasticity. The length of the tube is 100 mm, with circular cross-section of inner diameter 15 mm and outer diameter 20 mm. A coarse mesh representation of the tube is shown in Fig. 1. The left side of the tube is clamped by placing a zero displacement boundary condition on the surface, while a second displacement boundary condition is imposed on the right surface of the tube to bend it 3 mm in the negative y direction. Fig. 1 shows the deformed mesh after bending. All tube meshes are simulated for compressible and nearly-incompressible elasticity. Tables 3 and 4 summarize the problem sizes in terms of degrees of freedom (“#DoF”) for each refinement in conjunction with different polynomial orders p for compressible and nearly-incompressible elasticity, respectively. The size of problem (“#DoF”) increases when a finer mesh h or a higher-order polynomial p is used. As the FE mesh is refined in h, the norm of the strain energy computed from the FE solution converges to the exact norm of strain energy . Therefore, we use the discrete elastic strain energy Φh to compute relative error in the FE simulations with respect to smaller size problems. Achieving a certain strain energy error in the FE solution is problem-dependent. For example, problems with smooth and non-smooth solutions behave differently with mesh refinement h when used with different polynomial orders p . Therefore, with mesh refinement through decreasing h, we treat the strain energy from the finest mesh as the “exact” solution in the computation of relative L2 error for each polynomial order p to determine the effect of h-refinement in achieving a certain error. However, with p-refinement, we treat the strain energy from the highest polynomial order (p = 4) to compute the relative L2 strain energy error for each p-refinement. Tables 3a and 3b summarize the relative L2 error based on mesh refinement h and using higher-order polynomials p for the compressible tube bending FE simulations, respectively. In addition, depending on the problem, each simulation is run with 1, 6, 12, 24, 48, 96, 192, 384, and 768 CPU cores for the compressible case. Each simulation is run 3 times to reduce noise in the performance timing for a total of 297 simulations for the compressible case.
To address the first question raised in the Introduction with regard to minimal computational cost for a certain strain energy error in the FE solution for compressible elasticity, we compute the minimum amount of CPU work to reach a certain error based on mesh refinements and higher-order polynomials. To do this, for each polynomial order p we accumulate the time spent to reach the error. Then we determine the minimum time in the list and multiply it by the number of processors np that are executed for the minimum time. We generate Pareto optimal diagrams to present the results of the study for compressible and nearly-incompressible elasticity separately, where a point on the diagram is called Pareto optimal if strain energy error cannot be decreased (moving down the ordinate (y axis)) without increasing time (or cost) by moving to the right on the abscissa. The set of Pareto optimal configurations is known as the Pareto front.
Figs. 2a and 2b present the L2 strain energy error based on mesh refinement vs. cost, and the L2 error based on higher-order polynomials vs. cost, for compressible elasticity. Figs. 2a and 2b show that the minimal cost to achieve a certain error occurs in the lower left region of the diagram, indicating that higher order p is the most cost-effective for the compressible case. Each horizontal series of the same color represents a strong scaling study at fixed mesh refinement h and polynomial order p, but with changing number of processors np, with perfect strong scaling occurring when all color dots are collocated. The more expensive simulations tend to exhibit better strong scaling because they have more work over which to amortize the inherent communication costs, while smaller models (fewer DoFs) are more cost-efficient to run on a single core. The low order p = 1 cases are increasingly far from the Pareto front. In addition, solely calling larger number of CPU cores is less cost-efficient, although are faster with respect to solve time up to the point where the parallel overhead communication begin to dominate.
To answer the second question raised in the Introduction regarding the least amount of time to achieve a certain strain energy error given enough computational resources, we provide a second set of Pareto optimal diagrams in Figs. 3a and 3b for the compressible elasticity tube bending problem. The Pareto optimal configurations are toward the lower left region of the plot, whereby upon p-refinement, p = 3 exhibits faster solve times for certain error, while upon h-refinement, p = 2 delivers the fastest solve times for certain error, a result that was unexpected and requires further investigation. Each horizontal series of the same color represents strong scaling of a given mesh refinement h and polynomial order p. Larger number of CPU cores when enough work is amortized tend to reach a certain error in the solution faster with higher order polynomials. Similar to Pareto optimal for the cost, strong scaling occurs when the same color larger dots overlay smaller dots.
We notice that depending on the error, increasing the polynomial order p while fixing mesh size (h constant) may provide the fastest time to solution. For example, for strain energy error between 10−3 and 10−4, polynomial order p = 3 may be faster than lower order polynomials, but for error greater than 10−3, polynomial order p = 2 may be faster. This could be due to the number of DoFs for different polynomial orders when run with the same size mesh. In addition, we expected that higher order polynomials p = 3, 4 would be more cost-efficient than lower order polynomials. However, we observed that polynomial order p = 2 is more cost-efficient than higher order polynomials p = 3, 4 for this compressible tube bending problem.
For nearly-incompressible elasticity, we use the same meshes h and polynomial order p. The problem sizes in terms of DoFs and L2 strain energy error based on mesh refinement and polynomial orders are represented in Tables 4a and 4b. However, we instead use 24, 48, 96, 192, 384, 768, and 1536 CPU cores since nearly-incompressible elasticity is more computationally intensive for matrix-free solution. This resulted in 228 FE simulations. Figs. 4a and 4b represent Pareto optimal diagrams for the L2 error based on mesh refinement vs. cost, and the L2 error based on higher-order polynomials vs. cost, respectively. Pareto optimal consists of polynomial orders p = 2, 3 when L2 error is computed based on h-refinement, while polynomial order p = 3 is more cost-effective in achieving more accuracy in the solution. Similarly, the Pareto optimal diagrams are provided for L2 error based on mesh refinement h vs. solve time, and L2 error based on polynomial order p vs. solve time in Figs. 5a and 5b, respectively. The Pareto front consists of p = 2, 3, 4 in that order. In addition, larger number of CPU cores provide faster solutions for a certain strain energy error.
Next, we examine the relative parallel scalability of the matrix-free FE implementation in terms of throughput vs. solve time for p = 1, 2, 3, 4 in the compressible case, and p = 2, 3, 4 in the nearly-incompressible case. Using the tube geometry, new meshes are generated in Trelis. A mesh with 1,638,400 elements is used with polynomial order p = 1. Meshes with 204,800, 60,500, and 25,600 elements are used with polynomial order p = 2, 3, 4, respectively. These mesh sizes in conjunction with polynomial orders p = 1, 2, 3, 4 produce approximately 5.2M DoFs. This problem size is conveniently analyzed in the Pareto diagrams, which roughly occupies one compute node (24 CPU cores) on the parallel platform. For the compressible case, depending on the number of elements, 12, 24, 48, 96, 192, 384, and 768 CPU cores are executed where every problem is run three times for a total of 75 FE simulations. For the nearly-incompressible case, 192, 384, 768, and 1536 CPU cores are called. This resulted in 42 FE simulations for which each problem is run three times. Throughput is computed as millions of DoFs per second per CPU core for each FE simulation. The result of throughput vs. solve time for the compressible and nearly-incompressible cases are presented in Figs. 6a and 6b, respectively. We notice for the compressible case, the implementation scales up to 200–250 CPU cores (Fig. 7a) with polynomial orders 3 and 4. The sub-optimal performance in the code pertains to lack of work amortized to each CPU core for the problem size chosen. However, for the nearly-incompressible case, which is more computationally-intensive compared to the compressible case, the implementation scales up to 800–1000 CPU cores (Fig. 7b) for polynomial orders 2–4 for the problems sizes selected.
4.3 Beam Bending under Body Force
In this section, we consider a solid square-cross-section aluminum beam subject to a body force in the negative y direction simulated with p = 1, 2, 3, 4. Fig. 8 shows a coarse structured mesh of the beam. The length of the beam is 5 m, and each side of the square beam cross-section is 0.25 m. The beam is clamped on both ends. A body force of 200 N/m3 is applied in the negative y direction on the beam, which causes the beam to bend 9.1 × 10−4 mm in the negative y direction at the middle of the beam. Fig. 8 shows the deformed mesh of the beam. Similar to the tube bending simulation, the clamped BCs will generate stress concentration (singularity) in the clamped regions. When a stress concentration exists, it can be shown that the order of convergence of the FE solution becomes independent of polynomial order p . That is, the order of convergence of the FE solution cannot be improved by using higher-order finite elements. We investigate this phenomenon by conducting h- and p-refinement with our matrix-free FE implementation in libCEED/PETSc.
Five different refinements of the Hex8 beam mesh are generated with the Trelis software for 558, 4,464, 15,066, 35,712, and 69,750 elements. The finest mesh with 69,750 elements acts as the “exact” solution with respect to strain energy calculation. A total of 60 simulations are conducted where each simulation is repeated 3 times. An L2 strain energy error based on h-refinement per polynomial order is computed for p = 1, 2, 3, 4. In addition, an L2 strain energy error based on higher-order polynomial p for each mesh refinement h is computed. Tables 5a and 5b summarize the refinement (“#Refine”), polynomial degree p (“deg”), number of degrees of freedom (#DoF), strain energy, and the L2 error. The element size for each mesh is computed to be h = 0.1428, 0.0714, 0.0476, 0.0357 m for smallest to largest number of elements mesh, respectively. Fig. 9 represents the h-refinement order of convergence for polynomial orders 1 through 4. The calculated slopes (order of convergence) of the lines that are best fit in the least square sense for polynomial orders p = 1, 2, 3, 4 with h-refinement are 2.43, 2.48, 2.38, and 2.31, respectively. Therefore, we conclude that for this problem the order of convergence of the FE solution on structured meshes with stress concentration (singularity) cannot be improved by using higher-order polynomials. Thus, the lack of optimal spatial convergence with regard to higher-order FE when there is a stress concentration is confirmed, as opposed to when there is no stress concentration (or singularity) .
The paper investigated the use of displacement-based higher-order finite elements for compressible and nearly-incompressible linear isotropic elasticity using a matrix-free implementation with p-multigrid preconditioning. It was observed that varying the polynomial order p and mesh size h affects the cost and time it takes to achieve a certain strain energy error for Poisson’s ratios ν = 0.3, 0.499999 in a tube bending problem. Increasing the polynomial order p while fixing the mesh size (h constant) may provide the fastest time to converge to a certain strain energy error between 10−3 and 10−4. However, for strain energy error greater than 10−3, lower-order polynomials may achieve the solution faster. When there is a stress concentration present, higher-order polynomials do not improve the order of convergence in the finite element solution. In addition, polynomial order p = 2 proved to be more efficient than p = 3, 4 for the compressible elastic tube bending problem, which also has a stress concentration (singularity in the stress solution). Each mesh refinement was run with polynomial orders p = 1, 2, 3, 4 for compressible elasticity, and polynomial orders p = 2, 3, 4 for nearly-incompressible elasticity. Using the same mesh refinements h with higher-order polynomials (p = 3, 4) produce a much larger problem size (more DoFs) than when the same meshes are used with polynomial order 2. Therefore, polynomial orders p = 3, 4 appear to be more costly than polynomial order p = 2. An exhaustive search using different polynomial orders p and mesh sizes h may determine if polynomial order p = 2 is truly more cost efficient than polynomial orders p = 3, 4, which will be considered as future work. In addition, for nearly-incompressible elasticity (ν = 0.499999), the tube bending problem could not achieve an error better than 10−1 based on discrete global strain energy Φh computation because of the stress concentration; whereas a smooth method of manufactured solution (MMS) tube problem achieved strain energy error on the order of 10−7, illustrating the difficulty in convergence for h-and p-refinement for near-incompressibility when a stress concentration is present. Therefore, implementing a mixed formulation for handling appears to be expedient with respect to spatial convergence than solely a displacement-based, higher-order finite element method. Scalability of the matrix-free FEM for compressible elasticity is feasible up to a few hundred CPU cores, while scalability improves for the more computationally-expensive nearly-incompressible case to one thousand CPU cores for the problem sizes chosen in this study. Higher-order finite elements generally still out perform lower-order finite elements within the matrix-free implementation presented in the paper, for which the cost per degree of freedom decreases with increasing p, due to more structured computation and more efficient quadrature.
Notation: Boldface denotes vectors and tensors in symbolic form. Unless otherwise indicated, all vector and tensor products in symbolic form are assumed to be inner products, such as , and , where repeated indices denote a sum over those indices. Cartesian coordinates are assumed. The symbol tr(•) is the trace operator, such that . The symbol is the unit tensor, i.e., , where δij is the Kronecker delta operator, and is the fourth order identity tensor. p refers to polynomial basis order, h to characteristic element size, d to spatial dimension, and np to number of processors.
Funding Statement: The research relied on computational resources  provided by the University of Colorado Boulder Research Computing Group, which is supported by the National Science Foundation (Awards ACI-1532235 and ACI-1532236), University of Colorado Boulder, and Colorado State University.
Conflicts of Interest: The authors declare that they have no conflicts of interest to report regarding the present study.
|This work is licensed under a Creative Commons Attribution 4.0 International License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.|