Hash Functions and GPU Algorithm of Infinite Grid Method for Contact Search

The paper presents the memory-saving GPU algorithm of the infinite grid method with various hash functionsfor contact search in discrete element computations. The implemented hash table of fixed size has the addedbenefit of allowing the grid to be potentially infinite in size, which is particularly suitable for a large numberof discrete particles, moving in large computational domains with empty regions. Research on the applicationof various hash functions and hash table sizes was performed to preserve the computational performance ofthe developed memory-saving GPU algorithm and its OpenCL implementation. The performance of the developed software was evaluated by solving the hopper fill and discharge as well as the artificial avalanche problemson the NVIDIA® Tesla™ P100 GPU. The performance achieved by using the memory-saving implementationof contact search was compared with that attained by using the standard implementation of the uniform gridmethod. The performed analysis revealed that the developed GPU algorithm and its OpenCL implementationreduced the GPU memory consumed by the uniform grid method up to 69.7 times, which resulted in the contactsearch memory equal to 1.86% of the total memory required by DEM computations. Moreover, the applicationof the Morton hash function and the proper hash table size allowed for preserving high computational performance of the infinite grid method and the developed GPU software.


Introduction
The discrete element method (DEM) suggested in the pioneering work of Cundall and Strack [3] presents numerical methodology providing a quantitative description of discrete particulate media. The key concepts of the DEM imply that the domain of interest is treated as an assemblage of rigid or deformable particles and that the contacts among them need to be identified and updated during the entire motion process. DEM has been applied to solve numerous problems related to granular flows [32], cohesive powders [31] and diluted aerosols [16]. However, the simulation of systems at the particle level of detail has the disadvantage of making DEM computationally very expensive. Naturally, to solve the industrial-scale problems, parallel computing has become an obvious option for significantly increasing computational capabilities. However, the selection of the efficient parallel solution algorithm is highly dependable on the specific features of the considered problem and the numerical method [14,15,27]. Compared to the CPU-based parallelization, GPU has a higher parallel structure, which makes them more efficient for particle-based algorithms, where large blocks of data can be processed in parallel. Software environments, such as CUDA or OpenCL [4], are targeted at general-purpose GPU (GPGPU) programming.
Usually, the computing time of industrial applications can be significantly reduced, using the parallelization benefits of GPU, which is becoming increasingly more important as an alternative computational platform for DEM simulations [9]. Shigeto and Sakai [26] have proposed a new algorithm for multi-thread parallel computation of DEM, pointing out that their calculation speed ratio of the GPU to CPU was up to 3.4 when single-precision floating-point numbers are used. Durand et al. [5] have performed GPU simulation of an impact on a reinforced concrete slab with 14274 particles, which is insufficient in the case of civil engineering structures. Yue et al. [35] have made a GPU version of Trubal code and demonstrated its application to die filling. In 3D simulations, containing 20000 particles, an average speedup of 19.66 was achieved on the NVIDIA Tesla K40c card. Govender et al. [9] have designed the modular high performance Blaze-DEMGPU framework for the GPU architecture and investigated the influence of shape non-unifor-mity and polydispersity of polyhedral particles on hopper discharge. Kelly et al. [18] have adopted an adimensionalization process combined with mixed-precision data to simulate 3D scenarios with up to 710 million spherical frictionless particles. To achieve a higher speedup ratio for a larger number of particles, a few efforts were made to use the combined GPU and MPI technology. Xu et al. [33] have achieved the quasi-real-time simulation of an industrial rotating drum, when about 9.6⋅10 6 particles were treated with 270 GPUs. The one-dimensional domain decomposition with multiple GPUs has been applied to the simulation of 128 million particles by Tian et al. [29]. GPU-based DEM combined with MPI has been applied by Gan et al. [7] to study flow in different granular handling and processing systems related to the ironmaking industry. Contact search is a very important and time-consuming part of DEM computations of granular materials [34]. Spatial partitioning techniques subdivide domain into particular regions, such as cells. Since two particles can only be in contact if they are in the neighboring regions, the number of pairwise contact checks is significantly decreased [22]. The uniform grid method (UGM) is the efficient space subdivision scheme often used to speed up the contact search queries on GPU in the case of nearly monosized particles. A typical grid representation stores a list of particles contained by each cell. However, in the cases of large grids, the mere storing of the lists can exceed limits of available GPU memory [2]. Thus, large uniform grids only partially filled with particles inefficiently allocate memory, and contact search can violate the limits of GPU resources, which restricts the application of DEM in industry. Most of the authors [9,29,35] have performed standard implementations of the UGM for contact search or its minor modifications [26,33]. Kalojanov and Slusallek [17] have developed the first GPU-based parallel algorithm of uniform grid construction for ray tracing. NVIDIA SDK provides a sample code [10] for simulating simple particle interactions by using a sorting algorithm and a uniform grid structure. Xu et al. [33] have used atomic functions to register particles to cells and the thread block per cell to perform the neighbour search. Durand et al. [5] have used a dynamic bounding box to reduce the number of cells in the uniform grid. Cai et al. [1] have modified cell-Information Technology and Control 2022/1/51 50 linked list method using dynamic mesh for better utilization of CPU memory. Zheng et al. [36] have applied a complex and costly contact search algorithm, allowing a particle to overlap an arbitrary number of cells, to the self-compacting concrete flow simulations. The developed algorithm performed well up to 140063 heterogeneous particles in a simple case of the gravity packing problem. However, only 29580 particles were employed to simulate a complex self-compacting concrete flow. Gan et al. [8] have applied the hierarchical grids method for large-scale polydispersed particulate systems and noticed the decrease in efficiency caused by the load balance problem and the time-consuming MPI communication between GPUs on different nodes. The main drawback of the UGM is that too much memory as buckets are allocated with predefined capacity. In a typical DEM simulation, many of these buckets are not used. Hence, the occupied GPU memory cannot be used for any other purpose. It seems that mapping of cells into a hash table of a fixed size requires less memory than storing the grid in a dense array [23]. Moreover, positions of particles are mapped by the hash function into a finite number of cells. The size of used storage is associated with the number of particles. Memory necessary for the hash table does not depend on the number of grid cells, therefore, the grid can be potentially infinite in size. However, memory usage and performance of the GPU code might depend on the considered solution domain and dynamics of particles, as well as the applied hash function and the hash table size. Moreover, the performance of contact search algorithms on large or complex computational domains with unfilled regions, covered by large number of grid cells, has been rarely investigated in the frame of the DEM simulations on GPU. The paper presents performance analysis of the developed memory-saving GPU algorithm of the infinite grid method (IGM) with various hash functions for contact search of DEM simulations in computational domains that can be covered only by a large number of uniform grid cells. The other parts of the paper are organized as follows: Section 2 describes the implemented DEM model and used software, Section 3 presents the developed GPU algorithm for contact search, Section 4 describes the solved applications, Section 5 provides the performance analysis, while Section 6 gives the concluding remarks.

The DEM Model and Software
In the present work, the employed DEM software models the non-cohesive frictional visco-elastic particle systems. The dynamic behaviour of a discrete system is described by considering the motion and deformation of the interacting individual particles in the frame of Newtonian mechanics. An arbitrary particle is characterized by three translational and three rotational degrees of freedom. The forces acting on the particle may be classified into the forces induced by the external fields and the contact forces between the contacting particles. In the present work, the electromagnetic force [30], the aerodynamic force [16] and other external forces [20], except for the gravity force, are not considered. The normal contact force can be expressed as the sum of the elastic and viscous components. In the present work, the normal elastic force is computed according to the Hertz's contact model. The viscous counterpart of the contact force linearly depends on the relative velocity of the particles at the contact point. The tangential contact force is divided into the parts of static friction and dynamic friction. The dynamic friction force is directly proportional to the normal component of the contact force. The static friction force is calculated by summing up the elastic counterpart and the viscous damping counterpart. It is worth noting that the elastic counterpart includes the length of the tangential displacement, which depends on the time history. Thus, the considered friction model is incremental or time history-dependent, which requires storing the values of the tangential displacement during the contact between the neighbouring particles in the memory. The details of the applied DEM model can be found in [6,11].
The DEM code is developed by using OpenCL [4] to run the same software on GPU and CPU of different vendors. In the present research, the main attention is focused on contact search because it can take a large part of the computing time and the consumed memory of granular material simulations on GPU. Therefore, at the beginning the whole DEM algorithm is only briefly outlined for completeness. The developed DEM algorithm for shared-memory architectures can be described as follows. At the start of the simulation, preprocessing is performed on CPU and the initial data are copied into the GPU memory. No further memory transactions between the CPU and Information Technology and Control 2022/1/51 GPU memory are required, except for the result storage. All GPU kernels run thread per particle, which takes advantage of the massive parallel computation capabilities of modern GPUs and can be considered to be the most suitable parallelism in the case of DEM computations. The initial kernel performs the predictor step according to the Gear predictor-corrector method, starting the time integration. The several following kernels run contact search, therefore, they are circumstantially outlined below. The other kernels handle the time history of contacts, compute the contact forces and evaluate the external forces. The last kernel completes the time integration by performing the Gear corrector step. The values of positions, velocities and accelerations of the particles are corrected. At the end of the time step, the particle data can be copied from the GPU memory to CPU and stored on the hard disk drive in HDF5 format. It is recommended to transfer the data to the CPU memory as seldom as possible because it is a time-consuming process.

The GPU Algorithm of Contact Search
In fact, a very effective space subdivision scheme is a regular grid, which is often applied to perform fast contact search. However, a serious drawback of the UGM is that too much memory as buckets are allocated for a large number of uniform grid cells. To avoid storing the grid by using a dense array and to save the GPU memory each cell of the grid or the particles located there can be mapped into a hash table of a fixed set of M buckets. The main information is stored in two arrays (Figure 1), containing the list of pairs PARTICLE_ID(N) and PARTICLE_HASH(N), where N is the number of particles. The information for accessing the particles associated with a particular bucket is stored in two additional arrays BUCK-ET_START(M) and BUCKET_END(M), where M is the considered size of the hash table. The output of the contact search algorithm is stored in the arrays CONTACT_N(N) and CONTACT_IDS(K), where K is the number of particles N multiplied by the maximal number of contacts. In the case of packed monosized particles, the maximal number of contacts equals 12. These arrays are necessary for computing the contact forces in the relevant kernels. Figure 1 shows the scheme of the developed memory-saving GPU algorithm for the IGM. Kernel 1 runs a thread per particle and calculates hash values by using the considered hash function. In Figure 1, the values of the array PARTICLE_HASH are not calculated by any real hash function and only of an illustrative character. Kernel 1 stores the results to arrays PARTICLE_ID and PARTICLE_HASH in the global GPU memory. Kernel 2 sorts the particles according to their hash values. The sorting is performed by using the fast radix sort algorithm [25] provided by the Boost Compute library [21]. Kernel 3 fills the arrays BUCKET_START and BUCKET_END. The kernel compares the hash value of the current particle with the hash value of the previous particle in the sorted array PARTICLE_HASH. If the hash value is different, this indicates the end of the old bucket and the recommended to transfer the data to the CPU memory as seldom as possible because it is a timeconsuming process.

The GPU Algorithm of Contact Search
In fact, a very effective space subdivision scheme is a regular grid, which is often applied to perform fast contact search. However, a serious drawback of the UGM is that too much memory as buckets are allocated for a large number of uniform grid cells. To avoid storing the grid by using a dense array and to save the GPU memory each cell of the grid or the particles located there can be mapped into a hash table of a fixed set of M buckets. The main information is stored in two arrays (Figure 1 The output of the contact search algorithm is stored in the arrays CONTACT_N(N) and CONTACT_IDS(K), where K is the number of particles N multiplied by the maximal number of contacts. In the case of packed monosized particles, the maximal number of contacts equals 12. These arrays are necessary for computing the contact forces in the relevant kernels. Figure 1 shows the scheme of the developed memory-saving GPU algorithm for the IGM. Kernel 1 runs a thread per particle and calculates hash values by using the considered hash function. In Figure 1, the values of the array PARTICLE_HASH are not calculated by any real hash function and only of an illustrative character. Kernel 1 stores the results to arrays PARTICLE_ID and PARTICLE_HASH in the global GPU memory. Kernel 2 sorts the particles according to their hash values. The sorting is performed by using the fast radix sort algorithm [25] provided by the Boost Compute library [21]. Kernel 3 fills the arrays BUCKET_START and BUCKET_END. The minus one indicates that this bucket has no particles and the relevant loop through the particles will not be performed. This kernel also performs computations on a thread per particle basis.

Figure 1
A scheme of the developed GPU algorithm of the infinite grid method.
Kernel 4 performs the final contact check, which sometimes is referred to as the narrow phase of contact search. The contact check of the particular particle is performed against the particles located only in the neighbouring grid cells. Thus, for each particle, hash values of 27 neighbouring cells are calculated by using the considered hash function. It is Information Technology and Control 2022/1/51 52 start of a new bucket. The end index and the start index are written to the arrays BUCKET_END and BUCKET_START, respectively. In the array BUCK-ET_END, minus one indicates that this bucket has no particles and the relevant loop through the particles will not be performed. This kernel also performs computations on a thread per particle basis.
Kernel 4 performs the final contact check, which sometimes is referred to as the narrow phase of contact search. The contact check of the particular particle is performed against the particles located only in the neighbouring grid cells. Thus, for each particle, hash values of 27 neighbouring cells are calculated by using the considered hash function. It is worth noting that, contrary to the UGM, some hash values of the neighbouring cells might be the same and should be removed to avoid the duplicated contact. Then, the outer loop is performed over the determined buckets, while the inner loop is run over the particles of the current bucket. These neighbouring particles are accessed by using the indices to the array PARTI-CLE_ID that are stored in arrays BUCKET_START and BUCKET_END. When the distance between the checked particles is smaller than the sum of the radii of the particles, the contact is identified. The number of the contacting particles is increased in the array CONTACT_N, while the particle Id is added to the contact list CONTACT_IDS. The arrays CON-TACT_N and CONTACT_IDS serve as the output of Kernel 4 and the whole contact search algorithm.
The occupancy of the hash table and the number of potential contacts is highly dependent on the employed hash function, which can have a great influence on the required memory and computational performance of the contact search algorithm. The uniform function [28], the Morton function [2] and the universal function based on the primary numbers [19] are considered for wrapping the world coordinates into a finite number of buckets. Thus, three different hash functions are considered to investigate the influence of the function used to the computational performance of contact search in the case of problem-specific DEM simulations.

The Considered Applications
Three different problems are considered to evaluate the computational performance of the developed GPU implementation of contact search for prob-lem-specific DEM simulations. The hopper fill problem has a relatively small and simple solution domain, therefore, it can be efficiently solved by using the UGM for contact search. The solution domains of hopper discharge and artificial avalanche problems, on the contrary, have a relatively complex shape, involving several regions that require a large number of uniform grid cells. Moreover, in each time instance, the moving particles occupy only a small part of the computational domain, leaving empty regions and limiting efficiency of standard contact search implementations.
In the case of the hopper fill and the hopper discharge problems, the pyramid-shaped hopper of the same geometry is considered. The dimensions for the hopper geometry are as follows: height 1.0m, width 1.0m, thickness 1.0m and the height of the pyramidal part 0.4m. The width of the orifice is considered to be 0.5m to make a faster discharge. At the initial time instant of the hopper fill problem, cubically packed particles are placed in the cubic part of the hopper. In the case of the hopper discharge problem, a large rectangular box, imitating the infinite plane at the bottom, has the following inward dimensions: height 0.5m, width 8.0m and thickness 8.0m.
In the initial state of the hopper discharge problem, the particulate material is obtained from the numerical solution of the hopper fill problem at t=0.5s. Granular material is presented by the assemblies of 117649, 970299, 1906624, 2863288 and 3723875 particles with the radii equal to 0.0100m, 0.0050m, 0.0040m, 0.0035m and 0.0032m, respectively. The material has the bulk particle density of 1290kg/m 3 . The other physical data for particles are as follows: the Poisson's ratio is equal to 0.2, the particle elasticity modulus is equal to 2.36⋅10 8 Pa, the friction coefficient between the particles is equal to 0.4, while the coefficient of restitution is equal to 0.5. The coefficients of the walls are considered to be as follows: the Poisson's ratio is 0.35, the elasticity modulus is 2.6310 9 Pa and the friction coefficient between walls and particles is 0.4.
The avalanche flow is considered for DEM analysis of damping effects. This analysis resulted in large numbers of particles, moving in the partially filled computational domain covered by a large number of grid cells, which is a very suitable test case for the developed memory-saving GPU algorithm. The geometry of the considered avalanche problem is given in Information Technology and Control 2022/1/51 Figure 2. The main rectangular box has the following inward dimensions: height 9.5m, width 20.0m and length 32.0m. The cuboid of the cubically packed particles falls to the channel with an inclination of 10 degrees. The dimensions of the channel are as follows: width 4.0m, length 12.0m and the height of walls is equal to 5.5m. The left end and the right end of the channel are raised above the bottom of the rectangular box by 6.0m and 4.0m, respectively.
In the case of the avalanche problem, the granular material is presented by the assemblies of 108640, 963600, 1649520, 2053430, 3237300 and 4241517 particles with the radii equal to 0.037m, 0.018m, 0.015m, 0.014m, 0.012m and 0.011m, respectively. The material has the bulk particle density of 920kg/ m 3 . The other physical data for the particles are as follows: the Poisson's ratio equals 0.3, particle elasticity modulus is 9.33⋅10 6 Pa, the friction coefficient between the particles equals 0.05 and the coefficient The avalanche flow at t=17s: (a) 108640 particles with R=0.037m; (b) 1649520 particles with R=0.015m _START and between the e sum of the entified.  The avalanche flow at t=17s: (a) 108640 particles with R=0.037m; (b) 1649520 particles with R=0.015m.
In the initial state of the hopper discharge problem, the particulate material is obtained from the numerical solution of the hopper fill problem at t=0.5s. Granular material is presented by the assemblies of 117649, 970299, 1906624, 2863288 and 3723875 particles with the radii equal to 0.0100m, 0.0050m, 0.0040m, 0.0035m and 0.0032m, respectively. The material has the bulk particle density of 1290kg/m 3 . The other _START and between the e sum of the dentified. The s increased in particle Id is CT_IDS.  The avalanche flow at t=17s: (a) 108640 particles with R=0.037m; (b) 1649520 particles with R=0.015m.
In the initial state of the hopper discharge problem, the particulate material is obtained from the numerical solution of the hopper fill problem at t=0.5s. Granular material is presented by the assemblies of 117649, 970299, 1906624, 2863288 and 3723875 particles with the radii equal to 0.0100m, 0.0050m, 0.0040m, 0.0035m and 0.0032m, respectively. The material has the bulk particle density of 1290kg/m 3 . The other of restitution equals 0.5. Figure 2 shows the avalanche flow visualized by ParaView software [12]. The particles are coloured according to the magnitude of the particle's velocity. It can be observed that the results of the avalanche flow simulations are highly dependable on the number of particles.

The Performance Analysis
The considered benchmark problems were solved to evaluate memory consumption and computational performance of the developed memory-saving GPU implementation of contact search. In the frame of the IGM, the application of different hash functions was investigated, and the results were compared to those obtained by using the standard UGM. It is worth noting that the UGM is efficient in the case of monosized and nearly monosized particles. A large heterogeneity ratio can reduce the performance of the contact search based on the UGM. Thus, only monosized particles were considered to perform a correct performance comparison between the UGM and the developed GPU implementation of the IGM in the case of all the considered problems. All double-precision computations were performed on the NVIDIA® Te-sla™ P100 GPU Computing Accelerator (56 Streaming Multiprocessors, 1792 FP64 CUDA Cores, 12GB HBM2, 549GB/s memory bandwidth), which was installed on the workstation with the hardware characteristics as follows: Intel®Xeon™ E5-2630 2.20GHz 2xCPU, 32GB DDR4 2133MHz RAM. The simulation and visualization were performed on the computational infrastructure [13,24] hosted by Vilnius Gediminas Technical University.  .9MB in the cases of the hopper fill problem with 3723875 particles, the hopper discharge problem with 3723875 particles and the artificial avalanche problem with 4241517 particles, respectively, which was up to 1.9% of the total memory required by DEM computations. Therefore, it was difficult to observe the difference between the BASE curve and the curves FIL IGM, DIS IGM and AVA IGM, representing the IGM. Moreover, it was hardly possible to distinguish these curves from each other.
In the case of the hopper fill problem, which has a simple and small solution domain, the UGM consumed only 1.2 times more memory for contact search than the IGM, which results in 2.2 and 2.3 times more memory for the whole DEM computations, respectively. Therefore, the curves FIL IGM and FIL UGM are also located very close to each other. However, the UGM applied to solve the hopper discharge problem (the curve DIS UGM) and the artificial avalanche problem (the curve AVA UGM) consumed up to 65.1 and 69.7 times more memory for contact search then the IGM, which results in 2.2 and 2.3 times more memory for the whole DEM computations, respectively. It is worth noting that the developed GPU implementation reduced the percentage of memory consumed by contact search from 55.2% to 1.86% and from 56.9% to 1.86% of the total benchmark memory in the case of the hopper discharge problem and the artificial avalanche problem, respectively.
The developed implementation significantly reduced the GPU memory required by the standard UGM, which could lead to a decrease in computational performance. The applications of the uniform function [28], the Morton function [2] and the universal function based on the primary numbers [25] were investigated to evaluate and improve computational performance of the developed GPU implementation of the IGM.   Figure 5, the abbreviations UGM, Morton, Uniform and Universal represent the solutions obtained by using the uniform grid method, the Morton hash function, the uniform hash function and the universal hash function, respectively. Initially, the table size was considered to be equal to the number of particles (the abbreviation S1). The memory required by the IGM was negligibly small compared to the memory consumed by the UGM, therefore, there was no sense to consider small-   [13,24] hosted by Vilnius Gediminas Technical University.  er hash tables. Thus, the use of two (the abbreviation S2) and five (the abbreviation S5) times larger hash tables was investigated to increase the computational performance. The presented results show that the increased size of the hash table allowed for reducing the contact search time in the most cases. However, the observed performance increase often became very small in the case of the uniform and Morton functions with table size S5. Therefore, further increase in the table size did not appear very promising for the considered size of the solved problems. Figure 6 presents the performance scaling of the developed GPU software, which solved the considered problems, applying various hash functions for contact search. The average execution time of the OpenCL code per time step was measured computing 40000 time steps. In the legend of Figure 6, the abbreviations have the same meanings as in the previous figures. Figure 6a shows the performance scaling measured by solving the hopper discharge problem and applying various hash functions. As expected, the lowest performance could be observed by using the universal hash function, while the highest performance was achieved by applying the Morton hash function for contact search. ed by applying the Morton function in the case of the larger numbers of particles (Figure 4b).   ed by applying the Morton function in the case of the larger numbers of particles (Figure 4b).   The performance of contact search for various sizes of the hash tables became very small in the case of the uniform and Morton functions with table size S5. Therefore, further increase in the table size did not appear very promising for the considered size of the solved problems.

Figure 5
The performance of contact search for various sizes of the hash tables. Figure 6 presents the performance scaling of the The computing time was reduced by 29.9%, 21.8% and 28.1% of the computing time measured by using the smaller table size, solving the hopper fill problem, the hopper discharge problem and the avalanche problem, respectively. It also allowed for achieving large Cundall numbers (FPS×the number of particles) equal to 6.11⋅10 7 , 3.05⋅10 7 and 4.61⋅10 7 for the considered problems. Thus, the reduction in the table size can significantly increase computational performance, especially, in the case of large numbers of particles.

Conclusions
The paper presents the developed memory-saving GPU algorithm of contact search for DEM simulations in computational domains covered by a large number of grid cells and filled by particles, changing the occupied regions. The developed GPU algorithm and its OpenCL implementation significantly reduce memory consumption of the standard UGM. Moreover, the percentage of memory consumed by contact search becomes almost negligible comparing to the total benchmark memory, which is very important in the case of computational domains covered by a large number of grid cells. Computational performance of contact search highly depends on the applied hash Moreover, the largest differences in the computing time could be observed in the cases of the largest numbers of particles. Moreover, the largest differences in the computing time could be observed in the cases of the largest numbers of particles. function and the considered number of particles. The standard UGM demonstrates the highest computational performance for problems with the number of particles, which slightly exceeds 100000. The IGM with Morton hash function is recommended, when the number of particles approaches one million. The presented performance analysis reveals a high potential of the developed GPU algorithm and OpenCL software, which significantly reduce memory consumption of contact search and preserve high compu-tational performance, applying the Morton function and the proper hash table size.