NASA/CR-1999-209551 ICASE Report No. 99-36 Parallelization of a Multigrid Incompressible Viscous Cavity Flow Solver Using OpenMP Kevin Roe and Piyush Mehrotra ICASE, Hampton, Virginia Institute for Computer Applications in Science and Engineering NASA Langley Research Center Hampton, VA Operated by Universities Space Research Association National Aeronautics and Space Administration Langley Research Center Prepared for Langley Research Center Hampton, Virginia 23681-2199 under Contract NAS1-97046 September 1999 P ARALLELIZA TION OF A MUL TIGRID INCOMPRESSIBLE VISCOUS CA VITY FLO W SOL VER USING OPENMP KEVIN R OE AND PIYUSH MEHR OTRA Abstract W e describ e a m ultigrid sc heme for solving the viscous incompressible driv en ca vit y problem that has b een parallelized using Op enMP The incremen tal parallelization allo w ed b y Op enMP w as of great help during the parallelization pro cess Results sho w go o d parallel e ciencies for reasonable problem sizes on an SGI Origin Since Op enMP allo w ed us to sp ecify the n um b er of threads and in turn pro cessors at run time w e w ere able to impro v e p erformance when solving on smaller coarser meshes This w as ac complished b y giving eac h pro cessor a more reasonable amoun t of w ork rather than ha ving man y pro cessors w ork on v ery small segmen ts of the data and thereb y adding signi can t o v erhead Key w ords parallel computing SGI Origin Op enMP m ultigrid viscous driv en ca vit y Sub ject classi cation Computer Science In tro duction E ectiv e use of parallel mac hines requires easily main tainable and p ortable program ming mo dels that allo w users to exploit parallelism in applications written in a standard high lev el language MPI pro vides p ortabilit y ho w ev er it can b e more di cult to main tain and is not a high lev el programming mo del High P erformance F ortran HPF is p ortable and fairly easy to main tain Op enMP is also p ortable on shared memory arc hitectures and fairly easy to main tain although it can only b e used on shared memory mac hines Since HPF can b e used on b oth shared and distributed memory mac hines one w ould think that it w ould b e the parallel programming mo del of c hoice Ho w ev er Op enMP has some adv an tages o v er HPF if one is using a shared memory mac hine Most notably the Op enMP mo del allo ws users to incremen tally parallelize their co de whic h can b e in v aluable when parallelizing large co des Viscous uid o w inside a driv en ca vit y is a p opular test case for ev aluating n umerical tec hniques W e describ e the parallelization of a m ultigrid sc heme for solving the viscous incompressible driv en ca vit y problem using Op enMP Tw o dimensional incompressible viscous driv en ca vit y o ws are computed using a lo osely coupled implicit second order cen trally di erenced sc heme Mesh sequencing and a three lev el V cycle m ultigrid solv er with a simple Jacobi in tegration smo other are used in this study Although a m ulti lev el V cycle w ould more lik ely app ear in practice the three lev el V cycle w as used simply to test Op enMP s p erformance In this pap er w e explain the viscous driv en ca vit y problem the go v erning equations and the n umerical metho ds used W e then discuss the strategy used to parallelize the co de T ests w ere conducted to determine the p erformance of Op enMP when an equal n um b er of pro cessors are used for eac h grid lev el and to see if o v erhead could b e remo v ed b y using less pro cessors for solving on coarser grids Finally w e discuss p ossible future w ork in v olving additional tests for the m ultigrid algorithm when more than three grid lev els are allo w ed as w ell as p ossible p erformance comparisons b et w een Op enMP MPI and HPF Problem Domain A driven c avity is de ned as a rectangular ca vit y the cases in this pap er are all done on square ca vities where a plate of in nite length mo v es across the top of the ca vit y from left to This researc h w as supp orted b y the National Aeronautics and Space Administration under NASA Con tract No NAS while the authors w ere in residence at the Institute for Computer Applications in Science and Engineering ICASE NASA Langley Researc h Cen ter Mail Stop C Hampton Virginia fkproe pmg icase edu righ t with non dimensional sp eed U Figure The domain is discretized on a t w o dimensional structured Cartesian mesh Although the co de w as written to allo w for clustering of mesh p oin ts near the w alls the tests conducted w ere all done on meshes with uniform cell spacing U Circulation Fig Squar e c avity with an in nitely long plate moving acr oss the top of the c avity Go v erning Equations The non conserv ativ e form of the go v erning equations for unsteady incom pressible laminar viscous o w is as follo ws u v x y u uu v u P r u t x y x R e v uv v v P r v t x y x R e where R is the Reynolds n um b er e An alternativ e metho d to solving the go v erning equations can a v oid the explicit app earance of pressure b y using the v orticit y and stream function as dep enden t v ariables in the t w o dimensional case The t w o dimensional v orticit y transp ort equation is as follo ws u v r t x y R e where is the v orticit y de ned b y u v y x The stream function in t w o dimensions is de ned b y u and v y x Substituting this in to pro duces the P oisson equation for the stream function r The go v erning equations Equations are discretized using second order expressions for spatial deriv ativ es and a rst order implicit expression in time The equations can b e rewritten in t w o simpli ed forms f uf v f r f g t x y R e r f g where f u v P g P P g g and x y g g u v u u v v r u v x y y x x y x y t R e W e create a lo osely coupled set of equations b y ev aluating f implicitly at the adv anced time step while the non linear terms u v and g are lagged at the previous time step The resulting discretization is as follo ws n n n f f u v f f f f i i j j t x y f f f f f f g i i j j R x y e and f f f f f f g i i j j x y n where f represen ts f g is the discretized form of g n n P P i i x n n P P j j y g g g n where n n n n n n n n u u v v v v u u u v x y j j i i j j i i g g t x x y y n n m n n n u v u v u v u v u v u v x y x y x y x y x y x y j j i i R x y e Also note that n u v and x y n n n n v v u u j j i i n u v x y x y W e can rearrange Equations and in to A f A f A f A f A f B i j i j i j i j i j i j i j i j i j i j i j When w e are solving for u v and the co e cien ts are de ned as t t A i j R x R y e e t t A i j R x x e t t A and i j R y y e n B f t g i j When w e are solving for P and the co e cien ts are de ned as A i j x y A i j x A i j y B g i j No w w e can solv e the linearized system of go v erning equations in the form of Af b A more in depth discussion on the solution of the viscous driv en ca vit y problem can b e seen in other references Numerical Metho ds The o w solv er appro ximately in v erts a p en ta diagonal linear system at eac h time step in an attempt to ha v e the dep enden t v ariables reac h a steady state condition A Symmetric Gauss Seidel SGS iteration is used to appro ximately in v ert Af b A V cycle m ultigrid is used to more quic kly damp lo w frequency errors from the solution ho w ev er the V cycle is curren tly limited to three grid lev els ne medium and coarse In addition grid sequencing is used initially to sp eed the con v ergence of the solution on the nest grid A more detailed description of the m ultigrid algorithm can b e seen in other references P arallelization Strategy Ev en though the m ultigrid co de used here w as initially written without an y though t of b eing parallelized only a few c hanges to the co de w ere required prior to its parallelization Initially a Symmetric Gauss Seidel SGS algorithm w as used to sp eed con v ergence ho w ev er there w ere complications when attempting to parallelize this algorithm a red blac k v ersion of this algorithm w as found to b e unstable Similarly when either a regular or red blac k Gauss Seidel algorithm w as substituted in place of the SGS metho d the solution b ecame unstable Th us a simple Jacobi algorithm w as inserted to replace the SGS algorithm for testing purp oses Aside from this mo di cations to the co de w ere only in the form of parallel directiv es sp ecifying the n um b er of pro cessors to use in parallel sections whic h lo ops to parallelize and whic h v ariables should b e considered shared or priv ate The co de w as incremen tally parallelized whic h aided tremendously in the con v ersion pro cess for t w o reasons First w e did not ha v e to determine the data mapping for all the v ariables in the en tire co de Second using Op enMP in con trast to MPI or HPF made it easier to e cien tly parallelize sections of the co de and determine b ottlenec ks The next test in v olv ed using a di eren t n um b er of pro cessors for eac h grid lev el Although w e assign all a v ailable pro cessors when computing at the nest grid lev el this ma y not yield a su cien t amoun t of w ork for eac h pro cessor when computing at the coarse grid lev el In fact the sync hronization o v erhead when using all pro cessors at a coarser grid lev el will most lik ely degrade o v erall p erformance Since it is in shared memory w e do not ha v e to w orry ab out the cost asso ciated with reshaping the data if w e wish to use less threads pro cessors than at the ne grid lev el in Op enMP w e can assign a di eren t n um b er of pro cessors to eac h grid lev el in the m ultigrid algorithm to b etter optimize p erformance Figure This co de w as written so that w e can sp ecify the n um b er of pro cessors to use for eac h grid lev el in an input le Th us the co de can b e run with a di eren t n um b er of pro cessors p er grid lev el without recompilation Fine Mesh Assign 8 processors to fine mesh Medium Mesh May still be appropriate to assign 8 processors to medium mesh Coarse Mesh Assign only 4 processors to coarse mesh Fig Example Employing a di er ent numb er of pr o c essors for e ach mesh level Results The viscous driv en ca vit y o w solv er w as tested on the SGI Origin at NASA Ames Researc h Cen ter All timings w ere measured using clo c k calls within the co de rather than pro ling in order to remo v e as m uc h o v erhead as p ossible also results w ere the a v erage of ten runs Timings w ere started after initial I O and stopp ed when the solution had con v erged The co de w as compiled with the MIPSpr o f compiler using Op enMP directiv es Results sho w su cien t parallel e ciencies when reasonable problem sizes w ere tested on the SGI Origin T able and Figure sho w that when the coarse grid size is su cien tly large a factor of smaller than the ne grid w e are able to ac hiev e go o d parallel e ciencies When the ne grid is x the medium grid is only x and the coarse grid is only x Because there is little w ork to do for eac h pro cessor it is di cult to obtain m uc h b ene t from using m ultiple pro cessors When the ne grid is increased to x the medium grid is x and the coarse grid is x there is no w more w ork for eac h pro cessor to do ev en when m ultiple pro cessors are used F or the x grid t w o pro cessors yield an excellen t parallel e ciency of and then steadily drops o as more pro cessors are assigned Larger problems con tin ue this trend obtaining a parallel e ciency of for t w o pro cessors in the x ne grid case Also notice that when the co de w as parallelized with Op enMP it main tained go o d single no de p erformance This is an adv an tage o v er HPF since man y curren t HPF compilers are still ha ving di cult y obtaining go o d single no de T able Performanc e of driven c avity pr oblem using Op enMP Fine Grid Size Pro cessors Execution Time sec P arallel E ciencies x x x x x x x x seq 1.0 0.9 0.8 0.7 0.6 0.5 0.4 64x64 Parallel Efficiency 128x128 0.3 256x256 512x512 0.2 0.1 0.0 1 2 4 8 16 Processors Fig Performanc e of driven c avity pr oblem using Op enMP p erformance Initially w e b eliev ed that the p erformance w as lo w er than exp ected b ecause of the Origin s memory placemen t p olicy This is b ecause the arra ys w ere initialized serially th us the arra ys w ere assigned to the pro cessor whic h p erformed the initialization rather than b eing distributed among all the pro cessors whic h subsequen tly op erated on the arra ys The arra y w as then initialized in parallel and although this impro v ed p erformance it w as to a limited degree If only a few iterations w ere run then the impact on p erformance w ould b e signi can t When the co de w as run un til con v ergence on the order of iterations dep ending on the test conditions the impact on p erformance w as small enough to b e within the margin of error for our timings Op enMP also enables the user to sp ecify the n um b er of threads and hence pro cessors for eac h m ultigrid lev el if desired This allo w ed for an optimal n um b er of pro cessors to b e sp eci ed at eac h grid lev el T able and Figure sho w that execution time can b e reduced when tuning the n um b er of pro cessors for eac h grid lev el Although some b ene t w as obtained it w as not as ma jor an impro v emen t as originally hop ed for Plots of the solution to the viscous driv en ca vit y for R using a x mesh can b e seen e in Figure The streamline trace sho ws the main circulation zone close to the cen ter of the ca vit y and t w o recirculation zones in the b ottom corners of the ca vit y The v elo cit y v ector plot also sho ws the main T able Performanc e with a di er ent numb er of pr o c essors set p er grid level Pro cessors Fine Grid Size Execution Time sec P arallel E ciencies ne med coarse x x x x x x x x 1.0 0.9 0.8 0.7 0.6 0.5 64x64 128x128 0.4 256x256 Parallel Efficiency 512x512 0.3 64x64* 128x128* 0.2 256x256* 512x512* 0.1 0.0 1 2 4 8 16 Processors * Using a different # of processors for the grid levels 8/16 processors on fine and medium grid level; 4/8 on coarse level Fig Performanc e when a di er ent numb er of pr o c essors ar e sp e ci e d for e ach multigrid level circulation zone as w ell as the t w o recirculation zones although they are more di cult to see b ecause of the di erences in magnitude Conclusions F uture W ork The results sho w that Op enMP can ac hiev e su cien t parallel p er formance with simple m ultigrid co des The co de w as written suc h that the user can sp ecify the n um b er of pro cessors for eac h grid lev el in the con guration le so that re compilation is not necessary It is conceiv able that a routine could b e incorp orated in to the co de that w ould determine the optimal n um b er of threads to use for eac h grid lev el at run time The next stage w ould b e to remo v e the three lev el restriction and re test Op enMP s parallel p erformance This w ould most lik ely sho w ev en more of a reduction in o v erhead when v ery coarse grids are used relativ e to the ne grid A direct comparison to MPI and HPF equiv alen t co des w ould also b e desired ho w ev er it w ould b e di cult to compare the use of a v ariable n um b er of pro cessors for eac h grid lev el since there w ould b e implemen tation di culties when using MPI or HPF sp eci cally with reshaping of the data at run time whic h w ould b e prohibitiv ely exp ensiv e REFERENCES E A yguade M Gonzalez J Labar t a and X Mar torell Multi level and Dynamic Par al lelism Exploitation in Op enMP D A C CEPBA T ec hnical Rep ort Computer Arc hitecture Departmen t P olytec hnic Univ ersit y of Catalun y a 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Y 0.5 Y 0.5 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 X X Fig Str e amline T r ac e and V elo city V e ctor Plot for R using a x Mesh e Op enMP A Pr op ose d Industry Standar d API for Shar e d Memory Pr o gr amming www op enmp org Oc tob er Op enMP F ortr an Applic ation Pr o gr am Interfac e V ersion www op enmp org Octob er G Golub and J Or tega Scienti c Computing A n Intr o duction with Par al lel Computing Academic Press Inc pp H Nishid a and N Sa tofuka Higher or der Solutions of Squar e Driven Cavity Flow Using a V ariable or der Multi grid Metho d In ternational Journal for Numerical Metho ds in Engineering pp K A Hoffmann and S T Chiang Computational Fluid Dynamics F or Engine ers V olume I Engi neering Education System W A W ood Multigrid Appr o ach to Inc ompr essible Visc ous Cavity Flows NASA TM W L Briggs A Multigrid T utorial So ciet y for Industrial and Applied Mathematics P Wesseling Intr o duction to multigrid metho ds ICASE Rep ort No D L Sond ak and J Perr y A Study of Memory Plac ement on an SGI Origin Pro ceedings of the th SIAM Conference on P arallel Pro cessing Silicon Graphics Inc Origin and Onyx Performanc e T uning and Optimization Guide Do c umen t Num b er