The Group Membership Problem in Asynchronous Systems Aleta Marie Ricciardi Ph.D Thesis 92-1313 November1992 Department of Computer Science Cornell University Ithaca, NY 14853-7501 THE GW)VP MEMflERSHIP PW)flLENI IN AsYNcHR?NOUS SYSTEMS N' Dissertatioii Prcscnt?:?d to the Faculty of the Graduate School of Cornell University in Partial Fulfilinient of the Requiren?ents for the Degre?? of Doctor of Philosophy by Aleta Niarie Ricciardi January i9?j:?? Qc Ali4a Niarie Ricciarcli i??9;? Ai?L R1CH?1?? RESERVED THE GROUP MEMBERSHIP PRORLEM IN ASYNCFlllONOIJS SYSTEN`IS Aleta Marie R?cciardi, Ph.D. (??ornell I niversity 1 99.? The thesis fornially defines the class of I roc s Group Afeti?bc??hip Frobienis (GMP) for asynchronons systen)s These problerris involve rnaintalning a list of processes belonging to the systeni, and npdating it as processes join (are started) and leave (terriiinate or fail). \Ninvestigate closely the strongest nieinber of the ()1MP class. Strong (?MP presents this list in a consistent n?anner to all processes nsing it: the sequence of joins and leaves are identical. \&`e show that despite preva- lent beliefs, strong consistency and efficiency are not conflicting goals. This shonld have significant in?plications f??r distribrited systen?s since the need for process fl)en-)- bership agreement arises in n?any canonical problen?s in distribnted con-ipnting. ?Ve present an inexpensive n?eans (the ?-GNiP algorithn?) of assnring con?plete systen?-'vide agreen?ent on process rrien?bership. NVe discnss the role of process nienibership in distribnted systenis and how to ilse S-GNIP to bnild a Menibership Resoiirce Manager (NIRM). Flie thesis also exaniines whether any w?,aker n?ember of the (.?N4P class suf- fices to specify an NIRNi. In doing so we jnstify using Strong C?NIP over two u?iich weaker C??NtP instana?s in three in?portant ways. P?irst., by con?paring Strong ?;xip and its iuinin?al sohition with tlie weaker instances and their niinirnal sollitions. we arrive at the surJ)rising restilt that Strong ??1Ni1? is often less expensive than the others, notably iii executions in which rriei?bership changes are freqnent. Sec- ond, we show that a n?ei?bership service defined by Strong (4NIP is n?ore robust, n?ore responsive. and n?ore adaptable than a niernbership service defined by weaker GN4P instances. Third, we conipare n?ei?bership services defined by the various GNI Ps according to the utility each service provides higher-level, distributed appli- cations. `I?hat is., ignoring irnplen?entation costs, how useful are the different C-?Ni l? consistency guarante(?s as a platforn? on which to build distributed soliitioiis to dis- tributed problenis? N\e show that the (onsistency guarantees of Strong C?1NIP u?ake it (i.e. anci the n?enibership service Strong C1? NIP defines) niore useful to higher-level distributed applications. ?in ally the thesis presents experimental results fron? in?pleriientings -c AIP. The data den?onstrate that a centralized Xleinbership Resource Nlanager is a non- iutru sive service around which to design distributed systen?5 and provide systeln- wide consistency. The data quantify aiid help clarify the tradeoffs between replica- tion degree overall s',?sten? size and process failure frequency. `!?hese initial results should guide fi?iture NiRN4 desi?n and developriient. Biographical Sketch Aleta Ricciardi ?vas born in the wilds of the Rronx., NY ---H without doubt responsible for her love of outdoor sports. Her future as a cofl)puter scientist was flited when tier father brought lionie a vintage lf3Ni teletype circa ?967 and n?ade a valiant stab at teaching her i?ortran., Cobol., and all about r. She raised ulore than her fair share of aitch-ec-double-ell in her ft?rrnative years (which, according to her parents, lasted far longer tliait tlie national average) but nonetheless i?an aged to graduate froni Cornell Jlniversity in ??.)?I., taking refuge in 13iology arid Niatheniatics until key punches had been rel(?gated to the dustbin of history. Niceting l?rakash Panangaden on a porch sorriewliere, sonie night, iii Collegetown awoke li(?i' early nieniories of Computer Science and she left lier wild dQys for the monastic existence of graduate school. As is only fitting, she met Xlichael while cycling around Caynga Lake on the only nice d??y of the Summer of `92. ill ??o rTiy parents Niarie and F?niii, and to Bngs Bnnny Iv Acknowledgements ?his ?vork ?vould not have been possible ?itliout the gilidance support entli'isiasn?. and generosity of I?en t3irnian. I&en?s coi?fidence has been a constant so'irce of inspiration and his patience in giiiding n?y evolution has been rernarkabli?. l&eith N4arzullo, too., has been both friend and advisor - I have benefited greatly fioni (and enjoyed infinitely) our many di?cussions., technical and political, from the lunatic fringe. NN'?ithout I?en's and T?eith's insights ?armth., and genuine (x)ncern my experiences here ?vould have been severely diminished. l?rakash Panangaden and Bob Constable gave me early encouragement and guidance., as well as the initial desire to pursue these studies. Robbert van Renesse has been inval'iabh iii all technical matters. Niore than any other, Robbert `s assiStance in the implementation stage was unflinching and enthusiastic; I have l(?arned a great deal from lii in. I liave also benefited from niany discussions (and bagels) with Pat ?tephenson, Brad Glade Nlike Reiter arid the rest of the Isis crowd who have been both friends and helpful colleagues. I??m Bressoud has been the perfect office-ii?ate; I cannot begin to thank hini for his patience, knowledge, and for laughing at all my dumb jokes. f)avid Cooper proof- v read the entire thesis and gave n?an,\ valuabli coi?rnent. Niany thanks also to our technical support staff, especially John Vinley, Doug Flanagan, Anne Louise C?ocke1, and ()J. I have also been fi?rtunate to have received funding fron? JR XI, the Sieniens Corporation, N?NS?A, and C?1eneral Electric, Rruce Donald has been a continual joy a dear friend ?vhose kindness and sense of humor have added brilliantly to my lifi?. From Dan Huttenlocher I have gotten ?varmth, laughter, generosity and a great deal of confidence. Daniela Ri's, in addition to her more obvious ?vorld-class talents, is als?? a ?vonderful listener and ?best woman'. Niichelle Duncan arid I have been through many interesting arid important events quite fittingly, all electronically. i?hanks too, to ?he NVomen's Cabal'. Joseph Nlooi.e `vas a tremendous source of corrifort; he made sure niy life was as filled with love, fun, aiid music as it was ?vith work rio small task. Niari ()rser and `Avigail F?isenberg, too, have been dear arid cheri shed friends. Susie Fortebraccie" Armstrong and I have lc?gged many kilometers in tiie saddle arid enjoyed every moment, which kind of goes without saying. Jn the ?`XIoldy Oldies" category are I)eirdre ?aub and Nathan Yoffa both have helped and watched me grow and made nie laugh for rilaily, many years. Chuck lYiller and Mark Theodore also occupy very special places. Above all my family, especially my parents and brothers, have given iue iriore support, encouragement, and inspiration than 1 believe they'll ever understand. I have felt equally and constantly their faith iri rrie, their love, and their pride iri ruy vi acconi??1ishi?ent?. t??ina'1y, Niichael Ogg ilas n?ade what ?iio'i1d have b(?efl a very trying period SC) very beaiitiful and exciting. He lias shown inc such wondrous joys I often find niyself beyond words. vii Table of Contents 1 Introduction 1 1 Exan?ples 1.2 N'Ien?bership ?;?????t???? and Distributed Systein J)esign 1.2.1 In?plicatioris for Distributed Systei?s 1.2.2 N4R-Ni Design Considerations . . 1.3 Related \Vork . . . . . . 1.1 ?liesis Contiibutions . . . . . . . 1.? Uliesis Orgai?ization 2 The System Model and Formal Logic 2.1 ?ysten? R?qitiren?euts and Niodel Assun?ptions 2.2 i?lie ?ystern Model 2.3 The Niodal l?ogic (lsyflc 2.3.-I Syut. J?acking better notation we also write c E c to abbreviate ?6 E 1)1) for sonic p nientioned in c'? and elaborate when tile context does not clearly distiT gnish the intention. Eron? I?ai?port [Lani7S] we give the following definitions of event caiisality and gh?bal, cansal coiisistencv. Defii?ition Events 6 and c' are irnn?cdiatc1i,? causally related in, eat c (written 6 ?+? e') if and only if 2,This is pi?reiy fr)r e?'se of expositioli. 24 I. 6 and (-I 1?tl ?? son?e con?I)one1)t l)p of c; or 2 c --H--H send?(q n?)? ? --H--H recvq(p n?) 6 C 6 and c;? c c. NVe ?vrite 6 Hc 6? to denote the binary relation 1?apprn.s befor? which is the reflexive transitive ch?snre of --H->? over the events of c. I Observe that, iii the abstract, send and receive events are always ( i. 6?, in all cnts) -related while oth??r events derive the relation fron? their hi stori cal context. \N?e will on?it the snbscript when the particular cut is clear froi? context and simply write 6 H 6 I)eflnition ,N c'it c is cousaily consistcnt if and only if whenever 61 c c and 6 in c, then 6 C c. I i:ignre 2.2 illustrates both a causally consistent cut, and one that does not respect cansality. Henceforth we restri(?t the discussion to consistent cuts, as they are the ones that are physically realizable. Consistent cuts represent the possible global states of an execution; while a giver) consistent cut need not have existed at any point ill real tirne in an execution, it is in?possible for a cut that is not causally consistent to ever exist at any point in i?eal tiiiie. A characterization of global, systeni-wide causality should incorporate the no- tion of progress between global states. .%pecifically, we desire that every process either riiakes local progress or ren?ains stationary, none should regress. A process rnakes local progress between the curnii ? ive states represented by hp and ii? ex- actly when liJ) is a prefix of h'p a N. (?onsistent k)ut: Vc H : (c! c ? ( C c) e c ?. Incon?istei?t Cut: c ? e'; ? C c; ? Fi?iire 2. ?i: (i?ausalty (?on?'istent aii&I Inco'i?istent Cuts. 26 1?1 Defii?ition Given c (h1 hn) and c' (?!i h'?), c caus(i-liy precrd?s c? (?vritten C ? rI) if and only if for each process, p, 1. bp h'?; or 2. /?? is a strict prelix of h'p ()?serve that there are (infiiiitely) iii any corripletions (or successors) for any given cut. In this sense, the future of any cut is uncertain; it niay branch out in many possible directions. ()n the other hand c ? c' implies that any execution in which c1is a prefix n?ust also contain c as a prefix. I)efinition lAt c (h1 , . . . , h1l??) and e' (k'1 . . . , h'p . , J?'n) be consistent cuts. ??hen 1. r strictly cansally pr?crdcs c? (written c ? c') if and only if c ? r1 and c ? CI (as in Figure 2.2N); 2. c icry strictly p ?;c?des c' (written c ? cl) if and only if l?? is a strict j?refix of for eaci mentioned (Figure 2.2.13). 2.3 The Modal Logic asynr N?Te now introdnce a formal laugnage, async , with which we will define the class of Asynchronons Pro(:ess Group N4einbership problems. a?ync is a combination of temporal and epistemic logics. Uni?:jue to this combination is its attention to asynchrony the basic semantic entities of the logic are consistent cuts. lii her 27 c A. c ? c' and c ? c c 13. c ? c': J?p i.? a st?4ct puJix of /?I `v. Vig'ire 2.2: Causal 13inary R?1ations on Consistent Cuts. 2S doctoral thesis [?i?y9O]? Taylor defined C?onc?irrent Fp?steir?ic Logi(:, Not only are forTnnl&) of her logic eval1?ated along consistent cilts., hnt she defines a new, weaker n?odal operator that embodies systen? asynchrony and captnres the resniting process `in certainty. The ten?poral logi? con?ponent of ?,?y??,. arises fro?n algebraically analyzing the set of consistent cnts in an asynclironons execntion, and then iriatching each of the individnal algebraic properties with specific characteristic axionis of tense logic. Consistent ctits and the binary relation ? defined in (?hapter 2 fi?rin a downward- closed lattice. See [Ric9Ol for a proof that the tense logic arising froi? the corre- sponding characteristic axtonis is complete for tile class of l?ripke niodels that are downward-closed latti ces 2.3.1 Syntax of t'async NVell-fornied forn?nlae 0f asyflc are constrncted froni a conntable set of prinlitive propositions Ia I a'2, .. J fonr niodal operators, and the set of process identifiers Proc as follows: 4 a I ?i A Y2 ?Y' I I ??)Q' A%y NVe `ise the standard cil?breviations v, ?. ? and ? fi?r readability. The ??nglish langnage interpretations of tile tense niodalities ?, E, ? and ? are, respectively, ??at all fiitiire points' (henceforth), `at all points in the past", "at sonic fntnre point", ? and ?at some point in the past??. ??he English interpretations of the the duality iiiterpretation for ?,` rather than the branching time iiiterpretatioa. lE branching tii?e,' ? is interpreted as iiievitabi!ity' at sonie future point in every 29 episten?ic modalities i?)) and ???? are `?p kno?s"., and "in a state indistinguishable to 2.3.2 Semantics of flasyr?e I?ripke n?odel sen?antics [I&ri63] of niodal logics requires an occ ib?i?tq relat?on on the basic sen?antic entities of the niodel. For ?sync the teniporal operators' accessibility relation is ? - causality bet??een consistent cuts. For the epistenlic operators ?e take the follo?ving definitions from [?)iy9()] Defii?itioi? [faylor] (luts c ( 11p 1?PLhp7 ) and c' ., . . . ` h'p? ) are ifldt%?tifl!]uishabi? to process j)j (?vritten c ? c1) exactVy ?vhen hpj Fori?ul?e are evahiated along consistent cuts of an asynclironous execution. An ?ni?rpret?tion for an infinite asynchroiious run 4? 1A assigns to each prin?itive proposition a subset of consistent cuts of A N'Ve ?vrite : ?aI.,a2,.. ,? 2A (A, c) ? (tj exactly when c C 1?(ai), but will oniit the run, and simply write c ? Q,', when it is clear fi'on? context. A consistent cut c satisfies a well-for?ned lorn?ila, k' (written c ? ?") according to the following structural rules : ?Q iff c ? 6; 0 $` :91 A9'9 iff ? ? 6i and c ? :30 o+ : E)y ilf fc?rall (! ? c. w ?5); o+ : ?Q iff forall c' ? c.(c ? o+ ?): ?? fflff forall c! ? c.(c W o+ .4,: Dy" iff in all runs for ?vhicli c is a 1?refix,exists c' > c. ? o+ : ?y' iff in all rnns for ?vhich c is a prefix,exists (? > c ? o+ 4"': Pp? iff exists c' C A (! p ?` ((A cl) w o+ 4): A))Y' iff forall c' ? c?(c' W 2.3.3 Inference Rules and Axioms Let Q, and `? be well-fornied fornuil?T of ?a?ync ??lie Inference R?nles of async are: Modus Ponens ? ? H ??. Temporal Generalization ? F ?y',, and y k ?y'. Epistemic Generalization ? k A?i?y' ??he axioi?s of ?asyflc begin with all substitutional ii) stances of trtith-fnnctional tantologies. J?et Q' and `?j) be well-fori??ed forninla?. ?be teniporal axion?s are: 3j TOa (Z(Q ? ? ? E4) Tob ? TOc ?(? ? > ? E]'?) Tod ? Tia E]q? ? Tib ?q$ E?y' T2a E)? ? T2b E]? ? T3 ?ffi V?HL T4 ?? _ T5a (?? A D4?) ? ?(?Q' A ??) T5b (?9' A ?V) ? ??o' A ?`?`) _ T6b?Q??o' I?lie cpi sten?i c ?xion? s (?VC frorri [?Fay9O] KO A)) (5 ? (j? Ki A?(9 ? ? ( A)) 0,' ? K 2 A)) 0,' ? Kp A)) ? K 3 A?p 0,' ? kp A)) 5 K4 o,' ? Ppo,' K5 PpPp0,' ? Pp? K6 Fpmi4?m5 ? Pp0' K7 PpQ,' ? ?`Th)?5 Chapter 3 Strong Group Membership Jn this chapter, ?? define formally tile Strong Ci1ronp N1en?bership ProJ?len? for asynclironons systenis. ?he definition sfionld specifjy how to coordinate local events among a gronp of processes so that the externally observable behavior of the gronp (that is, the behavior observed by a process not in the group) is that of a single, fantt-tolerant process. The ultimate goal is for any syst(?ni process to be able to query core NI RN1 rnen ibers' l(?cal views and observe `?one-copy ` [BUGS7b] behavior on the sequence of view reports so-obtained. Because responses to queries wilt be taken as refiectiiig the exact systen? coniposition, we will want to ensure that NIRNI core n?en?bers have ?-dr nt,cqi sequences of view transitions. ?ailed core niei??bers will observe only a prefix of all view transitions, but their local views when they are operational i?ust not be perniitted to diverge. Niore in?poi'tant, core n?embers that have been ousted but are not crashed, must not be pern?itted to misrepresent the system state. ??or example, we wish to preclnde executions in which a functioning :3:3 process that is no longer a core n?en-iber is able to change its local vie?v iiidepen- dently; if that process were qneried? its response conid diffc?r fron? the response a trne core nieinber (i.e. one that has been forced to coordinate changes to its local view) wonid give. 3.1 Formal Specification I?) specify the problen? forn?ally. we define the following events, logical fc?rii'nlt'e, and sets of processes. For clarity, we will always nse en?phasizcd text for events SMALL CAPS for h?gical forn?nl?, and sans serif for process sets. ?he forn?nla UP1) holds along a ci't c if and only if p has not executed qut'!13 in its local history, h1). (1onversely, DO?'N1) holds along c exactly when p has crashed in c. 4?he iiidexical set Up(c) in an asynchrono'is run 4 is the set of all processes that have not crashed along c Up(c) ?p c k UPpt Proc p executes the internal eveilt j)iai!?]p(q) as soon as it suspects q faulty; whether p conies to suspect q through son?e local observation or through gos- sip is in?i?aterial. ?on?e tii?e after recording f(inItyp(q), p will execute the event r?n1ovep(q). Notice the distinction bet???en the events ffl?uit?j1j(q) and rc'n?ovep(q); the for?ner represents ps belief in q's faultiness, which rnay be incorrect, while the latter is actual ren?oval of q fioiu the set of core rncn?bers p believes opera- tional. `l?he forn?ula FA?LTY?(q) holds along all cuts th?t include fciuify?(q), and `By deflnitioii quit? ?viiI be the l?st event ()f p's history in tile ciit. 3,4 RE?iOVE1?(q) along all cuts that include rcn?ovcp(q). Analogous staten?ents hold for events opcrat?nyp(q) and addp(q) arid forniula oPERATINGp(q) ADD?(q). 4?he local n?ei??bcr?h?p Vic?V for process p cut c (A1,An), (denoted LocaIView?(c)), is the set of processes p obtains by sequentially ruoditying its ini- tial n?enibership list according to the rc??ovep() and add?( ) events in 1) NVe use LocaIView? when the cut is clear fiorn context. ?Yivially, we require p C LocaIView?(c). ??he fc?rrnula IN-LocAL?(q) holds along all cuts, c, such that q E LocalView73(c). NVhen DO?VN)?) holds ah?ng c, LocaIVi'ew?(c) is undefined. Because Ap is linear, it niakes sense to talk about the ?th version of j)'5 local view, which we denote Localviewf;. Thinally 1N?LocAL?x(q) holds when q C LocaIVeWf? ?Ve extend local views to group views as follows. Given S c Proc arid a con- sistent cut c, if the l(?cal views of all the functional processes in S are identical, the group view is the agreed-upon local view; if S has no functioning members, the group view is enipty; and if the functionin? members of S have different local views, the group view is undefined. \N'e say that S dete?tiines a group view. Formally Definition Given a ?onsistent cut c and a set of processes, S c Proc the group V??u" determined by 5 along c is SflUp(c)--H? LocalView13(c) A(?. q C S A uP(c)) (LocalViewp(c) LocalViewi(c)) un defined otherwise. Gpviews(c) :35 Lastly the fornnjla IN?CPp holds along all ents, c, snch that p C Gpviews(c); OUT?CPp holds ?vhen p ? Gpviews(c) Observe that precise defi?ition of this for- rniila ar?ticipates liniqileness of the group vie?v along all consister?t cuts. 3.1.1 More About GpViews(c) 1'he definition of Gpviews(c) is crncial to the class of C?roup Nienibership Problen?s so it is ?vorth?vhile discussing a subtle point: how the sets S and Gpviews(c) relate. Recall that Gpviews(c) is the abstraction we are using to define the ultin?ate goat ---H the single, fault-tol?want process illusion we will use to build our NIRNI. In this light, NIR?I core rnern!?ers are precisely the n?ernbers of Gpviews(c), so any n?ernber of Gpviews(c) should be able to reply to ?1R1Ni client requests. Now if q E (GpViews(c) fl then q is a core n?en?ber whose local view is not used in deterniining the NIRNI composition and systen? state: specifically, q's local view is not constrained by the definition of GpViews(c), so LocalViewq(c) need not be ideiitical to GpviewS(c).2 Because q replies to NIRNi client requests based on its local view, its replies will contradict other core rner?bers' replies when LocalViewq(c) # Gpviews(c). ?llius the single-process ilbision falls apart, and the NIRNI fliQy easil?, fi?ll short of its stated aiiii in assuring global colisistency, when every core n?eriiber's l(?cal view is not a?coiiuted for in detern?ining the NIRNI group view. To avoid this, we iieed a clause in the specification requiring q to be iii S when- 21n practice, a core n?eiiiber's iocal vie?v includes in additioii to the NI RNI composition the cot ire systeni process coil iI)ositIoi). :36 ever it is in GpViewS(c) (Gpviews(c) n UP(c)) c n UP(c)) Since p C LocaIView?(c), A Up(c)) c (GPviews(h A uP(c)) and so the desired clause ?ill require S Gpviews(c). Note, too, that the NIRKI must be unique to maintain the single rocess illn- sion. ?`he desired clause ?vill incorporate the above-mentioned equality as ?ell as uniqueness ah?ng all consistent cuts of the set, S, satisfying equality. 3.1.2 Strong GMP Specification ?Ve now have the language a id formal tools necessary to formalize Strong C??NiP. G?P--HO (Base (lase) An initial group view exists along the initial cut: V (S() GPviewc0(5?)). SocProc GNlP--H1 (?%?lidity) Processes do not make changes to their local views capriciously: a. (?iN?LocA?p(q) A iN?LocALp(q)) ? FAUUrY?(q) (??iN?LocALv(q) A iN?LocALp(q)) OPERATI?C1(q). In contrast to fAULTY?(q), oPERATI?c?(q) is riot stable. GNlP-2 (j;niqueness) Non-null group views are unique along all consistent cuts. :37 For all cut? c: v (cPviews(c) s) A UNDEF'D(Gpviewsl(c)) 3(S GNIP-3 (Sequei?ce) ?li processes exhibit the sa?ie sequence of local vie?vs cori- teu?poraueousty provided tile views are defined: UNDEF'D(LOCaiVIeWp) I??LocAL1X;(q) ? DOWNq V (LocaIViewq LocaIViewpr?) GNlP--H4 (Liveness) For each eveut ffiuityp(q) (respe , operai?ngp(q)) and each process 1) cGpviewr?, eventually either p is ren?oved froi? the group view, or q is re?ioved fion? it (respective, added to it): a. FAULTY?(q) A IN?GPp ? (?oUT?cPq v ?nLT?GPp) b. oPLR?NfINGp(q) A IN?GPp ? v ?oUT?GPp). (;?NIJ?-3 is equivalent to requiring that each local view eventually becot?e a group view. `flie presence of ? forces a group view to exist along sonie consistent cut in an execution. `Fhis too, is why we cannot bind LocalViewq to a version nuiiiber. Local views when indexed by version nnn?bers are static --H- the couiposition of a process 5 j'th local view will never change. So suppose c is the witness for ?. Then oi?itting the version superscript forces LocalViewq (c) (for q E LOCaIVieWf?) to be identical to LocaIView1X; at least along c Had we iiicliided the version nnn?ber in the equality clause, we would not have been able to conclude that group views necessarily exist, since the local views Tieed not have been identical siniultaneously. Finally ?i'ncc each process execntes at least one event bet?veen local vie?vs w and y $ I the corresponding gronp views ??ill exist along cnts that are ?-related. Hence, it n?akes sense to talk abont the ylh gronp view, ?vhich we denote Gpview'r Chapter 4 A Protocol Solving Strong GMP This chapter presents a distrib'ited protocol that solves ?trong (11NIP and therefore achieves the single. fanlt-tolerant process ilhision ?ve discnssed in Section 1.2.2.1. The trong G1ronp Nlen?bership Protocol (hereafter s-c?IP) is both asylfln?etrlc and centralized: a distingnished core rneniber? denoted ?gr Is responsible fc?r coordinating updates to the outer (i.e uoil-n7gr ) core inenibers' local viev?s. In a syn?i?etric, distribn ted sol?ition all core 1uei?bers ?:onld behave identi c to denote an update to Gpviewr'--Hl in which ?` (the value proposed) is a set of process identifiers. NVhen the version nun?ber is clear fron? context we sin?ply write ??. \Ve will also sometimes want to identify the specific processes in t'. 4.1 Simple S-GMP ?Vhile we assunie in this section that nigj does not fail, the protocol we present is in fact niore coi?pli cated in coni niunication structure and degree of coordination than this assuniption warr'?nts. Indeed, if we kn?u' that nigr could not fail we would already have a single, fault-tolerant process and ?? could solve ?trong G?1P with a reliable broadcast. Instead, we anticipate rn?r' `s faihin to sin?plify presenting the reconfiguration algorithn'i in Section 4.2. \?`?hen ni?r becon?es aware of an outer nieiuber's, say `r's, failnie, it initiates a two-phase update algorithn?. In Phase 1 (Vigure 4.1) fli9T proposes r's renioval by 42 Ph(1sc I f)h?se if Tflflr M-sub(--Hr) M-com(--Hr) Figure 4.]: `1??vo-Pliase (?oi?mnnication Structnre of ?in?ple s-c;?iP. ninhicastirig2 a s'ubn??t niessage, M--Hsub(--H?), to the nienibers of its local view and awaits each outer core niertiber's response or its own i?elief in an c?iiter riiert'1?er's fariltiness, In this way all core nienibers that are not believed faulty by nig? believe r faulty at the (?ud of Phase I. If nigr receives responses Thoni a niajority of its current local view-, it niulticasts a conin?tI niessage, M--Hcom( --HT'), in 1?hase 1 j,:i If n?gr does not receive a niajority response. it `nust block. If local views are identical at the outset of Sinipte s-c;?ip then, because nigr is a single process, local views are identical at the end of Siniple S--HCNiP. fhe subniit rnessage coordinates belief ainong the group iri r's faultiness; the coniniit niessage tells outer n?ernbers that the group has reached (iqree'ncTit 0ri r's ?\4 ulticasts are iiot failure atoniic 3?I)vpicaIIy a phase of coijiuluilication consists of a n?'ilticast fron? a sii?gIe process t() a group of processes anci their responses back to the niessage initiator. in fact, Siiiiple s-c?N.iP is one?aod one?haIf phases but this is awkward. 43 Phase I (??n?presscd I?hase 11 and Phase I fl?J(? q N-sub( --Hr) M-com(?r):M-sub(--Hq) M-com(--Hq) A A 4 L crasli Fignre .2. ( onipressing Siic???ssive !i?stances of ?inipli s-e??iP. faihire and that they shonid now remove r' froni their own local views. `I?hi? agree- rnent., however, need not he con?plete: becanse ?7?gT does riot receive responses frorri onter members it l)eli??ves faulty, it does not know whether these n?enibers received its snhmit n?essage. Even though such a n?en?ber iiiay receive u?gr 5 messages, rngr certainly has no way of knowing this. F oni n?gr `s perspective, these riiernbers n?Qy not be aware of the current update to tlie group vie?v, rendering con?plete, core-wide agreement on the new view contingent upon the subsequent reu?ovat of these faulty core nienibers. Finally, Gossip ensures that operational outer processes become aware of sud? contingencies. observe that tlie I?hase I subriiit niessage is unnecessary if ??gr knows a niajor- 44 ity of the non-fi?nlty onter processes already believe a process, 5Qy q, fa'ilty ??hat is, rr?gT need not disseniinate its own belief in q's fanltii'ess if it knows that the onter n?ernbers already believe the san?e thing. in this light, the contingent npdates when piggy-backed ni?on a con?rnit fliessage, serve as the snbrnit fl?essage f??r the next view change. NVe can thns coflipress snccessive instances of Sin?ple S-GNIP if n-i?r niakes known ?vhen it multicasts the Phase it con?niit message. exacttv how it plans to change the group view next. Vor exaniple? in Figure 4.2, process q crashes before reponding to M--Hsub(--HT) causing nif]r to suspect q faulty. Hy appending M--Hsub(--Hq) to its con?n?it niessage M--Hcom(?'r), migr indicates that it will iiext at- tenipt to reniove q from the group view. Outer processes respond to the cou?bined con?niit-submit niessage as they do to plain subn?it niessages. 4.2 F'ull S-GMP \`N?hen n?gr fails (or is believed to have failed) the outer nieinbers execute a rccon- fiqur??fion algorifAni to select a neiv coordiiiator and, if necessary, reestabhsh the group vie?v. Fo see h()w this illay arise, coiisider Figure 1.3 in which I?gr fails in the riiiddle of a conini?t niulticast; because local views differ along the second cut, the group view is undefined. R?configuri ug safely and successfully involves solving two problenis: s access?'on ?vhich process(es) should initiate reconfiguration and which should assilnie the n?!]r role at the end; and progression which update should a reconfiguration niti ator propose to n?solve niembers' inconsistencies and niai ntain safety? ln practical tern?s. a correct reconfiguration depends on an initiator's ability to 45 mgr I? q M-com(?r) LocaIView?'?tI LocaIView'? Gpviewr Figure 4.3: nig'r `s Failure Results iii I?ndefiued Croup \7ie?v. deteru?iue the last defined group view and propagate the correct s'ibniission for the succeeding group vie?v. As P??igiire 4.3 slloivs., proposals n?ay also be partially known arnong the current group view, arid this caii pose problents in reconfiguration. `fhe n?ost intriguing and di fliciiit aspect of reconfiguration is accounting for in- tisibic coniniits. An invisible coi?niit is an update that is coi?uiitted by a set of processes, but can never be known to have been con?n?itted by then?. An invis- ible coi?ryiit occurs when the only processes receiving a corrin?it iliessage fail, or are believed faulty by the rest of the group. i?his is significant fiw reconfiguration: while no subsequent reconfiguration initi ator will ever An o ?t whether these processes received the coniniit n?essages and updated their local views, Strong CXlP (specif- icalky, c'1Nl F?-:3) requires that if an invisible coinriut did occur, the reniaining core ruenibers n?ust behave in a manner consistent with that event. It is iruperative, then, that every invisibly coniniitted update be detectable by every recorifigurer. Fhis is the core issue in nialutaining C;NlP-3. NVe can ensure this only if all initia- 46 tors (?lietlier n?gr or a reconfigurer) atterripting to install the yth group view vie f()r the reqnisite n?ajority responses froni aniong the n?e set of processes 4.2.1 Structure ofthe Reconfiguration Algorithm tnlike the n?gr -initiated algorithni, reconfiguration soinetin?e requires three phases In actuality three phases are necessary only in the ?vorst and highly unlikely case., but fc?r simplicit?' ?ve p - ent the algoritlim here as three phases and discuss the cases when t?vo suflic? in lapter 6. That reconfiguring is three-phase is interesting and important, though not surprising in light of Skeen's work on non-blocking commit protocols [SkeS2l. In the first phase, the initiator r, broadcasts a reconfiguration int(c'r?vgotc message, R--Hint(ver(r)), to all processes in its local view and awaits their responses or its own belief in their faultiness. Core members respond to the inter- rogate message with their current, local state, If a majority have responded, the initiator uses the inforniation it received to determine an update event, say `c, and version number, say o+r. whose execution would result in a well-defined GpView'? The initiator multicasts this event as the Phase II reconfiguration s?ibni?t mes- sage, R--Hsub(? tx >). After obtaining a second majority response, acknowledging receipt of the submission, the initiator broadcasts the Phase III reconfiguration conintiY message. R--Hcom(? v, x ?). Again majority approval of R--Hint(ver(r)) and R--Hsub(? v, x >) are ??ssential in maintainilig Sequence and I?niqueness; without either, the initiator must block. Defii?itioi? An update initiator (either rngr or a reconfigurer) is succ?:ssfni for its submission (M--Hsub(? ?`, x >) or R--Hsub(? `??. x >)) if it can reach the appropriate 47 R--Hint(x --H I) R--Hsub(? --Hm?-qr.?T >) Th A-- r2 Tn-gr ?r1 said.,?R-sub(? `r- ? ___ -_____ ----H--H? R--Hint(x ??ng?. sa1'd?'M?sub(? ??. Y >)- M--Hsub(? ??X >) Figtire 4.4: Recoiifigurer r2 I?earns of -? nilicting Proposals for GpView'?' S-CNlP con-ii?it phase; that is, if a niajority si?bset of its local vie?v respond to the proposal. A subi?itted value, ? v, x ?, is stable if and only if the initiator sub?nitting it is successful. I Iii this light. GNiI?-:3 requires that a-il s??ccessful initi-itors trying to install (or cornplete the installation of) the xth system vie? make identical proposals. ?he local state infor?uation collected dnring reconfiguration Phase I intist allo? the reconfiguration iiiitiator to detertiiitie the correct update proposal unai?bi guonsly. In s-c?iP all successful reconfigurers atten?pting to install (or coniplete the installation of) the o+i-.th group view propagate n-19r 5 proposal if they become aware of it., and propose fl'?fjT 5 renioval if they do i-lot. I?'.nfortunateLv, as Figure 4.?! n-?akes clear asynchrony and inopportune failures can result in there being two different proposals for the sanie instance of the group view. Correctuess requires that only one of theni beconie stable, and that any non-bl(?cking reconfigurer be able to deterri?ine which one it is by the end of reconfiguration Phase I. Hy propagating the stable subn?ission, a reconfigurer forces the entire group to act consistently with any invisible conin?its. 4.2.2 Rules of Succession ?N?e solve the sn(cession problen? by imposing a deterininistic, i?nc--?jr ranking on core merubers based on seniority in the group view -- ?older core n?en?bers are ranked higher.4 \N'?e ilse rank(j?)>rank(q) to denote that j) has higher raiik than q. \Vhenever a process is ren?oved froni the group view, the ranks of all higher- ranked processes are decreased by one. Observe that for as l()ng as p and q are both (conten?poraneously) core n?en?bers their 1.anking rc-1ativ? to each othC r `v 11 not change. A process initiates reconfiguration when it believes all those ranked higher than itself are faulty. [hat is, given cut c and LocaIView?(c), define the local proposition INITIATE( I)-) A $ank(q) rank(P)) FAULTY1) 4?LocaIView?(c) 4?I?1iis does liot i iiean ???e cannot join niiiIti??ie processes to tlie group vieiv in a single update 49 ?%4?le 4.1: Ii?itiating R?configuration: FAULTYp(n2gr ) A FAULTYq(fl?gr ) Process raiik up? FAULTYq(p) INIiIAiE(q) INIT!ATE(p) rai?k(n?gr )= p True False False True rank(p)= fi 1 False False Eventually False rank(q)= p --H 2 True True lYne True False False True False NVhile initiating reconfigtiration on INrrlATE(p) can lead to n?ultipli?, concurrent reconfiguration initiations, it guarantees at h?ast one process ??ill begin the reconfig- uration algorithu?. Consider ??bh 4.2.2 in ?vhich rank(n?qr ) = p, rank(p) = p --H 1, and rauk(q) = --H 2, and both p and q believe u?gr faulty. In the third scenario both p and q initiate reconfigurations. S-cNlP must ensure view uniqueness in the face of i?ultiple, concurrent reconfiguration attempts. In the second scenario q Cx- pects j), ?vhich has crashed, to ini?iate a reconfiguration. Any sohition must also ensure that q eventually comes to suspect p faulty. In S-G?IP, q times-out ?&?ting f:)r ps R--Hint( ) niessage, surmises FAULTYq(p), and then initiates reconfiguration. 4.2.3 Rules of Progression: Maintaining Safety fl? understand the difficulties encountered in reconfiguration we exa ne c;NiP-2 and GNlP-:3 more closely. I5niqueness requires that at most one group view exists along any consistent cut in an asynchronous execution. In the situation depicted in Figure `4.5, Q and R are subsets of Gpviewx?, aiid q and r are both initiating reconfiguration. If all niembers 50 q Q j? -? j?z R-int() = FAULi?Y(??) FAULTY(q) Yigure 4.5: Niajority of Responses Needed of Q believe ? fa?ilty., tl?ey ?vill receive neither r?s R--Hint( ) n?essage., nor its i?iiase II proposal., nor its Phase III corni?it. N'nalogous staten-ients hold for the n?ernbers of R regarding q. If T?'5 j?roposal differs from q's then the rnen?bers of R ?vill commit a different vahie than the men?bers of Q. ?ince Rczipfr?f ?vill e??ntnaliy rerriove all of Q u fqf., and Q U ? qJ ?vill eventnally reii?ove all of R LJ frt t?vo distinct gronp vie?vs ?vill exist. Naively it ?ould appear that the n?ajority reqilirelnent suilices to preclude this occurring. Hoivever as Figure `4:3 niakes clear initiators that fliQy end up installing (snbn?itting and coinn?itting) the san?e group version iieed not have identical local vie??s ??hen they "in dertake reconfiguration. Any solution niust en sure C??4P-2 ?vhen concurrent initiators a?e seeking niajority approval ftou? different sets of processes. 51 M-sub(?vx>) M-coin(?? I? q r R--Hint( --H 1) Vigiire 4.6: Ren?oval of q (lonirnjtted Invisibly to p and r Henceforth., let jtr(c) be the size of the sniallest niajority siibset of LocaIView? (c) and ?` be the size of the si?allest iriQority snbset of LocaIViewX? _______________ ? def LocalViewf' itr(c) (LA' LocalV2iewr(c) t I I?r 2 4.2.4 Reconfiguration Phase I Responses Onter processes responses to R--Hint(ver(r)) niiist allo?v r to detern?ine th( natiire and coniposition of all local vie?v inconsistencies, inclnding inconsistencies iiivolving core niembers that did not respond to ??. ?N?hile local views alone sufhce to detect inconsistencies aniong the processes responding to an initiator, this inforniation falls short of satisfyiiig GNiP-:3 (Seqnence) entirely as invisible coininits are riot detectable. 52 In Fignre 1.6 ? > is conirnitted invisibty to I), q., and r. Since all three have identical lo?al vie?vs, r ?vill riot detect the actnal discrepancy. Ho?vever, p is a?vare of Tng? 5 intention to cornn?it ? ?., X >. and p can envision a sitnation in ?vhich n?gr sncceeded in doing so and then failed (in this case, the sitnation that actually occurred). If p ??ere to forward n?gr `s inteiition to con?init ? v, x ?, v would then envision the san?e situation and propagate ? v, ? > as its Phase II subn?ission. Thus, in addition to its local view, an outer n?eniber iTilist also report how it expects to cliaiige its local view next. 4.2.4.1 Ren?arks Since fault deterniinations are pernianent (c.f. Section 2. I ) all processes, but par- ticularly initiators, should n?ake theu? cautiously. Any core n?eiriher believing a majority of its local view faulty can never succeed in con?rnitting an update. N4ore- over Gossip ensures that future cornrinini catiori with another core rriernber results in the latter also believing this n?ajority faulty. Recovering fron? such a (otnpict(? f(Ii1u?? is diffIcult; one could employ, for exaniple, Skeen?s algorithni [SkeSS], but a con?plete discussion of recovering froin total faihire is beyond the scope of this work. Ilefore presenting S-C?IP we should address issues arising at start up. In prac- tice, a process wanting to join the NIRNi will query the systen? rian?e service for the NIR?XI. If the nan?e service reports that the NIRNI is non-existent, this process executes an initiation sequence and becon?es the NIRNI. If the name service reports that the X?RN1 aheady exists, the process requests of the NIRNI that it be allo??d 53 to join. ???eIRNI core i?nst bnild to critical size, after ?vhich it begins operating as a system service, and nntil that poirft, the nan?e 5C1'ViCC 15 not fanlt-tolerant: it is either a single process that n?ay fail, or an nncoordinated gronp of processes that iTlay report inconsistent inforn?ation regarding the `\4RM. In the latter case. a race condition way occnr if more than orie process desiring to join the NIRNI is told it does riot exist. 4.3 The S-GMP Algorithm The ITiost irriportant, arid rather astonishing, aspect of onr algorithrn is the complete lack of restrictions on changes to a gronp view. Specifically, there is no upper lii?it on the number of processes (i.e. core iiienibers) one can add to Gpviewr to form Gpviewr',+l; if removilig processes from Gpviewr?, tile npper limit is the size of the largest minority stibset of GpView". I'llis flexibility has a n'imber of very irriportant rarriiflcations. (lompared to an NIRNI defined by a less-stritigeut ri-ienil?er of the class of Asynchrouons C?1ronp Nierribership Problems (we make tliis explicit in Chapter 7), an NIRNI defined by Strong C?1?Il? i-las a number of striking advantages: Fault-tolerance Th( Strorig CNil? ser?i - can tolerate nearly half its niernbers faili rig simtiltaneotisly. Nioreover. the fanit-tolerarice level is dynari?ic and depends oil1y oi? the size of the current group view. \Veaker versions can tolerate the simultaneous fitilure of only a fixed number of members, and given initial conditions, this will usually be otie or t?vo. 54 Responsiveness The Strong GNIP service can adapt n?nch n?ore quickly to severe changes in its ?vork load. In a single change to its core rrienibership? the Strong GNIP sei?vice can add an arbitrary uni-riber of ne??? nienibers to share a snddenly?increased load or remove nearly half its ?nenibers ?rhen its ?vork load drops quickly (thereby ridding itself of dead ?veight'). \?\`eaker protocols ??ill need to run as niany as n rounds to change n?en?bership by n processes. Additions require very little change to S-C4?IP as presented in the first t?vo sec- tions of this chapter. Let vi be the set of new n?enibers. An initiator say f sends a Join niessage, giving the nienibers of i;i permission to join. in-imediately before it n?uliicasts the commit message, com(? ??I, x t 1 >):M--Hsub(? i?, + 2 >), to GpView"'. The initiator also sends a State--HXfer message to ?i, informing the ne?7 members of all relevant core and system state. `Ib simplify bookkeeping, new members begin with local version equal to the group version tl-ieir addition resulted in. If j)I E vi had not already received the state transfer iriforniatiori from a previous initiator it initializes its local state and sends ack(Join) back to r. If p' has already joined at the behest of a previous intiator, say i.(), p' responds to r with NextUpdate?i as set by r(). This response is necessary to niaintain GNIP-2 and C?N?P-3; it deals with the case in u-'hi?4? r?i.'s commit niessage differed from T'5 in the coiitingent submission component, say M--Hsiib( ? `1)2 ? t 2 >). and r() niQy have been able to commit ? v2, x t 2 > invisibly to r and all menibers of Acks(i, R--Hint(ver(r))) (See Figure 4.3). 55 M-com(? ?;1,x +1 >): `L I Tflgr' )oi?n a??d ?tate-x??er - NextUpdate7'1 R--Hint(x) Nf ? DcterinincPropo?al --- - ? I ? v1 x + 1 > 12 = ? --Hu?gT?x + 2> - R-com(? ?i?? +1 >): M-sub(? 1?. X +2 >) Figure 4.7: A Situaticr? Requiring New (?ore XIen?ber? to Rq-?ort NextUpdate to tuitiator. 56 in ??hat follows ?` [ and r2 are sets of process identifers. For fnll generality we use ? to denote either of the operations that after h?cal views - ren?ovc'() or add( ). C;???? a set of processes, G, let `t?initicast?(G ri-i) denote the compound action Vq c (r" : (se?nd?(q, rn)). niiiiti?c?t tp(G rn) is an indivisii?le action only in the sense that p does not execute any other events until all n?essages are sent; it is not failure-aton?ic. ??he n?essage ack(n?) acknowledges receipt of message n? ?he sets FauIty? and Recovered? are, respectively the rneuibers of its k?cal view p believes faulty, arid the uieu?bers riot in its local view j) currently believes functional For siniplicity we do not explicitly show gossi?)ing, channel-disconnect or any error-checkirig, bnt asurrie these are done transparently. For example, we do riot show an outer process checking whether the process fron? which it has received an M--Hsub() niessage is, iii its belief, the current n?gr , whether the update indicated iri a com() niessage actually n?atches the update indicated in the n?ost receiit sub() n?essage, whether the processes to be reriioved (or added) are currently (not) in the group view, and so forth. 57 Task:mgr ?vhile (tr'ie) repeat GetUpdate(?'1); until (??1 nil-id); n?ulficastn?gr ( LocaIViewn?gr M--Hsub( ?vt)); while (??1 nil-id) 1* Con?pressed algorithi? loop ?/ forall p E LocaIView??91 TCCVn?gr (P. ack(M--Hsub(+?'I))) or f(?uUyj ?gr if (majority of LocaIViewn?9r didn?t respond) qnitn-?gr /? Update LocaIView?ngr according to I DoC?oinmit( ??i); (41et LT,pdate(t2); if (Joining new members) muifiaistmgr (vi Join : 3tate--HX?er) forall j)' C ??? r(cvn?gr (vi ack(Join) : NextUpdate?') or f(?uitymgv (p'); v2 = NextUpdate?;i in?i4icast'?ngr (LocaiViewmgr , M--Hcom( Iv t) M--Hsub(I?'2)); `2; 5S /* Called only l)y ?ngr. I?rocednre:C?1et J?ipdate(OIJ$Ivat); if (?ecoveredrng? ? ?) wil --H--H Recoveredmg? else vat FauIty?ngr; return(); Task:Otitert?rocessesp rccvp(fl?gr M--Hsub( ?v1)); /? Ni?iark the processes of ? 1 faiilty or operational as appropriate. ?/ Dot)reCoTnniit(??i ?); repeat scndp(n?gr ¶ ack(M--Hsub(???1))); t'(cvp(T17gr M--Hcom(Iv 1): N--Hsub(?i?2)) or faultqp(?ngl ); if (!FA?LTYp(Tngr)) DoPreCon?rnit( t'.2); DoConin?it(vt); ? 1 --H--H else NVai t- R econfiguration(); until (`L'i nil-id); 59 /? Execiited l%Y a proc `:5 destined to be an N4 RN4 n?en?ber. ?/ New-Process:PI nan)e-server- I ooknp( Xi H,Xi, contact-id); if (contact-id nil-id) /? Race (londitions Possible. Localview11 narne-server-Install(N'lRXl., PI); Regin mgr Task; repeat scndy (contact-id request-j oln-MRM); rccv?i(contact-id? Join(<: t., :r >) : State--HXfer) OR tinie-ont; if (time-out) ri an?e-se'.ver- Lookiip( Ni RN4 contact-id); else /" LocaJView?? is part of State-XIet 7 1)oCon?n?it(? `l.Y nntil (NIR Xi -n?eniber OR Ni RNi-un available); else Reconfiguration For R?conflg'irin ? ( introduce the follo?'ir'g variables: o+ NextUpdate? is a tuple of the forni Fids, ver(p) + 1, rank(init)lp , where Ids i? the set of processes by which p is expectin? to aft?w its current local view, ver(p) and rank(??iit) is the rank (in .ver(p) of the proces? LocaIView? LocaIView? from which p re??eived the proposal. \`Vhen p receives a subn?ission it changes NextUpdate? to reflect the valne proposed and the initiator proposing it. o+ LastCommi=t? is a tuple of the forn? [Ids, ver(p)]p, where Ids is the set of pro- cesses by which p altered its previons local view to obtain its cnrrent local view LocaIV ver(p) Iew? o+ Ahead? is the set vahies reported coi?i?ittcd f:?r versions nun?bered greater than ver(r). initiator r receives these values in response to its R--HInt(ver(r)) niessage. Behind? are reported local versions nuiiibered less than ver(r). Anticipating th(' correctness proofs of Chapter 5, the reported version in Ahead? is exactly ver(r) + 1, and that in Behind? is exactly ver(r) --H o+ SubCurrent? is the set of proposed next updates with proposed versions equal to ver(r)+ I; SubAheadr is tlie set with proposed versions greater than ver(r)t 1. 61 ??sk:ReconfigurationIniti?tor,r?ithver(r) m?it?casf? ( LocaIView? R--Hint()); forall p C LocaIView? recv? (fi, NextU pdate1? : LastCommit?) or faultyr(p); if (niajority of LocaIView? didn't re?pond) quit?; /? 1?etcrrnine tlie value and version to siibi?it fron? the responses received, ?/ Detern?ineProposal(t;i , ver,, v2); ?oPreComrnit(?' I); niulticasir ( LocaIView? R--Hsub( ? ?v I ?`cr forall p C LocalViewr recvr(p, ack(R--Hsub(? ?`?`1 , ?`r or fauUy?(p); if (inajority of LocaIView? didn't respoi?d) qnitr; f)oC'o11i1?it(v1); if (Joining ne?v nienit?ers) muiticastr(??1, Join : State--HXi'er); forall j)1 E ?1 i?cvr('vI,ack(Join) : NextUpdate?i) or faultijr(p'); v2 NextUpdate',1; multica 1r( Loca?View? R--Hcom( ? ?? 1' t'cr : R--Hsub(I?2)); n?gT' , ii r, ?2; Begin n?gr ??sk; 62 Task:OuterReconfig??ration.,p Tecvp(r., R--Hint(x)); /* If rank(p) > rank(?), I'i? thought faulty. ?/ scndp(?., NextUpdate? : LastCommit?); recvp(T%R?sub(? +vi., VEr or fauityp(r); if (!FA?LTYp(r)) Doi?reC?oinniit( i); ?r:nd?(m ack(R--Hsub(? I'?'1 ??r ir(cvp(r., R--Hcom(< t?t vcr : M--Hsub(+V2)) or fa?1iyp(r); if (!Fi?UL?Np(r)) DoCou?rn1't( vi); Tfl?r V 1 ??. else ?Vait- Recoufiguration (); else N'Vait-Recoufignratiou(); 63 7 Sets proposal ?nd invisible fioni aniorig the 7 7 NextUpdate and LastCommit information. k?7 7? ver(r) = ;r. 7 Procedure:Deterrninel)roposal(0?;`iproposal,()?4?UversionO????Finvisibk?); Ahead? --H f [Ids, ver(p)]p I ver(p) (? t I )J Behind? = [[Ids,ver(p)]p I ver(p) SubAhead? = [[Ids, ver(p) t 1 rank(in?[)]? I ver(i)) = t 1 )J SubCurrent? = [[Ids, vcr(p) + t,rank(?n?t)lp ver(p) = if (Ahead?# \b) 7? Partially committed version v + I. -?7 proposal = Ahead?; C41etStableProposal( invisible, Su bAh ea d ?); version = x + 1 retnrn(); if (Behind?# ?) 7 Partially con?mitted version x. proposal = [Ids, ver(r)lr; 7? LastCommit? ?7 GetStableProposal(invisible, SubCurrent?); version = retnrn(); 6'! Petern? ineProposal,continued. /? All respondents report the same local version. ?/ version = x t 1; if (SubCurrentr == proposal = ? --Hlflgr r t ? GetI??pdate(invisible); return (); if (SubCurrent? is a siiigleton) proposal SubCurrentr; C?1et?j?pdate(invisiJ?le); return ( ); /? SubCurrent? has two elements. 7 C?etStableProposal(proposal, SubCurrent?); C?1etUpdate(invisible); return( ); /? update-set has no niore than two elements. 7 Procedure:???etStable Proposal( 0 I?'f?`(1ic?>,iNupdate-set) ? ia!., i(:'i > = the elen?ent of update-set with the loivest ?Yinkcd ii?itiator. return( ); Chapter 5 Correctness In this chapter we prove S-GMP correctly solves Strong GNIP ?Ve prove correctness by induction. In the first section of this chapter we construct the inductive step; li?ll correctriess fi)llows easily fi'oi? there. 5.1 The Inductive Step fhroughont this chapter we use sub( and com() to denote subn?it and con?rnit rYlessages irrespective of the sender's role (n?gr or a reconfigurer). As in Sec- tion 4.3., NextUpdate? is the triple tids,ver(p) + 1 rank(?n?t)]p . E?or each p, C??ossip., Disconnect and INITIATL() u?ean that NextUpdate? is always the proposal of the 1ou'cst-?anked initiatoi' from which p has received proposed updates for group view ver(p) + 1. For process 1 n?nlticasting n?essage n-?,, Acks(T,?n) is the set of processes fioni 65 66 ?vh?ch r receives a n?essage ackno?vledging, or in response to fl2. As before, Ahead? dL? fp p E Acks(r,R--Hint(ver(r))) A ver(p) > ver(r)? Behindr def fp p C Acks(r, R--Hint(ver('r))) A ver(p) ? ver(r)? \Ve also assi-in?e Gpviewx?l is well-defined. NVe approach the inductive step by proving all successful initiators propose the same value for Gpviewr??, whidi goes a long WQy toward maintaining both C?1N1P-2 and GN4P-3 as this is the only value that can be conirnitted. ??ie n?ost difficult aspect is ensuring that reconfigurers are able to detect updates that are con?n?itted invisibly to them. Proposition 5.1.1 If ? is a reconfiguration initiator with ver(r) r, then q?Acks(r, R-int(?)) Pro of `Though GpView' i?ay not be defined globally, let p be the process re- sponsibh for r installing LocalViewf'. ?`ipp05C son?e q EAcks(r, R--Hint(x)) reports ver(q)? j: --H 1. `Fhen q has neither received nor responded to p's sub(? v. x so p believes FAULTY?(q). ??ossip ensures r also believes FAULTY?(q) `ipon re- ceipt of com(? t', x >) froni p. Since r received com(? ?), X >) before it n?ulticasted R--Hint(Q?), the f)isconiiect property n?al?es it in?possible for r to receive a response fron? q. Thus, q ? Acks(r,R--Hint(x)). So suppose son?e q' E Acks(r, R--Hint(r)) reports ver(q1) > + 1, and let J)' be responsible for qi installing ver(q1). Because it ? ver(q') --H 2, i' has neither received nor responded to i1,s sub(? v, ver(q!) t>), resulting in FAULTY?(r), and upon q1's receipt of coii( V FAULTYp(q). (5.1) ??liat is, froni ps j)oint of vie?, GpVlew1 is (or has been) defined. Of course Gpviewr rnay not be defined gli?bally as sonie process q that p believes faulty, rriay have ver(q)? x. NVith respect to a recoiifiguration initiator, r, GpvjewL is r?defined at the end of 1?hase 1, if every process in Acks(r, R--Hint(x)) --H EauIty? reported a local version at least as large as . Processes in Acks(r, R--Hint(r)) fl Faultyr may have reported local v(?rsions less than r, but r believes theni faulty by the end of Phase 1. Proposition 5.1.2 Let r be a reconfiguration initiator. Ihen r proposes version x if and only if GpView'?1 is the most recent (i.e. highest-numbered) r-defined systeni vie?' at the end of Reconfiguration Phase I. Proof Refer to procedure !)eterrninePi'oposa] in Section 4.3, o+ \Vhen Ahead? ? ?, r proposes version number ver(r) + ?I. \Vhile it n?a,y also be the case that Behind? ? ?, we can use Proposition 5.1.1 to show that ?ULTY?(q) holds for p C Aheadr and q C Behind? resulting in FAULTY?(q) at the end of Phas?' 1 and Gpviewver(r') being i-defined there. 68 o+ ?Vhen Ahead? ? arid Behind? ? ?, ?pvewver(q) for q E Behindr is the most recent r-defined system vie'vi; 7 proposes version number ver(r)--Hver(q)t4 o+ NVhen Ahead? Behind? ?, cpviewver(r) is the most recent r-defined system view and r proposes version ver(r) + t. Proposition 5.1.3 For any initiator, ?, if P submits sub(? v, x >), then r --H J ? ver(p) ? Proof By way of contradiction, suppose x ? ver(P). From Proposition 5.1.2, we know GpView??' is the rnost recent P-defined group view, so let q be aiiy outer member reporting ver(q) x --H 1. I?et p' be the process from which p received the comn?it message com(? il, ver(p) >). Since q has riot committed view r, it can neither have received sub(? v1, ver(?) ?) froni p , nor have responded back to p'. By the end of P1,5 submit phase FAULTY?i(q) holds, and by C?ossip so does FAULTYp(q) as soon as p receives coii(? 7)1, ver(?) >); P would not have received q's response to R--Hint(ver(p)) On the other hand, ver(P) ? r --H 1 is impossible if GpView"?' is the most recent p-defined view. 69 Proposition 5.1.4 if q with ver(q) x 2 receiv ? sub(? ?)?X >) fron? initiator `r then FAULTNJ?(q) holds ??hen i n?nlticasted the snhrnission: q? Loc)viewr (ver(q) r --H 2 A sEND?(q sub(? t, x ? IAULTY?(q). Proof Let ver(q) y--H2 arid suppose r proposes ? `??? .r >. it is not possible for r to be the same co-ordinator as the one fron? ?vhich q received com(? ???, x --H 2 >) be- canse Vl?O channels force q to receive sub(? ?)!, --H I >) before com(? v', :r --H 1 >) and the latter before sub(? ?`, Y >), resnltlng in ver(q) + So suppose r is a reconfiguration initiator. Proposition 5.1.2 shows that r pro- poses ? v, i > if and only if ? detects Gpviewx?las the n?ost current r-defined sys- ten? view. By definition then, for every p in Acks(r, R--Hint(r)), either ver(p) > x --H 1 or FAULTN'r(p) holds. Thus, FAULiY?(q) holds at the end of reconfiguration Phase I. iii S-G?iP q fails syi?pathetically in?n?ediately after receiving sub( ? v, x --H 1 >) froni r, arid then attenipts to rejoin the group view as a new process. Proposition 5.1.5 If ver('r) then the n?ost recent r-defined systen? vi at the end of Reconfiguration Phase I is no less than ;? --H 1 and no greater than r. Proof Gpviewl't] (or any view greater than x + I ) cannot be the niost recent v-defined systeni view as this would require ver(r) ? r + t contradicting the hy- pothesis. So suppose GpView"?2 is i?()st recent. Let q C Acks(r, R--Hint(x)) such that ver(q) ? --H 2, and h-t p be the process fron? which r received com(? v, x Froi? Propositioii 5.1.4, it is impossibli- for q to have received and responded to sub(? ??., X 5() FA?Ui?Yp(q) holds by the end of p's submit phase. t?roin C)1ossip? FAUL?N'r(q) holds upon r's receipt of com(? i?., x >). meaning that ? cannot receive q's response to R--Hint(x). 1?he teniporal cohcsion of outer inen?bers' local vie?vs guaranteed by l?roposi- tions 5.1.1 and 5.1.5 is fundan?ental to correctuess. if non-faulty core nierubers never receive snbniissions that ornit an update, they cannot cornrnit a vie\v ornit- ting an update. c; Ni P-:3 follo?vs by sho?ving successful subrnissions for the sarne group version are identical. If r, ?vith ver(r) x, is successful in R?econfiguration I?hase 1, i?roposition 5.i.?i tells us the largest version nuniber observed arnong i??'s respondents is .r + 1. So suppose Aheadr is non-null and let p be a process frorn ?hich soirie nieruber of Ahead? received com(? Vrtl, J + 1 >). If the conipressed update algorithni were used to install version r + -I then every rnen?ber of Aheadr also responded to sub(? V+2, j + 2 >). Nioreover, r can irnagine that tile processes in the corriple- nient, Acks(r, R--Hint(i?)), rnay also have done so niaking it possibli? that Ahead? and Acks(r,R--HInt(x)) (and ?"x+l if Gpview?tl GpView U ?)x+1) together forni the requisite niajority subset of GpView?+'to coinrnit view r +2 (See ?igure 4.(;). ?Yoii- ble can arise if Acks(i. R--HInt(:r.)) (and i'+i and i'+2 if applicable) also constitutes a niajority subset of Gpviewi:+2 since neither r nor any process in Acks(i, R--Hint(x)) can know what value p would propose for view i. + 3. If rernoving core nienibers is tlie only perniitted operation, and i is successful it is inipossible for another process to conirnit a version nunibered higher than x + 1. \Vhen additions are per- 7-! n?itted? p i?ay continue con?n?itting new group views and r n1Qy lag behind p, but if both are successful for a given groul) versioii? both con?ruit the sanie vabie. Proposition 5.1.6 Let r be a reconfignratiou initiator with ver(r) = r Ahead? C Acks(r.,R--Hint(x)) report li?cal version x + 1 Acks(??R--Hint(r)) be a rna- jority subset of LocalViewf' and let [) be a process fron? which sonic n?ei?ber of Aheadr received com(? Vx+1, x + 1 >). Then., when the gronp view is only altered by ren?ovals., /) cannot be successful for any view nun?bered higher than x + 1 Proof Let q C Aheadr and let p be as described. ?ince q received R-int(r) from r after coin(? Vr,tl X + i >) from p, INITIATE(r) means rank(r) ? rank(p)? so FAULTY*(p) holds for every member of Acks('r., R--Hint(r)) (by (?ossip). As a result initiator p is successful for +2 if and only if Acks(r,R--Hint(i)) is a niQiority subset of L0CaIV?eWj"'+1 = Localviewf; --H Observe that Gpviewr' = LocalViewx = Acks(r,R-int(?)) u Acks(r,R-int(x))., and that `r will not receive ack(R--Hint(?)) responses from the members of t?+I; that is `tx+I C Acks(r?R--Hint(;r)). t?or p to comniit Localview;;t2 (Acks(r,R--Hint(r)) --H must be a majority subset of Localview;;+1 . Then initiator p is successful for ? it + 2 > if and on?y if IAcks(r,R--Hint())--HIv?+iI _ IAcks(mR--Hint(x))I--H I Loca?View11;+1 I Gpviewr' --H I Acks(r,R--Hint()) I --H ?t+1 --H --H ____ I Acks(r,R--Hint(x)) I + I Acks(r,R--Hint(x)) I --H I Vr?+1 2 I Acks(i.,R--Hint(i')) I --H `?+i ? Acks(r?R--Hint(x)) I 72 but this contradicts the assuniption that Acks(r, R--Hint(x)) is a iu?iority subset of GpView' Proposition 5.1.7 Let r be a reconfiguration initiator ?Tith ver(r) Y. Let Ahead? be non-null a?id let p be a process that sent com( ? `L?r+1 r + 1 >) to sorne nieniber of Ahead?. Then if r is successful for ? x + I? >: 1 . If p- snbn?its ? i'xt2? r + 2 ? independent of the conirnit of ? Ur'?+1 x + 1 >., then either r or j), but not both? is successful for version r + 2. 2. If p conirnits ? x + 1 > contingent upon ? ?Ar-,+2.. x + 2 > being (onirnit- ted, then either or p, but not both, is successful for version :r + 3. Proof 1 . - r I?igure 5.1 (top). The split-arro?v message fioni I to ??+i represents the tivo possibilities for the arrival of r's cornulit message, R?com(??? ?`r+1, X + I ?) : M--Hsub(? --HAcks(r., R--Hint(x)), x + 2 >), at `?`???i. ps coniniit message, com(? `rtI j + 1 >). to ill is a dashed bue to represent the possibility that it n-'Qy riot be received. For presentation clarity, \VC elide r's Phase II subniit niessage. (?1Ni1?-:3 would be violated if both p and r ivere able to conirnit their proposed vallies for version x + 2. 73- (a) Snppose the menibers of ??xt1 receive in from r before they receive com(? ?xt1,X + I >) froni p. Since r s inessage gossips its belief in p's fatiltiiiess? the nienibers of t.xtl will never receive another niessage fioni p. In particular the nienil?ers of i+I ?vill not receive p?s subsequent M--Hsub(? ?j.+2,x + 2 >). NVe say r owns v'tl ?hen p is successful for version :r + 2 if and only if Acks(r,R--Hint(r)) ? Acks(r?R--Hint(?')) + I irti and r is successful for version x + 2 if and onky if I Acks(n.R--Hint(:t)) + I vi?,ti I I Acks(r R--Hint(r)) I I3oth conditions cannot hold. (b) The analysis is the sanie as the previous case; r owns c1--+i once the nienibers of that set receive in. (c) In the last case, p owns r;?+I if its M--Hsub(? "i+2 Y + 2 >) inessage ar- rives at -i'?+i befi?re r's R--Hcom() niessage and so gossips ps behef in r's faultiness. Then I) is snccessful for version ;i + 2 if arid only if I Acks(r,R--Hint(x)) + ?4?,+l > Acks(r,R--Hint(r)) and r successful for ` -- sion i + 2 if and onky if Acks(r, R--Hint(x)) > Acks(m R?--Hint(;r)) I + I Again both conditions cannot hold. 74 2. There are o consider but as the analyses arc quite sirnilar? ?ve depict arid discuss ouly four (Figures 5.2 aud 5.3). (a) Suppose r o\vns Vx,+I. This happeus if the n?er?bers of ??+1 receive o+r 5 con?niit n?essage fc?r version ? + 1 x+1 R--Hcom(? ?1+j.,L + 1 >) : M--Hsub(? tx T + 2>) before either p's con?rriit message for version x + 1 = com(? tti x + 1 >) : sub(? Vx+2 x + 2 >), or p 5 conirnit niessage for version x + 2 = com(? vx,t2.,X + 2>): M-sub(? --HAcks(mR-int(r)),r + 3 >) (kkence., the split-arrow). Further suppos 0??iS `?x+2, as happens when the nien?bers of tx+2 receive nif+2 = R--Hcom(? Vx+2?X+ 2>): M--Hsub(? --HAcks(t,R--Hint(x)).,i: + 3>) frorn i bef:?re receiving fl?f)t2 fiorri p. Then r is successff?l for x + 3 if and only if I Acks(r.,R--Hint(x)) + I I + I tx,,+2 I > I Acks(r?R--Hint(i)) I and p is successful for x + 3 if and onLv if I Acks(r.,R--Hint()) > Acks(r,R--Hint(x)) + I Vx+I + ?x+2 I Y\oth ??ondi tions cannot hold. 0 (b) Suppose p ()?VIi5 v1+? and ?`rt2' ??hen p is successful for x + 3 if arid or?ly if Acks(r R--Hint(x)) + + Acks(r, R--Hint(x)) and ? iS successful for x + 3 if ?d on?y if I Acks(r?R--Hint(x)) I ? Acks(r.,R--Hint()) It I I + t1t2 I (c) in this situatiou (Figure 5.3). p o?ns ?`x+l and r o?vus ?`x+2 Flien p is successful for i. + :3 if arid only if I Acks(r?R--Hint(x)) I + tr,+1 Acks(r.R--Hint()) + ?`i?+2 I and r is successful for j + 3 if and only if Acks(r.R--Hint(x)) I + Cr,+2 > Acks(r.,R--Hint(.i)) 1+ I I (d) In the last scenario p o??ns ??r+2 and r o?? )s ??j.t1 . ??hen p is successful for x + 3 if arid only if I Acks(r?R--Hint()) + V1?+2 I I Acks(r.,R--Hint(?')) I + I I and r is successful for r if and only if Acks(r,R--Hint()) I + I titi I I Acks('r,R-int(x)) I + I I 76 1) vI+i P com(? ti.,+? >) a. M-sub(? ?+2,+2>) R-int(i) Cases a. an?l b: r o?vns ?)xt I. M-sub(? ?`.xt2. ? + 2 ?) corn(? ?`-1;+1.Y + 1 >) fl? R--Hint(x) Case (;: p O?\ -- ?xtJ F?igiire 5.1: (lase Analysis for Proposition 5.1 .7? Part ?. 77 `rn - p p T -p i-'Ql fl-? ?ase a: r owns ?L-'tI and Vx+2 I) R-int(?) Case b: p owns `x+1 and Vx?t2 Fignre 5.2: C?ase Analysis for Proposition 5.1.7, Part 2. 78 Case d: r owns Vy+i j) Owns R--Hint(x) n2f+l vrt2 T12?+1 p Figure 5.3: Continued Case Analysis fc?r Proposition 5.1.7., Part 2. r-+1 Tn - p __?- ___ It9 -r v1+2 r R--Hint(r) - ---- Case c: p owns ?`-?+i. r owns t",t2. 79 1?roposition 5. 1.7 is a keystone to correctuess becanse it addresses exactly the situations in ?vhich it might appear possible for S-GNiP to violate C?NlJ?-2 and (4NiP- 3: ?vhen two initiators are able to propose? and theref:?re n?ight be able to commit., different vallies for th( san?e group version. Propositions 5.t.(i arid 5.1.7 ??()VC that even when the local vie?vs of processes installiug the sariie group vie?? differ ( i.e. the initiators do not vie for niajorities fioni aniong the sai?e set of core members) s-GMP is safe. It ren?ains to consider ho?r decides which of the various proposals it learns about could have been invisibly con?niitted. In s-G?IP this necessity arises during n?configuration in determining L'2 when either Aheadr $ ? or Aheadr \h and Behind? $ ?, and in deternAning ?`? when Ahead? BehIndr `lo ilhistrate the difficulty, suppose Gpviewx?l is the i?ost recent r-defined group view. I?et Submissionsr(x) be the set of proposed next updates f:?r version L' that r liarns about in response to its R--Hirt( ) n?essage: Submissions?() --H I ?p C Acks(r,R--Hint()).NextUpdatep --H \L?)i.,X,Z1? for sorne initiator Intuitively, if Gpviewx?l were corurnitted with an accorripanyilig proposal for GpViewx (?.r--. the corripressed algorithui were used), then Submissionsr(x) would be that proposal and Submissionsr() I--H--H Ilowever, it is also possible that. while there ulay have been no plans for a future update when rngr rnulticast M--Hcom( ? L", X- --H -t >), nigr (lid begin an update algorithru to forrii Gpviewx at 50TT1C later tirne. If, during that sarne interval, an outer core niember, p, began reconfig- uration it is pos?ible for p riot to receive any Phase ? responses indicating n?gr `s intention fc?r GpView"' in this case? p would propose n'?gr `s removal for version x. N' subsequent reconfigurer could ti-icri get i?hase I responses indicating both of these proposals. NN?e first describe tl-ie coniposition of Submissionsr(.i), sbo'ving that every reconfigurer proposing version i either propagates ingr `s proposal for version or proposes Tngr 5 rei?oval (See Figure o+4.-t in Chapter 4). Proposition 5.1.8 Pbr all versions x, I Submissions?(x) ? 2. Proof Inspecting procedure Detei'n?inei?roposal, different subnii ssions for the sai?e view can arise only fron? n?gr and from a reconfiguration initiator proposing fl?g'r `s ren?oval. `?he latter occurs if ar?d only if the initiator did not learn of any outstanding proposal n?ade by rngr that is if Submissions?(x) ?he set SubmissI=ons?(.c) is b?vai?nt if it contains two distinct values. Corollary 5.1.1 follows by exani ing [)etermineProposal. Corollary 5.1.1 Let r and r' be recoiihgurers proposing version .r. ihen if both are bivalent.' they are identical: Submissions?(x) I Submissionsri(j) I --H--H 2) ? Submissions?(x) --H--H SUbmiSSi0flS?'(x). Fliat is all reconfigtirers either propagate n?gr `s nniqlle snbrnission for view :r or propose `tnqr 5 renioval. ?Ve now show that only one of the two proposals for a given version coiild possibly have been coi?iinitted (invisibly or otherwise)., and that all reconfig'irers can distinguish which of the two it was. ?his proposition, too, is important in the inductive step. It shows that in going fioni Gpviewz?l to GpviewT one and only one vahie can be con'iniitted by any member of Gpviewx?l as the same value is proposed by any successful initiator for the xth group version. Proposition 5.1.9 Let r be a reconbguration initiator. If Acks(r,R--Hint()) is a majority subset of LocalViewr and Submissionsr(?) is bivalent, then r can distinguish which of the two valii??s proposed could not have been committed invisibly. Proof Let r be as described, and let ? ?`. X > and ? > be the two e -, -ients of Submissicnsr(?i?). Let J) be the process of least rank arrioi?g those reported to have submitted ? ?`, :t >. and let p1 be the )rocess of least rank among those reported to have subuiitted ? t?1. x >. ?4ien r reasons as follows: In order for either vah?e to have been committed its initiator must have garnered majority approval fiom its own local view for the submitted value. ??hat is, Acks(p, sub(? i?, x >)) c Maj(LocaView?) and/or Acks(p', sub(? t', y >)) ii Maj(LocaIView?1). Since both p and p' make version r submissions, both must have local versions either x --H I or 3 (Proposition 5.1.3). \Ve consider these by cases: S2 1 if ver(p) ver(p'), then both processes vie for majority approval from among the same set of processes, and the Phase I response sets of both n?ust intersect: Acks(m R--Hint(ver(r))) fl Acks(p', R--Hint(ver(p'))) # ?. Suppose p has higher rank than p'. Then J) can only get a majority response to its Phase I aiid Phase II messages if it does so hefi?re p'?1 If this `tvere the case, then then p? ?vonld have learned of sub( ? `c, x >) and have propagated it (c.f. l)e[eru?iiieJ3roposa]). Ho?ever, p' submitted ? t'?, . >, and v' ? t', so p must not have been successful for ? v, x >. 2 If ver(p) # ver(p'), then one of ?p or p' has version j and the other has version x --H 1 (Proposition 5.1.3), Suppose ver(p) x and let rc be the process from which p received com(? v, x Pecause it, too, proposed version .c, ro's h?cal version niust also be either --H 1 or x. Suppose ver(r0 ) x --H I, and consider the conditions under whi - - ji proposes < v', x (a) If Submissions??(x) is the singleton fv't or if Submissions?t(x) is null and ?`j? the singleton (n?gr ?. In either case, all members of Acks(p',R--Hint(r--H -!))flAcks(ri,sub(? v,r >)) 1it is Ilot difficult to see that ?vhen t'vo processes colilpete for fliajorities froiii all?ong tile san?e process pool one niust succeed in getting the 1-1-lajority before the other. Nioreover, Disconjiect C4ossip and !NITIAI?E() n?eaii that the process `vith higher rank can only get a n?ajority response if it does so before the lower?ranked process. n?nst have received p"s R--Hint( --H 1) l?efi?re ?? `s sub(? ?, >); if not., they woiild have forwarded ? ?, X > in their responses to P1 making Submissions?t (r) bivalent. N4oreover, because they received sub(? ?`, X ? fiom P', rarik(r0) ? rank(?1). C?ince Acks(?0, sub(? v, x >)) is a majority subset of GpView"'?1 a majority of GpView??' believe p1 faulty npon re- ceipt of sub( ? v, x >); p' could not have obtained the recluisite majority for its submission R--Hsub(? ,?:I, ?,. (1)) If Submissions?i(x) is bivalent then Submissions1g(;r) = tv1, ?t Suppose p1is the first' reconfigurer fc?r which Submissions?t(?) I = 2. Among the respoiises to R--Hint(x --H 1) indicating ? ?, x >. let vi be the member of least rank to propose < v, x >. Similarly, let v2 be the member of least rank that proposed ? v', x >. Ry assumption, both Submissionsv1 (r) and Submissionsv2 (v) are either null or a siughton. \`Vithout loss of generality, suppose ? v> is the proposal to remove fl?'gv (one of the two submissions must be). `then Submissionsri (v) is either empty or the singleton, ( ? --Hm']v . v >t and `I'2 must be the singleton f ? i'?, v > ?. `\Ve fl0\V sho?i' p' can deduce v? cannot have been successful hA' ? v', v > by considering the possible roles v? might have played. i. if v? were mgv we claim it could not have received a majority re- 5i)0fl5C from among Gpviewx?i; if it had, then vi would have learned of ? i-I, v > during its interrogation phase and thence propagated it. ii. So considei2 $ mgv . Since both v1 and v2 were ?`il?le to propose npdates, their Phase i respondents sets niiist intersect? and since the processes in that intersection received R--Hint( t) froiii ri it n?nst ?)C the case that rank( r1 ) ? rank( T'2 Fnrthern?ore? had r2 been able to commit ? v' y > its Phase II respondents Acks(v2,R--Hsub(? v', x ?)), wonid have had to to have been a i?Qjority snbset of GpView??'. In this way, if T2 had been successful for ? w (Acks(v2,R--Hsub(? v',x >))flAcks(vi,R--Hint(J --H 1))) # ?. Since Submissicnsr1 (x) does not contain ? v' x >, the processes in the intersection n?iist have received v1's R--Hint(v --H 1) before v2s R--Hsub( < x > ). But now, rank(vi ) < rank(v2) means that having received vi's R--Hint(v --H I ), no process will receive further niessages fron? v2 so v?'s R--Hsub( ? ``, x ?) could riot have reached the required in ajority. ??hus, p1 deduces v2 cannot have been successfiil for ? v1, x >, and p1 should decide to propagate ? v, j >. Exaniining procedure (iNet S[ah]ePvoposa], we see that this is indeed the case: ? v, x > is sub- niitted by the process ()f least rank, v1, among those reported in Submissions??(.v). NVe have arrived at a contradiction: Detei?nine- J?roposai and CetStal?JeI???oposa1 would not have resulted in p' sub- iTlitting ? v', x >, as hypothesized. ?Ve have just proven the claim for tlie case when ].)I is the first recon- hgnrer whose Phase I response set is bivalent. If j)''5 proposal reaches a niajority stibset then this value will be chosen by subsequent initia- tors (with bivalent response sets) in (?etSta/?1eProposa] Since now p?is the snbn?itter with least rank. If p 5 proposal does not reach a n?aiority subset, subsequent initiators will nonetheless reason as p' did and so also choose to propagate the correct val"e. t?inalty, recall that r() is the process fron? which p received com(? v, j-. >). \Ve have just proven the proposition for the case when ver(rc) r I . ?he last case t() consider is ver(vo) j-. But in tliis case we can apply the san?e argun?ent to the proce??, say ? from which ?o received com(? v, r >); this reduction and analysis can contiiine for only a finite nun?ber of steps befi)re we exhaust the nien?bers of GpView?? Proposition 5.1.9 proves that C1c - tah1eJ?roposa1 correctly chooses the only pro- posal for a given grou?) view that co aid have been coininitted invisibly to a recoufig- uration initiator when its 1?hase I respoilse set is bivalent. N?'hen the set is univalent or en?pty, it is not hai?d to see that I)etetrnineJ?i'oposa] is safe. Stabty-defined proposals (those submitted by successful initiators) are exactly the proposals that reconfigurers must view as possibly coiun? itted invisibl'\,?. Corollary 5.1.2 If Gpviewx?l is defined, there is at fl)ost one stably-defined pro- posal for group version r. Proof Proposition 5.1.9 proves that any reconfignrer reachiiig its proposal stage kno?vs exactly ???hich of the t?vo proposals fol' versioii x is riot stably-dehned. Procednres Detez'n?in??Proposa1 and G'etStah1eJN?oposa1 propagate the other one. If this iriitiator reaches its con?rnit stage, its proposal is stabty-dehned and identical to thc other stably defined proposals for version r. Theoren? 5.1.1 (Identical Local Views) lfGpviewr?l is defined, then allmen?- bers that stirvive to define local version x have identical k?cal rth vie?v?: A ((?ver(i?) = A (?ver(q) x)) p,q?GpView'?' (Localviewjr; = Localviewf;). Proof ?he resnlt follows fron? Corollary 5.1.2; no process con?n?its a local view for version .i that diIf??rs fioi? any other processes' versior) ? becanse all proposals that can possibly reach the cornrnitstage are identical. Note that Fheoreni 5.1.1 implies no ten-iporal constraints or) local views. nierely that if p ever defines an xth local view. and if q, too, ever defines an rth local view, then these two are identical. It does not require Localview;; and LocalView'q' to exist conten?poraneorisly. NVe need to do a bit rriore work to pn)ve ()??lP-:3. Remarks S-C?IP ensures that the state to which the systerri finally reconfigures represents the cu ni ulafive systeril progress. It accounts fi?r any previous npdates (and reconfignra- S7 tious) that inay have been only partially successful. and n?akes thei? stable, ?Vith respect to an interrupted coiriniit tlie gix?up vie?v is undefined until 501nC initiator succeeds in niulti casting its con?n?it message. To see that tlie new rngr is unique consider a process, p, that lias received an interrogate i?essage fi?om r. It believes faulty and disconnects from every process ranked higher than ?`. I?hus, p immediately begins to believe that r is tiie highest ranking non-faulty process. in particular, p will not receive subutit., conin'iit, or interrogate niessages fiorn processes ranked higher than i?. 5.2 S-GMP Correctness Proofs Theorem 5.2.1 S-GMP satisfies GMP-(). Proof Let j) be a f)?0(:C55 that queries the na?ie server and is told no NIRN? exists. J?ien LocaIView?(co) GpView??j(co) ?pt Uniqueness of GPVeW?pj(co) depends on race conditions arid assun?ptions on tiie nanie server. iJutil the XiRN?i reaches its critical n?ass and can be used to create a fluilt-tolerant nanie service, its reliance on the iian?e service niakes its uniqueness vulnerable at start-up. Theorern 5.2.2 S-GMP satisfies GMP-I. Proof A pn . j) executes r?fl?ovej?(q) only upon receipt of one of tiie fol- lowing: 1, M--Hcom(--Hq):M--Hsub(--Hq') frorri nigr ., in which case S--Hc?N1P gives cccv13(ni??r ,M--Hsub(?q)) H fa?iUyp(q) recvp(?ngr M--Hcom(--Hq) : M--Hsub(--Hq')) `rcn?ov(4?(q) Equation 5.2 holds irrespective of whether the submit message M--Hsub( --Hq) were the tail end of a contingent update or submitted independently of any accompanying commit message. 2. R--Hcom(--Hq):M--Hsub(q') from some reconfiguration initiator, r. In this c se rccv?(?, R--Hsub(--Hq)) fauit.4p(q) ??cv1(r, R--Hcom(--Hq)) ren?ovr;?(q). Analogous statements app?y to the events opr?rotingp(q) and add?(q). Fhe next t\vo theorems will make use of the following consistent cuts. I?et r, be the process responsible for completing the XlHX'I-wide commit of ? ix, ;i' >. iheorems 5.2.3 and 5.2.4 will show that Cx is the first cut (in the sense of th( partial ordering., ?, on cuts) along which GpVie?x exists. cx[pJ debues the p?con'ponent of Cx for p E Gpviewx Cx [?1 dJ i?cvp( Cx. com( ? Vx, X >)) f)oComn?it( ? ??x, J recvp(rx., com( ? ?`x, X J?o(loinmit(? ?`x, J ?) r(?;cvp(rx, com(? ?x, ? >)) DoComniit( ? ?`x, .C qnit? otherwise Theoren? 5.2.3 S-cMP satisfies GNiP-2. 89 Proof Recall GNiP-2: p\ V (GPViews(c) = /\ UNDEF'D(GpVI'ews!(c)), Sc Proc ?#S'#S th an&l define tlie inte?va1 of a rnn as follows: [cx,,crti) = I c1 ? c ? NVe prove tlie tlieorerri J?y induction over intervals Base Case Suppose exactly one process, p, is installed as the initial group view, GpviewU. Let S() = fp?. Uhen until p changes its local view (that is, along every c E [co, ci)), GpView50(c) = ?pJ = 5o J3y assun?ptiou, every other process, c?. that queries the nanie server during this interval is told the NIRNI exists, so LocalViewq(c) is undefined until q is allowed to join and this is 1)0 sooner than cl. C'aveat: The Base Case holds only when the NIRNI elicounters no race condi- tions at start-up. Inductive Hypothesis The staten?eiit of the theoreni holds during the interval [cr??i, c), with Si,--Hi the unique set satisf?'ing GpView5__ (c1?i) = S,--Hi = GpView??' and no other sel satisfying the equality along any cut in the interval. Inductive Step From Corollary 5.t.2 if GpView?' is uniquely defined, then there is a unique, stal?ly-defined T)roposal, ? i > for Gpviewx, Define S1 c Proc: i?hen, l?y defii?it ion = ? ?CGPV?W??(Cx) are unique. Further, if ? Vi, X > is the reuioval of processes fron? GpViewi?l 91 those processes' local vie?s, if they exist at all, are eqnal to Gpviewr'??l ?ince Gpviewr?--H[ contains all the n?en?)ers of S? GpView', GpViewv?(c) is undefined fi?r c E [c1,, cx+i) the local vie?vs of the functional nien?l?ers of GpView??1 ?? differ fion? the local vie?vs of the rnen?bers of Vx along Cx. In addition any process that queries the nan?e server during this interval cannot define a local vie?v until it is allowed to join the XIRN1 core, Finally, froiii thc definition of the partial order ? on cuts, U0ci [ci, ci+i) covers the entire asynclironous run. Theorern 5.2.4 S-GMP satisfies 1MP-:3. Proof Recall GXiP-3: t\ A?A (Cx V q mIN?LocALf3(q) ? UNDiF'T)( LocaIViewq?) IN-LocAL1f;(q) ? DOWNq V (LocalViewq Localviewpx) Let Cx be as defined above and let LocalViewp(cx) exist and be Localview;;. if q C LocalViewf;--H1 --H LocaIVI.ew1x; then q will not con?n?it any version LocaIViewq?t?.' for 0 ? ?; once q receives com(? Vx, X >) it learns ?"x is its own removal frorn Gpviewx--Hi and crashes. If q is in neither L0CaIVieWj???l nor Localview1x, theii LccalView'q' is undefined along Cx. 92 So suppose q C Localviewf;. If q crashed on or before Cx ?V(- are done. If q has not crashed the result follows frorn `J?heoi.eui 5.1.1 and the definition of Cx It ren?ains to prove Cx consistent. ?y way of contradiction suppose it is not? and let Cp and C ` be events violating consistency: c? ??i e??? e'p! E Cx. and C1) ? Cx. p and p' i?ust be differeift different processes., n?aking Cp a send event fiorri p to p and Cpl the corresponding receive event. ?ousider ?igures 5.4 and 5.5. a: j) ?x,P1 $ rx. ?he lIFO assnn?ption prechides this (Figure 5.4-a). b: ? $ ?x,P1 Tx. If Cx is inconsistent., then flillowing cycle ensues violating causality (Figure 5.4-b): sCiI(ipJ(P?R?com(? ii H r(c?(r?R--Hcom(? v, I ?)) cx[p] H SC ndp(j'/, n? CpJ ?CCVp?(P, fl1 SCTI(11)J (P R--Hcom( ? v, x c: P $ `rx, PI $ Tx. If p is a subsequent initiator, it n?ust have lower rank than Tx, making it iiT?possible for PI to receive rx 5 conin?it message (Figure 5.5-ci and c2). d: p $ vx, P' $ rx. FinalLy, if p' is a subsequent initiator. then p is responding to a n?essage froni it. Iii no case, however, will p receive the con?n?it message fron? rx; if p' has lower rank than Tx FAUL?'??pi(rx) holds, and if PI has higher rank than Tx P will fail upon receipt of 7?x ?? interrogate iliessage, and not receive its (?n?nIit message (Figure 5.5-Ct). I) I) = p R--Hcom(? ?`, (4) 6 I a: FIVO violation C1)I 61) J?: (?a'i?ality violat on i?ignr? 5.4: Proving Cx (?on?ist?nt - `the'oren? 7. 94 Theoren? 5.2.5 s-GMP satisfies GMP-1. Proof An outer process's local failnie and recovery beliefs are propagated by C?ossip in every ?nessage it sends. ??lins., an npdate initiator learns of, adopts., then acts on onter processes' beliefs. 5.3 Message Complexity l?he seqnence and tiniing of failnres affect the message complexity of S-G?iP. NVe qnantify the ?vorst and best case coniplexity for our protocol to install a new grotip view., as well as the gain achievable when we can take advantage of conipressing phases. Define ?r',d1f GpView? ` and Tx to be the nuniber of tolerable failures in GpView1: rr,d=(?? (?Ti;1 + 1). t?hen the worst case scenarios to install the (r + -I )st gron?? view occurs when there are Tx successive failed (or aborted) rec'onfigurations. ?his results in 0 `,` --H(Tx,) --H Tx () Gpviewx 9 , 9' niessages. i?ortunatety, this specific corIif?osition, ordering., and timing of faihires occurs with extreniely, low probability. 95 I?here are three s??-called best case scenario to install a vi r succeeds ilsing the standard two-phase S-G?IP; Tngr succeeds using co?ipressed S-G?1P' and one successful reconfignrer. The first case requires at i?ost 3(nx --H 1) n?essages; the second, at n?ost 2(nx --H 1); and the third, at most 5(nx --H 1). Finally, if we can take advantage of a seq'ience of con?pressed S-G?IP updates we save substantially in message complexity For exarnple, flj --H 1 successive singleton removals, none of which are mgr requires flx' I nx--HI (n1. --H 1) t 2 ? (nx --H y) (n1 --H 1) +2n?(n? --H 2) --H 2( Z y--H--H2 y=2 ?r2 --H 2?r --H 1 --H 1)2 messages, averaging to n? --H 1 messages per update. A standard two-phase algorithm would require an additional flx?Y --H 1 messages for the :qth removal, on the average. In each case, the number of actual failures may reduce the number messages. 96 cl: p' subs??quent initi??tor Ca?us??1ity violation Figurc 5.5: I)rovjng c1, Consistent -- Theorei? 5.2.4 contiijued. A ____ R--Hint(ver(p)) p cl: p subseqnent initiator ra?k(p) ? rank(?1,) - -____ R-int(ve )) c2: p snbsequent initiator rank(r?) ? rank(p) I) p p r1- I), =_R-int(y?(p)) Chapter 6 Protocol Optimality In this chapter, ?ve rephrase the Strong GNi l? nien?bership problen? as a kno?'ledge- theoretic, con?rnit-style probteni, viewilig any sointion to ?;xii3 as one of acquiring and propagating kno?vledge [11N19o]. This approach proved snperior to the previons, behavioral specihcation ?vith respect to fine-tuning the s-G?IP protocol. ?Ve sho? here how the episternic forn?utation niakes clear that any solution to Strong cN?NiP requires niajority corroboration, and where optin?izations to the s-c ?i p protocol are not possible. \Ve flirther show the knowledge-based formulation facilitates establishing a lower bound on the number of messages nee4(?d to solve Strong G?IP, construct a message-minimal solution, and finally quanti fjy the tradeoffs between the given, optimized solution and the niessage-mininial sohition. others have used a knowledge-based approach to analyze a variety of prob- lems in distributed computing ([flalS7] [flad9i], [NiF9()], [NIaz9?O], [H Fs5l). The work presented here differs both in the problem considered and in demonstrating the practical utility of ari episteI?ic fornuilation \`Ve give concrete evidence that such a forniulat ion greatly sin?plifies i?uildirig robust fast sollition5 to coriiri iit-style 1?robleriis. Remarks ??he iridistiuguisliabili ty of failures froii? coi?n?nnication delays seen? 5 to warrant using the doxastic ruodality [?\its9] to refer to local failure beliefs. However the (?oss4? and Isolation i)roperties result in a iTiucli stronger local interpretation than belief. in particular., once a process suspects anothei is faulty? it behaves as if it k??ows that process is fliulty. ??`?hereas the standard doxastic interpretation would give equal weight to belief iri faultiness and belief in non?faiiltiness? our rriodel favors one befi?r? the event faulty1?(q) and the other after. By closing its iricorriing channel frorn a suspected-faulty process, a process behaves as if the systeul were synchronous; as if it knew that a suspected process could not send further messages. Note that the forinulcT FAUUiY?(q) arid DO?VNq reflect this anibiguity. ?he forn?er is local to the suspecting process ?vhile the latter is local to no process. 1?orrriula ? is local to p if p alwQys knows whether it is true [C?NiS6] : I??? v I?j?mq' holds along all consist??nt cuts. In this way, the standard kno'vh?dge operator models exactly the behavior process pexhibits upoii executing [a?iltyp(q). ??hroughout this chapter, the staten?ent "p believes (or knows) q is faulty" should be taken as an artifact of our niodel`5 behavioral requirerrients riot ii terally. In contrast (?handra and foneg impart a pure doxastic interpretation to local faillire beliefs [(;?T9i 1. In their work, despite belief in a process's faultiness, all of 99 its n?essages niust be delivered, as n?ust all niessages to it. I?he difference could also be tern?ed one of optin-lisni versus pessin-iisni. F?inally ?ve note that s-c;NiF is a fnii-?nfo?ination protocol; all relevant state iriforniation is sent with each protocol niessage. In Strong (41N4P, relevant infornia- tioii are local beliefs about faillires, arid during reconfiguration, a process's local view and the vahie of a pending conin?it (if it exists). outer processes can infer an iriitiator's respondents from its failure beliefs. 6.1 Reformulating Strong GMP Strong GNIP, as a coniruit probleni, requires functional core n?en?bers in a given group vie?v to vote on a proposed update to their local views, and cornn?it the update when particular conditions are u?et. Let ?" be a SCt of values, one of which the processes in the (r --H 1 )st group tI? view niust con?niit to install the x group view : ? C (f-.+) ? 2Proc), Before conin?itting a given update, a process n??y V0t(r for n?ore than one update vahie. As in Aton?ic ()ornrriit, in rio execution are votes for any version pre-determined. `I'he notation ? ?`, X > refers to a particular value, ?`, and version of the group vie?v fr?r which r is submitted arid/or conirnitted. The san?e vallie niay be siibi?itted for different versions. i?he following an foru?ul?, events, arid notation used