A NOTE ON RABIN'S NEAREST-NEIGHBOR ALGORTTHM by Steve Fortunet John Hopcroftt TR78--H340 Department of Computer Science Cornell University Ithaca, N.Y. 14853 Research was supported under grant number ONR Noool4-76-C-0018. A NOTE ON RABIN 5 NEAREST-NEIGHBOR ALGORITHM by Steve Fortune John Hopcroft ?S--H34O Department of Computer Science Cornell University Ithaca, N.Y. 14853 Abstract Rabin has proposed a probabilistic algorithm for finding the closest pair of a set of points in Enclidean space. His algorithm is asymptotically linear whereas the best of the known deterministic algorithms take order niogn time. We show that at least part of the speed up is due to the model rather than the probabilistic nature of the algorithm. 1. Introduction - One notion tha? has received some attention recently is that of a probabilistic algorithm. An algorithn is prcb- aoilistic if at certain steps it chooses a number randomly to determine the next step and at all other steps it is deterministic. Rabin (3] has proposed a probabilistic algorithm fo? finding the closest pair of a set of points in Euclidean space His algorithm has the property that for any fixed input the expected running time (taken over all possible sequences of random choices) is linear. It is possible, of course, that for a particularly bad sequence of random choices the algorithm will take longer. The best of the known deterministic algorithms for this problem has running time C(n log n) (2, 4, 51 We wish to raise the question of whether the speed of Pabn? algorithm is due to its probabilistic nature or due to the assumed underlying model of computation. In this model the input data are real numbers; the usual arithmetic operations take unit time. In addition, there is a special operation, described below, which performs an operation similar to hashing but is guaranteed to only require unit time. We show, however, that under this model there is a deterministic O(n loglog n) algorithm. Hence at least part of the speedup is due to the model. It would be interesting to know if the deterministic version could be further improved. We note that other fast probabilistic algoritnms have been pro?,osed (1, 3, 61, but they h?ve tho 3)roi)erty th?t for --H2--H some sequence of random choices an incorrect answer could result. Rabin's algorithm is apparently the only known example of an error-free probabilistic algorithm which runs faster than the deterministic equivalent. It would be interesting to find other examples of this phenomenom. 2. The Algorithm We describe the special operation, call it FINDBUCXET, in more detail. Given a set of real numbers and an interval size, we wish to partition the set into blocks such that each block is nonempty and contains the reals falling in one of the intervals. To do this we have a table of buckets where the size of the table is as large as the set of numbers. FINDBL'CsET, given the arguments a real number r and the interval size, returns the index of the bucket into which r is to be stored. FINDBUCKET returns the same index for all reals falling in the same interval; it is guaranteed to return difrerent indices for reels falling in different intervals. ?e assume no other relationship between the real and the index returned by FINDBUCKET. Rabin suggests that ?i.'??3jC?T be implemented by dividing the real by the interval size, truncating the quotient to an integer, and hashing on the integer. This would require constant expected time. The algorithn we describe is completely general and the extension to k-dimensional space will be obvious. ror siT. plicity we will assume that the points are real nuT.?ers appearing on the line. --H3--H The algorithm is based on the following observation: suppose we can find an interval size such that at most one point falls within each interval, and such that tnere are two nonempty adjacent intervals. Then to determine the closest pair we need only examine, for each point, the intervals surrounding the point. The number of operations is bounded by the number of surrounding intervals; on the line it is 2n, in the plane Sn, and in k-di;ensional space (3k?l)m. We write a recursive procedure FINDINT to find the appropriate interval size. FINDINT is called with a set of n points as parameter. It selects an initial interval size by dividing the difference between the maximum and minimum point by n. It then removes a point from the set and uses FINDBUCKET to deterrnine into which bucket the point is to be placed. This process continues until some bucket has Vn points, at which time FINDI;T is called recursively for each bucket which contains two or more points. The new interval size is set to the minimum of the interval sizes returned by the recursive calls. At this point, all buckets are emptied and the scan begins again with the first point not yet processed. When all points have been scanned, and a tentative interval size found, the entire set of points is placed into buckets at the current inteeval size. FIDINT is then called recursively on all buckets containing two or more points; the output of the algorithm is the minimum of the size returned by all recursive calls. A pidgin ALGOL description of FI?DINT is --H4--H given below. Input: A set S of n points. Output: An interval size such that no two points fall in the same interval and such that there are two adjacent nonempty Intervals 1. 2. 3. 4. 6. 7- 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. PROCEDtRE FINDINT(S): intervalsize ? (max(s) - min(S)) / n; T 5; WHItE T is not empty DO WHILE all buckets have fewer than V?? points and T is not empty DO remove a point r from T; B FINDBJCKET(intervalsize, r); insert r into bucket B; END; FOR each bucket B containing more than one element DO intervalsize min(intervalsize. FINDINT(B)) END; empty all buckets; END; FOR each r in 5 DO s FINDBUCKET(intervalsize, r); insert r into bucket B; END; FOR each bucket B containing more than one element DO intervalsize min(intervalsize, FINDINT(B)); END; RETtRN(intervalsize); END --H5--H To analy:e the running time of the algorithm, first note that exclusive of recursive calls, FINDINT takes time cn, for some c. secursive calls of size less than 256 take only a constant amount of time; we will assume that the cost of such calls are absorbed into the constant c. We show by induction that for n?256, PINDINT takes total time dn loglogn, for some Let us call the process of one iteration of the WHIL? loop of lines 5-9, i.e. until some bucket gets ? points, a level. Clearly, there are at most ? levels. Let us call the interval size at line 15 the tentative interval size. It is possible that more than one point can fall into a bucket at the tentative interval size. This can happen if two points would fall into the same bucket but were processed at different levels. However, each such bucket can contain at most one point from each level. Hence the number of points per bucket at line 19 is bounded by the number of levels, i.e. ??. From the above discussion it is clear that FINDINT never solves a recursive problem of Size bigger than tn. If all subproblems were disjoint, it would be easy to show that the running time was O(n loglogn). However it is pos- sible that a point might appear in a recursi':e call at both lines 11 and 20. Hence a more complicated analysis is necessary. --H6--H For the following, let us also say that the processing of lines 19 through 22 is a level by itself. We wish to count the number of intervals which are guaranteed to be nonempty at the interval size at the end of each level. This will enable us to bound the total work involved in the recursive calls. Suppose the first recursive call at some level has b points. After the call, the points are all separated by the new interval size, But since all the points were in one interval at the start of the level, after the call there are b-l new nonempty intervals. Suppose the second call at the level has c points. It is possible that these points have already been partially separated by the first call. But since all c points were in one interval at the start of the level, after the second call there are a total of (b-l) + (c-l) new nonempty intervals. Continuing in this way for all recursive calls at a level and all levels, we get that there are at least T (bi?l) none:npty intervals, where k is the number of recursive i=i calls and b. the size of the ith recursive call. As at the end of the algorithm there are n nonempty intervals, we have k (I) ? (b?--Hl)sn i=l 1 The amount of work involved in the recursive calls is k s dbi loglog bi i=l bounded by (2) (Note that we have already absorbed the cost of rocursive calls of size less than 256 into general overhead. ?ut (2) is certainly an upper bound on the amount of work necessary.) We wish to maximize (2) subject to (1), and the constraint that 2sb,s ? for each i. We claim (2) is maximized when each bi except possibly one isLt?. To see this first note that if r?s>2 (r+l) loglog (r+l) + (s-l) loglog (s-l) ? r loglog r + a logLog a hence (2) is augmented by increasing a bi by 1 at the expense of decreasing a smaller b. by 1. Second, if b-=2, for some i, since 2 loglog 2 = 0, (2) is augmented by deleting b. and adding 1 to some other bj (1) is then still satisfied. In this case (all but possibly one b?=tn) ksF47j?j s? + 4 for sufficiently large n, So we have k Z db loglog b. s d tn loglog tn (tn + 4) i 1 s dn loglogn - dn + 4dAt (loglogn-l) - d ? dn loglogn where the last line holds as ns256. The total work involved is thus bounded by cn + dn loglogn - dn which is less than cbloglogn if d>2c. (lJ [2) t3] [4) [5) Angluin, D. and t.c. Valiant. "Fast Probaoilistic Algorithrns for Hamiltonian Circuits and Matchings", Proc. of the Ninth Annual ACM Symposium on the Theory of Com utin (1977), pp. 30-41. Bentley, J.L. and M.I. Sharnos. "Divide-and-Conquer in .!?ltidimensional Space", Proc. of the Eighth Annual AC.' Symposiurn on Theory of Computing (1976), pp. 220-230. Rabin, M. "Probabilistic Aleorithrrs", appearing in Algorithms and Complexity (1976), Academic Press, N.Y., N.Y., J. Traub, Ed. Shames, ,`?.I. "Ceometric Complexity", Proc. of the Seventh Annual ACM Symposium on Theory??Compu?g, (1975) , pp. 224--H233. Yuval, G. "Finding Nearest Neighbours", Information Proressing Letters, Vol. 5,3 (1976), pp. 63-65. Solovay, R. and V. Strassen. "Fast Monte-Carlo Test for Primality", SIA'i Journal on Computine Vol. 6,1 (1977) pp. 84--H85. [6]