Syntactic similarity: I completed a textbook for Morgan-Claypool on algorithms for the detection of textual near-duplicates in large corpora. I first considered this problem in 1994 just before the public release of Alta Vista. Our internal users noted a problem: usually the answers were correct and useful, but about a third of the time most of the search results were for nearly identical pages. From this, to enable public release on the marketing schedule, we chose a measurement technique (Jaccard similarity) for measuring the similarity of pages, and developed an initial implementation for evaluating this measure. Finding that the technique was workable, we then developed an efficient approximation algorithm for this measurement; when this still proved too costly for immediate deployment (requiring running time slightly in excess of the crawler and indexer), I found an improved implementation yielding an order of magnitude improvement in speed, allowing our deduplication algorithm to use roughly an additional five percent of the computational effort already used by the indexer. By running deduplication at index generation time, we shrank the corpus by roughly a third, reducing the number of machines needed to produce search results.
Working with MSN Search (now Bing), I reimplemented a variant of the sketching part of the Alta Vista solution to run efficiently and compactly at crawl time, and added a small library to run at result generation time to deduplicate only those pages which answered the query, causing deduplication to no longer suppress valid result pages due to irrelevant near-duplicate pages. Bing has used this code in production since then.
This led to work with Windows in network and disk use reduction for cluster file storage, using similarity techniques to seed a compression engine with the handful of files most likely to be of use in a compression dictionary. This has been part of Windows Server (and more recently client releases) since Server 2003 R2. Recently I worked once again with the Server file system team on techniques to improve the performance of feature selection.
Work with the file system reinforced my interest in improving similarity detection to prefer larger matching file segments over smaller ones, and so I began working to find efficient techniques for weight-proportional sampling from weighted distributions nonetheless preserving the consistency properties allowing for an unbiased estimator of, in this case, to weighted Jaccard. We (Manasse, McSherry, and Talwar) developed a randomized technique to draw a single sample efficiently, running in a fast expected constant time.
Recently, inspired by Ping Li's work on eliminating the running time dependence on the number of samples to be drawn, which turns out to be almost exactly a reinvention of a technique due to Flajolet and Martin from the mid-80s, we (in this case, Haeupler, Manasse, and Talwar) looked to find a weighted analog. Attempts to generalize this to an unbiased estimator of weighted Jaccard fell afoul of the difficulty of estimating a ratio using estimates of the denominator. Nonetheless, experiments suggested that we could come close when distributions A and B had roughly equal one-norm. The principal tecnique was to scale both distributions by, say, a common power of two to put both norms in the range 2000-8000. Randomized rounding then leads to norm-preserving (in expectation) replacement of real-valued weights by integers, with roughly 8000 non-zero entries. This yields a running time (using one of Ping's scheme for rapid production of a set of samples for unweighted Jaccard estimation and Broder et al's technique for converting integer weights into replicated entries) of O(N+8000) for inputs of length N with norm 8000. We found that, due to bad estimation of the denominator, this yields a biased estimator, with bias expected to be at most one part in 2000 (the lower bound on norm), considerably smaller than the expected Chernov-bound error of estimation using K samples of one part in root K; in typical usage this error is about one part in sixteen or so.
Generalizing a lemma of Theobald et al, we find that norms differing by more than a factor of two guarantee a Jaccard ratio below one half; two weights within a factor of two are guaranteed to have a common scale factor for putting the norm into the range 2000-4000 or 4000-8000. The matching scale factor may correspond to mismatched but adjacent ranges, so for each input weighting we produce scale 2^s putting the norm in the range 2000-4000 and scale 2^(s+1) putting the norm in the range 4000-8000. The randomized rounding and choice of samples is keyed by s or s+1, respectively, so that like weights get like randomness to assign sketch elements.
The specific value of 2000 depends on the number of samples to draw; this choice guarantees that each of 200 bins contains 10 candidates on average, which means that a bin is empty with probability approximately e^-10, so it is likely that no bin is empty, and extremely rare for more than one bin to be empty. Thus in each sketch, we can in 8 bits record the identity of the single empty bin, if one exists, that none exists, or that more than or that more than one exists.