Msb: Nearest neighbor searching in metric spaces

Version 0.11 ("very preliminary release")
Jan 2003

Introduction

Msb is a collection of subroutines and drivers for nearest neighbor searching in metric spaces. It is written in ANSI C, and also compiles under C++. The routines are given as input a distance function dist(size_t, size_t), which satisfies the metric space conditions that

dist(a,a)=0,

dist(a,b)=dist(b,a);

dist(a,c) <= dist(a,b) + dist(b,c), (thetriangle inequality)

Here dist is assumed to give the distance between points: dist(1,2) is the distance between "point number 1" and "point number 2". Some examples of metric spaces include: points with the Euclidean distance measure or other L_p measures, strings under hamming or edit distance, and bit vectors under hamming distance. The drivers given here give examples of these measures.

The data structures are a means of potentially speeding up the nearest neighbor searching problem:

Given a set S of points (also called sites) in a metric space, build a data structure so that given a query point q, the nearest site to q can be found quickly.

The sb(S) data structure given here can be built for arbitrary metric spaces. Measurements for several cases have shown that it can be effective, and is, at least, does no harm.

The data structure needs typically a (small) constant amount of additional space per site. The data structures may well bog down and give no improvement over brute force for high dimensional points, or some sets of strings. However, this is not predictable, since a given set of sites may have a lot of structure, even if it lies in a very high-dimensional ambient space. So the data structure may be worth a try. There is also available a quadtree-like data structure, coded by Arya and Mount, for L_p spaces, which is faster in that setting.

The SB library also includes functions for fixed radius queries, where all sites closer than some given distance are desired, and k'th nearest queries, where the closest k sites are desired, for some input integer k. The code also internally supports inverse nearest neighbor queries, although that capability is not currently supported in the interface.

License

The SB library is provided under the Gnu General Public License,and the usual disclaimers apply:

/****************************************************************************
* Copyright (C)2002 Lucent Technologies                                     *
* (clarkson@research.bell-labs.com /who/clarkson)    *
*                                                                           *
* This program is free software; you can redistribute it and/or             *
* modify it under the terms of the GNU General Public License               *
* as published by the Free Software Foundation; either                      *
* version 2 of the License, or (at your option) any later version.          *
*                                                                           *
* This program is distributed in the hope that it will be useful,           *
* but WITHOUT ANY WARRANTY; without even the implied warranty of            *
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the             *
* GNU General Public License for more details.                              *
*                                                                           *
* You should have received a copy of the GNU General Public License         *
* along with this program; if not, write to the Free Software               *
* Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA,                 *
* or visit http://www.gnu.org/licenses/gpl.html                             *
*****************************************************************************/

Source Code, tar'd, gzipped

Source Code, zipfile

Slides from a talk

Installation

The .zip file can be downloaded and extracted using, for example, winzip. To extract the tar file Msb_0.1.tar.gz, run gunzip on it, and tar to create the directory Msb_0.1, and cd to that directory:

gunzip Msb_0.11.tar.gz

tar xf Msb_0.11.tar

cd Msb_0.11

Building using make. Now adjust the macros in Make_config to look in the appropriate directories, and use the appropriate compiler, random number generator, and library creation and use. The code has been compiled under ANSI C and C++ on SGI's and under Visual Studio on Windows. Finally, type

make library

to create the library, and

make drives

to produce the drivers for points under the Euclidean norm, strings under hamming distance, strings under edit distance, and bitvectors under hamming distance.

Building using Visual Studio 7. The folder 'win' contains the solution sb.sln, which contains projects to build the library and driver.

Using the library

The declarations for the functions for creating and using sb(S) are given in include/sb.h. So

#include <sb.h>

must be added to programs calling these functions, where the directory containing that file must be in the include path when compiling the program.

To load these functions into the resulting program, on most systems the linker should be told of the library libSB.a with the option -lSB, and using the -L option to tell the linker where that library is.

Using the drivers

The command

make drives

given in the Msb_0.11 directory, will produce drivers and distance functions for euclidean data in euclid, strings under hamming distance in hamming, strings under edit distance in string, and bitvectors in (you guessed it) bitvector. Some corresponding data sets are included in those directories also. The drivers take data from stdin, and print test results on stdout. They are included here mainly to show various applications of the library, and the user will need to examine and change the source code to adapt them to other needs.

The executable test_sb can be make'd in the euclid subdirectory, and will generate a variety of distributions of random sites, build the data structure, and then test the data structure on those sites.

A Variant Data Structure

Also included is the "MM" data structure, built with the build_M function, searched with the search_M function, and with driver drive_M. This data structure does not require the triangle inequality, but does need some additional points, that are used as representatives of typical query points. Most users would probably prefer the sb data structure.

Author

Ken Clarkson
Room 2C-455
Bell Laboratories
600 Mountain Ave
Murray Hill, NJ
07974
clarkson@bell-labs.com

Last Modified
Jan 2003