The data structures are a means of potentially speeding up the nearest neighbor searching problem:
Given a set S of points (also called sites) in a metric space, build a data structure so that given a query point q, the nearest site to q can be found quickly.The sb(S) data structure given here can be built for arbitrary metric spaces. Measurements for several cases have shown that it can be effective, and is, at least, does no harm.
The data structure needs typically a (small) constant amount of additional space per site. The data structures may well bog down and give no improvement over brute force for high dimensional points, or some sets of strings. However, this is not predictable, since a given set of sites may have a lot of structure, even if it lies in a very high-dimensional ambient space. So the data structure may be worth a try. There is also available a quadtree-like data structure, coded by Arya and Mount, for Lp spaces, which is faster in that setting.
The SB library also includes functions for fixed radius queries, where all sites closer than some given distance are desired, and k'th nearest queries, where the closest k sites are desired, for some input integer k. The code also internally supports inverse nearest neighbor queries, although that capability is not currently supported in the interface.
The SB library is provided under the Gnu General Public License,and the usual disclaimers apply:
/****************************************************************************
* Copyright (C)2002 Lucent Technologies *
* (clarkson@research.bell-labs.com /who/clarkson) *
* *
* This program is free software; you can redistribute it and/or *
* modify it under the terms of the GNU General Public License *
* as published by the Free Software Foundation; either *
* version 2 of the License, or (at your option) any later version. *
* *
* This program is distributed in the hope that it will be useful, *
* but WITHOUT ANY WARRANTY; without even the implied warranty of *
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the *
* GNU General Public License for more details. *
* *
* You should have received a copy of the GNU General Public License *
* along with this program; if not, write to the Free Software *
* Foundation, Inc., 675 Mass Ave, Cambridge, MA 02139, USA, *
* or visit http://www.gnu.org/licenses/gpl.html *
*****************************************************************************/
make libraryto create the library, and
make drives
to produce the drivers for points under the Euclidean norm, strings under hamming distance, strings under edit distance, and bitvectors under hamming distance.
Building using Visual Studio 7. The folder 'win' contains the solution sb.sln, which contains projects to build the library and driver.
Using the library
#include <sb.h>must be added to programs calling these functions, where the directory containing that file must be in the include path when compiling the program.
To load these functions into the resulting program, on most systems the linker should be told of the library libSB.a with the option -lSB, and using the -L option to tell the linker where that library is.
make drivesgiven in the Msb_0.11 directory, will produce drivers and distance functions for euclidean data in euclid, strings under hamming distance in hamming, strings under edit distance in string, and bitvectors in (you guessed it) bitvector. Some corresponding data sets are included in those directories also. The drivers take data from stdin, and print test results on stdout. They are included here mainly to show various applications of the library, and the user will need to examine and change the source code to adapt them to other needs.
The executable test_sb can be make'd in the euclid subdirectory,
and will generate a variety of distributions of random sites,
build the data structure, and then test the data structure on
those sites.
Also included is the "MM" data structure, built with the build_M function, searched with
the search_M function,
and with driver drive_M. This data structure does not require the
triangle inequality, but does need some additional points, that are
used as representatives of typical query points. Most users
would probably prefer the sb data structure.A Variant Data Structure
Author
Ken Clarkson
Last Modified
Room 2C-455
Bell Laboratories
600 Mountain Ave
Murray Hill, NJ
07974
clarkson@bell-labs.com
Jan 2003