Selected Projects and Papers
Brenda S. Baker
Selected projects
- Software tools and algorithms for analyzing source code
These are based on a theory of parameterized string pattern matching [1-6].
A parameterized match (p-match) models systematic name changes, e.g. replacing x
and foo by y and fun everywhere throughout
a region of code.
- Dup [1-3,7]
Given source code and a threshold length set by the user, Dup finds all pairs of regions
that are over the threshold length and are either identical textually or are p-matches.
Dup can also be used to find exact duplication
in text. A scatter plot produced by Dup for a million-line production subsystem
is shown here.
- Pdiff
Pdiff finds the smallest edit distance
between two source files based on edit operations of insertion, deletion, and
p-match. It generates HTML to display the two files side by side
with the differences marked. (The theory is described in [6].)
- Software tools for analyzing executables without access to source code
- Identifying related executables
Compiling related Java programs can result in very
different bytecode files, when these files are viewed
as raw bytes.
We present several techniques for detecting similarities of Java programs
from their bytecode files, without access to the source [8].
We use these techniques to adapt Dup (above), Udi Manber's tool Siff,
and the UNIX utility Diff to deal with bytecode files.
Our tools make it possible to find
similarities among thousands of bytecode files, or
to compare one new file to an index of many old ones.
Possible applications include detection of plagiarized code,
software reuse, program management, and uninstallers.
- Exediff - patch files for executables
When source code can't be given out
or when data lines are slow, it would be convenient
to distribute updates to executables as patch files as is commonly
done for source files.
The difficulty is that
when primary changes are made to source code,
secondary changes (e.g. in pointers or jumps), can propagate throughout the executable.
Exediff
uses heuristics to reconstruct secondary changes where possible
in order to keep patch files small.
For the Alpha architecture, a comparison of the sizes of gzipped patch files created by Exediff and
by the
Bindiff
utility (which compares raw bytes) showed
that Exediff generally saved a factor of two for
major version changes and a factor of five for minor
version changes [9]. I have also implemented Exediff for Java bytecode files.
- WWW Content-Filtering
Concerns about possible censorship of the Internet
led to a project to modify a web proxy server to do content-filtering,
with a firewall to prevent avoidance of filtering [10].
Through
this project, I also helped to start up the Lucent Managed Firewall,
which led to the
VPN Firewall Family.
- Load balancing in networks of workstations
Mutual exclusion scheduling is the problem of scheduling unit-time
tasks non-preemptively on m processors subject to constraints
represented by a graph, such that tasks represented
by adjacent vertices run in disjoint time intervals.
This problem arises in load-balancing the parallel solution
of partial differential equations by domain decomposition.
We show that the problem is NP-hard, and investigate
when it can be approximated [11].
- Approximation schemes for planar graphs
I showed that planar graphs have an underlying recursive structure
that can be derived from planar embeddings, and that this structure
can be used to obtain polynomial-time approximation schemes for NP-hard maximization
or minimization problems, i.e. approximation algorithms
whose solution sizes converge toward optimal as the problem size increases.
The class of problems for which this approach provides approximation
schemes includes maximum independent set, maximum tile salvage,
partition into triangles, maximum H-matching, minimum vertex cover,
minimum dominating set, and minimum edge dominating set. [12]
-
A Theory of Parameterized Pattern Matching:
Algorithms and Applications (Extended Abstract).
ACM STOC,
1993, 71-80.
-
Parameterized String Pattern Matching.
JCSS 52,1, Feb. 1996, 28-42.
-
Parameterized Duplication in Strings: Algorithms
and an Application to Software Maintenance,
SIAM J. on Computing 26,5, Oct. 1997, 1343-1362.
-
Parameterized Pattern Matching by Boyer-Moore-type
Algorithms,
ACM-SIAM SODA, 1995, 541-550.
-
Longest
Common Subsequence from Fragments via Sparse Dynamic Programming,
European Symposium on Algorithms,
August, 1998
(with Raffaele Giancarlo).
-
Parameterized Diff,
ACM-SIAM SODA, Jan. 1999.
-
On Finding Duplication and Near-Duplication in Large
Software Systems,
Second Working Conf. on Reverse Engineering,
1995. Received IEEE Outstanding Paper Award.
-
Deducing
Similarities in Java Sources from Bytecodes, in USENIX
Annual Technical Conference, 1998 (with Udi Manber).
-
Compressing Differences of Executable
Code, in ACM
SIGPLAN Workshop on Compiler Support for System Software (WCSSS'99),
1999 (with Udi Manber and Robert
Muth).
-
Local Control over Filtered Access to the WWW,
Fourth World Wide Web Conference, December, 1995, Boston, MA
(with Eric Grosse).
Also WWW J. 1,1, Jan., 1996.
HTML,
gzipped PostScript.
-
Mutual Exclusion Scheduling,
Theoretical Computer Science 162,2, Aug., 1996, pp. 225-243
(with E.G. Coffman, Jr.).
-
Approximation Algorithms for NP-complete Problems on
Planar Graphs, J. ACM 41,1, Jan., 1994, pp. 153-180.
Last modified: Tue Aug 7 21:11:17 EDT 2001
Brenda S. Baker
bsb@research.bell-labs.com
Copyright
© 2001 Lucent Technologies. All rights reserved.