Papers Available on the Web
On the Nonstationarity of Internet Traffic,
  Proc. ACM SIGMETRICS `01, 102-112, 2001
IP Packet Generation: Statistical Models for TCP
Start Times Based on Connection-Rate Superposition,
  Proc. ACM SIGMETRICS `00, 166-177, 2000
Internet Traffic Data,
  Journal of the American Statistical Association, 95, 979-985, 2000.
Reprinted in Statistics in the 21st Century, edited by
A. E. Raftery, M. A. Tanner, and M. T. Wells, Chapman & Hall/CRC,
New York, 2002.
A Scalable Method for Estimating Network Traffic Matrices from Link Counts,
  Bell Labs Tech Report, 2001
Time-Varying Network Tomography: Router Link Data,
  Journal of the American Statistical Association, 95,
1063-1075, 2000
Broadcast Audience Estimation,
Proceedings of IEEE INFOCOM `00, 952-960, 2000
Internet Traffic Tends Toward Poisson and Independent as the
Load Increases
  Nonlinear Estimation and Classification ,
eds. C. Holmes, D. Denison, M. Hansen, B. Yu, and
B. Mallick, Springer, New York, 2002.
Jin Cao,
William S. Cleveland,
Dong Lin,
and Don X. Sun
Abstract
Network devices put packets on an Internet link, and multiplex, or
superpose, the packets from different active connections.
Extensive empirical and theoretical studies of packet traffic variables
--- arrivals, sizes, and packet counts --- demonstrate that the
number of active connections has a dramatic effect on traffic
characteristics. At low connection loads on an uncongested link
--- that is, with little or no queueing on the
link-input router --- the traffic variables are long-range dependent,
creating burstiness: large variation in the traffic bit rate.
As the load increases, the laws of superposition of marked point
processes push the arrivals toward Poisson,
the sizes toward independence, and reduces the variability of the counts
relative to the mean. This begins a reduction in the burstiness;
in network parlance, there are multiplexing gains.
Once the connection load is sufficiently large,
the network begins pushing back on the attraction to Poisson and
independence by causing queueing on the link-input router.
But if the link speed is high enough, the traffic can get quite
close to Poisson and independence before the push-back begins in force;
while some of the statistical properties are changed in this high-speed case,
the push-back does not resurrect the burstiness.
These results reverse the commonly-held presumption that Internet traffic
is everywhere bursty and that multiplexing gains do not occur.
Very simple statistical time series models --- fractional sum-difference
(FSD) models --- describe the statistical variability of the traffic
variables and their change toward Poisson and independence before significant
queueing sets in, and can be used to generate open-loop packet arrivals
and sizes for simulation studies.
Both science and engineering are affected. The magnitude of multiplexing
needs to become part of the fundamental scientific framework that guides
the study of Internet traffic. The engineering of Internet devices and
Internet networks needs to reflect the multiplexing gains.
[compressed postscript]
 
 
[pdf]
 
 
 
Internet Traffic: Statistical Multiplexing Gains
  DIMACS Workshop on Internet and WWW Measurement, Mapping and
Modeling, 2002
Jin Cao,
William S. Cleveland,
Dong Lin,
and Don X. Sun
Abstract
This note describes recent results on the effect
of statistical multiplexing on the long-range dependence (LRD) of Internet
packet traffic. Details and bibliographies may be found in
a series of papers at
/stat/InternetTraffic/webpapers.html.
Let a(j), for j = 1, 2, ... be the arrival times of packets on an
Internet link, let t(j) = a(j+1) - a(j) be the inter-arrival times, and let
q(j) be the packet sizes. Suppose we divide time up into equal-length,
consecutive intervals. Let p(i) be the packet counts in interval i.
The t(j) and q(j) are studied as time series in j, and the p(i) as
a time series in i.
The packet traffic on a link is the result of the
statistical multiplexing of packets from different active
connections. Let c be the mean number of active connections over
an interval of time during which link usage is stationary.
c serves as a measure of the magnitude of statistical multiplexing over the
interval. This note addresses the effect of an increasing
c on the LRD of the three traffic variables t(j), q(j), and p(i).
Results are based on the following: (1) the mathematical
theory of marked point processes; (2) empirical study of
live packet traces from 15 interfaces whose 5-min average traffic rates range
from about 2 kbps to 250 mbps; (3) empirical study of synthetic packet
traces from network simulation with NS;
(4) simple statistical models, FSD models and FSD-MA(1) models,
for the traffic variables.
[postscript]
 
 
[pdf]
 
 
 
S-Net: A Software System for Analyzing Packet Header Databases,
  Proc. Passive and Active Measurement, 34-44, 2002
Jin Cao,
William S. Cleveland,
and Don X. Sun
Abstract
S-Net is a software system designed for analyzing large databases of
Internet packet headers. S-Net runs on Unix, and analysis is carried out
using the S language for visualization and data analysis. There are three
goals in its design: (1) to allow, despite the large size of databases,
comprehensive analysis --- detailed analysis down to the level of
individual observations, together with analysis by summaries of behavior
that aggregate many observations; (2) to provide an extensible environment
so that analysts can readily tailor methods to the specific objectives
of their analyses; and (3) to provide an implementation based on public
domain software, making S-Net readily accessible to network researchers.
S-Net is available at /stat/InternetTraffic/S-Net.
[compressed postscript]
 
 
[pdf]
 
 
 
The Effect of Statistical Multiplexing on the
Long Range Dependence of Internet Packet Traffic,
  Bell Labs Tech Report, 2002
Jin Cao,
William S. Cleveland,
Dong Lin,
and Don X. Sun
Abstract
As the number of active connections (NAC) on an Internet link increases,
the long-range dependence of packet traffic changes due to increased
statistical multiplexing of packets from different connections.
Four packet traffic variables are studied as time series --- inter-arrival
times, sizes, packet counts in 100 ms intervals, and byte counts in
100 ms intervals.
Results are based on the following: (1) the mathematical
theory of marked point processes; (2) empirical study of 2526 packet traces,
5 min or 90 sec in duration, from 6 Internet monitors measuring
15 interfaces ranging from 100 mbps to 622 mbps;
(3) simple statistical models for the traffic variables;
and (4) network simulation with NS.
All variables have components of
long-range dependence at all levels of the NAC. But the variances of
the long-range dependent components of the sizes and of
the inter-arrivals decrease to zero as the NAC increases; the sizes tend
toward independent, and the inter-arrivals tend toward independent or
very short range dependent. These changes in the sizes and inter-arrivals
are not arrested by the increased upstream queueing that eventually
occurs as the NAC increases. The long-range dependence of the count variables
does not change with the NAC, but their standard deviations relative
to the means decrease like one over the square root of the NAC,
making the counts smooth relative to the mean.
Theory suggests that once the NAC is large enough,
increased upstream queueing should alter these properties of the counts,
but in the empirical study and in the simulation study the NAC was not large
enough to observe an alteration for 100 ms counts.
The change in the long-range dependence of the sizes and
inter-arrivals does not contradict the constancy of the long-range dependence
of the counts because the summation operations that
produce counts from the arrivals and sizes act on an increasing number of
packets as the NAC increases.
[compressed postscript]
 
 
[pdf]
 
 
[extended abstract: postscript]
 
 
[extended abstract: pdf]
 
 
 
A Poisson Limit for Buffer Overflow Probabilities
  Proc. INFOCOMM, 2002, to appear
Jin Cao
and
Kavita Ramanan,
Abstract
A key criterion in the design of high-speed networks is the
probability that the buffer content exceeds a given threshold. We
consider n independent identical traffic sources modelled as point
processes, which are fed into a link with speed proportional to n.
Under fairly general assumptions on the input processes we show that
the steady state probability of the buffer content exceeding a
threshold b > 0 tends to the corresponding probability assuming
Poisson input processes. We verify the assumptions for a large class
of long-range dependent sources commonly used to model data traffic.
Our results show that with superposition, significant multiplexing
gains can be achieved for even smaller buffers than suggested by
previous results, which consider O(n) buffer size. Moreover,
simulations show that for realistic values of the exceedance
probability and moderate utilisations, convergence to the Poisson
limit takes place at reasonable values of the number of sources
superposed. This is particularly relevant for high-speed networks in
which the cost of high-speed memory is significant.
[postscript]
 
 
[pdf]
 
 
 
Jin Cao,
William S. Cleveland,
Dong Lin,
and Don X. Sun
Abstract
Traffic variables on an uncongested Internet wire exhibit a pervasive
nonstationarity. As r, the TCP connection rate, changes, marginal
distributions and long range dependence change. As r increases (1)
packet and connection arrival processes tend toward Poisson; (2) time
series of packet sizes, round-trip times, and transferred file sizes
tend toward independence. Extensive empirical study of packet traces
demonstrates and elucidates the nonstationarity in two ways: (1) as r
changes, the parameters of comprehensive statistical models fitted to
traffic variables change; (2) as r changes, queueing characteristics
of the packet traces change. The cause of the nonstationarity is
superposition: the intermingling of sequences of connections between
different source-destination pairs, and the intermingling of sequences
of packets from different connections. Nonstationarity needs to be added
to long-range dependence and heavy-tailed marginal distributions as one
of the fundamental characteristics of Internet traffic.
[compressed postscript]
 
 
[pdf]
 
 
 
William S. Cleveland,
Dong Lin,
and Don X. Sun
Abstract
TCP start times for HTTP are nonstationary. The nonstationarity occurs
because the start times on a link, a point process, are a superposition
of source traffic point processes, and the statistics of superposition
changes as the number of superposed processes changes. The start time
rate is a measure of the number of traffic sources. The univariate
distribution of the inter-arrival times is approximately Weibull,
and as the rate increases, the Weibull shape parameter goes to 1, an
exponential distribution. The autocorrelation of the log inter-arrival
times is described by a simple, two-parameter process: white noise plus a
long-range persistent time series. As the rate increases, the variance of
the persistent series tends to zero, so the log times tend to white noise.
A parsimonious statistical model for log inter-arrivals accounts for
the autocorrelation, the Weibull distribution, and the nonstationarity
in the two with the rate. The model, whose purpose is to provide
stochastic input to a network simulator, has the desirable property
that the superposition point process is generated as a single stream.
The parameters of the model are functions of the rate, so to generate
start times, only the rate is specified. As the rate increases, the model
tends to a Poisson process. These results arise from theoretical and
empirical study based on the concept of connection-rate superposition.
The theory is the mathematics of superposed point processes, and the
empiricism is an analysis of 23 million TCP connections organized into
10704 blocks of approximately 15 minutes each.
[compressed postscript]
 
 
[pdf]
 
 
 
William S. Cleveland
and Don X. Sun.
Abstract
Internet engineering and management depend on an understanding of the
characteristics of network traffic. Statistical models are needed
that can generate traffic that mimics closely the observed behavior on
live Internet wires. Models can be used on their own for some tasks
and combined with network simulators for others. But the challenge
of model development is immense. Internet traffic data are ferocious.
Their statistical properties are complex, databases are very large,
Internet network topology is vast, and the engineering mechanism
is intricate and introduces feedback into the traffic.
Packet header collection and organization of the headers into
connection flows yields data rich in information about
traffic characteristics and serves as
an excellent framework for modeling. Many existing statistical tools
and models --- especially those for time series, point processes,
and marked point process --- can be used to describe and model
the statistical characteristics, taking into account the structure
of the Internet, but new tools and models are needed.
[compressed postscript]
 
 
[pdf]
 
 
 
Jin Cao,
D. Davis,
Scott Vander Wiel
Bin Yu, and Zhengyuan Zhu.
Abstract
Traffic matrices are extremely useful for network configuration,
management, engineering, and pricing. Direct measurement is, however,
expensive in general and impossible in some cases. This paper
proposes a scalable algorithm for statistically estimating a traffic
matrix from the readily available link counts. It relies on a
divide-and-conquer strategy to lower the computational cost without
losing estimation accuracy. The proposed algorithm is tested on a
real network with 18 nodes. The estimates are comparable to the
direct estimates but require dramatically less computation.
[postscript]
 
 
 
Jin Cao,
D. Davis,
Scott Vander Wiel
and Bin Yu.
Abstract
The origin-destination (OD) traffic matrix of a computer network is
useful for solving problems in design, routing, configuration
debugging, monitoring, and pricing. Directly measuring this matrix is
not usually feasible but less informative link measurements are easy
to obtain.
This work studies the inference of OD byte counts from link byte
counts measured at router interfaces under a fixed routing scheme. A
basic model of the OD counts assumes that they are independent normal
over OD pairs and iid~over successive measurement periods. The normal
means and variances are functionally related through a power law. We
deal with the time-varying nature of the counts by fitting the basic
iid~model locally using a moving data window. Identifiability of the
model is proved for router link data and maximum likelihood is used
for parameter estimation. The OD counts are estimated by their
conditional expectations given the link counts and estimated
parameters. OD estimates are forced to be positive and to harmonize
with the link count measurements and the routing scheme. Finally, ML
estimation is improved by using an adaptive prior.
Proposed methods are applied to two simple networks at Lucent
Technologies and found to perform well. Furthermore, the estimates
are validated in a single-router network for which direct measurements
of origin-destination counts are available through special software.
[postscript]
 
 
 
Chuanhai Liu,
and
Jorg Nonnenmacher
Abstract
We investigate the estimation of the size of a broadcast audience. We
give the Maximum Likelihood (ML) estimators for different scenarios. Using
Poisson approximation, we obtain upper and lower bounds for the audience
size. The ML estimators are obtained in closed-form, so are the bounds. We
treat the problem of a malfunctioning feedback flow control, where
the sender is overwhelmed by feedback messages. The ML estimate of
the audience size is obtained using the Expectation-Maximization (EM)
algorithm, even under such condition of censored values due to feedback
implosion. Quantifying the influence of feedback implosion the loss of
information for the audience estimation, we further investigated the
choice of the parameters in order to optimize estimation results. The
details are given the paper
[postscript]