Bell Labs: Papers Available On the Web

Bell Labs Internet Traffic Research

Papers Available on the Web

Internet Traffic Tends Toward Poisson and Independent as the Load Increases Nonlinear Estimation and Classification , eds. C. Holmes, D. Denison, M. Hansen, B. Yu, and B. Mallick, Springer, New York, 2002.
Jin Cao, William S. Cleveland, Dong Lin, and Don X. Sun
Abstract Network devices put packets on an Internet link, and multiplex, or superpose, the packets from different active connections. Extensive empirical and theoretical studies of packet traffic variables --- arrivals, sizes, and packet counts --- demonstrate that the number of active connections has a dramatic effect on traffic characteristics. At low connection loads on an uncongested link --- that is, with little or no queueing on the link-input router --- the traffic variables are long-range dependent, creating burstiness: large variation in the traffic bit rate. As the load increases, the laws of superposition of marked point processes push the arrivals toward Poisson, the sizes toward independence, and reduces the variability of the counts relative to the mean. This begins a reduction in the burstiness; in network parlance, there are multiplexing gains. Once the connection load is sufficiently large, the network begins pushing back on the attraction to Poisson and independence by causing queueing on the link-input router. But if the link speed is high enough, the traffic can get quite close to Poisson and independence before the push-back begins in force; while some of the statistical properties are changed in this high-speed case, the push-back does not resurrect the burstiness. These results reverse the commonly-held presumption that Internet traffic is everywhere bursty and that multiplexing gains do not occur. Very simple statistical time series models --- fractional sum-difference (FSD) models --- describe the statistical variability of the traffic variables and their change toward Poisson and independence before significant queueing sets in, and can be used to generate open-loop packet arrivals and sizes for simulation studies. Both science and engineering are affected. The magnitude of multiplexing needs to become part of the fundamental scientific framework that guides the study of Internet traffic. The engineering of Internet devices and Internet networks needs to reflect the multiplexing gains.
[compressed postscript] [pdf]

Internet Traffic: Statistical Multiplexing Gains DIMACS Workshop on Internet and WWW Measurement, Mapping and Modeling, 2002
Jin Cao, William S. Cleveland, Dong Lin, and Don X. Sun
Abstract This note describes recent results on the effect of statistical multiplexing on the long-range dependence (LRD) of Internet packet traffic. Details and bibliographies may be found in a series of papers at /stat/InternetTraffic/webpapers.html. Let a(j), for j = 1, 2, ... be the arrival times of packets on an Internet link, let t(j) = a(j+1) - a(j) be the inter-arrival times, and let q(j) be the packet sizes. Suppose we divide time up into equal-length, consecutive intervals. Let p(i) be the packet counts in interval i. The t(j) and q(j) are studied as time series in j, and the p(i) as a time series in i. The packet traffic on a link is the result of the statistical multiplexing of packets from different active connections. Let c be the mean number of active connections over an interval of time during which link usage is stationary. c serves as a measure of the magnitude of statistical multiplexing over the interval. This note addresses the effect of an increasing c on the LRD of the three traffic variables t(j), q(j), and p(i). Results are based on the following: (1) the mathematical theory of marked point processes; (2) empirical study of live packet traces from 15 interfaces whose 5-min average traffic rates range from about 2 kbps to 250 mbps; (3) empirical study of synthetic packet traces from network simulation with NS; (4) simple statistical models, FSD models and FSD-MA(1) models, for the traffic variables.
[postscript] [pdf]

S-Net: A Software System for Analyzing Packet Header Databases, Proc. Passive and Active Measurement, 34-44, 2002
Jin Cao, William S. Cleveland, and Don X. Sun
Abstract
S-Net is a software system designed for analyzing large databases of Internet packet headers. S-Net runs on Unix, and analysis is carried out using the S language for visualization and data analysis. There are three goals in its design: (1) to allow, despite the large size of databases, comprehensive analysis --- detailed analysis down to the level of individual observations, together with analysis by summaries of behavior that aggregate many observations; (2) to provide an extensible environment so that analysts can readily tailor methods to the specific objectives of their analyses; and (3) to provide an implementation based on public domain software, making S-Net readily accessible to network researchers. S-Net is available at /stat/InternetTraffic/S-Net.
[compressed postscript] [pdf]

The Effect of Statistical Multiplexing on the Long Range Dependence of Internet Packet Traffic, Bell Labs Tech Report, 2002
Jin Cao, William S. Cleveland, Dong Lin, and Don X. Sun
Abstract As the number of active connections (NAC) on an Internet link increases, the long-range dependence of packet traffic changes due to increased statistical multiplexing of packets from different connections. Four packet traffic variables are studied as time series --- inter-arrival times, sizes, packet counts in 100 ms intervals, and byte counts in 100 ms intervals. Results are based on the following: (1) the mathematical theory of marked point processes; (2) empirical study of 2526 packet traces, 5 min or 90 sec in duration, from 6 Internet monitors measuring 15 interfaces ranging from 100 mbps to 622 mbps; (3) simple statistical models for the traffic variables; and (4) network simulation with NS. All variables have components of long-range dependence at all levels of the NAC. But the variances of the long-range dependent components of the sizes and of the inter-arrivals decrease to zero as the NAC increases; the sizes tend toward independent, and the inter-arrivals tend toward independent or very short range dependent. These changes in the sizes and inter-arrivals are not arrested by the increased upstream queueing that eventually occurs as the NAC increases. The long-range dependence of the count variables does not change with the NAC, but their standard deviations relative to the means decrease like one over the square root of the NAC, making the counts smooth relative to the mean. Theory suggests that once the NAC is large enough, increased upstream queueing should alter these properties of the counts, but in the empirical study and in the simulation study the NAC was not large enough to observe an alteration for 100 ms counts. The change in the long-range dependence of the sizes and inter-arrivals does not contradict the constancy of the long-range dependence of the counts because the summation operations that produce counts from the arrivals and sizes act on an increasing number of packets as the NAC increases.
[compressed postscript] [pdf] [extended abstract: postscript] [extended abstract: pdf]

A Poisson Limit for Buffer Overflow Probabilities Proc. INFOCOMM, 2002, to appear
Jin Cao and Kavita Ramanan,
Abstract A key criterion in the design of high-speed networks is the probability that the buffer content exceeds a given threshold. We consider n independent identical traffic sources modelled as point processes, which are fed into a link with speed proportional to n. Under fairly general assumptions on the input processes we show that the steady state probability of the buffer content exceeding a threshold b > 0 tends to the corresponding probability assuming Poisson input processes. We verify the assumptions for a large class of long-range dependent sources commonly used to model data traffic. Our results show that with superposition, significant multiplexing gains can be achieved for even smaller buffers than suggested by previous results, which consider O(n) buffer size. Moreover, simulations show that for realistic values of the exceedance probability and moderate utilisations, convergence to the Poisson limit takes place at reasonable values of the number of sources superposed. This is particularly relevant for high-speed networks in which the cost of high-speed memory is significant.
[postscript] [pdf]

On the Nonstationarity of Internet Traffic, Proc. ACM SIGMETRICS `01, 102-112, 2001
Jin Cao, William S. Cleveland, Dong Lin, and Don X. Sun
Abstract Traffic variables on an uncongested Internet wire exhibit a pervasive nonstationarity. As r, the TCP connection rate, changes, marginal distributions and long range dependence change. As r increases (1) packet and connection arrival processes tend toward Poisson; (2) time series of packet sizes, round-trip times, and transferred file sizes tend toward independence. Extensive empirical study of packet traces demonstrates and elucidates the nonstationarity in two ways: (1) as r changes, the parameters of comprehensive statistical models fitted to traffic variables change; (2) as r changes, queueing characteristics of the packet traces change. The cause of the nonstationarity is superposition: the intermingling of sequences of connections between different source-destination pairs, and the intermingling of sequences of packets from different connections. Nonstationarity needs to be added to long-range dependence and heavy-tailed marginal distributions as one of the fundamental characteristics of Internet traffic.
[compressed postscript] [pdf]

IP Packet Generation: Statistical Models for TCP Start Times Based on Connection-Rate Superposition, Proc. ACM SIGMETRICS `00, 166-177, 2000
William S. Cleveland, Dong Lin, and Don X. Sun
Abstract TCP start times for HTTP are nonstationary. The nonstationarity occurs because the start times on a link, a point process, are a superposition of source traffic point processes, and the statistics of superposition changes as the number of superposed processes changes. The start time rate is a measure of the number of traffic sources. The univariate distribution of the inter-arrival times is approximately Weibull, and as the rate increases, the Weibull shape parameter goes to 1, an exponential distribution. The autocorrelation of the log inter-arrival times is described by a simple, two-parameter process: white noise plus a long-range persistent time series. As the rate increases, the variance of the persistent series tends to zero, so the log times tend to white noise. A parsimonious statistical model for log inter-arrivals accounts for the autocorrelation, the Weibull distribution, and the nonstationarity in the two with the rate. The model, whose purpose is to provide stochastic input to a network simulator, has the desirable property that the superposition point process is generated as a single stream. The parameters of the model are functions of the rate, so to generate start times, only the rate is specified. As the rate increases, the model tends to a Poisson process. These results arise from theoretical and empirical study based on the concept of connection-rate superposition. The theory is the mathematics of superposed point processes, and the empiricism is an analysis of 23 million TCP connections organized into 10704 blocks of approximately 15 minutes each.
[compressed postscript] [pdf]

Internet Traffic Data, Journal of the American Statistical Association, 95, 979-985, 2000. Reprinted in Statistics in the 21st Century, edited by A. E. Raftery, M. A. Tanner, and M. T. Wells, Chapman & Hall/CRC, New York, 2002.
William S. Cleveland and Don X. Sun.
Abstract Internet engineering and management depend on an understanding of the characteristics of network traffic. Statistical models are needed that can generate traffic that mimics closely the observed behavior on live Internet wires. Models can be used on their own for some tasks and combined with network simulators for others. But the challenge of model development is immense. Internet traffic data are ferocious. Their statistical properties are complex, databases are very large, Internet network topology is vast, and the engineering mechanism is intricate and introduces feedback into the traffic. Packet header collection and organization of the headers into connection flows yields data rich in information about traffic characteristics and serves as an excellent framework for modeling. Many existing statistical tools and models --- especially those for time series, point processes, and marked point process --- can be used to describe and model the statistical characteristics, taking into account the structure of the Internet, but new tools and models are needed.
[compressed postscript] [pdf]

A Scalable Method for Estimating Network Traffic Matrices from Link Counts, Bell Labs Tech Report, 2001
Jin Cao, D. Davis, Scott Vander Wiel Bin Yu, and Zhengyuan Zhu.
Abstract Traffic matrices are extremely useful for network configuration, management, engineering, and pricing. Direct measurement is, however, expensive in general and impossible in some cases. This paper proposes a scalable algorithm for statistically estimating a traffic matrix from the readily available link counts. It relies on a divide-and-conquer strategy to lower the computational cost without losing estimation accuracy. The proposed algorithm is tested on a real network with 18 nodes. The estimates are comparable to the direct estimates but require dramatically less computation.
[postscript]

Time-Varying Network Tomography: Router Link Data, Journal of the American Statistical Association, 95, 1063-1075, 2000
Jin Cao, D. Davis, Scott Vander Wiel and Bin Yu.
Abstract The origin-destination (OD) traffic matrix of a computer network is useful for solving problems in design, routing, configuration debugging, monitoring, and pricing. Directly measuring this matrix is not usually feasible but less informative link measurements are easy to obtain. This work studies the inference of OD byte counts from link byte counts measured at router interfaces under a fixed routing scheme. A basic model of the OD counts assumes that they are independent normal over OD pairs and iid~over successive measurement periods. The normal means and variances are functionally related through a power law. We deal with the time-varying nature of the counts by fitting the basic iid~model locally using a moving data window. Identifiability of the model is proved for router link data and maximum likelihood is used for parameter estimation. The OD counts are estimated by their conditional expectations given the link counts and estimated parameters. OD estimates are forced to be positive and to harmonize with the link count measurements and the routing scheme. Finally, ML estimation is improved by using an adaptive prior. Proposed methods are applied to two simple networks at Lucent Technologies and found to perform well. Furthermore, the estimates are validated in a single-router network for which direct measurements of origin-destination counts are available through special software.
[postscript]

Broadcast Audience Estimation, Proceedings of IEEE INFOCOM `00, 952-960, 2000
Chuanhai Liu, and Jorg Nonnenmacher
Abstract We investigate the estimation of the size of a broadcast audience. We give the Maximum Likelihood (ML) estimators for different scenarios. Using Poisson approximation, we obtain upper and lower bounds for the audience size. The ML estimators are obtained in closed-form, so are the bounds. We treat the problem of a malfunctioning feedback flow control, where the sender is overwhelmed by feedback messages. The ML estimate of the audience size is obtained using the Expectation-Maximization (EM) algorithm, even under such condition of censored values due to feedback implosion. Quantifying the influence of feedback implosion the loss of information for the audience estimation, we further investigated the choice of the parameters in order to optimize estimation results. The details are given the paper
[postscript]