This unique volume is an introduction for computer scientists, including a formal study of theoretical algorithms for Big Data applications, which allows them to work on such algorithms in the future. It also serves as a useful reference guide for the general computer science population, providing a comprehensive overview of the fascinating world of such algorithms.
To achieve these goals, the algorithmic results presented have been carefully chosen so that they demonstrate the important techniques and tools used in Big Data algorithms, and yet do not require tedious calculations or a very deep mathematical background.
Sample Chapter(s)
Preface
Chapter 1: Introduction to Data Stream Algorithms
https://doi.org/10.1142/9789811204746_fmatter
The following sections are included:
https://doi.org/10.1142/9789811204746_0001
Modern computer networks often include monitoring elements which constantly analyze the network’s traffic. These monitoring elements can serve various purposes such as studying changes in the network’s use over time or detecting malicious network activity. The algorithms employed by the monitoring elements have to process all the network traffic they receive, and then extract out of this traffic the information they are looking for. The task faced by these algorithms is made more complicated by the fact that a monitoring element typically receives a huge amount of traffic, and it is only possible for it to store a small fraction of this traffic. In other words, an algorithm used by the monitoring element views the traffic received by the element packet by packet. The algorithm can, then, store in memory some of these packets for future reference; however, the size of the memory available to the algorithm is relatively small, and thus, at every given time it can keep in memory only a small fraction of the packets it has viewed. Intuitively, this memory restriction means that in order to produce a meaningful output, the algorithm must find a way to “guess” the important packets that should be kept in memory, and filter them out of the rest of the traffic…
https://doi.org/10.1142/9789811204746_0002
In this chapter, we take a short break from the data stream model and study some tools from probability theory that we will use throughout the rest of the book. The chapter will begin with a review of the basic probability theory. This review will be quite quick as we assume that the reader already took a basic probability course at some point and, thus, is familiar with the reviewed material. After the review, we will study the topic of tail bounds, which is an essential tool for many of the results we will see in this book, but is not covered very well by most basic probability courses.
https://doi.org/10.1142/9789811204746_0003
In Chapter 1, we saw the data stream model, and studied a few algorithms for this model; all of which produced exact solutions. Unfortunately, producing exact solutions often requires a large space complexity, and thus, most of the algorithms we will see in this book will be estimation and approximation algorithms. An estimation algorithm is an algorithm which produces an estimate for some value (such as the length of the stream). Similarly, an approximation algorithm is an algorithm which given an optimization problem finds a solution which approximately optimizes the objective function of the problem. For example, an approximation algorithm for the minimum weight spanning tree problem produces a spanning tree whose weight is approximately minimal…
https://doi.org/10.1142/9789811204746_0004
Many data stream algorithms work in two steps. In the first step, the algorithm reads its input stream and produces a summary of this stream, while in the second step the algorithm processes the summary and produces an output. The summary used by such an algorithm should satisfy two seemingly contradicting requirements. On the one hand, it should be short so as to keep the space complexity of the algorithm small. On the other hand, the summary should capture enough information from the original stream to allow the algorithm to produce an (approximately) correct answer based on the summary alone. In many cases, it turns out that a summary achieving a good balance between these requirements can be produced by simply taking a random sample of tokens out of the algorithm’s input stream. In this chapter, we study data stream algorithms for extracting random samples of tokens from the stream, and present one application of these algorithms.
https://doi.org/10.1142/9789811204746_0005
In many settings, it is desirable to have access to a random function. Unfortunately, to store such a function one must keep a table mapping every possible element to its image, which is often infeasible. A popular workaround for this problem is to use hash functions, which are functions that behave as random functions (in some sense), but can be stored using a small amount of space. This workaround is especially important for data stream algorithms, which are often required to have very low space complexities. In this chapter, we present one particularly useful kind of hash functions. These hash functions will come into use in many of the data stream algorithms that we will see in the following chapters.
https://doi.org/10.1142/9789811204746_0006
Counting the distinct tokens in a data stream is an important problem having many real-world applications. For example, a router often tracks statistics such as the number of distinct IPs appearing in the packets that have passed through it, or the number of distinct URLs that have been requested. A more juicy application of this problem is detection of Denial of Service attacks on servers. Such attacks are often characterized by a large amount of traffic originating from a relatively limited number of computers or sub-networks, and thus, can sometimes be detected by simply counting the distinct traffic sources…
https://doi.org/10.1142/9789811204746_0007
As discussed in Chapter 4, many data stream algorithms work by first constructing a short summary of their input data stream, and then extracting their answer from this summary. This mode of operation naturally suggests the following question. Suppose we would like to speed up a calculation by using multiple machines, and suppose that we use each machine to construct a summary of a different part of the data stream. Then, is it possible to efficiently combine the summaries we get for the different parts of the stream into a new summary for the entire stream…
https://doi.org/10.1142/9789811204746_0008
In the previous chapters, we have studied algorithms that considered the stream as a “collection” of abstract tokens and did not attribute any specific “meaning” to the values of the tokens. Accordingly, these algorithms estimated properties of the stream, such as the number of distinct token values, which make sense even when the token values have no meaning (the sole exception to this rule was that in some places we assumed the existence of a natural order between token values, which requires the token values to have some slight meaning)…
https://doi.org/10.1142/9789811204746_0009
While presenting the data stream model in Chapter 1, we explained it using two examples. In one example, the input for the algorithm appeared on a magnetic tape which can be efficiently accessed only in a sequential manner. In this example, the reason that we got the input in the form of a data stream is technical (it is the only way to efficiently read the input), and has nothing to do with the nature of the problem that the algorithm tried to solve. The other motivating example in Chapter 1 was a network monitoring element which views the packets going through the network, and has to perform certain calculations based on these packets. For example, the monitoring element might have to raise the alarm if it detects malicious network activity. Note that in this example the fact that the monitoring element receives the packets in the form of a data stream is a natural consequence of the nature of the problem itself. In other words, the input for this problem consists of a set of “events” (each event is the pass of a single packet through the network), and the stream contains these events in the order in which they occurred…
https://doi.org/10.1142/9789811204746_0010
Traditionally, an algorithm is considered efficient if it has a polynomial time complexity. However, as modern applications require algorithms to process more and more data, it is often necessary for a truly practical algorithm to have a much better time complexity guarantee than merely being polynomial. Accordingly, faster and faster algorithms have been developed over the last decades for many computation problems. Usually, the fastest algorithm that people hope for is a linear time algorithm since this time complexity is necessary even for just reading the entire input. Nevertheless, in this part of this book we will consider sublinear time algorithms, i.e., algorithms that break the above barrier and use (significantly) less than linear time. As a sublinear time algorithm cannot even read all of its input, its output is always only an approximate solution in some sense. However, despite this clear weakness, sublinear time algorithms have gained importance because they often provide a practical way to solve problems on very large datasets.
https://doi.org/10.1142/9789811204746_0011
Many computational problems ask an algorithm to check whether a given object has a certain property. For example, given a graph we might be interested in determining whether it is connected or not. Unfortunately, very often such problems cannot be solved by sublinear time algorithms because the difference between an object that obeys the property and an object that does not obey it can be very subtle, which effectively forces any algorithm for testing whether an object has the property to read most of the object. Property testing algorithms are algorithms that avoid the above issue by relaxing their guarantee a bit. More specifically, a property testing algorithm should detect correctly that the object has the property when it has it, and it should also detect correctly that the object does not have the property when it is far from having it (in some sense). However, the property testing algorithm is allowed to answer in either way when the object almost has the property. For example, a property testing algorithm for connectivity of graphs should indicate correctly that a connected graph is connected, and should also indicate correctly that a graph with many connected components is not connected. However, such an algorithm is allowed to answer in either way given an unconnected graph with only a few connected components…
https://doi.org/10.1142/9789811204746_0012
There are multiple natural representations for graphs. For example, a graph can be represented by its adjacency matrix, or by storing the adjacency list of each vertex. Traditionally, the designer of a graph algorithm is free to assume that the algorithm’s input graph is given in any one of the standard representations. This assumption often makes sense because the graph can be easily converted from one representation to another in polynomial (or often even linear) time. However, in the context of sublinear time algorithms, the algorithm does not have enough time to convert the input graph from one representation to another, and thus, the representation in which the input graph is given can have a major impact on the things that can be performed in sublinear time. For example, it might be possible to determine in sublinear time whether a graph is connected when it is given in one representation, but not when it is given in another representation…
https://doi.org/10.1142/9789811204746_0013
As discussed in Chapter 12, sublinear time algorithms for graph problems have to be studied separately for every one of the standard graph representations. Chapter 12 presented a few algorithms for a specific graph representation that is appropriate for bounded degree graphs, which are a family of sparse graphs. In this chapter, we present a property testing algorithm for graphs represented by their adjacency matrix, which is a representation that is more appropriate for dense graphs.
https://doi.org/10.1142/9789811204746_0014
Boolean functions are functions which get (multiple) bits as their input and produce a (single) output bit based on these input bits. A few very simple examples of such functions are the standard logical gates of AND, OR and NOT, but the relationship between the study of Boolean functions and computer science is much deeper than the superficial relationship represented by these logical gates. In particular, results about Boolean functions have found applications in diverse fields of computer science such as coding theory and complexity…
https://doi.org/10.1142/9789811204746_0015
Up to this point in this book, we have seen algorithms that allow a single computer to handle big data problems by either using a very small amount of memory (streaming algorithms) or reading only a very small part of the data (sublinear time and local computation algorithms). A natural alternative approach for handling big data problems is to use parallel algorithms, i.e., algorithms that use multiple computers (or CPUs). The study of parallel algorithms dates back to the late 1970s, but their importance increased significantly over the last two decades because modern computer applications often necessitate processing of huge amounts of data in a relatively short timeframe, which is difficult to do with a single computer…
https://doi.org/10.1142/9789811204746_0016
Chapter 15 presented the Map-Reduce model and a few very simple algorithms for it. In this chapter, we will start to see more interesting Map-Reduce algorithms. All the algorithms that we will see in this chapter get either a list of words or a list of numbers as input.
https://doi.org/10.1142/9789811204746_0017
Up to this point in the book, all the Map-Reduce algorithms we have seen operated on inputs with very little combinatorial structure (essentially, all the algorithms operated on either sets or ordered lists of elements). To demonstrate that the Map-Reduce framework can also handle inputs with a significant combinatorial structure, we present in this chapter Map-Reduce algorithms for two graph problems.
https://doi.org/10.1142/9789811204746_0018
An Internet search engine finds many results for a query. Often some of these results are near duplicates of other results. Given this situation, most search engines will try to eliminate the near duplicate search results to avoid producing a very repetitive output. To do so, the search engine must be able to detect pairs of results that are similar. Consider now an online shopping service. Such services often try to recommend items to users, and one of the easiest ways to do this is to recommend an item that was bought by one user to other users that share a similar taste. However, to use this strategy, the online shopping service must be able to detect pairs of users that bought similar items in the past, and thus, can be assumed to have a similar taste…
https://doi.org/10.1142/9789811204746_bmatter
The following section is included:
"The book is useful to experienced researchers from different fields who have an interest in big data algorithms. To summarize, this book gives an excellent overview of big data algorithms from the perspective of theoretical computer science."
Sample Chapter(s)
Preface
Chapter 1: Introduction to Data Stream Algorithms