Revision-Controlled Data Structures

I'll start with an example, then explain it more precisely afterwards.

Example: linked lists

Suppose you have a list of elements $S = a_1 a_2 ... a_k$ and a sequence of sublists $s_1, s_2, ..., s_n$ of $S$, where each sublist consists of a subset of the elements in $S$, in the same order they appear in $S$. For example, let $S$ = ABCD, and let these be the sublists:

s_1: A
s_2: AB
s_3: AB D
s_4: B D
s_5: BCD
In this case, each sublist differs from the sublist before it by at most one element. We can represent these sublists as linked lists, with head 0 and tail $\epsilon$. And to these linked lists, we can apply pointer "revision control," producing something like this:

Image /Users/bparkis/v/webpages//flowRoot2290.png

In the above diagram, there is an arrow from an element $a_j$ to another element $a_q$ with a number i whenever next($a_j$) is changed to $a_q$ in the $i$th sublist. You can see that, for instance, to reconstruct s_3 you would start at 0 and trace through the diagram following the arrows with the largest numbers less than or equal to 3, following 0, then A, then B, then D, then $\epsilon$.

The basic idea here is that you have a sequence of data structures each consisting of nodes and pointers to nodes. Each data structure differs from the one before it in the sequence by the alteration of a small number of pointers. Since each data structure is only slightly different from the one before it, I'll call the sequence of data structures ``revisions'' of a data structure (for example, reflecting how multiple insertions and removals revise a binary tree over time). The objective is to access any node of any revision quickly, and also to store all the revisions using a small amount of space.

Associate with each possible outgoing pointer from a node a sequence of pointers $m_i$, each of which represents a revision of that pointer at a time $i$. For instance, in a singly linked list each node A has a ``next'' pointer, next(A). If next(A) = B at time 0, next(A) is changed to C at time 1.5, and next(A) is changed to D at time 2, then we would associate with next(A) the sequence $B_0, C_{1.5}, D_2$. The sequence associated with each possible pointer is maintained in increasing order of time. Now, given a time $t$, we can determine the value of the pointer at time $t$ by searching in the sequence for the $m_i$ such that $i$ is maximized but $i < t$. This represents the latest revision of the pointer at time $t$ and it can be found by a binary search in time $O(\log j)$, where $j$ is the number of revisions that have been performed on that pointer. In our previous example, if $t = 1.8$, the latest revision would be $C_{1.5}$, so we know that next(A) = C at time 1.8. Additionally, we can update the pointer to $E$ at a new time $r$ by inserting $E_r$ into the sequence, which can also be done in time $O(\log j)$ if we represent the sequence by a red-black tree.

Therefore any operation on the data structure at any time $t$ can be performed, simply by following the latest revision of each pointer at time $t$. Suppose that an operation X on the data structure accesses or updates $k$ pointers on the data structure, and that over the entire history of the data structure before this operation, the total number of times any pointer has been revised is $j$. Then if X would have been $\Theta(g)$ without revision control, it is $O(g + k \log j)$ with revision control. Since $g = \Omega(k)$ without revision control, this is not too bad a slowdown, with a worst case of $\Theta(g \log j)$.

The extra space required to store the $j$ revisions is only $\Theta(j)$, probably a large improvement over storing each revision of the data structure as a separate copy. Also, if revisions are stored as separate copies then every update operation on the data structure must create a completely new copy, probably slowing down the update dramatically.

Application

I started thinking about this in relation to a naive way of finding which rectangle of many on the screen contains a given point. Naively, you could sort all left and right vertical boundaries of rectangles by their x-coordinate. These boundaries divide up the screen into vertical strips, such that within a given strip, all vertical lines pass through the same rectangles. So within each vertical strip, you could sort those rectangles by their y-coordinate. Then to locate which rectangle a given point lies in, first do a binary search of the x-coordinate to locate the vertical strip it lies in, then do another binary search of the y-coordinate to find the rectangle. With $n$ rectangles this takes $O(\log n)$ time, but it requires $O(n^2)$ space (and preparation time). Instead, you could consider each vertical strip sorted by y-coordinate as a red-black tree. Each vertical strip differs from the one to its left by only the addition or removal of a small number of rectangles, so consider the strips as revisions of a single red-black tree (where ``time'' is the x-coordinate). Each insertion or removal of a rectangle involves the revision of $O(\log n)$ pointers, and there are $2n$ insertions and removals total, so $j = O(n \log n)$ and $\log j = O(\log n)$. So locating a point becomes $O(\log n \log j) = O((\log n)^2)$, a minor slowdown, and revision control dramatically reduces the space requirement to $O(n \log n)$.



© Bart Parkis
Last modified:Tue May 19 19:07:52 2009