Introduction
To implement data structure in a functional way, there are two basic problems. First, from the point of view of designing and implementing efficient data structures, functional programming’s stricture against destructive updates(i.e. assignments) is a staggering handicap, tantamount to confiscating a master chef’s knives.
Imperative data structures often rely on assignments in crucial ways, and so different solutions must be found for functional programs. The second difficulty is that functional data strcutures are expected to be more flexible than their imperative counterparts. In particular, when we update an imperative data structure we typically accept that the old version of the data strcuture will no longer be available, but when we update a functional data structure, we expect that both the old and the new version of the data structure will be available for further processing, this is called persistent, while the other is called ephemeral. And we are not surprised if the persistent version is more complicated and even less efficient that the ephemeral one.
The term data structure has at least four distinct, but related, meanings:
- An abstract data type, that is, a type and a collection of functions on that type, we can refer to this as an abstraction.
- A concrete realization of an abstract data type, we can refer to this as an implementation, but note that an implentation need not be actualized as code – a concrete design is sufficient.
- An instance of a data type, such as a particular list or tree. We can refer to such an instance generically as an object or a version. However, particular data types often have their own nomenclature, for example, we simply refer to stack or queue objects as stacks or queues.
- A unique identity that is invariant under updates. For example, in a stack-based interpreter, we often speak informally about “the stack” as if there were only one stack, rather than different versions at different times. We can refer to this identity as a persistent identity.
Persistence
A distinctive property of functional data structures is that they are always persistent, updating a functional data structure does not destory the existing version, but rather creates a new version that coexists with the old one. Persistence is achieved by coping the affected nodes of a data structure and making all changes in the copy of rather than in the original. Because nodes are never modified directly, all nodes that are unaffected by an update can be shared between the old and new version of the data structure without worrying that a change in one version will inadvertently be visible to the other.
Lists
Linked lists are common in imperative programming and ubiquitous in functional programming. And we can define its abstraction as follow:
| |
It can be implemented trivially using either the built-in type of lists or a custom datatype:
| |
Another common function on the lists is append(@ in ocaml), we denote it as ++, and it’s easy to implement it in a
O(n) way, while we can implement in a O(1) way in an imperative setting:
| |
In functional setting, we have to copy the entire list to keep the persistence so that we are free to continue using the old lists as well as the new list. Although this is undeniably a lot of copying, notice that the second list ys shares the nodes wit the new list. Another function that illustrates these twin concepts of copying and sharing is update, which changes the value of a node at a given index in the list:
| |
Notice that this version of update is not tail call, so we can transform it via CPS:
| |
so that the tail optimization can be applied.
Binary search trees
Binary search trees provide a good example of the kind of sharing the node with more than one pointer field. And a binary search tree can implement sets or finite maps, whose minimal interfaces as:
| |
module type finiteMap = sig
type 'a t
type key
exception NotFound
val empty : 'a t
val bind : key -> 'a -> 'a t -> 'a t
val lookup : key -> 'a t -> 'a (* raise NotFound if key is not found *)
end
where the elem is some fixed type of totally-ordered elements. A more realistic implementation would probably include many additional functions, such as deleting an element or enumerating all elements. An unblanced set via binary search tree can be implemented as:
| |
where the Comparable is:
module type Comparable = sig
type t
val compare : t -> t -> int
end
Notice that in the member function, we defined a auxiliary function that take a candidate that less than or equal to
the search value such that it take no more than d + 1 comparisons, where d is the depth of the tree, while in a traditional
way, we need 2d comparisons in the worst case. And in the insert function, we introduce a AlreadyInSet exception
so that it return the tree itself while inserting an existing value to the tree to avoid the extra copying in the situation.
To sum up, we keep the persistence via sharing and copying.
Some Familiar Data Structures in A Functional Setting
Although many imperative data structures are difficult or impossible to adapt to a functional setting, some can be adapted quite easily.
Leftist heaps
Sometimes we need efficient access only to the minimum element, a data structure supporting this kind of access is called a priority queue or a heap:
| |
Heap can be implemented as heap-ordered trees, in which the element at each node is no large than the elements at its children. Under this ordering, the minimum element in a tree is always at the root. Leftist heaps are heap-ordered binary trees that satisfy the leftist property: the rank of any left child is at least as large as the rank of its right sibling. The rank of node is defined to be the length of its right spine(the rightmost path from the node in question to an empty node). A simple consequence of the leftist property is that the right spine of any node is always the shortest path to an empty node.
We define the leftist heap as functor:
| |
The key insight behind leftist heaps is that two heaps can be merged by merging their right spines as you would merge two sorted lists, and then swapping the children of nodes along this path as necessary to restore the leftist proerty, this can beimplemented as follow:
| |
where the makeTree function make a new tree from two tree and a key:
| |
Because the length of each right spine is at most logarithmic, merge runs in O(logN) time. And remaining functions are
trivial via merge. To sum up, we have:
| |
Binomial heaps
Another common implementation of heaps is binomial queues, which we call binomial heaps to avoid confusion with FIFO queues. Binomial heaps are composed of more primitive objects known as binomial trees. Binomial trees are inductively defined as follows:
- A binomial tree of rank 0 is a singleton node
- A binomial tree of rank k + 1 is formed by linking two binomial trees of rank k, making one tree the leftmost child of the other.
So, it is easy to see that a binomial tree of rank r contains exactly $ 2^r $ nodes. Another equivalent definition of binomial trees is : a binomial tree of rank r in a node with r children $ t_1, \cdots, t_r $, where each $ t_i $ is a binomial tree of rank $ r - i $, so we can represent a node in binomial tree as an element and a list of children as:
| |
Each list of children is maintained in decreasing order of rank, and elements are sorted in heap order. Now a binomial heap is a collection of heap-orderd binomial trees in which no two trees have the same rank, this collection is represented as a list trees in increasing order of rank.
| |
Because each binomial tree contains $ 2^r $ elements and no two trees have the same rank, the trees in a binomial heap of size n correspond exactly to ones in the binary representation of n. For example, the binary representation of 21 is 10101 so a binomial heap of size 21 would contain one tree of rank 0, one of rank 2 and one of rank 4. Note that, just as the binary representation of n contains at most log(n+1) ones, a binomial heap of size n contains at most log(n+1) trees.
To insert a new element into a heap, we first create a new binomial tree with rank 0, we then step through the existing trees in increasing order of rank until we find a missing rank, linking trees of equal rank as we go:
| |
The worst case is insertion into a heap of size $ n = 2^k - 1 $, requiring a total of k links and $ O(k)=O(log n) $. To merge two heaps, we step through both lists of trees in increasing order of rank, linking trees of equal rank as we go:
| |
And the deleteMin and findMin can implement simply by calling an auxiliary function removeMinTree, which find the tree with minimum root and remove it from the list, return both the tree and remaining list, thus we have complete implementation as follow:
| |
And each major operation require O(log n) time in the worst case.
Red black tree
As we know, a simply binary search tree perform very poorly on ordered data, for which any individual operation might take up to O(n) times. The solution to this problem is to keep each tree approximately balanced, which is knwon as balanced binary search tree, and red-black trees are one of the most popular families of balanced binary search tree. A red-black tree is a binary search tree in which every node is colored either red or black. every red-black tree satisfy the following invariants:
- all empty nodes are black
- no red node has a red child
- every path from root to an empty node contains the same number of black nodes. Take together, these invariants guarantee that the longest possible path in a red-black tree, one with alternating black and red nodes, is no more than twice as long as the shortest possible path, one with black nodes only.
We can simple implement a red-black tree as:
| |
The member function is the same as the function in binary search tree, because it does not care the color of node:
| |
The insert function need to mantain the two balanced invariants:
| |
First, when we create a new node with Empty, we initially color it red. Second we force the final root to be black,
regardless of the color. Finally, we replace the calls to the Node constructor in the with the balance function, which
acts just as the Node constructor except that it massages its arguments as necessary to enforce the balance invariants.
Coloring the new node red maintains invariant 3, but violate invariant 2 whenever the parent of the new node is red. We allow a single red-red violation at a time, and percolate this violation up the search path toward the root during rebalancing. The balance function detects and repairs each red-red violation when it processes the black parent of the red node with a red child. This black-red-red path can occur in any of four configurations, depending on whether each red node is a left or right child. However, the solution is the same in every case: rewrite the black-red-red path as a red node with two black children:
| |
And you can see it’s much elegant than any version written in imperative setting. After balancing a given subtree, the red root of that subtree might now be the child of another red node. Thus we continue balancing all the way to the top of the tree. At the very top of the tree, we might end up with a red node with a red child, but with no black parent, we handle this case by always recoloring the root to be black. Now we sum it all:
| |
Even without optimization, this implementation of balanced binary search trees is one of the fastest around. With appropriate optimizations, such as eliminating comparison (we have done this before) and eliminating redundant testing(while recursing on the left child their is no need for red-red violations involving the right child)