Saturday, January 30, 2016

Fast Object-Oriented Heap Implementation

There comes light to the darkness of your heaps.

This is the third post of a series of heap-related ones. See here and here for the back story.

In the last post, we found the heapq module lacking important features. Average Joe Dev doesn't want to clutter up his source code and and re-implement the same features all over the place to rectify the shortcomings of heapq. Understandably, Python core devs don't want to compromise on the performance of heapq either—being fast is the mission of a heap.

One issue with all proposals and implementations I've seen so far, is either their reduced feature set or their lack of performance. So, I went out to fix that once again and (I hope) for all.
What's different this time? Not much, I suppose, except that I made the following observation. Past implementations provided a single feature-enhanced and object-oriented implementation. So, you get object orientation but with an unnecessary slowdown. One solution would be to utilize heapq alongside with a such a feature-rich heap class. Then again using different interfaces for almost the same thing usually produces headaches in production. So, this approach isn't quite optimal.

If you need a feature, you will implement the corresponding logic (and slowness) anyway. From what can tell, cluttering your source does not make you and your program faster in the long run. Thus, hiding complexity in a feature-specific class make sense. Taking into account the performance penalty produced by more features, it makes further sense to split things up into different implementations.

So, I came to the conclusion that this problem cannot be solved by a single heap implementation but by a feature-complete heap suite: different classes with the same base interface for different use-cases.

xheap is such a heap suite. I further provided you with a repo at github to be used as an issue tracker. The remainder of this post will examine each class provided by xheap; a next post will investigate their performance.

Heap

The heap interface is pretty standard these days:
  • peek - get first item
  • push - insert an item
  • pop - remove first item
  • pushpop - push then pop; but faster
  • poppush/replace - pop then push; but faster
You somehow also expect it to work like an array/list (cf. heap invariant). All variants of Heap will support that interface. Some demo code:
from xheap import Heap

heap = Heap([5, 4, 3, 1])  # make heap from list
heap.push(2)
heap.pop()              # returns 1
heap.peek()             # returns 2
heap[0]                 # returns 2
heap.pushpop(6)         # returns 2
heap.replace(7)         # returns 3

Benefit? It's object oriented! And as fast as heapq since it's a thin wrapper.

You can replace it with a more feature-rich (and potentially slower) version if needed. And by the time that happens, the only thing you need to change is using another heap class that provides the wanted features.

OrderHeap

You guessed it: you can specify how items are compared. The key-parameter (analogous to builtin functions max or sorted) is specified during heap initialization:
from xheap import OrderHeap

heap = OrderHeap(key=lambda x: -ord(x))  # define the order
heap.push('a')
heap.push('z')
heap.push('t')
heap.pop()       # returns z

Benefit? Auto-wrapping new items into tuples by which the heap is sorted internally. You would have to re-implement it anyway in all places (without making a mistake btw.), so the heap class is the best location where such logic naturally belongs to.

Why not setting the key, each time you push an item? Right now, I couldn't imagine such a use-case. From what I can tell, the key is always derived from the item itself. So, a tuple might suffice for now; if you feel that's not quite true for your project, let me know so I can tweak OrderHeap to your needs.

Btw. pull requests are welcome as well. ;-)

RemovalHeap

Pretty obvious as well: you can remove an item from anywhere in the heap without manually keeping track of indexes.

Benefit? Keeping track of indexes, is not implemented easily—especially outside of a class. It further requires special tweaking to Heap.pop to enable removing an item from the middle of the heap.

There is an alternative approach using periodic sweeps but hey, the concrete implementation is encapsulated within the heap class. So, if I ever feel like changing it for an better approach, nobody will notice.

It's demo time:
from xheap import RemovalHeap

heap = RemovalHeap(['z', 'u', 'd', 'a'])
heap.remove('d')

XHeap

For the problem described here, I need both properties. Thus, XHeap is basically a conjunction of OrderHeap and RemovalHeap: you can remove items AND you can define arbitrary orderog.

What's left is investigating how fast or slow each class is compared to Python's original heapq module. To be continued …

Best,
Sven

1 comment: