Tuesday, January 26, 2016

Heaps in Python

This time, we dig into the matter of how we can use heaps in Python. As a starter, you want heaps when there's a job to do like the following:

queue=[]
while necessary:
    ...
    queue.append(an_item)   # or two
    ...
    ...
    next_item = min(heap)   # what's next
    ...
    queue.remove(next_item) # we're done; remove it
    ...



Heaps are the basic implementation for fast priority queues. As you can imagine the critical part here is fast. You can implement priority queues like shown above but then you get abysmal performance. You might even be tempted to use sorted if you need to work on more than the first item of the queue. Either way, the basic idea is the same: push tasks onto a queue, and remove them in order by their priority.

So, what do I mean by fast? Using min needs O(n) comparisons, using sorted for working on more than the first item needs O(n log n) and using list.remove also needs O(n) compare operations. With a heap, you get all of this for O(log n) - significantly fewer comparisons.

Moreover, a heap is the datastructure used for, you guessed it, heapsort. Basically pushing everything onto a heap, then popping off the items one by one (in the correct order) and string them up in a new list or doing it in place.


A heap of sawdust illustrates the structure of the eponymous datastructure.

Heaps are usually implemented using arrays with the following invariant:

heap[k] <= heap[2k+1] and heap[k] <= heap[2k+2]

Don't be afraid of the lofty word invariant. If you know, what a tree is, you are basically set. Heaps are trees where each node have at most two children and each child is bigger or equal than its parent. Given that definition, the invariant above makes pretty much sense and allows a space-efficient storage of the heap tree using an array (if you don't believe me, take a pen&paper and see for yourself).

Python's stdlib module, heapq, is a ludicrously fast implementation of a heap. It not only uses some important tricks to reduce the amount of compare operations but it also has a cache-friendly version of heapify. And before I forget it, it's implemented in C. This said, Python core devs really seem to care about the performance of heapq which is understandable given the mission-critical property of a heap to be fast.

So, how do we use heapq then? Using the example from above, you just need to do the following:

import heapq
queue=[]
while necessary:
    ...
    heapq.heappush(queue, an_item) # or two
    ...
    ...
    next_item = heap[0]            # what's next
    ...
    heapq.heappop(heap)            # we're done; remove it
    ...

That's it. Simple and efficient.

Stay tuned; there is more to say about heaps.

Best,
Sven

No comments:

Post a Comment