Turn on suggestions

Auto-suggest helps you quickly narrow down your search results by suggesting possible matches as you type.

Showing results for

- DevCentral
- Technical Articles
- Speeding Up Performance with Binary Array Search i...

Options

- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page

Rodrigo_Albuque

MVP

- Subscribe to RSS Feed
- Mark as New
- Mark as Read
- Bookmark
- Subscribe
- Email to a Friend
- Printer Friendly Page

on 18-Mar-2020 10:03

Have you ever come across searching through timestamps or line numbers that are sorted?

What about searching for a number in a sorted list of millions of numbers?

Some might be tempted to use the **in** operator like this:

And it works, right?

However, for much larger numeric datasets, this might not be desirable because **in** operator searches sequentially through a list.

In this article, I'm going to show you a much more efficient way not only to find numbers in a sorted list efficiently, but also to insert numbers.

I'll start off by showing how to find an insertion point and then the difference between right and left insertion.

Lastly, I'll explain how to use bisect for searching only purposes and kick off a performance test comparing bisect with **in** operator by searching through a list of 900 million sorted numbers.

Bisect module was designed to insert numbers into a sorted list but not necessarily to find a number.

Let's find where we would insert the numbers 0.5 and 1.5 in A:

In above case, if we were to insert 0.5, bisect would insert it in the first position as it's the lowest number in the list.

Likewise, 1.5 would be inserted between numbers 1 and 2.

Make sense, right?

If number is already on the list, **bisect()** by default picks the insertion point to the right of existing number:

If we were to insert 56, it would insert 56 after the existing 56 on the list (__index 12__).

If we were to insert 1, it would insert 1 after the existing 1 (__index 1__).

However, bisect also gives us to the option to insert to the left if that's what we want:

The other interesting thing to note is that the regular **bisect() **function is an alias for **bisect_right()**:

Let's see how we can use it for searching purposes only.

We can return True/False if we find number or not:

The above function finds our insertion point using **Binary Array Search** technique and stores the value in **insert_point** variable.

As insertion point is 1 index after (to the right, remember?) highest number that is lower than what we're looking for, I subtracted **insertion_point **by 1.

Another function and maybe more useful one would be to return the index corresponding to the number we're looking for in case we want to retrieve the number at a later point:

It's similar to first function, but we return the index straightaway.

This is useful because retrieving a number with an array index is extremely fast (constant time operation).

When we use the **in** operator, Python starts searching at index 0, then 1, 2, 3 and so on.

When we use **bisect()**, Python starts by breaking up the list in 2 halves and comparing our number with the number in the middle of the list.

If number we're looking for is less than the middle element, we know it's in the left half, so we can exclude all numbers on the right half and keep going.

Similarly, if number we're looking for is greater than middle element, we know it's in the right half, so we can exclude all numbers on the left half and keep going.

What this means is that in a single operation, we can potentially exclude half of the list!

For those who like maths, while **in** operator performance is linear (0, 1, 2, etc), we can say that bisect()'s performance is logarithmic.

Here's our test script:

The function above creates a list of 900 million sorted numbers and then searches for a random number between 1 million and 900 million.

In our example below, the **in** operator will go through ~623 million steps before it can find number **623358246** but bisect will only go through **29** steps:

It took 58 seconds for **in** operator to find number **623358246 **but bisect was so fast that even 2 decimal numbers only were not enough to display the time it took below 1 second.

Labels:

Comments

devial

Nimbostratus