Have you ever come across searching through timestamps or line numbers that are sorted?
What about searching for a number in a sorted list of millions of numbers?
Some might be tempted to use the in operator like this:
And it works, right?
However, for much larger numeric datasets, this might not be desirable because in operator searches sequentially through a list.
In this article, I'm going to show you a much more efficient way not only to find numbers in a sorted list efficiently, but also to insert numbers.
I'll start off by showing how to find an insertion point and then the difference between right and left insertion.
Lastly, I'll explain how to use bisect for searching only purposes and kick off a performance test comparing bisect with in operator by searching through a list of 900 million sorted numbers.
Bisect module was designed to insert numbers into a sorted list but not necessarily to find a number.
Let's find where we would insert the numbers 0.5 and 1.5 in A:
In above case, if we were to insert 0.5, bisect would insert it in the first position as it's the lowest number in the list.
Likewise, 1.5 would be inserted between numbers 1 and 2.
Make sense, right?
If number is already on the list, bisect() by default picks the insertion point to the right of existing number:
If we were to insert 56, it would insert 56 after the existing 56 on the list (index 12).
If we were to insert 1, it would insert 1 after the existing 1 (index 1).
However, bisect also gives us to the option to insert to the left if that's what we want:
The other interesting thing to note is that the regular bisect() function is an alias for bisect_right():
Let's see how we can use it for searching purposes only.
We can return True/False if we find number or not:
The above function finds our insertion point using Binary Array Search technique and stores the value in insert_point variable.
As insertion point is 1 index after (to the right, remember?) highest number that is lower than what we're looking for, I subtracted insertion_point by 1.
Another function and maybe more useful one would be to return the index corresponding to the number we're looking for in case we want to retrieve the number at a later point:
It's similar to first function, but we return the index straightaway.
This is useful because retrieving a number with an array index is extremely fast (constant time operation).
When we use the in operator, Python starts searching at index 0, then 1, 2, 3 and so on.
When we use bisect(), Python starts by breaking up the list in 2 halves and comparing our number with the number in the middle of the list.
If number we're looking for is less than the middle element, we know it's in the left half, so we can exclude all numbers on the right half and keep going.
Similarly, if number we're looking for is greater than middle element, we know it's in the right half, so we can exclude all numbers on the left half and keep going.
What this means is that in a single operation, we can potentially exclude half of the list!
For those who like maths, while in operator performance is linear (0, 1, 2, etc), we can say that bisect()'s performance is logarithmic.
Here's our test script:
The function above creates a list of 900 million sorted numbers and then searches for a random number between 1 million and 900 million.
In our example below, the in operator will go through ~623 million steps before it can find number 623358246 but bisect will only go through 29 steps:
It took 58 seconds for in operator to find number 623358246 but bisect was so fast that even 2 decimal numbers only were not enough to display the time it took below 1 second.
Actually, 20x might be an underestimation on Cannon Lake, so you are correct that I am way off from the maximum on this processor. However, on skylake, I am not too far off. Official Site