Hands-on Python: Finding Nth most repeated words using sorted() and Lambda functions
This is a follow-along and learn by doing article and definitely not a production-like solution.
However, it's supposed to train your thought process to get to a production-like solution while at the same time getting some hands-on exposure to Python.
In this article, I'm going to show the following:
- What my function is supposed to do (yes, this is the first step, not the how)
- Breaking down the words using Regex (Python's re module)
- Creating a dictionary with words and its corresponding counter (i.e. how many times it repeats in a line)
- Returning top Nth words of above dictionary sorted by highest count
We should get to a solution with just a few lines at the end but the key learning thing here is the use of Lambda function as key of sorted() function.
How the Function is Supposed to Work
The function receives a string and how many words we'd like to display and returns a sorted list of words with most repeated word first:
In above example, Rodrigo is returned because it appears 3 times in the string.
The last argument 1 is just in case we want to return more than one result.
For example, if we want the most repeated word (Rodrigo) but also the 2nd most repeated word (DevCentral), we can set it to 2:
But how do we do it in Python in an efficient way?
Also, how do we code this in a more 'Pythonic' way?
So, here's what I'll do next:
- Quickly go through how to filter words/number (i.e. removing punctuation, space, etc)
- Create a dictionary with each word as the key and the number of times they repeat
- Return top Nth most repeated words
Filtering only words or numbers using Regex
We first define and compile a regular expression pattern that matches only alphanumeric characters (no space or punctuation):
\w is similar to [A-Za-z0-9_ ], i.e. it matches all letters and numbers.
The plus sign means that each subsequent character that is also alphanumeric would also match if they repeat ONE or more times (do not confuse with * that means ZERO or more times).
We can confirm that it works by testing it with findall() function:
This is what we need.
The pattern above can be adjusted according to our needs but as this is mostly a learning example we'll stop here so it doesn't get too complex.
Let's now focus on counting the repeated words..
Creating a counter for words/numbers
We can create a dictionary to keep track of number of words:
But why reinvent the wheel, right?
We can use collections.Counter instead:
Now that we have broken down the words into a list using Regex and added each word and corresponding counter to a dictionary, we're ready for the final solution.
How to return top Nth words
How do we find the top Nth words from highest to lowest counter?
The Not-so-pythonic Way
Sorting the words
The first thing we can do is to first reverse the dictionary and create a tuple like this:
Then sort the tuple by the number of occurrences (from highest to lowest):
reverse=True means we're sorting from highest to lowest as the default is the opposite.
Returning only first Nth words using Slicing and List comprehension
Let's imagine we want to list the first 2 words with highest count and set variable n to represent this counter.
We've got this:
But we want just the first 2 words with highest number of repetitions, so we can use slicing like this:
However, we're only interested in the words, not the counter.
In this case, we can do a for loop to return only the second word of above list:
Is there an easier or more efficient way?
The Pythonic Way
Is there a way to return Nth most repeated word without having to reverse the dictionary?
We could perhaps throw our counter dictionary straight into sorted function!
Before we jump to the solution we need to learn 2 things:
- How to use key argument from sorted() function
- How to use lambda function as the key to sorted() function
Using key argument from sorted function
sorted() function allows us to sort things matching pretty much any pattern using key argument.
The key argument expects us to tell it what to match so sorted() function can sort based on that.
Let's use a simple list of names:
By default, sort would allow us to sort it alphabetically from A-to-Z or in reverse from Z-to-A:
Let's say we want to come up with a crazy sorting rule and we want to sort it alphabetically but based on last character of each word.
Can we do that? Not by default though!
We can create a function that returns the last character [-1] of a string like this:
And then pass it on as the key to sorted() function:
Nice, isn't it?
But that's still a lot of work to create a separate function.
Is there an easier and more pythonic way to do this?
Using lambda function
Yes, there is.
If we just need to use a function as an argument to sorted() function, we can do it like this:
The above lambda function is the equivalent to the function we created before with the added benefit that it can be added directly to key argument.
It is much more readable and reduces the number of lines in our code.
Let's now go back to our problem as we also need a cray rule to return a sorted list based on our dictionary values, i.e. the word counter.
Returning top Nth repeated words in one line
Let's check our dictionary again:
To create a matching function, we need a function that returns the kind of item we want sorted() to match so it can sort based on it.
In this case, we want sorted() to sort based on our dictionary values (the counters!).
Therefore, our function has to receive a dictionary key as an input and return its value:
When we apply it to sorted we should also remember to set reverse to True because we want highest to lowest count, remember?
However, we can use a lambda function instead like this:
We can now use n to do the slicing and find Nth word:
And this is our code in one line:
Putting it all together
This is how our pythonic code could look like:
And here's the final output: