cancel
Showing results for 
Search instead for 
Did you mean: 

Quick Intro

This is a follow-along and learn by doing article and definitely not a production-like solution.

However, it's supposed to train your thought process to get to a production-like solution while at the same time getting some hands-on exposure to Python.

In this article, I'm going to show the following:

  • What my function is supposed to do (yes, this is the first step, not the how)
  • Breaking down the words using Regex (Python's re module)
  • Creating a dictionary with words and its corresponding counter (i.e. how many times it repeats in a line)
  • Returning top Nth words of above dictionary sorted by highest count

We should get to a solution with just a few lines at the end but the key learning thing here is the use of Lambda function as key of sorted() function.

How the Function is Supposed to Work

The function receives a string and how many words we'd like to display and returns a sorted list of words with most repeated word first: 

0151T000003liJkQAI.png

In above example, Rodrigo is returned because it appears 3 times in the string.

The last argument 1 is just in case we want to return more than one result.

For example, if we want the most repeated word (Rodrigo) but also the 2nd most repeated word (DevCentral), we can set it to 2:

0151T000003liJpQAI.png

But how do we do it in Python in an efficient way?

Also, how do we code this in a more 'Pythonic' way?

So, here's what I'll do next:

  • Quickly go through how to filter words/number (i.e. removing punctuation, space, etc)
  • Create a dictionary with each word as the key and the number of times they repeat
  • Return top Nth most repeated words

Filtering only words or numbers using Regex

We first define and compile a regular expression pattern that matches only alphanumeric characters (no space or punctuation):

0151T000003liJuQAI.png

\w is similar to [A-Za-z0-9_ ], i.e. it matches all letters and numbers.

The plus sign means that each subsequent character that is also alphanumeric would also match if they repeat ONE or more times (do not confuse with * that means ZERO or more times).

We can confirm that it works by testing it with findall() function:

0151T000003liJzQAI.png

This is what we need.

The pattern above can be adjusted according to our needs but as this is mostly a learning example we'll stop here so it doesn't get too complex.

Let's now focus on counting the repeated words..

Creating a counter for words/numbers

We can create a dictionary to keep track of number of words:

0151T000003liJqQAI.png

But why reinvent the wheel, right?

We can use collections.Counter instead:

0151T000003liJrQAI.png

Now that we have broken down the words into a list using Regex and added each word and corresponding counter to a dictionary, we're ready for the final solution.

How to return top Nth words 

How do we find the top Nth words from highest to lowest counter?

The Not-so-pythonic Way

Sorting the words

The first thing we can do is to first reverse the dictionary and create a tuple like this:

0151T000003liK4QAI.png

Then sort the tuple by the number of occurrences (from highest to lowest):

0151T000003liK9QAI.png

reverse=True means we're sorting from highest to lowest as the default is the opposite.

Returning only first Nth words using Slicing and List comprehension

Let's imagine we want to list the first 2 words with highest count and set variable n to represent this counter.

We've got this:

0151T000003liKAQAY.png

But we want just the first 2 words with highest number of repetitions, so we can use slicing like this:

0151T000003liKEQAY.png

However, we're only interested in the words, not the counter.

In this case, we can do a for loop to return only the second word of above list:

0151T000003liJlQAI.png

Is there an easier or more efficient way?

The Pythonic Way

Is there a way to return Nth most repeated word without having to reverse the dictionary?

We could perhaps throw our counter dictionary straight into sorted function!

Before we jump to the solution we need to learn 2 things:

  • How to use key argument from sorted() function
  • How to use lambda function as the key to sorted() function

Using key argument from sorted function

sorted() function allows us to sort things matching pretty much any pattern using key argument.

The key argument expects us to tell it what to match so sorted() function can sort based on that.

Let's use a simple list of names:

0151T000003liJmQAI.png

By default, sort would allow us to sort it alphabetically from A-to-Z or in reverse from Z-to-A:

0151T000003liKJQAY.png

Let's say we want to come up with a crazy sorting rule and we want to sort it alphabetically but based on last character of each word.

Can we do that? Not by default though!

We can create a function that returns the last character [-1] of a string like this:

0151T000003liKKQAY.png

And then pass it on as the key to sorted() function:

0151T000003liJsQAI.png

Nice, isn't it?

But that's still a lot of work to create a separate function.

Is there an easier and more pythonic way to do this?

Using lambda function

Yes, there is.

If we just need to use a function as an argument to sorted() function, we can do it like this:

0151T000003liKOQAY.png

The above lambda function is the equivalent to the function we created before with the added benefit that it can be added directly to key argument.

It is much more readable and reduces the number of lines in our code.

Let's now go back to our problem as we also need a cray rule to return a sorted list based on our dictionary values, i.e. the word counter.

Returning top Nth repeated words in one line

Let's check our dictionary again:

0151T000003liKLQAY.png

To create a matching function, we need a function that returns the kind of item we want sorted() to match so it can sort based on it.

In this case, we want sorted() to sort based on our dictionary values (the counters!).

Therefore, our function has to receive a dictionary key as an input and return its value:

0151T000003liKBQAY.png

When we apply it to sorted we should also remember to set reverse to True because we want highest to lowest count, remember?

However, we can use a lambda function instead like this:

0151T000003liKYQAY.png

We can now use n to do the slicing and find Nth word:

0151T000003liKdQAI.png

And this is our code in one line:

0151T000003liKnQAI.png

Putting it all together

This is how our pythonic code could look like:

0151T000003liKGQAY.png

And here's the final output:

0151T000003liKHQAY.png

Version history
Last update:
‎06-Mar-2020 07:15
Updated by:
Contributors