Creating Word Clouds in Python
November 24, 2020
Welcome to an exciting article on the word cloud generation. Word clouds are great ways to summarize vast pieces of information visually. They are typically used to depict metadata on websites. The bigger the font size of the keyword, the higher its significance on the website. In this article, we will code a program to generate custom word clouds.
We’ll install the packages required for this tutorial in a virtual environment. We’ll use
conda to create a virtual environment. For more installation information, refer to the Anaconda Package Manager website.
Create a new virtual environment by typing the command in the terminal. Perform this after installing anaconda package manager using the instructions mentioned on Anaconda’s website.
conda create -n wordcloud python=3.6
This will create a virtual environment with Python 3.6.
We will be installing the following packages:
Activate the virtual environment using the command,
conda activate wordcloud. After activating the virtual environment, we’ll be installing these packages locally in the virtual environment. To use these packages, we must always activate the virtual environment named
wordcloud before proceeding. You may also use the name of your choice for the virtual environment. Just replace
wordcloud with the name of your choice.
To install the packages, we will use the following commands:
pip3 install matplotlib
pip3 install nltk
pip3 install wordcloud
Note: If you get an error during installation, install the 1.19.3 version of numpy. Use the command
pip3 install numpy==1.19.3 For more information on the error, refer to this discussion.
NLTK Installation: On the basis of the installer used, you may or may not need to run the following commands. If you get an error related to
punkt package while running the code, run the following commands:
>>> import nltk >>> nltk.download('popular')
A Graphical User Interface pops up. If you are unsure of what to download, refer to this question on Stack Overflow.
Once installed, check if the packages are installed correctly. Run the following piece of code in an instance of the python shell (activated by running
python3) in your terminal, and you should get a valid output for the version number.
>>> import nltk >>> print(ntlk.__version__) >>> import matplotlib >>> print(matplotlib.__version__) >>> import wordcloud >>> print(wordcloud.__version__)
If you get valid output, you have successfully installed the package and can proceed with the rest of the article. We should get the following output:
>>> import matplotlib >>> matplotlib.__version__ '3.3.3' >>> import nltk >>> nltk.__version__ '3.5' >>> import wordcloud >>> wordcloud.__version__ '1.8.1'
Word Cloud Generation
Let’s now look at the code to generate word clouds. The input to the program will be a paragraph copied from any website of your choice. With the paragraph as input, we’ll pre-process it and send it to the
As mentioned above, we use the following libraries:
- matplotlib: A visualization and plotting tool used extensively in Python.
- nltk.corpus.stopwords: Natural language toolkit, known as
nltk,is a library built for performing various Natural Language Processing (NLP) tasks. It’s a vast library with tools for pre-processing, data cleaning, data visualization, data modeling, etc. We’ll use the list of stopwords for English. Stopwords are redundant words that don’t add significant meaning to the data.
- nltk.tokenize.word_tokenize: Tokenization is the process of breaking down the text into smaller units called tokens. The tokens can be words, sub-words, or phrases. We’ll use the tokenizer available in
- wordcloud: It’s a library that takes in the list of words and outputs a word cloud image. Developed by Andreas Mueller, it’s quite extensible and flexible with respect to the features.
We define a class called
WordCloudGeneration and define the following methods in the class:
pre-processing: We pass the input
datathrough the tokenizer. The
datais converted to lower case and tokenized. Tokenization results in a list of words. This list of words is further filtered. The filtering process copies words to
preprocessed_dataonly if the word is not a stopword.
create_word_cloud: This function takes in the processed list of words and calls the
WordCloudclass object. The
generatemethod in the
WordCloudclass returns an image of the word cloud. Using the library
matplotlib, we plot the image.
import matplotlib.pyplot as plt from nltk.corpus import stopwords from nltk.tokenize import word_tokenize from wordcloud import WordCloud class WordCloudGeneration: def preprocessing(self, data): # convert all words to lowercase data = [item.lower() for item in data] # load the stop_words of english stop_words = set(stopwords.words('english')) # concatenate all the data with spaces. paragraph = ' '.join(data) # tokenize the paragraph using the inbuilt tokenizer word_tokens = word_tokenize(paragraph) # filter words present in stopwords list preprocessed_data = ' '.join([word for word in word_tokens if not word in stop_words]) print("\n Preprocessed Data: " ,preprocessed_data) return preprocessed_data def create_word_cloud(self, final_data): # initiate WordCloud object with parameters width, height, maximum font size and background color # call the generate method of WordCloud class to generate an image wordcloud = WordCloud(width=1600, height=800, max_font_size=200, background_color="black").generate(final_data) # plt the image generated by WordCloud class plt.figure(figsize=(12,10)) plt.imshow(wordcloud) plt.axis("off") plt.show() wordcloud_generator = WordCloudGeneration() # you may uncomment the following line to use custom input # input_text = input("Enter the text here: ") input_text = 'These datasets are used for machine-learning research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learning algorithms (such as deep learning), computer hardware, and, less-intuitively, the availability of high-quality training datasets. High-quality labeled training datasets for supervised and semi-supervised machine learning algorithms are usually difficult and expensive to produce because of the large amount of time needed to label the data. Although they do not need to be labeled, high-quality datasets for unsupervised learning can also be difficult and costly to produce.' input_text = input_text.split('.') clean_data = wordcloud_generator.preprocessing(input_text) wordcloud_generator.create_word_cloud(clean_data)
The output of the code pops in a separate window. It should look like the image shown below.
This was a fun experiment we coded in Python. We went over how to generate a virtual environment using Anaconda and how to install the needed packages to generate word clouds. I encourage you to test the program with various inputs and experiment with the code. Happy learning.
Peer Review Contributions by: Adrian Murage
About the authorLalithnarayan C
Lalithnaryan C is an ambitious and creative engineer pursuing his Masters in Artificial Intelligence at Defense Institute of Advanced Technology, DRDO, Pune. He is passionate about building tech products that inspire and make space for human creativity to flourish. He is on a quest to understand the infinite intelligence through technology, philosophy, and meditation.