Introduction to Data analysis using Pandas
October 10, 2020
Data Science and Data Analytics are some of the hottest topics in the Computer Science industry. The ability to analyze and make predictions based on data is nothing short of extraordinary.
Python is one of the most popular languages in the data science community. This is due to its ease of use and rich collection of libraries built to work with data. Pandas is a library that makes handling data easy and efficient. In this tutorial, we are going to understand how Pandas can be used to explore and draw insights from data.
Table of Contents
What is Pandas?
According to the official documentation, Pandas is a fast, powerful, flexible, and easy to use open-source data analysis and manipulation tool. It is built on top of the Python programming language. Pandas is usually used in conjunction with Jupyter notebooks, making it more powerful and efficient for exploratory data analysis.
If you are new to Jupyter notebooks, this article walks you through the installation and basics of Jupyter notebooks.
Pandas provide a robust collection of functions that make it easy to process and read data. In this tutorial, we are going to explore some useful functions and techniques that are an integral part of a data scientist’s toolset. You can install Pandas by using Python’s package manager, pip.
Enter the following command on the terminal:
pip3 install pandas
Alternatively, if you want to install Pandas using a different method, this tutorial walks you through the various ways in which you can install Pandas.
Analyzing data using Pandas
Now that we have Pandas installed on our system, we can delve into data exploration and analysis. For this, I will be using the “wine dataset”. Navigate to this link to download the dataset from Kaggle.
The “wine” dataset is a beginner-friendly dataset that provides information on various factors that affect the quality of the wine. It has 12 columns describing different factors such as pH, the acidity of the wine, etc. I will be using Jupyter notebooks to execute Python code in this tutorial. However, you can execute the code in a different text editor or IDE of your choice. Jupyter notebooks make it easier to view and explore the data.
Create a Jupyter notebook by running the following command on the terminal:
It will open a browser window and display the Jupyter notebook UI.
# Import necessary Libraries import pandas as pd # read the data using pandas data = pd.read_csv("winequality-red.csv")
The first line imports the Pandas library and gives it an Alias called
pd. Therefore, every time we use
pd, we will be referring to pandas. The
read_csv function is used to read a CSV (comma separated values) file and stores the contents in a variable called
Pandas stores the read data in a data structure called a Data Frame. According to the official documentation, a data frame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure that also contains labeled axes (rows and columns).
In simple terms, a data frame is like a table that contains named columns and rows of data similar to a table in a database. A data frame is powerful and has a lot of built-in functions that allow us to manipulate data. We are going to look at some of these functions. In the example above, the wine dataset is read and stored in a data frame called data.
data.head() data.columns data.info()
head() function prints the first 5 rows in the dataset by default. If a number ‘n’ is specified as an argument, it prints the first ‘n’ rows in the dataset.
data.columns prints a list containing all the column names in the data.
info() function provides useful information about the data such as the number of rows, number of columns, name of each column, and its data type, etc.
# Finding the min and max quality of the wine print("Wine with maximum quality:",data.quality.max()) print("Wine with minimum wuality:",data.quality.min()) data.quality.head(10) data.quality.tail(5)
In a data frame, we can access individual columns by using the dot(.) operator.
For instance, in the example above, we access the ‘quality’ column in the data and print the minimum and maximum values. Similarly, we can also access the ‘pH’ column by typing
This is another way to access individual columns:
data.describe() data['pH'] = data['pH'].values.astype(int) data.head()
astype() function converts the data from its original type to the one specified in the argument. In the example above, we convert the ‘pH’ column that has
float values to integers by specifying
int as the argument.
data['good_wine'] = data['quality'] > 5 data['bad_wine'] = data['quality'] <= 5 data.head() data = data.sort_values('alcohol', ascending=False) data.head(10)
We created two new columns, ‘good_wine’ and ‘bad_wine’ as shown in the example above. The ‘good_wine’ column will have
True wherever the ‘quality’ of the wine is greater than 5. The ‘bad_wine’ will have
True wherever the ‘quality’ is less than or equal to 5.
sort_values() function sorts the data frame based on the specified column. In the example above, we specify the ‘alcohol’ column, and the data is sorted based on this column.
ascending=False tells pandas to sort the data in descending order. This can be set to
True if you want the data to be sorted in ascending order.
data = data.drop(columns=['good_wine', 'bad_wine']) data.head()
drop() function can be used to get rid of unwanted columns in the dataset. You can specify a list of columns as an argument, and Pandas will delete all these columns. As you can see in the image above, the ‘good_wine’ and ‘bad_wine’ columns have been removed.
Conclusion and further reading
In conclusion, whether you are a data scientist, data engineer, or a software developer, Pandas is an indispensable part of your toolkit. In this tutorial, we looked at how we can explore the wine dataset and how we can draw insights from it using Pandas and its built-in functions.
Now that we have a better understanding of data analytics basics, you can go to Kaggle, download any dataset of your choice, and use Pandas to read, explore, and gain insights from the dataset.
Peer Review Contributions by Saiharsha Balasubramaniam