One of the most fundamental libraries in the Machine Learning and Data Science landscape is unarguably NumPy (which stands for Numerical Python). Its significance has led to many other (similar) libraries like Pandas, SciPy, and Matplotlib (all based on NumPy) to be created. Let us delve into the workings and the various functions of the first import line in 99.78% of Kaggle notebooks.
The Python programming language is the most versatile language to have ever existed. Yes, that’s right existed. Python provides developers with an abundant of high-level data structures such as lists and dictionaries that aid in producing other data structures.
However, these structures are not suited for high-performance numeric computation.
import numpy as np import pandas as pd import matplotlib.pyplot as plt import scipy as sc
Table Of Contents
- Reason for the NumPy Library
- Enter NumPy
- NumPy Functions
- How is NumPy so fast?
Reason for the NumPy Library
To back that statement up, let’s use one of Python’s fundamental data structure, a list, and multiply every element in the list with a constant, let’s say 5.
import time # Create a list list_a =  # Append 100,000 elements to the list for i in range(100000): list_a.append(i) start_time = time.time() # Multiply every element in list_a with 5 list_b = [5*number for number in list_a] # Calculate time taken to multiply every element end_time = time.time() - start_time print("Time taken for a list:" + str(end_time)) # Output - Time taken for a list - 0.0039899349212646484s
Using a real-world example, like that of a processing chip within a self-driving car, 0.03 seconds for 100,000 multiplication operations is seen as highly ineffective.
There could be millions of multiplication and addition operations to be completed in a second, and those 0.03 seconds could result in a life or death situation.
A NumPy array is similar to an array in any other language. It consists of homogeneous elements. However, the dimension is not restricted to 2. A NumPy array can have any dimension that calls for the situation at hand. According to the dimension, a block of computer memory is occupied to access the numbers represented more easily.
Before delving into the functionality, let’s begin by importing the NumPy library.
# Import the NumPy library import numpy as np # Check Version print(np.__version__) # Output - 1.18.4
np.__version__ returns the version of NumPy being used.
Creating a NumPy array can be done in one of two ways – convert a list to a NumPy array or initializing a NumPy array.
import numpy as np # Method 1 - Converting a list to a NumPy array list_a = [1,2,3,4,5,6] print(type(list_a)) # Output - <class 'list'> np_list_a = np.array(list_a) print(type(np_list_a)) # Output - <class 'np.ndarray'>
Initializing a linear NumPy array can go one of many ways:
array_of_zeros = np.zeros(10) print(array_of_zeros) # Output - [0, 0, 0, 0, 0, 0, 0, 0, 0, 0] array_of_ones = np.ones(10) print(array_of_ones) # Output - [1, 1, 1, 1, 1, 1, 1, 1, 1, 1] array_of_random_numbers = np.random.rand(5) print(array_of_random_numbers) # Output - [0.84692907 0.58108508 0.77377301 0.95796771 0.61382531]
A NumPy array of random integers can be generated using the
numpy.random.randint method. This method takes three inputs:
- low (lower bound of the range; inclusive; default value is 0),
- high (upper bound of the range; exclusive) and
- size (size of the array).
The method returns random integers from the discrete distribution of integers in the half-open interval
[low, high). If high is None, the results are from
# With all three parameters - low, high and size array_of_random_integers = np.random.randint(low = 1, high = 100, size = 10) print(array_of_random_integers) # Output - [37 31 98 78 67 10 9 42 39 45] # With high as None array_of_random_integers = np.random.randint(low = 5, high = None, size = 10) print(array_of_random_integers) # Output - [3 4 1 4 4 1 3 1 3 2]
Yet another way to initialize a linear NumPy array is
linspace, which returns an evenly spaced sequence in a specified interval.
This function takes the following parameters:
- start (beginning value of the sequence),
- stop (if the endpoint is set to False, stop-1 is the ending value, else stop is the end value),
- num (default: 50, number of values to be generated), and
- endpoint (to decide where to stop, default: True)
Return a numpy.ndarray with
num equally spaced samples in the closed interval
[start, stop] if the endpoint is True. If the endpoint is False, it returns
num in equally spaced samples in the half-open interval
even_spaced_array = np.linspace(start = 0, stop = 50, num = 5, endpoint = True) print(even_spaced_array) # Output - [ 0. 12.5 25. 37.5 50. ] even_spaced_array = np.linspace(start = 0, stop = 40, num = 5, endpoint = True) print(even_spaced_array) # Output - [ 0. 10. 20. 30. 40.]
Below are a few more array operations that are self-explanatory:
Apart from mathematical computations, there will be a constant need to reshape or manipulate data in arrays. One simple transformation that can be done is to transpose a matrix. A tedious process like transforming a list of lists (a matrix) is done as follows:
# Initialize Matrix to a set of values matrix_A = [[5,6,7,8] for _ in range(4)] print(matrix_A) """ Output - [[5, 6, 7, 8], [5, 6, 7, 8], [5, 6, 7, 8], [5, 6, 7, 8]] """ # Loop over every element in the matrix for i in range(4): for j in range(i,4): # Swap the elements matrix_A[i][j], matrix_A[j][i] = matrix_A[j][i], matrix_A[i][j] print(matrix_A) """ Output - [[5, 5, 5, 5], [6, 6, 6, 6], [7, 7, 7, 7], [8, 8, 8, 8]] """
How does this work out in NumPy? Well, it’s pretty simple.
# Initialize Matrix to a set of values matrix_A = [[1,2,3,4] for _ in range(4)] print(matrix_A) """ Output - [[1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4], [1, 2, 3, 4]] """ # Convert it into a NumPy array np_matrix_A = np.array(matrix_A) # And transpose! np_matrix_A = np_matrix_A.T print(np_matrix_A) """ Output - [[1, 1, 1, 1], [2, 2, 2, 2], [3, 3, 3, 3], [4, 4, 4, 4]] """
You can find more such functions on the NumPy official documentation.
With all that being said, let’s address the elephant in the room.
How is NumPy so fast?
Let’s analyze the example where we transposed a matrix. One key point to remember is – in any scripting language, a major performance dipper is the use of unnecessary for loops. Loops when used to perform a single operation (in this case, swapping two elements) on a large dataset increases the complexity a significant amount.
Upon crunching a few numbers, transposing a (10000 x 10000) matrix using for loops takes 58.8596s, and using NumPy it takes significantly lesser time. The reason behind such high performance is a tiny concept called vectorization that NumPy implements. Vectorization groups element-wise operations together. Such a vectorized approach applies to all elements in an array.
Figure: Vectorized Operations
This is the under-the-hood reason why NumPy’s calculations are off the charts. When an nd-arrays in NumPy and C are compared, the NumPy function produces a massive time advantage in comparison to a C-array if the function is relatively large.
Let’s compare Numpy arrays and Python lists. As shown in the chart below, as the number of elements increases, the breakeven size is around 200 elements.
Figure: NumPy array vs Python List
Like all things that come full circle, let’s try to wrap this up with a performance comparison with our first example – multiplying every element of an array by 5.
But this time, the NumPy way:
import time as time # Create a NumPy array having 100000 elements list_a = np.array([i for i in range(100000)]) start = time.time() # Multiply by 5 the vectorized way list_b = list_a * 5 print(time.time() - start) # Output - 0.0004646778106689453
NumPy is one of the most fundamental libraries in Machine Learning and Data Science. It’s coded in Python and it uses vectorized forms to perform calculations at an incredible speed. It supports various built-in functions that come in handy for many programmers.
Peer Review Contributions by: Saiharsha Balasubramaniam