Pandas: from Zero to Hero

Gamze Yılan
8 min readDec 10, 2022

A throughout article that you can read while watching your favorite daytime TV, but good enough to teach you all you need to know to actually get the work done.

Getting Started

Pandas is a Data Analysis and Processing library in Python. Pandas is widely used for Data Science and Machine Learning. It’s developed on top of another Python package called NumPy, which is used for scientific computing. If you don’t know much about NumPy, you might want to check it out quickly by reading my 5 minute article before you continue with Pandas. This is not a must-do but I highly recommend it.

In order to get started, first ensure you have Python, NumPy and Pandas in your system. How you get it depends on your system and the IDE of your preference, but I recommend Anaconda. With the basic Anaconda installation you can simply go to terminal and run “conda install numpy”, and then “conda install pandas” & voila!

Now, let’s create a example.py file to start working with.

Note: If we were to create a project instead of a single example file to work with, we would have to simply create a directory and a __init__.py file inside it.

You should add the following lines to the top of the file in order to be able to work with Pandas within the file.

import numpy as np

import pandas as pd

Note: We generally import NumPy as np and Pandas as pd so that we don’t have to type numpy & pandas everytime we call a function from the libraries. Keep in mind that altough can be changed, np and pd are the general convention.

After that, you can run the code within the example file by simply going to the terminal and running the command:

python example.py

Series

Similar to how any programming language has arrays, similarly to how NumPy has matrices, we get series in Pandas. The main difference between series and ordinary matrices/arrays is that items within series can have index number or keys to access that value. Think of it as a list of key-value pairs like a JSON data.

Fun Fact: Pandas is often expressed as “Excel for Data Science” due the fact that it saves value — key pairs, displays them visually and performs calculations on them in a similar manner to MS Excel.

You can create a Pandas series from a Python dictionary:

myDictionary = { “John” : 35, “Jane” : 40}

pd.Series(myDictionary) // returns John 35, Jane 40

If you were to add an array instead of a dictionary, the items within the series would have integers as keys:

myArray = [ 35, 40]

pd.Series(myArray) // returns 0 35, 1 40

You can add two arrays together and create a series out of them as:

myArray1 = [ 35, 40]

myArray2 = [ “John”, “Jane”]

pd.Series(data= myArray1, index=myArray2) // returns John 35, Jane 40

If there are two different series with the same keys, you can perform mathematical operations on them as:

series1 = pd.Series({ “John” : 35, “Jane” : 40})

series2 =pd.Series({ “John” : 100, “Jane” : 50})

seriesSum = series1 + series2 // returns John 135, Jane 90

Note: When performing a mathematical operation on series with different keys, the matching keys will have the result of the operation as their value but the not-maching keys will have a value of NaN as there’ll be no result for these keys.

Data Frames

Data frames can be considered as a Series but for matrices. Basically, a matrix where the indexes are keys and the whole thing is a key-value pair is called a data frame.

You can create a data frame from any array or matrix as:

myMatrix = [ [10, 20, 30], [ 100, 200, 300] ]

dataFrame = pd.DataFrame(myMatrix)

You can assign columns and rows keys as:

myMatrix = [ [ 1, 2, 3],[10, 20, 30], [ 100, 200, 300] ]

rows = [ “John”, “Jane”, “Sarah” ]

columns = [ “Tennis”, “Basketball”, “Pingpong”]

dataFrame = pd.DataFrame(myMatrix, index = rows, columns = columns)

You can call a certain column from the data frame as:

dataFrame[ “Tennis” ]

If you were to call a row with the key as shown above, however, it would throw an error. You can call a row with the key as:

dataFrame.loc[ “John” ]

The fact that there are keys to call columns and rows doesn’t mean that there are no indexes. Just as in matrices, the indexes still exist. You can, for example, call the first row as:

dataFrame.iloc[1] // returns the row Jane with the values 2, 20, 200

You can add a new column to the data frame as:

dataFrame[ “Football” ] = [ 1000, 2000, 3000]

Vice versa, you can delete a column or a row from the data frame as:

dataFrame.drop( “Football”, axis=1, inPlace=True) // to delete a column you must set axis to 1

dataFrame.drop( “John”, axis=0, inPlace=True) // to delete a row you must set axis to 0

Note: Pandas doesnt want to have you mistakenly change an entire column or a row. Maybe you’re trying to create another data frame and set the old data frame with one column missing to it instead of actually altering the old one? So to ensure that you’re actually changing the data frame itself, we put the inPlace=True parameter within the functions.

You can call a cell value as:

dataFrame.loc[ “John”, “Tennis”]

You can eliminate within the data frames as:

dataFrame[ dataFrame[“Tennis”] > 5]

// will delete the rows where the Tennis column value is less than 5, hence, will delete the row John

To change the indexes within a data frame, first you must add the new index array as a column to the old data frame. Then, you can capture it by the column name and assign it’s values as the new indexes as:

dataFrame.set_indexes(“New Index Column”, inPlace=True)

A multi-index data frame looks like this in Excel:

Wehere there are two indexes to rows, Fruit & Vegetable AND the product name. We can also create multi-indexes like this via Pandas. In order to do that you must:

  • Define two lists of outer indexes and inner indexes, and then combine these two into a tuple uzing list and zip methods.
  • Then use the MultiIndex.from_tuples() method to transform the tuple into a multi-index.
  • Then create a list of columns.
  • After that, create a list of the data as a matrix and actually turn it into a matrix using the np.array() method.
  • Finally, use the DataFrame() function from Pandas to create the data frame with multi-index.

In our given example above, we would need to do the following to actually create that:

outerIndexes = [ “Vegetables”, “Vegetables”, “Frutis”, “Fruits”]

innerIndexes = [ “Tomatoes”, “Cucumbers”, “Oranges”, “Blueberries”]

combinedIndexes = list( zip( outerIndexes, innerIndexes) )

combinedIndexes = pd.MultiIndex.from_tuples(combinedIndexes)

listAmountPrice =[ [ 2 USD, 5 lbs], [ 1USD, 3 lbs], [ 2 USD, 3 lbs], [ 3 USD, 2 lbs] ]

numpySeriesAmountPrice = np.array( listAmountPrice )

myColumns = [ “Price”, “Amount” ]

multiIndexDataFrame = pd.DataFrame( numpySeriesAmountPrice, index = combinedIndexes, columns= myColumns)

And then, you can call the amount and the price of Blueberries as:

multiIndexDataFrame.loc[ “Fruits” ].loc[ “Blueberries” ]

You can give the multi-index columns some names as well, like this:

multiIndexDataFrame.index.names = [ “Aisle”, “Product Name” ]

Operations

  • In order to return a data frame from the original data frame with the rows that contain NaN values deleted:

myDataFrame.dropna()

  • In order to return a data frame from the original data frame with the columns that contain NaN values deleted:

myDataFrame.dropna( axis = 1)

  • In order to return a data frame from the original data frame with the rows that contain more than two NaN values deleted:

myDataFrame.dropna(thresh = 2)

  • In order to return a data frame from the original data frame with the cells that contain a NaN value changed to a certain value, let’s say, the integer 20:

myDataFrame.fillna(20)

  • In order to run operations and gather data about a data frame based on a column, you should group the data frame in relation to that column using groupby. Then you can perform operations to count the data within, get the mean of the values etc.

group = employeeInfoDataFrame.groupby( “Salary”)

group.mean() // gets the mean of the salaries within the table

group.count() // calculates how many employees get a certain salary for each salary

group.describe() // brings some statistical analysis results like the mean, the max and min values etc for that data frame relative to the grouped “Salary” column

  • If you’ve got two or more data frames that have the same columns, you can gather them together (one under another) like this:

pd.concat( [dataFrame1, dataFrame2] )

  • Likewise, if you’ve got two or more data frames that have the same rows, you can gather them together (one under another) like this:

pd.concat( [dataFrame1, dataFrame2] , axis=1)

  • If you’re looking for getting two or more data frames with different columns, though, you’d use the merge method as below. But keep in mind, in order to be able to do this, all the data frames that you merge must have at least one column in common.

pd.merge( dataFrame1, dataFrame2, on = “commonColumnName”)

  • You can bring only the unique values within a column in a database as:

myDataFrame[ “columnName” ].unique()

  • You can calculate how many unique values are there within a column in a data frame as:

myDataFrame[ “columnName” ].nunique()

  • You can define a function and apply it to each value within a data frame with the .apply() function. To explain further in an example:

def priceWithTax ( price ):

return price * 1.33 // function that adds %33 tax to the original price

priceDataFrame[ “Price” ].apply( priceWithTax ) // returns a series with the prices updated

Excel with Pandas

Even though we worked with data frames and series above, real life data will most likely be handed to you in the form of an Excel file.

In order to be able to work with an Excel file, make sure to save it under the exact same directory as where your python file is (the python file in which you’ll run your Pandas functions and operations). Then, you can import your Excel file as:

myDataFrame = pd.read_excel( “myExcelFile.xlsx” )

The Excel files will be imported as type DataFrame by default, so you’re good to go!

Conclusion

There are many other pre-defined functions and easy-to-use structures within Pandas that you may need depending on the task, but this far should be enough to help you get the job done. If you’re looking for a whole career on this or if you’re interested in learning further, you may continue from advanced level courses or research as you come across a need for something else ( something smarter! ) while working on a task. Pandas has a solution for almost everything!

--

--