Pandas#

Earlier#

  • Numpy

  • Scipy

  • Matplotlib

Today#

  • Pandas

#Pandas

Learning Objectives:

  • Gain an introduction to the DataFrame and Series data structures of the pandas library

  • Access and manipulate data within a DataFrame and Series

  • Import CSV data into a pandas DataFrame

pandas is a column-oriented data analysis API. It’s a great tool for handling and analyzing input data, and many ML frameworks support pandas data structures as inputs. Although a comprehensive introduction to the pandas API would span many pages, the core concepts are fairly straightforward, and we’ll present them below. For a more complete reference, the pandas docs site contains extensive documentation and many tutorials.

##Setup

Install

!pip install pandas
Requirement already satisfied: pandas in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (2.2.3)
Requirement already satisfied: numpy>=1.22.4 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from pandas) (2.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from pandas) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from pandas) (2025.1)
Requirement already satisfied: six>=1.5 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)

Import - External Package

import pandas as pd

The primary data structures in pandas are implemented as two classes:

  • DataFrame, which you can imagine as a relational data table, with rows and named columns.

  • Series, which is a single column. A DataFrame contains one or more Series and a name for each Series.

The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in Spark and R.

##Series

One way to create a Series is to construct a Series object. For example:

pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
0    San Francisco
1         San Jose
2       Sacramento
dtype: object

Creating Data#

DataFrame objects can be created by passing a dict mapping string column names to their respective Series. If the Series don’t match in length, missing values are filled with special NA/NaN values. Example:

city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
pd.DataFrame({ 'City name': city_names, 'Population': population })
City name Population
0 San Francisco 852469
1 San Jose 1015785
2 Sacramento 485199

Missing Values:

city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 485199])
pd.DataFrame({ 'City name': city_names, 'Population': population })
City name Population
0 San Francisco 852469.0
1 San Jose 485199.0
2 Sacramento NaN

A DataFrame is a table(excel sheet). It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

For example, consider the following simple DataFrame:

pd.DataFrame({'A': [50, 21], 'B': [131, 2]})
A B
0 50 131
1 21 2

The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, …) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.

The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor:

# Student list
students = pd.DataFrame({'Name': ['Ross', 'Monica','Chandler'],
              'Last Name': ['Geller', 'Geller','Bing'],
              'GPA': ['98', '99','90']},
               index=['id1', 'id2','id3'])
students
Name Last Name GPA
id1 Ross Geller 98
id2 Monica Geller 99
id3 Chandler Bing 90
students['Name']['id1'] = 'Joey'
students
/var/folders/vp/68ys1b7x6gs927m8j64n1z8m0000gn/T/ipykernel_25218/424575284.py:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  students['Name']['id1'] = 'Joey'
Name Last Name GPA
id1 Joey Geller 98
id2 Monica Geller 99
id3 Chandler Bing 90

##Reading Data

Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won’t actually be creating our own data by hand. Instead, we’ll be working with data that already exists.

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:

A,B,C,
30,21,9,
35,34,1,
41,11,11

So a CSV file is a table of values separated by commas. Hence the name: “Comma-Separated Values”, or CSV.

Let’s now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame. We’ll use the pd.read_csv() function to read the data into a DataFrame.

Reading from web directly:

houses = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv")

We can use the shape attribute to check how large the resulting DataFrame is:

houses.shape
(17000, 9)

We can examine the contents of the resultant DataFrame using the head()

command, which grabs the first five rows:

# arg the desired number of rows
houses.head(8)
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
0 -114.31 34.19 15.0 5612.0 1283.0 1015.0 472.0 1.4936 66900.0
1 -114.47 34.40 19.0 7650.0 1901.0 1129.0 463.0 1.8200 80100.0
2 -114.56 33.69 17.0 720.0 174.0 333.0 117.0 1.6509 85700.0
3 -114.57 33.64 14.0 1501.0 337.0 515.0 226.0 3.1917 73400.0
4 -114.57 33.57 20.0 1454.0 326.0 624.0 262.0 1.9250 65500.0
5 -114.58 33.63 29.0 1387.0 236.0 671.0 239.0 3.3438 74000.0
6 -114.58 33.61 25.0 2907.0 680.0 1841.0 633.0 2.6768 82400.0
7 -114.59 34.83 41.0 812.0 168.0 375.0 158.0 1.7083 48500.0

Statistical overview of the dataset:

houses.describe()
longitude latitude housing_median_age total_rooms total_bedrooms population households median_income median_house_value
count 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000 17000.000000
mean -119.562108 35.625225 28.589353 2643.664412 539.410824 1429.573941 501.221941 3.883578 207300.912353
std 2.005166 2.137340 12.586937 2179.947071 421.499452 1147.852959 384.520841 1.908157 115983.764387
min -124.350000 32.540000 1.000000 2.000000 1.000000 3.000000 1.000000 0.499900 14999.000000
25% -121.790000 33.930000 18.000000 1462.000000 297.000000 790.000000 282.000000 2.566375 119400.000000
50% -118.490000 34.250000 29.000000 2127.000000 434.000000 1167.000000 409.000000 3.544600 180400.000000
75% -118.000000 37.720000 37.000000 3151.250000 648.250000 1721.000000 605.250000 4.767000 265000.000000
max -114.310000 41.950000 52.000000 37937.000000 6445.000000 35682.000000 6082.000000 15.000100 500001.000000

Reading from local storage:

! wget https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv
tips = pd.read_csv('tips.csv')
--2025-05-06 19:45:50--  https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... 
connected.
HTTP request sent, awaiting response... 
200 OK
Length: 9729 (9.5K) [text/plain]
Saving to: ‘tips.csv.3’


tips.csv.3            0%[                    ]       0  --.-KB/s               
tips.csv.3          100%[===================>]   9.50K  --.-KB/s    in 0.001s  

2025-05-06 19:45:51 (12.7 MB/s) - ‘tips.csv.3’ saved [9729/9729]
tips.sample(5)
total_bill tip sex smoker day time size
12 15.42 1.57 Male No Sun Dinner 2
240 27.18 2.00 Female Yes Sat Dinner 2
222 8.58 1.92 Male Yes Fri Lunch 1
6 8.77 2.00 Male No Sun Dinner 2
87 18.28 4.00 Male No Thur Lunch 2
tips[tips['day'] == 'Sat'].head(5)
total_bill tip sex smoker day time size
19 20.65 3.35 Male No Sat Dinner 3
20 17.92 4.08 Male No Sat Dinner 2
21 20.29 2.75 Female No Sat Dinner 2
22 15.77 2.23 Female No Sat Dinner 2
23 39.42 7.58 Male No Sat Dinner 4

##Accessing Data

Selecting specific values of a pandas DataFrame or Series to work on is an implicit step in almost any data operation you’ll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively.

Each column is a property of the dataframe object and we can get columns using Python property access:

tips.total_bill
0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64

If we have a Python dictionary, we can access its values using the indexing ([]) operator. We can do the same with columns in a DataFrame:

tips['total_bill']
0      16.99
1      10.34
2      21.01
3      23.68
4      24.59
       ...  
239    29.03
240    27.18
241    22.67
242    17.82
243    18.78
Name: total_bill, Length: 244, dtype: float64

These are the two ways of selecting a specific Series out of a DataFrame. Neither of them is more or less syntactically valid than the other, but the indexing operator [] does have the advantage that it can handle column names with reserved characters in them (e.g. if we had a total bill column, tips.total bill wouldn’t work).

Doesn’t a pandas Series look kind of like a fancy dictionary? It pretty much is, so it’s no surprise that, to drill down to a single specific value, we need only use the indexing operator [] once more:

type(tips['time'])
pandas.core.series.Series
tips['time'][200]
'Lunch'
tips['time'].unique()
array(['Dinner', 'Lunch'], dtype=object)

Indexing#

The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc. For more advanced operations, these are the ones you’re supposed to be using.

Index-based selection Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. iloc follows this paradigm.

To select the second last row of data in a DataFrame, we may use the following:

tips[-2]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/anaconda3/envs/software_project/lib/python3.9/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: -2

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[21], line 1
----> 1 tips[-2]

File ~/anaconda3/envs/software_project/lib/python3.9/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~/anaconda3/envs/software_project/lib/python3.9/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: -2
tips.iloc[-2]

Both loc and iloc are row-first, column-second. To get a column with iloc, we can do the following:

tips.iloc[:,0]

select rows and columns:

tips.iloc[103:107,0]

Label-based selection. The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it’s the data index value, not its position, which matters.

tips.loc[103:107,'tip']
tips.sample(5)

Change the index

df = tips.set_index('day')
df.sample(7)
df.loc['Sun','total_bill']

More info in the documentation indexing and selection.

Manipulating Data#

You may apply Python’s basic arithmetic operations to Series. For example:

tips.head()
tips['new_calc'] = tips.total_bill / 10
tips.head()

Operation between columns:

tips.tip / tips.total_bill

calculating new column:

import numpy as np
tips['log_total'] = np.log(tips['total_bill'])
tips.head(5)

summarizing data

tips.tip.value_counts()

##Filtering

So far we’ve been indexing various strides of data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on conditions.

For example, suppose that we’re interested specifically in lunch times.

tips[tips['time'] == 'Lunch']

If we want to ask more questions, for example parties that had lunch and paid a tip higher than 5, we can use the ‘and’ operator:

tips[(tips['time'] == 'Lunch') & (tips['tip'] > 5)]

We can build any query that concerns us as in any database, this can be achieved using basic conditional operators (and, or …)

For multiple selection instead of using many ‘or’ operators we use isin() function that pandas provide:

tips[(tips['size'].isin([2, 3])) & (tips['tip'] > 5)]

Data cleaning, in modern apps and researches many time we have missing data due to technical or buisiness issues, in clinical research patient might not answer some question, it’s up to the researcher to decide how to handle missing data(this is a topic in ML). Naive way is to drop missings and it’s provided by Pandas:

# we also can drop by specific column(see documentation)
tips.dropna()

Grouping and Sorting#

In addition to aggregation built-ins (count(), sum(), nunique()…) we can group the data based on splitting the values of some index, then apply a calculation on it and presenting it the result in a combined matter:

avg_tip = tips.groupby(['smoker']).tip.mean()
avg_tip
avg_tip = tips.groupby(['day']).tip.mean()
avg_tip.sort_values(ascending=False)

Now, who is the higher payer ? Sorting dataset by the tip:

Sort by column to get the highest tipper:

tips.sort_values(by='tip', ascending=False)

Multiple group by:

tips.groupby(['smoker','day']).tip.mean()

Combining Data#

!pip install Faker
from faker import Faker
fake = Faker()
import numpy as np
import pandas as pd
ids = np.random.randint(1000, 2000, 10)
names = [fake.name() for i in range(10)]
ages = np.random.randint(18, 70, 10)
address = [fake.address() for i in range(10)]
company = [fake.company() for i in range(10)]
df_customer = pd.DataFrame({
    'id': ids,
    'name': names,
})
df_info = pd.DataFrame({
    'id': ids[:8],
    'age': ages[:8],
    'address': address[:8],
    'company': company[:8]
})
df_customer
df_info

Join/Merge#

Combining columns from different dataframes based on a key:

image.png

pd.merge(df_customer, df_info,on='id')
pd.merge(df_customer, df_info,how='left',on='id')

Union#

Assuming we have another store with it’s clients as follow:

ids = np.random.randint(1000, 2000, 4)
names = [fake.name() for i in range(4)]
df_customer_1 = pd.DataFrame({
    'id': ids,
    'name': names,
})
df_customer_1

We want to combine the rows of both tables to one table, this is called union in SQL and concat in Python:

pd.concat([df_customer, df_customer_1])
#.loc[2], iloc[2]

Apply#

Manipulate columns and rows in a DataFrame with custom functions.

csv_url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/Students_Alcohol_Consumption/student-mat.csv'
df = pd.read_csv(csv_url)
df.head()

For the purpose of this exercise slice the dataframe from ‘school’ until the ‘guardian’ column

stud_alcoh = df.loc[: , "school":"guardian"]
stud_alcoh.head()

Create a lambda function that will capitalize strings.

capitalizer = lambda x: x.capitalize()

Capitalize both Mjob and Fjob

stud_alcoh['Mjob'] = stud_alcoh['Mjob'].apply(capitalizer)
stud_alcoh['Fjob'] = stud_alcoh['Fjob'].apply(capitalizer)
stud_alcoh.head()

Create a function called majority that returns a boolean value to a new column called legal_drinker (Consider majority as older than 17 years old)

def majority(x):
    if x > 17:
        return True
    else:
        return False
stud_alcoh['legal_drinker'] = stud_alcoh['age'].apply(majority)
stud_alcoh.head()

Apply multiple columns

def f(x):
   return (x['G1'] + x['G2'] +x['G3'])/3
 df.apply(f, axis=1)
def f1(a, b, c):
   return (a+b+c)/3
df.apply(lambda x: f1(x['G1'], x['G2'], x['G3']), axis=1)
df[['G1','G2','G3']].apply(lambda x: f1(*x), axis=1)

#Matplotlib

import matplotlib.pyplot as plt

By running this special iPython command, we will be displaying plots inline:

%matplotlib inline

##Plotting

isr_covid = covid[covid.country=='Israel']
isr_covid.date = pd.to_datetime(isr_covid.date.astype(str))
isr_covid.head()
plt.figure(figsize=(12,8))
plt.plot(isr_covid['date'], isr_covid['daily_vaccinations'])

##Subplots

You can plot different things in the same figure using the subplot function. Here is an example:

import requests
from datetime import date, timedelta

today = date.today()
yesterday = today - timedelta(days=660)
country = "israel"
endpoint = f"https://api.covid19api.com/country/{country}"
params = {"from": str(yesterday), "to": str(today)}
response = requests.get(endpoint, params=params).json()
response
covid_cases = pd.DataFrame(response)
covid_cases
# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.figure(figsize=(12,14))

# Make the first plot
plt.subplot(3, 1, 1)
plt.plot(pd.to_datetime( covid_cases['Date'].astype(str)), covid_cases['Confirmed'])
plt.title('Confirmed Cases')

# Set the second subplot as active, and make the second plot.
plt.subplot(3, 1, 2)
plt.plot(pd.to_datetime( covid_cases['Date'].astype(str)), covid_cases['Deaths'])
plt.title('Deaths')

# Set the second subplot as active, and make the second plot.
plt.subplot(3, 1, 3)
plt.plot(pd.to_datetime( covid_cases['Date'].astype(str)), covid_cases['Active'])
plt.title('Active')

# Show the figure.
plt.show()

You can read much more about the subplot function in the documentation.