Pandas#
Earlier#
Numpy
Scipy
Matplotlib
Today#
Pandas
#Pandas
Learning Objectives:
Gain an introduction to the
DataFrameandSeriesdata structures of the pandas libraryAccess and manipulate data within a
DataFrameandSeriesImport CSV data into a pandas
DataFrame
pandas is a column-oriented data analysis API. It’s a great tool for handling and analyzing input data, and many ML frameworks support pandas data structures as inputs. Although a comprehensive introduction to the pandas API would span many pages, the core concepts are fairly straightforward, and we’ll present them below. For a more complete reference, the pandas docs site contains extensive documentation and many tutorials.
##Setup
Install
!pip install pandas
Requirement already satisfied: pandas in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (2.2.3)
Requirement already satisfied: numpy>=1.22.4 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from pandas) (2.0.2)
Requirement already satisfied: python-dateutil>=2.8.2 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from pandas) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from pandas) (2025.1)
Requirement already satisfied: six>=1.5 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)
Import - External Package
import pandas as pd
The primary data structures in pandas are implemented as two classes:
DataFrame, which you can imagine as a relational data table, with rows and named columns.Series, which is a single column. ADataFramecontains one or moreSeriesand a name for eachSeries.
The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in Spark and R.
##Series
One way to create a Series is to construct a Series object. For example:
pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
0 San Francisco
1 San Jose
2 Sacramento
dtype: object
Creating Data#
DataFrame objects can be created by passing a dict mapping string column names to their respective Series. If the Series don’t match in length, missing values are filled with special NA/NaN values. Example:
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
pd.DataFrame({ 'City name': city_names, 'Population': population })
| City name | Population | |
|---|---|---|
| 0 | San Francisco | 852469 |
| 1 | San Jose | 1015785 |
| 2 | Sacramento | 485199 |
Missing Values:
city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 485199])
pd.DataFrame({ 'City name': city_names, 'Population': population })
| City name | Population | |
|---|---|---|
| 0 | San Francisco | 852469.0 |
| 1 | San Jose | 485199.0 |
| 2 | Sacramento | NaN |
A DataFrame is a table(excel sheet). It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.
For example, consider the following simple DataFrame:
pd.DataFrame({'A': [50, 21], 'B': [131, 2]})
| A | B | |
|---|---|---|
| 0 | 50 | 131 |
| 1 | 21 | 2 |
The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, …) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.
The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor:
# Student list
students = pd.DataFrame({'Name': ['Ross', 'Monica','Chandler'],
'Last Name': ['Geller', 'Geller','Bing'],
'GPA': ['98', '99','90']},
index=['id1', 'id2','id3'])
students
| Name | Last Name | GPA | |
|---|---|---|---|
| id1 | Ross | Geller | 98 |
| id2 | Monica | Geller | 99 |
| id3 | Chandler | Bing | 90 |
students['Name']['id1'] = 'Joey'
students
/var/folders/vp/68ys1b7x6gs927m8j64n1z8m0000gn/T/ipykernel_25218/424575284.py:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:
df["col"][row_indexer] = value
Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
students['Name']['id1'] = 'Joey'
| Name | Last Name | GPA | |
|---|---|---|---|
| id1 | Joey | Geller | 98 |
| id2 | Monica | Geller | 99 |
| id3 | Chandler | Bing | 90 |
##Reading Data
Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won’t actually be creating our own data by hand. Instead, we’ll be working with data that already exists.
Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:
A,B,C,
30,21,9,
35,34,1,
41,11,11
So a CSV file is a table of values separated by commas. Hence the name: “Comma-Separated Values”, or CSV.
Let’s now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame. We’ll use the pd.read_csv() function to read the data into a DataFrame.
Reading from web directly:
houses = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv")
We can use the shape attribute to check how large the resulting DataFrame is:
houses.shape
(17000, 9)
We can examine the contents of the resultant DataFrame using the head()
command, which grabs the first five rows:
# arg the desired number of rows
houses.head(8)
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | -114.31 | 34.19 | 15.0 | 5612.0 | 1283.0 | 1015.0 | 472.0 | 1.4936 | 66900.0 |
| 1 | -114.47 | 34.40 | 19.0 | 7650.0 | 1901.0 | 1129.0 | 463.0 | 1.8200 | 80100.0 |
| 2 | -114.56 | 33.69 | 17.0 | 720.0 | 174.0 | 333.0 | 117.0 | 1.6509 | 85700.0 |
| 3 | -114.57 | 33.64 | 14.0 | 1501.0 | 337.0 | 515.0 | 226.0 | 3.1917 | 73400.0 |
| 4 | -114.57 | 33.57 | 20.0 | 1454.0 | 326.0 | 624.0 | 262.0 | 1.9250 | 65500.0 |
| 5 | -114.58 | 33.63 | 29.0 | 1387.0 | 236.0 | 671.0 | 239.0 | 3.3438 | 74000.0 |
| 6 | -114.58 | 33.61 | 25.0 | 2907.0 | 680.0 | 1841.0 | 633.0 | 2.6768 | 82400.0 |
| 7 | -114.59 | 34.83 | 41.0 | 812.0 | 168.0 | 375.0 | 158.0 | 1.7083 | 48500.0 |
Statistical overview of the dataset:
houses.describe()
| longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | |
|---|---|---|---|---|---|---|---|---|---|
| count | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 | 17000.000000 |
| mean | -119.562108 | 35.625225 | 28.589353 | 2643.664412 | 539.410824 | 1429.573941 | 501.221941 | 3.883578 | 207300.912353 |
| std | 2.005166 | 2.137340 | 12.586937 | 2179.947071 | 421.499452 | 1147.852959 | 384.520841 | 1.908157 | 115983.764387 |
| min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
| 25% | -121.790000 | 33.930000 | 18.000000 | 1462.000000 | 297.000000 | 790.000000 | 282.000000 | 2.566375 | 119400.000000 |
| 50% | -118.490000 | 34.250000 | 29.000000 | 2127.000000 | 434.000000 | 1167.000000 | 409.000000 | 3.544600 | 180400.000000 |
| 75% | -118.000000 | 37.720000 | 37.000000 | 3151.250000 | 648.250000 | 1721.000000 | 605.250000 | 4.767000 | 265000.000000 |
| max | -114.310000 | 41.950000 | 52.000000 | 37937.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |
Reading from local storage:
! wget https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv
tips = pd.read_csv('tips.csv')
--2025-05-06 19:45:50-- https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443...
connected.
HTTP request sent, awaiting response...
200 OK
Length: 9729 (9.5K) [text/plain]
Saving to: ‘tips.csv.3’
tips.csv.3 0%[ ] 0 --.-KB/s
tips.csv.3 100%[===================>] 9.50K --.-KB/s in 0.001s
2025-05-06 19:45:51 (12.7 MB/s) - ‘tips.csv.3’ saved [9729/9729]
tips.sample(5)
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 12 | 15.42 | 1.57 | Male | No | Sun | Dinner | 2 |
| 240 | 27.18 | 2.00 | Female | Yes | Sat | Dinner | 2 |
| 222 | 8.58 | 1.92 | Male | Yes | Fri | Lunch | 1 |
| 6 | 8.77 | 2.00 | Male | No | Sun | Dinner | 2 |
| 87 | 18.28 | 4.00 | Male | No | Thur | Lunch | 2 |
tips[tips['day'] == 'Sat'].head(5)
| total_bill | tip | sex | smoker | day | time | size | |
|---|---|---|---|---|---|---|---|
| 19 | 20.65 | 3.35 | Male | No | Sat | Dinner | 3 |
| 20 | 17.92 | 4.08 | Male | No | Sat | Dinner | 2 |
| 21 | 20.29 | 2.75 | Female | No | Sat | Dinner | 2 |
| 22 | 15.77 | 2.23 | Female | No | Sat | Dinner | 2 |
| 23 | 39.42 | 7.58 | Male | No | Sat | Dinner | 4 |
##Accessing Data
Selecting specific values of a pandas DataFrame or Series to work on is an implicit step in almost any data operation you’ll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively.
Each column is a property of the dataframe object and we can get columns using Python property access:
tips.total_bill
0 16.99
1 10.34
2 21.01
3 23.68
4 24.59
...
239 29.03
240 27.18
241 22.67
242 17.82
243 18.78
Name: total_bill, Length: 244, dtype: float64
If we have a Python dictionary, we can access its values using the indexing ([]) operator. We can do the same with columns in a DataFrame:
tips['total_bill']
0 16.99
1 10.34
2 21.01
3 23.68
4 24.59
...
239 29.03
240 27.18
241 22.67
242 17.82
243 18.78
Name: total_bill, Length: 244, dtype: float64
These are the two ways of selecting a specific Series out of a DataFrame. Neither of them is more or less syntactically valid than the other, but the indexing operator [] does have the advantage that it can handle column names with reserved characters in them (e.g. if we had a total bill column, tips.total bill wouldn’t work).
Doesn’t a pandas Series look kind of like a fancy dictionary? It pretty much is, so it’s no surprise that, to drill down to a single specific value, we need only use the indexing operator [] once more:
type(tips['time'])
pandas.core.series.Series
tips['time'][200]
'Lunch'
tips['time'].unique()
array(['Dinner', 'Lunch'], dtype=object)
Indexing#
The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc. For more advanced operations, these are the ones you’re supposed to be using.
Index-based selection Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. iloc follows this paradigm.
To select the second last row of data in a DataFrame, we may use the following:
tips[-2]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
File ~/anaconda3/envs/software_project/lib/python3.9/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
3804 try:
-> 3805 return self._engine.get_loc(casted_key)
3806 except KeyError as err:
File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()
File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()
File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()
File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()
KeyError: -2
The above exception was the direct cause of the following exception:
KeyError Traceback (most recent call last)
Cell In[21], line 1
----> 1 tips[-2]
File ~/anaconda3/envs/software_project/lib/python3.9/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
4100 if self.columns.nlevels > 1:
4101 return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
4103 if is_integer(indexer):
4104 indexer = [indexer]
File ~/anaconda3/envs/software_project/lib/python3.9/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
3807 if isinstance(casted_key, slice) or (
3808 isinstance(casted_key, abc.Iterable)
3809 and any(isinstance(x, slice) for x in casted_key)
3810 ):
3811 raise InvalidIndexError(key)
-> 3812 raise KeyError(key) from err
3813 except TypeError:
3814 # If we have a listlike key, _check_indexing_error will raise
3815 # InvalidIndexError. Otherwise we fall through and re-raise
3816 # the TypeError.
3817 self._check_indexing_error(key)
KeyError: -2
tips.iloc[-2]
Both loc and iloc are row-first, column-second. To get a column with iloc, we can do the following:
tips.iloc[:,0]
select rows and columns:
tips.iloc[103:107,0]
Label-based selection. The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it’s the data index value, not its position, which matters.
tips.loc[103:107,'tip']
tips.sample(5)
Change the index
df = tips.set_index('day')
df.sample(7)
df.loc['Sun','total_bill']
More info in the documentation indexing and selection.
Manipulating Data#
You may apply Python’s basic arithmetic operations to Series. For example:
tips.head()
tips['new_calc'] = tips.total_bill / 10
tips.head()
Operation between columns:
tips.tip / tips.total_bill
calculating new column:
import numpy as np
tips['log_total'] = np.log(tips['total_bill'])
tips.head(5)
summarizing data
tips.tip.value_counts()
##Filtering
So far we’ve been indexing various strides of data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on conditions.
For example, suppose that we’re interested specifically in lunch times.
tips[tips['time'] == 'Lunch']
If we want to ask more questions, for example parties that had lunch and paid a tip higher than 5, we can use the ‘and’ operator:
tips[(tips['time'] == 'Lunch') & (tips['tip'] > 5)]
We can build any query that concerns us as in any database, this can be achieved using basic conditional operators (and, or …)
For multiple selection instead of using many ‘or’ operators we use isin() function that pandas provide:
tips[(tips['size'].isin([2, 3])) & (tips['tip'] > 5)]
Data cleaning, in modern apps and researches many time we have missing data due to technical or buisiness issues, in clinical research patient might not answer some question, it’s up to the researcher to decide how to handle missing data(this is a topic in ML). Naive way is to drop missings and it’s provided by Pandas:
# we also can drop by specific column(see documentation)
tips.dropna()
Grouping and Sorting#
In addition to aggregation built-ins (count(), sum(), nunique()…) we can group the data based on splitting the values of some index, then apply a calculation on it and presenting it the result in a combined matter:
avg_tip = tips.groupby(['smoker']).tip.mean()
avg_tip
avg_tip = tips.groupby(['day']).tip.mean()
avg_tip.sort_values(ascending=False)
Now, who is the higher payer ? Sorting dataset by the tip:
Sort by column to get the highest tipper:
tips.sort_values(by='tip', ascending=False)
Multiple group by:
tips.groupby(['smoker','day']).tip.mean()
Combining Data#
!pip install Faker
from faker import Faker
fake = Faker()
import numpy as np
import pandas as pd
ids = np.random.randint(1000, 2000, 10)
names = [fake.name() for i in range(10)]
ages = np.random.randint(18, 70, 10)
address = [fake.address() for i in range(10)]
company = [fake.company() for i in range(10)]
df_customer = pd.DataFrame({
'id': ids,
'name': names,
})
df_info = pd.DataFrame({
'id': ids[:8],
'age': ages[:8],
'address': address[:8],
'company': company[:8]
})
df_customer
df_info
Join/Merge#
Combining columns from different dataframes based on a key:
pd.merge(df_customer, df_info,on='id')
pd.merge(df_customer, df_info,how='left',on='id')
Union#
Assuming we have another store with it’s clients as follow:
ids = np.random.randint(1000, 2000, 4)
names = [fake.name() for i in range(4)]
df_customer_1 = pd.DataFrame({
'id': ids,
'name': names,
})
df_customer_1
We want to combine the rows of both tables to one table, this is called union in SQL and concat in Python:
pd.concat([df_customer, df_customer_1])
#.loc[2], iloc[2]
Apply#
Manipulate columns and rows in a DataFrame with custom functions.
csv_url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/Students_Alcohol_Consumption/student-mat.csv'
df = pd.read_csv(csv_url)
df.head()
For the purpose of this exercise slice the dataframe from ‘school’ until the ‘guardian’ column
stud_alcoh = df.loc[: , "school":"guardian"]
stud_alcoh.head()
Create a lambda function that will capitalize strings.
capitalizer = lambda x: x.capitalize()
Capitalize both Mjob and Fjob
stud_alcoh['Mjob'] = stud_alcoh['Mjob'].apply(capitalizer)
stud_alcoh['Fjob'] = stud_alcoh['Fjob'].apply(capitalizer)
stud_alcoh.head()
Create a function called majority that returns a boolean value to a new column called legal_drinker (Consider majority as older than 17 years old)
def majority(x):
if x > 17:
return True
else:
return False
stud_alcoh['legal_drinker'] = stud_alcoh['age'].apply(majority)
stud_alcoh.head()
Apply multiple columns
def f(x):
return (x['G1'] + x['G2'] +x['G3'])/3
df.apply(f, axis=1)
def f1(a, b, c):
return (a+b+c)/3
df.apply(lambda x: f1(x['G1'], x['G2'], x['G3']), axis=1)
df[['G1','G2','G3']].apply(lambda x: f1(*x), axis=1)
#Matplotlib
import matplotlib.pyplot as plt
By running this special iPython command, we will be displaying plots inline:
%matplotlib inline
##Plotting
isr_covid = covid[covid.country=='Israel']
isr_covid.date = pd.to_datetime(isr_covid.date.astype(str))
isr_covid.head()
plt.figure(figsize=(12,8))
plt.plot(isr_covid['date'], isr_covid['daily_vaccinations'])
##Subplots
You can plot different things in the same figure using the subplot function. Here is an example:
import requests
from datetime import date, timedelta
today = date.today()
yesterday = today - timedelta(days=660)
country = "israel"
endpoint = f"https://api.covid19api.com/country/{country}"
params = {"from": str(yesterday), "to": str(today)}
response = requests.get(endpoint, params=params).json()
response
covid_cases = pd.DataFrame(response)
covid_cases
# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.figure(figsize=(12,14))
# Make the first plot
plt.subplot(3, 1, 1)
plt.plot(pd.to_datetime( covid_cases['Date'].astype(str)), covid_cases['Confirmed'])
plt.title('Confirmed Cases')
# Set the second subplot as active, and make the second plot.
plt.subplot(3, 1, 2)
plt.plot(pd.to_datetime( covid_cases['Date'].astype(str)), covid_cases['Deaths'])
plt.title('Deaths')
# Set the second subplot as active, and make the second plot.
plt.subplot(3, 1, 3)
plt.plot(pd.to_datetime( covid_cases['Date'].astype(str)), covid_cases['Active'])
plt.title('Active')
# Show the figure.
plt.show()
You can read much more about the subplot function in the documentation.