Pandas

Pandas#

Earlier#

Numpy
Scipy
Matplotlib

Today#

Pandas

#Pandas

Learning Objectives:

Gain an introduction to the DataFrame and Series data structures of the pandas library
Access and manipulate data within a DataFrame and Series
Import CSV data into a pandas DataFrame

pandas is a column-oriented data analysis API. It’s a great tool for handling and analyzing input data, and many ML frameworks support pandas data structures as inputs. Although a comprehensive introduction to the pandas API would span many pages, the core concepts are fairly straightforward, and we’ll present them below. For a more complete reference, the pandas docs site contains extensive documentation and many tutorials.

##Setup

Install

!pip install pandas

Requirement already satisfied: pandas in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (2.2.3)
Requirement already satisfied: numpy>=1.22.4 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from pandas) (2.0.2)

Requirement already satisfied: python-dateutil>=2.8.2 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from pandas) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from pandas) (2025.1)
Requirement already satisfied: tzdata>=2022.7 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from pandas) (2025.1)
Requirement already satisfied: six>=1.5 in /Users/raminasser/anaconda3/envs/software_project/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas) (1.17.0)

Import - External Package

import pandas as pd

The primary data structures in pandas are implemented as two classes:

DataFrame, which you can imagine as a relational data table, with rows and named columns.
Series, which is a single column. A DataFrame contains one or more Series and a name for each Series.

The data frame is a commonly used abstraction for data manipulation. Similar implementations exist in Spark and R.

##Series

One way to create a Series is to construct a Series object. For example:

pd.Series(['San Francisco', 'San Jose', 'Sacramento'])

  San Francisco
       San Jose
     Sacramento
dtype: object

Creating Data#

DataFrame objects can be created by passing a dict mapping string column names to their respective Series. If the Series don’t match in length, missing values are filled with special NA/NaN values. Example:

city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 1015785, 485199])
pd.DataFrame({ 'City name': city_names, 'Population': population })

	City name	Population
0	San Francisco	852469
1	San Jose	1015785
2	Sacramento	485199

Missing Values:

city_names = pd.Series(['San Francisco', 'San Jose', 'Sacramento'])
population = pd.Series([852469, 485199])
pd.DataFrame({ 'City name': city_names, 'Population': population })

	City name	Population
0	San Francisco	852469.0
1	San Jose	485199.0
2	Sacramento	NaN

A DataFrame is a table(excel sheet). It contains an array of individual entries, each of which has a certain value. Each entry corresponds to a row (or record) and a column.

For example, consider the following simple DataFrame:

pd.DataFrame({'A': [50, 21], 'B': [131, 2]})

	A	B
0	50	131
1	21	2

The dictionary-list constructor assigns values to the column labels, but just uses an ascending count from 0 (0, 1, 2, 3, …) for the row labels. Sometimes this is OK, but oftentimes we will want to assign these labels ourselves.

The list of row labels used in a DataFrame is known as an Index. We can assign values to it by using an index parameter in our constructor:

# Student list
students = pd.DataFrame({'Name': ['Ross', 'Monica','Chandler'],
              'Last Name': ['Geller', 'Geller','Bing'],
              'GPA': ['98', '99','90']},
               index=['id1', 'id2','id3'])
students

	Name	Last Name	GPA
id1	Ross	Geller	98
id2	Monica	Geller	99
id3	Chandler	Bing	90

students['Name']['id1'] = 'Joey'
students

/var/folders/vp/68ys1b7x6gs927m8j64n1z8m0000gn/T/ipykernel_25218/424575284.py:1: FutureWarning: ChainedAssignmentError: behaviour will change in pandas 3.0!
You are setting values through chained assignment. Currently this works in certain cases, but when using Copy-on-Write (which will become the default behaviour in pandas 3.0) this will never work to update the original DataFrame or Series, because the intermediate object on which we are setting values will behave as a copy.
A typical example is when you are setting values in a column of a DataFrame, like:

df["col"][row_indexer] = value

Use `df.loc[row_indexer, "col"] = values` instead, to perform the assignment in a single step and ensure this keeps updating the original `df`.

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

  students['Name']['id1'] = 'Joey'

	Name	Last Name	GPA
id1	Joey	Geller	98
id2	Monica	Geller	99
id3	Chandler	Bing	90

##Reading Data

Being able to create a DataFrame or Series by hand is handy. But, most of the time, we won’t actually be creating our own data by hand. Instead, we’ll be working with data that already exists.

Data can be stored in any of a number of different forms and formats. By far the most basic of these is the humble CSV file. When you open a CSV file you get something that looks like this:

A,B,C,
30,21,9,
35,34,1,
41,11,11

So a CSV file is a table of values separated by commas. Hence the name: “Comma-Separated Values”, or CSV.

Let’s now set aside our toy datasets and see what a real dataset looks like when we read it into a DataFrame. We’ll use the pd.read_csv() function to read the data into a DataFrame.

Reading from web directly:

houses = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv")

We can use the shape attribute to check how large the resulting DataFrame is:

houses.shape

(17000, 9)

We can examine the contents of the resultant DataFrame using the head()

command, which grabs the first five rows:

# arg the desired number of rows
houses.head(8)

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
0	-114.31	34.19	15.0	5612.0	1283.0	1015.0	472.0	1.4936	66900.0
1	-114.47	34.40	19.0	7650.0	1901.0	1129.0	463.0	1.8200	80100.0
2	-114.56	33.69	17.0	720.0	174.0	333.0	117.0	1.6509	85700.0
3	-114.57	33.64	14.0	1501.0	337.0	515.0	226.0	3.1917	73400.0
4	-114.57	33.57	20.0	1454.0	326.0	624.0	262.0	1.9250	65500.0
5	-114.58	33.63	29.0	1387.0	236.0	671.0	239.0	3.3438	74000.0
6	-114.58	33.61	25.0	2907.0	680.0	1841.0	633.0	2.6768	82400.0
7	-114.59	34.83	41.0	812.0	168.0	375.0	158.0	1.7083	48500.0

Statistical overview of the dataset:

houses.describe()

	longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value
count	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000	17000.000000
mean	-119.562108	35.625225	28.589353	2643.664412	539.410824	1429.573941	501.221941	3.883578	207300.912353
std	2.005166	2.137340	12.586937	2179.947071	421.499452	1147.852959	384.520841	1.908157	115983.764387
min	-124.350000	32.540000	1.000000	2.000000	1.000000	3.000000	1.000000	0.499900	14999.000000
25%	-121.790000	33.930000	18.000000	1462.000000	297.000000	790.000000	282.000000	2.566375	119400.000000
50%	-118.490000	34.250000	29.000000	2127.000000	434.000000	1167.000000	409.000000	3.544600	180400.000000
75%	-118.000000	37.720000	37.000000	3151.250000	648.250000	1721.000000	605.250000	4.767000	265000.000000
max	-114.310000	41.950000	52.000000	37937.000000	6445.000000	35682.000000	6082.000000	15.000100	500001.000000

Reading from local storage:

! wget https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv
tips = pd.read_csv('tips.csv')

--2025-05-06 19:45:50--  https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... 

connected.

HTTP request sent, awaiting response...

200 OK
Length: 9729 (9.5K) [text/plain]
Saving to: ‘tips.csv.3’


tips.csv.3            0%[                    ]       0  --.-KB/s               
tips.csv.3          100%[===================>]   9.50K  --.-KB/s    in 0.001s  

2025-05-06 19:45:51 (12.7 MB/s) - ‘tips.csv.3’ saved [9729/9729]

tips.sample(5)

	total_bill	tip	sex	smoker	day	time	size
12	15.42	1.57	Male	No	Sun	Dinner	2
240	27.18	2.00	Female	Yes	Sat	Dinner	2
222	8.58	1.92	Male	Yes	Fri	Lunch	1
6	8.77	2.00	Male	No	Sun	Dinner	2
87	18.28	4.00	Male	No	Thur	Lunch	2

tips[tips['day'] == 'Sat'].head(5)

	total_bill	tip	sex	smoker	day	time	size
19	20.65	3.35	Male	No	Sat	Dinner	3
20	17.92	4.08	Male	No	Sat	Dinner	2
21	20.29	2.75	Female	No	Sat	Dinner	2
22	15.77	2.23	Female	No	Sat	Dinner	2
23	39.42	7.58	Male	No	Sat	Dinner	4

##Accessing Data

Selecting specific values of a pandas DataFrame or Series to work on is an implicit step in almost any data operation you’ll run, so one of the first things you need to learn in working with data in Python is how to go about selecting the data points relevant to you quickly and effectively.

Each column is a property of the dataframe object and we can get columns using Python property access:

tips.total_bill

    16.99
    10.34
    21.01
    23.68
    24.59
       ...  
  29.03
  27.18
  22.67
  17.82
  18.78
Name: total_bill, Length: 244, dtype: float64

If we have a Python dictionary, we can access its values using the indexing ([]) operator. We can do the same with columns in a DataFrame:

tips['total_bill']

    16.99
    10.34
    21.01
    23.68
    24.59
       ...  
  29.03
  27.18
  22.67
  17.82
  18.78
Name: total_bill, Length: 244, dtype: float64

These are the two ways of selecting a specific Series out of a DataFrame. Neither of them is more or less syntactically valid than the other, but the indexing operator [] does have the advantage that it can handle column names with reserved characters in them (e.g. if we had a total bill column, tips.total bill wouldn’t work).

Doesn’t a pandas Series look kind of like a fancy dictionary? It pretty much is, so it’s no surprise that, to drill down to a single specific value, we need only use the indexing operator [] once more:

type(tips['time'])

pandas.core.series.Series

tips['time'][200]

'Lunch'

tips['time'].unique()

array(['Dinner', 'Lunch'], dtype=object)

Indexing#

The indexing operator and attribute selection are nice because they work just like they do in the rest of the Python ecosystem. As a novice, this makes them easy to pick up and use. However, pandas has its own accessor operators, loc and iloc. For more advanced operations, these are the ones you’re supposed to be using.

Index-based selection Pandas indexing works in one of two paradigms. The first is index-based selection: selecting data based on its numerical position in the data. iloc follows this paradigm.

To select the second last row of data in a DataFrame, we may use the following:

tips[-2]

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File ~/anaconda3/envs/software_project/lib/python3.9/site-packages/pandas/core/indexes/base.py:3805, in Index.get_loc(self, key)
   3804 try:
-> 3805     return self._engine.get_loc(casted_key)
   3806 except KeyError as err:

File index.pyx:167, in pandas._libs.index.IndexEngine.get_loc()

File index.pyx:196, in pandas._libs.index.IndexEngine.get_loc()

File pandas/_libs/hashtable_class_helper.pxi:7081, in pandas._libs.hashtable.PyObjectHashTable.get_item()

File pandas/_libs/hashtable_class_helper.pxi:7089, in pandas._libs.hashtable.PyObjectHashTable.get_item()

KeyError: -2

The above exception was the direct cause of the following exception:

KeyError                                  Traceback (most recent call last)
Cell In[21], line 1
----> 1 tips[-2]

File ~/anaconda3/envs/software_project/lib/python3.9/site-packages/pandas/core/frame.py:4102, in DataFrame.__getitem__(self, key)
   4100 if self.columns.nlevels > 1:
   4101     return self._getitem_multilevel(key)
-> 4102 indexer = self.columns.get_loc(key)
   4103 if is_integer(indexer):
   4104     indexer = [indexer]

File ~/anaconda3/envs/software_project/lib/python3.9/site-packages/pandas/core/indexes/base.py:3812, in Index.get_loc(self, key)
   3807     if isinstance(casted_key, slice) or (
   3808         isinstance(casted_key, abc.Iterable)
   3809         and any(isinstance(x, slice) for x in casted_key)
   3810     ):
   3811         raise InvalidIndexError(key)
-> 3812     raise KeyError(key) from err
   3813 except TypeError:
   3814     # If we have a listlike key, _check_indexing_error will raise
   3815     #  InvalidIndexError. Otherwise we fall through and re-raise
   3816     #  the TypeError.
   3817     self._check_indexing_error(key)

KeyError: -2

tips.iloc[-2]

Both loc and iloc are row-first, column-second. To get a column with iloc, we can do the following:

tips.iloc[:,0]

select rows and columns:

tips.iloc[103:107,0]

Label-based selection. The second paradigm for attribute selection is the one followed by the loc operator: label-based selection. In this paradigm, it’s the data index value, not its position, which matters.

tips.loc[103:107,'tip']

tips.sample(5)

Change the index

df = tips.set_index('day')

df.sample(7)

df.loc['Sun','total_bill']

More info in the documentation indexing and selection.

Manipulating Data#

You may apply Python’s basic arithmetic operations to Series. For example:

tips.head()

tips['new_calc'] = tips.total_bill / 10

tips.head()

Operation between columns:

tips.tip / tips.total_bill

calculating new column:

import numpy as np
tips['log_total'] = np.log(tips['total_bill'])

tips.head(5)

summarizing data

tips.tip.value_counts()

##Filtering

So far we’ve been indexing various strides of data, using structural properties of the DataFrame itself. To do interesting things with the data, however, we often need to ask questions based on conditions.

For example, suppose that we’re interested specifically in lunch times.

tips[tips['time'] == 'Lunch']

If we want to ask more questions, for example parties that had lunch and paid a tip higher than 5, we can use the ‘and’ operator:

tips[(tips['time'] == 'Lunch') & (tips['tip'] > 5)]

We can build any query that concerns us as in any database, this can be achieved using basic conditional operators (and, or …)

For multiple selection instead of using many ‘or’ operators we use isin() function that pandas provide:

tips[(tips['size'].isin([2, 3])) & (tips['tip'] > 5)]

Data cleaning, in modern apps and researches many time we have missing data due to technical or buisiness issues, in clinical research patient might not answer some question, it’s up to the researcher to decide how to handle missing data(this is a topic in ML). Naive way is to drop missings and it’s provided by Pandas:

# we also can drop by specific column(see documentation)
tips.dropna()

Grouping and Sorting#

In addition to aggregation built-ins (count(), sum(), nunique()…) we can group the data based on splitting the values of some index, then apply a calculation on it and presenting it the result in a combined matter:

avg_tip = tips.groupby(['smoker']).tip.mean()
avg_tip

avg_tip = tips.groupby(['day']).tip.mean()
avg_tip.sort_values(ascending=False)

Now, who is the higher payer ? Sorting dataset by the tip:

Sort by column to get the highest tipper:

tips.sort_values(by='tip', ascending=False)

Multiple group by:

tips.groupby(['smoker','day']).tip.mean()

Combining Data#

!pip install Faker
from faker import Faker
fake = Faker()

import numpy as np
import pandas as pd

ids = np.random.randint(1000, 2000, 10)
names = [fake.name() for i in range(10)]
ages = np.random.randint(18, 70, 10)
address = [fake.address() for i in range(10)]
company = [fake.company() for i in range(10)]
df_customer = pd.DataFrame({
    'id': ids,
    'name': names,
})
df_info = pd.DataFrame({
    'id': ids[:8],
    'age': ages[:8],
    'address': address[:8],
    'company': company[:8]
})

df_customer

df_info

Join/Merge#

Combining columns from different dataframes based on a key:

pd.merge(df_customer, df_info,on='id')

pd.merge(df_customer, df_info,how='left',on='id')

Union#

Assuming we have another store with it’s clients as follow:

ids = np.random.randint(1000, 2000, 4)
names = [fake.name() for i in range(4)]
df_customer_1 = pd.DataFrame({
    'id': ids,
    'name': names,
})

df_customer_1

We want to combine the rows of both tables to one table, this is called union in SQL and concat in Python:

pd.concat([df_customer, df_customer_1])
#.loc[2], iloc[2]

Apply#

Manipulate columns and rows in a DataFrame with custom functions.

csv_url = 'https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/04_Apply/Students_Alcohol_Consumption/student-mat.csv'
df = pd.read_csv(csv_url)
df.head()

For the purpose of this exercise slice the dataframe from ‘school’ until the ‘guardian’ column

stud_alcoh = df.loc[: , "school":"guardian"]
stud_alcoh.head()

Create a lambda function that will capitalize strings.

capitalizer = lambda x: x.capitalize()

Capitalize both Mjob and Fjob

stud_alcoh['Mjob'] = stud_alcoh['Mjob'].apply(capitalizer)
stud_alcoh['Fjob'] = stud_alcoh['Fjob'].apply(capitalizer)

stud_alcoh.head()

Create a function called majority that returns a boolean value to a new column called legal_drinker (Consider majority as older than 17 years old)

def majority(x):
    if x > 17:
        return True
    else:
        return False

stud_alcoh['legal_drinker'] = stud_alcoh['age'].apply(majority)
stud_alcoh.head()

Apply multiple columns

def f(x):
   return (x['G1'] + x['G2'] +x['G3'])/3

 df.apply(f, axis=1)

def f1(a, b, c):
   return (a+b+c)/3

df.apply(lambda x: f1(x['G1'], x['G2'], x['G3']), axis=1)

df[['G1','G2','G3']].apply(lambda x: f1(*x), axis=1)

#Matplotlib

import matplotlib.pyplot as plt

By running this special iPython command, we will be displaying plots inline:

%matplotlib inline

##Plotting

isr_covid = covid[covid.country=='Israel']
isr_covid.date = pd.to_datetime(isr_covid.date.astype(str))
isr_covid.head()

plt.figure(figsize=(12,8))
plt.plot(isr_covid['date'], isr_covid['daily_vaccinations'])

##Subplots

You can plot different things in the same figure using the subplot function. Here is an example:

import requests
from datetime import date, timedelta

today = date.today()
yesterday = today - timedelta(days=660)
country = "israel"
endpoint = f"https://api.covid19api.com/country/{country}"
params = {"from": str(yesterday), "to": str(today)}
response = requests.get(endpoint, params=params).json()

response

covid_cases = pd.DataFrame(response)

covid_cases

# Set up a subplot grid that has height 2 and width 1,
# and set the first such subplot as active.
plt.figure(figsize=(12,14))

# Make the first plot
plt.subplot(3, 1, 1)
plt.plot(pd.to_datetime( covid_cases['Date'].astype(str)), covid_cases['Confirmed'])
plt.title('Confirmed Cases')

# Set the second subplot as active, and make the second plot.
plt.subplot(3, 1, 2)
plt.plot(pd.to_datetime( covid_cases['Date'].astype(str)), covid_cases['Deaths'])
plt.title('Deaths')

# Set the second subplot as active, and make the second plot.
plt.subplot(3, 1, 3)
plt.plot(pd.to_datetime( covid_cases['Date'].astype(str)), covid_cases['Active'])
plt.title('Active')

# Show the figure.
plt.show()

You can read much more about the subplot function in the documentation.