Pandas - Basic
Create a DataFrame from List 从列表创建DataFrame
Code:
import pandas as pd
def createDataframe(student_data: List[List[int]]) -> pd.DataFrame:
result = pd.DataFrame(student_data, columns = ['student_id','age'])
return resultExample 1:
Input:
student_data:
[
[1, 15],
[2, 11],
[3, 11],
[4, 20]
]
Output:
+------------+-----+
| student_id | age |
+------------+-----+
| 1 | 15 |
| 2 | 11 |
| 3 | 11 |
| 4 | 20 |
+------------+-----+Display the First Three Rows 显示前三行
Code:
import pandas as pd
def selectFirstRows(employees: pd.DataFrame) -> pd.DataFrame:
return employees.head(3)Example 1:
Input:
DataFrame employees
+-------------+-----------+-----------------------+--------+
| employee_id | name | department | salary |
+-------------+-----------+-----------------------+--------+
| 3 | Bob | Operations | 48675 |
| 90 | Alice | Sales | 11096 |
| 9 | Tatiana | Engineering | 33805 |
| 60 | Annabelle | InformationTechnology | 37678 |
| 49 | Jonathan | HumanResources | 23793 |
| 43 | Khaled | Administration | 40454 |
+-------------+-----------+-----------------------+--------+
Output:
+-------------+---------+-------------+--------+
| employee_id | name | department | salary |
+-------------+---------+-------------+--------+
| 3 | Bob | Operations | 48675 |
| 90 | Alice | Sales | 11096 |
| 9 | Tatiana | Engineering | 33805 |
+-------------+---------+-------------+--------+Select Data 选择数据
loc函数:是一个非常强大的索引工具,用于通过标签(标签名、行或列的名称)进行数据选择和操作。loc函数允许你根据行和列的标签来访问或修改DataFrame中的数据。
Code:
import pandas as pd
def selectData(students: pd.DataFrame) -> pd.DataFrame:
return students.loc[students['student_id'] == 101, ['name','age']]Example 1:
Input:
+------------+---------+-----+
| student_id | name | age |
+------------+---------+-----+
| 101 | Ulysses | 13 |
| 53 | William | 10 |
| 128 | Henry | 6 |
| 3 | Henry | 11 |
+------------+---------+-----+
Output:
+---------+-----+
| name | age |
+---------+-----+
| Ulysses | 13 |
+---------+-----+Create a New Column 创建一个新列
Code:
import pandas as pd
def createBonusColumn(employees: pd.DataFrame) -> pd.DataFrame:
employees['bonus'] = employees['salary'] * 2
return employeesExample 1:
Input:
DataFrame employees
+---------+--------+
| name | salary |
+---------+--------+
| Piper | 4548 |
| Grace | 28150 |
| Georgia | 1103 |
| Willow | 6593 |
| Finn | 74576 |
| Thomas | 24433 |
+---------+--------+
Output:
+---------+--------+--------+
| name | salary | bonus |
+---------+--------+--------+
| Piper | 4548 | 9096 |
| Grace | 28150 | 56300 |
| Georgia | 1103 | 2206 |
| Willow | 6593 | 13186 |
| Finn | 74576 | 149152 |
| Thomas | 24433 | 48866 |
+---------+--------+--------+Drop Duplicate Rows 删除重复行
drop_duplicates函数: drop_duplicates函数是 pandas 库中 DataFrame 对象的一个方法。其目的是丢弃重复的记录,您可以指定将记录视为重复的标准
Code:
import pandas as pd
def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame
customers.drop_duplicates(subset='email', keep='first', inplace=True)
return customersExample 1:
Input:
+-------------+---------+---------------------+
| customer_id | name | email |
+-------------+---------+---------------------+
| 1 | Ella | emily@example.com |
| 2 | David | michael@example.com |
| 3 | Zachary | sarah@example.com |
| 4 | Alice | john@example.com |
| 5 | Finn | john@example.com |
| 6 | Violet | alice@example.com |
+-------------+---------+---------------------+
Output:
+-------------+---------+---------------------+
| customer_id | name | email |
+-------------+---------+---------------------+
| 1 | Ella | emily@example.com |
| 2 | David | michael@example.com |
| 3 | Zachary | sarah@example.com |
| 4 | Alice | john@example.com |
| 6 | Violet | alice@example.com |
+-------------+---------+---------------------+Drop Missing Data 删除缺失数据
dropna函数: dropna函数属于 pandas DataFrame,用于删除缺失值。在 pandas 中,缺失数据通常用NaN(Not a Number 的缩写)值表示,但在你的示例中,它显示为None,这也被 pandas 视为缺失值
Code:
import pandas as pd
def dropMissingData(students:pd.DataFrame) -> pd.DataFrame:
students.dropna(how='any',subset='name', inplace=True)
return studentsExample 1:
Input:
+------------+---------+-----+
| student_id | name | age |
+------------+---------+-----+
| 32 | Piper | 5 |
| 217 | None | 19 |
| 779 | Georgia | 20 |
| 849 | Willow | 14 |
+------------+---------+-----+
Output:
+------------+---------+-----+
| student_id | name | age |
+------------+---------+-----+
| 32 | Piper | 5 |
| 779 | Georgia | 20 |
| 849 | Willow | 14 |
+------------+---------+-----+Modify Columns 修改列
Code:
import pandas as pd
def modifySalaryColumn(employees: pd.DataFrame) -> pd.DataFrame:
employees['salary'] = employees['salary'] * 2
return employeesExample 1:
Input:
DataFrame employees
+---------+--------+
| name | salary |
+---------+--------+
| Jack | 19666 |
| Piper | 74754 |
| Mia | 62509 |
| Ulysses | 54866 |
+---------+--------+
Output:
+---------+--------+
| name | salary |
+---------+--------+
| Jack | 39332 |
| Piper | 149508 |
| Mia | 125018 |
| Ulysses | 109732 |
+---------+--------+Rename Columns 重名列
rename函数:当需要重命名列名或索引名时,pandas 中的rename函数是一个非常有用的工具。
Code:
import pandas as pd
def renameColumns(students: pd.DataFrame) -> pd.DataFrame:
students = students.rename(columns={
"id": "student_id",
"first": "first_name",
"last": "last_name",
"age": "age_in_years",
}
)
return studentsExample 1:
Input:
+----+---------+----------+-----+
| id | first | last | age |
+----+---------+----------+-----+
| 1 | Mason | King | 6 |
| 2 | Ava | Wright | 7 |
| 3 | Taylor | Hall | 16 |
| 4 | Georgia | Thompson | 18 |
| 5 | Thomas | Moore | 10 |
+----+---------+----------+-----+
Output:
+------------+------------+-----------+--------------+
| student_id | first_name | last_name | age_in_years |
+------------+------------+-----------+--------------+
| 1 | Mason | King | 6 |
| 2 | Ava | Wright | 7 |
| 3 | Taylor | Hall | 16 |
| 4 | Georgia | Thompson | 18 |
| 5 | Thomas | Moore | 10 |
+------------+------------+-----------+--------------+Change Data Type 改变数据类型
astype函数: astype函数用于将 pandas 对象转换为指定的 dtype(数据类型)。astype函数不会在原处修改原始 DataFrame。相反,它会返回一个带有指定数据类型更改的新 DataFrame。如果要反映原始 DataFrame 中的变化,则需要将结果重新赋值给原始 DataFrame 或使用相应的复制参数。该函数的语法如下
Code:
import pandas as pd
def changeDatatype(students: pd.DataFrame) -> pd.DataFrame:
students = students.astype({'grade': int})
return studentsExample 1:
Input:
DataFrame students:
+------------+------+-----+-------+
| student_id | name | age | grade |
+------------+------+-----+-------+
| 1 | Ava | 6 | 73.0 |
| 2 | Kate | 15 | 87.0 |
+------------+------+-----+-------+
Output:
+------------+------+-----+-------+
| student_id | name | age | grade |
+------------+------+-----+-------+
| 1 | Ava | 6 | 73 |
| 2 | Kate | 15 | 87 |
+------------+------+-----+-------+Fill Missing Data 填充缺失数据
fillna 函数: fillna是 pandas 库中的一个函数,主要用于 pandas Series 和 DataFrame 对象。它允许您使用指定的方法填充 NA/NaN 值。在本例中,我们使用它来替换None(或通常数据帧表示中的NaN)值。
Code:
import pandas as pd
def fillMissingValues(products: pd.DataFrame) -> pd.DataFrame:
products['quantity'].fillna(0, inplace=True)
return productsExample 1:
Input:+-----------------+----------+-------+
| name | quantity | price |
+-----------------+----------+-------+
| Wristwatch | None | 135 |
| WirelessEarbuds | None | 821 |
| GolfClubs | 779 | 9319 |
| Printer | 849 | 3051 |
+-----------------+----------+-------+
Output:
+-----------------+----------+-------+
| name | quantity | price |
+-----------------+----------+-------+
| Wristwatch | 0 | 135 |
| WirelessEarbuds | 0 | 821 |
| GolfClubs | 779 | 9319 |
| Printer | 849 | 3051 |
+-----------------+----------+-------+Reshape Data: Concatenate 重塑数据: 连接
pd.concat():pandas 中的一个便捷函数,用于纵向(按行)或横向(按列)连接 DataFrames。objs参数是要连接的系列或数据帧对象的序列或映射axis参数决定连接的方向:axis=0设置为默认值,这意味着它将垂直(按行)连接 DataFrames。
import pandas as pd
def concatenateTables(df1: pd.DataFrame, df2: pd.DataFrame) -> pd.DataFrame:
return pd.concat([df1,df2], axis=0)Example 1:
Input:
df1
+------------+---------+-----+
| student_id | name | age |
+------------+---------+-----+
| 1 | Mason | 8 |
| 2 | Ava | 6 |
| 3 | Taylor | 15 |
| 4 | Georgia | 17 |
+------------+---------+-----+
df2
+------------+------+-----+
| student_id | name | age |
+------------+------+-----+
| 5 | Leo | 7 |
| 6 | Alex | 7 |
+------------+------+-----+
Output:
+------------+---------+-----+
| student_id | name | age |
+------------+---------+-----+
| 1 | Mason | 8 |
| 2 | Ava | 6 |
| 3 | Taylor | 15 |
| 4 | Georgia | 17 |
| 5 | Leo | 7 |
| 6 | Alex | 7 |
+------------+---------+-----+Reshape Data: Pivot 重塑数据: 透视
Pivot函数:pandas 中的枢轴函数用于根据列值重塑数据并从中获取新的 DataFrame。pivot采用以下参数,我们将利用它们:index: 确定新 DataFrame 中的行数columns:确定新 DataFrame 中的列values: 指定重塑表格时使用的值
Code:
import pandas as pd
def pivotTable(weather: pd.DataFrame) -> pd.DataFrame:
ans = weather.pivot(index='month', columns='city', values='temperature')
return ansExample 1:
Input:
+--------------+----------+-------------+
| city | month | temperature |
+--------------+----------+-------------+
| Jacksonville | January | 13 |
| Jacksonville | February | 23 |
| Jacksonville | March | 38 |
| Jacksonville | April | 5 |
| Jacksonville | May | 34 |
| ElPaso | January | 20 |
| ElPaso | February | 6 |
| ElPaso | March | 26 |
| ElPaso | April | 2 |
| ElPaso | May | 43 |
+--------------+----------+-------------+
Output:
+----------+--------+--------------+
| month | ElPaso | Jacksonville |
+----------+--------+--------------+
| April | 2 | 5 |
| February | 6 | 23 |
| January | 20 | 13 |
| March | 26 | 38 |
| May | 43 | 34 |
+----------+--------+--------------+Reshape Data: Melt 重塑数据: 熔化
melt函数:pandas 的melt函数用于转换或重塑数据。它将 DataFrame 从列代表多个变量的宽格式转换为每行代表一个唯一变量的长格式。在我们的例子中,我们要将销售数据从每季度有单独的列转换为季度有一列、销售值有一列的格式。id_vars:指定保持不变的列value_vars:这指定了我们要 "熔化 "或重塑为行的列var_name:这是新列的名称,用于存储value_vars中的标题名称value_name:这是新列的名称,用于存储value_vars中的值
Code:
import pandas as pd
def meltTable(report: pd.DataFrame) -> pd.DataFrame:
report = report.melt(
id_vars = ['product'],
value_vars = ['quarter_1','quarter_2','quarter_3','quarter_4'],
var_name = "quarter",
value_name = "sales"
)
return reportExample 1:
Input:
+-------------+-----------+-----------+-----------+-----------+
| product | quarter_1 | quarter_2 | quarter_3 | quarter_4 |
+-------------+-----------+-----------+-----------+-----------+
| Umbrella | 417 | 224 | 379 | 611 |
| SleepingBag | 800 | 936 | 93 | 875 |
+-------------+-----------+-----------+-----------+-----------+
Output:
+-------------+-----------+-------+
| product | quarter | sales |
+-------------+-----------+-------+
| Umbrella | quarter_1 | 417 |
| SleepingBag | quarter_1 | 800 |
| Umbrella | quarter_2 | 224 |
| SleepingBag | quarter_2 | 936 |
| Umbrella | quarter_3 | 379 |
| SleepingBag | quarter_3 | 93 |
| Umbrella | quarter_4 | 611 |
| SleepingBag | quarter_4 | 875 |
+-------------+-----------+-------+Method Chaining 方法链
Filtering Data
Boolean Indexing: 使用布尔条件从数据帧中筛选行。
Sorting Data
sort_values()方法:用于根据一列或多列对数据进行排序by='weight':我们指定要根据权重列进行排序ascending=False:将此参数设置为False,表示我们希望按降序(从最重到最轻)排序
Column Selection
Subset Selection:过滤和排序后,我们从数据帧中选择列的子集
Method Chaining
Chaining Operations:通过用点连接方法,在一行中对 DataFrame 执行多个操作。这是 pandas 的一个强大功能,它可以使代码变得简洁,但对于新手来说,阅读起来可能比较复杂
#Code1:
import pandas as pd
def findHeavyAnimals(animals: pd.DataFrame) -> pd.DataFrame:
filtered_animals = animals[animals['weight'] > 100]
sorted_animals = filtered_animals.sort_values(by='weight', ascending=False)
name = sorted_animals[['name']]
return name
#Code2:
import pandas as pd
def findHeavyAnimals(animals: pd.DataFrame) -> pd.DataFrame:
return animals[animals['weight'] > 100].sort_values(by='weight', ascending=False)[['name']]Example 1:
Input:
DataFrame animals:
+----------+---------+-----+--------+
| name | species | age | weight |
+----------+---------+-----+--------+
| Tatiana | Snake | 98 | 464 |
| Khaled | Giraffe | 50 | 41 |
| Alex | Leopard | 6 | 328 |
| Jonathan | Monkey | 45 | 463 |
| Stefan | Bear | 100 | 50 |
| Tommy | Panda | 26 | 349 |
+----------+---------+-----+--------+
Output:
+----------+
| name |
+----------+
| Tatiana |
| Jonathan |
| Tommy |
| Alex |
+----------+Last updated
Was this helpful?