Pandas - Basic

Key Concepts 主要概念

DataFrame:类似电子表格或 SQL 表格的二维表格结构。每一行代表一条记录,每一列代表不同的属性。它的大小是可变的,旨在处理不同类型数据的混合。

Create a DataFrame from List 从列表创建DataFrame
Code:
import pandas as pd

def createDataframe(student_data: List[List[int]]) -> pd.DataFrame:
    result = pd.DataFrame(student_data, columns = ['student_id','age'])
    return result
Example 1:

Input:
student_data:
[
  [1, 15],
  [2, 11],
  [3, 11],
  [4, 20]
]
Output:
+------------+-----+
| student_id | age |
+------------+-----+
| 1          | 15  |
| 2          | 11  |
| 3          | 11  |
| 4          | 20  |
+------------+-----+
Display the First Three Rows 显示前三行
Code:
import pandas as pd

def selectFirstRows(employees: pd.DataFrame) -> pd.DataFrame:
    return employees.head(3)
Example 1:

Input:
DataFrame employees
+-------------+-----------+-----------------------+--------+
| employee_id | name      | department            | salary |
+-------------+-----------+-----------------------+--------+
| 3           | Bob       | Operations            | 48675  |
| 90          | Alice     | Sales                 | 11096  |
| 9           | Tatiana   | Engineering           | 33805  |
| 60          | Annabelle | InformationTechnology | 37678  |
| 49          | Jonathan  | HumanResources        | 23793  |
| 43          | Khaled    | Administration        | 40454  |
+-------------+-----------+-----------------------+--------+
Output:
+-------------+---------+-------------+--------+
| employee_id | name    | department  | salary |
+-------------+---------+-------------+--------+
| 3           | Bob     | Operations  | 48675  |
| 90          | Alice   | Sales       | 11096  |
| 9           | Tatiana | Engineering | 33805  |
+-------------+---------+-------------+--------+
Select Data 选择数据

loc函数:是一个非常强大的索引工具,用于通过标签(标签名、行或列的名称)进行数据选择和操作。loc函数允许你根据行和列的标签来访问或修改DataFrame中的数据。

Code:
import pandas as pd

def selectData(students: pd.DataFrame) -> pd.DataFrame:
    return students.loc[students['student_id'] == 101, ['name','age']]
Example 1:
Input:
+------------+---------+-----+
| student_id | name    | age |
+------------+---------+-----+
| 101        | Ulysses | 13  |
| 53         | William | 10  |
| 128        | Henry   | 6   |
| 3          | Henry   | 11  |
+------------+---------+-----+
Output:
+---------+-----+
| name    | age |
+---------+-----+
| Ulysses | 13  |
+---------+-----+
Create a New Column 创建一个新列
Code:
import pandas as pd

def createBonusColumn(employees: pd.DataFrame) -> pd.DataFrame:
    employees['bonus'] = employees['salary'] * 2
    return employees
Example 1:

Input:
DataFrame employees
+---------+--------+
| name    | salary |
+---------+--------+
| Piper   | 4548   |
| Grace   | 28150  |
| Georgia | 1103   |
| Willow  | 6593   |
| Finn    | 74576  |
| Thomas  | 24433  |
+---------+--------+
Output:
+---------+--------+--------+
| name    | salary | bonus  |
+---------+--------+--------+
| Piper   | 4548   | 9096   |
| Grace   | 28150  | 56300  |
| Georgia | 1103   | 2206   |
| Willow  | 6593   | 13186  |
| Finn    | 74576  | 149152 |
| Thomas  | 24433  | 48866  |
+---------+--------+--------+
Drop Duplicate Rows 删除重复行

drop_duplicates函数: drop_duplicates函数是 pandas 库中 DataFrame 对象的一个方法。其目的是丢弃重复的记录,您可以指定将记录视为重复的标准

Code:
import pandas as pd

def dropDuplicateEmails(customers: pd.DataFrame) -> pd.DataFrame
	customers.drop_duplicates(subset='email', keep='first', inplace=True)
	return customers
Example 1:
Input:
+-------------+---------+---------------------+
| customer_id | name    | email               |
+-------------+---------+---------------------+
| 1           | Ella    | emily@example.com   |
| 2           | David   | michael@example.com |
| 3           | Zachary | sarah@example.com   |
| 4           | Alice   | john@example.com    |
| 5           | Finn    | john@example.com    |
| 6           | Violet  | alice@example.com   |
+-------------+---------+---------------------+
Output:  
+-------------+---------+---------------------+
| customer_id | name    | email               |
+-------------+---------+---------------------+
| 1           | Ella    | emily@example.com   |
| 2           | David   | michael@example.com |
| 3           | Zachary | sarah@example.com   |
| 4           | Alice   | john@example.com    |
| 6           | Violet  | alice@example.com   |
+-------------+---------+---------------------+
Drop Missing Data 删除缺失数据

dropna函数: dropna函数属于 pandas DataFrame,用于删除缺失值。在 pandas 中,缺失数据通常用NaN(Not a Number 的缩写)值表示,但在你的示例中,它显示为None,这也被 pandas 视为缺失值

Code:
import pandas as pd
def dropMissingData(students:pd.DataFrame) -> pd.DataFrame:
   students.dropna(how='any',subset='name', inplace=True)
   return students
Example 1:

Input:
+------------+---------+-----+
| student_id | name    | age |
+------------+---------+-----+
| 32         | Piper   | 5   |
| 217        | None    | 19  |
| 779        | Georgia | 20  |
| 849        | Willow  | 14  |
+------------+---------+-----+
Output:
+------------+---------+-----+
| student_id | name    | age |
+------------+---------+-----+
| 32         | Piper   | 5   |
| 779        | Georgia | 20  | 
| 849        | Willow  | 14  | 
+------------+---------+-----+
Modify Columns 修改列
Code:
import pandas as pd

def modifySalaryColumn(employees: pd.DataFrame) -> pd.DataFrame:
    employees['salary'] = employees['salary'] * 2
    return employees
Example 1:

Input:
DataFrame employees
+---------+--------+
| name    | salary |
+---------+--------+
| Jack    | 19666  |
| Piper   | 74754  |
| Mia     | 62509  |
| Ulysses | 54866  |
+---------+--------+
Output:
+---------+--------+
| name    | salary |
+---------+--------+
| Jack    | 39332  |
| Piper   | 149508 |
| Mia     | 125018 |
| Ulysses | 109732 |
+---------+--------+
Rename Columns 重名列

rename函数:当需要重命名列名或索引名时,pandas 中的rename函数是一个非常有用的工具。

Code:
import pandas as pd

def renameColumns(students: pd.DataFrame) -> pd.DataFrame:
    students = students.rename(columns={
        "id": "student_id",
        "first": "first_name",
        "last": "last_name",
        "age": "age_in_years",
        }
    )
    return students
Example 1:
Input:
+----+---------+----------+-----+
| id | first   | last     | age |
+----+---------+----------+-----+
| 1  | Mason   | King     | 6   |
| 2  | Ava     | Wright   | 7   |
| 3  | Taylor  | Hall     | 16  |
| 4  | Georgia | Thompson | 18  |
| 5  | Thomas  | Moore    | 10  |
+----+---------+----------+-----+
Output:
+------------+------------+-----------+--------------+
| student_id | first_name | last_name | age_in_years |
+------------+------------+-----------+--------------+
| 1          | Mason      | King      | 6            |
| 2          | Ava        | Wright    | 7            |
| 3          | Taylor     | Hall      | 16           |
| 4          | Georgia    | Thompson  | 18           |
| 5          | Thomas     | Moore     | 10           |
+------------+------------+-----------+--------------+
Change Data Type 改变数据类型

astype函数: astype函数用于将 pandas 对象转换为指定的 dtype(数据类型)。astype函数不会在原处修改原始 DataFrame。相反,它会返回一个带有指定数据类型更改的新 DataFrame。如果要反映原始 DataFrame 中的变化,则需要将结果重新赋值给原始 DataFrame 或使用相应的复制参数。该函数的语法如下

Code:
import pandas as pd

def changeDatatype(students: pd.DataFrame) -> pd.DataFrame:
    students = students.astype({'grade': int})
    return students
Example 1:
Input:
DataFrame students:
+------------+------+-----+-------+
| student_id | name | age | grade |
+------------+------+-----+-------+
| 1          | Ava  | 6   | 73.0  |
| 2          | Kate | 15  | 87.0  |
+------------+------+-----+-------+
Output:
+------------+------+-----+-------+
| student_id | name | age | grade |
+------------+------+-----+-------+
| 1          | Ava  | 6   | 73    |
| 2          | Kate | 15  | 87    |
+------------+------+-----+-------+
Fill Missing Data 填充缺失数据

fillna 函数: fillna是 pandas 库中的一个函数,主要用于 pandas Series 和 DataFrame 对象。它允许您使用指定的方法填充 NA/NaN 值。在本例中,我们使用它来替换None(或通常数据帧表示中的NaN)值。

Code:
import pandas as pd

def fillMissingValues(products: pd.DataFrame) -> pd.DataFrame:
    products['quantity'].fillna(0, inplace=True)
    return products
Example 1:
Input:+-----------------+----------+-------+
| name            | quantity | price |
+-----------------+----------+-------+
| Wristwatch      | None     | 135   |
| WirelessEarbuds | None     | 821   |
| GolfClubs       | 779      | 9319  |
| Printer         | 849      | 3051  |
+-----------------+----------+-------+
Output:
+-----------------+----------+-------+
| name            | quantity | price |
+-----------------+----------+-------+
| Wristwatch      | 0        | 135   |
| WirelessEarbuds | 0        | 821   |
| GolfClubs       | 779      | 9319  |
| Printer         | 849      | 3051  |
+-----------------+----------+-------+
Reshape Data: Concatenate 重塑数据: 连接
  • pd.concat():pandas 中的一个便捷函数,用于纵向(按行)或横向(按列)连接 DataFrames。

    • objs参数是要连接的系列或数据帧对象的序列或映射

    • axis参数决定连接的方向:axis=0设置为默认值,这意味着它将垂直(按行)连接 DataFrames。

import pandas as pd

def concatenateTables(df1: pd.DataFrame, df2: pd.DataFrame) -> pd.DataFrame:
    return pd.concat([df1,df2], axis=0)
Example 1:

Input:
df1
+------------+---------+-----+
| student_id | name    | age |
+------------+---------+-----+
| 1          | Mason   | 8   |
| 2          | Ava     | 6   |
| 3          | Taylor  | 15  |
| 4          | Georgia | 17  |
+------------+---------+-----+
df2
+------------+------+-----+
| student_id | name | age |
+------------+------+-----+
| 5          | Leo  | 7   |
| 6          | Alex | 7   |
+------------+------+-----+
Output:
+------------+---------+-----+
| student_id | name    | age |
+------------+---------+-----+
| 1          | Mason   | 8   |
| 2          | Ava     | 6   |
| 3          | Taylor  | 15  |
| 4          | Georgia | 17  |
| 5          | Leo     | 7   |
| 6          | Alex    | 7   |
+------------+---------+-----+
Reshape Data: Pivot 重塑数据: 透视
  • Pivot函数:pandas 中的枢轴函数用于根据列值重塑数据并从中获取新的 DataFrame。pivot采用以下参数,我们将利用它们:

    • index: 确定新 DataFrame 中的行数

    • columns:确定新 DataFrame 中的列

    • values: 指定重塑表格时使用的值

Code:
import pandas as pd

def pivotTable(weather: pd.DataFrame) -> pd.DataFrame:
    ans = weather.pivot(index='month', columns='city', values='temperature')
    return ans
Example 1:
Input:
+--------------+----------+-------------+
| city         | month    | temperature |
+--------------+----------+-------------+
| Jacksonville | January  | 13          |
| Jacksonville | February | 23          |
| Jacksonville | March    | 38          |
| Jacksonville | April    | 5           |
| Jacksonville | May      | 34          |
| ElPaso       | January  | 20          |
| ElPaso       | February | 6           |
| ElPaso       | March    | 26          |
| ElPaso       | April    | 2           |
| ElPaso       | May      | 43          |
+--------------+----------+-------------+
Output:
+----------+--------+--------------+
| month    | ElPaso | Jacksonville |
+----------+--------+--------------+
| April    | 2      | 5            |
| February | 6      | 23           |
| January  | 20     | 13           |
| March    | 26     | 38           |
| May      | 43     | 34           |
+----------+--------+--------------+
Reshape Data: Melt 重塑数据: 熔化
  • melt函数:pandas 的melt函数用于转换或重塑数据。它将 DataFrame 从列代表多个变量的宽格式转换为每行代表一个唯一变量的长格式。在我们的例子中,我们要将销售数据从每季度有单独的列转换为季度有一列、销售值有一列的格式。

    • id_vars:指定保持不变的列

    • value_vars:这指定了我们要 "熔化 "或重塑为行的列

    • var_name:这是新列的名称,用于存储value_vars 中的标题名称

    • value_name:这是新列的名称,用于存储value_vars 中的值

Code:
import pandas as pd

def meltTable(report: pd.DataFrame) -> pd.DataFrame:
    report = report.melt(
        id_vars = ['product'],
        value_vars = ['quarter_1','quarter_2','quarter_3','quarter_4'],
        var_name = "quarter",
        value_name = "sales"
    )
    return report
Example 1:

Input:
+-------------+-----------+-----------+-----------+-----------+
| product     | quarter_1 | quarter_2 | quarter_3 | quarter_4 |
+-------------+-----------+-----------+-----------+-----------+
| Umbrella    | 417       | 224       | 379       | 611       |
| SleepingBag | 800       | 936       | 93        | 875       |
+-------------+-----------+-----------+-----------+-----------+
Output:
+-------------+-----------+-------+
| product     | quarter   | sales |
+-------------+-----------+-------+
| Umbrella    | quarter_1 | 417   |
| SleepingBag | quarter_1 | 800   |
| Umbrella    | quarter_2 | 224   |
| SleepingBag | quarter_2 | 936   |
| Umbrella    | quarter_3 | 379   |
| SleepingBag | quarter_3 | 93    |
| Umbrella    | quarter_4 | 611   |
| SleepingBag | quarter_4 | 875   |
+-------------+-----------+-------+
Method Chaining 方法链
  • Filtering Data

    • Boolean Indexing: 使用布尔条件从数据帧中筛选行。

  • Sorting Data

    • sort_values() 方法:用于根据一列或多列对数据进行排序

    • by='weight':我们指定要根据权重列进行排序

    • ascending=False:将此参数设置为False,表示我们希望按降序(从最重到最轻)排序

  • Column Selection

    • Subset Selection:过滤和排序后,我们从数据帧中选择列的子集

  • Method Chaining

    • Chaining Operations:通过用点连接方法,在一行中对 DataFrame 执行多个操作。这是 pandas 的一个强大功能,它可以使代码变得简洁,但对于新手来说,阅读起来可能比较复杂

#Code1:
import pandas as pd

def findHeavyAnimals(animals: pd.DataFrame) -> pd.DataFrame:
    filtered_animals = animals[animals['weight'] > 100]
    sorted_animals = filtered_animals.sort_values(by='weight', ascending=False)
    name = sorted_animals[['name']]
    return name
    
#Code2:
import pandas as pd

def findHeavyAnimals(animals: pd.DataFrame) -> pd.DataFrame:
    return animals[animals['weight'] > 100].sort_values(by='weight', ascending=False)[['name']]
Example 1:

Input: 
DataFrame animals:
+----------+---------+-----+--------+
| name     | species | age | weight |
+----------+---------+-----+--------+
| Tatiana  | Snake   | 98  | 464    |
| Khaled   | Giraffe | 50  | 41     |
| Alex     | Leopard | 6   | 328    |
| Jonathan | Monkey  | 45  | 463    |
| Stefan   | Bear    | 100 | 50     |
| Tommy    | Panda   | 26  | 349    |
+----------+---------+-----+--------+
Output: 
+----------+
| name     |
+----------+
| Tatiana  |
| Jonathan |
| Tommy    |
| Alex     |
+----------+

Last updated

Was this helpful?