Part III

Classes More Then 5 Students (value_counts方法)

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| student     | varchar |
| class       | varchar |
+-------------+---------+

# 写出一个解决方案，找出所有至少有五名学生的班级
Example 1:

Input: 
Courses table:
+---------+----------+
| student | class    |
+---------+----------+
| A       | Math     |
| B       | English  |
| C       | Math     |
| D       | Biology  |
| E       | Math     |
| F       | Computer |
| G       | Math     |
| H       | Math     |
| I       | Math     |
+---------+----------+
Output: 
+---------+
| class   |
+---------+
| Math    |
+---------+

import pandas as pd

def find_classes(courses: pd.DataFrame) -> pd.DataFrame:
    
    value_class = courses['class'].value_counts()
    # value_counts() 方法来计算某列中每个字符串的出现次数
    result_df = value_class[value_class >= 5 ].reset_index()

    result_df.columns = ['class','count']

    result_df = result_df[['class']]

    return result_df

Customer Placing the Largest Number of Orders - 下订单最多的客户（分组取首）

+-----------------+----------+
| Column Name     | Type     |
+-----------------+----------+
| order_number    | int      |
| customer_number | int      |
+-----------------+----------+
# order_number 是该表的主键（具有唯一值的列）
# 该表包含有关订单 ID 和客户 ID 的信息

# 写一个解决方案，找出下订单最多的客户的 客户编号
Example 1:

Input: 
Orders table:
+--------------+-----------------+
| order_number | customer_number |
+--------------+-----------------+
| 1            | 1               |
| 2            | 2               |
| 3            | 3               |
| 4            | 3               |
+--------------+-----------------+
Output: 
+-----------------+
| customer_number |
+-----------------+
| 3               |
+-----------------+

import pandas as pd

def largest_orders(orders: pd.DataFrame) -> pd.DataFrame:
    # 按 “客户编号 ”分组，并计算出现次数
    customer_counts = orders.groupby('customer_number').size().reset_index(name='count')
    
    # 按计数降序排序
    sorted_customers = customer_counts.sort_values(by='count', ascending=False)
    
    # 获取首位 customer_number
    most_frequent_customer = sorted_customers.head(1)[['customer_number']]
    
    return most_frequent_customer

Group Sold Products By The Date - 按日期归类已售产品( agg方法多个聚合)

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| sell_date   | date    |
| product     | varchar |
+-------------+---------+

# There is no primary key (column with unique values) for this table. It may contain duplicates。
# Each row of this table contains the product name and the date it was sold in a market。

# Write a solution to find for each date the number of different products sold and their names.
# The sold products names for each date should be sorted lexicographically
# Return the result table ordered by sell_date

Example 1:

Input: 
Activities table:
+------------+------------+
| sell_date  | product     |
+------------+------------+
| 2020-05-30 | Headphone  |
| 2020-06-01 | Pencil     |
| 2020-06-02 | Mask       |
| 2020-05-30 | Basketball |
| 2020-06-01 | Bible      |
| 2020-06-02 | Mask       |
| 2020-05-30 | T-Shirt    |
+------------+------------+
Output: 
+------------+----------+------------------------------+
| sell_date  | num_sold | products                     |
+------------+----------+------------------------------+
| 2020-05-30 | 3        | Basketball,Headphone,T-shirt |
| 2020-06-01 | 2        | Bible,Pencil                 |
| 2020-06-02 | 1        | Mask                         |
+------------+----------+------------------------------+

import pandas as pd

def categorize_products(activities: pd.DataFrame) -> pd.DataFrame:
    
    # agg方法用于对groupby 创建的每个组应用多个聚合函数
    return activities.groupby('sell_date')['product'].agg([
        ('num_sold', 'nunique'),
        ('products', lambda x: ','.join(sorted(x.unique())))
        ]).reset_index()

Daily Leads and Partners - 每日线索和合作伙伴 (groupby、nunique)

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| date_id     | date    |
| make_name   | varchar |
| lead_id     | int     |
| partner_id  | int     |
+-------------+---------+

# There is no primary key (column with unique values) for this table. It may contain duplicates
# This table contains the date and the name of the product sold and the IDs of the lead and partner it was sold to
# The name consists of only lowercase English letters

# For each date_id and make_name, find the number of distinct lead_id's and distinct partner_id's.

Example 1:

Input: 
DailySales table:
+-----------+-----------+---------+------------+
| date_id   | make_name | lead_id | partner_id |
+-----------+-----------+---------+------------+
| 2020-12-8 | toyota    | 0       | 1          |
| 2020-12-8 | toyota    | 1       | 0          |
| 2020-12-8 | toyota    | 1       | 2          |
| 2020-12-7 | toyota    | 0       | 2          |
| 2020-12-7 | toyota    | 0       | 1          |
| 2020-12-8 | honda     | 1       | 2          |
| 2020-12-8 | honda     | 2       | 1          |
| 2020-12-7 | honda     | 0       | 1          |
| 2020-12-7 | honda     | 1       | 2          |
| 2020-12-7 | honda     | 2       | 1          |
+-----------+-----------+---------+------------+
Output: 
+-----------+-----------+--------------+-----------------+
| date_id   | make_name | unique_leads | unique_partners |
+-----------+-----------+--------------+-----------------+
| 2020-12-8 | toyota    | 2            | 3               |
| 2020-12-7 | toyota    | 1            | 2               |
| 2020-12-8 | honda     | 2            | 2               |
| 2020-12-7 | honda     | 3            | 2               |
+-----------+-----------+--------------+-----------------+

import pandas as pd

def daily_leads_and_partners(daily_sales: pd.DataFrame) -> pd.DataFrame:
    
    result_df = daily_sales.groupby(['date_id', 'make_name']).nunique().reset_index().rename(columns={'lead_id':'unique_leads', 'partner_id':'unique_partners'})

    return result_df

Actor and Dirctors Who Cooperated At Least Three Times (group, size)

+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| actor_id    | int     |
| director_id | int     |
| timestamp   | int     |
+-------------+---------+
# timestamp is the primary key (column with unique values) for this table.
# Write a solution to find all the pairs (actor_id, director_id) where the actor has cooperated with the director at least three times
Example 1:

Input: 
ActorDirector table:
+-------------+-------------+-------------+
| actor_id    | director_id | timestamp   |
+-------------+-------------+-------------+
| 1           | 1           | 0           |
| 1           | 1           | 1           |
| 1           | 1           | 2           |
| 1           | 2           | 3           |
| 1           | 2           | 4           |
| 2           | 1           | 5           |
| 2           | 1           | 6           |
+-------------+-------------+-------------+
Output: 
+-------------+-------------+
| actor_id    | director_id |
+-------------+-------------+
| 1           | 1           |
+-------------+-------------+

import pandas as pd

def actors_and_directors(actor_director: pd.DataFrame) -> pd.DataFrame:

    actor_director_df = actor_director.groupby(['actor_id','director_id']).size().reset_index(name='count')

    result_df = actor_director_df[actor_director_df['count'] >= 3] 

    result_df = result_df[['actor_id','director_id']]

    return result_df

Replace Rmployee ID With The Unique Identifier (merge)

# Table: Employees
+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| id            | int     |
| name          | varchar |
+---------------+---------+
# id is the primary key (column with unique values) for this table.
# Each row of this table contains the id and the name of an employee in a company.
 
# Table: EmployeeUNI
+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| id            | int     |
| unique_id     | int     |
+---------------+---------+
# (id, unique_id) is the primary key (combination of columns with unique values) for this table.
# Each row of this table contains the id and the corresponding unique id of an employee in the company.
# Write a solution to show the unique ID of each user, If a user does not have a unique ID replace just show null.

# Example 1:

# Input: 
Employees table:
+----+----------+
| id | name     |
+----+----------+
| 1  | Alice    |
| 7  | Bob      |
| 11 | Meir     |
| 90 | Winston  |
| 3  | Jonathan |
+----+----------+
EmployeeUNI table:
+----+-----------+
| id | unique_id |
+----+-----------+
| 3  | 1         |
| 11 | 2         |
| 90 | 3         |
+----+-----------+
Output: 
+-----------+----------+
| unique_id | name     |
+-----------+----------+
| null      | Alice    |
| null      | Bob      |
| 2         | Meir     |
| 3         | Winston  |
| 1         | Jonathan |
+-----------+----------+

import pandas as pd

def replace_employee_id(employees: pd.DataFrame, employee_uni: pd.DataFrame) -> pd.DataFrame:

    merged_df =  pd.merge(employees, employee_uni, on='id', how='left') 

    result_df = merged_df[['unique_id', 'name']]
    
    return result_df

Students and Examinations (three tables; merge, groupby, fillna, astype)

# Table: Students
+---------------+---------+
| Column Name   | Type    |
+---------------+---------+
| student_id    | int     |
| student_name  | varchar |
+---------------+---------+
# student_id is the primary key (column with unique values) for this table.
# Each row of this table contains the ID and the name of one student in the school.
 

# Table: Subjects
+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| subject_name | varchar |
+--------------+---------+
# subject_name is the primary key (column with unique values) for this table.
# Each row of this table contains the name of one subject in the school.
 
# Table: Examinations
+--------------+---------+
| Column Name  | Type    |
+--------------+---------+
| student_id   | int     |
| subject_name | varchar |
+--------------+---------+
# There is no primary key (column with unique values) for this table. It may contain duplicates.
# Each student from the Students table takes every course from the Subjects table.
# Each row of this table indicates that a student with ID student_id attended the exam of subject_name.
 
# Write a solution to find the number of times each student attended each exam.
# Return the result table ordered by student_id and subject_name.

Example 1:

Input: 
Students table:
+------------+--------------+
| student_id | student_name |
+------------+--------------+
| 1          | Alice        |
| 2          | Bob          |
| 13         | John         |
| 6          | Alex         |
+------------+--------------+
Subjects table:
+--------------+
| subject_name |
+--------------+
| Math         |
| Physics      |
| Programming  |
+--------------+
Examinations table:
+------------+--------------+
| student_id | subject_name |
+------------+--------------+
| 1          | Math         |
| 1          | Physics      |
| 1          | Programming  |
| 2          | Programming  |
| 1          | Physics      |
| 1          | Math         |
| 13         | Math         |
| 13         | Programming  |
| 13         | Physics      |
| 2          | Math         |
| 1          | Math         |
+------------+--------------+
Output: 
+------------+--------------+--------------+----------------+
| student_id | student_name | subject_name | attended_exams |
+------------+--------------+--------------+----------------+
| 1          | Alice        | Math         | 3              |
| 1          | Alice        | Physics      | 2              |
| 1          | Alice        | Programming  | 1              |
| 2          | Bob          | Math         | 1              |
| 2          | Bob          | Physics      | 0              |
| 2          | Bob          | Programming  | 1              |
| 6          | Alex         | Math         | 0              |
| 6          | Alex         | Physics      | 0              |
| 6          | Alex         | Programming  | 0              |
| 13         | John         | Math         | 1              |
| 13         | John         | Physics      | 1              |
| 13         | John         | Programming  | 1              |
+------------+--------------+--------------+----------------+

import pandas as pd

def students_and_examinations(students: pd.DataFrame, subjects: pd.DataFrame, examinations: pd.DataFrame) -> pd.DataFrame:
    # Step 1: Create a cross join of students and subjects
    result = students.merge(subjects, how='cross')
    # Step 2: Count the number of exams attended
    exam_count = examinations.groupby(['student_id', 'subject_name']).size().reset_index(name='attended_exams')
    # Step 3: Merge the exam count with the result
    result = result.merge(exam_count, on=['student_id', 'subject_name'], how='left')
    # Step 4: Fill NaN values with 0 for students who didn't attend any exams
    result['attended_exams'] = result['attended_exams'].fillna(0).astype(int)
    # Step 5: Sort the result by student_id and subject_name
    result = result.sort_values(['student_id', 'subject_name'])

    return result

Sales Person (three tables; merge; ~ )

# Table: SalesPerson
+-----------------+---------+
| Column Name     | Type    |
+-----------------+---------+
| sales_id        | int     |
| name            | varchar |
| salary          | int     |
| commission_rate | int     |
| hire_date       | date    |
+-----------------+---------+
# sales_id is the primary key (column with unique values) for this table.
# Each row of this table indicates the name and the ID of a salesperson alongside their salary, commission rate, and hire date.
 
# Table: Company
+-------------+---------+
| Column Name | Type    |
+-------------+---------+
| com_id      | int     |
| name        | varchar |
| city        | varchar |
+-------------+---------+
# com_id is the primary key (column with unique values) for this table.
# Each row of this table indicates the name and the ID of a company and the city in which the company is located.
 
# Table: Orders
+-------------+------+
| Column Name | Type |
+-------------+------+
| order_id    | int  |
| order_date  | date |
| com_id      | int  |
| sales_id    | int  |
| amount      | int  |
+-------------+------+
# order_id is the primary key (column with unique values) for this table.
# com_id is a foreign key (reference column) to com_id from the Company table.
# sales_id is a foreign key (reference column) to sales_id from the SalesPerson table.
# Each row of this table contains information about one order. This includes the ID of the company, the ID of the salesperson, the date of the order, and the amount paid.
 
# Write a solution to find the names of all the salespersons who did not have any orders related to the company with the name "RED".
# Return the result table in any order.

# Example 1:

# Input: 
# SalesPerson table:
+----------+------+--------+-----------------+------------+
| sales_id | name | salary | commission_rate | hire_date  |
+----------+------+--------+-----------------+------------+
| 1        | John | 100000 | 6               | 4/1/2006   |
| 2        | Amy  | 12000  | 5               | 5/1/2010   |
| 3        | Mark | 65000  | 12              | 12/25/2008 |
| 4        | Pam  | 25000  | 25              | 1/1/2005   |
| 5        | Alex | 5000   | 10              | 2/3/2007   |
+----------+------+--------+-----------------+------------+
# Company table:
+--------+--------+----------+
| com_id | name   | city     |
+--------+--------+----------+
| 1      | RED    | Boston   |
| 2      | ORANGE | New York |
| 3      | YELLOW | Boston   |
| 4      | GREEN  | Austin   |
+--------+--------+----------+
# Orders table:
+----------+------------+--------+----------+--------+
| order_id | order_date | com_id | sales_id | amount |
+----------+------------+--------+----------+--------+
| 1        | 1/1/2014   | 3      | 4        | 10000  |
| 2        | 2/1/2014   | 4      | 5        | 5000   |
| 3        | 3/1/2014   | 1      | 1        | 50000  |
| 4        | 4/1/2014   | 1      | 4        | 25000  |
+----------+------------+--------+----------+--------+
# Output: 
+------+
| name |
+------+
| Amy  |
| Mark |
| Alex |
+------+

import pandas as pd

def sales_person(sales_person: pd.DataFrame, company: pd.DataFrame, orders: pd.DataFrame) -> pd.DataFrame:

    return sales_person[
        ~sales_person['sales_id'].isin(
            pd.merge(
                left=orders, 
                right=company[company['name'] == 'RED'], 
                how='inner', 
                on='com_id'
            )['sales_id'].unique()
        )
    ][['name']]

PreviousPart II NextDeep Dive into LLMs like ChatGPT

Last updated 1 year ago

Was this helpful?