CMU 15-445 Homework 1 SQL
这个是 Lecture 1 的 Homework,感觉 Lecture 1 的笔记没啥好写的。
作业准备
- 安装 sqlite3(以 macos 为例),版本最低要求为 3.25
brew install sqlite3
- 启动 sqlite3
$ sqlite3
SQLite version 3.37.0 2021-12-09 01:34:53
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
sqlite>
- 获取数据文件
$ wget https://15445.courses.cs.cmu.edu/fall2022/files/imdb-cmudb2022.db.gz
- 校验数据文件
$ md5sum imdb-cmudb2022.db.gz
d7cdf34f4ba029597c3774fc96bc3519 imdb-cmudb2022.db.gz
- 解压数据文件并导入
$ gunzip imdb-cmudb2022.db.gz
$ sqlite3 imdb-cmudb2022.db
- 查看所有表
$ sqlite3 imdb-cmudb2022.db
SQLite version 3.31.1
Enter ".help" for usage hints.
sqlite> .tables
akas crew episodes people ratings titles
- 添加索引
CREATE INDEX ix_people_name ON people (name);
CREATE INDEX ix_titles_type ON titles (type);
CREATE INDEX ix_titles_primary_title ON titles (primary_title);
CREATE INDEX ix_titles_original_title ON titles (original_title);
CREATE INDEX ix_akas_title_id ON akas (title_id);
CREATE INDEX ix_akas_title ON akas (title);
CREATE INDEX ix_crew_title_id ON crew (title_id);
CREATE INDEX ix_crew_person_id ON crew (person_id);
- 熟悉表结构,以 people 表为例
sqlite> .schema people
CREATE TABLE people (
person_id VARCHAR PRIMARY KEY,
name VARCHAR,
born INTEGER,
died INTEGER
);
CREATE INDEX ix_people_name ON people (name);
参考答案
Q2 5分
找到 10 部时长最长的科幻作品。
输出作品的名字、首映日、时长。时长应以 (mins)
为后缀,例如时长为 12 分钟,输出为 12(mins)
。如果作品的流派有多个,只要包含科幻即属于科幻作品。
结果的第一行应为:Cicak-Man 2: Planet Hitam|2008|999 (mins)
这个问题涉及的表为
titles
,表结构如下:sqlite> .schema titles CREATE TABLE titles ( title_id VARCHAR PRIMARY KEY, type VARCHAR, primary_title VARCHAR, original_title VARCHAR, is_adult INTEGER, premiered INTEGER, ended INTEGER, runtime_minutes INTEGER, genres VARCHAR ); CREATE INDEX ix_titles_type ON titles (type); CREATE INDEX ix_titles_primary_title ON titles (primary_title); CREATE INDEX ix_titles_original_title ON titles (original_title);
其中,输出涉及的字段为
primary_title
、premiered
、runtime_minutes
标记作品流派的字段为
genres
,使用,
分割多种流派,Sci-Fi
表示科幻流派
我的答案
SELECT
primary_title,
premiered,
runtime_minutes || '(mins)' AS runtime
FROM
titles
WHERE
genres LIKE '%Sci-Fi%'
ORDER BY
runtime_minutes DESC
LIMIT 10;
Q3 5分
找到出生于 1990 及以后最年长的人,如果死亡年份为空说明还在世。
输出每个人的姓名的年龄,输出结果首先按年龄排序,其次按姓名排序,返回前 20 个结果。
这个问题涉及的表为
people
,表结构如下:sqlite> .schema people CREATE TABLE people ( person_id VARCHAR PRIMARY KEY, name VARCHAR, born INTEGER, died INTEGER ); CREATE INDEX ix_people_name ON people (name);
其中,输出涉及的字段为
name
、born
、died
我的答案
SELECT
name,
JULIANDAY(IFNULL(died, STRFTIME('%Y', CURRENT_TIMESTAMP))) - JULIANDAY(born) AS age
FROM
people
WHERE
born >= 1900
ORDER BY
age DESC
LIMIT 20;
Q4 10分
找到出现次数最多的 20 名剧组人员,输出姓名和出场次数。
这个问题涉及的表为
crew
、people
,crew
表结构如下:sqlite> .schema crew CREATE TABLE crew ( title_id VARCHAR, -- REFERENCES titles (title_id), person_id VARCHAR, -- REFERENCES people (person_id), category VARCHAR, job VARCHAR, characters VARCHAR ); CREATE INDEX ix_crew_title_id ON crew (title_id); CREATE INDEX ix_crew_person_id ON crew (person_id);
我的答案
SELECT
num_appearances,
name
FROM
(
SELECT
COUNT(1) AS num_appearances,
person_id
FROM
crew
GROUP BY
person_id
ORDER BY
num_appearances DESC
LIMIT 20
) t
JOIN people p
WHERE t.person_id = p.person_id;
Q5 10分
输出每十年的作品发行量、平均分、最高分、最低分,平均分保留两位小数,忽略尚未上映的作品,即 premiered
为 null
的作品。
这个问题涉及的表为
ratings
、titles
,ratings
表结构如下:sqlite> .schema ratings CREATE TABLE ratings ( title_id VARCHAR PRIMARY KEY, -- REFERENCES titles (title_id), rating FLOAT, votes INTEGER );
我的答案
SELECT
(premiered-premiered%10) || 's' AS decade,
ROUND(AVG(rating), 2) AS avg_rating,
MAX(rating) AS top_rating,
MIN(rating) AS min_rating,
COUNT(1) AS num_releases
FROM
titles t
JOIN ratings r
WHERE
premiered IS NOT NULL
AND premiered >= premiered-premiered%10
AND premiered < premiered-premiered%10 + 10
AND t.title_id = r.title_id
GROUP BY decade
ORDER BY decade;
Q6 10分
找到名字包含 "Cruise" 且出生于 1962 年的人参与的票数最高的作品,输出作品名和票数。
这个问题涉及的表为
people
、titles
、crew
、ratings
,这四个表结构前文都展示过了。
我的答案
SELECT t.primary_title, MAX(r.votes) FROM ratings r, titles t WHERE r.title_id IN (
SELECT title_id FROM crew WHERE person_id IN (
SELECT person_id FROM people WHERE born = 1962 AND name LIKE '%Cruise%'
)
) AND r.title_id = t.title_id;
Q7 15分
列出与 "Army of Thieves" 同年首映的作品数量,结果包含 "Army of Thieves"。
这个问题涉及的表为
titles
,表结构前文已经展示过了。
我的答案
SELECT COUNT(DISTINCT(title_id)) FROM titles WHERE premiered = (
SELECT premiered FROM titles WHERE primary_title = 'Army of Thieves'
);
Q8 15分
列出所有与 "Nicole Kidman"(生于 1967)一起出演过一部作品的演员,按名字顺序输出,结果包含 "Nicole Kidman"。
这个问题涉及的表为
crew
、people
,crew.category
字段值为actor
或actresses
都是演员。
我的答案
SELECT DISTINCT(name) AS name FROM crew c JOIN people p WHERE title_id IN (
SELECT title_id FROM crew WHERE person_id = (
SELECT person_id FROM people WHERE name = 'Nicole Kidman' AND born = 1967
)
) AND (category = 'actor' OR category = 'actress') AND c.person_id = p.person_id ORDER BY name;
Q9 15分
对于所有生于 1955 年的人,获取他们的名字和其参演的所有电影的平均分,并输出其中排名第 9 的人的名字和其所有作品的平均分。建议使用 NTILE(10)
获取排名。
这个问题涉及的表为
crew
、titles
、people
、ratings
,其中titles.type
值为movie
表示电影。
我的答案
SELECT name, avg_rating FROM (
SELECT NTILE(10) OVER(ORDER BY avg_rating DESC) AS `rank`, name, avg_rating FROM (
SELECT ROUND(AVG(r.rating), 2) AS avg_rating, t.person_id FROM (
SELECT c.title_id, person_id FROM crew c JOIN titles t WHERE person_id IN (
SELECT person_id FROM people WHERE born = 1955
) AND c.title_id = t.title_id AND t.`type` = 'movie'
) t JOIN ratings r WHERE t.title_id = r.title_id GROUP BY t.person_id LIMIT 10
) t JOIN people p WHERE t.person_id = p.person_id
) WHERE `rank` = 9;
Q10 15分
这个完全没看懂题目是什么意思...这里贴一下原文,有人能理解题目意思还请评论告知。
Concatenate all the unique titles for the TV Series "House of the Dragon" as a string of comma-separated values in alphabetical order of the titles.
Details: Find all the unique dubbed titles for the new TV show "House of the Dragon" and order them alphabetically. Print a single string containing all these titles separated by commas.
Hint: You might find Recursive CTEs useful.
Note: Two titles are different even if they differ only in capitalization. Elements in the comma-separated value result should be separated with both a comma and a space, e.g. "foo, bar".
写在最后
这个 Homework 的截止时间是 Sep 11, 2022 @ 11:59pm,所以我直接把我的答案放在网上没有关系,因为没办法验证正确性,如果你的查询结果和我的不一致,欢迎评论交流。
上面写的参考答案有些很明显并不是很合适,哪怕结果是正确的,所以只能作为参考。平常开发时遇到复杂的查询场景,一般都用高级语言来处理逻辑,多次调用简单的 SQL 查询,理论上全部使用 SQL 完成应该是更好的。
转载自:https://juejin.cn/post/7142152700501688356