Data Analytics - Python & other tools
Fall 2020
HW 1: End-to-end analysis of TMDb data, Argo-Lite, SQLite, D3 Warmup, OpenRefine, Flask
Use HW1 Skeleton zip file and unzip to folders.
Homework Overview
In Question 1 (Q1), you will collect data using an API for The Movie Database (TMDb). You will construct a
graph representation of this data that will show which actors have acted together in various movies, and use
Argo Lite to visualize this graph and highlight patterns that you find. This exercise demonstrates
how visualizing and interacting with data can help with discovery.
In Q2, you will construct a TMDb database in SQLite, with tables capturing information such as how well
each movie did, which actors acted in each movie, and what the movie was about. You will also
partition and combine information in these tables in order to more easily answer questions such as "which
actors acted in the highest number of movies?".
In Q3, you will visualize temporal trends in movie releases, using a JavaScript-based library called D3. This
part will show how creating interactive rather than static plots can make data more visually appealing,
engaging and easier to parse.
In Q4, you will use OpenRefine to clean data from Mercari, and construct GREL queries to filter the entries
in this dataset.
In Q5, you will build a simple web application that displays a table of TMDb data on a single-page website.
To do this, you will use Flask, a Python framework for building web applications that allows you to connect
Python data processing on the back end with serving a site that displays these results.
Extremely Important: folder structure & content of submission zip file
The zip file’s folder structure must exactly be (when unzipped):
HW1-username/
Q1/
Q2/
Q3/
submission.py
Q2_SQL.py
index.(html / js / css)
q3.csv
lib/
d3/
d3.min.js
d3-fetch/
d3-fetch.min.js
d3-dsv/
d3-dsv.min.js
Q4/
Q5/
properties_clean.csv
changes.json
Q4Observations.txt
wrangling.py
2 Version 1
Q1 [40 points] Collect data from TMDb and visualize co-actor network
Q1.1 [30 points] Collect data from TMDb and build a graph
For this Q1.1, you will be using and submitting a python file - submission.py in folder Q1
Complete all tasks according to the instructions found in submission.py to complete the Graph class, the
TMDbAPIUtils class, and the two global functions. The Graph class will serve as a re-usable way to
represent and write out your collected graph data. The TMDbAPIUtils class will be used to work with the
TMDB API for data retrieval.
NOTE: You must only use a version of Python ≥ 3.7.0 and < 3.8 for this question. You must not use any
other versions (e.g., Python 3.8).
NOTE: You must only use the modules and libraries provided at the top of submission.py and modules
from the Python Standard Library. Pandas and Numpy CANNOT be used
NOTE: We will call each function once in submission.py during grading. You may lose some points if your
program runs for unreasonably long time, such as more than 10 minutes during “non-busy” times. The
average runtime of the code during grading is expected to take approximately 4 seconds.
a) [10 pts] Implementation of the Graph class according to the instructions in submission.py
b) [10 pts] Implementation of the TMDbAPIUtils class according to the instructions in
submission.py. You will use version 3 of the TMDb API to download data about actors and their
co-actors. To use the TMDb API:
o Use the TMDb API key to access the TMDb data: 74a32a264202ce8e09c7
o Refer to the TMDB API Documentation, which contains a helpful ‘try-it-out’ feature for
interacting with the API calls.
c) [10 pts] Producing correct nodes.csv and edges.csv. You must upload your nodes.csv and
edges.csv file as directed in Q1.2.
NOTE: Q1.2 builds on the results of Q1.1
3 Version 1
Q1.2 [10 points] Visualizing a graph of co-actors using Argo-Lite
Using Argo Lite, visualize a network of actors and their co-actors. You can access Argo Lite here
You will produce an Argo Lite graph snapshot your edges.csv and nodes.csv from Q1.1.c.
a. To get started, review Argo Lite’s readme on GitHub. Argo Lite has been open-sourced.
b. Importing your Graph
● Launch Argo Lite
● From the menu bar, click ‘Graph’ → ‘Import CSV’. In the dialogue that appears:
o Select ‘I have both nodes and edges file’
● Under Nodes, use ‘Choose File’ to select nodes.csv from your computer
o Leave 'Has Headers' selected
o Verify ‘Column for Node ID’ is ‘id’
● Under Edges, use ‘Choose File’ to select edges.csv from your computer
o Verify ‘Column for Source ID’ is ‘source’
o Select ‘Column for Target ID’ to ‘target’
o Verify ‘Selected Delimiter’ is ','
● At the bottom of the dialogue, verify that ‘After import, show’ is ‘All Nodes’
● The graph will load in the window. Note that the layout is paused by default; you can select
to 'Resume’ or ‘Pause’ layout as needed.
● Dragging a node will 'pin' it, freezing its position. Selecting a pinned node, right clicking it,
then choosing 'unpin selected' will unpin that node, so its position will once again be
computed by the graph layout algorithm. Experiment with pinning and unpinning nodes.
c. [7 points] Setting graph display options
● On “Graph Options” panel, under 'Nodes' → 'Modifying All Nodes', expand 'Color' menu
o Select Color by 'degree', with scale: ‘Linear Scale’
o Select a color gradient of your choice that will assign lighter colors to nodes with higher
node degrees, and darker colors to nodes with lower degrees
● Collapse the 'Color' options, expand the 'Size' options.
o Select 'Scale by' to 'degree', with scale: Linear Scale'
o Select meaningful Size Range values of your choice or use the default range.
● Collapse the 'Size' options
● On the Menu, click ‘Tools’ → ‘Data Sheet’
● Within the ‘Data Sheet’ dialogue:
o Click ‘Hide All’
o Set ‘10 more nodes with highest degree’
o Click ‘Show’ and then close the ‘Data Sheet’ dialogue
● Click and drag a rectangle selection around the visible nodes
● With the nodes selected, configure their node visibility by setting the following:
o Go to 'Graph Options' → 'Labels'
o Click ‘Show Labels of Selected Nodes’
o At the bottom of the menu, select 'Label By' to ‘name'
o Adjust the ‘Label Length’ so that the full text of the actor name is displayed
● On the Menu, click ‘Tools’ -> ‘Filters’ -> ‘Show All Nodes’ The result of this workflow yields a
graph with the sizing and coloring depending upon the node degree and the nodes with the
highest degree are emphasized by showing their labels.
●
d. [3 points] Designing a meaningful graph layout
4 Version 1
Using the following guidelines, create a visually meaningful and appealing layout:
● Reduce as much edge crossing as possible
● Reduce node overlap as much as possible
● Keep the graph compact and symmetric as possible
● Use the nodes’ spatial positions to convey information (e.g., “clusters” or groups)
● Experiment with showing additional node labels. If showing all node labels creates too much
visual complexity, show at least 10 “important” nodes. You may decide what “importance”
mean to you. For example, you may consider nodes (actors) having higher connectivity as
potentially more “important” (based on how the graph is built).
The objective of this task is to familiarize yourself with basic, important graph visualization features.
Therefore, this is an open-ended task, and most designs receive full marks. So please experiment
with Argo Lite’s features, changing node size and shape, etc. In practice, it is not possible to create
“perfect” visualizations for most graph datasets. The above guidelines are ones that generally help.
However, like most design tasks, creating a visualization is about making selective design
compromises. Some guidelines could create competing demands and following all guidelines may
not guarantee a “perfect” design.
If you want to save your Argo Lite graph visualization snapshot locally to your device, so you can
continue working on it later, we recommend the following workflow.
● Select 'Graph' → 'Save Snapshot'
o In the 'Save Snapshot` dialog, click 'Copy to Clipboard'
o Open an external text editor program such as TextEdit or Notepad. Paste the clipboard
contents of the graph snapshot, and save it to a file with a .json extension. You should
be able to accomplish this with a default text editor on your computer by overriding the
default file extension and manually entering ‘.json’.
o You may save your progress by saving the snapshot and loading them into Argo Lite to
continue your work.
● To load a snapshot, choose 'Graph' → 'Open Snapshot'
● Select the graph snapshot you created.
e. Publish and Share your graph snapshot
● Select 'Graph ' → 'Publish and Share Snapshot' → 'Share’
● Next, click 'Copy to Clipboard' to copy the generated URL
● Return the URL in the return_argo_lite_snapshot() function in submission.py
If you modify your graph after you publish and share a URL, you will need to re-publish and obtain a
new URL of your latest graph. Only the graph snapshot shared via the URL will be graded.
Deliverables: Place the files listed below in the Q1 folder.
? submission.py: the completed Python file
Q2 [35 points] SQLite
SQLite is a lightweight, serverless, embedded database that can easily handle multiple gigabytes of data. It
is one of the world’s most popular embedded database systems. It is convenient to share data stored in an
SQLite database — just one cross-platform file which does not need to be parsed explicitly (unlike CSV
5 Version 1
files, which have to be parsed).
You will modify the given Q2_SQL.py file in folder Q2 by adding SQL statements to it.
NOTE: You must only use a version of Python ≥ 3.7.0 and < 3.8 for this question. You must not use any
other versions (e.g., Python 3.8)
NOTE: Do not modify the import statements, everything you need to complete this question has been
imported for you. You may not use other libraries for this assignment.
A Sample class has been provided for you to see some sample SQL statements, you can turn off this output
by changing the global variable SHOW to False. NOTE: This must be set to false before uploading to
Gradescope and turning it in to Canvas.
username - use this name, e.g. mhull32
NOTE: For the questions in this section, you must only use INNER JOIN when performing a join
between two tables. Other types of joins may result in incorrect results.
a. [9 points] Create tables and import data.
i. [2 points] Create two tables (via two separate methods) named movies and movie_cast with
columns having the indicated data types:
1. movies
1. id (integer)
2. title (text)
3. score (real)
2. movie_cast
1. movie_id (integer)
2. cast_id (integer)
3. cast_name (text)
4. birthday (text)
5. popularity (real)
ii. [2 points] Import the provided movies.csv file into the movies table and movie_cast.csv into
the movie_cast table
1. You will write Python code that imports the .csv files into the individual tables. This will
include looping though the file and using the ‘INSERT INTO’ SQL command. Only use
relative paths while importing files since absolute/local paths are specific locations that
exist only on your computer and will cause the auto-grader to fail.
iii. [5 points] Vertical Database Partitioning. Database partitioning is an important technique that
divides large tables into smaller tables, which may help speed up queries. For this question you
will create a new table cast_bio from the movie_cast table (i.e., columns in cast_bio will
be a subset of those in movie_cast) Do not edit the movie_cast table. Be sure that when you
insert into the new cast_bio that the values are unique. Please read this page for an example
of vertical database partitioning.
cast_bio
1. cast_id (integer)
2. cast_name (text)
3. birthday (date)
6 Version 1
4. popularity (real)
b. [1 point] Create indexes. Create the following indexes for the tables specified below. This step increases
the speed of subsequent operations; though the improvement in speed may be negligible for this small
database, it is significant for larger databases.
i. movie_index for the id column in movies table
ii. cast_index for the cast_id column in movie_cast table
iii. cast_bio_index for the cast_id column in cast_bio table
c. [3 points] Calculate a proportion. Find the proportion of movies having a score > 50 and that has ‘war’ in
the name. Treat each row as a different movie. The proportion should only be based on the total number
of rows in the movie table. Format all decimals to two places using printf(). Do NOT use the
ROUND() function as it does not work the same on every OS.
Output format and sample value:
7.70
d. [4 points] Find the most prolific actors. List 5 cast members with the highest number of movie
appearances that have a popularity > 10. Sort the results by the number of appearances in descending
order, then by cast_name in alphabetical order.
Output format and sample values (cast_name,appearance_count):
Harrison Ford,2
e. [4 points] Find the highest scoring movies with the smallest cast. List the 5 highest-scoring movies that
have the fewest cast members. Sort the results by score in descending order, then by number of cast
members in ascending order, then by movie name in alphabetical order. Format all decimals to two
places using printf().
Output format and sample values (movie_title,movie_score,cast_count):
Star Wars: Holiday Special,75.01,12
War Games,58.49,33
f. [4 points] Get high scoring actors. Find the top ten cast members who have the highest average movie
scores. Format all decimals to two places using printf().
? Sort the output by average score in descending order, then by cast_name in alphabetical order.
? Do not include movies with score <25 in the average score calculation.
? Exclude cast members who have appeared in two or fewer movies.
Output format and sample values (cast_id,cast_name,average_score):
8822,Julia Roberts,53.00
g. [6 points] Creating views. Create a view (virtual table) called good_collaboration that lists pairs of
actors who have had a good collaboration as defined here. Each row in the view describes one pair of
actors who appeared in at least 3 movies together AND the average score of these movies is >= 40.
The view should have the format:
good_collaboration(
cast_member_id1,
7 Version 1
cast_member_id2,
movie_count,
average_movie_score)
For symmetrical or mirror pairs, only keep the row in which cast_member_id1 has a lower
numeric value. For example, for ID pairs (1, 2) and (2, 1), keep the row with IDs (1, 2). There
should not be any “self pair” where the value of cast_member_id1 is the same as that of
cast_member_id2.
NOTE: Full points will only be awarded for queries that use joins for part g.
Remember that creating a view will not produce any output, so you should test your view with a
few simple select statements during development. One such test has already been added to the
code as part of the auto-grading.
NOTE: Do not submit any code that creates a ‘TEMP’ or ‘TEMPORARY’ view that you may
have used for testing.
Optional Reading: Why create views?
i. [4 points] Find the best collaborators. Get the 5 cast members with the highest average scores
from the good_collaboration view, and call this score the collaboration_score. This
score is the average of the average_movie_score corresponding to each cast member,
including actors in cast_member_id1 as well as cast_member_id2. Format all decimals to
two places using printf().
? Sort your output by this score in descending order, then by cast_name alphabetically.
Output format (cast_id,cast_name,collaboration_score):
2,Mark Hamil,99.32
1920,Winoa Ryder,88.32
h. [4 points] SQLite supports simple but powerful Full Text Search (FTS) for fast text-based querying (FTS
documentation). Import movie overview data from the movie_overview.csv into a new FTS table called
movie_overview with the schema:
movie_overview
? id (integer)
? overview (text)
NOTE: Create the table using fts3 or fts4 only. Also note that keywords like NEAR, AND, OR and NOT
are case sensitive in FTS queries.
i. [1 point] Count the number of movies whose overview field contains the word ‘fight’. Matches
are not case sensitive. Match full words, not word parts/sub-strings.
e.g., Allowed: ‘FIGHT’, ‘Fight’, ‘fight’, ‘fight.’. Disallowed: ‘gunfight’, ‘fighting’, etc.
Output format:
12
ii. [2 points] Count the number of movies that contain the terms ‘space’ and ‘program’ in the
8 Version 1
overview field with no more than 5 intervening terms in between. Matches are not case
sensitive. As you did in h(i)(1), match full words, not word parts/sub-strings. e.g., Allowed: ‘In
Space there was a program’, ‘In this space program’. Disallowed: ‘In space you are not
subjected to the laws of gravity. A program.’, etc.
Output format:
6
Deliverables: Place all the files listed below in the Q2 folder
1. Q2_SQL.py: Modified file containing all the SQL statements you have used to answer parts a - h in
the proper sequence.
Q3 [15 points] D3 (v5) Warmup
Read chapters 4-8 of Scott Murray’s Interactive Data Visualization for the Web, 2nd edition . . This
simple reading provides important foundation you will need for Homework 2. This question uses D3
version v5, while the book covers D3 v4. What you learn from the book is transferable to v5.
NOTE the following important points:
1. We highly recommend that you use the latest Firefox browser to complete this question. We will grade
your work using Firefox 79.0 (or newer).
2. For this homework, the D3 library is provided to you in the lib folder. You must NOT use any D3 libraries
(d3*.js) other than the ones provided.
3. You may need to setup an HTTP server to run your D3 visualizations. The easiest way is to use
http.server for Python 3.x. Run your local HTTP server in the hw1-skeleton/Q3 folder.
4. We have provided sections of code along with comments in the skeleton to help you complete the
implementation. While you do not need to remove them, you may need to write additional code to make things
work.
5. All d3*.js files in the lib folder are referenced using relative paths in your html file. For example, since the
file “Q3/index.html” uses d3, its header contains:
It is incorrect to use an absolute path such as:
The 3 files that are referenced are:
- lib/d3/d3.min.js
- lib/d3-dsv/d3-dsv.min.js
- lib/d3-fetch/d3-fetch.min.js
9 Version 1
6. For a question that reads in a dataset, you are required to submit the dataset too (as part of your
deliverable). In your html / js code, use a relative path to read in the dataset file. For example, since Q3
requires reading data from the q3.csv file, the path should be ‘q3.csv’ and NOT an absolute path such as
“C:/Users/polo/HW1-skeleton/Q3/q3.csv”. Absolute/local paths are specific locations that exist only on your
computer, which means your code will NOT run on our machines when we grade (and you will lose points).
7. You can and are encouraged (though not required) to decouple the style, functionality and markup in the
code for each question. That is, you can use separate files for CSS, JavaScript and HTML — this is a good
programming practice in general.
Deliverables: Place all the files/folders listed below in the Q3 folder
● A folder named lib containing folders d3, d3-fetch, d3-dsv
● q3.csv: the file that we have provided you, in the hw1 skeleton under Q3 folder, which contains the
data that will be loaded into the D3 plot.
● index.(html / css / js) : when run in a browser, it should display a barplot with the following
specifications:
a. [1.5 points] Load the data from q3.csv using D3 fetch methods. We recommend d3.dsv().
b. [2 points] The barplot must display one bar per row in the q3.csv dataset. Each bar
corresponds to the running total of movies for a given year. The height of each bar
represents the running total. The bars are ordered by ascending time with the earliest
observation at the far left. i.e., 1880, 1890, ..., 2000
c. [1 point] The bars must have the same fixed width, and there must be some space between
two bars, so that the bars do not overlap.
d. [3 points] The plot must have visible X and Y axes that scale according to the generated
bars. That is, the axes are driven by the data that they are representing. Likewise, the ticks
on these axes must adjust automatically based on the values within the datasets, i.e., they
must not be hard-coded.
e. [2 point] Set x-axis label to ‘Year’ and y-axis label to ‘Running Total’.
f. [1 point] Use a linear scale for the Y axis to represent the running total (recommended
function: d3.scaleLinear()).
g. [3 points] Use a time scale for the X axis to represent year (recommended function:
d3.scaleTime()). It may be necessary to use time parsing / formatting when you load and
display the year data. The axis would be overcrowded if you display every year value so set
the X-axis ticks to display one tick for every 10 years.
h. [1 point] Set the HTML title tag and display a title for the plot.
■ Position the title “Running Total of TMDb Movies by Year” above the barplot.
■ Set the HTML title tag (i.e.,
版权所有:留学生编程辅导网 2021,All Rights Reserved 联系方式:QQ:99515681 电子信箱:99515681@qq.com
免责声明:本站部分内容从网络整理而来,只供参考!如有版权问题可联系本站删除。