Collections Module Day 2 of 3
Day 5 of 100
I really felt like I was in the weeds today and did not even get to use the collections
tools we were supposed to. Instead, I spent my time learning the basics of dictionaries, lists, and dictionaries of lists... The challenge for the day was to read a csv file of movie data and sort it by most popular director of films made after 1960. I stopped the tutorial video before Bob demonstrated how to use the collections tools because he said to challenge yourself to solve it without hints. Well, I took the challenge, but I don't really understand the tools in collections
yet so I had to resort to more basic methods.
First, I read the csv into a dict
object using the csv.DictReader
method:
import csv
# open the csv file
with open('movie_data.csv', 'r') as movie_data:
# read the file into a dictionary
csv_reader1 = csv.DictReader(movie_data)
...
Then, I initialized two dictionaries, one to keep track of how many movies each director had in the list and another to create a list with the movie title and release date for each movie the director made. The second one was pretty complex for me and it took me a while to figure out how to make a dictionary of lists (given that each list contained only two items, perhaps I should have used a tuple
or a namedtuple
but this leaves room for improvement). I also hit a roadblock because initially, I wanted to iterate through the DictReader
object once to create the list of directors and then again to print out each of their movies, but it wouldn't let me work with the DictReader
object twice. I even tried creating a second object and that didn't work. I fired up the PyCharm debugger and I could see the object was still there but I couldn't work with it more than once. Maybe it was a bug in my environment or maybe this is how it is supposed to be but I worked around the problem by creating the two dictionaries instead of one and populated both during the first read through the DictReader
object:
...
# Create dictionaries to store the list of directors and movies
directors = {}
movies = {}
for line in csv_reader1:
# Populate the dictionary of directors with and the number of films
if line['title_year'] and int(line['title_year']) > 1980:
if line['director_name'] not in directors:
directors[line['director_name']] = 0
directors[line['director_name']] += 1
# Populate the dictionary of movies by director
if line['title_year'] and int(line['title_year']) > 1980:
if line['director_name'] not in movies:
movies[line['director_name']] = [[line['movie_title'], line['title_year']]]
else:
movies[line['director_name']].append([line['movie_title'], line['title_year']])
...
Then I created a sorted dictionary of the directors in descending order by number of films (this is 100% copied):
...
# list of top 20 directors in descending order
sorted_directors = sorted(directors.items(),
key=lambda x: x[1], reverse=True)
...
Finally, I printed the directors rank and the names and release dates of their films:
...
# initiate a variable to track the rank of the directors
director_rank = 1
# print out the movies and release years for each of the top 20 directors
for director in sorted_directors[:20]:
print(f'#{director_rank} {director[0]}')
director_rank += 1
for movie in movies:
if movie == director[0]:
for film in movies[director[0]]:
print(f'\t "{film[0].rstrip()}", {film[1]}')
print("-" * 60)
The csv dataset has 5,000+ rows in it and my code was able to sort through it and output the following:
#1 Steven Spielberg
"Indiana Jones and the Kingdom of the Crystal Skull", 2008
"The BFG", 2016
"War of the Worlds", 2005
"The Adventures of Tintin", 2011
"Minority Report", 2002
"A.I. Artificial Intelligence", 2001
"The Lost World: Jurassic Park", 1997
"The Terminal", 2004
"Munich", 2005
"Hook", 1991
"War Horse", 2011
"Saving Private Ryan", 1998
"Lincoln", 2012
"Jurassic Park", 1993
"Catch Me If You Can", 2002
"Indiana Jones and the Last Crusade", 1989
"Bridge of Spies", 2015
"Amistad", 1997
"Indiana Jones and the Temple of Doom", 1984
"Schindler's List", 1993
"Raiders of the Lost Ark", 1981
"The Color Purple", 1985
"E.T. the Extra-Terrestrial", 1982
----------------------------------------
#2 Clint Eastwood
"Space Cowboys", 2000
"Invictus", 2009
"American Sniper", 2014
"Changeling", 2008
"Flags of Our Fathers", 2006
"Absolute Power", 1997
"Hereafter", 2010
"Blood Work", 2002
"Jersey Boys", 2014
"J. Edgar", 2011
"Midnight in the Garden of Good and Evil", 1997
"Mystic River", 2003
"Million Dollar Baby", 2004
"Gran Torino", 2008
"The Bridges of Madison County", 1995
"Firefox", 1982
"Unforgiven", 1992
"Letters from Iwo Jima", 2006
"Pale Rider", 1985
----------------------------------------
#3 Woody Allen
"Midnight in Paris", 2011
"The Curse of the Jade Scorpion", 2001
"To Rome with Love", 2012
"Bullets Over Broadway", 1994
"Deconstructing Harry", 1997
"Everyone Says I Love You", 1996
"Blue Jasmine", 2013
"Small Time Crooks", 2000
"Anything Else", 2003
"Vicky Cristina Barcelona", 2008
"Radio Days", 1987
"Hollywood Ending", 2002
"Match Point", 2005
"New York Stories", 1989
"Whatever Works", 2009
"You Will Meet a Tall Dark Stranger", 2010
"Celebrity", 1998
"Scoop", 2006
----------------------------------------
...
Now, I understand the what I created was not the challenge, exactly. It is too late at night to keep going to further figure out how to rank the directors (with more than 5 films after 1960) by rating and then print a sorted list of moves by their rank. I feel, I could do this but it would just be more dictionaries, more lists, and more dictionaries of lists... To implement the more complex requirements of the challenge, I want to use the tools they suggest and that means watching the rest of the tutorial tomorrow.