Movie versus Movie

What are your top ten favorite movies of all time? This is a very difficult question. But why?

In February 2014, I gave a talk explaining the challenges of measuring how much we like movies, books, songs, or products; combining insights from diverse sources like the Netflix Prize, Duncan Watts' social experiments, or the beginnings of Facebook. The better we get at measuring and ranking levels of enjoyment, the better we can customize websites, sort search results, find other people with similar tastes, and recommend products, so can we overcome these challenges? Drumroll... Yes, we can, with Bayesian ranking algorithms.

I made this website as a proof of concept for these ideas. Now you can find out your top ten favorite movies as well.

Movie script clichés

Movies have their own language. Besides the whole "Let's go", "Yes sir", "Go go go" shenanigans that I usually don't use so much in my daily life (which is surprisingly bereft of machine guns and tanks), there are certain common sentences that you just keep hearing over and over again in movies.

I scraped ~1500 movie scripts to see which sentences are overused. Interestingly, "I love you" ended up pretty low, somewhere close to 90th place. Two ranks below "Shut up", in fact.

Also, it turns out Charlie Kaufman uses "I don't know", like, ALL the time. 19 times in Synecdoche, NY and 18 times in Eternal Sunshine of the Spotless Mind.

Feel free to explore which movies tend to use these clichés to the max.

Breast Cancer Surgery Risk

Breast cancer awareness and surgery techniques have come a long way in the last decades. The 5 year survival chance after a surgery are much better these days, compared to 30 years ago.

This is a tool that shows these chances based on the age of the patient, number of malignant lymph nodes, and the year of the surgery. The survival probability is based on a logistic regression model trained on patient data.

I made this as part of a simple demonstration/tutorial of a d3 dashboard with AJAX requests for the Metis Data Science Bootcamp that I designed and taught.

Shortcomings of the Rotten Tomatoes Model

The assumptions behind Rotten Tomatoes' approach to aggregating movie ratings start looking a bit shaky when we bring out the magnifying glass. I wrote a blog post examining these problems.

Student Flows between Chicago Public Schools

School choice network for the Chicago Public School system. Chicago applies a free market system in education. High school students are assigned to the nearest neighborhood school, but they are free to apply to any others. There are also a few schools where nobody is assigned (like magnet schools). The arcs show students leaving one school to enroll in another.

The idea is that demand for good schools will be high. Bad schools will bleed students and will have to shut down. The "rising tide" of the competition will carry all schools up. But does this work as intended. That is not very easy to answer. Let's look at these "demands" and how students flow.

Left figure (A) shows the student flows in 2001. Only flows of greater than 40 students are shown. The horrible mess in the right figure (B) is what happens if you look at the average flows over an entire decade (1994-2005).

This is a complex system. If we want to understand it, we need to extract the underlying information in this mess.

The Underlying Map of Student Flows in Chicago

The network of student enrollment choices in Chicago over a decade is an unreadable mess. To gain insights about whether or not the school choice / free market idea is working as intended, we needed to understand the trends in these flows.

Using the positionality maximization method that I developed for directed, weighted networks (building upon the ideas of Roger Guimera), I detected groups of schools that had similar flow patterns (left figure, A). On the right figure, I show these groups as single nodes, and the flow biases as arcs, the thicknesses denote amount of deviation from what you would expect from random chance (B). These are the natural subdistricts of Chicago schools. As you can see, the flows are heterogeneous, there is isolation among different regions. School districts are analyzed as a whole, but this picture shows the pitfalls of that approach: The flows show different behaviors in separate regions.

The Increasing Achievement Gap

In Chicago, free choice to enroll in any school is expected to foster competition and make all schools better over time.

Do students really show higher demand for better schools, as assumed?

This figure shows that the high achieving, successful students do, but the low achievers not so much. I defined the high and low achievers as the top and bottom quartile in middle school standardized tests.

Here we see the fraction of successful and unsuccessful students applying to schools of different qualities. The x axis shows the high school's PSAE score (a standardized high school test), it is a proxy for school's quality. Higher score, better school. These scores are declared to all students so they can make good choices.

However, even though all of these schools are open enrollment, more green (successful) students apply to the good schools, and more red students apply to the bad schools. This causes a concentration of good students in good schools, bad students in bad schools, and an exponential feedback effect, driving the achievement gap further apart.

This insight should prompt us to at least acknowledge that the free market idea is not necessarily working in the ideal sense, at least for the already unsuccessful kids. We should take action before the achievement gap increases even more.

The Concept of Network Positions

While modularity maximization techniques in networks (which can extract high within-module structures like friendship modules in A, B and C) have been well developed, they cannot model different interaction biases (such as the sexual relations exemplified in D, E and F). I developed a new method (positionality maximization) for directed, weighted networks that can extract positional relations of any complexity, including the ones shown in C and F.

Challenges of Topic Modeling

Topic modeling is an unsupervised method to find persistent topics in the contents of a text corpus. Dirichlet-based topic modeling algorithms such as LDA lose quite a bit of accuracy when topic sizes in the corpus are unbalanced (which you would expect to be the case in the real world). They tend to find more or less equally prevalent topics (A).

One way to deal with this problem is to increase the number of topics to find: Now LDA can find higher resolution, almost "sub"-topics. Even if the subtopics are similarly sized, topics of different sizes will just be represented by the number of subtopics. Unfortunately, as number of topics increases, the already rough likelihood landscape becomes much rougher. Due to this, different runs of the same algorithm end up at completely different local minima, killing LDA's reproducibility (B).

Andrea Lancichinetti, myself, Jane WangDaniel AcunaKonrad Kording, and Luis Amaral developed a topic modeling algorithm called TopicMapping that does not suffer from these problems. Even in the case of unbalanced topic sizes, TopicMapping performs with high accuracy and high reproducibility. (C).

How TopicMapping Works

TopicMapping is a new topic modeling algorithm that we developed (with Andrea Lancichinetti and others). It has high accuracy and high reproducibility, even in highly heterogeneous topic distributions over documents (a case where LDA fails). Here is how it works, explained over an example.

A. Let's say the corpus comprises of six documents, 3 are about biology and 3 about math. B. We build a network connecting words with weights equal to their dot product similarity. C. We filter non-significant weights, using a p-value of %5. Running Infomap on this network, we get two clusters and two isolated words (study and research). D. We refine the word clusters
using a topic model: the two isolated words can now be found in both topics.

Two-way Mapping between Time Series and Networks

This figure illustrates the technique we developed with Andriana Campanharo, Dean Malmgren, and Luis Amaral to convert time series and networks to each other, allowing us to apply the analysis toolset developed on one domain to the other one.

For example, networks made from unhealthy patients' heart EKGs are easily detectable by the existence of a small, separate network module that is absent in networks made from healthy EKG data.

Read more in the paper

d3 Manhattan Map Example with Colored Districts

This is the map of Manhattan, with the districts colored (with fake data). It is an illustration of how to visualize district-resolution information on a borough map with d3.js.