Rotten Tomatoes and Selection Bias

The AVI sent along a link (from 2013) this week about movies that audiences love and critics hated as judged by their Rotten Tomatoes scores.

For those of you not familiar with Rotten Tomatoes, it’s a site that aggregates movie reviews so you can see overall what percentage of critics liked a movie. After a few years of that, they also allowed users to leave reviews so you can see what percentage of audience members liked a movie. This article pulled out every movie with a critic score and an audience score in their database and figured out which ones were most discordant. The top movies audiences loved/critics hated are here:

The most loved by critics/hated by audiences ones are here:

The article doesn’t offer a lot of commentary on these numbers, but I was struck by how much selection bias goes in to these numbers. While movie critics are often (probably fairly) accused of favoring “art pieces” or “movies with a message” over blockbuster entertainment, I think there’s some skewing of audience reviews as well. Critic and audience scores are interesting because critics are basically assigned to everything, and are supposed to write their reviews with the general public in mind. Audience members select movies they are already interested in seeing, and then review them based solely on personal feelings.

For example, my most perfect movie going experience ever was seeing “Dude, Where’s my Car?” in the theater. I was in college when it came out, and had just finished some grueling final exams. My brain was toast. A friend suggested we go, and the theater was full of other college students who had also just finished their exams. It was a dumb movie, a complete stoner comedy from the early 2000s. We all laughed uproariously. I have very fond memories of this, and the movie in general. It was a great movie for a certain moment in my life, but I would probably never recommend it to anyone. It has a 17% critic score on Rotten Tomatoes, and a 47% audience score. This seems very right to me. No one walks in to a movie with that title thinking they are about to see something highbrow, and critics were almost certainly not the target audience. Had more of the population been forced to go to that movie as part of their employment, the audience score would almost certainly dip. If only the critics who wanted to see it went, their score would go up.

This is key with lists like this, especially when we’re looking at movies that came out before the site that existed. Rotten Tomatoes started in 1998, but a quick look at the top 20 users loved/audiences shows that the top 3 most discordant movies all came out prior to that year. So essentially the user scores are all from people who cared enough about the movie to go in and rank it years after the fact.

For the critics loved/users hated movies, the top one came out in 1974. I was confused about the second one (Bad Biology, a sex/horror movie that came out in 2008), but noted that Rotten Tomatoes no long assigns it a critic score. My suspicion is that “100%” might have been one review. From there, numbers 3-7 are all pre 1998 films. In the early days of Rotten Tomatoes you could sort movies by critic score, so I suspect some people decided to watch those movies based on the good critic score and got disappointed. Who knows.

It’s interesting to think about all of this an how websites can improve their counts. Rotten Tomatoes recently had to stop allowing users to rate movies before they came out as they found too many people were using it to try to tank movies they didn’t like. I wonder if sending emails to users asking them to rank (or say “I haven’t seen this”) to 10 random movies on a regular basis might help lower the bias in the audience score. I’m not sure, but as we crowd source more and more of our rankings, bias prevention efforts may have to get a little more targeted. Interesting to think about.

 

5 Things About Appendicitis Rates Over Time

A close relative of mine had a bit of a scare this week when she ended up admitted to the hospital for (what was ultimately diagnosed as) acute appendicitis. She ended up in surgery with a partially ruptured appendix, though she’s doing fine now.

When I mentioned this saga to a coworker, she said she felt like she didn’t hear much about appendicitis anymore. We started wondering what the rates were, and if they were going down over time. Of course this meant I had to take a look, so here’s what I found:

  1. The rates have fallen over the decades, and no one is really sure why. This paper suggests that rates fell by 15% between 1970 and the mid 80s, but no one’s sure what happened. Did appendicitis become less common? Less deadly? Or did our diagnostic tools get better and some number of cases get reclassified? This is a valid question because of this next point….
  2. A surprisingly high number of appendectomies aren’t necessary. An interesting study from 2011 showed that about 12% of patients who get an appendectomy end up not getting diagnoses with appendicitis. They suggest that this rate has been falling over time which could have helped the numbers in point #1. Is it the whole story? It’s not clear! But definitely something to keep in mind.
  3. The number of incorrectly removed appendixes may not be going down. Contrary to the assertions of the study above, it’s not certain that misdiagnosed appendicitis is going down. Despite better diagnostics, it appears that easier surgical techniques (i.e. laparoscopic surgeries) actually may have increased the rate of unnecessary surgeries. This sort of makes sense. If you have to do a big complicated surgery, you are going to really want to verify that it’s necessary before you go in. As the surgery get easier, you make focus more on getting people to surgery more quickly.
  4. The data sources may not be great. One of the more interesting papers I found compared the administrative database (based off insurance coding) vs a pathology database and found that insurance coding consistently underestimated the number of cases of appendicitis. Since most studies have been done off of insurance code databases, it’s not clear how this has skewed our view of appendicitis rates.
  5. Other countries seem to be seeing a drop too Whatevers going on with appendicitis diagnosis, the whole world seems to be seeing a similar trend. Greece has seen a 75% decrease. England has also seen falling rates. To be fair though, some data shows it’s mixed. Developed countries  seem to be stabilizing, newly developed countries seem to see high rates.

So who knew how hard it was to get a handle on appendicitis rates? I certainly thought it would be a little more straightforward. Always fascinating to explore the limits of data.

What’s My Age Again?

One of my favorite weird genre of news story occurs when the journalist/editor/newsroom all forget how old they are in relation to the people they are writing about. This phenomena is what often gives rise to articles about millenials that don’t actually quote millenials,  or articles about millenial parents of small children that compare them to Boomer parents of teenage children. I also see this in the working world, where there are still seminars about “how to manage millenials”, even though the oldest millenials are nearing 40 (and age discrimination laws!) and new college grads are most likely “Gen Z”.

Anyway, given my love for this genre of story, I got a kick out of a Megan McArdle Tweet this week that pointed out a Mother Jones article that fell a bit in to this trap.

She was pointing to this article that explained how Juul (an ecigarette manufacturer) had been marketing to teens for several years. As proof, they cited this:

Now for many millenials, this makes perfect sense. How could you screen three teen movies like “Can’t Hardly Wait”, “SCREAM” and “Cruel Intentions” and say you were marketing to adults? Well, that depends on your perspective. Can’t Hardly Wait came out in 1998, SCREAM in 1996 and Cruel Intentions in 1999. Current 14-18 year olds were born between 2001 and 2005. Does a party featuring movies made 5 years before you were born sound like it is trying to attract current teens? Or is it more likely that it would draw those who were teens at the time they were released….i.e. those in their early 30s?

As a quick experiment, subtract  5 years from your current birth year, Google “movies from ______”, take out the actual classics/Oscar winners and see how many of those movies you would have gone to an event to see at age 16. I just did it for myself and I’d have gone to see Rocky (though that’s an actual classic) and that’s pretty much it. I enjoyed the Omen, but not until later in college, ditto for Murder by Death and Network. In thinking back to my teen years, I did attend an event where Jaws was screened at a pool party, but I suspect the appeal of Jaws is more widespread/durable than “Can’t Hardly Wait”.

To be clear, I have very little insight in to Juul’s marketing plan or anything about them other than what I’ve seen on the news. What I do know though is that some movies appeal to broad audiences, and some appeal to a very narrow band of people who saw them at the right age. Teen movies in particular do not tend to appeal endlessly to teens, but rather to continue to appeal to the cohort who originally saw them.

There is an odd phenomena with some movies where they do poorly in the box office then pick up steam on DVD or cable broadcasts. The movie Hocus Pocus  (1993)is a good example. It was a flop at the box office, but was rebroadcast on ABC Family and the Disney Channel and then landed on a kids “13 Nights of Halloween” special in the early 2000s. This has caused the very odd phenomena of kids who weren’t born when it was released remembering it as a movie of their childhood more than those in the “right” cohort would have.

So basically I think it can be a bit of a challenge to triangulate what pop culture appeals to what age groups, particularly once you are out of that age group. Not that I’m judging. I struggled enough to figure out what was cool with teens when I actually was one. I have no idea how I’d figure it out now.

 

Diagnoses: Common and Uncommon

There was an interesting article in the Washington Post this week, about a man with a truly bizarre disorder. Among many other terrible symptoms, he essentially never has to go to the bathroom while he’s standing up and going about his day and appears to be dehydrated no matter how much he drinks, but the minute he lays down at night he has to urinate copiously and shows signs of being overhydrated. He has so many bizarre symptoms that he ended up in something called the Undiagnosed Disease Program, a fascinating group run by the NIH that seeks to find diagnoses for people who have baffled other physicians. They conduct all sorts of testing and try to either find people a diagnosis or to add their information to a database in the hopes that eventually they’ll get some information that will help them figure this out. The overall goal is to both help people and add to our collective knowledge about the human body.

Outlier medical cases are truly fascinating to many people, myself included. The WaPo column is actually part of a series called “medical mysteries“. Oliver Sacks made a whole writing career out of writing books about them. These cases make it in to our textbooks in school, and they are the stories that stick in our minds. These aren’t even one in a million cases, they are often one in 10 or 100 million. The guy in the WaPo story might even be 1 in a billion or 10 billion.

I am also fascinated by these stories in part because last year I started in on a medical mystery of my own. It started innocuously enough: random bouts of nausea, random bouts of extreme fatigue, then noticeable increased sensitivity to smells, tastes and pain. I assumed I was pregnant. I wasn’t.

I followed up with my doctor who confirmed that my hormone and other blood levels were fine. She ran tests to see if I was being poisoned, if I had a weird vitamin deficiency or had ODed on something accidentally.  She referred me to a couple of other doctors. The bouts came and went, but they actually started to get very disconcerting. My increased sense of smell meant that my car would frequently smell strongly of gas…something most of us take to mean there’s a problem. I couldn’t wear certain clothes because it felt like the seems or zippers were cutting my skin, but my skin showed no signs of redness. I couldn’t drink my coffee some mornings because I was convinced it was scalding my mouth. When I ate food I was convinced I could still taste the wrapper. Sensory information is supposed to help us make our way through the world, and to have it suddenly shifting around on you is incredibly disorienting.

Over the course of 6 months I saw 7 different doctors, all of whom were baffled. Since I work at a hospital I informally talked to half a dozen other NPs/PAs/MDs, and none of them had any idea either. The nausea and fatigue could come with hundreds of disorders, but nervous system hypersensitivity is a much less common symptom.

In the course of all this, the Assistant Village Idiot made a comment about how I should remember that strange symptoms were more likely to be an uncommon presentation of a common thing than an uncommon thing. The most experienced doctors I saw also mentioned the limitations of diagnosis. We build diagnoses based on the most common presentations of things, but we often don’t know if there are other possible presentations. We give names to clusters of symptoms because we see them together often, but it’s possible the biological underpinnings of the disorder could end up different places we don’t see as often. One doctor mentioned that in 6 months or a year I might add more symptoms that made things much clearer.

After about 6 months I still had no answers, but got some relief when I discovered that a magnesium supplement I’d taken to help me sleep seemed to help my symptoms. My doctor told me I could increase the dose and take it daily, and over the course of 6 weeks it mostly worked. I had relief, even if I still had no answers.

That was in January, and for the last 8 months I’ve seen small flares of symptoms that magnesium seemed to help. Then, about a month ago a new symptom started that made the whole thing much clearer: I got a headache. A one sided, splitting “gotta go lay down in a dark room” headache. A week or two later I got another one, then I got another one. I had always gotten a handful of migraines a year, but with the sudden change in frequency I started to notice something. For two days before I would be extra sensitive to light, pain, and smell. Sound too. Then during the migraine I would be incredibly nauseous, then the day after I would be so fatigued I could barely get out of bed. I looked back at my journals of my mystery symptoms I’d started keeping last year and realized it fit the same pattern. The symptoms that seemed so mysterious were actually part of the very classic migraine prodome/aura/postdrome pattern. It was then that I learned about the existence of acephalgic or “silent” migraines…..migraines that occur with all of the symptoms except the classic headache. My doctor confirmed my suspicions. I had been having chronic migraines with the headache, that now had developed in to chronic migraines with the headache. Once the headache appeared, my case was textbook. I got prescriptions for Imitrex and Fioricet along with a prophylactic medication.

Now per the Wiki page (and everything else I’ve read), acephalgic migraines are uncommon. It’s not particularly normal to get them as badly as I did without regular migraines, though they admit the data may be flawed. Since most people wouldn’t identify those symptoms as migraines, they might have an underreporting problem. Regardless, the AVIs point stood: this was an uncommon presentation of a common thing, not an uncommon disorder.

I like this story both because I am relieved to have a diagnosis and because it is relieving to have a diagnosis and because it is an interesting example of the entire concept of base rate. Migraines are the third most common disease in the world, after tension-type headaches and dental caries (cavities). One out of every 7 people get them. If we assume that my symptoms are highly unusual for migraine sufferers….say 1% of cases….that still means about 15 out of 10,000 people will get them. For comparison, schizophrenia is 1.5 out of 10,000.  Epilepsy is 120 out of 10,000, or about 10% the rate of migraine sufferers. A small percentage of a big number is often still a big number. An uncommon presentation of a common disorder can often be more common than uncommon disorders.

See, everything’s a stats lesson if you look hard enough. While I’m relieved to have a diagnosis, the downside of this is that the more frequent headaches are impacting my ability to sit in front of a screen as often, which may impact blogging. While we figure out what works to reduce the frequency of these, I may end up doing some more archives posting, maybe a top 100 post countdown like the AVI has been doing. We’ll see. While my doctor is great, any good resources are appreciated!

From the Archives: Birthday Math

Three years ago on my birthday, I put up a post of 5 fun math related birthday things. One of these was the “Cheryl’s Birthday” math problem which had gone viral the year prior. Here it is:

I was thinking about this recently, and found out it now had its own Wikipedia page.

The Wiki informed me that there had been a follow up problem released by the same university:

Albert and Bernard now want to know how old Cheryl is.
Cheryl: I have two younger brothers. The product of all our ages (i.e. my age and the ages of my two brothers) is 144, assuming that we use whole numbers for our ages. 
Albert: We still don’t know your age. What other hints can you give us? 
Cheryl: The sum of all our ages is the bus number of this bus that we are on. 
Bernard: Of course we know the bus number, but we still don’t know your age. 
Cheryl: Oh, I forgot to tell you that my brothers have the same age. 
Albert and Bernard: Oh, now we know your age.

So what is Cheryl’s age?

It’s a fun problem if you have a few minutes. I thought it was easier than the first one, but still requires actually sitting down and doing a few steps to get to the answer. Very hard to short cut this one. It also retains the charm of the original problem of making you flip your thinking around a bit to think about what you don’t know and why you don’t know it.

The answer’s at the bottom of the Wiki page if you’re curious.

There’s More to that Story: 4 Psych 101 Case Studies

Well it’s back to school time folks, and for many high schoolers and college students, this means “Intro to Psych” is on the docket. While every teacher teaches it a little differently, there are a few famous studies that pop up in almost every textbook. For years these studies were taken at face value, however with the onset of the replication crisis many have gotten a second look and have been found to be a bit more complicated than originally thought.  I haven’t been in a classroom for psych for quite a few years so I’m hopeful the teaching of these has changed, but just in case it hasn’t, here’s a post with the extra details my textbooks left out.

Kitty Genovese and the bystander effect: Back in my undergrad days, I learned all about Kitty Genovese, murdered in NYC while 37 people watched and did nothing. Her murder helped coin the term “bystander effect”, where large groups of people do nothing because they assume someone else will. It also helped prompt the creation of “911” the emergency number we all can call to report anything suspicious.

So what’s the problem? Well, the number 37 was made up by the reporter, and likely not even close to true. The New York Times had published the original article reporting on the crime, and in 2016 called their own reporting “flawed“. A documentary was made in 2015 by Kitty’s brother investigating what happened, and while there are no clear answers, what is clear is that a murder that occurred at 3:20am probably didn’t have 38 witnesses who saw anything, or even understood what they were hearing.

Zimbardo/Stanford Prison Experiment: The Zimbardo (or Stanford) Prison Experiment is a famous experiment in which study participants were asked to act as prisoners or guards in a multi-day recreation of a prison environment. However, things got quickly out of control and the guards got so cruel and the prisoners so rowdy that the whole thing had to be shut down early. This showed the tendency of good people to immediately conform to expectations when they were put in bad circumstances.

So what’s the problem? Well, basically the researcher coached a lot of the bad behavior. Seriously, there’s audio of him doing it. This directly contradicts his own statements later that there were no instructions given. Reporter Ben Blum went back and interviewed some of the participants who said they were acting how they thought the researchers wanted them to act. One guy said he freaked out because he wanted to get back to studying for his GREs and thought the meltdown would make them let him go early. Can bad circumstances and power imbalances lead people to act in disturbing ways? Absolutely, but this experiment does not provide the straightforward proof it’s often credited with.

The Robber’s Cave Study: A group of boys are camping in the wilderness and are divided in to two teams. They end up fighting each other based on nothing other than assigned team, but then come back together when facing a shared threat. This shows how tribalism works, and how we can overcome it through common enemies.

So what’s the problem? The famous/most reported on study was take two of the experiment. In the first version the researchers couldn’t get the boys to turn on each other, so they did a second try eliminating everything they thought had added group cohesion in the first try, and finally got the boys to behave as they wanted. There’s a whole book written about it and it showcases some rather disturbing behavior on the part of the head researcher Muzafer Sherif. He was never clear with the parents what type of experiment the boys were subjected to, and he actually both destroyed personal belongings himself (to blame it on the other team) and egged the boys on in their destruction. When Gina Perry wrote her book she found that many of the boys who participated (and are now in their 70s) were still unsettled by the experiment. Not great.

Milgram’s electric shock experiment: A study participant is brought in to a room and asked to administer an electric shock to a person they can’t see who is participating in another experiment. When the hidden person gets a question “wrong” they are supposed to zap them to help them learn. When they zap them, a recording plays of someone screaming in pain. It is found that 65% of people will administer a fatal shock to a person as long as the researcher keeps encouraging them to do so. This shows that our obedience to authority can override our own ethics.

So what’s the problem? Well, this one’s a little complicated. The original study was actually 1 of 19 studies conducted, all with varying rates of compliance. The most often reported findings were from the version of the experiment that resulted in the highest amount of compliance. A more recent study also reanalyzed participants behavior in light of their (self-reported) belief that the subject was actually in pain or not. One of the things the researchers told people to get them to continue was that the shocks were not dangerous, and it also appears many participants didn’t think what they were participating in was real, and it wasn’t. They found that those who either believed the researchers assurances or expressed skepticism about the entire experiment were far more likely to administer higher levels of voltage than those who believed the experiment was legit. To note though, there have been replication attempts that did find comparable compliance rates to Milgram’s, though the shock voltage has always been lower due to ethics concerns.

So overall, what can we learn from this? Well first and foremost that once study results hit psych textbooks, it can be really hard to correct the error. Even if kids today aren’t learning these things, many of us who took psych classes before the more recent scrutiny of these tests may keep repeating them.

Second, I think that we actually can conclude something rather dark about human nature, even if it’s not what we first thought. The initial conclusion of these studies is always something along the lines of “good people have evil lurking just under the surface”, when in reality the researchers had to try a few times to get it right. And yet this also shows us something….a person dedicated to producing a particular outcome can eventually get it if they get enough tries. One suspects that many evil acts were carried out after the instigators had been trying to inflame tensions for months or years, slowly learning what worked and what didn’t. In other words, random bad circumstances don’t produce human evil, but dedicated people probably can produce it if they try long enough. Depressing.

Alright, any studies you remember from Psych 101 that I missed?

Absolute Numbers, Proportions and License Suspensions

A few weeks ago I mentioned a new-ish Twitter account that was providing a rather valuable public service by Tweeting out absolute vs relative risk as stated in various news articles. It’s a good account because far too often scientific news is reported with things like “Cancer risk doubled” (relative risk) when the absolute risk went from .02% to .04%. Ever since I saw that account I’ve wondered about starting an “absolute numbers vs proportions” type account where you follow up news stories that compare absolute numbers for things against proportional rates to see if they are any different.

I was thinking about this again today because I got a request from some of my New Hampshire based readers this week to comment on a recent press conference held by the Governor of New Hampshire about their recent investigation in to their license suspension practices.

Some background: A few months ago there was a massive crash in Randolph, New Hampshire that killed 7 motorcyclists, many of them former Marines. The man responsible for the accident was a truck driver from Massachusetts who crossed in to their lane. In the wake of the tragedy, a detail emerged that made the whole thing even more senseless: he never should have been in possession of a valid drivers license. In addition to infractions spread over several states, a recent DUI in Connecticut should have resulted in him losing his commercial drivers license in Massachusetts. However, it appears that the Massachusetts RMV had never processed the suspension notice, so he still was driving legally. Would suspending his license have stopped him from driving that day? It’s not clear, but it certainly seems like things could have played out differently.

In the wake of this, the head of the Massachusetts RMV resigned, and both Massachusetts and New Hampshire ordered reviews of their processes for handling suspension notices sent to them by other states.

So back to the press conference. In it, Governor Sununu revealed the findings of their review, but took great care to emphasize that New Hampshire had done a much better job than Massachusetts in reviewing their out of state suspensions. He called the difference between the two states “night and day” and said “There was massive systematic failure in the state of Massachusetts. [The issue in MA was] so big; so widespread; that was not the issue here.”

He then provided more numbers to back up his claim. The two comparisons in the article above say that NH found their backlog of notices was 13,015, but MAs was 100,000. NH had sent suspension notices to 904 drivers based on the findings, MA had to send 2,476. Definitely a big difference, but I’m sure you can see where I’m going with this. The population of MA is just under 7 million people, and NH is just about 1.3 million. Looking at just the number of license drivers, it’s 4.7 million vs 1 million. So basically we’ve got a 5:1 ratio of MA to NH people. Thus a backlog of 13,000 would proportionally be 65,000 in MA (agreeing with Sununu’s point) but the 904 suspensions is proportionally much higher than MAs 2,476 (disagreeing with Sununu’s point). If you were to change it to the standard “per 100,000 people”, MA sent suspension notices to 52 people per 100,000 drivers, NH sent 90 per 100,000.

I couldn’t find the whole press conference video nor the white paper they said they wrote so I’m not sure if this proportionality issue was mentioned, but it wasn’t in anything I read. There were absolutely some absurd failures in Massachusetts, but I’m a bit leery of comparing absolute numbers when the base populations are so different. Base rates are an important concept, and one we should keep in mind, with or without a cleverly named Twitter feed.

Math aside, I do hope that all of these reforms help prevent similar issues in the future. This was a terrible tragedy, and unfortunately one that uncovered really gaps in the system that was supposed to deal with this sort of thing. Here’s hoping for peace for the victim’s families, and that everyone has a safe and peaceful Labor Day weekend!