Nonfiction palate cleanser. Really interesting look at big data, how the internet is a wealth of information for those who want to study human behavior and trends. Some definitely creepy info about how companies (like Google and FB) run experiments to see what we like better all the time without our knowledge. And contains a great discussion of morality in data collection, as well as an easily understood explanation of correlation vs causation

People’s search for information is, in itself information. By searching for what people searched for, Google can highlight trends that might not be picked up anywhere else. The power of Google is that people tell the giant search engine things they might not tell anyone else. This is Big Data, data that concerns millions of people, and cannot be seen from the ground, as it were.

- Good data science is surprisingly intuitive. At its core, data science is about spotting patterns and predicting how one variable will affect another. So if humans are naturally data scientists, and data science is intuitive, why do we need computers and statistical software? Sometimes there is insufficient experience for our unaided gut to draw upon. Also, gut instincts may give a general sense of how the world works, but data helps us to sharpen the picture. Further, our intuition alone is subject to certain biases that may be unseen to us; we tend to exaggerate the relevance of our own experience, overestimate the prevalence of anything that makes for a memorable story, etc.

- There are four unique powers of Big Data. There are many unique data sources that give us windows into areas about which we could previously just guess. Offering up new types of data is the first power of Big Data. Secondly, Big Data allows us to finally see what people really want and really do, not what they say they want and say they do. Providing honest data is the second power. Because there is now so much data, there is meaningful information on even tiny slices of a population. Allowing us to zoom in one small subsets of people is the third power. Big Data further allows us to undertake rapid, controlled experiments. This allows us to test for causality, not merely correlations. Allowing us to do many causal experiments is the fourth power.

- THE FIRST POWER: this revolution is less about collecting more and more data, but collecting the right data. Example time: historically, people have believed that the best way to predict whether a horse will win a race has been to analyse his pedigree. While pedigree does matter, it can still only explain a small part of a racing horse’s success. Horse agents do use other information to see which horses might possibly win future races; they might analyse the gaits of horses and examine them visually. Jeff Seder was never interested in the traditional methods of evaluation; he cared about Data. Seder decided to measure the size of winning horses internal organs and found that the size of the heart was a massive predictor of a horse’s success; and this was a better predictor for success than the previous techniques employed. If you want to predict the future, you don’t have to worry about why your model works, simply that it works. Next Example: How do you figure out what newspapers are liberal or conservative? Certain phrases are more used by liberals (estate tax, rosa parks, workers rights) and by conservatives (death tax, saddam hussein, government spending). By analysing the frequency with which certain words were used it’s possible to calculate the bias in the media. Why do some publications lean right and some lean left? The politics of a given area is instructive; the evidence strongly suggests that newspapers are inclined to give their readers what they want. Who owns the paper has much less effect than we might think upon its political bias. Many people have viewed American journalism as controlled by rich people or corporations with their goal of influencing the masses; but the owners of the American press give their readers what they want because they are primarily driven by profit. Are the media liberal or conservative? Newspapers slant left, but there is no grand conspiracy; it’s just the workings of good old capitalism.

- THE SECOND POWER: People lie to anonymous surveys; why? Because they lie to themselves, they want to make a good impression, etc. How can we learn what people are really thinking? Second power = certain online sources get people to admit things they would not admit anywhere else. Surveys tell us that there are far more gay men in tolerant states than in intolerant states; but is this the whole picture? We can measure the instances of searches for gay porn in tolerant and intolerant states and compare; the measure of pornography searches by men (5%) seems a reasonable estimate of the true size of the gay population in the US. Prejudice is another subject people are honest about with Google. Following the San Bernardino shooting, more than half of all searches about Muslims became hateful, whereas ‘only’ 20% were hateful before. Further, searches for the word ‘nigger’ shoot up whenever black people are in the news, when Obama got elected, and on Martin Luther King Jr. Day. Why? The dominant explanation is that, while blacks claim racism and whites deny this racism, there must be some implicit prejudice around. But hidden explicit racism is a more likely solution; after all, people don’t unconsciously search for ‘nigger jokes’ on google. Also, the internet isn’t as desegregated as many people believe it is; people with strong political opinions visit sites of the oppposite viewpoint all the time, purposefully or not. The internet actually brings people of different political views together, which is not what people ordinarily think. But there is a caveat; we should be wary of what people put on social networking sites such as Facebook. On Facebook, we show our cultivated selves, not our true selves. In Facebook world, the average adult seems to be happily married, vacationing in the Caribbean, and perusing the Atlantic. In the real world, a lot of people are angry, on supermarket checkout lines, peeking at the National Enquirer, ignoring the phone calls from their spouse, whom they haven’t slept with in years. If you’re a business you should never trust what your customers tell you; trust what they do. Netflix learned this lesson; if you ask users what films they want to watch, they fill the queue with aspirational films; but a few days later they just watch what they always want to watch; low-brow comedies or romance films. Netflix then stopped asking people what they wanted, and started recommending what films to watch based on what the data they had collected on the millions of clicks suggested people actually wanted to watch.

- THE THIRD POWER: Big Data allows us to meaningfully zoom in on small segments of a dataset to gain new insights on who we are. Raj Chetty got a hold of all Americans’ tax records since 1996, and it allowed him to answer some interesting questions. Is America a land of opportunity? If you take America as a whole, then the average American has a low chance of climbing to the top 20% of income earners if he starts in the bottom 20%, (a 7.5% chance). But this assumes that America is uniform; if we zoom in on the date, we can see that in some parts of the US, the chance of a poor kid succeeding is as high as in any developed country in the world. In answer to the question: in some parts, America is a land of opportunity, in other parts, it’s not. Which parts are good for poor kids? Areas that spend more on education, that have more religious people and lower crime, and that have less black people. What places are best at giving people a chance to escape the grim reaper? For the wealthiest Americans, it doesn’t matter where you live; but for the poorest, life expectancy varies depending on where you live. What factor affects this variance? How many rich people also live in that area. More rich people in a city means the poor there live longer.

- THE FOURTH POWER: randomised experiments are the gold standard for proving causality, and Big Data makes randomised experiments, which can find truly causal effects, much easier to conduct, anytime and anywhere as long as it’s online. Randomised controlled experiments are also called A/B testing; you test situation A (say a background for a website) and situation B (a different background) and compare and contrast which gets more clicks; whichever gets more clicks (for whatever the reason may be), go with that. Some experiements in the real world can’t be conducted and causes can’t be adduced (because they are unethical, etc) but we can use the natural world for A/B testing. A great high school known as Stuy is ranked number one in the US. Can we compare what people’s lives would have been like had they entered a school they failed to enter? To test the causal effects of Stuy high school, we need to compare two groups that are almost identical apart from one tiny variable. We can compare students who just barely didn’t manage to make it in Stuy and those who did manage. This category of natural experiments (using sharp numerical cutoffs) is called regression discontinuity. What did this study find? There was absolutely no difference between students who entered Stuy and those who just barely didn’t manage to get in. They ended up in equally prestigious universities. Stuy students achieve more in life than non-Stuy students because better students attend Stuy in the first place. Stuy doesn’t cause you to perform better. The factors that make you successful are your talent and your drive; not who gives your commencement speech.

- LIMITATIONS on BIG DATA: Can Big Data predict which ways the stocks are headed? No. If there are many variables, one variable is bound to correspond in some statistically significant manner to an individual outcome; but this doesn’t mean that there is a causal relationship between the two variables. It just means that that variable got lucky. Take IQ and genes; is there one gene that can add a whole bunch of IQ points? Robert Plomin thought he found the answer by comparing the DNA of geniuses to the DNA of those with average IQs. There was a striking difference between the two; geni had a gene that stupidoes didn’t have. Was this the gene for IQ? No. A few years later, Plomin got access to another sample of people that included their DNA and IQ, but this time there was no correlation. The curse of dimensionality had struck again. How do you mitigate against this curse? Humility and detachment from results. Further, even if we can measure some statistics, the things we can measure are often not exactly what we care about. we can measure how well students do on multiple-choice questions; but we can’t easily measure curiosity, this latter trait being much more important than how well someone does on a silly test. What is needed in this case is Big Data coupled with rational human judgement.

More than anything, I really enjoyed the writing in this. Like participating in a conversation with a friend. And since this was a fairly technical topic (or could be), I think it's a real feat to write so clearly and in such an approachable way. Maybe I was particularly grateful for this because I put down the previous book I was reading for being so impenetrable. The general lessons of the book were not particularly new to me, but nearly all of the specific examples and anecdotes were, and I appreciate now having them to hand. Worth the read.
informative inspiring fast-paced

Freakonomics on steroids. Yes, it's okay to compare Everybody Lies to [b:Freakonomics: A Rogue Economist Explores the Hidden Side of Everything|1202|Freakonomics A Rogue Economist Explores the Hidden Side of Everything (Freakonomics, #1)|Steven D. Levitt|https://images.gr-assets.com/books/1327909092s/1202.jpg|5397] – the author admits it was the book that inspired him to go into this line of work. So what happens when you take the ideas behind Freakonomics, published in 2005, and apply them in the age of Google, Facebook, Twitter, PornHub, Tinder, OKCupid, etc? Probably when Stephen Levitt published Freakonomics, not many people had thought about being able to collect the incredible quantities of data we now collect about ourselves every day. How much more can we learn about ourselves with all this hard evidence of the way millions of people really act? How much more honest, surprising, even shocking, will it be?

Surprising and even shocking insights about people's real behaviors are the hook that gets you interested and keeps you reading. What really happens to violent crime when violent movies are released? What kinds of porn are people really searching for on the internet? Who's really more insecure about their bodies, men or women? Does getting into a top magnet high school really affect your chances for a lucrative career? Did racism really propel Donald Trump to the White House? Sometimes the results confirm what we already thought, but sometimes they're shockers, and occasionally they're deeply disturbing.

But there's a bigger story to tell here, and it's the way that the Digital Age has revolutionized how we quantify ourselves. It's far more than just social media and online porn. We can and do aggregate incredible amounts of data about ourselves because it's so easy now to collect and store the data. And when you know where to look and how to do it, it's also easy to mine that data to identify trends and correlations. When all the records are online, data mining that would have taken months of legwork in the 20th century (assuming it was possible at all), now takes seconds. We have amassed and continue to amass an incredible and ever-growing storehouse of information on everything from crime reports to how many steps millions of Fitbit users walk every day to what color ad background gets people to click more often. We stand to revolutionize all of the social sciences because what people really do when they're “alone” on the internet is so much more honest than what they say they do in surveys, and also because of the sheer quantity of test subjects voluntarily participating in the world's biggest experiments every day. And it's not just seemingly silly stuff like people's real porn habits that are under the microscope. Take, for example, public health. The potential for improving epidemiological studies is mindblowing.

I have to wonder, though, how accurate insights based on Google searches really are. Raise your hand if you've ever typed some zany thing into Google just to see what insanity the internet has to offer, or if you've ever clicked, “Surprise me!” Doesn't the very fact that you can type in anything, and the thing you type doesn't have to reflect your real self undermine the reliability of this database? The author's answer would probably be, “In theory, yes, but your silly search is just one among millions, even hundreds of millions, of data points, and if tens of thousands of people are asking Google about their penis size or whether their husband is gay, that's a good indicator that tens of thousands of people are really worried about it, even if a few of them are just kidding around.”

Audio Notes: This book is one of those books that's probably better in print than audio. It's fine in audio, but it's obvious there are a lot of charts and graphics and they've had to adapt the text so their content is being described to you. So if you have a choice, probably choose print.

This book thought itself more edgy and cool than it actually was. "Watch me throw in porn search terms. Bet you didn't expect THAT in your economics book!" Well, no, other than the fact I heard the author promoting the book on 6 different podcasts, I wasn't expecting it. But he kind of sounded like an 8-year-old who's just learned the F word.

Also, I was kind of annoyed off the bat, in Chapter 1, where he says his Grandma is "big data" because she has so many years of experience to draw on. Isn't the ENTIRE POINT of your book that humans are really bad at interpreting data? You need anonymous information! Unbiased computers to interpret it! It was a really dumb analogy, which didn't even fit the narrative of his own book.

OK, *deep breath*, I have one more rant in me about how the audiobook reader mispronounced "doppelganger" but I'll spare you. Overall, I liked the book! Really! It might not be strictly necessary to read it if you've already heard the guy on nearly every interview-style podcast to which you're subscribed, but it wasn't a waste of time.

Listened to the audiobook. It kinda reads like a Google Trends advert. I enjoyed it. Very insightful.

Here's a great, painless, even entertaining romp through the world of big data sets. Best of all, the only charts and tables are easily understood, even if one hates statistics and econ books. Stats were always tough for me. I could do all the problem sets and thought I understood them... only to stare blankly at the problems on the exams. ugh.
Nevertheless, this was an important read. The rapidly expanding field of big data set analysis underlies the successes and failures of Google, FB, TikTok and all the others. While he doesn't suggest we should drop out of all those online sites, it isn't hard to formulate an argument for doing just that.
Of course, at the end, his last sentence is, "Every narrator is unreliable."
The Goodreads blurb gives a pretty good overview of his arguments.

I found this book to be absolutely fascinating. It has inspired me to learn more about the uses of big data.

interesting lessons about what data can tell people about ourselves and our true behavior. Uses strange examples and data set (talks a lot about sex, racial topics) that can draw attention away from what the author is trying to emphasize