The dangerous story-first data analysts

The dangerous story-first data analysts have grabbed the people's data

The dangerous story-first data analysts
Elena-mozhvilo-storytelling-sm

The spectacle of Elon Musk and his proteges seizing the people’s data has brought uncomfortable truths about data collection to the fore. This day could hardly have arrived sooner. The people must now confront the perils of entrusting our data to corporations run by smooth-talking profit-maximizers.

According to media reports, the Musk team targeted specific federal agencies, demanding access to vast amounts of data (link, link), and in short order, fired boatloads of government employees, claimed to have found fraud everywhere, cancelled contracts, etc. They make announcements on social media and put up "receipts" on a hastily put together website. One such tweet was “Maybe Twilight is real and there are a lot of vampires collecting Social Security,” with an accompanying frequency table showing ages up to 369.

These circumstances do not inspire confidence for producing high-quality data analyses.

  1. The team is made up of “story-first” analysts. From the name of the department, and every word they have since published, it's evident that these data analysts have already written their headlines before they even touched a single row of data. They even set a target bounty of $2 trillion of "savings" before they began their work. Is there any chance that this group finds any data that show some parts of the federal government is run efficiently?
  2. Ironically, when all evidence points in the same direction, it's panic time for data analysts, not party time. If the Musk team weren't "story-first", by now we should have learned at least one finding that debunks an alleged source of fraud or waste. Wake me up when they make such a discovery. "Story-first" analysts behave differently from "data-first" analysts. They pursue different questions, and thus harness different methods. As a trivial example, consider the widely-publicized claim of fraudulent social security payouts to dead people (link). If that's the answer, what is the question? The appropriate workflow for this analysis starts with defining dead people, for example, as those whose age is above 120; extracting a list of such people; removing duplicates; and computing the total amount of payouts to them. By contrast, if the analyst wants to estimate the extent of fraud, she won't get anywhere by honing in on just one segment of alleged fraudsters.

  3. "Story-first" analysts are at the gravest risk of being fooled by confirmation bias, i.e. they skillfully find evidence for whatever it is they want to find. In the trivial example introduced above, they fail to inquire why there are "dead" people in the database with positive payouts (other than malfeasance). I don't have access to the data so I can only speculate. If I found dead people in the data, I'd seek answers to many questions before making accusations on social media. Are those payouts final? Could they have been caught by a downstream system? Are the birth dates of those people accurately entered? Are fake birth dates used as placeholders for missing data? Is that field the right one to compute age from, or is it an obsolete field? Any data analyst run the risk of self-delusion but the danger is amplified by a "story-first" mindset. Story-first promotes cherry-picking evidence, discarding or disregarding anything that disagrees with the story, as well as running down paths that lead to that story.

  4. "Story-first" analysts are also at great risk of being careless. At every stage of the analysis, they may fail to exercise self-doubt. Within days, the Musk team listed as a top catch on their website a saving of $8 billion by cancelling a project. They didn't bother to check that the maximum contract value of the cancelled project was $8 million (link). Of course, they also failed to pro-rate that number by the amount that has already been paid out for work that has already been delivered and accepted. And surely, they wouldn't think to check the contract to see if it mandates a breakup fee. Hard to imagine they had included any litigation expenses and if appropriate, settlement fees, that would apply when the other party sues the government for breach of contract.

  5. By many accounts, the Musk team has obtained access to data that are not anonymized. This creates the opportunity for malfeasance. The blackmail happening at Columbia University, in which the government rescinded certain funding to coerce compliance with a list of demands (link), can happen to any entity that receives federal funding, not necessarily in the public's eye. This is what I mean when I said "The people must now confront the perils of entrusting our data to corporations run by smooth-talking profit-maximizers." If we let corporations amass our data, the data may land in the laps of our enemies. Our enemies may not be known at the time of data collection; they may reveal themselves in the future. Witness how fast the likes of Mark Zuckerberg switch allegiance. Recall Musk was once allied with the Democrats.

  6. "Story-first" analysts who are aware of their cherry-picking often resort to hiding the data. It's always a red flag when analysts refuse to release the data. The Musk team has not released any data that can be independently vetted. Pharmaceuticals also play this game of control. These entities frequently lean on the privacy excuse, but in fact, there are ways to provide the data without violating people's privacy; and if these entities truly cared about privacy, they would have stopped much of their privacy-busting marketing activities. Not releasing the datasets has serious consequences: not only is there no way to independently verify the conclusions, we aren't even sure if the findings were completely made up.

  7. The speed at which the Musk team is moving precludes the possibility of deeply understanding the data they're looking at, or gathering the background information required to interpret them properly. Take for instance the possibility that the 150-year-old people in the Social Security database may have missing age information. Missing age may have been coded as some very large number. The media has tried to blame this on the Musk team not being familiar with old programming languages. That's nonsense. The missing age problem also affects new computer code; it is a problem that exists in every database that collects age information; and the value chosen to represent missing age depends on the designer of the database. To be sure what this value is, the team would need to find someone who can look up the source code. Well, they probably fired most of the people who have this knowledge.

  8. Don't be distracted by "read" vs. "write" access (link). Read access is sufficient for the previously mentioned blackmail. Write access - which means the ability to edit the database - merely opens up more avenues for inflicting harm (e.g. by planting fake data or deleting inconvenient data).

  9. Don't celebrate bad analyses even if you agree with the mission of the Musk team. Everything I described so far can ricochet, should the roles be reversed. Further, you may be an ally today; you could become their enemy tomorrow.

  10. Don't believe the "we haven't done it" defense. If something is technically feasible, it will likely be done. Many tech companies have been caught with their pants down. The large-language model developers claim they did not use stolen copyrighted materials to train their models; ask any insider if they believe these words (link).


Here's the situation we're facing. A gaggle of story-first data analysts have seized the people's data. They are pursuing a pre-conceived agenda of uncovering and uprooting "wasteful spending" in the federal government. Laser-focused on this one idea, they publish cherry-picked evidence, everything pointing to the same conclusion. Their analyses show no sign of self-doubt, and exhibit amateurish carelessness. They are also shielded from third-party scrutiny, hiding the data. It is hard to take this group seriously but these conditions allow them to do serious harm, intentionally or not.

P.S. [4-1-2025] On point #8, they were just joking. In the latest news (link), Musk claimed to have "corrected" Social Security records. This proves that they have "write" access to the database. This development also proves point #10. "If something is technically feasible, it will likely be done." Gaining all access to databases is technically feasible, and now it's done.