Junk Charts

Making sense of long data sequences

Kaiser Fung — Mon, 22 Jun 2026 09:14:41 -0400

In the game of football (i.e. soccer), players pass the ball around, and eventually someone takes a shot. If the ball breaches the net, the team scores a point, and the game restarts from the center. If an errant shot surpasses the goal line, the opposing goalkeeper restarts the action; if the shot is intercepted by an opponent pushing the ball out of bounds, the attacking team retains possession and restarts the action with a corner or a throw-in. The action can also be stopped and restarted before a shot attempt, if the ball lands out of bounds at any time. If a referee accidentally touches the ball, a restart is in order...

I won't detail all the possibilities. The point is that a football match can be represented by a set of sequences, each of which has a distinct (re)start and a distinct end. Each sequence involves a subset of the players; the same player can feature multiple times. Each ending signifies the ball never reaching another player – it goes out of bounds, or into goal, or hits a ref; or a foul is called, or a stoppage is permitted for half-time, substitutions, injuries, not to mention theatrical writhing, or in the latest twist, advertiser hydration.

With modern video processing and sensor data, analysts can visualize and dissect these sequences. One should be able to figure out which team is playing based on the style of play described by patterns contained in these sequences.

Brennan Klein's research group at Northeastern has published some maps based on these sequences. (Brennan and I crossed paths some years ago.)

The following map shows a team, during the 2022 World Cup, that executed a large number of passes, most frequently on the front side of midfield. These players spend a lot of time passing the ball around, it appeared. What team might this be?

The national football team known for lots of short passes around the midfield is Spain. Its iconic "tiki-taka" style of play has dazzled football fans in recent years. The passing cluster map depicted one aspect of the shape of their passing sequences during the 2022 World Cup.

Which team deployed the following passing patterns?

This team passed the ball a lot on its own side of the field, horizontally. When they attacked, it was more likely to come from the left flank. I inferred that attacks at goal utilized long balls, but their frequency was not sufficiently high to appear on the map. During the 2022 World Cup, this team executed about 40% fewer passes than Spain did.

The above diagram visualized England's tactics.

Each arrow shows an individual pass. A cluster contains passes of similar lengths and directions. The analysis has broken up the sequences, like we turn sentences into bags of words. Yet, the influence of the sequences is still palpable because the ball has to go somewhere.

Visualizing all sequences directly would make quite a mess. Each passing cluster is the analyst's abstraction. It contains passes from many sequences, but only relevant passes from each sequence.

I'd be interested in seeing clusters formed from two- or three- consecutive passes. Or perhaps, clusters conditional on another cluster. For example, given that the subset of sequences that have contributed to a "red" cluster, what are the passing clusters that form in the continuation of these sequences?

Brennan's group has been blogging about football analytics throughout the World Cup. Check them out here!

The greatest nations on earth

Kaiser Fung — Thu, 18 Jun 2026 10:56:56 -0400

Reader Kirsten P. sent me to Ben Byrne's website (link), where we learn that Europe has the greatest countries on earth.

Byrne presents a solution to a visualization challenge that arises frequently in business and social science studies. We have a composite ranking, built from component ratings. The entities being ranked can be placed into larger groups.

In this case, the composite ranking is the greatness of nations. The component ratings (what Byrne calls "pillars") consist of six factors, each of which is drawn from published statistics by reputable organizations. The entities are countries, and the grouping is by continent.

The right panel is the key event. The countries are sorted by the composite ranking from best to worst. The component ratings are visualized as a set of columns. These columns are not labeled, indicating the designer's assumption that it's of secondary interest.

The "rank range" on the right side is quite interesting, and I'll come back to it.

The right panel by itself would be a sufficient chart.

The capacity of the chart is expanded by way of the left panel, which offers users a set of controls. This interactive element attempts to solve one of the conundrums of any ranking procedure: users may not share the same preferences as the creator.

The left panel reveals that the "default" (read: canonical) composite ranking issued by the creator is one that combines the component ratings using a geometric mean, with equal weights on each component.

To which Byrne quipped:

Equal weights aren’t neutral — they assert each pillar matters the same. Your call.

This point can't be stressed enough. Having no opinion is an opinion.

Readers can move the dots around to indicate the importance they attach to each component. The scale used here is redundant – if one moves all dots to the right, as if to say every component is extremely important, then the result is identical to the default setting. On every horizontal position, if all dots are aligned vertically, we have the same composite ranking as the default. I don't see this redundancy as a defect; in a way, it informs any reader who rates everything as important that they effectively hold nothing to be important.

Now, let's return to the "rank range" section of the right panel.

This is an attempt to capture the "uncertainty" of the composite ranking. Specifically, Byrne is concerned with uncertainty arising from the component weights. This is precisely the issue that I mentioned above: people have differing views on the relative importance of those six components.

What Byrne did here is to simulate 500 people, each of whom have their own preferences. He samples randomly amongst sets of six numbers that add up to 100%. For each set of preferences, a composite ranking results. These rankings form a distribution, and the "rank range" shows the middle 90% of the distribution. (This method of generating preferences assumes that all possible allocations of weights have equal chance of occurring.)

Simply stated, a wider range indicates that the ranking is sensitive to the individual's choice of weights while a narrow range says it doesn't really matter what weights are applied.

In effect, the rank range reflects the underlying variability of the ratings across components. If a nation achieves similar scores across all six components, then the composite score wouldn't vary much no matter how we arrange the weights. On the other hand, if the nation scores much lower on one or more components, then it could achieve higher ranking if those components were downvoted, but much lower ranking if they were upvoted.

This explains the wide range shown for UAE, which has an extremely low rating for "Freedom & Rights". The composite rankings on the left side of the range are those that place heavy weights on Freedom & Rights; the ones on the right side of the range ignore this component.

The dot, which represents this particular reader's preferences, is right skewed. (I set Health and Governance components higher, and everything else, including Freedom, in the middle.)

Because the chart adjusts in real time, you can observe the effect of changing the weight on Freedom & Rights on the UAE composite ranking. Shifting the weight down moves the orange dot to the right since the low rating on Freedom & Rights becomes less and less important.

I'd suggest fixing the scale of the rank range to [0, 1]. Right now, the minimum and maximum are set to the values found within the subset of nations being plotted. As a result, I have to scroll from top to bottom to figure out the range of the range.

For this type of complex datasets, it's easy to over-complicate it, making the visualization confusing. Byrne has done a great job designing a clean interface, and still offering quite a bit of insights.

Confirmation bias on display

Kaiser Fung — Sat, 13 Jun 2026 18:12:07 -0400

At the workplace, a survey went up as the New York Knicks inched closer to taking the NBA title.

Nice illustration of confirmation bias.

Know your data 49: the risk of story-telling with data

Kaiser Fung — Thu, 11 Jun 2026 10:14:49 -0400

This ArsTechnica story (link) will continue to recur so long as most of us, especially those of us in the U.S., ignore the forever expansion of surveillance state.

A man was arrested and falsely accused of luring a child at a McDonald's, based on an erroneous result of a facial recognition software widely used by local police in the U.S. It turned out the man lives more than 300 miles away from that McDonald's, and has never even been to that town. Nevertheless, the police obtained an arrest warrant, took the man into custody, and the man required legal assistance to "prove his innocence," a perverse reversal of our legal doctrine of innocent-until-proven-guilty.

The story was excellently written, as it covers several ingredients that contribute to these wrongful arrests. Data by themselves are not sufficient to do harm. The blame goes beyond the facial recognition company's practice of assembling images of Americans, 40 million and counting.

It's also that the facial recognition industry has failed to encourage proper usage of such powerful technology. Instead, the industry issues meaningless statistics conveying misleading confidence about the reliability of the technology. In this case, the man supposedly matched the perpetrator's image "93%". What does 93% mean? It's anyone's guess. The technology vendor calls this number a "confidence score." The police did not seem to understand what the number measures, but 93 percent sounds like a highly confident match.

The falsely-accused man lamented, "Says it’s 93 percent accurate. Far as I’m concerned, it’s 100 percent inaccurate." And he would be right.

The real culprit is story-telling with data. It's what I've been calling "story time": start with one tangible piece of data, then spin it into a grand lullaby.

The single piece of evidence here is the 93% face match. Add a heavy dose of narrative fallacy. The investigators formulated a theory of the man as the perpetrator, discarding anything that didn't fit the story. For example, one officer checked the man's license plate numbers against (yet another) surveillance database, which turned up nothing. Nonetheless, this contradictory evidence did not change their theory; the police ignored it when applying for the arrest warrant.

Moreover, we learned that the investigators were blinded by their conviction. They could have collected plenty of corroborative evidence, such as the McDonald's receipt. The investigators didn't even ask. If they had, they might have overturned their faulty theory.

Anyone who has worked with data recognizes such dangers (whether we are able to avert them is a different matter!). We all tend to be skeptical of the analysis when the analysis disagrees with our theory, asking few questions otherwise. It takes a lot of mental strength to resist narrative fallacy and blind spots.

Surveillance technology gives the impression it's all knowing. People frequently believe that the surveillance companies have "all our data". Ultimately, the technologies offer samples with gaps that must be filled, gaps which frequently get grouted with stories.

Data and investigators form the worst pair, as they both laser-focus on positive identification, and neglect negative evidence. When was the last time you heard a forensic software company state their objective as "prove his innocence"? Every application in that space wants to "prove guilt"! Many databases are records of transactions. If someone is innocent, his data would not appear in the database.

If you think you'd heard this story before, you'd be right. A similar story happened to a grandmother, which I featured in this post.

Rearranging deck chairs

Kaiser Fung — Thu, 04 Jun 2026 08:39:42 -0400

Avinash highlighted this chart by economists Saez and Zucman that paints a vivid picture of what's happening to average tax rates by income group in the U.S. It goes a long way in explaining why there is such dismay with income distribution.

The population is divided into roughly 10 groups of equal size (deciles). The top decile, those with income above the 90th-percentile, is subdivided into six additional groups, with the top group consisting of just 400 people.

In simple terms, the lines have not moved much for the bottom 11 groups, i.e. anyone not in the top 1%. Deeper blue indicates more recent data. If anything, these lines have shifted upwards, meaning that everyone not in the top 1% has been paying slightly higher tax rates over time.

The top 1% (subdivided into four groups) have seen a dramatic decrease in tax rates. By 2000, they become somewhat "linear" with the other income groups. But don't let this chart fool you – because of the subdivision of the top 1%, the numbers on the horizontal axis are not "linearly" increasing.

The authors of the chart defined income groups analytically – by chopping up the income distribution into deciles.

The chart strongly suggests the effectiveness of a different way of defining "rich". Any cohort that has their tax rates decline over time defines "the rich". This criterion would have handsomely bisected the above axis into two obviously different sections.

Astute readers may recognize that the Saez-Zucman chart uses the same form and convention as the line charts in my prior posts on beef and egg prices.

The following variant tells an even more compelling story:

All l have done is to flip the roles of the income groups and time. Time now flows left to right. Each line represents an income group; the deeper the blue, the richer the people represented.

A massive convergence pattern is seen. In the 1950s, higher income groups pay dramatically higher tax rates (they also earn dramatically more money than the lower income groups). By the 2010s, everyone's tax rates have been squeezed into the 25-35% range.

Further, the deeper blue lines are downward trending, meaning these groups have benefitted from lower taxes. The light blue lines? Their tax rates have increased slightly over time.

By 2020, the richest 400 Americans made a huge breakthrough. They finally have become the group that pays the lowest income tax. Wow!

The beef on beef

Kaiser Fung — Fri, 29 May 2026 08:32:34 -0400

In the last post, I took a look at egg prices in the U.S., which have declined a bit in the last year but are still well above where they were five years ago.

The administration cherry-picked this item to suggest to the people that inflation has been tamed.

Let's take a look at the price of beef. Not long ago, this item made the news, when the President vowed that he was going to make steaks affordable.

Well...

His effort is clearly failing. Ground beef in the U.S. is now 20% higher than last year. Over the five-year period, beef prices have gone up almost 70%!!!

With editing power comes responsibility

Kaiser Fung — Tue, 26 May 2026 09:17:58 -0400

Long-time reader Chris P. sent me to this regrettable tweet by someone who works for the Department of Labor, which collects the nation's economic data.

It features a list of items for which prices have declined year-over-year. The title is a true lie. Sure, the prices of the featured items decreased but the prices of many other items increased. This issue of cherry-picking is covered by economist Justin Wolfers (link), and I won't belabor it here.

We now have two successive administrations that are fighting dramatic consumer price inflation by telling citizens their eyes are lying. When a matcha latte at a restaurant in a university town costs $13 (link), and when the Italian restaurant (no Michelin star) I'm going to tonight charges $38 for spaghetti alla chitarra, it's hard to not to notice! (For those who don't live in the U.S., those are list prices; add a minimum of 25% for taxes and tips.)

Rapidly increasing prices is not a common thing in the U.S. It's only natural that consumers compare current prices to what have been regarded as "normal" prices back in the days. Therefore, using year-over-year prices feels wrong; prices were already elevated last year.

I went back and took a look at egg prices (per dozen eggs) over a longer time period (The data is found here).

For eggs alone, it is true that the average price has fallen sharply from last year. But at current prices, Americans are still paying 40% more for eggs than five year ago. Egg prices have generally been under $2 stretching back decades, and therefore, we still feel like they're higher than expected.

(I'm not sure why the data from FRED do not match the data published in the press release cited by Wolfers and used in the DOL tweet. The FRED data shows a 50% drop in the last year, more than the 40% drop cited by those two sources.)

My chart was strongly edited. Here is the first draft:

A lot of editing is substraction.

To address the question of historical comparisons, the chart doesn't need to show every year of data. That helps clean up the spaghetti mess. Fewer colors are required, making it easier for the reader to find the line. In addition, lines can now be individually labeled, and the legend eliminated, because we have fewer of them.

Both my chart and the DOL chart went through editing; both editors subtracted. So, substraction itself is not the source of trouble. With power comes responsibility.

Some good bars

Kaiser Fung — Wed, 20 May 2026 08:54:29 -0400

This New York Times article (paywall) about the President's approval ratings is a demonstration that one doesn't need fancy charts to deliver clear messages.

The structure of the data helps. Each bar has two meaningful categories, plus a missing category (undecided, didn't want to say, etc.). The missing/unknown category is neatly tucked in the middle, and given a dull gray color, unlabelled. It's third fiddle, as is appropriate.

The aggregate response is shown up top, followed by demographic breakdowns. It's a small-multiples design, even though it's not usually described as such.

The pattern in this data is also simple to interpret. Independents are placed in the middle, but on the war issue, they are much close to Democratic respondents than to Republican respondents.

This design is versatile. One can make similar charts using other demographic groups, like age groups, ethnic groups, income groups, and gender.

The design can also admit variations. The next chart utilizes it, with subtle differences.

Every bar represents all respondents, regardless of political party. Each bar shows a different issue. The top bar depicts "his job as president" so in a sense, it is an aggregation of the data below. While the demographic breakdown is explicit and clear, the decomposition of all issues to individual issues is implicit and implied. We don't know what weights respondents attach to different issues when they gave the overall rating but the correlation between these ratings provides hints. So, the structure of this chart is similar but not quite the same as the first one.

The second chart also contains an additional column of information on the right. It shows the "net approval rating" which is defined as the percent approving minus the percent disapproving. This metric takes the view that each disapprover cancels out one approver.

Updating the unemployment chart

Kaiser Fung — Mon, 18 May 2026 08:50:13 -0400

The official statistics continue to show that the U.S. job market is healthy (for example, here). The unemployment rate has been around 4% for a few months, which is the lowest it's been for several cycles.

In this chart, I encoded the unemployment rate using a red-blue color scale. Red dots indicate months of high unemployment rate, which has spiked to 10 percent in 2010 and over 14% during the Covid-19 pandemic. Notice that in good times, the unemployment rate generally improves to about 4 percent. This level has been reached three times since 1990.

It turns out that there are many roads that lead to Rome. An unemployment rate of 4% in 2001 is materially different from an unemployment rate of 4% in 2020. The aggregate statistics hide what is happening underneath.

I previously designed the following chart depicting two important aspects that help us interpret the aggregate unemployment rate.

It's important to realize that the unemployment rate is not the number of unemployed divided by the population size. The denominator is the size of the "labor force," which may leave out a third or more of the population. The ones left out of the denominator are invisible to the Bureau of Labor Statistics; they are neither employed nor unemployed. The numerator counts the number of unemployed. If someone has worked at least one hour during a reference time period, they are considered employed. Therefore, some of those who are employed are actually employed part-time.

The scatter plot brings together two underlying drivers of the unemployment rate – the proportion of the population considered "not in labor force", i.e. the invisible; and the proportion of the employed who work part-time.

The path in the scatter plot shows the evolution of the relationship between 1990 and 2026. The year labels adopt the same color scheme as in the dot plot above. In other words, when the labels are red, the U.S. was facing high unemployment.

In 2001, the first time during this period the U.S. saw a low 4% unemployment rate, about a third of the population were excluded, and just over 17% of those employed were working part-time.

Then, in 2020, when the aggregate unemployment rate again showed 4%, the path in the scatter plot has moved to an entirely different location. An extra 4% of the population have exited the labor force, not to be counted. The amount of part-time workers has settled back to about 17% after jumping to almost 20% in the early 2010s.

The employment situation since 2020 has been stuck in this new location. Most recently, the line is moving northeast, which is the worst possible direction: more people are leaving the labor force, and more workers are working part-time.

As discussed in Chapter 6 of Numbersense (link), people not in labor force include not only those who are too young or old to work, or those who are incarcerated but also those who are deemed by the government to not want work.

The following small-multiples version of the scatter plot shows different periods of time in different panels.

The bottom-left panel shows a transition period when the path shifted from the bottom of the chart to the top of the chart. Beyond 2020, the path has been hovering in that region with no indication that it would be re-directed downwards.

Relevant prior post:

Painting the full picture of the employment situation

It’s very frustrating to read the mainstream articles about the recent unemployment report. For example, the New York Times said “U.S. Jobless Claims Hit 15-year Low.” (link) At this point, everyone should be aware of how employment statistics, in particular, the unemployment rate, is computed. Certainly, the editors at

Junk ChartsKaiser Fung

MTA is waking up

Kaiser Fung — Fri, 08 May 2026 08:54:20 -0400

Back in 2022, I was noticing how many bus riders in New York were simply boarding without paying the fare (link). These so-called "Select Buses" were designed to speed up boarding: riders were supposed to get a paper ticket from a machine on the sidewalk prior to boarding; and then they were allowed to board the bus from any door.

It doesn't take a genius to realize that such a system would promote fare evasion. One can consider it a price to pay for improving efficiency. That wasn't the posture of the MTA. The MTA was in full-on denial mode. This is a direct quote from my 2022 post:

The Vice reporter ... citing an earlier MTA study which concluded that all-door boarding reduced fare evasion... that study found that fare evasion was 10 times lower on Select bus routes than on regular bus routes!

That study literally claimed the fare evasion rate on Select Buses was merely 2.2%. Here's the proof:

Fast forward to 2026. The New York Times is saying:

slightly more than half of passengers — 52.7 percent — skip the fare on Select Bus Service buses

Whoever consultant that wrote the earlier report should be ashamed of themselves. What a horrendous mistake!

Over the last year, the MTA is phasing out the swipe cards in favor of chip cards. The road-side machines will be retired, instead of being updated to accept chip cards. Currently, passengers are supposed to pay on boarding by scanning the chip cards on the tablets.

This change has not reduced fare evasion. As a result, the MTA has decided to crack down. This requires – you guessed it – buying more technology to solve a problem created by technology. Members of an "EAGLE" squad will be inspecting riders; they will use a "hand-held device that can check whether an OMNY card or a cellphone made a payment on the bus someone is riding on". (OMNY is the chip card specific to the MTA.)

I'm wondering whether this device can check credit cards too.

The climate chart that sparks a count v ratio debate

Kaiser Fung — Tue, 05 May 2026 08:14:35 -0400

Long-time reader Chris P. alerted me to a debate over climate graphs between Aaron Brown and Hank Green who have each recorded youtubes stating their positions (Brown and Green).

One of these charts (shown above) actually defines an unusual way of measuring climate change. It originally came from the official Climate Change Special Report of 2017 (Figure 6.5, also reproduced as Figure ES.5).

This chart shows red and blue bars over time, between 1930 and 2016. The key ingredients are days in which the temperature reached record high or low. The bar is red for years in which the number of record highs exceeded the number of record lows; the bar height is the number of record highs per number of record lows. On the other hand, if the count of record lows is higher, then the bar is blue, and its height is the ratio of record low to record high.

The climate scientists are concerned about the increasing trend in the number of red bars, as well as the height of red bars. This suggests that the earth is getting warmer. The iconic “hockey stick” graph (which also features in those videos) measures the average (excess) temperature, while this chart focuses on the extremes of temperature.

Brown alleges that the use of ratios is manipulative while Green defends it as smart. In this post, I show why using ratios is beneficial, and not problematic for this chart. In a follow-up post, I’ll consider meaningful issues with the chart.

On one level, Brown’s argument is sensible. By showing ratios, rather than the counts, the chart obscures the fact that the number of new records has been decreasing over time. Here is what Brown, citing Steven Koonin’s book, says the counts look like:

I grayed out the front part of the chart as they concern years that didn’t show up in the original chart. Brown charges:

The only reason to present the data as a ratio is to create a scary visual in which the alarming-looking red lines get taller and taller.

It’s quite possible climate scientists want to show “alarming-looking” graphics. Brown is also right in observing that (a) the absolute number of record highs is not growing (he calls it “holding steady”); and that (b) the ratios are getting larger, primarily because the number of record lows is decreasing.

A paragraph later, Brown serves up his own counter-argument.

In the early years, there were many warmest and coldest days because it's easy to set records over short time spans. By 2018, a day had to be the warmest or coldest in 124 years to count, so record days occurred less frequently.

In other words, both record highs and record lows are expected to wither away over time; the further out we go, the harder it is to make a new record. Therefore, if the frequency of record highs is falling, it doesn’t negate global warming, and neither does record lows becoming rarer prove global warming.

Therein lies a challenge of interpreting this chart. The analysis needs a “baseline”. We have to establish what the metric looks like without climate change. Then, if we observe something different from that baseline, we can investigate the reasons for the differences.

The official graphic in the climate report also suffers for the lack of a baseline. They just point to the increasing ratio and suggest that it is consistent with global warming. Left unspoken is the important idea that without climate change, we expect the ratio to settle around 1.

I just mentioned a reason for adopting the ratio metric – it has a recognizable baseline. By symmetry, the trend of attaining record highs should mirror the trend of attaining record lows. Both counts, as we’ve argued, should be decreasing but in a similar way – in the absence of climate change. Thus, the ratio should hug close to 1 (the rest is noise).

This is also the gist of Green’s retort to Brown’s criticism.

It turns out that we can be more precise about the trends of counts of record highs and lows as the years pass by but they would look like curves, not straight lines. We’d have to monitor two curves. This is a good example in which we can construct a “ladder of abstraction” (See my paper with Andrew Gelman here). It’s important for readers to know the composition of the ratio – in this regard, I agree with Brown.

Let’s now make the above statements precise.

We want to nail down the baseline scenario of no climate change (In classical statistics, we call this the null hypothesis.)

Let’s zoom in on a single measurement station. We imagine that the daily temperatures follow a fixed distribution (for concreteness, one can imagine a normal distribution with mean $\mu$ and standard deviation $\sigma$). When the climate is stable, we assume that each day, we take independent draws from the same distribution.

Consider record highs. Each new measurement is either a new record high or not. The probability that the measurement at time t is a new record high is $\frac{1}{t}$. That’s because each measurement up to time t has equal chance of being the current record.

Start with the first measurement, which is trivially a record high. With two measurements, the second number has equal chance of being higher or lower than the first number. This follows from symmetry induced by the iid assumption. With three measurements, again each one has equal chance ($\frac{1}{3}$) of being the record high, thus the probability that the third one is a new record is $\frac{1}{3}$. This quantifies the idea that it’s harder and harder to break records over time.

The expected number of records up to time t is the partial sum of the harmonic series, \[1 + \frac{1}{2} + \frac{1}{3} +....+\frac{1}{t},\] which is approximately $\log{t}$. Note that the slope of $\log{t}$ is $\frac{1}{t}$.

If we have $n$ stations emitting $n$ streams of measurements, then the expected number of records up to time t is $n \log{t}$.

We would then want to put a confidence bound around that number and if the observed number of record highs strays outside the confidence bound, then we have evidence that the climate is not stable.

Since the value $n \log{t}$ increases with both $n$ and $t$, it is not easy to work with.

Instead, we consider the trend of record lows as well. Using the same argument, we’ll find that the probability that the measurement at time t is a record low is also $\frac{1}{t}$; the expected number of record lows up to time t is $\log{t}$ for a single station, and $n \log{t}$ for $n$ stations.

Thus, if we take the ratio of record highs to record lows, the expected value is 1. Of course, because of randomness, at any time t, the ratio would not be exactly 1. It should be close, and again, it’s the confidence bound around the ratio that describes how far it might deviate from 1.

This calculation can be further refined as I made another simplifying assumption, namely, that there is a fixed number of stations collecting data, and they were all deployed at the same time.

As the decaying probability of new records indicates, stations with a shorter history have a disproportionately larger impact on the number of record highs and lows. Brown argues that such recency bias is a reason to ditch using ratios. His diagnosis is correct but the treatment wrong-headed. The bias manifests itself absolutely in the counts. If we use the ratio of highs to lows, then the bias gets cancelled out.

The use of the ratio isn't a problem. There are other issues with the chart. I'll discuss those in another post.

Pushing U.S. leftward, in a way

Kaiser Fung — Tue, 28 Apr 2026 08:59:09 -0400

I return to Reuter's very strong entry in the Webby Awards. (See my review of all entrants here.)

Deep inside this project is a standard stacked bar chart:

The chart covers an interesting story, imports of luxury products. The U.S. accounts for quite a lot of these imports. The only other nations that make an appearance are China and France. I'm assuming the absence of red and green in 8 of the 10 categories means China and France do not import those items in large quantities.

For the most part, the bar chart is nicely polished.

The one thing that stands out is the sorting of countries by alphabetical order (with "rest of the world" tagged on).

It's a simple edit to move all the U.S. bars to line up on the left side. Like this:

I find it curious that jewellery is split into "silver" and "non silver". Silver total fewer than $10 billion while non-silver adds up to over $110 billion.

The legal status of AI chats and Google docs

Kaiser Fung — Fri, 24 Apr 2026 08:07:17 -0400

Whether one can get into legal trouble for AI chats is something to which we should be paying attention. The AI platforms have been refreshingly honest about their right to share AI chats with third parties, which include authorities.

In a recent piece of news (link), a federal judge ruled that attorney-client privilege does not extend to someone's chats with an AI chatbot even if they are preparation for talking to one's lawyers. This decision sounds reasonable – the AI is not a lawyer, and thus cannot be part of one's legal team.

But I feel like the decision also raises perplexing questions.

If one thinks of AI as a tool, then it's not much different from Google Docs or Microsoft Word. Word used to be more "private", in the sense that the software runs locally on one's computer, and works even without an Internet connection. Microsoft can't share our documents, as they don't exist on Microsoft's computers. Google Docs (or Word in the cloud) represents the class called SaaS software, software hosted at remote computers, so that the documents of the users appear "private" but are really in the possession of Google.

Not being a lawyer, I wonder if one were to put preparatory materials on Google Docs (or Microsoft Word in the cloud), whether this act creates the same problem, namely, that such materials are not protected by attorney-client privilege.

Further, if one actively collaborates with a lawyer on the Google Doc, is that document protected by attorney-client privilege? Would Google share it with third parties under any circumstances? Would the aforementioned judge decide that the document does not enjoy attorney-client privilege? Is the distinction whether the lawyer is actively involved in the editing or not?

It gets even muckier, as even the blind can see that Google, Microsoft and every other software developer has forced AI features onto editing software, thus blurring the lines between the two product classes. Google Docs and any other similar software now boast that it uses AI models to suggest edits, auto-completions, etc. By allowing such features (in some cases, I'm not even sure users are able to turn them off), do users give up attorney-client privilege?

Maps of Top X

Kaiser Fung — Wed, 22 Apr 2026 08:59:54 -0400

Some years ago, maps of Top X came into fashion. The example above is lifted from Visual Cinnamon's Webby Award nominee (and co-winner). This is just one chart from their massive project. See here for my short reviews of all nominees.

The most common version of maps of Top X is the map of Top 1. Using the tile map as the base, Visual Cinnamon displays every state as a circle of equal size. The tile map condones distortion of areas as well as of relative positions. The project concerns bird-watching. Each state is represented by its #1 Googled bird. At a glance, we can see that Bald Eagle (yellow) and Northern Cardinal (red) dominate Google searches. There are only seven states in which neither of these two birds came first.

Maps of Top 1 dump most of the data. It doesn't differentiate between states in which residents split their attention across a large variety of birds, and states in which residents are obsessed with a small number of birds.

To show what's possible, I made some maps of Top 2.

This map of Top 2 shows where Bald Eagle was the most popular, measured by Google searching. This bird is popular almost everywhere; only in five states does Bald Eagle not end up Top 2 on Google. In roughly half of the highlighted states, Bald Eagle was the #1 searched bird.

You can imagine what the map of Top 1 looks like by ignoring the yellow, lower semicircles. The story is enriched by just adding back one layer of information.

Deploying the small multiples principle, I made a map of Top 2 for Northern Cardinal.

Outside of Western states, Northern Cardinal is intensely searched. In the South, this bird is ranked #1, surpassing Bald Eagle, which frequently drops to #2.

A map of Top 1 would be if we only pay attention to the deep red, upper semicircles. The story is richer just because I added one extra layer of information.

The third map addresses those states in which Bald Eagle and Northern Cardinal do not account for both top positions.

There are seven other birds that feature. Most of these less popular birds show up as #2, not #1. Thus, they don't appear on maps of Top 1.

Notes on Webby nominees

Kaiser Fung — Wed, 15 Apr 2026 09:18:32 -0400

Long-time reader Chris P. asked me about the Webby Awards nominees for best data visualization (link). The award is open to voting right now.

There are five contenders for the Webby Award. I’ll give some thoughts in the order they appear on the website. I’ll post more detailed commentary in future posts.

Sources of U.S. imports – Reuters (link)

This project addresses the question of where Americans get imported goods from. Some of the answers are quite surprising, if you aren’t a trade economist. The designers engage the audience by casting the project as an interactive quiz: after presenting the types of goods the U.S. imports from a particular country, the reader is asked to guess the country. The illustrations convey a sense of light-hearted fun. The data graphics skew toward visual appeal, sacrificing clarity. What impress me the most is the quality of editing: they do not show every bit of data they collected. They focus on several major categories of imports that everyone can relate to, such as food, medicine, and electronics.

Birds – Visual Cinnamon (link)

This project is made for bird lovers. It analyzes which birds are the most sought after by Americans. A tremendous amount of work went into compiling data across 50 states, Google Trends, various databases and so on. If Reuter’s project is poetic, Visual Cinnamon built an encyclopedia. Pretty much every arrangement of the data has been visualized. Which bird is most Googled in which state; which bird type is most Googled in which state; in which states are a bird type most Googled; how birds are classified into bird types; specific details of each bird or bird type; correlations between Google searches, reported sightings, and population sizes... you have a question, they have an answer somewhere on the website. A variety of chart forms is deployed, some more successful than others. Like several of the nominees, this one is about exploring and learning from data, and can’t be reduced to Cliff Notes.

Air Pollution – Earth Genome (link)

This project visualizes major sources of air pollution around the world. Unlike the previous two projects, it does not utilize a vertical scroll that reveals a sequence of graphics. Instead, the project consists of a single chart form, which shows motion across space. Space is the old reliable map. The motion is quite ingenious; it shows polluting particles streaming out of these facilities, and spraying the pollution around as wind conditions change over time. The density of the roving dots conveys the amount of pollutants, although I find it challenging to visually compare these densities. The data come from simulations from models. In the ladder of abstraction (link), the designers chose to be less abstract, as the moving dots suggest pollutants (much of which ironically are invisible in real life!) It’s quite fun to stumble upon visual insights, like this:

One can see how the emissions from several facilities are blown toward certain population centers during certain times of the day.

Spotify Wrapped (link)

This project is Spotify’s version of the year-end reports that tech companies send out to their users. These reports remind users of their activities on the platform during the past year. This project deserves its own category. It really has little in common with the other nominees. The primary distinction being that it isn’t one size fits all; Spotify automatically creates individualized visual products for its millions of users. It also doesn’t analyze; simple stats are presented in a visually appealing style. The link describes the general design of Wrapped but we don’t have the pleasure of seeing a specific example of it. There is a not-so-obvious trade-off of this mass customization strategy: not all sections of the design will be relevant to all users. Nevertheless, each user isn’t lost inside their entire database.

Blue Corridors – Ode Partners (link)

This project shows eight species of whales, and threats to their habitats. In spirit, this data visualization is identical to Earth Genome’s air pollution project. It also shows motion in space. Space is also the reliable map; the twist is the inclusion of a spherical option. The major traces of whales over time are shown dynamically. The project includes every bit of data, for every location, and point in time. Under a different tab, the designers attempt to address tricky cause—effect issues. They deal with this by adding background maps, for example, there’s one that show shipping and vessel strikes. The superposition of dynamic traces and background maps makes for a formidable fishing expedition. Like several others, this project elevates data exploration as the end goal.

P.S. [4/21/2026] Edited the link to the Webby Awards to reflect the new link after the winner is announced. There appear to be two winners: Blue Corridors and Birds. Congratulations!

Know your data 48: selling faces

Kaiser Fung — Thu, 09 Apr 2026 09:21:21 -0400

In the last post about Clearview AI's face recognition database, I asked the rhetorical question:

Have you heard a peep from Facebook/Instagram, Twitter, Google, etc. about Clearview illicitly taking images from their properties?

I didn't expect an answer to appear in the same week!

The FTC has "settled" with OKCupid, which is a dating app owned by the Match group, on giving data to a face recognition company (Clarifai, a competitor of Clearview) without proper notification to its users (link).

I put "settled" in quotes because throughout the story of the data belonging to Americans, the meaning of many words has been warped beyond recognition. This is, as Ars Technica pointed out in its header, a "settlement" without any financial penalty. The company just swore on its pinky that it would behave in the future.

We can learn several bits of interesting information from this news though.

Firstly, despite being allowed to hide questionable behavior behind obnoxiously long, pugnaciously vague, and presumably unread privacy policies, these tech companies still have to engage in activities that they simply could not afford to put down in writing.

Secondly, the word "sharing" has also lost any meaning. Does anyone – except the FTC enforcers – seriously believe that OKCupid gave 3 million photos to Clarifai for free? And, as an act of charity, would you also take our users' location and their demographic data?

The FTC claimed it knew what happened: based on the passage quoted by Ars Technica, it appeared that they merely repeated the story told by OKCupid, Match, and Clarifai. They claimed that no formal agreement existed but disclosed that the founder of OKCupid and the CEO of Match were both "financially invested" in Clarifai. These parties somehow believed that this cover story gave them a get-out-of-jail card, a rationale to support their use of the word "sharing". In fact, this is even more troubling than if a straightforward commercial agreement were to exist.

For one, this story proves that user data at tech companies are at the hands of individuals. (We already sort of knew from some past actions e.g. by Elon Musk.) It also shows that these individuals – none of whom face any kind of sanctions – will sell out their users for personal financial gain. When no agreement exists, it's harder to trace where, when and what data have left the building.

Thirdly, this settlement stemmed from actions that took place in 2014, so it took 12 years for regulators to uphold the law by a friendly handshake with the offenders.

Fourthly, let's read the terms of the settlement carefully, shall we? Hold on to your seat belts, this is truly scary:

OKCupid and Match... agreed to a permanent prohibition barring them from misrepresenting how they use and share personal data.

Wait, so businesses are heretofore not prohibited from "misrepresenting how they use and share personal data"? It takes 12 years and a negotiated settlement to confirm to Americans that our businesses are in fact free to misrepresent how they use and share personal data – unless the FTC imposes a specific "permanent" ban from lying?

Fifthly, apparently Clarifai has done absolutely nothing wrong in this matter, even though it is the entity that sought out the people's data, and exploited the data to build a product that sometimes might send them to jail for months because of errors (See prior post). For good measure, they even told us that they sell to "foreign governments, military operations and police departments".

Last but not least, this news resolves the mystery of how companies like Clearview AI and Clarifai build out their enormous databases of people's images associated with their personally-identifiable data. They may not even need to "scrape" the data; they simply get them "for free" via secretive "data sharing."

The paradox of circles

Kaiser Fung — Wed, 08 Apr 2026 09:13:45 -0400

While I was writing the posts about making paired circle charts (here, here), I came across a paradox.

The oft-repeated guidance when making circles or bubbles is that the data should be encoded in the areas. Since circular areas involve squaring the radii, we should square-root the data if we're putting the data in the radius (or diameter). This action happens frequently because in most drawing programs, we can control the radius (or diameter) easily, but we can't directly calibrate the area.

Now, take a look at the following chart:

I dislike these racetrack charts (link). Anyone who has run on a track knows the outer lanes are longer than the inner lanes. These charts draw our attention to the circumferences of the circles.

Therein lies the rub. The circumference is proportional to the radius (or diameter). Therefore, if we make an emptied-out circle, we should not square-root the data.

The trouble is in every circle, the area and the circumference (or border) are simultaneously present. Hence, the paradox of circles. If the data are attached to the area, then the circumference distorts them; if the data are tied to the circumference (or radius or diameter or angle), then the area distorts it. One can't have both!

I asked Andrew about this paradox, and of course, he has written something related to it. It's in this post, which was a response to an earlier post of mine (link).

Andrew featured the following chart he made about the "social penumbra" of different groups of people:

He ran into the neither here nor there problem, saying:

So by displaying the data as areas, we’re knowingly handing people a distortion. For example, if a certain group represents 1% of the population, then the core group (the yellow circle in the graph) will take up 1% of the area of the full circle and thus will be 10% in linear dimension.

Instead of full circles, he made quarter-circles. I think it's a brilliant move. The other three-quarters are just wasted space, so to speak. However, because the right-angled edges are present, the readers may be more likely to pay attention to the radius of the quarter-circle, rather than its area.

For what it's worth, this is the legend:

Let's zoom in on the "gun owner" category at the bottom left corner:

The first gray circle ("family members") has a radius that is about 1.5 times that of the yellow "core". If the data are encoded in the circular areas, then the gray circle's area is $(1.5)^2 = 2.25$ times that of the yellow "core". So, the size of family members but not core is 1.25 times the size of the core.

On the other hand, if the data are encoded in the radii, then the gray circle (think its circumference) is 1.5 times that of the yellow circle, so that the set of family members (not core) is about half the size of the core.

Thus, the "distorted" quantity is quite severely distorted. As a designer, you're hoping that your readers interpret the chart the way you intended (one of area or circumference, but not both).

In this case, I would be surprised if readers are focused on the circumferences. They might try to measure the radii since that's much easier to compare than the areas. (This is still true if the full concentric circles are shown.) On balance, I still think these quarter-circles have a place in our toolbox.

In the older post, I asked whether designers should (be forgiven for?) deliberately distort data in order to correct known visual illusions.

The quarter-circle example is related but not quite what I had in mind. This paradox of circles is such that we are forced to distort one quantity no matter what; so we aren't really doing a double-negative to undo an illusion.

The log chart is also related but not quite what I had in mind. In a log chart, we deliberately introduce a severe distortion, and it's not because readers apply an illusion to undo its effect.

Know your data 48: AI solves missing connections (or not)

Kaiser Fung — Wed, 01 Apr 2026 08:46:48 -0400

A sob story (link) about a grandmother in Tennessee raises a number of issues related to use of AI tools by law enforcement.

As reported, Angela Lipps was falsely accused of bank fraud near Fargo, North Dakota, a crime she did not commit. She was charged in ND, arrested and sat in jail in Tennessee, extradited to ND, and ultimately the charges were dismissed, and she was released after five months 🥲.

Lipps’s suffering was due to "misidentification". Her lawyer was able to produce bank records that proved that she was in Tennessee at the time of the crime. Tennessee is over 1,000 miles (1,600 km) away from North Dakota. In fact, according to Lipps, she doesn’t travel, has never even boarded an airplane, let alone ever went to North Dakota!

The situation ultimately got traced to the notorious face surveillance company, Clearview AI. This private company makes a business out of scraping images from social media and online sources, building a gigantic database that is used to “doxx” people.

Let’s dissect that bit by bit.

I deliberately use the word “doxx”. Doxxing is usually associated with someone publishing someone else’s personally identifiable information (PII, such as names and addresses) without that person’s consent, usually for the purpose of shaming, revenge, etc. According to Wiki (link), the U.S. have weak regulations in this area; only a few states consider doxxing illegal.

The contempt for doxxing appears to stop at the corporate door. Clearview’s raison d'être is doxxing on steroids. Its entire business is tagging images with people’s names (which naturally leads to other PII data, given the motivation of its customers). Instead of putting it up on social media, like a political activist might do, Clearview sells the information to someone willing to pay, also without that person’s consent, and worse, behind that person’s back. Government agencies are its primary customers.

According to the Wiki page, Clearview settled a lawsuit in 2022, agreeing not to sell to “private individuals and businesses.” But the linked CNN article used the qualified phrase, “most companies in the United States”. The Lipps case obviously shows that government use of such a tool is not harmless.

The other key word is “scraping,” about which I wrote recently (here). Clearview is engaged in large-scale harvesting of images across multiple platforms; that is their central value proposition. For this business, they need as many images as possible, as recent as possible. Have you heard a peep from Facebook/Instagram, Twitter, Google, etc. about Clearview illicitly taking images from their properties? Neither have I. Years ago when the New York Times covered this company, they made some noise but there have been no lawsuits or enforcement actions that I’m aware of.

The Lipps case is highly instructive, showing us how surveillance data can sometimes harm people.

According to the police department that used Clearview to identify Lipps as the criminal, the true criminal had used Lipps’s picture on a fake ID. How did the true fraudster have access to Lipps’s picture? Most likely from social-media scraping!

Misidentification is evidently a misnomer. If we believe the police’s story, then Clearview correctly identified Lipps from the fake ID photo. The problem appears to be that they ran with that, without looking for collaborative evidence. A surveillance image actually existed that would have exonerated Lipps if it were inspected.

The Lipps case also shines a light on the gray legal area in which these law enforcement agencies work. The Fargo jurisdiction did not have any AI for facial recognition; it then asked neighboring West Fargo police to help out, because they have Clearview. What’s to stop Clearview users from doxxing someone for non-official matters?

(By the way, Clearview’s sales team is probably knocking on the doors of the two police departments – because they have learned that users are sharing their Netflix accounts, so to speak.)

The proposed solution by the Fargo police is to route such identification requests to the North Dakota State and Local Intelligence Center, which has specific expertise in AI tools.

When I read that, I said to myself, I bet NDSLIC also uses Clearview, which would not have made a difference in the Lipps case. A quick search confirmed it. The West Fargo police chief defended his department (link), saying exactly that: “[we] did send it to NDSLIC, which returned the exact same results using the identical Clearview AI software.”

In the U.S., there is so much involvement by private entities in these aspects of law enforcement that it becomes very hard to figure out if proper and legal process has been applied.

Undoubtedly, facial recognition technology has solved, and will continue to help solve, crimes. But just as surely, such technologies, under various, sometimes unexpected, circumstances, will result in innocent people being “harrassed,” and in Lipps’s case, thrown in jail for months. Where do we draw the line?

P.S. [4/1/2026] Other posts in the Know Your Data series are here. [6/11/2026] Corrected the post number in the series.

Why only me?

Kaiser Fung — Tue, 31 Mar 2026 09:01:50 -0400

The doping story won't stop.

A past winner of the New York Marathon has recently failed doping tests, and is being banned from competing for five years (link). He tested positive for a new version of EPO, which was an emerging drug at the time I wrote Numbers Rule Your World (link). In Chapter 4, I discussed what it really meant when Lance Armstrong, at the time the GOAT cyclist, said he passed hundreds of doping tests in his career. Embarrassingly, statistics instructors at the time were comparing doping tests to mammograms, which are notorious for the amount of false positives they generate. I showed why false negatives are the real problem – this all happened before Armstrong's downfall.

Albert Korir, from Kenya, is a star. In addition to winning the New York Marathon in 2021, he placed second in 2019 and 2023, and third in 2024 and 2025.

Given that he passed all tests (a la Lance Armstrong) in those past years, he is only stripped of honors since October 2025. Sadly, his third-place finish in 2025 is no longer.

The only surprise is that he admitted the offence, and received a one-year reduction in penalty. He didn't say he ate any contaminated beef, or used his sick father's spoon, or drank from someone else's water bottle.

Of drug testing, for every athlete caught doping, there are many more who elude detection. Indeed, for every athlete caught doping, plenty of prior tests of the same athlete had came back negative.

Or, you can be the person who believe that the first time these athletes crossed the doping line, they got caught red-handed.

Guide to using pairs of circles

Kaiser Fung — Tue, 24 Mar 2026 08:52:50 -0400

In my last post, I featured the following student project that uses nested circles to compare pairs of data.

The underlying data are measures of change in wealth over time, specifically, a 15-year period (2000-2015). In each pair, one circle represents the "rich" and the other circle represents the "poor". So, for each country, there are two numbers being compared. For most countries, since the rich is getting richer, the "rich" circle is the larger one.

I find it useful to start by looking at the "boring" way of presenting the same concept, using side-by-side bar charts.

This dataset contains certain identifying features, due to how the Economist chose to define wealth disparity. Each number is the relative wealth of the rich (or poor), relative to the national average (=100), in each year. Because of the skewness of the wealth distribution, the numbers for the rich are usually quite a bit larger than those for the poor; it follows that the numbers for the change in wealth are also larger than those for the poor. In fact, the change in wealth for the poor is typically negative: if the rich are running higher, the poor should be falling lower! In each year, the average is pinned to zero. An exception is if the middle class lags while both ends of the distribution gain.

The circular version separates the direction and magnitude of the data: the circular areas encode the absolute values of wealth changes, while the colors show the direction of change (up or down).

In this post, I explore a few design decisions when making such circular charts:

sizing individual circles,
handling direction and magnitude,
determining relative sizes of circle pairs.

The basics first. Since the data are encoded in the areas of the circles, and the area of a circle is proportional to the square of its radius, we usually have to feed the square-root of the data to the plotting software.

Let's take a generic pair (A, B). There are three possible relationships between A and B: A>B, A

The strict inequalities can be simply accommodated:

The case of equality disturbs the peace. When A=B, the two circles have the same areas; they completely overlap.

One way out of this problem is to assert that the case of A=B is sufficiently rare as to be ignorable. I'd be willing to accept such an assumption in the case of the wealth inequality dataset.

If such an assertion is not supported, then a more creative solution is needed. For example, put them side by side.

An additional complication arises when the data contain both positive and negative values, which is the situation with the change in wealth data.

As shown below, we have six feasible configurations, requiring two colors plus two tinges, coupled with which circle is larger.

In each configuration, which circle is larger is immediately apparent. Then, the tinge signals whether the individual element (A or B) has positive or negative sign. In our dataset, the tinge signals either gain or loss in relative index over time.

Alexis, the student who made the featured chart, simplifies the situation, as she applies color to the gap in wealth changes (i.e., $A - B$), rather than the wealth changes themselves (A, B). Thus, there is only one value, and one corresponding color, per pair of circles.

The larger circle is given a fixed color (blue here). The color of the smaller circle is the direction of the difference in wealth changes between the rich and the poor – in other words, the direction of the wealth gap.

The simplicity is achieved by giving up the ability to distinguish between the various cases shown above. We go from six possibilities to two.

In Alexis's chart, all circles conform to an unspoken single scale, aligned to the 2015 relative wealth index for the rich.

This represents a third dimension. The pair of circles shows the wealth changes of the rich and the poor. The designer has freedom to choose what to use for this third dimension. This is not a decision available for the standard bar chart presentation.

The following illustrates the effect of introducing a third dimension. The top set of circles does not utilize the third dimension while the bottom set of circles does.

In the top row, the focus is within-country variation. In Japan as well as Spain, both the rich and the poor shifted in the same direction between 2000 and 2015, and the magnitude of the shift of the poor was roughly half that of the rich. In the United States, the rich got richer while the poor got poorer. The wealth change for rich Americans was roughly 20 times that for the poor.

In the bottom row, the sizes of the circles for Japan and Spain are all aligned with those for USA. Both within-country and between-country variations are present.

It's up to the designer to figure out whether, and how, to utilize this third dimension.

Getting metrics right is half the battle

Kaiser Fung — Wed, 18 Mar 2026 08:01:29 -0400

I've had a chance to look at some class projects from Ray Vella's NYU class recently. I've featured work from prior classes on this blog before (link).

The project objective is to improve a chart on income inequality published by the Economist.

This term, I was most interested in two solutions that involve rethinking how inequality should be measured.

One of the most confusing parts of the Economist original chart is the unit of measuring inequality. You see the data labels 848 and 1,150 up top, and it's puzzling what those numbers mean. The subtitle claims the data concern "GDP per person": are they expressed in British pounds? 848 pounds would be too little, but 848,000 pounds could be too high, but then the blue dots represent "the rich," and if they mean the "super-rich," it might not be large enough.

The answer is None of the above. Following the asterisk, the reader learns that the GDP per capita data have been adjusted. You'd have to know enough about economics data to see that "at purchasing-power parity" implies that all values are US dollars.

Swapping pounds for dollars, I'm still perplexed in much the same way. No text suggests that I should add 000 to these units. That's when one has to return to the titles, and notice "National average = 100".

The plotted data are evidently index values, with the national average set to 100. 848 is 8.48 times the national average while 1,150 is 11.5 times the national average.

Strictly speaking, the plotted data reflect the values relative to the national average in each year. The national average for the U.K. is 100 in both 2000 and 2015 even though the average in pounds for 2015 is surely quite a bit higher than the average in pounds for 2000.

The point of taking you down this dark tunnel is to demonstrate how much work it takes to explain to the reader how the designer has transformed the data, and to convince you, hopefully, not to venture into the dark side.

[To really explain it fully, I'd need another blog post because the above description is still missing one important piece.]

The first of the student work, by Thomas Carlson, reverts to more conventional measures of income inequality. (He submitted several views, one of which I'm discussing here.)

The rich is represented by "top 10%" average pre-tax income; the poor, by "bottom 50%" average pre-tax income. This pair of metrics is much easier to understand. The poor are the bottom half of the distribution, i.e. below the national median.

The vertical axis shows percentages. Instead of showing what their average income is in dollars, which is also hard to interpret without comparisons, he's showing what proportion of the nation's wealth each group comprises. Taking USA (on the far right) for example, the top 10% of Americans hold almost half of the nation's wealth while the bottom half of Americans own just 12%. (By inference, the other 40% of the population have roughly 40% of the wealth. An interesting symmetry is revealed in this data.)

By Thomas's metric, the U.K. is not particularly remarkable. This underlines the point that units of measurement/definitions of metrics matter a lot.

Thomas improved the chart a lot by addressing what I call a Type D problem, in the style of the Trifecta Checkup (featured in this recent post). He also changed the chart form (a Type V problem).

From a visual perspective, the most striking effort was the work by Alexis Duhaney.

She used circular areas instead of bars to visualize the data. Each country's situation is depicted by two numbers, one each for the rich and the poor. Each number is a pre-post change in the underlying measure of wealth.

The underlying metric is taken from the original Economist chart, so it would take a day and a half to explain it to readers. But the instinct of expressing a pre-post change is a sharp one. One can switch to Thomas's metrics, as an example. As such, the underlying metric becomes the change in share of pre-tax national income.

The visual appeal of this circular design is beyond question. Switching to circles introduces a whole set of issues, which I'll cover in a separate post.

P.S. [3/24/26] Added link to follow-up post.

The emerging AI agents war

Kaiser Fung — Fri, 13 Mar 2026 08:29:53 -0400

Mark Palko sent me news that Amazon obtained an injunction against Perplexity's shopping bot (link).

Perplexity is best known as a pioneer of AI-assisted web search, a product that I'd confidently say will find its market. It will succeed not because it delivers better search results, but because it offers a far more natural, far simpler user experience.

The recent news concerns something else – Perplexity's shopping bot that browses around and shop for things on behalf of users. This shopping bot is an example of an "AI agent," a term you must have heard of if you follow any tech news.

First, here's an incidental demonstration of my point about AI search.

Doing researching for this post, I typed keywords like "Perplexity shopping with Comet" in a traditional search engine, yielding pages upon pages of recent pieces about Amazon's lawsuit, despite my deliberate omission of the word "Amazon" or anything legal. Using an AI chatbot, with a prompt like "I want to find links to articles that introduce the Comet browser shopping feature offered by Perplexity starting last year. I don't want recent links about Amazon's lawsuit", I got exactly what I wanted. Here's a link to an article about the Comet shopping feature. (It's an "ad" by a company in the AI agent space, which is a different issue altogether.)

Based on currently available products, an AI agent is an automated workflow. In the article linked above, Perplexity Comet's edge is said to be:

Instead of waiting for your next search query, it actively completes tasks, negotiates purchases, and automates shopping workflows that previously required dozens of manual steps.

In the course of online shopping, one might start with an idea of what to buy. Then, one might find articles written about the "best" items in that category, noting pros and cons, and prices. One might then shortlist some options, and pick one. Then, one might select a retailer that sells the selected item, figure out its shipping and return policies, and if satisfactory, complete the transaction.

Perplexity's Comet browser does all these tasks:

The AI agent can auto-fill forms, conduct multi-site research, aggregate reviews, compare pricing, and — critically for commerce — initiate and complete purchase transactions.

It's time to introduce the naughty word: "scraping". This is the crux of Amazon's grievance.

In order for Comet (or any other AI agent) to fulfill those tasks, it must navigate around websites, extract data from webpages, analyze the data, and make decisions. Extracting data from webpages is the well-known activity known as "web scraping".

Web scraping is a strange beast. It has no reason to exist and yet, it's everywhere. When the data science field was created some 15 years ago, a common starting point of a textbook teaching Python is web scraping. Open up some webpage, and grab the data on the page.

Imagine you're the owner of a small on-line seller of widgets. An engineer at a competitor writes a web scraper to compile a database of the products you sell, and the prices. This scraper browses your website page by page, extracting the product and pricing information.

As the owner, you either consider your product pricing catalog confidential or public information.

Most retailers treat it as trade secret – if you take a notepad and start jotting down every product and price in a Walmart or Target, you'd most likely be stopped. Ditto, most online retailers deploy technology to detect and block web scrapers, typically by refusing to serve them webpages (403 errors). These retailers act as if the information presented publicly is protected. This stance has led to an arms race, as developers work around the anti-scraping tech. Regardless of one's view on this, we can agree that if the retailers treat their product catalogs as trade secrets, anyone trying to scrape the data is acting against the retailers' wishes, and may face legal jeopardy. (I never understood why college professors taught web scraping as the first example of a python script.)

Alternatively, some retailers might view their product pricing data as public information, so that they are okay with third-party access. In this world, web scraping bears no legal risk, but it is a poor technical solution nonetheless. The proper approach is to create APIs so that developers can register themselves and request the data they want in an open, orderly fashion.

All retailers have databases that hold their product and pricing data. Their websites grab data from these databases, and present them in nice formats to customers. Web scraping code grabs the data, together with the layers of formatting, spread out across hundreds and thousands of pages, and then removes the packaging, and merges the page-level data, restoring the structure of data. If successful, the output of the web scraper is similar to what the retailers hold in their databases! In practice, it's an inexact copy of the retailers' databases, riddled with errors. If these retailers consent to sharing the data, there are better ways to organize the data exchange.

Whether the retailers condone or condemn web scraping, there is still no reason to use it.

The emergence of AI agents brings this touchy subject to the forefront. The only way shopping bots can function is if they are allowed to browse around websites, collecting data. If Amazon's lawsuit succeeds, it kills not only Perplexity's bot, but also all others.

Another test of self-sufficiency

Kaiser Fung — Tue, 10 Mar 2026 08:09:44 -0400

I came across this infographic by the Swiss paper, Neuen Zürcher Zeitung, on my Linkedin feed (thanks, Markus Ikehata).

The piece is done in German, a language I don't speak. So reading this data visualization is like applying a "self-sufficiency test". That's the test I use to determine how much work the visual elements of a chart are doing to convey insights (as opposed to text and numbers).

I'll now document what I've learned from reading the just the visual elements of the infographic. Feel free to correct any errors in the comments below.

The first component is a semi-circle chart, that is quite canonical when it comes to representing parliaments.

I can count seven major parties represented by what I'd presume their typical party colors. The deep orange party (SP) has the plurality although a coalition of possibly three parties are needed to claim a majority.

The second chart is a side-by-side bar chart.

This chart uses the plurality party (deep orange) as the anchor, and is designed to display the importance of the other parties relative to it. The second largest party has a bit more than half the number of seats as the first party.

Curiously, this chart uses nine colors while the semi-circle chart has seven. On further inspection, the first seven bars correspond directly to the seven parties shown in the semi-circle. EVP (yellow) is omitted from the first chart: a mystery. The gray bar shows "others," as I came to realize below. (I also confirmed that "andere" is German for "others".)

Now, the third chart is a line chart.

This chart shows a trend from 1970 to today. It has eight colors. This is how I figured out that the omission of the gray bar is different from that of the yellow bar.

I'm scratching my head, trying to reconcile the line chart with the semi-circle. This chart shows the plurality party (orange) as mostly hovering near the zero level, with the two green lines clearly above it. I can see that all lines start at zero in 1970, thus, it displays an index relative to that year. I gather that the deep orange party (SP) has always been strong, and has maintained its number of seats throughout the decades. Meanwhile, the parties represented by the green lines (SVP, Grüne) have been gaining seats in the recent past.

Something happened in 2002 that merited a footnote.

Finally, on the right side, we have two columns of maps. (I have altered the shape of the grid below to save space.)

This is a small-multiples presentation of the the same eight parties, without the gray ("others") category.

They show the relative strength of each party across nine regions of Switzerland. Be attentive to the scales. What's the question answered by this chart? It's the geographical distribution of the strength of each party. It's best to interpret each map as a separate entity.

We shouldn't fixate on one region, and compare the shades of color, to understand the relative strengths of each party in a given region. That's because every map has its own color scale, adapted to the range of data in each map. For example, in the region indicated by the red arrow below, EVP has maximum strength (value of 4.9 according to the legend) while FDP is weak. Nevertheless, "weak" by FDP standard is still significantly more than 4.9.

In conclusion, this NNZ effort passes the self-sufficiency test comfortably. Even though I haven't read any of the text, I'm still able to learn a lot about the Swiss elections. The text assists the reader but the visual elements are self-sufficient.

(If you see any misinterpretation, please make a comment below. I hope I don't have to overturn my own conclusion 😃 )

Get your automated Junk Charts clone

Kaiser Fung — Wed, 04 Mar 2026 15:05:20 -0500

Prof. Hill at Samford University posted several class projects to Linkedin, which is more than amusing. The students made clones of Junk Charts critics, using the Junk Charts Trifecta Checkup framework.

Professor Chartworth (link), by Mary Hodges, is cheeky and irreverent. I threw in the cheese chart that recently showed up on the blog.

The AI (it's Claude behind the scenes) gave this answer:

📋 1. WHAT IS THE QUESTION?!

*adjusts glasses frantically* Oh, OH! Finally, FINALLY a chart that actually answers a clear question! Who are the top cheese producers in the EU? It's right there in the title! *wipes tear* I'm... I'm not used to this level of basic competence. Is this what normalcy feels like? I need to sit down.

📊 2. WHAT DOES THE DATA SAY?!

*nervous laughter* So Germany makes 2.4 million tonnes and gets 23%, France makes 1.9 million and gets 18%, and Italy makes 1.4 million and gets 13%... BY TUFTE'S GHOST, THESE THREE COUNTRIES ONLY ACCOUNT FOR 54% OF EU PRODUCTION! *screams internally* WHERE ARE THE OTHER 46%?! The Netherlands is probably sobbing in a corner with their Gouda, completely excluded from this cheese triumvirate! This is like showing a podium but forgetting to mention there's an entire stadium of competitors we're ignoring!

🎨 3. THE VISUAL CRIMES!

WHO APPROVED THESE PROPORTIONALLY INCORRECT CHEESE WHEELS?! *flips table* The German cheese wheel is MAYBE twice the visual area of Italy's, but it should be 1.7x based on the actual tonnage! In my 47 sleepless nights of measuring cheese illustrations with digital calipers, I've never seen such a flagrant violation of the sacred principle of proportional representation! And don't get me STARTED on how they're all the same shade of yellow-orange like some kind of processed American singles fever dream—where's the visual distinction?! It's like they asked an AI to draw 'generic cheese' and called it a day! *collapses dramatically* The only thing saving this from complete disaster is that at least—AT LEAST—they labeled the actual numbers clearly. Small mercies in this dairy-based nightmare.

Not bad, eh? (Compare to my writeup here.)

Professor Chartworth does a great job maintaining a certain style. The content is better than I could have expected. Call me impressed.

The framework can also be used seriously, as demonstrated by Larkin Camp's project (link). I test-ran it using this New York Times production:

This no-nonsense critic implements my "type" classification, which pinpoints the directions of improvement. It also provides a type summary, like this:

Critique Type: Type V

The visual design needs improvement, but question and data are good.

This visualization addresses a clear question with appropriate data but suffers from poor visual design choices. The beeswarm plot prioritizes aesthetics over functionality, making it nearly impossible to identify specific countries or understand patterns without extensive interaction. The design would benefit from a simpler, more readable format that allows viewers to quickly grasp both individual country performance and overall trends.

It correctly recognizes the "beeswarm" design, and its weaknesses.

Prof. Hill also features Hudson Hyneman's application. Check it out here.

Numbersense in sports commentary

Kaiser Fung — Mon, 02 Mar 2026 09:03:37 -0500

Even though data and analytics are part and parcel of modern sports, it's still jarring to hear sports broadcasters invoke common statistical fallacies.

During an overtime period in the recent Champions League match between Italy's Juventus and Turkey's Galatasaray, one commentator attacked the Turkish team's strategy (at ~97 minutes mark, when neither side had yet scored in overtime):

They [Galatasaray] showed you absolutely the way to not go about protecting a three-goal cushion tonight. From the very start, they never really played enough football. They were more content with trying to stop the game, break the game up, slow the game down.

The background: the two teams were taking part in a two-match playoff for a spot in the Round of 16; in the first match, played the week before, Galatasaray seized a 3-goal lead, which in football terms, is considered a massive advantage; and yet, on home soil, Juventus netted three goals in regular time, leveling the aggregate score (all of that despite playing ten vs eleven since the 49th minute).

According to this broadcaster, the outcome proved the Turkish side's strategy wrong. Instead of a conservative strategy of "slowing the game down," the Turkish side should have – I don't know what his unspoken alternative strategy would have been – treated it as if they did not have a 3-0 lead? Take risks trying to pad the goal difference while leaving gaps in the defense?

The commentary reflects the classic "outcome bias" fallacy of evaluating a strategy based on the realized outcome, not on what information was available at the time of the decision.

Imagine a lottery with just two players, paying out $100,000 for bets of $100. Using the aforementioned flawed logic, the loser should not have played in the first place; simultaneously, the winner obviously made the right decision to participate. However, at the time either makes the decision, they possess the same information so either both join or neither. You can't have it both ways.

Galatasaray ultimately scored two goals in overtime to join the Round of 16. The broadcaster didn't take back what he said earlier. The ultimate outcome should have confirmed the wisdom of the original conservative strategy, no?

Notably, at the start of the broadcast, the hosts cited some damning statistics: in the history of the Champions League, at this stage of the competition, we were told that out of 49 teams that were down by three or more goals after the first match, only four managed to overcome the deficit and advance to the next round. That's a probability of 8%. (With Juventus's loss, it's four out of 50.)

It would be amusing to analyze those 50 matches, and check how many of the teams that were leading after the first match deployed conservative tactics, and how they fared, relative to those that didn't.

P.S. [3/2/26] Corrected a typo. Clarified that Galatasaray didn't win the second match but won in aggregate.

Beautiful chart to behold

Kaiser Fung — Thu, 26 Feb 2026 08:09:57 -0500

As RJ puts it, "Bellissimo"!

This 3-D graphic of population data - called a "stereogram" – by an Italian statistician from 1880 is striking in appearance. It's also a chart that requires – no, demands – one's time to dissect and devour.

RJ adds to it by turning it into a modern interactive chart, with tooltips that help clarify the multiple threads of information.

The underlying dataset is any population growth data. Here is something similar from the U.S. Census Bureau (the 1880 graphic used old Swedish data when Sweden had high birth rates and high infant mortality).

Each number in this table is a count of people of a given age in a given year. Summing down a column is adding up people of all ages in a given year, thus yielding the total population in that year. Scanning across a row shows the trend in the size of a given age cohort (adding across a row is not meaningful). In the first row of this table, we see that the number of births is declining during the 1990s.

I asked for calendar years 1990-1999, thus there are 10 columns, one for each year. There is one row per age from 0 to 100, followed by a catch-all row for anyone 101 and older, thus there are 102 rows of data.

If you think of any individual, this person does not ever stay in a cell. Each person moves diagonally down toward the right, one step at a time, as years pass – until the year of death, at which point that individual's line ends, contributing to a drop in the count at the next step. Every person must start on the first row.

For most people shown above, we only see part of their life line. It's truncated on the left (because they were born before 1990), and it's truncated on the right (if they died after 1999). In statistics, we call this left- and right-censoring.

This is a case in which the graphic is quite a bit more involved than the original data – the dataset is deceptively simple. RJ also cites a contemporary critic who properly pointed out that the Italian has turned a 2-D dataset into a 3-D object.

In unpacking the 3-D graphic, RJ offered this helpful view:

Like the table above, the years are laid out horizontally while the age groups (in groups of five years) are shown "vertically" (i.e., into the screen). The counts are the added third dimension, which represents the lift "up" from the base.

The red lines trace the counts by age in a given year. At the top, we have the total number of newborns, then the counts cascade down the cliff, eventually flattening out somewhat. This snapshot view is more familiarly presented in a population pyramid:

If we ignore the gender split in the pyramid, and look at the total bar lengths, then these bar lengths map to the heights of a single red line in the stereogram. (Percent vs. count doesn't matter here since we fix the year.)

Back on the stereogram, the dark gray lines that run horizontally trace the change in the size of an age group over time. It's a cross-sectional, longitudinal view. Births were scaling rapidly during that period in Sweden but the number who lived past 55 years had not grown much.

Such trends are typically shown on line charts. Imagine collapsing the age-group axis, plotting the count against year, and one line per age group.

The light gray lines on the stereogram are effectively gridlines.

Perozzo, the Italian scientist who made this stereogram, was even more ambitious. He also put in blue lines to trace individuals as they age over time.

If an individual lived to 100 years old, then that person born in 1750 would follow the blue line down the hill as years passed all the way. The "drop" in height from one point to the next point represented those who left the cohort. In the 1700s, many Swedish kids didn't live to age 10, and then after age 45, the blue line plunged again.

The reader has to develop a feel for the rise and fall of the rolling terrain. Perozzo recognized this, and he applied shading to help out. I'm quite impressed by this little feature:

RJ mimicked this effect using a formula-driven approach.

It's worthwhile to read his entire post as he gets into even more details.

It appears that many readers find the stereogram too much of a good thing. It's a grand feast serving food coma.

I think it does have its place but it shouldn't stand alone. The 3-D graphic makes it clear that there are three ways of slicing the hill. Each slice can be represented as line charts in 2-D but we need the 3-D chart as a kind of legend to show where the slices are coming from.

What do you think? Let me know below.

Coffee machine math

Kaiser Fung — Tue, 24 Feb 2026 08:17:44 -0500

In a previous post, I noticed a change in the user interface of the office's coffee machine. Instead of showing default settings, the new interface shows the settings of the previous user so every time, I use it, I see something different. Is this retain-previous strategy better or worse than the common-default strategy?

Previously, I argued that the retain-previous strategy is worse. The main reason is that I don't like the assumption that coffee preferences are serially correlated for people working in the same office. If we take away the serial correlation assumption, then using the most common settings as the default makes more sense. In this post, I attempt to quantify the argument.

Let's set up the stylized problem as follows.

We only have two settings (Large and Small). We assume 70% of users want Large, and 30% want Small. If the machine uses common default, it shows Large to all users, and it would predict correctly 70% of the time. We therefore are interested in whether the retain-previous strategy can be at least 70% accurate.

The retain-previous strategy has to show something to the first-ever user. Let's assume it does the sensible thing, which is to show the most common setting, i.e. Large.

It therefore has a 70% chance of getting the first prediction correct.

If the prediction is correct, then the first user indeed prefers Large, and the machine shows a Large setting to the second user. If the second user also prefers Large, then the second prediction will also be correct. This has probability $ 70\% \times 70\% = 49\%.$

If the first prediction is wrong, then we know the second user sees a Small setting. In this case, a correct second prediction happens if the second user wants Small. The probability is $ 30\% \times 30\% = 9\%.$

Taken together, the chance that the second prediction is correct is $49\% +9\% = 58\%.$ Notice that already in the second prediction, the correct probability has dipped below 70% (the level of the common-default strategy).

If we roll this analysis over users 3, 4, 5, ...., each of the subsequent predictions will also have a 58% chance of being correct, and this is because the system only remembers one step prior. What the machine shows the third user is not affected by the user type of the first user.

We can check this by looking at the possible sequences up to User 3.

LLL - first user is accurately predicted, second user is shown Large and is accurately predicted, the third user is shown Large also, and is accurately predicted. This has a probability of $70\% \times 70\% \times 70\%.$

LLS - same as above except that the third user is incorrectly predicted, so won't count towards our correct probability.

LSL - first user is accurately predicted, second user is shown Large but prefers Small, third user is shown Small but prefers Large. Two wrong predictions in a row, and won't count towards our correct probability.

LSS - same as above, except that the third user is accurately predicted. This counts toward the correct probability of the third prediction, $70\% \times 30\% \times 30\%.$

SLL - first user is shown Large but prefers Small, the second user is shown Small but prefers Large, the third user is shown Large correctly. The contribution towards the correct probability is $30\% \times 70\% \times 70\%.$

SLS - same as above, except the third prediction errs.

SSL - first user is shown Large but prefers Small, the second user prefers Small and is accurately predicted, the third user is shown Small but prefers Large. No contribution to correct probability either.

SSS - first user is shown Large but prefers Small, the second and third users are both shown Small and they both prefer Small. The addition to the correct probability of the third prediction is $30\% \times 30\% \times 30\%.$

Now, group LLL and SLL. The sum $70\% \times 70\% \times 70\% + 30\% \times 70\% \times 70\% = 70\% \times 70\%.$ There are two branches out of the first user but ultimately they converge to the same product. Similarly, group LSS and SSS. These two branches converge to the same product, $30\% \times 30\%.$

Thus, the correct probability of the third prediction is $70\% \times 70\% + 30\% \times 30\% = 49\% + 9\% = 58\%.$ Look familiar?

Under this retain-previous strategy, the first prediction is correct 70% of the time, then all subsequent ones are correct 58% of the time. Thus, the overall accuracy must be below 70%, under the level of the common-default strategy.

For those who want to see more equations. The formula for the correct probability is $ p^{2} + (1-p)^{2}, $ where p is the probability of the majority user type. In my example, $ p = 70\%$; substituting that in, the formula gives 58% as I computed above.

As p ranges from 0 to 1, the probability curve is a "bowl" with minimum at $ p = 50\% $, and the value increases as p moves toward 0% or 100%. In other words, the more concentrated the preferences are, the more likely the retain-previous strategy is to make correct predictions. In a sense, the problem becomes easier because most users want the same settings.

The correct probability of the common-default strategy is the proportion of the majority user type, written as $ \max(p, 1-p) $. This curve also has a minimum at $ p = 50\% $, and bends upwards toward 0% or 100%. Instead of a quadratic curve, it is a straight line.

Because the straight line always lies on top of the bowl, the common-default strategy "dominates" the retain-previous strategy. There are three points where the two strategies meet: $ p = 0\% \text{ or } 100\% $, meaning that everybody picks the same settings; and $ p = 50\% $.

In conclusion, while the retain-previous strategy improves in situations where the preferred settings are more concentrated, its predictive accuracy is still below that of the common-default strategy.

Me on the coffee line

Kaiser Fung — Mon, 23 Feb 2026 08:42:49 -0500

At the office, they swapped out the automated coffee machines. These are the ones in which you select "Espresso", then you select regular or decaf, then you choose the size of the cup, then you press the button, and in a few seconds, the espresso comes streaming out.

I noticed that the new UI operates differently from the old one. The current interface retains the previous settings while on the old machine, the settings return to the default ("regular" and "medium") after each use.

UI of the old coffee machine showing the same default settings each time

The interface design reflects a choice by the developer, which embeds the developer's assumption about user behavior.

The current developer assumes a kind of serial correlation, that the next user is likely to require the same settings as the previous user. We can frame this problem as predicting the setting requested by the next user. We want to maximize the total number of correct predictions in a queue of users. The retain-previous strategy sounds reasonable.

The developer of the other machine adopts a different strategy: show the same default settings to everyone. Presumably, the default settings are the most commonly requested settings. Both strategies are easy to understand, and this latter one, even simpler.

The common-default strategy discards the serial nature of the problem, as it treats every user identically regardless of their position in the queue. If, say, the most common settings are desired by 40% of users, then this strategy will predict correctly 40% of the time. Its effectiveness is a function of how common the most common is.

The retain-previous strategy is more complicated to analyze. In these coffee machines, there are three settings of caffeination (regular, half decaf, decaf); and three settings of size; thus, there are eight possible types of users. If we have historical data, we can take adjacent pairs of users and count what proportion are same-same pairs.

Without data, we may call upon some standard probability model for simulating a queue of users. This starts getting a little complicated. Any standard model assumes independence between samples, which should preclude serial dependence! That said, a standard model is obviously capable of generating adjacent pairs that have the same settings, i.e. sequential users who select the same settings.

Flipping a coin continuously will result in "runs" even though the coin is perfectly fair. The probability that the next flip is a head given the previous flip is a head is $ \frac{1}{2} $; ditto tail given tail. So the probability of seeing a run of length 2 is $ \frac{1}{2} \times \frac{1}{2} + \frac{1}{2} \times \frac{1}{2} = \frac{1}{2} $. But since the coin is fair and flips are independent, the serial correlation is zero!

Let's get back on track. I don't like the serial correlation assumption anyway. Does knowing about the prior user really provide information about the next user's requirements? It might matter, for example, if the coffee machine is in a family home; but for a shared office, I don't think so.

If the developer of the new coffee machine assumes serial independence, then the probability of the next user's settings is the same whether or not we condition on the previous user's settings. So, the prediction is driven by the overall preferences amongst the eight possible settings. We are back to the common-default strategy.

This is sufficient to argue that the retain-previous strategy is suboptimal relative to the common-default strategy. The only way it may be better is if user preferences are correlated serially in a material way.

If this is not convincing, see my future post for a more quantitative argument.

The retain-previous interface is annoying me in another way. It adds variety when none is needed. There is a certain "comfort" that comes with seeing the same settings each time, even if they aren't my preferred settings. With retain-previous, I have to train myself to ignore the UI and just put in my requirements. If I accidentally press Start without looking, I'm not sure what I'm getting.

As usual, no one's doping

Kaiser Fung — Tue, 17 Feb 2026 09:09:58 -0500

I like to watch the Olympics as much as anyone. One thing's for sure though: we're going to learn from this Milano-Cortina Winter Olympics 2026 that doping is as rare as the blue moon, just like we learned from other Olympics.

We will also learn that any failed test is due to a black-swan event that befell the unfortunate athlete, which in fact, partly explains the first statement.

Long-time reader Antonio R. points me to the first doping finding at these Olympics. Italian biathlete Rebecca Passler failed a test prior to the start of the Olympics, and was immediately suspended; now, according to this Italian report (link), her suspension has been lifted. (Perhaps only temporarily as there are other agencies that will eventually review the case.)

It's because she accidentally ingested the banned substance by eating ... Nutella. Yes, the famous Italian spread made of chocolate and hazelnuts. This surely is a new one. It's also a head-scratcher. If Nutella contains traces of the banned substance, then surely all Italian athletes are aware, no? Reading further, I learned that it's not Nutella but contaminated Nutella. The spoon used to share Nutella among the members is to blame.

How did the banned substance get on the spoon? It's from the athlete's sick mother's cancer medication. According to the reporter, past cases suggest that this still isn't enough to avoid punishment because it's the athlete's responsibility to take all possible precautions.

So, we also get some family drama. Passler's mother didn't want to affect her preparation for these Olympics, so she has hidden her diagnosis. In addition, she hid the cancer medication in some secret cabinet. What is likely true is that the mother doesn't know that the medication contains a substance that is banned by anti-doping agencies.

I'm not here to condemn or condone Passler. While the story requires an unlikely sequence of unlikely events, it is not impossible. I don't know of a foolproof way to know if she is a victim of a black-swan event or not – unless you're in the inner circle of her staff.

I wrote extensively about anti-doping tests in Chapter 4 of Numbers Rule Your World, in the years before Lance Armstrong confessed. My analysis leads me to believe that there are many more false negatives than false positives. Armstrong, you might recall, repeatedly pointed to years of negative test findings to push back on doping rumors.

My book page

Also, the term "false positive" is imprecise. To believe Passler's story requires us to accept the initial test result as correct. Her team is, in fact, endorsing the test finding as a true positive!

In the book, I differentiate between a lab false positive, and a real-world false positive. In Passler's case (as in the case of every athlete who happened to have eaten something that happened to contain trace amounts of some banned substance), the lab test is presumed correct; what these athletes are disputing is the cause of the positive result.

There is a key computation in Chapter 4 of Numbers Rule Your World.

The proportion of doping athletes is bounded above by the proportion of tests coming back positive in any Olympics (I'm simplifying a bit by assuming one test per athlete.) If 100 athletes are tested, and 1 tested positive, there can at most be one true positive. If there are more than 1 doper amongst the 100 athletes, then surely the testing program has a false-negative problem. If there are 5 dopers, at least four of them will have negative findings (the false-negative rate is a staggering 80%!!). If there are 10 dopers, at least nine will be cleared.

So, pay attention to the number of positive tests. If this is like other Olympics, the number will be very small. That can be interpreted as very few athletes are doping, or most dopers are evading detection.

Data hunting on the radar (chart)

Kaiser Fung — Mon, 16 Feb 2026 08:14:00 -0500

Today, I take another look at the simple radar chart created for the previous post in this series.

The most troubling part of this chart form is that it makes us look at things that distort the data, namely the shaded areas, and/or the perimeters.

The underlying data of the four students:

All four students achieved two High and two Low grades in the four subjects, thus they have the same GPA (assuming each subject has the same weight).

The radar charts divide these students into two sub-groups (top two rows, bottom two rows) if we go by the shaded areas. The area of Adam is the same as that of Betty (by symmetry). Each area is divided into four equal parts, each of which is a right-angled triangle, so the area is 4 x (1/2 x 2 x 1) = 4. (I'm setting the outer radius to be 2 and the inner radius to be 1.) Also by symmetry, the area of Chad is the same as for Daisy. The area of Chad is 13% larger than the area of Adam. We compute Chad's area also by considering four right-angled triangles, so the area is (1/2 x 2 x 1) + (1/2 x 2 x 2) + (1/2 x 1 x 1) + (1/2 x 2 x 1) = 1/2 x (2+4+1+2) = 1/2 x (9) = 4.5.

The only difference between these two sub-groups is which two subjects they achieved the two High grades. To the extent that we claim that the difference in areas represents "data", then the radar chart must have assigned differential weights for the four subjects, in an implicit manner, which defies our understanding.

What this really is saying is the area shown on the radar chart is meaningless.

The perimeter on the radar chart is also meaningless.

To estimate the perimeters, we use Pythagoras's Theorem (square of the hypotenuse is sum of the squares of the two other sides.)

For Adam, the perimeter is 4 x sqrt(4+1) = 4 x sqrt(5) = 8.9. Betty's perimeter is the same as Adam. For Chad, the perimeter is 2 x sqrt(4+1) + sqrt(4+4) + sqrt(1+1) = 2 x sqrt(5) + 3 x sqrt(2) = 8.7. Daisy's perimeter is the same as Chad.

Thus, Chad or Daisy's perimeter is about 2% smaller than Adam or Betty's perimeter. Again, the only explanation for a difference in perimeters is that it encodes a difference in the data, which in this case, represents in which subjects the students achieved their High grades.

Finally, not only do both area and perimeter distort the underlying data, they stretch in opposite directions!

The radar chart doesn't really encode data in the area or the perimeter. The visual form makes us think that. The data are really to be found in the spokes of the chart; here is a chart from the first post of this series.

See the previous posts in this series (1, 2).

Alternatives to radar chart 1

Kaiser Fung — Wed, 11 Feb 2026 13:01:43 -0500

The radar chart is frequently used in the following setting: you are comparing some objects across K dimensions. Embedded in this data are K+1 rankings, which include K rankings, one for each dimension, plus an aggregate ranking.

My last post explains why I don't like the radar chart. In this post, I'll explain why the radar chart conveys the information worse than even a data table.

Here is a very simple dataset I'll be using:

Four students are rated on four subjects. Each rating is either High or Low. Each student earns two Highs and two Lows. Chad and Daisy (last two rows) are strong in Math/Science and Language/Arts respectively. One is the mirror image of the other. Adam and Betty are also mirror images of one another.

On an unadorned data table, the reader can already find various insights. Who's the best at Math? Adam and Chad. Who's good at Math & Science? Chad. What subjects are Betty performing well at? Science and Arts. What subjects do Daisy need help with? Math and Science. Is Betty or Daisy better at Arts? Daisy.

Now, try finding answers to those types of questions from this radar chart?

OK, the overlapping areas are distracting and annoying. Try this line version:

Not much better. In fact, this quick exploration reveals yet another reason to unplug the radar. It doesn't like categorical data, or any data with a good number of equal values. Equal data values cause lines (or perimeters of areas) to over-print.

But it also doesn't handle continuous data. Imagine we add jitter to the equal values so they are minutely separated. This turn the overlapping lines into separate lines with different angles, producing even more criss-crossing!

Instead, let's do a small-multiples arrangement:

At least the data become visible. However, this arrangement makes it harder to answer many of the questions we care about. Who's the best at Math? We have to look at all four charts. Is Betty or Daisy better at Arts? We have to compare two charts. What help does Daisy need? This one can be read from a single chart but it's still not as easy as the data table.

The data table can be "enhanced" by adding color, styling, and symbols.

Our eyes are really great at sorting out two categories. Just a little color and bolding is sufficient.

Symbols are also useful:

With symbols, I better add a legend.

Even better is if I vary both symbols and colors:

In future posts, I'll explore other options.

Snow math

Kaiser Fung — Sat, 07 Feb 2026 09:49:36 -0500

Glad to see some reporter is on the case about snow ploughing in NYC after the snowstorm a couple of weeks ago. (Despite what you might have read elsewhere, there has so far been just one day of snow, followed by icy cold conditions.)

The reporter said the city issued 2,800 and so tickets to home and business owners who did not fulfil their civic duties of "clearing a path at least four-feet wide with clear access to crosswalks".

In my neighborhood, I found something counter-intuitive. I ventured outside that night after the snowing had mostly stopped. Most buildings made an effort to deal with the snow, and so it was quite easy to walk around. Notable exceptions were in front of Taco Bell and Starbucks. (Also, outside a Korean BBQ restaurant that is part of a national chain and one of the most popular businesses in the hood.) This creates the strange situation in which I could walk freely outside the little mom-and-pop stores that are barely surviving but must sink my feet in inches of snow in front of these large storefronts (that were ironically open for business).

That shocked me because you'd think that the large corporations should be the least likely offenders. I assumed the fines must not be large enough, or they must have found some loophole to avoid them. According to this article, the fine is $150 for the first offence, and up to $350 for subsequent offences. (I assume the money goes to funding the government workers who plough the snow instead.)

Those fines (if enforced) are clearly too low at current hourly wages. The businesses probably would have had to pay more than $150 to hire workers. So it comes down to whether the business owner wants to be a good citizen. I guess this is where the small businesses have an edge.

Four reasons to unplug radar charts

Kaiser Fung — Thu, 05 Feb 2026 09:19:29 -0500

I remain unconvinced by radar charts. Here is another example that popped up in a recent issue of ORMS Today (link).

Some entity is being rated on five dimensions. These dimensions are laid out as five spokes from the center, equally spaced. The ratings appear as five dots on these spokes. The five dots are connected cyclically to form a sequence, using straight line segments.

This use of the radar chart is very popular on business dashboards. The purpose is to rate entities along multiple dimensions. When multiple entities are rated, they each appear as a cyclical sequence.

In this post, I present five reasons why you should stay away from the radar chart.

The Radar chart foregrounds fake connections.

In the example above, the reader's attention is focused on the thick blue line. Perhaps its jagged shape carries the key to unlocking the insights in the dataset. (Perhaps not.) In other examples, the designer shades the area enclosed by the connected line segments. The area has no more relevance than the shape.

The data are actually encoded in the distances between the center and the dots on the spokes. Ironically, the radial lines are backgrounded in favor of the envelope.

If someone wants to understand the chart, they have to backfill these red lines while ignoring the blue envelope.

You might think of salvaging the radar chart by plotting those radial lines instead of the envelope. There is a reason why this isn't the common practice: the radial design of the radar chart is severely limited in scope. Imagine trying to compare two entities on those five dimensions. The radial axes overlap, messing up the comparison.

The Radar chart conveys fake neutrality.

Like all chart forms, the radar chart imposes a set of strict assumptions on the designer. Many of these assumptions are impractical, even harmful. One such restriction is the equal spacing between the spokes, which implies equal importance between the dimensions.

Visualizing hierarchies

Kaiser Fung — Mon, 02 Feb 2026 09:56:59 -0500

Long-time reader Chris P. sent me to an Instagram user (link) who analyzed the travel schedules of all the NCAA men's volleyball teams.

The analysis is deceptively simple if we ignore data collection – the outputs are distances travelled for each team, grouped by conference. The inputs? At least, schedules from each conference, including home/away indicators; maps; campus maps. The analyst describes how much work it is to put together this "simple" dataset. The process also includes the elephant in the room – assumptions!

For example, to be accurate, one wants to know where each team's home gym is but that venue is not always obvious to an outsider. So, in some cases, the analyst resorts to using the coordinates of the school's campus.

Unpacking assumptions is like pulling apart an onion. In that latter scenario, how does one determine the coordinates of any campus? Many schools are not one contiguous space, and even if it's one connected space, it almost surely has a highly irregular shape! In the other scenario, we must make another assumption: that teams always depart from their home gym.

Enough about the data. We're here to talk about the visualization.

Here is a different chart in the series, focused on comparing schools in a particular conference:

The basis of this chart is a bar chart. Each bar has exposed tiles although they don't have a quantitative interpretation. In principle, the last digit of each data label represents the outer edge of each bar; in practice, it's plainly obvious that the bar lengths do not accurately encode the travel distances. So it's a bar chart in form but not in content.

In the NEC conference, FDU's bar should be about a quarter of the length of Saint Francis's; and about half the length of D'Youville's. But it's not.

I suppose the distances are horizontally dispersed in a way that roughly – very roughly – conveys the ranking of the data.

Is there a better way to visualize this dataset?

In re-thinking the graph, I want to retain several satisfying features of the original:

The chart form preserves a nested hierarchy in the data: everything > conference > school. It works identically at each level, thus reducing cognitive load moving from one level to another.
The tiles, colors and fonts suggest a light-hearted, playful mental state.
The data concern distances.
Something other than a standard bar chart is desired.

Here's what I came up with:

This particular chart shows the data at the conference level. The leagues are arranged around the circumference of a circle. The arrow cues readers to read clockwise from the top. The first conference encountered, the ECC, has the least miles travelled. The conference that does the second-lowest mileage is the IVA.

The distance data are encoded as edge distances on the circle counting from the top. The gaps between consecutive dots represent the differences in travel distance between adjacently ranked leagues.

Next, I added "gridlines" to help readers gauge the scale of the chart. These gridlines are the radii of the circle because the edge distance is proportional to the angle. In deciding the number of gridlines, I took a hint from the original chart, where the designer tells readers that a trip around the Earth is about 25,000 miles. The maximum traversed distance here is roughly 67,000 so I plotted 3 round trips, and three gridlines (at 0, 1, and 2 round trips).

Just to show that this design meets the first requirement above, here is the chart for NEC:

Once the reader figures out how to read one of these charts, the reader has learned how to read all of them.

Last thing... moving back to the D corner of the Trifecta Checkup. What would make this analysis even more compelling is if a "Y" variable (i.e. outcome) is included. How does the variable travel distances affect the teams' performances?

How MTA is spending its money

Kaiser Fung — Thu, 29 Jan 2026 09:42:15 -0500

New York's subway and bus operator, MTA, has forever been facing budget crises. In recent years, it has to contend with the work-from-home trend, and rampant fare evasion.

The MTA also appears eager to buy whatever vendors sell them, including snake oil and hot air.

I previously wrote about the so-called "Select Buses" (link). The MTA spent money on a system that required installing special fare machines on sidewalks, from which all passengers, including those who held weekly or monthly passes, must obtain a paper receipt prior to boarding the bus. When the bus arrives, passengers may board from front and back doors. Effectively, this sets up an honor system: MTA does not validate whether someone has paid the fare. The driver can't be bothered to check those boarding from the front door either - what's the point when anyone can get on through the back doors?

Apparently, the vendor convinced the MTA that this Select Bus system would reduce waiting times; I have no reason to doubt this claim since the driver wouldn't have to deal with the hassle of passengers paying the fare while boarding the bus! Passengers are also streaming freely onto the bus through three open doors instead of just the front door.

My previous post is highly recommended. I discovered that MTA management even purchased a "study" from a consultant in which they claimed to have found that the aforementioned system not only did not promote fare evasion but it curbed fare evasion!

Another cost-saving tactic favored in all subway systems around the world is replacing human operators with machines. We have all seen people jumping over, or crawling under, the turnstiles. No MTA staff is present to enforce fares anymore.

Recently, some vendor convinced the MTA to install "spikes" and "sleeves" on the turnstiles to "stop" fare evaders. I kid you not. Below is a "sleeve" (hat tip to New York Post for the images):

Image from NY Post

How is this stopping anyone from jumping over?

And below are the "spikes" (they are on top of the side wall):

Image from NY Post

I have rarely seen anyone "climbing" over. The usual manuevre is hurdling over so this is a mystery.

The same song is being played. A few months after these toothless interventions appeared, the MTA has declared victory, and will spend more money to install them everywhere!

It took the NY Post reporter one trip to two subway stations to eye-witness what the study's authors apparently couldn't see: neither the sleeves nor the spikes are stopping fare evaders (link). Or, it just takes some common sense.

The MTA actually told reporters the following with a straight face: "At stations where the equipment has already been installed, fare evasion has dropped by about 60%." (link)

Let's say this out loud: the MTA believed that those sleeves and spikes have caused a 60% drop in fare evasion.

Again: the MTA determined that those sleeves and spikes have stopped 6 out of every 10 prospective fare evader.

This is the same MTA that told us by letting riders get onto Select buses from every door without validating tickets, they have curbed fare evasion below normal levels.

I can't find details as to how they conducted the study. So let's interpret the quote above.

First, they are describing only those stations with equipment. We don't know what's going on in stations without equipment. If they were to compare the two groups of stations, we'd need to know how they selected the stations for this pilot program. Are the stations with equipment similar to those without? (Probably not, unless they designed a rigorous testing program before the pilot started.)

Second, surely some of the reduction in fare evasion reflect a general trend. For example, as more companies are pushing employees back to the office, we have an influx of commuters who have well-paid jobs and are thus less likely to evade fares. Any pre-post type analysis must include factors like this.

Third, we also don't know if the equipment installation is all-or-none at each station, or what proportion of each station's turnstiles have those sleeves and spikes?

Fourth, the MTA is simultaneously rolling out many different interventions. I see warning notices, and hear warning messages. Sometimes, there are guards standing near the turnstiles (although I have never seen any guard stopping an ongoing act of fare evasion.) How did the study account for these other factors?

Fifth, by claiming a "drop", they must be comparing a current measurement against some baseline. What is this baseline?

Sixth, how do they even measure fare evasion? Do they have staff counting fare evaders? Are they analyzing video footage?

Designing a proper test to measure the effect of the sleeves and spikes is an interesting project.

A cheesy graphic

Kaiser Fung — Wed, 28 Jan 2026 14:48:29 -0500

Jorge Camoes featured this cheesy graphic by Eurostat in a recent Linkedin post. It's a fun graphic that brings out the unexpected, at least amongst the uneducated. Who knew Germany makes more cheese than France or Italy?

What kind of chart is it?

It's side-by-side pie charts. The twist is that the chart does not encode the raw data, neither the tonnage, nor the proportion of tonnage. Instead, the chart plots index values, with Germany set to 100%. In that scale, France is 1.9/2.4 = 80% and Italy is 1.4/2.4 = 60%.

Therein lies the problem. The chunks of cheese bitten off France and Italy's rinds are roughly equal sized so I don't think they are scaled properly.

Possibly, the designer is simultaneously manipulating the size of the pies, and the bitten-off chunks?

I took my ruler out and it's neither here nor there.

The closest is if we take whole pies of all three countries. I estimated that the radius of France is about 75% that of Germany, Italy is 56% of Germany, so close enough to 80% and 60% respectively. But even this encoding is problematic because we should be encoding the data in the areas not the radii of the pies. (If we take whole pies, we have moved from pie charts to bubble charts.)

The ratio of areas is 66% and 32% respectively, which takes us further from the data.

If we now bite a chunk out of France and Italy but not Germany, as per the graphic, then the ratio further slides away to 44% for France, and 25% for Italy.

Instacart bows to pressure

Kaiser Fung — Tue, 27 Jan 2026 09:27:27 -0500

The public reaction to the Consumer Reports study on Instacart's pricing strategies has forced the company to "end all item price tests," which is part of their Eversight offering. (See also my previous post about the study.)

Within days of the CR report, Instacart initially disputed some details of the study (link), but eventually announced the end of these AI pricing experiments (link).

These press releases provide a few more insights about what was happening.

Instacart says retailers already charge different prices at different physical stores for the same item. The revised policy does not prohibit online retailers from charging different prices for the same item based on IP addresses (or phone numbers or other ways to geo-locate shoppers). This itself is an intriguing admission. The Internet is supposed to "flatten" the world, bringing everyone closer together but is that more hype than reality? The implication of retailers duplicating brick-and-mortar practices online is an admission that the online presence (which in theory can be launch once for every location) has not altered location-driven economics, if we believe what they're saying.

Retail partners who feature on Instacart can subscribe to a pricing tool called "Eversight." Instacart purchased this capability from a startup called Eversight Labs in 2022.

Eversight marketing materials said the platform is designed for retailers to run "millions of tests" "all the time". Instacart claimed the pricing experiments only "10 of its retail partners" use Eversight, according to CR. Instacart suggested that using Eversight could increase sales by 1-3 percent, and margins by 2-5 percent. Lets think about how that may be possible.

We first entertain a traditional test-and-learn setting, in which the price experiment involves a randomly selected subset of shoppers, and is turned on for a short period of time to collect enough samples for a statistical read of the result.

The objective of the pricing experiment is to determine the "optimal" price for an item, attained by measuring customers' price elasticity. The expected outcome of the experiment is a price adjustment; the revised price is both fixed and universal (for the population specified in the test). If the outcome is a price hike, the test result must have predicted that the loss of sales due to the higher price is more than offset by the additional revenues generated by the price hike from customers undeterred by it.

As discussed in my previous post, the net improvement in revenues has to be quite large to justify trading away the comfort of inertia. For this reason, the outcome is much more likely to be a price increase than a price decrease. This behavior I think is a type of endowment effect of interest to behavioral psychologists.

If the expected outcome is a price increase (or no change), it follows that the set of test price levels looks more like [base, +2%, +3%] than [-2%, base, +2%]. That's why I suggested computing the average displayed price as a way of learning whether the pricing experiment effectively raised prices.

Instacart's primary pushback on the CR study is that Eversight offers "testing," suggesting that these are temporary price changes that disappear after the tests are over. This defence makes no sense for a number of reasons.

If the retailer has no intention of changing prices, why conduct pricing tests?

If indeed list prices remain the same post-test, then the incremental revenues marketed by Eversight would have been achieved during the period of testing. Further, if the price changes were purely randomly applied, as Instacart asserted, then the set of test price levels is likely skewed toward price hikes and not price decreases.

Assuming the retailer found out from the price testing that it makes more money by raising prices by 5%, why would they not roll out the price hike?

Finally, consider the possibility that a retailer is always running price tests. It gives a new meaning to "testing."

Additionally, it is a fallacy to think that if the test price levels are symmetric, e.g. [-2%, base, +2%], then the experiment does not alter the status quo. This is a subtle point.

The no-change scenario only materializes if the tested prices do not affect consumer behavior. For example, test prices of [-1 cent, base, +1 cent] most likely result in all three test subgroups exhibiting the same buying propensity. This is of course a silly test.

A more probable outcome is transactions shift in inversely proportion to the price changes. The +2% subgroup buys fewer units while the -2% subgroup gets more units. The price elasticity may be nonlinear, in which case the total revenues obtained during the test may be higher, or lower, than the pre-test amount.

Running tests "all the time" only makes sense if the vendor is confident that these tests in aggregate improves business outcomes. This setting is incompatible with the idea of an unbiased test with symmetric price levels. If management has such a crystal ball, they should just implement the price changes, without a need to run tests!

VAR technology is ruining football

Kaiser Fung — Thu, 22 Jan 2026 10:00:41 -0500

I have been watching quite a bit of football (mostly Italy's Serie A) recently, and it's become obvious that VAR technology is ruining the beautiful game.

Goal celebrations have always been fun to watch, especially after spectacular golazos. But nowadays, many an extravagant exultation is now fake. Because the goal is subject to a VAR review, which is a microscopic, backroom operation far away from the action on the field. The referee's (human) decision is not final. VAR's decision, made by VAR referees who are aided by technology, is final.

That's not the official word. But in reality, that's what's happening. The VAR either agrees with the referee's decision, or it doesn't. When the VAR dissents, depending on the situation, either it directly overturns the referee's ruling on the field, or it sends the referee to a viewing booth. The referee's walk to the viewing booth is always rejoiced by the team that is on the short end of the original decision; in almost every case, the referee accepts the VAR's view. Therefore, effectively, VAR is the real referee; the on-the-field referee is the intern standing in for the boss.

This process ruins goal celebrations. The player will celebrate but he knows it's just an act because until and unless VAR accepts the decision, the goal is not official. The review process may take many minutes, particularly if the alleged infraction is a matter of centimetres. We watch the players theatrically arguing for their respective cases. If the goal is ultimately allowed, it feels weird. Should the player restart the dance of joy? If VAR takes away the goal, the already-seen celebration has turned into a caricature. Fans switch from excitement to disbelief, then to anger.

Spontaneity is the casualty.

Despite the orthodoxy, in some cases, it's not clear the VAR decision, even when aided by video, is the better one.

A couple of recent examples.

During an Atalanta-Roma match on Jan 3 2026, Atalanta's forward Scamacca scored a scorching header from right in front of goal, heading in a cross from the left side. Eventually, the goal was annulled by VAR officials for an "off-side" violation. This review took forever.

The off-side violation took place multiple passes before the final shot. That moment had no bearing on the ultimate goal, other than Scamacca was momentarily in an off-side position while he was near the midfield circle, that is to say, where he was judged off-side was nowhere near the spot from which he scored, and when he was judged off-side was long before his teammate sent the cross to meet his head.

It was worse than that... because it was an opposing player who gifted Scamacca the ball near mid-field. If one takes the official view, the entire sequence started when an Atalanta player (in blue) attempted to pass the ball to Scamacca in mid-field.

If Scamacca had received that pass, dribbled the ball towards goal, and scored, then the goal should be disallowed because of the off-side rule. No complaint.

But the Atalanta defender kicked a very poor ball that was way out of Scamacca's reach. ("Unreachable" would have been the call if this is NFL.) In fact, the ball went straight to a Roma player (in white), who failed to control the ball, gifting it to Scamacca in a backward "pass". So in fact, Scamacca did not receive the ball in an off-side position from his own teammate; he took a gift from the opponent, in which case off-side wasn't even pertinent.

Scamacca then dribbled the ball half way to goal, then sent it to a teammate left. It then went to another teammate, who dribbled it to the goal-line and sent in an excellent cross.

By this time, Scamacca had positioned himself right in front of goal, and headed the cross in. The referee immediately signaled goal. Celebration ensued. The (home) stadium erupted.

Even after the match was over, the officials' official stuck by the VAR ruling (link). With the greatest benefit of doubt, this decision can only be justified if we go by some ludicrously strict interpretation of the rules of the game. If they go down this path on every goal, the players might as well wait till the VAR officials had their say before starting their goal celebrations. Spectators can expect five to ten minutes' delay to confirm every goal. (This also has the side effect of adding loads of "injury time" to the end of each half, another negative.)

This process is inherently unfair because not even the officials' official would advocate combing through the rulebook word by word to adjudicate every goal. At best, they might do this when there is obvious controversy. This is precisely why I was so annoyed with that Atalanta-Roma decision. They are not targeting controversial goals. Almost everyone who watched that match would have accepted it as a clear goal, before the VAR process drowned us in minutiae.

In my understanding, the spirit of the off-side rule is to prevent the striker from gaining undue advantage by camping out behind the defense. Nothing of this sort was happening there.

Further, the above goal was an exceptional team and individual effort, a fantastically conceived and executed sequence. Now, all that is in the dustbin of history, soon to be completely forgotten. I really fail to see how this use of VAR technology improved the experience.

(I previously wrote about how VAR technology leads to "off-side" calls by a finger nail, which sends another bunch of beautiful goals to the dustbin, for the bragging title of I-go-strictly-by-the-book.)

A second recent example. In the Lazio-Fiorentina match on Jan 7, 2026, Fiorentina was rewarded a penalty kick in the dying minutes after a VAR review that overturned the on-field referee's original decision (of no penalty).

The video technology compiled a sequence of views to justify its decision. It showed that Gudmundson, the Fiorentina striker (in purple), fell to ground inside the penalty box with the Lazio defender (in light blue) hot on his back. There was no doubt that there was a tangle of legs. The Lazio player fell first but it appeared that his leg might have obstructed Gudmundson, causing him to lose balance.

The intern (oops, I mean, the referee) was summoned to watch the video replay. The referee had signaled no penalty, which meant he had determined that Gudmundson deliberately put his leg out to touch the Lazio player, simulating a fall. As expected, he decided to change his call and rewarded Fiorentina a penalty.

Whether it was a foul or simulation is a question of the level of force, not a question of the relative positions of the legs. During the broadcast, viewers were shown a video replay. We kept seeing the legs tangling up. We agreed that they touched but who pushed whom? In my view, no amount of video can answer the question.

Video replays present deceptively objective views of reality. They aren't what they appear to be. The videos are compiled by the VAR technology to explain its decision. They cherry-pick the angles and vantage points to build evidence. We'll never see a video replay contradict the VAR decision. (Similarly in tennis, no replay will ever show a ball falling inside the line, if the line-calling bot has said it's out!)

This subject is near and dear to me because it's a real-world example of what happens when we use automated models to make real-life decisions.

Machines are only valuable if their conclusions differ from humans. If the machine always agree with humans, we don't need it. When the machines disagree with humans, we have two disagreeing points of view. How should a final decision be made?

Machines don't have special access to reality not visible to humans. Machines embody "models" of reality. These models express assumptions when there are not enough data. Embodied models are rarely explained, so these assumptions are not exposed, and thus not reviewable. In the case of VAR, particularly in the video replay cartoons, audiences have never been informed even one of the many assumptions that must have been adopted by the modelers.

Because they use models, machines can also make mistakes. They also have built-in biases, just possibly different biases than those found in humans. There will be certain aspects that human senses may work better, e.g. in judging the amount of force applied.

Machines have advantages, such as not subject to the variability between human referees. Think about that for a moment. We have made a trade-off: we agree to standardize on a single point of view (held by the developers of the technology); it's not that the problem of different opinions has vanished, we make it go away by adopting one viewpoint. It's like employing the same human referee for all matches.

The smell of pie charts

Kaiser Fung — Mon, 19 Jan 2026 09:44:06 -0500

So many experts have been dumping on pie charts for so long that they have acquired a stench. People still want to make pie charts but they're worried about rattling the experts. So they put on lipstick.

Reader Kirsten P. sent me this graphic by Minneapolis Star Tribune, illustrating the disproportionate forces the Federal Government has sent to a metropolitan region of the country. The graphic accompanies an article which gives further background (link).

At first sight, this looks like a new kind of chart: part dot plot and part pie chart.

Not really. You're looking at two side-by-side pie charts. Nothing more, nothing less. The dots play no role other than to disguise these pie charts. A text note even tells readers that each dot represents one officer. I'll take a wild guess: no one is out there counting dots.

Here I strip away the dots:

The pie on the left shows federal forces in the Twin Cities. (The original label says Minnesota for an unknown reason but in the article, they wrote "Homeland Security Secretary Kristi Noem told Fox News on Jan. 6 that her agency sent 2,000 Immigration and Customs Enforcement agents to the Twin Cities" which referred to Twin Cities.)

The pie on the right depicts the sizes of the top 10 metro police forces within the Twin Cities area. The point is that the two pies are roughly the same size (the right pie should be about 1/6th smaller.) [Maybe there is a bigger point: how the Republican trifecta of president, Congress and Supreme Court is imposing "big government", and in particular, federal authority over states, both supposedly misguided Democratic policies. Confusing, no?]

Inside the chart is a puzzle. What do the colors on the right signify?

I suppose Minnesotans will ace this test but as someone not from there, it took me a while to figure this out.

My first guess is red for Republican and blue for Democrat. This doesn't feel right as there shouldn't be many Republican counties in a metropolitan area. (And indeed, this guess does not pan out.)

This puzzle unlocks when I noticed the label "MSP": it is out of place because M = Minneapolis and SP = St Paul, each of which has its own slice of the pie. MSP is also the acronym for the airport, and the airport probably has its own police force.

Thus, the red slices are police forces belonging to counties while the blue slices are entities other than counties. Squinting harder, one can differentiate two shades of blue. MSP and Metro Transit are the lighter blue while the other blue show police forces associated with cities.

The reader's attention is drawn to the divisions within each pie when the article's story is about the difference in the sizes of the two pies.

Here's a version that points the readers directly at the story:

The failed coup against standardized testing

Kaiser Fung — Wed, 14 Jan 2026 09:06:34 -0500

Standardized testing such as the SAT and ACT has attracted some relentless critics over the years. These adversaries argue that the tests are biased against minorities and the less well-to-do. Recently, many top U.S. colleges have been running an experiment, as they suspended the SAT/ACT requirement for admissions since Covid-19.

By the end of 2025, one school after another has backtracked and re-instated the standardized test requirement. MIT (2022, link), Yale (2024, link), Harvard (2024, link), Princeton (2025, link; Alma mater, why so late?), and Stanford (2025, link) have all changed their minds. Every school that reversed course pointed to data showing that the cohorts that have enrolled during the test-optional period are plainly unprepared for college.

In this Wall Street Journal editorial, they cited a dismal finding from the University of California (a top public university system):

About half of UC campus math chairs say that the “number of first-year students that are unable to start in college-level precalculus”—which used to be a standard course for California’s top high school sophomores—doubled over the last five years.

Wait, how about the other half of the campuses?

The other half of chairs said the number tripled.

Harvard started offering a remedial high-school math course (link) to incoming first-year students to help them catch up. University of California, San Diego launched its remedial course 10 years ago, and recently saw enrollment jump 10 times (link).

The move to drop standardized testing was always going to be a disaster. As someone who has read applications (for graduate schools), it's clear that test scores represent the only item in an applicant's file that is interpretable.

High school GPAs are meaningless because the admissions officer lacks any context to interpret the data. Like college professors, high school teachers are giving away top grades like Halloween candy.

Transcripts present the same problem with GPAs – no context. There is never enough time to read course titles to infer what level they are at, and certainly no point staring at individual grades, having no idea what proportion of the class received the same grades.

Teacher references are not much better. Most references are vapidly nice, without distinguishing one student from another.

Once in a blue moon, you come across a teacher slamming a student. My first reaction is: why did the teacher bother to write it? My second reaction is: Poor student, who mistakenly assumed that s/he was on good terms with said teacher. My third reaction is: what a mess! Two vapidly nice ref + one viciously cruel ref = ?? The truth is I know neither the student nor the reference writers, and I don't feel like choosing who to believe.

Essays are somewhat informative but the prevalence of hired help, coupled with the availability of AI writers, ensure that many essays do not say much real about the applicants. It's also a medium that favors those with better writing, and story-telling skills. Besides, this medium favors the well-to-do, who can afford more expensive coaches, and send their kids to far-away places for save-the-world type experiences. (Somehow, those critics who like to bash standarized test biases are quiet about obvious biases of other forms of assessment, such as essays.)

Without the standardized test scores, the application portfolio contains only subjective items. Picking one applicant over another is an act of randomness. No wonder the colleges admitted under-prepared students during the test-optional period.

Know your data 47: bait and switch prices

Kaiser Fung — Tue, 13 Jan 2026 09:08:01 -0500

Long-time reader Mark P. sent me to this Guardian investigation of pricing discrepancies in "dollar stores" in the U.S. (link)

The dollar store is really the five-dollar store these days. I don't think one can find anything in a dollar store that costs a dollar!

The reporter did a great job finding customers and former employees of these stores to tell their stories. Many of the customers live in isolated areas, and on tight budgets. A dollar store is often the only retailer within walking distance to where they live, and so in a sense, they represent a captive audience.

The headline is bait-and-switch pricing. It turns out that these dollar stores (the article names two chains: Dollar General, and Family Dollar) frequently charges customers higher prices than the advertised shelf prices.

It's not a one-time anomaly. In some stores, as much as 80% of the register prices have been found to be higher than the respective shelf prices! The differences aren't mere rounding up. Examples given in the article include $5 frozen pizzas charged $7.65, and $11 npaper towels charged $15.50. Even after authorities have complained about this practice, and even assessed penalties, many such stores continue to bait and switch.

The situation is really shameful:

Dollar General stores have failed more than 4,300 government price-accuracy inspections in 23 states since January 2022, a Guardian review found. Family Dollar stores have failed more than 2,100 price inspections in 20 states over the same time span, the review found.

"Industry watchers" want our pity. They claim that stores do not have sufficient staff to update shelf prices, leading to pricing discrepancies. In other words, they're saying that the register prices are correct, and the shelf prices are incorrect.

For the data nerds, this admission raises grave concerns about data collection. Imagine wanting to collect prices to estimate inflation. One method is to send people into stores to jot down prices. The data correspond to shelf prices. We now know that in dollar stores, the shelf prices may be much lower than the actual prices paid by consumers. So, the collected data are inaccurate.

Reading between the lines, we learn from the Guardian article that government bean-counters know about this issue. This is why they conduct inspections that have uncovered these price discrepancies.

The red-faced publicists for these stores had more to say:

[the Dollar General's] store teams “are empowered to correct the matter on the spot.”

This statement contradicts the other claim. If we believed the industry insiders cited above, the correct prices are the register prices, so there is nothing to "correct." By "correcting the matter," they must mean charging customers the shelf prices instead of the register prices. So, they "correct" the matter by charging the incorrect prices.

Is this a Freudian slip? Did they admit that the shelf prices are real, and they overcharge unsuspecting customers? If and when this bait-and-switch scheme is noticed, they will do the right thing.

The consumer advocates are equally confused. Since they use the term "overcharges," they must believe that the shelf prices are correct while the register prices are incorrect. If that is the case, then they can't accept that the cause of these "overcharges" is how the stores don't have the staff to update the shelf prices! But they swallowed that whole.

The largest gambling market in Europe, and the largest online

Kaiser Fung — Wed, 07 Jan 2026 09:37:43 -0500

In a prior post, I discussed why the dual-axes chart about the European gambling market is mind-boggling.

Here is an alternative visualization that focuses on the story behind the dataset:

I'd like to center attention first to each country's share of gross gambling revenues. The top five are Italy, U.K., Germany, France and Spain, each accounting for 10-18% of the market. Everybody else is relatively insignificant, with less than 5% share.

The next important insight from the data is the over/under performance of the online sector compared to aggregate. I decided to use only the online data series because online better implies offline worse, and vice versa.

The countries are divided into two groups, those with online share higher than their aggregate share (shown in purple), and those with online share smaller than their aggregate share (shown in orange).

For example, Italy's overall share is about 1% but its online share is only 11%. By contrast, the U.K.'s overall share is 17% while its online share is 26%.

I'm using a different measure of online share from the designer of the original. On my chart, "online share" is each country's share of the aggregate European online gambling revenues. The total of these online shares sum to 100%. On the original chart, "online share" is defined as online's share of total gambling revenues within each country. The total of these online shares across countries is meaningless. The online share and offline share sum to 100% for each country.

Self-cancelling actions

Kaiser Fung — Tue, 06 Jan 2026 09:43:39 -0500

One use of machine vision is to monitor road intersections to catch unruly drivers. The advantages of using machines over humans are plenty. Cops can only scope out a limited number of intersections for a limited duration of time while AI cameras can find every last offender at every intersection at all times of the day and night.

One possible business model for the AI company is to split "revenues" from traffic tickets issued for these infractions with local governments, e.g. for a $100 fine, the AI company gets $30.

This type of business model is challenging because of a self-cancelling property. After word gets out that a certain intersection is being monitored 24/7 by AI, most drivers will react by curbing their rule-breaking behavior, or taking a different, un-surveilled route. These counter-actions reduce the number of fines accessed, which depresses the amount of sharable revenues.

Ironically, to prove the success of such policies, one should look for a revenue reduction, not revenue growth. (There might be an initial spurt in fines before most drivers become aware of the AI traffic cop.)

The Trump Republican tariffs have the same self-cancelling property.

If the tariffs are sufficiently high so that imported goods become unaffordable, then consumers will switch to domestic suppliers. This behavioral shift should reduce the amount of imports, which in turn should suppress tariff collection.

Thus, if the policy is successful, we should observe lower imports, and less tariffs.

If tariffs go up instead of down, it implies that consumers are paying higher prices than before for the imported products, probably because no domestic substitutes are available. (There might be an initial spurt before prices catch up to the new reality, before businesses run out of workarounds, or before consumer behavior changes.)

If the stated goal of returning manufacturing to the U.S. is achieved, there should be fewer imports, and thus less tariffs collected.

The most notorious self-cancelling product is 100% effective medication.

If a new medicine is 100% effective, then patients are cured, removing the need to buy more drugs. The pharmaceutical company will eventually suffer a collapse of this line of business. This is why many observers believe that pharmas don't have a strong incentive to cure any disease.

Forced roommates, favoritism, and more in data visualization

Kaiser Fung — Mon, 05 Jan 2026 09:36:40 -0500

Long-time reader Antonio R submitted the featured chart shown above.

From a visual perspective, the chart is overly ambitious.

It uses dual axes, which is almost always a bad idea. The left-aside axis is related to the orange line, which depicts the "online share" of gambling revenues within each country, expressed as percentages. The right-side axis concerns the stacked columns, which display each country's total gambling revenues, split by online (green) versus offline.

What are the reasons why this chart is mentally taxing?

Dual axes

How is the reader supposed to figure out which axis pairs with which chart? The two axis titles are "Online share (%)" and "Gross Gambling Revenues (€bn)". We'd have to move our eyes to the bottom of the chart, read the legend text, and then mentally connect those with the axis titles. The "online share" gives us the first hint, then we presume that the "land-based revenue" and the "online revenue" must be the components of "gross gambling revenues". Without that legend, we'd have been lost.

Redundancy

The online shares of revenues depicted in orange refers to the green sections of the columns. The orange and green objects are re-scaled versions of the same revenues. Use the same color to represent the same quantity.

Stack order

Given the focus on online gambling revenues, the green sections of the column chart should be placed at the bottom of the columns. The bottom layer of a stacked column chart is the only layer with a uniform base, making it the easiest to read.

Forced roommates

Notice that the two axes share the same set of gridlines. Because of this arrangement, it is as if 50% equals €16 bn. That would be true if the two data series were re-scaled versions of the same underlying data but on this chart, the "Online share" is a re-scaled version of one component of "Gross gambling revenues", and therefore they represent different data. Since the total revenues vary by country, 50% share maps to a different amount in each country so there does not exist a set of gridlines that can meet the desired sharing objective.

The graphing software has taken on the hopeless role of assigning roommates. It wants gridlines for both axes but having two sets of competing gridlines would kill many brain cells. It decides to make the early-rising athlete share a room with the night-owl hacker; they are just going to have to make it work.

How does the software designer decide where to put the shared gridlines? One way is to fix the grid of the primary axis (left-side). This sets the number of lines on the chart. Now, choose a scale for the other data series so that the grid labels on the secondary axis are the "least ugly".

No matter what the designer does, the final gridlines serve one side better than the other.

Favoritism

The gridlines favor the orange series, and so does the sorting of countries.

When you have two data series, you can sort with respect to one series, not both (unless they are perfectly correlated in rank). In the featured chart, the countries are sorted by the online share of gambling revenues.

This sorting scheme arranges the other data series awkwardly, as it turns out, four of the top five markets have seen low online penetration. Italy, the largest market, ends up on the right side of the chart.

If you pay attention to the green sections only, you'll learn that the online segment in Italy (in terms of Euros) is still the second largest behind that of the U.K. The peculiar sorting scheme highlights five countries that have small gross gambling revenues.

There is a good story behind this data. The top markets are much larger than the rest; most of these countries (except the United Kingdom) are stronger in offline than online segments.

In a future post, I'll offer an alternative view of this dataset.

P.S. [1/7/2026] Next post is here.

Making online help more helpful

Kaiser Fung — Wed, 24 Dec 2025 09:11:02 -0500

A jocular post by Andrew (link) sent me on an unlikely excursion.

In the post, Andrew mused about finding quotes of himself in the online Merriam-Webster dictionary. The first example he cited was this quote:

My quick answer, was, no, the persistence method would not have worked.

This perked me up because of how silly this sentence is in teaching how to use the word "persistence" in a sentence.

The same sentence can be written for any number of nouns. The regression method, the factor method, the omission method, the research method, the drinking method, the kissing method, ....

Visiting the M-W page for persistence, I found a total of 27 example sentences with attribution, in addition to three generic examples without attribution.

The first attributed quotation is:

Nothing in the world can take the place of persistence.

This has the same nature as Andrew's sentence. One can substitute "persistence" for many a noun, without destroying the sentence. This implies that the sentence cannot explain how to use the specific word "persistence".

Not all examples are useless. The following each contains enough context to learn the meaning of "persistence":

These steps aren't easy, and can take some time and persistence.

Hall, who lives in Granbury, returned to the lake this winter and his persistence paid off on the last day of his trip.

The finish offers notes of black and brown spice notes and there is good persistence.

It would be more helpful if Merriam-Webster grouped the examples by word sense. The third sentence shown above is distinct in using persistence to refer to lingering sensation. (The original article is found here.)

The usage by Andrew is even stranger. In the Wired feature, the "persistence method" is defined in the sentence immediately before the one cited by Merriam-Webster. Andrew mentioned a climate scientist who used "persistence" to describe "the assumption [used in climate models] that conditions remain unchanged from one year to the next." This word sense maps to Merriam-Webster's second definition of "persist" (i.e., "to remain unchanged or fixed in a specified character, condition, or position"), which its editors have tagged as "obsolete."

In short, a reader can't figure out the meaning of persistence from reading Andrew's quotation.

If the Merriam-Webster examples are representative, they suggest that "persistence" is most often used in the sense of a human trait, and when used in this way, authors like to pair them up with related traits and concepts, such as "patience and persistence", "vision, persistence, and sweat", "hunger and persistence", "time and persistence", and "hard work and dogged persistence". The bounty of these specimens feels redundant.

How does Merriam-Webster select these quotations? This is what they disclose:

I assume they use an algorithm. I kept digging.

I investigated a quotation attributed to a Forbes feature:

Coupled with persistence, passion lit a path in the sky for the WASP.

This is a variation of the pattern "X and persistence" where X = "passion". But. But what is the casual reader supposed to make of "lit a path in the sky"? What is "WASP"?

If you know WASP, it's not what you're thinking. That meaning has little to do with paths in the sky. Read the Forbes article, and you'll learn that WASP stands for Women Airforce Service Pilots.

The word "persistence" appears in that article three other times.

At the root of all dreams lies persistence.

The Power of Persistence and Honoring a Legacy

So one of the lessons I’ve learned from doing this project has definitely been persistence. I mean, they kept fighting for military status until 1977 under President Carter, and that was the first time that that happened.

The algorithm evidently picked the most obstruse sentence, and also the one appearing furthest down the page. I'd have selected the third of this set - I cheated by including two sentences in the quote but without the second sentence, the meaning is elusive.

How should we make an ideal section for word usage in sentences? I'd want fewer but sharper sentences; self-contained sentences, or including surrounding sentences that provide the context for comprehension; and sentences grouped by word sense.

Implementing this type of algorithm takes a lot of work. You have to deploy a "spider" or some way of compiling a collection of text from which to extract sentences. You need a search engine to find keywords. You hope your text extraction process successfully pulled down author, date and document source (not standardized across different websites). You have to design a scoring rubric to select which sentences to show.

Another type of algo pricing

Kaiser Fung — Tue, 23 Dec 2025 09:45:25 -0500

In my previous take on "algorithmic pricing," I deliberately glossed over one nuance.

Typically, the inputs to a pricing algorithm are demographics, behavioral data (e.g. how many times the user has revisited the product page), estimated price sensitivity, and so on. Most of these are data about the individual customer. Thus, it has been shaded as "surveillance pricing."

It is also possible to build a pricing model using purely supply and demand data that do not identify individuals. How much inventory does the retailer have? What's the projected demand? How do price changes affect the demand? At what price does the retailer generate maximum profits?

Demand forecasting is likely to benefit from "surveillance" data such as the frequency of users browsing the item's page, or placing it in the shopping cart. In this setting, however, the data will be aggregated. That's because this design allows the price of an item to change over time but not to vary across individuals at a given moment.

The observations of Consumer Reports do not suggest this type of pricing algorithm; thus, I didn't mention it in the other post.

Dynamic pricing is normal in various industries. Airlines, hotels and the hospitality industry have long priced their products based on supply and demand data. That's why plane tickets and rooms get more expensive, the closer it is to the use date. (But excess inventory might go on fire sale for last-minute bookings.)

Customers end up paying different prices for similar products (note: never the same room or seat at the same time) but it doesn't feel unfair. That's because a room on Christmas Eve is clearly more valuable than the same room the week before. Besides, everyone who's willing to lock down the reservation months in advance get a discount. This dynamic pricing isn't offensive.

Differential grocery prices don't give the same vibes. The same can of tomatoes isn't worth more from one week to another, and sellers can order more inventory instead of hiking prices if demand exceeds expectation.

Turkey during Thanksgiving week doesn't have to be more expensive; the markets can stock up. Consider the alternative of dynamically adjusting prices. Imagine there are 10 turkeys on the shelf, and 50 shoppers will be looking to buy a turkey. If the current price is affordable to everyone, then the sales become effectively first-come, first-served. This feels fair because if you want a turkey, you hussle there before the others. If the vendor raises the price to price out 40 of the 50 shoppers, then the turkeys end up with the highest bidders. In reality, the algorithm isn't so precise but the overall effect is to sell the birds to those willing and able to pay more.

Economists may praise the dynamic pricing setting as more "optimal." It certainly maximizes the total revenues received by the sellers. The average customer of everyday items finds it unfair. Perhaps part of the opposition is against asymmetric application. I find it hard to believe that the same dynamic pricing algorithm would be allowed to lower prices in response to poor demand. Each such price adjustment is a bet by the seller that the price drop would generate sufficient additional purchases to pay for itself. It represents trading a sure thing for an uncertainty.

Interestingly, supermarkets don't tend to play with lead times, unlike airlines or hotels. Almost all grocery items have best-before dates but only a few stores I know put discounts on items that are about to expire. Why? I'm not sure. Too much administrative hassle? Too many customers shifting from paying list prices? Do you have a guess?

When will they "personalize" pricing?

Kaiser Fung — Wed, 17 Dec 2025 09:17:50 -0500

One of the oft-cited benefits of Web or mobile technology is "personalizing" the user experience. This concept starts with little conveniences such as remembering your log-in user name (via cookies). The most obvious tool of personalization is probably recommendation engines, made famous by Netflix. When a service is personalized, different users expect to encounter different experiences.

Consumers don't welcome all personalization modes. A recent survey by Consumer Reports found that 7 out of 10 respondents reject personalized pricing for groceries (link). Something about paying different prices for the same can of tomatoes offend our sensibilities. Nevertheless, it's obvious that businesses will make more money if they are able to charge more for customers able or willing to pay more; and it's equally obvious that Web and mobile tools of personalization can be extended to pricing decisions. So, it's a matter of when, not if, that we will be charged different prices from our friends for the same things.

Huge props to Consumer Reports for conducting a rigorous study that confirms something many of us already suspect is happening: personalized pricing (link).

The CR study focused on Instacart, a popular shopping concierge service, by which the company dispatches shopping assistants to pick groceries from brick-and-mortar stores and deliver them to customers. Consumer Reports found that the Instacart website shows many different prices to different customers for the same item ordering from the same stores, with the maximum price sometimes as much as 20 percent higher.

Today, those customers who paid $123.93 are likely to presume that everyone is being charged the same price - that's the norm when they shop in brick-and-mortar stores with posted price tags. We're assuming they know how much they're paying. But, as described in Chapter 7 (on inflation statistics) of Numbersense (link), many American supermarket customers have no clue how much items they placed in their shopping carts cost. (I cited this research by marketing professors.)

Furthermore, there is not much a skeptical online shopper can do to learn if the store charges everyone the same prices. It's a bit easier to collect posted prices at physical stores for comparison with the online prices, but still too much a hassle for most shoppers.

Thus, online retailers have an incentive to personalize pricing because they can find more revenues from unsuspecting customers.

This is why the Consumer Reports study is so valuable.

How then did CR get around the data challenge? They recruited hundreds of people, arranging simultaneous Instacart shopping sessions for the same retailers, during which everyone placed the same basket of groceries into their shopping carts. Then, they recorded the prices. The variations in prices were visualized in the type of dot plots shown above.

The CR team seemed to be of two minds about whether Instacart is really doing "personalized" pricing. They call it "algorithmic pricing experiments."

This coinage merges two distinct concepts: algorithmic pricing, and pricing experiments. Algorithmic pricing is what I call "personalized pricing": the price differentiation is most commonly achieved by deploying an algorithm that computes each item's price while the shopper is browsing the site or using the shopping app. The goal of such an algorithm is to maximize the store's revenues.

A pricing experiment is another species. The retailer might set up five treatment "cells," say, the base price, and four variations (±5%, ±10%). Every time an item's price is required during a shopping session, a virtual die is thrown to pick one of those five price levels. Thus, those shoppers facing steeper-than-normal prices are just "unlucky." This is how Instacart staff explained the CR observations.

Normally, a pricing experiment does not use a pricing algorithm because an experiment should be designed like a clinical trial requiring random assignment of treatments (i.e. prices). Therefore, I don't use the term "algorithmic pricing experiment".

If I say "algorithmic pricing experiment," I mean something else. This test would also appear like a clinical trial, in which the treatment group comprises shoppers subjected to a pricing algorithm, and the control group contains shoppers being shown the standard, non-personalized prices. The treatment group itself would split into multiple cells with different pricing (analogous to testing dosages of medicine). The control group is included in order to measure the business-as-usual state.

Whether Instacart is running experiments or personalized pricing, an outside observer should find price variability. How then can we tell one from the other?

First, look at how widespread the price variations are. Typically, experiments affect a subset of shoppers, especially for a website with millions of customers while an algorithm represents a pricing strategy applied to all.

Second, look at how sticky the price variations are. Experiments are run to answer strategic questions, after which a strategic decision is made whether to "roll out" a change to all customers. The alternate prices are not supposed to last.

Third, look at the average prices. In my design, randomization occurs at the item level, therefore if we compute the average price differential (0%, ±5%, ±10%) across all purchases by customer over a time window, those averages should be roughly zero (if the number of items is not too few).

In the case of a personalized pricing algorithm, which sets prices to match a customer's ability or willingness to pay, we should see some customers with elevated prices, and others with deflated prices. It's hard to imagine that the average prices stay close to zero for most shoppers.

The rub is that external observers have almost nothing to work with.

In order to assess how widespread the price variations are, the study would have to recruit all subtypes of customers. To measure how longlasting the price differentials are, the study must be repeated regularly. Without access to transaction databases, outsiders can't gauge average prices by customer.

The evidence collected by Consumer Reports is very important; it's hard to ask for more. The study suggests to me that personalized pricing will become widespread within a few years, whether we like it or not.

Facetune your charts

Kaiser Fung — Mon, 15 Dec 2025 09:02:20 -0500

In a prior post, I featured the above chart to tell the story of international students in the U.S.

The story told by this chart is clean. Divide the total into two halves: the first half are the Indian and Chinese students; the other half comprises everyone else. Within the top half, India has 5/8 to China's 3/8. The bottom half is spliced into 8 parts. Europe is one, Canada+Mexico has one, the rest of the Americas occupy one, etc.

As I write this, I can hear purists screaming in my ears. "Your chart distorts the data", "You're spreading misinformation".

I confess. The chart is, for lack of a better word, face-tuned.

Below is the version of the chart that is "faithful" to the dataset:

India and China together made up 53% of the total, more than the 50% shown in my preferred version.

The eight country groups contributed between 4 to 8 percent of the students, not exactly 6 percent as my preferred chart suggests. To be precise: Europe 8%, East Asia (excluding China) 7%, Africa 6%, South Asia (excluding India) 6%, Canada+Mexico 5%, Southeast Asia 5%, Middle East 4%, and Americas (excluding Canada+Mexico) 4%.

The question: is the "faithful" version better than the "approximated" version?

If my goal is for readers to walk away with insights that they can pass along to others, then I don't hesitate to use approximations.

If the readers thought the sum of Indian and Chinese students made up 50% of the total (rather than 53%), what is the harm?

If the readers thought that Europe, Middle East, Southeast Asia, etc. all contributed equal shares of international students - and they would be adrift by a few percentage points one way or the other, is it worse than them trying to recall which region had 8 percent, and which region had 5 percent?

Our brains are not designed to hold raw data. This is why we don't want to - and can't - remember passwords that are long strings of randomly selected, alphanumeric characters. This is also why rejecting approximations is frequently harmful. Face-tuning your charts is often beneficial!

The story on international students coming to the U.S.

Kaiser Fung — Wed, 10 Dec 2025 09:26:24 -0500

The New York Times noted the drop in international students arriving in the U.S. in 2025 (link; paywall). As the following charts show, the schools have nearly recovered from the Covid-19 related dip but in the last year or so, the trend has reversed, probably due to the current hostility toward foreign-born persons.

(Note that each chart above has a different scale.)

These line charts are incredibly ugly because of the Covid-19 "shock."

Later in the article, the focus shifts to the change from 2024 to 2025. The time dimension is thus removed. They choose a bag of bubbles design:

Some readers will find this design engaging. You positively must play with the chart in order to learn something about the data. Which bubble represents which country? Is the declining trend affecting all regions?

Unlike the line charts above, if the reader is interested in the year-on-year change in student arrivals, this bubble chart gives out that information directly.

The size of the bubbles shows the 2025 data. This signals the relative importance of the bubbles. The main takeaway is that the erosion was widespread: most circles sit below the axis of no change.

The aggregate drop in arrivals was almost 20%. This value is printed on the chart as an annotation. Without the text, it would be impossible to figure it out. You'd have to do an average of the individual decline rates, using the relative bubble sizes as weights.

Let's switch the perspective, and make a chart that gives readers some high-level takeaways.

In this chart, the outer square displays all 1.2 million international students in 2024-5. The population of all students is divided into four rows, each containing ~300K students.

From this, you can see that India and China together account for half of the total. India is the top source of international students, comprising ~30% of the total (25% + 25%/4 =31%).

Each row is subdivided into four parts, so each "cell" covers about 75,000 students, ~6% of the total.

The bottom two rows show a classification of countries into eight regions with roughly equal contributions: South Asia (excl. India), East Asia (excl. China), Europe, Canada+Mexico, Southeast Asia, Middle East, Africa and Americas (excl. Canada+Mexico).

As an extra, I also show the relative sizes of Canada vs. Mexico.

The challenge of visualizing complex datasets like this one is to pick a problem of manageable size, and then to distill the stories contained in the data.

(Note that I obtained data from OpenDoorsData.org, which is a different source than what the Times used. As a result, I have full data on Canada and Mexico.)

MTA acknowledges OMNY defects

Kaiser Fung — Fri, 05 Dec 2025 09:59:52 -0500

MTA (New York's subway operator) is poised to phase out the old swipe cards by the end of the year. I previously wrote about many issues with the new OMNY chip card (here, here, and here). Recently, MTA has acknowledged these issues. Yet, they will retire the old system without fixing these problems!

First up. The OMNY card uses a tablet for scanning, and yet the spacious screen real estate is wasted without showing riders useful data: not how much is being charged for the trip, not a list of recent charges, not how much value remains on the card. According to this news report (link), MTA has raised the "possibility" of showing remaining balances. I want to be a fly on the wall to hear the opponents of displaying the data. The old swipe card system using the tiniest screen still managed to show such data.

Second, many riders complained about the lack of "visibility and transparency" relating to free rides. Absolutely agree. The real problem, as I explained in two blog posts (here and here), is the mind-numbingly complex new method of rewarding free trips. The PR agency decided to dumb down the math, which compounds the problem because what they are promoting on the trains is a lie. They can't possibly be computing the free rides the way they are described to the public.

In a prior post, I guessed at what the real method of rewarding free rides is. While that method gets the job done, it is difficult to explain, and impossible for riders to audit without lots of data.

Comparing the OMNY way to the old swipe cards highlights the problem. Riders used to pay upfront the fixed fee for a 7-day travel card, and after the purchase, they could do as many rides as they like, without a care. Now, they don't know what's going on. Nevertheless, we are relieved that the MTA heard the correct feedback: "riders...want... some help building a little trust that this new unlimited ride fare cap is giving them free rides".

They have committed a big rookie mistake of marketing. When you're giving customers a discount or freebie, you better make it super obvious what they are getting.

The first proposed "solution" doubles down on the opaqueness - they are asking riders to spend time going to a website to inspect their historical trips. How is this better than the old swipe-card system, in which riders know at the turnstile that they just received a free ride without needing to do anything else?

In fact, I did a transfer from subway to bus today, and I had no idea if I was charged once or twice. (The transfer to bus should have been free.) If I used the old swipe card, I'd have been told right after the swipe that the ride was a free transfer. With OMNY, the same green light greeted me whether or not I was transferring.

Say Jon without the h in Chinese

Kaiser Fung — Mon, 01 Dec 2025 09:44:16 -0500

When you speak to any of the so-called "smart" devices, they can hear you, and perform tasks as you request them. One of the key components of such an application is voice-to-text software. There are many nuances that trip up such software. One puzzle is homophones: since John and Jon are pronounced the same, how can the "smart" device decide which one was spoken?

Humans encounter the same problem. We make the intention clear by saying "John with the h" or "Jon without the h". How does this issue arise with Chinese names?

My friend Ray V. sent me to a nice data visualization project by Liuhuaying Yang tackling this tricky subject.

The situation with Chinese names is even more complex. Chinese names are made up of ordered characters (typically one, possibly two, characters for the surname, and typically one or two, possibly three, characters for the "given name"). The surname is written before the given name.

Each character is a single syllable. Homophones are numerous. The designer illustrates this as a tree:

This "li" tree contains forty "fruits," each being a Chinese character sharing the same sound. If all one has is "Li," it could be any of these characters (of course, some are more likely than others.) Thus, Li has to augment "li" by saying something like "the 'li' as used in pear." Jon without the h.

The situation is a bit better if the name is spoken out loud, because the "tone" is heard. Mandarin Chinese uses four tones, indicated by the color of the fruit in the tree above. Only three tones appear there, so apparently the first tone of "li" (shown in pink) is rarely used in Chinese names. In written text, the tone indicator is usually dropped, making it much harder to figure out which of these "li"s is the right one.

The canopy of the tree casts a shadow on the ground, the size of which encodes the difficulty of the puzzle. According to their statistics, the 40 "li"s show up in given names with a popularity of 38 per thousand people. If you look more closely, there are two shadows. The thicker shadow is related to surname usage while the thinner one, usage in given names. One of the "li"s is one of the top five surnames in China, and so the thicker shadow is the outer one (76 per thousand).

The writeup neglected to explain the two shadow rings. So, let's find one that has the opposite characteristic as "li" for comparison.

"Rong" is rarely a surname so the thicker shadow is right at the base of the trunk while the thinner shadow related to given names is more visible. Interestingly, only one of the four tones appears in this "rong" tree.

Returning to the "li" tree, let's analyze "Li" as a surname.

The character in green is the second most common surname in China, but there are about a dozen other characters that can be someone's family name. Unsurprisingly, many of these characters only show up in given names. Because they are using the tint of the color to show relative popularity, we don't really know the drop in popularity relative to the green character (it's a huge drop-off).

The value of this data visualization project is in structuring and presenting the data in a way that engages readers. This is a project that keeps readers focused on the trees, while losing themselves inside the forest, hopefully at will.

Enter the forest of Chinese names here.

Light entertainment: Pi-orities

Kaiser Fung — Wed, 26 Nov 2025 10:11:59 -0500

Screaming for attention in my twitter feed: Attila's post in which he dug up loads of pie charts from old Brazilian publications.

It's Thanksgiving week in the U.S. Supermarkets sell lots of pies. Enjoy!

More than a penny

Kaiser Fung — Fri, 21 Nov 2025 09:07:47 -0500

Farewell to the penny.

The current U.S. administration has decided to get rid of the "penny."

One of the reasons cited by many reports of the penny's demise is that it costs more than a penny to make a penny. Here's a quote from AP (link):

“For far too long the United States has minted pennies which literally cost us more than 2 cents,” Trump wrote in an online post in February. “This is so wasteful!”

This is an example of a false friend that sounds reasonable, but in fact the reasoning buckles.

You can see this by asking: does it cost $100 to print a $100 bill? Should it cost $100, or anywhere close to it?

The cost efficiency of money-printing must be judged in aggregate; one shouldn't pick out one unit of currency and analyze it separately. The system is set up so that the printing of $100 bills subsidizes the printing of pennies. By that metric of "wastefulness," a lower denominated currency is going to be wasteful relative to a higher denominated one.

A better reason to get rid of the penny is that inflation has rendered it useless. Nothing can be bought for a penny; very few items can be had for even one dollar these days in the U.S. So, its death is better attributed to loss of utility.

Avinash's scoring rubric for data visualization

Kaiser Fung — Tue, 18 Nov 2025 09:04:33 -0500

Here's the always-entertaining Avinash on data visualizations (link).

tldr; He presents a (pseudo-)"algorithm" for great data visualizations. His scoring rubric consists of eight items: time to insight, effort to comprehend, trust, hierarchy, logic, nuance, no gimmicks, and influence.

All of these should be familiar to Junk Charts readers. I like to express the first two in terms of a "return on effort" metric. See this post. It's not that every graphic that requires a long time to process is bad; the issue is when we expend the effort but don't receive the reward.

The last metric ("influence") is a very high bar. It's something we dream of, but rarely achieve. Worse, it may be easier to attain influence by deception using flawed graphics.

Avinash then analyzes four infographics that each explain Covid risks to illustrate his scoring mechanism. (The post was originally published during the Covid era.)

Since I included xkcd's cartoon up top, let's take a closer look. Like Avinash, I'm treating it as a data visualization, which was not the intention – so be warned.

We are looking at a grid. Based on the color scheme, it's a 4x4 grid. There's something of a scatter plot living on this grid. Think of each "dot" in the scatter plot as a text box. Each text box contains an activity. Each activity is rated on two dimensions: Covid risk, and non-Covid risk.

These axis labels are concise but imprecise. "Covid risk" really means the risk of catching Covid while doing said activity while "non Covid risk" signifies the general risk(s) of said activity other than catching Covid. For example, "staying home" has negligible risk of catching Covid (assuming there isn't an infected family member), and "staying home" presents low risks in general to someone, even if we ignore Covid risk (top left corner). By contrast, "singing in the church" is not typically regarded as a risky activity, but during Covid, it was a super-spreader event (top right corner).

This leads us to one avenue to consume this infographic. The diagonal going from top left to bottom right represents status quo: activities that didn't change in risk profile due to Covid. Our attention should be drawn to the top right corner, where those activities have elevated risks of catching Covid, relative to doing them prior to the pandemic. As a matter of curiosity, the activities shown in the bottom left corner are ironically less risky during Covid than prior.

Top right corner: singing in church, going to a restaurant, going to a bar, going to a party, attending in-person classes, etc. are usually considered dangerous but during Covid, that was how people got infected.

Bottom left corner: bungee jumping while doing sword tricks, going down a waterslide on an electric scooter, running and sliding headfirst into the pins at a bowling alley, etc. For these activities, the risk during Covid was rated lower than prior, probably because many of these recreational centers were closed, and they don't involve crowds of people.

xkcd actually makes a subtle point that isn't conveyed in the other infographics: that the risk profiles of some activities changed dramatically during a pandemic.

I can't figure out the color scheme of the graphic. The green, yellow, orange, and red colors correspond to the distance from the top left corner ("origin"), which represents low Covid risk and low non-Covid risk.

Take the red boxes. They show activities that have either high risk of catching Covid or high non-Covid risk (or both). The latter segment includes some activities with low Covid risk. It's confusing.

Though I said above the form of the plot is that of a scatter plot, I really should make a clarification.

For the notion of a "scatter" or "cluster" does not exist. What xkcd did is to fill the entire grid with evenly spaced data. The data are made up to represent all points in the grid; the density of the data does not vary, and as such, they do not contain any statistical meaning, unlike the usual scatter plots.

We aren't getting $2,000 checks

Kaiser Fung — Fri, 14 Nov 2025 09:29:43 -0500

The $2,000 per person checks "funded by tariffs" do not make sense.

The math doesn't work. For simpler math, I'd take the U.S. population to be 300 million. Multiply 300 million x $2,000, and we get $600,000 million, or $600 billion.

Are there $600 billion of tariff money to spend? Of course not.

The U.S. government collected about $200 billion in 2025 (link) but wait, a chunk of that number comes from tariffs that existed before. The incremental Trump tariffs amounted to ~$120 billion.

After four years, they will still be more than $100 billion short.

This is assuming every dollar of "revenue" can be spent: there is no cost of setting up and administering the tariff system.

(Note: If they distributed the whole $200 billion, instead of the incremental $120 billion, they must somehow raise $80 billion from somewhere, because that money would have been spent, since the U.S. government runs a gigantic deficit.)

Sure, they already said they will exclude "high earners". We can turn this around. Let's say they get $500 billion to spend. That works out to $500,000 million / $2,000 = 250 million checks. Around 20% of the population will not get them.

This calculation assumes they send checks out on the last day of Trump's term. Otherwise, if they issued checks in 2026, they would be spending money they didn't have.

They can also bait and switch, and ultimately give $2,000 checks only to the poorest Americans. Kind of like retailers who say "sale up to 50%" except you can find only one item at half price.

Even if we buy snakeoil from these people, the whole tariff business still makes no sense.

Because Americans will then spend the $2,000 "windfall" on groceries, and everything else with rapidly inflating prices - primarily caused by these tariffs!

Imagine this scenario: Uncle Sam forces McDonald's to pay additional taxes (i.e. tariffs) amounting to 20% of revenues. McDonald's immediately passes the additional cost to consumers, raising the Big Mac price by 20%. Uncle Sam takes the incremental taxes, and sends Big Mac coupons to Americans, who then go to McDonald's to buy the iconic burgers at the new higher price with the coupons covering the 20% tax-related price hike.

Nothing of value has happened. McDonald's profits stay the same since the incremental revenues from the price hike counteracts the additional taxes paid to Uncle Sam. The fiscal situation with the U.S. government do not change, since all incremental tax receipts are immediately paid out. The consumers effectively do not feel the tariff-related price hike. So, what's the point?

Interestingly, McDonald's may like this arrangement because its revenue line gets inflated. I suppose the U.S. government can call this nominal GDP growth.

(Note: For reasons explained in this prior post, if McDonald's raised prices by the tariff rate, Uncle Sam's collected tariffs would in fact not be sufficient to cover the entire price hike.)

Let's be Finn-icky

Kaiser Fung — Tue, 11 Nov 2025 09:07:14 -0500

Long-time reader Aleks J. sent me to this very nice statistical literacy effort, produced by a team at Tampere University and Tampere University of Applied Science in Finland (link).

Mediawatch uses as a background story that resonates with me: the modern world is flooded with lots of bad graphs and statistics, and we don't have enough time or bodies to debunk them. The technology of debunking is not keeping up with the tools of dispatching.

The Mediawatch player navigates around the website, interpreting various charts that pop up.

Here's one example:

This chart explains the start at zero rule for column charts.

Spread the word! It's an excellent project worthy of our support.

Govt shutdown shines light on missing data

Kaiser Fung — Mon, 10 Nov 2025 08:06:40 -0500

You haven't seen me on Tiktok, but I'm there now, thanks to FullTake who interviewed me last week about how the government shutdown affects data collection.

Here's the clip:

@fulltake
The longest government shutdown in U.S. history has stalled data reports from the Bureau of Labor Statistics. Former Columbia University program director and data science consultant, Kaiser Fung, tells us what will happen as a result of these report suspensions. #data
♬ original sound - fulltake

Statisticians don't like holes in the data, especially avoidable ones.

The government shutdown is punching holes in the datasets that underlie U.S. economic reports. These datasets rely on "shoe leather," staff conducting interviews about employment situation, or visiting retail stores to compile lists of prices. During the impasse, data collection has been suspended.

What happens after the government reopens?

We know that the furloughed employees typically get back pay, undoing the damage in one sense. However, data that weren't collected could not be replaced.

Prices displayed on store shelves one or two or three months later aren't necessarily the prices during the shutdown. When it comes to employment, it is possible to ask someone how many hours they were working several months ago. But such replacement data introduce recall bias. The more unsteady is a person's employment, the more inaccurate his/her answer. In fact, anyone with a steady job isn't contributing to recall error.

Alternatively, BLS can apply statistical methods to fill data gaps. Think of these fillers as part data, part assumptions. The most famous simple backfill method is "mean imputation," which is a jargonistic way of saying "replace missing values with the average value of the non-missing." Backfilling is typically biased toward maintaining the status quo, because the most common – and least assailable – assumption is that the future replays the past. This assumption is likely to misfire in light of high economic uncertainty.

The government statisticians can elect not to fill in the gaps. This is an act of passing the buck because analysts who use these data series would then have to prepare their own filling materials.

How will any of this affect you and I?

Here's one way. The CPI is used by the government to determine cost-of-living adjustments for Social Security payments. Similarly, employers may use CPI to figure out annual pay increases.

Let's say BLS economists backfilled the missing values caused by the pause in data collection. These fillers mostly reflect assumptions as there aren't much, if any, data. The key assumption is likely rolling forward the status quo. If the inflation trend continues, then we would have a few months in which the CPI is under-estimated. This could lead to lower-than-warranted cost-of-living adjustments. (Imagine, for example, the adjustment formula is based on an average of some number of historical monthly inflation figures.)

The way of the statistician

Kaiser Fung — Thu, 06 Nov 2025 08:57:27 -0500

What do statisticians do? A lot of people seem not to know.

If you are curious to know, try reading Andrew's post about Tuesday's local elections in the U.S. (link) Living in the northeast, we were served a flurry of late-breaking reports, claiming that the New York City mayor's race was tightening (link), and the New Jersey governor's race was a toss-up (link, link). Andrew's analysis would suggest comfortable wins for the Democratic candidates in each case; and since I'm writing this post after polls closed, I can report that his findings weren't far from the actual outcomes.

How did Andrew determine that the Democratic candidate in each case has around 80-90% chance of winning?

The starting point are poll numbers. People are asked which candidate they intend to vote for in each election. This data are then converted into vote shares. There are multiple pollsters, and each runs periodic polls so we have a dataset consisting of a series of vote share values for each candidate in each election.

Here is New Jersey Democrat Mikie Sherrill's vote-share series: 56%, 51%, 51%, 54%, 55%, and 55%. Each value came from a different poll. The average vote share is 53%. Crudely, one predicts that Sherrill will win 53% of the election-day votes.

Statisticians don't like that answer. A moment's thought should convince you that the ability of prior polls to predict the election outcome depends, in part, on the variability in that vote-share series. Two of the six values sit uncomfortably close to 50%. How do we capture this observation quantitively?

The canonical tool used by statisticians is the margin of error. Here, it's ±4.4%. (This number is derived from the standard deviation of the vote-share data series.) Notably, the left side of the margin of error dips below 50%.

On election day, Sherrill needs at least 50% of the votes to win. How likely is she to get half or more of the votes, given the series of poll numbers averaging 53%? We now appeal to the statistics gods.

The gods tend to a pool of "truth". The prior polls are random samples of this truth. Since they didn't measure every likely voter, each polling sample is different, so the series of polling averages exhibited variability. The margin of error quantifies such uncertainty: the probability that a poll average falls within 49.6% and 57.4% is 95%.

That doesn't directly answer our question. Using the same tools, we can show that there is 91% chance of obtaining a sample average of 50% or higher. In other words, the New Jersey Governer's election is not a toss-up as the media led us to believe.

Andrew also explained why he lowered Sherrill's chance to 84%. The margin of error only accounts for sampling variability – think of it as random error. As recent elections have shown, polls also suffer from systematic error, that is to say, some other factor causes most polls to skew in the same direction. Andrew modeled this source of error by adjusting the margin of error upwards, to ±6%, which leads to a downward revision of her winning probability. (She won handily with 56% of the votes, at the time of writing.)

For the NYC mayor election, Andrew gave several reasons why he lowered Mamdani's chance of winning futher. This election is a three-way race, while the above methodology uses the two-candidate vote shares, ignoring the Republican candidate's values. It's reasonable to assume that on election day, some of the voters who had intended to vote for the Republican would decide not to waste their vote, and most of them were expected to gift their votes to Cuomo.

By making assumptions about the size of this group of late switchers, and the amount of skew towards Cuomo, we can adjust the expected two-candidate vote shares between Mamdani and Cuomo. This maneuver only moved the average vote share by a couple of percentage points (because the Republican weren't going to get that many votes). Even after inserting more uncertainty to account for more variability in a three-horse race, Andrew's analysis still shows Mamdani's chance of winning to be over 86%.

The prediction market Kalshi was heavily advertising at the bus stops in NYC last week. These displays consistently showed Mamdani's winning probability in the 80-90% range. As Andrew indicated, since the people betting on these markets had access to the same polling data, it's not surprising they arrived at a similar place.

There are some details I and/or Andrew shoved under the rug but these minor items need not bother anyone except statisticians.

For example, we effectively treated the values in each vote-share series as a different "random sample" from an underlying population. One can complain that these are non-random samples due to different pollsters, and different polling periods.

Another complaint may be that the series of polls is too short, only about six values per race. One can, in theory, fetch a longer series but there is a trade-off; polls conducted far from the election day are generally less reliable, and the further back, the more unreliable.

We used an "empirical" estimate of the sampling variability by computing the variability of the series of numbers. As the series is short, this estimate is error-prone. Nevertheless, it's better than the so-called "parametric" alternative, which results in a severe underestimate of the uncertainty. (This parametric estimate arises from a theoretical model.)

Finally, all steps above require assumptions. If one uses a different guess of how many Republican voters would shift their vote to Cuomo, the estimated vote shares would have been different. When it comes to assumptions, what's certain is that not making assumptions is the worst possible strategy. In this case, not making assumptions is the same as assuming that no Republican voters would vote for Cuomo.

Turnout tuneout

Kaiser Fung — Wed, 05 Nov 2025 08:44:03 -0500

I enjoy the scatter plot published by the New York Times team about New Jersey's elections (link; paywall).

On this plot, each dot represents a "township" (with at least 500 votes cast in 2021). The yellow dots depict "majority non-white" towns; based on the accompanying article, the driving force are Hispanic voters. The gray dots, unlabeled, show the majority white towns.

When the analyst classifies the towns in this manner, a clear pattern emerges. Almost all the yellow dots are found in the lower right quadrant while the gray dots cluster in the upper left of the chart. The data tell a compelling story; what is it?

The backdrop of the chart are two successive recent elections: the 2020 Presidential election, and the 2021 state Governer's election. The horizontal axis shows the vote margin in the 2021 election: the right side (of zero) represents the towns Democrats won while the left side (of zero), where Republicans won. It's not surprising that the Democrats are stronger in the majority non-white towns; almost all the yellow dots are on the right side of zero. (Much of the NYT article concerns the shift of Hispanic voters towards Trump in 2024 but this isn't the story of this scatter plot.)

The vertical axis shows the drop in turnout from 2020 to 2021. The change was dramatic, ranging from about -20% to above -60%.

The story: the vast majority of the gray dots lie above the yellow dots. This means that the tuneout was much more severe in the majority non-white towns relative to the majority white towns. Add to that, the Democrats' strongholds are majority non-white towns. So, the turnout of Democratic voters deteriorated much more than that of Republican voters. We expect the Governer's margin of victory to be much smaller than the President's.

Lurking behind this scatter plot are four quadrants, with most towns found in two of the quadrants.

It's easy to delineate the left and right sides: just use the 0% voting margin divider so that left are Republican towns while right are Democratic towns.

How about the top and bottom divider? Here, I can find the average change in turnout in the whole of New Jersey. This turns out to be 40% - 72% = -32%.

The location of the average turnout drop is a bit odd. It makes me wonder if NYT is using a different data source. I'd have expected the line to sit lower since the large dots are mostly below the line, and in addition, the pile of small dots also appears below the line.

What we just observed is an example of an "interaction" effect. The observed data result from the simultaneous operation of two effects. We cannot artificially impose "change one while keeping the other constant."

Effect 1 is the correlation between vote margin and race; majority non-white towns skew Democrat. Effect 2 is the correlation between vote margin and change in turnout; towns with larger turnout drops skew Democrat. Both effects are driven by the Hispanic voters so they happen simultaneously.

Let's see what we should observe if only one of the effects exists.

On the LHS chart, I assume that turnout is dropping everywhere about the same in 2021 so Effect 2 is absent. The relationship between Turnout Drop and vote margin is modeled as a flat line. Since Effect 1 is present, I expect the majority non-white towns to skew Democratic, and therefore the cluster of yellow dots is situated to the right of the cluster of gray dots.

On the RHS chart, I assume Effect 1 is absent, meaning that vote margin is not associated with race. (This is a thought experiment.) This assumption implies that the yellow and gray clusters must overlap, so that voting behavior does not depend on the majority race in these towns. If Effect 2 is present, e.g. if I assume that towns suffering higher drop in turnout skew Democratic, then the relationship between the two variables plotted is a negatively-sloped line.

The actual pattern is the combination of these two, which is what statisticians mean by an "interaction."

It's from the RHS chart that we can see why it's silly to impose "change one while keeping the other constant." To keep the other effect at bay, we have to assume that towns with majority non-white populations have similar voting margins as towns with majority white populations, a clear misrepresentation of reality. Said differently, many of the voters who are skipping the 2021 elections are the same voters who live in majority non-white towns, and we can't keep them in one column while deleting them from another column.

Light entertainment: home work

Kaiser Fung — Tue, 04 Nov 2025 10:49:47 -0500

Trick or treat seems to have died in the city. Glad to see it's alive somewhere.

(Tip from long-time reader Chris P.)

Big drops and big dots

Kaiser Fung — Mon, 03 Nov 2025 09:15:33 -0500

For some unclear reasons, the response rates for various instruments used by the Bureau of Labor Statistics (BLS) to measure the nation's well-being have been dropping . As the above chart shows, the CPS survey (which is used to measure the unemployment rate) are seeing 15% fewer responses than about 10 years ago.

The chart is unnecessarily busy. This combination of dots and line segments curiously elevate the dots above the lines. As a result, the chart spotlights the "noise" in the data series.

Two years later, someone at the BLS noticed this problem, and published a new design, which has definitely improved:

The dots are no longer there, and so they don't steal all of our attention.

Nevertheless, this revised design still lets the background noise drown out the signal.

Besides, you're involuntarily twisting your neck as you work out which color and which line is which survey.

In this revision, I put the dots back but push them to the background. I add a smoothed line for each survey that depicts the downward trend in response rate. The line labels are at the end of each line.

I didn't check why ATUS has zero response since 2024. It might just have been suspended.

How many gigs?

Kaiser Fung — Fri, 31 Oct 2025 08:41:14 -0400

During a recent interview, I was asked an excellent question: how does the gig economy affect the accuracy of our employment statistics?

The short answer is: it shouldn't have a large impact.

Nevertheless, a quick web (or AI) search suggests that certain experts have put up arguments claiming that official statistics under-count gig workers.

That's hard to believe.

Let's start with how BLS (Bureau of Labor Statistics) counts employment. The primary source is the Current Population Survey (CPS). A random sample of households are contacted, and asked a bunch of questions. The key question related to their employment status is whether they worked at least one hour during a so-called "reference week." Anyone who responded yes is counted as an employed person.

I did just say "person." Because the CPS survey - by extension, the unemployment rate - counts people, not jobs. It doesn't matter if a gig worker has five jobs; working just one hour during that week for any single employer is sufficient for him/her to be counted as employed. Thus, all gig workers who work one hour or more should already be counted.

Those people who allege under-counting make the following argument: they assert that some gig workers do not see their gigs as "work," and therefore when contacted by BLS data collectors, they would proclaim themselves unemployed.

As we are now years deep after the emergence of the gig economy, I don't believe the claim that gig workers see themselves as not working.

Conversely, there is an argument to be made that the transition into the gig economy may artificially inflate the number of employed people.

Take an adjunct professor as an example of a gig worker. The university has turned a single teaching job previously held by a single person into several adjunct teaching jobs held by different people. In fact, the administration probably hires an accountant whose job is to make sure that each adjunct professor doesn't work too many hours, as otherwise, the school must treat him/her as an employee with benefits.

Since only one hour of work suffices to qualify as employed, each of the adjunct professors count as an employed person. This splitting of one job into several pushes the number employed upwards.

Ten pie charts. Are you worried yet?

Kaiser Fung — Mon, 27 Oct 2025 09:35:14 -0400

Statista published this series of pie charts that illustrate results from a survey asking Americans what they are worried about (link). The survey question has 18 options, while the chart covers the top 10 issues. "Top" is defined by the proportion of respondents who ranked the issue as their topmost concern.

The chart form is a small multiples of pie charts. Each pie chart addresses a specific issue, and contains one data point – the proportion of people who ranked that issue as their top worry. The data series is encoded twice, first in the area (or angle) of the sector, and also in its color.

This chart fails our self-sufficiency test. If stripped of the data labels, we are left with:

It takes some effort to figure out the proportion of each sector. It also shows the minimal contribution from the use of color. Using color alone, no reader can possibly learn the data in any of these pie charts.

It's not clear to me that the color assignments were applied using a formula. The change between Immigration and Unemployment on the second row is quite noticeable, and feels larger than the change between Health and social security and Poverty on the first row. Yet, the former is a fraction of a percent while the latter represents 3 percent.

Here is a bar chart showing the same data:

I like to extend the axis to the full 100 percent, making it easier to judge the length of each bar, as a proportion of the total.

I chose only two shades because the gaps between successive data points are modest for the most part.

The bar chart does not require printing the entire dataset to be understood.

Peak social media? Depends on how you measure it

Kaiser Fung — Mon, 20 Oct 2025 08:00:52 -0400

John at the Financial Times produced the above graph, with the headline "Time on social media peaked in 2022, with young people cutting back first" (link).

What does he really mean?

The first part probably refers to the first chart on the left, showing the aggregate population in the study. The study measures the "average number of hour spent on social media per day." It's a double average: average per day, and average per person. He's describing the peak in the curve observed in 2022, followed by a downtrend in the last two years.

The second part probably refers to the set of line charts starting from the second-left chart. Here, he disaggregated the dataset by age group. Each line represents social-media usage of people in the labeled age group. The first three lines exhibit dips at the end, similar to the aggregate while the last two age groups do not show a downward slope. This is summarized as "young people cutting back first".

There is more we can see from these very nicely drawn charts. For example:

The older someone is, the less time they tend to spend on social media (the entire lines shift lower as we move left to right)
From 2014 to 2020 (roughly speaking), the average person increased their social-media usage, so did the average person representing each of these age groups (every line rose steeply during that period)

"Young people are cutting back first." This statement is quite ambiguous. An unspoken corollary is that older people are not yet cutting back. The statement also implies that young people are reducing usage as they become older, as there can be no other way.

So, there are two distinct issues to consider:

young people reduce social-media usage as they age (e.g. have family commitment, less free time)
young people of today use less social media than young people of the past (e.g. no longer "cool", has other pastimes)

The first point concerns the aging process of individuals while the second point suggests a cultural or generational factor. These are different things.

The FT charts address the second point only. The reduction in social-media usage is observed between one "generation" and the previous.

In order to study the first point, one needs a "cohort analysis" in which the cohort is defined by birth year. Think about tracking individuals as they age.

In the FT analysis, the building block is the single-age subgroup, e.g. 18 years old. Between 2014 and 2024, anyone who's 18 years old belongs to the 18 group; these people have birth years from 1996 to 2006.

A lot changed between 2014 and 2024. In 2014, Instagram had 200 million active users and was a photo app; by 2024, it claimed over 2 billion users, and has pivoted to videos. In 2014, Snapchat and Whatsapp were just gaining traction; Tiktok hasn't even launched; the concept of an influencer was novel. Being 18 in 2024 is very different from being 18 in 2014!

When analysts from a subgroup, we are claiming that people in the subgroup can be treated as "alike." Sometimes, this isn't the case. The FT analysis further combines several single-age subgroups into larger age groups. For example, the 16-24-year-old age group contains nine single-year subgroups, and together, the people in this age group has birth years spanning 1990 and 2008.

Alternatively, we can build birth year cohorts. The most recent cohort that contributes data to John's dataset consists of those born in 2008. These youngsters reached 16 years old in 2024, the last year of data collection. This group is not very informative as we only have one year of observations. One year does not make a trend. We don't yet know how much social media they will consume when they reach 30.

A more interesting cohort are those born in 1998. In 2014, they were 16 years old and thus, became part of this study. By 2024, they were 26 years old. The study followed them for 10 years. They contribute to our understanding of social-media usage of people between 16 and 26 years old.

Birth-year cohort is a direct analysis of the effect of aging. Generational change can be captured by modeling level shifts between different birth-year cohorts. For example, the curve for those born in 2000 should probably start at a higher level of usage and possibly remain at a higher level than the curve for those born in 1970.

In this last section, I attempt to illustrate the ties between the age cohorts and the birth-year cohorts, using FT's charts. I can only tell part of the story because the aggregation has wiped out some of the necessary data.

Let's consider a 20-year-old in 2014. This person sits in the middle of the 16-to-24 age group. According to the chart, the average such person consumed about 2.1 hours per day of social media in 2014.

Five years later (2019), this person was 25 years old, and therefore, his/her data fell into the 25-34 age group. In 2019, that age group on average used 2.6 hours per day of social media. Another five years later (2024), the person is 30 years old. Still part of the 25-34 age group, s/he is still associated with 2.6 hours per day (we can't say anything more unless we have single-age cohorts.)

Let's take another group (55-to-64), for which the middle age is 59.5. In 2024, the average usage was around 1.6 hours per day. Five years earlier, in 2019, the 59.5-year-old was 54.5 years old, which means the data fell into the 45-to-54 age group. The average for that age group in 2019 was also 1.6 hours per day. Another five years earlier, in 2014, the 59.5-year-old was 49.5. The age group remains the same; the usage level in 2014 though was much lower, at 1.1 hours per day. (Once, again, we need single-age cohort data to know if this aggregate number is representative or not.)

Can measure vs should measure

Kaiser Fung — Thu, 16 Oct 2025 08:55:57 -0400

Enrico Bertini has been putting out a lot of good content lately. In this post (link), he advises that "no amount of design or data processing skills can overcome problems inherent in the data due to the way it was generated and collected."

Readers here will be familiar with this sentiment. This is one of the reasons why I created the Junk Charts Trifecta Checkup. Under this framework, the problem raised by Enrico is identified as a "Type D" chart, defined as a chart that deploys a good visual design to answer a well-posed problem but as data visualization, because it fails to convey the meaning of the underlying data.

Enrico goes on to delineate the modes of failure:

Let's walk through these concepts, while mapping his terminology to other similar concepts you may have heard of. For simplicity, I'm going to imagine below a dataset that measures consumer behavior, so the unit of measurement is a person.

Representation gap – this happens when the observed sample of people does not fully represent the "population" of people that the analyst intends to describe. This is otherwise known as sampling bias but I do like Enrico's alternative phrasing. Depending on your discipline, you may also call it selection or filtering effect.

Accuracy gap – this is familiar to statisticians as measurement error, defined as the gap between the (unobservable) true values and the observed values. If we are measuring someone's weight, it may be the case that the scale does not give accurate measurements - the inaccuracy can be a systematic bias (e.g. it starts not from zero) or the scale can suffer from a high margin of error, or both at the same time.

Interpretation gap – this issue is sometimes called low construct validity. The problem is what is measured isn't quite what one wants to measure. This happens in real life because we measure what we can measure, which is not necessarily the same as what one ought to measure. Note that this issue exists even where there is no measurement error (or accuracy gap). The act of using a "proxy" measure creates this interpretation gap.

Consistency gap – I see this is a component of the measurement error mentioned above. This has to do with variability across multiple measurement devices, for example, different interviewers when conducting surveys. In addition to variability within each instrument, there may be differences between instruments.

I have always been fascinated by how a device like the Apple Watch measures a user's blood oxygen level. Below is an excerpt of what AI tells me how Apple's technology works:

**Infrared Light and Red Light**: The watch features a specific set of sensors that emit both infrared and red light. These lights penetrate the skin and are absorbed differently by oxygen-rich (arterial) and oxygen-poor (venous) blood.

**Photodiodes**: These sensors then measure the amount of light that is reflected back to the watch. By analyzing the reflected light, the device can calculate the proportion of oxygenated versus deoxygenated blood.

This answer is actually incomplete because what shows up on the Apple Watch is a single percentage that presumably ranges from 0 to 100%. It is not a proportion.

So, I asked a follow-up question:

Explain this more. "By analyzing the reflected light, the device can calculate the proportion of oxygenated versus deoxygenated blood."

The key bit is this:

**Ratio Analysis**: The device analyzes the ratio of red light to infrared light that has been reflected back.

- High levels of infrared light indicate a higher concentration of deoxygenated blood, while high levels of red light indicate oxygenated blood.

**Algorithm Implementation**: Using complex algorithms, the watch converts these measurements into a percentage of blood oxygen saturation, usually expressed as SpO2.

- Normal SpO2 levels typically range from **95% to 100%**; levels below this may indicate potential health issues.

Note where it mentions "complex algorithms". So, what the watch actually measures are the levels of red and infrared light reflected back. What the user expects to be measured is the blood oxygen saturation percentage. There is an "interpretation gap" because the algorithms take what is measured, and transforms them to what we ought to measure; the value shown on the watch is an indirect, proxy measurement that inherently involves construct validity.

Further, any device has measurement error. Some of this is caused by the user, e.g. is the watch tightly fitted on the wrist. Even if we accept the proxy measure as sufficient, the observed values may still deviate from the truth. Add to this the inconsistency from one Apple Watch to the next.

I further asked AI what methods of measurement are used in a clinic.

The answer describes two options: a non-invasive method that uses fingertip sensors and relies on "complex algorithms" similar to the Apple Watch; and an invasive procedure (arterial blood gas analysis) that draws blood out, and more directly measures the blood oxygen level (along with other tests).

In this case, what ought to be measured can be measured more directly. The trade-off is a less accurate but more available method.

Putting the ladder of abstraction into practice

Kaiser Fung — Tue, 14 Oct 2025 08:41:45 -0400

In a previous post, I mentioned Gelman's advice to use the ladder of abstraction to explain complex charts (link).

The complex chart is this New York Times's graphic showing the cloud cover at different locations along the path of totality during 2024's full solar eclipse.

Using this example, I drafted how one can construct a series of charts that build up to this line chart.

Start with the least abstract graphic that stays closest to the human experience of our world:

Our sense of space is tied up in maps, even though maps are also abstractions, subject to the specific distortions of the chosen projection scheme. This map presents the background story of the moon's shadow sweeping along the path of totality from the bottom left to the top right of the map.

At the same time, the map also foreshadows which parts of the map are less relevant to the topic at hand. This prepares the readers for when I drill down to specific locations in forthcoming graphics.

The map also couples space and time. Time is provided in data labels at specific locations. Time increments as one moves along the path of totality.

Next, I pick a specific location (Rochester, NY) to demonstrate how an entire experience is reduced to a point in time.

We focus our attention only on the center of the path of totality, as depicted by the added red line. Recall that in the interactive map in the original New York Times's article, the designer lets the reader play with the top slider bar to control the moon's shadow as it moves through time and space.

Here, I nail the moon's shadow to the middle of the observation period, presumably the best time for viewing. This graphic explains how each location will eventually become a single point, which corresponds to a geographical location and a point in time.

In the above map, I'd have left out the cloud cover data, holding them for later.

Then, I make use of the historical cloud cover map to introduce the new variable of visibility.

A key takeaway is the color legend, with yellow indicating better visibility. I shall reuse this color scheme.

As a next step, I prepare readers for the shift from maps to a line chart. This step significantly ratchets up the abstraction.

I reduce the above map to a line chart for Rochester.

The color scheme is retained (not quite, as I made 3 not 7 levels of color but you get the drift).

The previously developed idea of looking only at the expected time of totality is repeated to focus attention to a single point on this line chart. This sets the reader up to recognize that Rochester will soon feature as a single dot in our final chart.

Lastly, we arrive at the final graphic, which is the New York Times's line chart – with an added flourish of yellow inspired by that historical cloud cover map.

Instead of being plunged directly into a complex graphic, I have utilized the ladder of abstraction to present important components, one at a time, building up to the final abstraction.

P.S. I haven't found an elegant way to explain turning the line chart sideways so a small gap persists.

Applying the band-aid, missing the wound

Kaiser Fung — Thu, 09 Oct 2025 08:27:28 -0400

Long-time reader Chris P. sent me the featured column chart, through a tweet. The original was published in an Axios article (link).

The Axios author sourced the data to an CSIS report (link), which means the original original is this column chart:

The Axios chart is a subset of this CSIS chart, after dropping the last three categories of Jihadist, Ethnonationalist, and Other. This is an unorthodox distillation of a chart; typically, one would combine those three categories into "Others" and keep it on the plot. By removing them completely, the reader may mistakenly assume that the column heights represent the total count of "terrorist attacks and plots."

Amusingly, the CSIS report is headlined "Left-wing terrorism...." – the story being pushed is that if one looks at 20 years worth of history, and predicts what might happen in the next few years, one should ignore all the data except the last 6 months.

On a report centering left-wing terrorism, the data in the stacked column chart are ordered in such a way that the left-wing components (dark blue) sit in the middle of the stack. This means that every such block starts at a different base level, making it difficult to compare heights.

The designer recognizes this difficulty, and uses an interactive element to overcome it. Clicking on one of the blocks pushes all the other blocks to the background:

I'd call this a "band-aid." It doesn't cure the malady but it is an improvement.

The Axios designer applies a different solution – reducing the number of categories plotted to two. With two categories, the subject of order becomes even more prominent: the category shown at the bottom has blocks that start on the horizontal axis while the category shown above is given floating blocks that start at different levels.

Amusingly, Axios then places the left-wing data in the floating blocks, which means it applied the band-aid but missed the wound!

This is what the same chart looks like when the order of the categories are reversed:

The trend in left-wing attacks and plots (orange) is much clearer.

Here, I combined the three omitted categories into "Others":

Notes on vibe coding 6

Kaiser Fung — Tue, 07 Oct 2025 09:36:11 -0400

In the last note about my vibe coding experiment (link), I mentioned a spreadsheet that I had asked the AI to build, used to keep track of all the posts and images from the old Typepad blog. This little device proves pivotal to the project.

One piece of information the spreadsheet captures is the name of each image file. Typepad's practice was to assign a random string of characters to name each uploaded image, erasing the user's file names while doing so. I had to reverse the process and gave each image file a unique name. Later, while fixing the images scraped as HTML files, I asked GPT5 to replace the *.html file names with the corrected *.jpg file names.

Further, when - because of Ghost's practice of not overwriting any uploaded image files - I had to append "-1" to any image transferred twice, I asked GPT5 to update the spreadsheet with the revised *-1.jpg file names.

So far, so good. I was quite proud of my migration strategy.

Then, a flash of panic. Suddenly, I realized the spreadsheet's design has a serious flaw.

It dawned on me that there were duplicate file names. How is that possible, since I explicitly designed the index to be unique?

The reason is obvious from hindsight.

For my long-time readers, you may remember I published two separate blogs on Typepad - Junk Charts (data visualization), and Big Data, Plainly Spoken (book blog). On Ghost, the two blogs merged into one stream. All of the data coming out of Typepad are specific for each blog. In my workflow, I made one set of code and applied it to both, sequentially.

This means I actually have two spreadsheets, one for each blog. The index on each spreadsheet starts from 1, and as a result, there are collisions of some file names across the two spreadsheets.

Worse, this flaw has already broken some posts that have been migrated. Sadly, checking a few posts confirmed my hunch: these posts contain text from one blog and images from the other!

I had a big mess on my hands.

First, I must re-index all the images on the book blog so that their names do not overlap with those of Junkcharts images. Next, I had to round up the file names that collided, and find the posts showing these wrong images.

So far (and for this reason), I have held off a large-scale migration of posts, so the number of corrections was moderate. But the path out of this mess was riddled with potholes.

Ghost has this weird policy of not overwriting files with the same name. If a post is showing the wrong image, this implies that the particular file name is duplicated across the two blogs, and the wrong one has been uploaded. The correct image may or may not have been moved yet. If the correct one exists on Ghost's servers, then it must have been given a "-1" suffix due to sending the same file twice.

If the correct image hasn't yet been given to Ghost, the situation is identical. If I upload that image now, it won't replace the incorrect file but will instead be renamed *-1.jpg.

So, to fix this mess, I asked the AI coder to find the posts that contain the collided images, and edit the image links to point to the *-1.jpg, instead of the *.jpg.

What a relief when I visited those pages with incorrect images, and discovered that the fix was in.

After all these twists and turns, the spreadsheet(s) of the posts and images are still happily tracking everything. It's been a life-saver many times over.

An abstract chart only statisticians love

Kaiser Fung — Mon, 06 Oct 2025 08:47:35 -0400

In reviewing the New York Times's visual story anticipating the full solar eclipse in 2024 (link), I skipped over their most intricate chart.

This chart has the handiwork of a statistician. I wonder if non-statisticians would appreciate it.

The line chart looks simple but is very abstract. It is many layers removed from reality. The chart can't really stand on its own feet. It's better that readers see something less abstract first. (This follows Andrew Gelman's advice in his post on the ladder of abstraction.)

Such as this map from the NYT article, which was featured in my previous review (link):

This map is less abstract, showing the outline of North America, and the path of totality running from Mexico through the U.S. to Canada. It also shows the path's width.

In the statistician's chart, the path of totality has been straightened out and made vertical, running from top to bottom, instead of west to east. As the designer explains, the width has been simplified away, the vertical axis depicting only the center of the path.

The cloud cover data have also been stripped down. The interactive chart shows everything, including areas far away from the path of totality (both in space and in time).

The statistician's chart, once again, pare things back to the bare minimum. At each location, there is a single number for cloud coverage. Presumably, this is the expected cloud cover at the best time for observation in the highlighted region. The shades of gray have become a single percentage.

The point of this line chart is to remove all extraneous information, and make a beeline for the key correlation between cloud cover and space.

Time is also present, insofar as it is coupled with the vertical axis of space.

Statisticians like these types of simple charts that have been cleansed of extraneous information. But we often forget that simplicity begets abstraction.

The basic story of the data is that most of the places along the center of the path of totality would experience very cloudy conditions during the solar eclipse.

That cough I'm hearing may be you hinting at me to get an eye checkup. Will the cloud cover really be that dense since there is still quite a bit of white space in the chart?

It turns out that the right axis line has been curiously erased. As a result, readers may mistakenly take the right edge of the chart as its right axis. In the following chart, I restored the right axis, showing where the cloud cover hits 100%.

You can now see that the viewing conditions aren't great towards the east coast.

One last thing... if this line chart is your thing, one of the key design decisions is its orientation.

Here is what the chart looks like if we turn it on its side:

In Western cultures, it's more natural to have time running left to right (as opposed to top down). The path of totality also happens to go west to east which aligns with a left to right orientation. Higher levels of a key metric being shown higher up the chart is also natural.

There is a reason why the New York Times printed this chart sideways. They would rather that we don't have to turn our heads when reading the names of places.

It's a tradeoff. Which orientation would you select?

Notes on vibe coding 5

Kaiser Fung — Wed, 01 Oct 2025 09:01:24 -0400

I'm still refining my blog migration code, that is completely written by AI. In this vibe coding experiment, I refrain from meddling with the AI-generated code.

In the previous update (link), the initial script hit several snags, and I covered the URL matching mystery.

The other big problem I had to contend with concerned the file type of scraped images. Recall from Notes 1 that the Typepad export process does not save images, and so I asked GPT5 to produce a script that reads the Typepad export file, flags every image tag, catalogs and labels them, and then "clicks" on every link to download the images.

I was highly satisfied with the result. I had avoided a lot of tedious work, such as figuring out all the different ways Typepad may store images.

I was not that surprised to learn about edge cases that neither I nor the AI coder anticipated. In this case, Typepad sometimes serves up HTML pages with embedded image links instead of the image files themselves. I'm not sure why Typepad sometimes uses this method. It would have been transparent to readers because the images load automatically.

However, in such cases, the image-downloading script has saved an HTML file, instead of an image file. I discovered this when some of the migrated posts displayed broken image links. Ghost does not expect to find HTML files in an image block, and can't handle them.

In modern coding, we expect to be able to "roll back" changes. In this case, I was hoping to undo the upload of images that included those useless HTML files. This then clears the way to redo the upload, after cleaning up the HTML files.

That would be too easy! Ghost does not provide a method to delete files. According to GPT, if a file is uploaded twice, the first file is still present while the second file is renamed "file-2".

Because the corrected images have different suffixes from the HTML files, I worked around this restriction. I could leave the HTML files stranded in the server, without ever using them.

I had yet to find the real image files associated with those HTML files.

A little foresight proved crucial at this stage of vibe coding. In my first ask, I wanted and received a spreadsheet that documents key information such as post index and title, image index and title, and so on. From this spreadsheet, it's quick to find all the posts that presented HTML files. The AI coder then implemented a number of ways to grab the underlying images, ridding the HTML wrapping.

After retrieving these images, I packaged them up and uploaded them to Ghost.

As mentioned above, I'd much prefer to undo and redo the upload. This ensures that the number of image files on the Ghost server is exactly the total number of images in my blog posts. Doing what I just did, the server contains more files than the expected number, because there is a subset that has duplicates, with one HTML file and one image file.

If this were the only patch, the impact would have been light. The risk is in accumulating patches as more issues are discovered. The server becomes more and more bloated with "dead" files, which I'm not allowed to remove.

Meanwhile, codes and scripts are also piling up. All of the above steps were accomplished using AI-generated code.

The same principle of hygiene applies. A cleaner process would involve just one master script to which I add handlers for edge cases. My hands are tied because Ghost does not let me overwrite anything. They treat a post the same as an image. If I upload another post of the same name, the first one stays put while the second one is given a new name. Instead of regenerating everything, I end up repairing bits and parts.

My spreadsheet summarizing all the posts and images has been a life-saver throughout this process. At any time, it gives me a snapshot of everything.

But a serious flaw would soon emerge. Stay tuned.

There is a time and a place for every shadow and cloud

Kaiser Fung — Tue, 30 Sep 2025 12:38:52 -0400

The New York Times published several beautiful pieces related to the solar eclipse in April, 2024. Let's take a look at one of those articles (link).

They started with a simple map that quite remarkably captures the main reason for intensive media interest at the time.

This map draws a "path of totality" that swipes across the U.S. from southeast to northwest. The residents of those regions will experience total solar eclipse. Importantly, a time element is lurking beneath, represented by the arrow. Several key cities along the path are shown.

The map is simple but not too simple: it provides the minimal context for readers to interpret the much more complicated charts below.

***

For the moment, I switched my focus to Rochester, NY, which is right in the middle of the path of totality. People were flocking to these areas at the time in order to witness the rare meteorological event. One potential party spoiler is visibility, measured by cloud cover. This is the pivot of NYT's data visualization project: producing a graphic that merges where one is and how much cloud there will be.

I have already mentioned the crucial role of "time". When a path of totality is depicted as a line through the map of the United States, it is as if everyone along the path sees the same sky. The following map, scanned from a different NYT article (here), adds the missing time element:

Two things are happening at once. First, the time window for observation varies as the shadow of the moon shifts across space. Second, each location uses its own time zone. On the augmented map above, the key locations are also labeled with an observation time expressed in local time.

Now comes another dynamic element - cloud cover also moves. To address these issues, the NYT team deploys a dynamic map.

The slider above establishes the location, as well as the relevant time window for observation, expressed in local time. The midpoint of the slider is the best time for viewing, and it recenters for each location (and so does the map).

As you move the slider back and forth, you can see the shadow of the moon shift across the map. That really is the whole game - what is the predicted cloud cover right when the shadow passes by one's location?

The slider controls two variables at once. Not only is the shadow of the moon moving but the cloud cover is also morphing.

The cloud cover is a choropleth, using darker colors for denser clouds. That's intuitive.

All in all, I like this project a lot, and appreciate the fantastic effort that made it possible.

A couple of decisions merit our revisiting.

Further down the article (link), I was really excited when I saw this legend:

This legend encapsulates "show, don't tell". I expected to find it on the dynamic map. It is one of those informative legends that not just describes a classification scheme but conveys even more information to the reader.

Alas, that was not to be. The actual legend looks like this:

The plain-English legend labels of "Less clouds", "More clouds" work well. I'd prefer to use the same grouping of <10%, 1-25%, 25-50%, 50-90% and 90-100%. Presumably, those bins are chosen to align with human perception.

The following screenshot shows the view from Montreal.

Notice where the moon's shadow is relative to the slider's position. I believe they are showing the same standard time window for all locations (expressed in local time). As a result, at any location, much of the slider is nonfunctional - for those periods, the shadow is far away, and off the page.

Instead, an optimal viewing window can be established for each location, and the slider's labels customized to indicate it.

Elsewhere in the article (link), they also show another slightly different design that replaces forecasted cloud cover with historical data.

I suppose they produced it prior to when the latest forecasts became available. In reality, the real-time element is more decorative than substantive: the reason I say this is that any predictive model of cloud cover must be based on historical data!

I prefer this two-color legend to the grayscale above. The yellow parts stand out as locations where people have great visibility for the full solar eclipse. Through this, you can clearly see why a static map like this one has limits - the yellow visibility may not be there when the moon's shadow passes by, and on a static map, you can't show different places at different times.

Notes on vibe coding 4

Kaiser Fung — Fri, 26 Sep 2025 14:21:15 -0400

I wasn't expecting to be writing about vibe coding so soon again. But here I am.

Two blog posts ago (here), I felt quite satisfied that I have managed to migrate some blog posts from Typepad to Ghost using a piece of code written completely by AI (albeit with my steering). I haven't read the code itself.

As I geared up to move larger chunks of the blog over, I started to notice previously-unknown problems. These issues all necessitated updates to the code. In my vibe-coding experiment, I'd feed the anomalies back to the AI coder, and when it gets stuck, I'd steer it out of trouble.

It actually got stuck more often than I'd like it to. It seemed to perform better if it's given full reign to write on a blank slate but it is quite ineffective when the starting point is a functional program, and it is asked to make small tweaks to fix specific issues.

A small side track to think about the nature of bug fixing. Since the previous code is functional, an important objective is "if it ain't broke, don't fix it". Touch it lightly, to reduce the chance of creating even more problems. So, I cringe every time the reasoning steps include unsolicited "optimization": stuff like "I see that your current code for doing X is not efficient, and I am going to fix it by..." makes my stomach churn.

A particularly epilectic moment occurred when the AI coder decided to change the parameter of my main function from "force-tags" to "forced-tags." As a result, when I ran the corrected function with the previous set of parameters, it popped a syntax error. Why on earth - or in the multiverse - would it do that? (Ironically, when the original AI coder wrote "force-tags" instead of the more grammatical "forced-tags", I cringed but suppressed my urge to "fix" it, by which I mean, to break it.)

The first big problem I encountered was missing posts. As it turned out, those posts weren't actually missing; they were lost in the crowd, so to speak.

I'm going to migrate 19 years of posts in stages. That's because I'm pretty sure there are unknown problems that would pop up so I am starting with small batches; at some point, when I have sufficient confidence, I will move large chunks of posts all at once.

Step 1 is to use some criteria to extract a subset of posts to migrate over. Today, I selected about 200 posts. Step 2 is to find all the images on those 200 posts, rename those images using my indexing strategy (covered here), and upload these images to Ghost.

The heretofore satisfactory code failed to find all 200 blog posts. I was missing images from about 10 posts (after excluding those posts that did not contain images). That's very odd since when I look into the input files, I definitely see the 10 posts and the associated images, with their customized file names.

I will spare you this journey because I wasted a few hours while GPT5 came up with seemingly endless, useless ideas, after which it still had not a clue what was going on. It is one inexhaustible fount of throw-at-the-wall stuff. Nothing sticks.

During this slow crawl, I discovered that OpenAI sneakily switched my model to "GPT5 not thinking". I'm calling them out here. I was using GPT5 Thinking from the start. I suppose they didn't like the amount of work I was throwing at it recently, and decided to quietly unburden themselves. This, I believe, explains some of the incompetency I encountered today, versus previous work.

At this point, the AI coder and I had become a team. I was running diagnosis tests on the side. What I discovered: if I pulled those unmatched URLs from the error log, putting them in their own file, and ran the same code on this much smaller set of links, the AI code managed to find those 10 posts, and pull the images out.

This is terribly confusing. But that little test gave me life. I abandoned the effort to fix the code. I just divided the posts into two groups, and processed them successively. Problem solved.

In the meantime, the AI coder wasn't giving up. It threw out even more suggestions for further fixes. Any bets on whether those fixes would work?

For giggles, these were GPT5's famous last words before I jumped off the ship.

It "smelled" a rat here, but it smelled raccoon, hyena and Labubu before, none of which was sighted.

In the next post, I'll cover another unexpected problem.

Proof by absurdity

Kaiser Fung — Wed, 24 Sep 2025 08:25:46 -0400

One of Andrew's readers ridiculed a paper published in JAMA, one of the top peer-reviewed journals of medical science in the world, that reported some disarming statistics (link).

The authors claimed that 7% of American adults have been present at a mass shooting involving at least four victims. Further, they estimated that 2% of American adults have been injured at such a shooting.

Really?

There are roughly 200 million adults in the U.S. So they say with a straight face that 2% of 200 million = 4 million people have been injured during mass shootings involving 4 or more victims.

Last year, there were roughly 500 such shootings. If the average such event injured 100 people (follow along just for laughs, now), that's 50,000 injuries in a year. We'd have to accumulate 80 years of numbers to reach 4 million.

This type of thinking helps data analysts get rid of fringe hypotheses quickly so that they can focus on more promising ones. I don't have a better name for this style of argument. A proof by absurdity?

Notes on vibe coding 3

Kaiser Fung — Mon, 22 Sep 2025 13:23:26 -0400

In this third post of the series about vibe coding, I reflect on my experiment, and speculate about its future. (The previous two posts are here, and here.)

In particular, will non-coders be able to "write code"?

It's obvious that for certain projects, it is already possible for a non-coder to obtain functional code via an AI. The prerequisites are an ability to articulate what one wants, and ample patience, because for now, some degree of steering is required. For truly complicated applications, I'm not sure it's there yet.

It is indeed possible to imagine a world requiring less steering, which implies that the AI coders will have developed even better sense of what the user might be looking to do. For example, there may come a day when the AI would devise an image indexing strategy on its own, obviating the need for me to prompt it.

Let's ponder what that world would look like. The user asks to archive all the blog posts at a website, informing that AI coder that the posts are filled with images, and it's important to match images to their respective posts. The AI coder figures out the image indexing strategy, plus the directory structure, plus the anti-blocking techniques, and produces functional code that requires no further steering.

This future world looks very familiar! It's the world of software as we know it. When we execute a find and replace within a Word document, what happens? Behind the scenes, the application executes code that finds the word and replaces it, repeating these operations until the entire document is read through. The key words are "behind the scenes." When we use Word, we don't think about the code that forms Word.

All of software is code but most of the time, users don't see or notice any code.

I think that's the world we're heading towards. Right now, the framing of the issue is a bit off-kilter. Non-coders don't want to write code, read code, or think about code. They just want to things done.

The ideal interface for this future is not a chatbot. It's something that accepts natural language prompts, and then delivers the results the user is seeking. This user experience is similar to running any command within an application like Word or Excel. It isn't one in which the user takes an action, expecting to receive a piece of code that the user then executes in order to obtain outputs.

***

This future world is also different from the world of Word, Excel, etc. in two fundamental ways.

First, the software is constructed in real time. In the old world, Microsoft engineers have written the find-and-replace code once, and every time a user clicks the command, that same code runs. In the new world, when the user issues the prompt, the AI composes the code on demand, and then executes it.

This shift to real-time has major implications. Software becomes more flexible and customizable. In the old world, the find-and-replace function only admits minor variations, such as whether to match case or not. The user can't ask for some wrinkle that wasn't pre-conceived and suggested by the software developer. In this AI world that I imagine, the user can request a find-and-replace operation for "apple" that only applies if the "apple" in a sentence refers to the fruit. This is possible because the code is written in real time at the user's prompting.

My find-and-replace code will be different from yours because we issue different requirements.

This flexibility comes at a cost. The behavior of software will become more variable. Even if we both want the same find-and-replace, the AI code will likely be somewhat different, which means there is a good chance that the outputs might vary. In the old world, by contrast, the outputs must be the same since it's the same piece of code. I suspect that the loss in reliability will be tolerated in many applications.

Another change in this new world is how users communicate with their software. In the old world, it's all buttons, menus, and links. To accommodate customizable software, the new interface must let users articulate what they want to get done. A natural-language interface is the answer, and large-language models are perfect for this purpose.

If the point of vibe coding is to let AI do all of the coding, then it's inevitable the AI has to take control over our computers. We would effectively have to make the AI a "super-user" on our computers, with rights to edit, create and delete files; install software; etc. This inevitably creates risks over privacy and security.

In my experiment, the AI didn't directly run any code on my computer. I downloaded each script and ran it myself. Even in this mode, I assumed some risk because I didn't read the code. It'd have been better to pass the code first through some kind of malware detector. Besides, the potential harm could also become from bad code, rather than malice, which is even harder to prevent.

***

In conclusion, vibe coding places the attention on coding but what is really innovative about this new AI world of coding is that we are coming closer and closer to software that can be customized and written in real time, and then executed behind the scenes to deliver outputs to users. The key difference in user experience we'll feel is the ability to use natural language to describe what we want to get done, and because of the new flexibility, the scope of what can be done is vastly expanded.

Meanwhile, expect the software to be less reliable, and even more insecure.

Blog Migration Update Thread (ongoing)

Kaiser Fung — Mon, 22 Sep 2025 11:10:00 -0400

Yes! You have found the new location of Junk Charts, the long-running blog by Kaiser Fung on data, graphs, and AI. Things will be up and running here shortly, but you can subscribe in the meantime if you'd like to stay up to date and receive emails when new content is published.

Please use the comments section to report any issues you encounter.

If there are specific posts that you'd like to be migrated sooner, please either mention them below in a comment, or contact me directly.

List of known, not-yet-resolved issues (10/5): primary tag associated with each post, image size in RSS, migrating old comments, broken links and images inside posts because of partial migration.

Nov 6, 2025

It appears that someone has now "parked" my old Typepad URL, and falsely put up a message saying Junk Charts is "closed for business." The old Typepad site is closed but I'm alive and well here.

Moved some blog posts for Ray Vella's class (here).

Belatedly realized that my own website has outdated links to Typepad so I moved those blog posts as well. You can browse the collection here.

Oct 20, 2025

Posts from 2024 have arrived.

Oct 8, 2025

All 2025 posts on the book blog have now been replicated on Ghost.

Oct 7, 2025

Created a new page that documents how to turn old Typepad links to new Ghost links.

Oct 5, 2025

All Junk Charts posts from 2025 have now been replicated on Ghost.

Unlike Typepad, Ghost selects one of the tags on each post as the "primary" tag. The primary tag is set to the first tag on the post. As a result, on the many posts, the primary tag may look fishy. Fixing this aspect will take a long time.

Oct 3, 2025

A couple of bug fixes on the Archive and Keywords pages, including broken images, and the pagination footer.

Added a default image for posts without images.

Previously, links to Typepad posts are redirected to Typepad because only a small subset of posts have been replicated in Ghost. Now that Typepad pages no longer exist, those links are transformed into Ghost-style URLs. There should be no more links to Typepad posts. This means that there may be some broken links because some old blog posts have not yet been migrated.

Sept 27, 2025

More posts migrated. About 220 posts on the new site now, still a lot more to come.

Sept 26, 2025

Disabled infinite scrolling for non-mobile devices. It now shows six posts per page, and you must click to see other pages. On mobile, infinite scrolling is enabled, as per usual practice.

Launched Posts by Keywords collection. You can click on any keyword and see all posts about that topic. The link to it is on the top navigation, to the left of the blog name, and may be hidden in the ... menu.

Sept 25, 2025

Migrated about 200 posts today. It's going at a deliberately slow pace because I'm still refining the migration code to make sure the regenerated posts require as little manual modification as possible. Some posts will have broken links if they point to other posts that haven't yet been posted.

While broken links may be expected, broken images indicate problems. If the original post contains images, and it has reappeared, the images should have followed the post. Please report any broken image under comments.

I'll blog about today's work in more detail. Some of the unanticipated issues that forced fixes to the blog migration code included: unexpected Typepad elements such as "PING" (which is an abandoned community feature from the past); scraped images that were html pages with embedded images; working around default author names; and the weaknesses of current AI models when asked to fix specific bits of a large piece of code.

If you've been here for a few days, you'll notice I'm also tweaking the top banner.

In addition, in the top nav bar, on the left side, possibly hidden under the ... menu, you will find a page of all blog posts arranged by year.

Sept. 22, 2025

About a week till Typepad shuts down. Ghost site goes live. Only some test posts have migrated using the AI-generated code obtained from my vibe-coding experiment (documented here). With the same code, I should be able to migrate all old posts in bulk once I get other issues sorted out. Am enjoying my experience working with Ghost so far.

For those readers who reached out, there is already an RSS feed. The sizing of the images in the feed is a known issue. If you see other problems, let me know in the comments.

The current routing scenarios: any link to Typepad is still served by Typepad; any link to www.junkcharts.com is served by Ghost. Any link to www.junkcharts.com that starts with /junk_charts/ or /numbersruleyourworld/ is re-directed from Ghost to Typepad (because most posts have not yet been moved).

If an old post has been migrated (e.g. https://junkcharts.typepad.com/junk_charts/2025/08/reflection-on-two-design-quirks.html), the corresponding post is found at https://www.junkcharts.com/reflection-on-two-design-quirks/. So the rule is take the name of the post (the part prior to .html after the date of the post), add it to the end of www.junkcharts.com.

A second look at that axis

Kaiser Fung — Fri, 19 Sep 2025 08:01:00 -0400

In the book blog (link), I wrote about this Bloomberg chart. Found here.

At the end of that post, I noted the unusual labeling in the time axis. I had to take a second look.

What first unsettled me was the sudden gap in labels at the right side between Q3 2020 and Q2 2025. Then, I looked back on the rest of the labels, and on first glance, it didn't seem like those time intervals were even.

Now, I have pulled out my ruler, and measured everything - and phew, they spaced the tick marks appropriately. Not too surprising, given it's a Bloomberg graphics piece.

The graph has 13 ticks and 12 intervals between ticks. Half (6) of those intervals span 11 quarters, three of them 12, and one of them 10. That leaves the last and widest interval which accounts for 19 quarters.

I can't figure out why they wouldn't use evenly spaced labels. The wiggles in the line suggest that they have the data for every quarter.

I'm also mystified by the decision to omit labels between 2000 and now.

Maybe it's because there isn't enough space because 19 = 11 + 8. Why not extend the axis line a little so there is a little additional whitespace but then the omitted label will fit?

I like this theory - now I think the reason why some intervals are 11 quarters and some are 12 quarters is that tick placement is dictated by getting the labels to fit.

Do you have a better theory?

Notes on vibe coding 2

Kaiser Fung — Tue, 16 Sep 2025 14:52:17 -0400

This post continues the prior post (link) about my blog archive project.

In view of the impending shutdown of Typepad, I want to "scrape" my own blog so that I can keep a complete archive of several thousands of image-heavy blog posts going back almost 20 years. It seems like the right project to test "vibe coding," which is an AI hype of the week. Vibe coding promises to make it possible for businesses to replace human coders with AI coders, and also promises to make it possible for non-coders to write code.

At the end of my previous post (link), I ran the first piece of code written by GPT. I had read GPT's description of what its code does, and hadn't seen anything troubling. Notably, I did not read the code before running it. That, to me, is the essence of vibe coding.

You have come across that AI magic story if you are on any social media. Someone writes down a prompt, and then magically, AI delivers a perfect piece of code, one that works out of the box.

Did my GPT5 code work just like that? Funny you asked.

The code ran without errors but it didn't produce anything useful. What does this mean? It created the entire file structure with one folder per blog post, as intended. All folders were found empty. Hmmm.

I relayed this discovery to the AI coder. It pinpointed the problem: it had mistakenly assumed that the Typepad export file references each post's URL as "URL" but in fact, the name of the reference is "UNIQUE URL". It then fixed its own code, and offered a revised file.

I ran the revised code; it finished without error, and this time, the folders were populated with data.

At some point during the above process, I concocted a different way of organizing the data. Instead of having thousands of folders in the directory, I'd set up a single folder to hold all the images. The key is to assign a unique number to each image, and also to associate each image number to the pertinent blog post.

I sketched out how I'd like to set up the image indexing scheme and the new directory structure, and issued a new prompt. GPT responded with a new script that implements these ideas.

This script also ran without errors. Again, the first attempt was only partially successful. When I opened the process tracker, I found that only about half of the blog images were successfully captured.

I learned that some of the image links grabbed from the HTML code were not really what they appeared to be. For example, some links pointed to Amazon-generated pages for my books, which had expired, but in any case, not images that I want to keep. There were also other links that encountered various HTTP error codes.

At this point, I explicitly asked GPT to contend with blocking technology as indicated by the HTTP 403 errors (forbidden). Even though the AI knew from the start that 403s could be an issue, the initial code did not include any counter-measures. With each new report of blocked URLs, the AI codes now added another layer of code that executed a specific anti-blocking tactic.

Other refinement was necessary. At first, the AI coder ignored my instruction to set each image's name to the image index - it sometimes retained the original name. Next, when it switched the name to the image index, it dropped the suffix (.jpg, .png, etc.). The chatbot interface proves very convenient for steering the AI coder and fixing these minor issues.

At one point, I jumped ship to another AI coder, Claude. That was when GPT got twisted around like a pretzel. I was then starting to encounter coding errors. As usual, I relayed the errors to GPT: it kept telling me it had fixed the problem when the offending code was still there. Now, I had two AIs running side by side. GPT is still the main code generator; I no longer took the GPT code and ran it directly - I passed it to Claude, which checked for the same coding error, and if present, fixed it.

It turns out that current AI coders may have a habit of falling into such traps. For a different project, for which I used Claude as the main code generator, it got stranded in a corner where it kept telling me an offending line of code has been removed, when the new file clearly still contains it. So I had to fire up GPT to get a lift out of that dark corner.

I'm still amazed by how much working code was produced. At the end, I obtained code that ran through the process of setting up the directory structure, and populating it the way I wanted to. The image index worked as expected, tying each image to the blog post it belonged to.

And I haven't read a single line of code.

That's vibe coding. The user does have to steer the AI coder in the right direction, and correct the course as needed but as demonstrated here, I didn't have to rewrite any code myself.

In the next post, I'll discuss where I think this is all heading. Is it true that non-coders will use AI to write code?

Notes on vibe coding 1

Kaiser Fung — Wed, 10 Sep 2025 10:10:00 -0400

Large language models (LLMs) have evolved to the point where they become useful tools for writing computer code.

Recently, as a result of Typepad's impending shutdown (link), I found a perfect opportunity to explore "vibe coding" using LLMs. Vibe coding can mean many things to many people so let me just define what I mean by vibe coding. It is a hands-off process of coding, in which the human's role is limited to steering and guiding. The human coder is not actually writing code; in fact, the coder isn't even reading code. In the end, the code is entirely written by AI.

(The million-dollar question: does this mean vibe coding can be done by someone who knows no coding at all? I'll come back to this question at the end.)

What is vibe coding not? My definition deliberately excludes LLMs as "StackExchange on steriods". StackExchange has been a super-useful Q&A website in which developers ask questions to other developers who supply answers, frequently filled with insights, extensions, code fragments, and commentary. Not surprisingly, StackExchange data were used to train LLMs (link). Therefore, if one has a coding question today, one can ask the LLM, instead of searching for the answer on StackExchange. The AI has effectively "read" the relevant StackExchange posts, and responded with key information. Tools like Co-Pilot makes it possible to do the above without leaving the code editor.

That's not what I'm exploring in my vibe coding experiment. In this StackExchange on steroids mode, the coder is still in control of the code; the coder has probably written a good portion of it, and the coder most definitely has read everything through. While this path is viable, and valuable, it certainly won't lead to the promised land of enabling a non-coder to produce code.

Now, let me define my experiment. In view of Typepad's imminent shutdown, I want to archive all my posts, stretching back nearly 20 years. There are several thousand image-heavy posts. The general idea is to "scrape" my own blog. In the Big Data era, scraping has become an everyday skill: this is how Google, and ChatGPT collect data to power its search engine and its AI chatbot respectively. The process of scraping is the sequential loading of large numbers of webpages in order to extract and store the relevant data from each page.

Scraping code is annoying to write because it requires a deep understanding of the structure of these web pages. Website design differ: consider where the navigation column is placed, how images are interspersed with text, whether there are buttons, forms, popups, etc. etc. For example, if I want to save every image of every blog post, I'd need to delve into the HTML code to decipher how the image tags are organized, and then write custom code to navigate such structure. I'd also be hoping that Typepad hasn't altered this structure during the last 20 years, or else my scraping code has to know what these different structures are, and then try guessing which particular one applies to any given page.

Web scraping is also somewhat controversial. Many website operators try to block it. Scraping generates fake traffic to websites; scraped pages are loaded but aren't actually read by humans; thus, websites pay service providers, who deliver the pages to visitors across the networks, for fake traffic that does not produce any revenues (ads, product sales, etc.). For this reason, and also to protect their data, most websites impose limits on scraping; some operators even attempt to stop all scrapers by predicting whether a page load request comes from scrapers. (As a general rule, though, the louder company X complains about other people scraping its website, the more likely Company X is actively scraping other websites, busily working around these blocking tactics. Looking at you: Google, Facebook, OpenAi, etc.)

Why do I think vibe coding might fit this project well? First, since AI models have shown great ability parsing the structure of human speech, they should be able to dissect the HTML code, which adheres to rules that are even more rigid than grammar. Second, AI coders probably have seen a lot of scraping code, since it's such a common activity, so that it should know how to handle blocking adversaries. Both hypotheses would come true; but my journey is just getting started.

Here is the first prompt I sent to ChatGPT (at the start, I used the recently released GPT 5 Thinking model):

First vibe-coding prompt

The prompt mostly describes the high-level objective of my project, offering key context that must not be missed (e.g. I run two blogs under one domain name). I didn't mention countering potential blocking or parsing HTML structure because I expect that any self-respecting AI coder knows about these challenges. I include context specific to my scraping request. One such detail is the need to associate images with each post. It would be a nightmare if I end up with an image folder containing thousands of files, untethered from the text, so I suggested in the prompt to make a new folder for each blog post. I'm curious whether the AI coder will heed this advice, or will it recommend a better way of linking up images and text after divining the motivation for this special request?

The response looks very promising. GPT returns a file with code inside (a python "script"), and also provides instructions for how to run the script.

How to run code

Time out to think about where we are.

If you don't know any coding at all, you'd have some difficulty following those instructions. Where are you supposed to type in those commands? What is this "bash"? You probably don't have the "pip" program installed either. If the program succeeded, from where do you fetch the output? GPT actually tries to help by saying "./archive/junk_charts/..." but a non-coder would be able to decipher those words! It's not hard to pick up these concepts but you do have to learn them.

I'd like to return your attention to the first prompt shown above, in which I also clipped the top part of the AI's response. The section you can see explains how the scraper will nevigate my blog. Remember that it has to visit every blog post sequentially. While researching my prompt, the AI visited my blog and discovered that old posts are grouped by month, with all posts published in a given month aggregated on the same monthly page. There exists also a top-level index page, called archive.html, that contains links to each monthly page. Thus, the scraper first visits that index page, and using it as a map, it loads each monthly page, and on each monthly page, it extracts the required text and images. This scraping strategy makes sense to me.

Elsewhere in that GPT response, I noticed mention of "rate limiting" and a possible "retry" mechanism, so the AI is definitely "aware" of potential blocking. Therefore, both my hypotheses came true - I didn't have to include these "obvious" items in my prompt.

In the first prompt, I asked GPT to build a testing mode so I can run the code on one month's worth of posts before rolling it out to thousands of posts. GPT made this testing mode as requested.

After reading the rationale of the GPT response, I don't have any complaints. So I downloaded the script, and ran the code.

[to be continued]

P.S. [9/23/2025] Part 2 and Part 3 are now posted. Remember to change your bookmarks to https://www.junkcharts.com.

MTA lowers revenues while upping stress on some commuters

Kaiser Fung — Thu, 28 Aug 2025 20:00:00 -0400

In the previous post, I looked into the new "fare-capping" scheme offered by OMNY for frequent public transit commuters in NYC. It's a mind-blowingly complicated solution to a math problem. The previous 7-day pass is much simpler.

I believe the switch from swipe cards to OMNY leaks revenues, while also incurring costs of implementation. Therefore, it's a weird decision on the MTA's part.

In this post, I explore the economics.

I will focus on the subgroup of frequent commuters. If you aren't going to do 12 or more rides per week, this issue is moot.

For the super-frequent commuters, who reliably make more than 12 trips every week, getting the 7-day pass is a no brainer. The switch to fare capping means they don't pay upfront. This isn't much of a benefit though since by the end of the week, they would have paid $34 in both cases. Because of the terrible user interface (see previous post here), these commuters would have to check their transaction logs to confirm that fares were capped at $34. In fact, verifying the cap requires an accounting degree, as it is hard to establish the start and end of each 7-day period (see the last post). I'd argue that the experience for these super-frequent commuters has worsened slightly.

A segment of these super-frequent commuters enjoys an unexpected pleasant surprise: they will see their commuting expenses decrease under fare-capping. They are those who didn't care enough to get the 7-day passes in the past; now, their weekly spend is automatically capped at $34. Thus, in the super-frequent commuters segment, the MTA collects less revenues after switching to fare-capping.

***

The most interesting group is the commuters whose average ride frequency is around 12. Under the 7-day pass system, they make a buy-or-not decision every seven days. åAfter they purchase passes, they likely will adjust their behavior, using the bus or subway for shorter trips, in order to maximize the value of the passes. This optimizing behavior enhances the perceived value of the 7-day pass. In some weeks, if they fail to hit the 13-ride minimum, they may overpay relative to pay-per-ride.

Under fare capping, these commuters don't pay upfront. If they end up taking fewer than 13 rides, the total charge will be the same as pay-per-ride. If the ride frequency exceeds 12, the total is capped at $34. So, the risk of overpaying is eliminated. The other side of the coin is that the MTA is denied these overpayment revenues.

What they give on one hand, they take from the other. The anxiety over whether or not to buy a pass is replaced by the anxiety over whether or not the next ride is free. For someone who only occasionally exceed 12 rides, it's hard to know when the cap has been exceeded, and if so, when the pertinent 7-day window ends. If commuters don't know for sure they have enough to get the free rides, they won't change their behavior and start taking extra short rides. (These extra rides only enhances the perceived value of the frequent commuter discounting; they don't represent incremental revenues for the MTA.)

These commuters don't have to make any decisions under fare capping. This can be described as "convenience" but it is served with a dose of poor customer experience. Even those commuters who have benefited are unaware when the cap has kicked in, nor do they get the satisfaction of benefits building up as they take more rides.

For the MTA, the collected revenues will certainly decline for two reasons: a) super-frequent commuters who didn't take advantage of 7-day passes are now given automatic fare caps; and b) the borderline 7-day pass users have their fares capped during those weeks when they unexpectedly take fewer than 13 trips.

I'm coming up empty when trying to think of a group of commuters from which the OMNY system generates incremental revenues.

***

Ironically, the old way is less stressful. After paying upfront, it is stress-free. Under fare capping, you have to constantly worry about whether you've hit the fare cap or not, and when the 7-day window resets. Even after you've hit the cap, you have to worry about when the current 7-day window ends.

That's without accounting for the money invested in the OMNY infrastructure. So, the MTA reduces its profits while making lives more complicated for the frequent commuters.

Reflection on two design quirks

Kaiser Fung — Mon, 25 Aug 2025 05:18:49 -0400

This post is for paid members only. Log in or subscribe to see it. Thank you members for supporting my work.

When I first saw this slanted column chart on Visual Capitalist (link), I feel this may be another case of questionable design distorting data representation.

It isn't so simple.

Slanting the columns does not actually distort the encoding of the data. Take a look at the last column on the right- where the designer drops a perpendicular from the top rung of the ladder to the "floor" of the chart. In so doing, a right-angled triangle has been outlined. The length of the slanted side is the hypothenuse, the length of which is (height)*sin θ where θ is the angle of the slant at the floor. Thus, the ratio of lengths of two slanted sides (x1/x2) = (h1/h2) after the sin θ factor cancels out as each column is given the same slant.

For this chart, readers are mostly interested in the year-on-year change: on a conventional column chart, this is reflected in the difference in heights between successive columns. Now, (x2-x1) = (h2-h1)*sin θ so the measured difference in the lengths of successive slanted sides is proportional to the measured difference in heights of the columns. The observed ratio is a constant multiple of the actual ratio of the data. For this usage, there is an absolute distortion but not a relative distortion.

In sum, data distortion is not a strong enough reason to disapprove of the slanting feature.

***

I'm also fascinated by the designer's end run around the start-at-zero rule for column charts. While not explicitly stated, the floor of each ladder can be thought of as starting at zero. The use of the broken scale essentially resets the scale to start at 18 so the chart in reality starts at 18 rather than 0.

(Such distortion of the data encoding impacts the calculation I did in the section above, because I have assumed starting at zero. But the culprit would then be violating the start-at-zero rule.)

They won't tell you why they did it

Kaiser Fung — Tue, 19 Aug 2025 20:00:00 -0400

My friend Alberto is cited in this Washington Post article about AI companies committing "chart crimes" (link; paywalled).

Let's run through these examples from OpenAI's presentation, from when they launched their GPT5 foundational model.

Why is 50.0 lower than 47.4? The answer is simple!

It's because lower is better since the metric is "deception rate". (The pink columns represent the latest version of GPT while the white columns represent a prior generation.)

Our story is: the new GPT5 is much better than our older models, and that's what our chart shows. Is there anything wrong with that?

***

Seriously though, I don't buy the idea that this is a screwup by AI. I don't buy that this is vibe graphing.

To buy that official line, you'd have to accept that no staff member reviewed the slides before this huge announcement, that the CEO of the most famous AI company did not walk through the slides even once before going on camera, that there were no rehearsals for this event, that those people who are responsible for metrics did not double check what they put out to the public, and if anyone even flipped through these slides once, they failed to notice the howler(s).

That last point. What does it tell you when a company with a boatload of PhDs on staff cannot detect this howler when within seconds of it being shown to the rest of the world, people noticed and mocked it on social media?

(As far as I can tell, that event was a livestream that presented a scripted demo possibly delivered live but without a live audience.)

Is it hubris? Is it deliberate? Is it deceptive? I don't know but it's hard to believe it's innocent. It's also distracting as the conversation is focused on the design of the chart, rather than its contents.

***

The more notorious example from the same event is this one:

Don't worry, the pink parts are definitely higher than the white columns.

The corrected version is found on OpenAI's blog post here.

***

Why did they put those howlers out there?

My best guess? It's an extreme version of "tasting your own medicine". Extreme, in the sense that developers are forbidden from editing the vibe code that came out of GPT.

Simple is not always easy

Kaiser Fung — Thu, 14 Aug 2025 20:00:00 -0400

This chart is as simple as it gets. It can't get any simpler.

It's a column chart showing a single series of numbers (same store sales growth rate) over successive quarters.

Somehow, it's not as easy as it gets.

***

The designer did a great job channeling my attention to the far right column, which shows the most recent quarter of Q2, 2025. That's because the chart's trying to say something... something about the contrast of the tall gray column and the midget black column.

I don't see a legend. My first instinct is to think of the gray column as the expected value, and the black column as the realized value so that the gray part is the gap between expectation and reality. But this can't be true.

It's not true because after that one stellar quarter in Q2, 2021, all the subsequent values have been much lower. It's inconceivable that management would have predicted a return to that earlier performance level for the current quarter.

Is it possible that the black portion is a partial number while the gray part represents the excess yet to materialize? A common such situation is associated with part-year (realized) versus full-year (projected) values. This can't apply to our chart, either.

Then, I noticed that the gray column is level with the Q2, 2021 column, which represents the "high water mark" for Cava's historical same-store sales growth (at least for the time window of the chart). Perhaps the point is the comparison of the current quarter to the historical maximum. This theory is usurped when I pull out a ruler to discover that the top of the gray column is in fact a little higher than the Q2, 2021 column!

To show a reference level, I prefer a line or a symbol. For example:

***

It's odd they chose Q2, 2021 as the reference point. The data from the last few years should have made clear that Cava isn't likely to replicate that level of growth. Indeed, the news that caused a crash in Cava's stock price the other day is that:

Shares of Cava Group crashed in premarket trading after the Mediterranean fast-casual restaurant chain slashed its full-year same-store sales growth forecast to a maximum of 6%, versus the previous estimate of 8%.

This line suggests a different reference level: the projected growth previously communicated by Cava's management. Another possibility is the average growth in the same quarter over the last few quarters.

***

The inclusion of the "outlier" Q2, 2021 value of ~120% made it harder to differentiate the data for the other quarters, all of which were under 50%.

Also, note the axis labels being placed above, instead of next to, the tick marks. This small design flaw increases the reader's cognitive load significantly. Try figuring out the value of the two columns on the right.

Story-first governing

Kaiser Fung — Mon, 11 Aug 2025 20:00:00 -0400

In the last decade, "data story-telling" first became a thing, then came the backlash. Some of my colleagues complain that data story-telling becomes more about stories than data. In other words, it morphs into what I've been calling the "story-first" mentality towards data. The story-first people decides on a story, then finds the data to support it. That's the opposite of "data-first". The world is over-run by story-firsters.

The current U.S. government is run by story-firsters. Maybe we shouldn't single them out. It seems like the U.S. government over the last decades have increasingly become more story-first. The most recent actions announced by the President are the most extreme yet.

First, he fired the head of the Bureau of Labor Statistics (link), the federal agency that collects and publishes various key official statistics, including the widely-disseminated inflation and unemployment rates. The gravest thing about this firing is the stated reason: the unproven accusation that she "manipulated" the data in order to make the current administration look bad.

With such reasoning, the next BLS head has to be someone who will publish only data that please the administration! Otherwise, his/her head is next on the chopping block.

(The story-firsters will say: since this administration's policies are self-evidently beneficial to the U.S. economy, any data not showing this result are flawed, and thus, the BLS head is incompetent! Or think of it this way - the firing is based on someone knowing what the "right" numbers are, and how do they know those?)

***

Second, the President announced his wishlist for "reforms" to the U.S. Census, in so doing disclosing that he has little other than surface knowledge about a census.

His biggest want is to stop counting "illegals". Every time someone wants to stop counting something, you know they have unpure intentions, because for story-firsters, no fate is worse than having to face inconvenient data. (By contrast, for data-firsters, the worst fate is to have no data.)

The entire problem of illegal immigration goes away when there is no data measuring it. Similarly, if the government don't keep statistics on crime, we can be told there is no crime.

Here too, the story-firsters have gradually gained ground. Chapter 6 of Numbersense (link) covers details of how the U.S. government computes the unemployment rate. Since the time of Clinton, more and more citizens have entered the rank of the uncounted: they can neither be employed or unemployed, according to the BLS. Nevertheless, none of these dropouts have jobs, they are in fact unemployed (in the everyday sense), so by removing them from both the numerator and the denominator, the unemployment rate improves. It looks better, but it's not because those uncounted people found jobs.

Not counting illegals means we can't size the problem. We therefore can't properly allocate resources to deal with it, including hiring enough ICE agents to snatch people off the streets, if that is your desired policy.

***

There may also be specific metrics that the current U.S. government wants to modify. Any change to a long-running instrument, whether it's altering underlying populations being measured, or changing specificiations, wreaks serious long-term damage. I discussed this issue in Chapter 2 of Numbersense (link), as it relates to the movement to replace BMI as the obesity metric.

An immediate casualty is historical comparison. The power of the Census comes from its history. Because we are measuring the same thing using the same method for a long time, we can describe trends, and anomalies. A sudden shift in the definition of a metrics literally and figuratively breaks the time series, effectively devaluing the currency of all prior work.

The inside joke is: the new metric is certainly not unbiased, nor above accusation of manipulation, because all metrics are built on top of assumptions, and those who disagree with the assumptions have grounds for bias complaints. It's like dropping everything you own to buy the new house only to discover, after you move in, that while the roof doesn't leak like the old house, the new house is infested with ants.

***

The reason for the surge of the story-firsters is covered in Chapter 2 of my book Numbersense (link) : the perversity of measurement.

Statistics are great at reflecting the health of something, such as public health, public security, educational achievement, and the economy. It is then tempting to link statistics to performance. This is usually labeled pay-for-performance, and in some quarters, treated as an axiom, something so eminently reasonable that its adoption is beyond skepticism.

Anyone who has experienced pay-for-performance knows the issue: there are many ways to "manipulate" the statistics without making real change. In Chapter 1 of Numbers Rule Your World (link), I reported an bottomless pit of methods used by enterprising university administrators to dress up their numbers, leading to better school rankings, without changing the quality of education (and in the worst cases, probably causing a drop in quality).

The "value-added" movement in education is a poster child for the perversion of measurement. When the salaries and bonuses of teachers and administrators are tied to standardized testing results, there are strong incentives to cheat. These policies are highly effective at spreading the cheating culture from students to staff.

Machines cheat too. If machines are told to maximize the number of clicks on a display ad, they learn to push a popup that interferes with what users are trying to do, thus generating many unintended clicks. The click metric dutifully reports these "fake" clicks as if they are real. Some humans notice this trickery, and seek to end it by requiring the user to remain on the ad for at least three seconds. The optimizing machines respond by withholding the "skip" button on the popup for three seconds.

...

Now, over the years, the Federal Reserve has drifted toward a "pay for performance" posture. Since the 1990s, when Alan Greenspan was Chair, the "dual mandate" of employment and stable prices is being managed by "targets". In recent years, the inflation target of 2% has been repeatedly mentioned.

The markets then interpret deviations from those targets as bad news. Lately, the administrations view stock prices as a barometer of their economic policies. The Fed is doing a lousy job, we are told, because inflation is higher than 2%, or that the market indices are reacting badly to the latest figures. And now, the President is saying the data collectors are failing when the statistics don't meet his expectations. In addition to firing the BLS head, he has been agitating to remove the Fed Chair.

I said earlier it's not just this administration. Once the "pay for performance" posture is adopted, the statistics are not just passive observers but active participants. The story-first instinct then rises to the top, encouraged by these incentives. It's a matter of time before the economic indicators point in an undesirable direction (if they never do, they aren't good metrics), and that's when the numbers get warped out of reality. It doesn't have to be blatant cheating. One can always come up with reasonable arguments to support changing assumptions or definitions. Somehow, these changes would always move the statistics in the favored direction!

Clear and confused states

Kaiser Fung — Mon, 04 Aug 2025 20:00:00 -0400

Long-time contributor Howard H. appreciates this data visualization project in the Washington Post about the emergency preparedness of counties in southern states as floods and hurricanes pound the region. (link)

He and I both like the first graph:

When the trend is clear, the graph does not need more embellishment.

Howard said: "Great title, clear visual, nice annotation to drive the point home. The pink/blue shading isn’t strictly 'Tuftian,' but it emphasizes the chart’s main message, so while it’s a little extra I think I like it."

There are two minor visual issues:

I'd have drawn the line in gray so it doesn't get associated with just the red shaded area.
Also, I'd have extended the negative side of the vertical axis to -8% so that the top and bottom halves have equal heights.

***

The second chart in the series perplexes us both.

Howard is concerned about the story behind the map. Why did more residents evacuate in the green counties?

I'm confused about what the pink part of the data encodes. In the legend labeling, pink represents counties in which "more [people] are staying in the storm's path". The legend's title suggests a comparison of the current week to "a normal week".

They have obtained cell phone data that tracked people's movements. For each county, let's assume they are able to compute both the in-flow and the out-flow of people in any given week. They define something called the "normal week": we assume this to mean the average week for some historical period of time. So for each county, they have the average in and average out. They also have the current week's in and out.

This is where things get murky. We are comparing the current week to the normal week. Because the in and out metrics are separate counts, they can go in opposite directions. If so, the data could not be tamed by a one-dimensional color scale.

Maybe they don't track in- and out-flows separately, combining the two metrics to yield the net flow, defined as out minus in. Given the direction I've chosen, let's call it the net outflow.

If the net outflow is 0%, that means the same volume of movement was observed in the current week relative to normal. Notice that this does not mean the volume of in-flow equals the volume of out-flow. For example, if a net of 10,000 people move out of a county in a normal week, then a net outflow of 0% represents 10,000 people leaving the county in the current week. To make this point clearer, let's rename the metric the relative net outflow.

If the relative net outflow is a positive percentage, that means relatively more people moved out of the county during the current week than normal. If normally there is a net outflow, then the current week's net outflow is even larger. If normally there is a net inflow, then the current week's net inflow is smaller, or it could even flip from net in to net out.

In the color legend, positive relative net outflow is shown in green and described as "more [people] leaving storm's path [than in a normal week]".

The pink part is described as "more [people] staying in the storm's path [than in a normal week]". This section of the scale corresponds to a negative relative net outflow, i.e. relatively fewer people than usual moved out of the county during the current week. In the first case, if normally there is a net outflow, then the current week's net outflow is smaller. To me, this is unexpected. If the county's residents choose to ignore the potential storms, they'd have gone on their business as usual, and I'd have expected the relative net outflow to stay within the normal range, rather than moving in a negative direction.

In the second case, if normally there is a net inflow, then the current week's net inflow is larger. This is counterintuitive in the same way. Could it be the case that residents from other counties who are evacuating decide to move to these counties?

In my discussion with Howard, we both feel that most counties probably experience neutral movements over a period of time, i.e. average net outflow is close to zero. This assumption doesn't help with the interpretation; it just suggests that the comparison to the "normal week" is a moot point.

***
Howard brings up an alternative scaling scheme: compute the evacuation rates of all counties, take the average evacuation rate as the midpoint, so that the scale represents a specific county's evacuation rate relative to the average county.

Since the map directly references a "normal week", it's probably not what they did.

One solution to OMNY's math problem

Kaiser Fung — Wed, 30 Jul 2025 20:00:00 -0400

In a prior post, I attempted to divine how OMNY determines the 7-day window for doing fare capping. There just doesn't seem to be an easy way to do the math, if we take their promotional copy seriously, or even semi-seriously.

To recap, for commuters using the OMNY card, they should only see charges top out at $34 during a seven-day period, no matter how many rides were taken. For a ride to be free, it must be ride #13 or higher inside some 7-day counting window. But it's not clear, given a sequence of prior taps, which tap is the first tap of the currently active 7-day window?

I can't find further details on the OMNY website, though. The key issue with the official description is that the "first tap" is hard to nail down.

Given a series of taps, let's imagine allowing each tap to initiate its own 7-day window. The only windows that should concern us are those that overlap with the present time (shown in blue below).

But as you can see in the diagram, we expect multiple such windows to include the present moment. The count of prior rides in each window is different, and so is the sequence number of the next ride! This is why our heads explode if we try to process their imprecise description.

***

There is a way around this mess. Instead of counting forwards, we count backwards.

Imagine a series of ride times associated with a commuter. That's the dataset I'm working with. The first thing I do is to drop most of this history; I only care about the rides that occurred within the last seven days.

To make things concrete, it's Monday 9 am sharp, and a commuter is tapping. Taking a 7-day backward window, I pull out this commuter's entire sequence of rides from last Monday 9:01 am up to and including 9 am. My goal is to determine if this next ride (at 9 am) should be free.

If the number of rides in that counting window is 13 or more, then this ride, i.e. the 13th ride in a 7-day period, should be free. If it's 12 or fewer, this ride will be charged.

Now, I will roll this window forward every time this commuter taps again. When the commuter rides again after work, say at 6 pm on Monday, I shift forward the counting window, to start at 6:01 pm last Monday ending at 6 pm today.

How does the number of rides change in the counting window relative to the morning? The count decreases by the number of rides that happened from 9:01 am to 6 pm the previous Monday, as this time segment drops out of the counting window. Simultaneously, the count increases by the number of rides that happened from 9:01 am to 6 pm this Monday.

The change in the ride count is the net of those two values. If more rides are added than dropped, the ride count goes up; conversely, the count decreases.

If the commuter does not leave the office during the work day, then there should be exactly one ride in the decrement window (occurring precisely at 6 pm the previous Monday), while there will be a single ride in the increment window, occurring at 6 pm today. The net change in the count is zero.

We still can't tell if the 6 pm ride today should be charged because we are missing information. We have to know how many rides were recorded in the prior counting window from 9:01 am previous Monday to 9 am this Monday. Let's say there were 13, meaning that the last ride, that is to say, the ride at 9 am today, should have been a free ride.

Free rides don't contribute to the fare cap. So, the increment window contains one ride, but zero paid rides. The net change in the count is -1. This commuter is now one ride shy of getting another free ride. Thus, the next ride will be charged.

***

This implementation of the 7-day fare capping does not square with the promotional language. For one thing, the "first tap" does not matter at all. We are counting backwards from the current time, not counting forwards from some "first tap".

In addition, each new tap refreshes the seven-day counting window. There is no such thing as "the rest of the 7-day period" because the window is continuously shifting forwards. Therefore, this is not what OMNY said they implemented, if we trust the promotional language.

My algorithm can be described simply and precisely: the next ride is free if it's paid ride #13 or higher within the last seven days. However, it's still not easily audited by commuters. You typically can't recall how many rides you've taken in the prior seven days, down to the minute. (It's even worse I imagine for those tapping their credit cards, as the OMNY transactions are dispersed among your other charges.)

You can be someone who just trusts authority. In that case, they can do whatever they want, because you aren't checking. You'd also praise whatever it is they do as effortless and convenient.

In reality, you've outsourced the auditing task to other commuters who care, or watchdogs. Your trust derives from people like me. What I'm finding out is there isn't even enough information out there to verify their implementation.

OMNY's mind-blowing solution to a math problem

Kaiser Fung — Mon, 28 Jul 2025 20:00:00 -0400

There is another baffling thing about the new OMNY system. It's their new approach to frequent-use discounting.

In the old swipe-card system (link), frequent commuters buy weekly and monthly passes to save money.

The price for a single ride is currently $2.90. The seven-day pass costs $34 while the 30-day pass is $132.

The math is simple. You pay upfront for unlimited rides. After paying $34, you just hop on and off buses and subways without a care for the next 7 days. On the twelfth ride, the weekly pass pays off compared to the alternative of pay per ride because 12*$2.90 = $34.80 > $34.

Since rides usually are taken in pairs (going out, coming back), if you expect to use the subway or bus once a day for six out of seven days, you should get the 7-day pass. To be sure, you can keep track of whether there is a day in which you didn't take a ride. More than one such day and your weekly pass will likely not pay out. Meanwhile, if there is a day with two round trips, it's almost certain that the pass will come good.

A similar math applies to the 30-day pass. You'd want to avoid more than seven days of no trips. (The monthly pass is not offered in the new OMNY system, so I'll focus on the 7-day pass from now on.)

***

OMNY changes everything, and tells commuters the new way is much easier. Don't believe it.

The new frequent commuter discount is promoted as: "You keep tapping--let us do the math!" This sounds like a great convenience to commuters, only if you trust them with the math. This is especially so because the OMNY people make it strenuous for any commuter to follow the trail of charges. (See my prior post on their communications fiasco.).

Alright, they didn't really say let us do the math in those exact words. As the following subway ad shows, the actual words are "Start tapping any day and $34 is the most you'll pay in a week for unlimited rides."

At first sight, this sounds simple. You just keep tapping, and when the amount exceeds $34, it should stop adding extra charges. Beyond that point, all rides are free. OMNY calls this "fare capping".

If you give it another moment, you'd realize that something doesn't click. Let's assume you've gone beyond the cap, and all future rides are free. Now ask when will your rides stop being free? (The OMNY tablets never say how much is charged as you go through the turnstile. Hence this question.)

To answer that, you'd have to know when the clock started. Indeed, when did the seven-day window begin? In the official materials (link), this moment is called the "first tap".

The subway ad hides a little big issue under the rug. It says you can start "any time" but other than the first time you ever use the OMNY card, which of your other taps is a first tap? Is the second lifetime tap the second tap of the same counting window, or is it the "first tap" of a new counting window?

You might think that is a silly question. So, let's walk through a scenario. Let the "first tap" start the clock, and they look at the seven days from that moment. Assume you did not ride enough to meet the cap. Supposedly, after seven days, the counting window resets--but it probably doesn't until your next tap. It's highly unlikely that you magically tap exactly seven days from the previous first tap. Therefore, if the window resets exactly seven days from the first tap, it would no longer start with a tap. For the next window to start on a "first tap", it has to wait till your next tap.

Is your head hurting as much as mine? This is cognitive overkill.

A literal interpretation of fare capping: the very first time you use the OMNY card, it establishes your personal first-tap time (say, Monday 9 am). This then divide your future into 7-day windows, all starting on Monday at 9 am. If you commit this time to memory, then your approach to using fare capping is similar to the previous 7-day pass: just tap as many times as possible in the next seven days.

I don't think that's the right interpretation since only the first counting window starts with a first tap; none of the others will.

If we follow the other interpretation, each new counting window starts with a first tap, thus after one counting window ends, the next one does not start till your next tap. From a commuter's perspective, this is mindblowingly complex. Imagine you are about to tap, and you want to know where you are within the current 7-day window. You'd have to start from your very first tap, and then work out each counting window, one at a time!

That's why I called this fare-capping program "Don't ask questions, just trust us."

This is not how you treat your customers. The next mayor should fix this, presto.

***

In the next post, I'll discuss how I think this fare-capping scheme actually works. That is to say, how it would work if I were designing it.

P.S. It doesn't have to be this hard. I heard from someone familiar with one transit system in Australia. They have a daily fare cap. The day starts and ends the same way for everyone. Fares are capped to a maximum per day. It's really that simple.

Would love to hear how frequent commuter discounting works in your transit system!

[7/31/2025] The next post can be found here.

OMNY needs a facelift

Kaiser Fung — Sun, 27 Jul 2025 20:00:00 -0400

New York City is in the process of transitioning commuters to a chip card called OMNY. Switching to chip is a good move but the new user interface is terrible.

Commuters are supposed to tap the OMNY card on the above tablet, and then walks through the turnstile when the light turns green. (Commuters can use their phone or chip credit card to pay as well. I'll be talking about OMNY card users but the content also applies to these others.)

It's as simple as that.

Sadly, the user communication is simple to the point of useless. In my months of tapping, the only responses I have seen are the green light, the red light, and, on a few occasions, the blue light. It's the four corner neons that switch colors.

When it's green, the interface does not inform me how much was charged, nor how much money I have left on my card. This information is provided to commuters in the swipe-card interface on every trip.

In addition, the swipe-card interface tells me I'm using a free transfer, e.g. when transferring from bus to subway. With OMNY, the screen once again lights up green but it shows the same green light whether the trip is charged or free!

It doesn't get better from here. When I do get a red light, it does not indicate why. Is it because I didn't hold the card long enough? Is it because the balance is too low? Is it because of software malfunction? The swipe-card display tells me whether it's out of money, and sometimes, it just tells me to swipe again. With OMNY, it's become the commuter's responsibility to figure out what went wrong (they got rid of most humans many years ago so there is usually no one around to ask, nor would they have the tools to diagnose the problem anyway).

Once in a while, the screen shows blue - I think it's a different hue of blue. It also does not explain itself. On the last occasion this happened to me, I was able to walk through the turnstile as if it was green. Who knows?

***

As a result, it takes a lot more effort to track and audit the charges under OMNY compared to swipe cards. The previous system is more honest. You pay for a service, and your payment is immediately acknowledged.

The OMNY tablet offers a much larger screen, but this real estate is wasted.

Our digital breadcrumbs

Kaiser Fung — Thu, 24 Jul 2025 20:00:00 -0400

The Cold Play jumbotron saga has been getting max play on my timelines the past week (for background, see here). To recap, during a recent Cold Play concert, a random couple was shown on the big screen being intimate. The couple and another friend noticed they were on camera, and instead of waving and smiling at it, the couple ducked out of view. The singer narrated the scene to the crowd, and speculated that the couple were caught having an affair or were just really shy.

In another world, one would just laugh at this awkward scene and move on. The reaction was worse than the act; they were touching but weren't making love or something.

We aren't in ordinary times. We have been thrusted into the Big Data era, like it or not.

***

So what happened next? Someone decided to "dox" them. "Dox" is the slang for putting out the identity of someone without their consent.

Doxing is super easy because our photos and images have been stockpiled by countless, mostly secretive, businesses. We put data into these databases every time we upload a photo of ourselves to the cloud, or compliantly upload a headshot to verify our identity, or use our face to log into a device. Even if one is very careful never to put one's own photo on a remote server, one can't stop one's friends from uploading a photo from the last outing, and then tagging all participants. The act of tagging a person on a photo is to create an entry in a database that connects a name to an image of a face.

Your phone may automatically identify faces in your camera roll; it may even create folders for specific people that have been detected in multiple photos - without your explicit tagging. Once there exists a folder of photos of the same person, it's simple to now put a name to the folder. If your phone isn't doing it, it's because some years ago, a certain business - I recall it being Facebook - made face recognition a feature of the camera, and users pushed back against it. But, these service providers, whether it's Apple, Facebook, or any number of other players, can easily put name to face. Today, my sense is that the resistance to such technology has mostly dissipated.

Because of this type of technology, it's straightforward for anyone (who's willing to pay a subscription fee) to "dox" someone from a photo. Thus, the couple at the Cold Play concert was quickly found to be the CEO and head of HR at an AI startup. Then, strangers found their way to all their social media, including Linkedin profiles, and the world learned everything about anything that can be found publicly. Journalists are feasting on the situation too, which explains why my timelines won't stop pushing this content.

With AI, this type of content can be generated today without human intervention. The only possible barrier is the absence of a preexisting subscription to the doxing service. If this is in place, AI can dox the person, fetch all their social media content, and write any number of sensational articles.

***

This is yet another example of technologies that have useful applications but can be turned into something much more sinister. I refuse to believe that the disappearance of opposing voices means people accept these negative consequences.

The more salicious aspect about that embracing couple is that both the CEO and the head of HR are married - and not to each other. All the writers assume that they are cheating on their spouses. But do we really know? It's certainly possible that both couples are in open relationships. I have no idea, but neither do those who label them as cheaters.

We have seen similar crowd behavior before, but in a much graver setting. Remember the gruesome quadruple murders in a college town In Idaho. Incidentally, the PhD student in criminology recently accepted a plea deal to avoid potentially getting the death penalty (link). The convicted murderer wasn't apprehended for some time after the murders, and during this time, people obtained video footage from various places the victims visited that fateful night. The Internet sleuths doxed quite a few people, scoured their social media content, and pushed stories that accused them as likely quadruple-murder suspects. For example, this professor later filed a lawsuit against one of the Tiktok influencers for defamation (link).

The technologies and culture that drove these false accusations are the same as those that doxed the pair at the Cold Play concert. In the Idaho murders case, we can now say for sure that those doxed individuals were definitely falsely accused. If the faces of these individuals weren't readily found in databases, they would not have been dragged through the mud.

Even if the couple were engaging in extramarital affairs, implying they also were flaunting workplace rules, and one holds moral values dearly, is this how we want society to handle such cases?

Why is this chart confusing?

Kaiser Fung — Sun, 20 Jul 2025 20:00:00 -0400

Significance Magazine has a fascinating article about the success rate of Broadway productions. The authors conclude that Broadway investors have about 20-25% chance of recouping their original investment. (Recoupment means breakeven, different from making tons of profits!)

That number is a bit higher than the folklore number of about 20%. The importance of their contribution is to put some data rigor behind their number.

Figure 4 in the article is the following chart:

The chart compares histograms for two groups of shows. The reference group (blue) are shows known to have recouped their investments with a known recoupment date, some weeks after opening. The histogram presents the time to recoupment. The comparison group (orange) are shows that closed but without public information that can be used to figure out if the investors recouped their money before closing. This histogram displays the time until closing. The chart was conceived in an effort to guess the "label" for the second group of shows; should they be called a success (recouped) or a failure in the analysis?

In the authors' own words, then:

Figure 4 compares the total running times of these shows to the recoupment times of the 22 shows with known recoupment dates, suggesting that many likely did not last long enough to recoup.

I'm confused as to how the histograms support this conclusion. The above statement suggests that the shows with unknown recoupment dates have generally closed earlier than those with known recoupment dates. In the chart, I counted 12 blue shows that closed within 30 weeks against 3 such orange shows. Considering the entire histograms, I also don't sense that the orange one is poised to the left of the blue one. (One possibility is the color labels were accidentally swapped.)

***

The above observation then leads me down the rabbit hole of investigating the source of confusion.

Since the authors clearly stated that there were 22 shows with known recoupment dates, I can see which columns sum up to 22. The blue columns: 7+5+1+4+1+4+2+2+1+1 = 28 shows while the orange columns: 2+1+4+5+3+1+1+2+2+1=22 shows. It'd seem that the orange histogram corresponds to the shows with known recoupment dates, confirming that the labels were swapped. I just have to check the number of shows that closed but with unknown recoupment status. In the article, they said "we were left with 28 final shows whose recoupment status had to be manually classified" so this made me feel better.

Here's a version of their chart with the right color labels:

Note that I switched the colors to blue and yellow so that the merged color is green, which is more easily understood than blue+orange = brown.

***

Back to the overlapping histograms, it's very confusing to have created three colors for two groups.

It's clearer to stack them top and bottom:

Or just print the outline of the reference histogram:

***

They really should have used density histograms instead of count histograms, given that the two groups have different number of shows. Plotting proportions are fine too although density histograms have better statistical properties (as I explained here).

Say goodbye to soccer

Kaiser Fung — Tue, 15 Jul 2025 20:00:00 -0400

The current U.S. government has another idea for an executive order: it may decide to join the rest of the world and rename "soccer" to "football" (link).

That's a fairly pointless name change. How about something more impactful?

Using Celsius instead of Fahrenheit for temperatures
Using grams instead of ounces
Using metres instead of feet

There is a crucial but subtle difference between these actions, though.

Changing soccer to football creates a collision because there is a different sport called American football, and now "football" becomes imprecise.

Switching scientific units to align with the rest of the world does not lead to confusion, as there is only one temperature, weight or length.

Color bomb

Kaiser Fung — Sun, 13 Jul 2025 20:00:00 -0400

I found a snapshot of the following leaderboard (link) in a newsletter in my inbox.

This chart ranks different AIs (foundational models) by token usage (which is the unit by which AI companies charge users).

It's a standard stacked column chart, with data aggregated by week. The colors represent different foundational models.

In the original webpage, there is a table printed below, listing the top 20 model names, ordered from the most tokens used.

Certain AI models have come and gone (e.g. the yellow and blue ones at the bottom of the chart in the first half). The model in pink has been the front runner through all weeks.

Total usage has been rising, although it might be flattening, which is the point made by the newsletter publisher.

***

A curiosity is the gray shaded section on the far right - it represents the projected total token usage for the days that have not yet passed during the current week. This is one of those additions that I like to see more often. If the developer had chosen to plot the raw data and nothing more, then they would have made the same chart except for the gray section. On that chart, the last column should not be compared to any other column as it is the only one that encodes a partial week.

This added gray section addresses the specific question: whether the total token usage for the current week is on pace with prior weeks, or faster or slower. (The accuracy of the projection is a different matter, which I won't discuss.)

This added gray section leaves another set of questions unanswered. The chart suggests that the total token usage is expected to exceed the values for the prior few weeks, at the time it was frozen. We naturally want to know which models are contributing to this projected growth (and which aren't). The current design cannot address this issue because the projected additional usage is aggregated, and not available at the model level.

While it "tops up" the weekly total usage using a projected value, the chart does not show how many days are remaining. That's an important piece of information for interpreting the projection.

***

Now, we come to the good part, for those of us who loves details.

A major weakness of these stacked column charts is of course the dizzy set of colors required, one for each model. Some of the shades are so similar it's hard to tell if they repeated colors. Are these two different blues or the same blue?

Besides, the visualization software has a built-in feature that "softens" a color when it is clicked on. This feature introduces unpleasant surprises as that soft shade might have been used for another category.

It appears that the series is running sideways (following the superimposed gray line) when in fact the first section is a softened red associated with the series that went higher (following the white line).

It's near impossible to work with so many colors. If you extract the underlying data, you find that they show 10 values per day across 24 weeks. Because the AI companies are busy launching new models, the dataset contains 40 unique model names, which imply they needed 40 different shades on this one chart. (Double that to 80 shades if we add the colors on click variations.)

***

I hope some of you have noticed something else. Earlier, I mentioned the model in pink as the most popular AI model but if you take a closer look, this pink section actually represents a mostly useless catch-all category called "Others," that presumably aggregates the token usages of a range of less popular models. In this design, the Others category is catching an undeserved amount of attention.

It's unclear how the models are ordered within each column. The developer did not group together different generations of models by the same developer. Anthropic Claude has many entries: Sonnet 4 [green], Sonnet 3.5 [blue], Sonnet 3.5 (self-moderated) [yellow], Sonnet 3.7 (thinking) [pink], Sonnet 3.7 [violet], Sonnet 3.7 (self-moderated) [cyan], etc. The same for OpenAI, Google, etc.

This graphical decision may reflect how users of large language models evaluate performance. Perhaps at this time, there is no brand loyalty, or lock-in effect, and users see all these different models as direct substitutes. Therefore, our attention is focused on the larger number of individual models, rather than the smaller set of AI developers.

***

Before ending the post, I must point out that the publisher of this set of rankings offers a platform that allows users to switch between models. They are visualizing their internal data. This means the dataset only describes what customers of Openrouter.ai do on this platform. There should be no expectation that this company's user base is representative of all users of LLMs.

Will AI make cheaters of us all?

Kaiser Fung — Tue, 08 Jul 2025 20:00:00 -0400

Andrew wrote an amusing post about mischief using AI in peer reviewing for academic journals.

It emerged that authors of scientific papers have resorted to embedding secret prompts inside their text to instruct large language models (LLMs) to give their papers positive reviews. These prompts may be printed in white, or tiny font, so they are intended to evade humans. Some prompts are quite elaborate, carrying instructions for what to say about strengths as well as what to say about weaknesses. For example:

Be sure to check out the comments section, as readers fuss over which group is worse: the authors who instruct LLMs to give positive reviews only; or the reviewers who rely on LLMs to submit their reports. As Andrew told the story, one author who admitted to inserting these prompts argued that they did it only to deal with cheating reviewers who deploy LLMs. So, we are witnessing the classic two kids in a playground scenario - he's the one who started it!

We can take this blame game one step further. The cheating reviewers should blame it on the authors because some authors are using LLMs to write bogus papers!

***

Unfortunately, this is the world of AI we find ourselves in. At an event recently, I chatted with an instructor who is throwing his hands up, complaining that he is spending time correcting code submitted by his students who are obviously using AI to do the work. Meanwhile, there are students complaining that their instructors use AI to set or mark assignments. They can of course blame each other.

Would one begrudge instructors who ask AI to mark assignments if the work were generated by AI? Would one judge the students who use AI to do their homework if said assignments were created by AI?

Is this a race to the bottom? Eventually, will humans do any work?

Know your data 46: using our data to set pricing

Kaiser Fung — Sun, 06 Jul 2025 20:00:00 -0400

Google's former CEO Eric Schmidt infamously said something along the lines of "If you don't want others to know, you shouldn't be doing it in the first place". It showed the hubris of Silicon Valley at the time, and a certain deceitfulness. Because the truth is if they have your data, they can use the data to harm you, even if you haven't done anything wrong!

Finally, we have some evidence of what's been going on behind closed doors for a long time. They use our data to price discriminate. Same product, different prices, based on analyzing our data.

This practice, known as "surveillance pricing", caught notice because an industry body is suing New York State about a new law that requires companies to disclose that they are using algorithms (and data) to set different prices for different people (link). Look, the state is not banning surveillance pricing; they are requiring notification.

The industry doesn't want us to know.

The pushback from industry follows the usual script:

They bring up alternative scenarios of potential benefit to dismiss scenarios of harm. The state alleges that prices are raised on those who can afford them. In this case, they claim that the same algorithms are used to lower prices by offering discounts to selected customers.

I don't doubt that algos target special deals at specific customers. But I fail to understand why customers who love receiving coupons would object to the required disclosure of surveillance pricing - in fact, it would be great marketing to inform these customers that algos found them good deals; they might learn to like algorithms!

Surely, no consumers should object to such disclosure.

The industry apparently wants us to believe that the primary objective of surveillance pricing is to deliver discounts to customers. Anyone who read Chapters 3 to 5 (marketing data) of Numbersense (link) should recognize it as hogwash. Discounts result in lower revenues, unless the business can somehow prevent existing customers from using them. Whatever "leakage" happens, the business has to make up for these "lost" revenues. Thus, the same algos are likely to raise prices on other customers. With the amount of data at their disposal, it's not hard to figure out which customers are less price-sensitive, or have higher "willingness to pay".

I have worked on such algos. What kind of reception do you think data scientists would get from the business teams if we present to them an algorithm that delivers discounts to selected customers, leading to the predicted outcome of lower total expected revenues (and lower profitability)?

***

The usual script from the industry bodies also includes the false claim that telling the truth is "misleading". This is the same script used to oppose non-GMO labels. In this instance, they assert that customers will interpret the mandated disclosure as evidence of "price gouging".

These industry honchos aren't alarmed when consumers falsely believe that prices are fixed for everyone - when such disclosure isn't required!

I must digress to complain about another industry practice that is gaining popularity by the day, at least in the U.S. Many stores don't even bother putting up price tags. Some restaurants and coffee shops put up menus without prices. I just walked into a Vietnamese diner this afternoon, hoping to get an iced coffee to combat the oppressive heat in New York - well, the menu of bahn mi and side dishes is reasonably priced but the beverage and dessert sections have no prices! I walked out, disgusted. The iced coffee, based on redacted prices, is probably outrageously expensive, either compared to their past pricing, or compared to their peers. I'm guessing at least $6 (plus tax and the delightful... tip screen).

In their war against disclosures, some businesses won't even put up their prices.

And yet, we should believe that they won't use our data to maximize their profits.

Light entertainment: Acid Images

Kaiser Fung — Wed, 02 Jul 2025 20:00:00 -0400

A contact commented on the following chart circulating on Linkedin to promote Portugal:

His main complaint: the flag of Portugal is wrong!

Imagine.

***

A couple of things to note about this image.

I clicked on the "CR" logo on the top left corner, and learned about something called Content Credentials. It tells me that the image was generated by ChatGPT.

I applaud this effort. Will it stop fraud? Probably not but at least it gives honest people a way to label the work.

***

The second thing is, there are many errors throughout this map. Let's make a list...

I'll get us started.

There are two French flags: one is linked to the second highest value while the other one is linked to the second lowest value.

Students demonstrate how analytics underlie strong dataviz

Kaiser Fung — Mon, 30 Jun 2025 20:00:00 -0400

In today's post, I'm delighted to feature work by several students of Ray Vella's data visualization class at NYU. They have been asked to improve the following Economist chart entitled "The Rich Get Richer".

In my guest lecture to the class, I emphasized the importance of upfront analytics when constructing data visualizations.

One of the key messages is pay attention to definitions. How does the Economist define "rich" and "poor"? (it's not what you think). Instead of using percentiles (e.g. top 1% of the income distribution), they define "rich" as people living in the richest region by average GDP, and "poor" as people living in the poorest region by average GDP. Thus, the "gap" between the rich and the poor is measured by the difference in GDP between the average persons in those two regions.

I don't like this metric at all but we'll just have to accept that that's the data available for the class assignment.

***

Shulin Huang's work is notable in how she clarifies the underlying algebra.

The middle section classifies the countries into two groups, those with widening vs narrowing gaps. The side panels show the two components of the gap change. The gap change is the sum of the change in the richest region and the change in the poorest region.

If we take the U.S. as an example, the gap increased by 1976 units. This is because the richest region gained 1777 while the poor region lost 199. Germany has a very different experience: the richest region regressed by 2215 while the poorest region improved by 424, leading to the gap narrowing by 2638.

Note how important it is to keep the order of the countries fixed across all three panels. I'm not sure how she decided the order of these countries, which is a small oversight in an otherwise excellent effort.

Shulin's text is very thoughtful throughout. The chart title clearly states "rich regions" rather than "the rich". Take a look at the bottom of the side panels. The label "national AVG" shows that the zero level is the national average. Then, the label "regions pulled further ahead" perfectly captures the positive direction.

Compared to the original, this chart is much more easily understood. The secret is the clarity of thought, the deep understanding of the nature of the data.

***

Michael Unger focuses his work on elucidating the indexing strategy employed by the Economist. In the original, each value of regional average GDP is indexed to the national average of the relevant year. A number like 150 means the region has an average GDP for the given year that is 50% higher than the national average. It's tough to explain how such indices work.

Michael's revision goes back to the raw data. He presents them in two panels. On the left, the absolute change over time in the average GDPs are presented for each of the richest/poorest region while on the right, the relative change is shown.

(Some of the country labels are incorrect. I'll replace with a corrected version when I receive one.)

Presenting both sides is not redundant. In France, for example, the richest region improved by 17K while the poorest region went up by not quite 6K. But 6K on a much lower base represents a much higher proportional jump as the right side shows.

***

Related to Michael's work, but even simpler, is Debbie Hsieh's effort.

Debbie reduces the entire exercise to one message - the relative change over time in average GDP between the richest and poorest region in each country. In this simplest presentation, if both columns point up, then both the richest and the poorest region increased their average GDP; if both point down, then both regions suffered GDP drops.

If the GDP increased in the richest region while it decreased in the poorest region, then the gap widened by the most. This is represented by the blue column pointing up and the red column pointing down.

In some countries (e.g. Sweden), the poorest region (orange) got worse while the richest region (blue) improved slightly. In Italy and Spain, both the best and worst regions gained in average GDPs although the richest region attained a greater relative gain.

While Debbie's chart is simpler, it hides something that Michael's work shows more clearly. If both the richest and poorest regions increased GDP by the same percentage amount, the average person in the richest region actually experienced a higher absolute increase because the base of the percentage is higher.

***

The numbers across these charts aren't necessarily well aligned. That's actually one of the challenges of this dataset. There are many ways to process the data, and small differences in how each student handles the data lead to differences in the derived values, resulting in differences in the visual effects.

Decluttering charts

Kaiser Fung — Mon, 23 Jun 2025 20:00:00 -0400

Enrico posted about the following chart, addressing the current assault on scientific research funding, and he's worried that poor communications skills are hurting the cause.

He's right. You need half an hour to figure out what's going on here.

Let me write down what I have learned so far.

The designer only cares about eight research areas - all within the IT field - listed across the bottom.

Paired with each named research area are those bolded blue labels that run across the top (but not quite). I think they represent the crowning achievement within each field but I'm just guessing here.

It appears that each field experiences a sequence of development stages. Typically, universities get things going, then industry R&D teams enter the game, and eventually, products appear in the market. The orange, blue and black lines show this progression. The black line morphs into green, and may even expand in thickness - indicating progressive market adoption and growth.

For example, the first field from the left, digital communications, is shown to have begun in 1965 at universities. Then in early 1980s, industry started investing in this area. It was not until the 1990s when products became available, and not until the mid 2000s when the market exceeded $10 billion.

Even now, I haven't resolved all its mysteries. It's not explained the difference between a solid black line and a dotted black line. Further, it appears possible to bypass $1 billion and hit $10 billion right away.

***

Next, we must decipher the strange web of gray little arrows.

It appears that the arrows can go from orange to blue, blue to orange, blue to black, orange to black. Under digital communications, I don't see black or green back to blue or orange. However, under computer architecture, I see green to orange; under parallel & distributed systems, I see green to blue. I don't see any black to orange or black to blue, so black is a kind of trapping state (things go in but don't come out). Sometimes, it's better to say which direction is not possible - in this case, I think other than nothing comes out of black, every other direction is possible.

It remains unclear what sort of entity each arrow depicts. Each arrow has a specific start and end time. I'm guessing it has to do with a specific research item. Taking the bottom-most arrow for digital communications, I suppose something begun in academia in 1980 and then attracted industry investment around 1982. An arrow that points backwards from industry to academia indicates that universities pick up new research ideas from industry. Digital communications things tend to have short arrows, suggesting that it takes only a few years to bring a product to market.

To add to this mess, some arrows cross research areas. These are shown as curved arrows, rather than straight arrows. For these curved arrows, the "slope" of the arrow no longer holds any meaning.

The set of gray arrows are trying too hard. They are overstuffed with purposes. On the one hand, the web of arrows - and I'm referring to those between research areas - portray the synergies between different research areas. On the other hand, the arrows within each research area show the development trajectories of anonymized subjects. The arrows going back and forth between the orange and blue bars show the interplay between universities and industry research groups.

***

Lastly, we look at those gray text labels at the very top of the page. That's a grab-bag of corporate names (Motorola, Intel, ...) and product names (iPhone, iRobot, ...). Some companies span several research areas. I'm amused and impressed that apparently a linear sequence can be found for the eight research areas such that every single company has investments in only contiguous areas, precluding the need to "leapfrog" certain research areas!

Actually, no, that's wrong. I do notice Nvidia and HP appearing twice. But why is Google not part of digital communications next to iPhone?

Given that no universities are listed, the company and product labels are related to only the blue, black or green lines below. It might be only related to black and/or green. I'm not sure.

***

So far, I've expended energy only to tease out the structure of the underlying dataset. I haven't actually learned anything about the data!

***

The designer has to make some decisions because the different potential questions that the dataset can address impose conflicting graphical requirements.

If the goal is to surface a general development process that repeats for every research area, then the chart should highlight commonality, rather than difference. By contrast, if one's objective is to illustrate how certain research areas have experiences unique to themselves, one should choose a graphical form that brings out the differences.

If the focus is on larger research areas, then the relevant key dates are really the front ends of each vertical line; nothing else matters. By contrast, if one wants to show individual research items, then many more dates become pertinent.

A linear arrangement of the research areas will not perform if one's goal is to uncover connections between research areas. By contrast, if one attempts to minimize crossovers in a network design, it would be impossible to keep all elements belonging to each research area in close proximity.

A layering approach that involves multiple charts to tell the whole story may be the solution. See for example Gelman's post on ladder of abstraction.

Nonlinear thinking in marketing

Kaiser Fung — Sun, 15 Jun 2025 20:00:00 -0400

I recently had the pleasure of listening to Avinash live at an event sponsored by Precise TV. Avinash is someone who gets marketing analytics, as well as a great communicator.

The talk is centered on his "See, Think, Do, Care" framework, which is posited as a challenger to the dominant schematic of a "marketing funnel".

For those readers unfamiliar how marketers think, they think linearly not unlike the rest of the world. A marketing funnel is a classical way of organizing the marketing function.

In the example above, the universe of the potential customers of a business is divided into four groups: the first group consists of people who are aware of the company's products but not yet considering a purchase; the second group are those thinking about buying; the third group are those who decide to buy for the first time; and the last group are those who have purchased more than once.

In this funnel setup, a marketing team can be split into four sub-teams. The first team focuses on driving awareness; the second team's goal is to get people from aware to interested; the third team - which is the core of marketing - wants to "convert" interested prospects into first-time customers; lastly, the loyalty team's job is customer retention, indicated by return purchases.

The funnel describes a linearized world. Each person enters the top of the funnel, and marketing's job is to push them down the funnel as far as possible, as quickly as possible, and keep them there.

Avinash opposes this linear view of the world. In his "See, Think, Do, Care" framework, he also sets up four groups but people can move from any to any. He calls these groups "audience intent clusters".

The "See" group consists of people who are just looking around. The "Think" group are those who have expressed some interest - in the digital world, interest is evidenced by specific behaviors (such as clicking on some link). The "Do" group are those who are close to buying, for example, those who have moved an item to their shopping cart. The "Care" group are the return customers. Unlike traditional funnel users, Avinash wisely sets the bar higher. Someone who has made just one purchase isn't a target; someone who has made two or more purchases is worth cultivating. It's common sense, yet he's right - most marketers see anyone who has bought something as potentially "loyal". The problem with such an approach is that most of the loyalty marketing dollars would be wasted on people with no intent of returning. Why not focus the spending on those with a higher chance of future business?

Avinash points out the lack of "care" in how many businesses deal with the "care" segment. This is particularly true of technology companies. Tech support FAQs, and a support phone number that's hidden from view show return customers not love but indifference.

The key idea of the talk: any person is not trapped in one of four stages until the marketers shove them one step below. A "loyal" customer might be browsing at the brand's Instagram channel, and her intent might be "see", not "do". So, the content shown to her should incite curiosity, rather than hard selling.

There's more here (link) in Avinash's own words.

Out of line

Kaiser Fung — Tue, 10 Jun 2025 20:00:00 -0400

This simple chart showing life expectancies in 10 countries raises one's eyebrows.

The first curiosity is the deliberate placement of Pakistan behind India and China. Every nation is sorted from lowest to highest, except for Pakistan. Is the reason politics? I have no idea. If you have an explanation, please leave a comment.

***
This graphic is an example of data visualization that does not actually show the data.

The positions of the flags do not in fact encode the data! For example, the Indian flag is closer to the Chinese flag than to the Pakistani flag even though the gap between India and China (7) is more than double the gap between India and Pakistan (3).

Here is what it looks like if the gaps encode the data. With this selection of countries, Pakistan and India are separated from the rest.

In the original chart, the readers must read the data labels to understand it, and resist interpreting the visual elements.

I removed the flag poles because they have the unintended consequence of establishing a zero level (where the cartoon characters stand) but the positions of the flags don't reflect a start-at-zero posture.

Returning to our first topic for a second. If the message of the chart is to single out Pakistan, it actually works! If all other countries are sorted by value, with Pakistan inserted out of order, it draws our attention.

In a conventional layout, Pakistan is shoved to the left side in the bottom corner. See below:

Interpreting margins of error in tennis calls

Kaiser Fung — Sun, 08 Jun 2025 20:00:00 -0400

A commentator of the French Open recently complained that the human line judge made a mistake: "Hawkeye's error is 3 cm and the ball was out by 4 cm. So the line judge is wrong to call it in."

The commentator got this all wrong. The divergence of opinion should reduce one's confidence in Hawkeye's estimate. Let me explain why.

Hawkeye's goal when it comes to judging line calls can be stylistically described as determining the center of the landing spot of the tennis ball. It's helpful to first look at what happens without a margin of error. From this estimated location, we draw a ball given the diameter of a tennis ball, and figure out if the ball overlaps with the line on the court. If it doesn't overlap, then the computer decides that the ball is outside the line.

The idea of a margin of error of 3 cm is visualized as drawing a circle of radius 3 cm around the estimated location. Now, from any point inside this circle, we draw tennis balls as before; and if none of these balls overlap with the line on the court, then the computer decides that the ball is outside the line. By involving the margin of error, we explicitly embrace the uncertainty of estimation. For sure, fewer out calls will be issued relative to the case when we just use a single location estimate.

The fundamental problem in statistics is that Hawkeye gets one chance (one sample) to get this right. If Hawkeye used a deterministic process, then given the same inputs (videos, etc.), it would always generate the same estimated ball location. In real-world systems, whether it's because of noise in the system, or some stochastic element in Hawkeye's process, the same inputs lead to different estimates. The margin of error describes how much these estimates vary.

The reported margin of error holds that Hawkeye's estimate is unlikely to be off by more than 3 cm. In other words, that expanded circle of radius 3 cm is expected to capture the "true" center of the ball's landing location.

The word is "unlikely" rather than "impossible". All margins of error comes with a confidence level; usually it is 95% confidence. This means there is a 5 percent chance that Hawkeye may be off by more than 3 cm from the true location.

Hawkeye's estimate is not error-free as the commentators assumed, even after allowing for the margin of error.

I'm curious about the margin of error associated with humans inspecting ball marks on the clay - I suspect it's small. (The error of judging the balls in flight, by contrast, is certainly much higher.)

At tournaments that use Hawkeye, the players are forbidden from challenging calls. Let's subvert the process and exchange the roles of humans and machines.

Assume Hawkeye makes the first call, in or out. If the player disagrees with the call, he or she raises a challenge, and the umpire (and/or line judge) goes to inspect the mark on the clay. Now, the umpire's word is final, no complaints allowed.

As an example, Hawkeye determines that the ball is 2.5 cm outside the line, which is less than 3 cm, thus the machine rules it "in". A player protests. The umpire decides that the mark on the ground is wholly outside the line, and changes the call to out. How will the commentators react?

If their reaction is not colored by a preference for machines over humans, they will say that the machine has made a mistake - and to accord with their current behavior (in reverse), they should then recommend that the tournament removes line-calling machines because they are not accurate.

This is an instance in which reversing the players makes clear one's biases.

If we take a Bayesian view of this, we should combine the evidence. In the first step, we have one estimate. Now, if the second estimate conforms with the first, then the evidence becomes stronger. But if the second estimate contradicts the first, then the evidence weakens. This is why I said at the start that the divergence of opinion causes me to lower my confidence in Hawkeye's estimate.

Even more, I believe that the human estimate derived from the mark on the ground is more accurate anyway so I'd give that even more weight.

P.S. Outside of clay courts, the situation is more complicated as there are no ball marks to look at. I'm not against the technology. I'm against the illusion of perfection, and I'm against black-box technology that stifles dissent. Both these issues can be addressed by how technology is applied.

Electronic line calling vs ground truth in tennis

Kaiser Fung — Thu, 05 Jun 2025 20:00:00 -0400

The American commentators at the French Open have been making a fuss over the tournament's decision to favor human line judges over electronic line calling. Their whining centers on two arguments:

Certain cases in which the electronic call differed from the umpire's decision, for which they claim one of the player was robbed of a point
The process of letting players dispute close calls wastes too much time

These commentators treat computers as infallible. The attitude is apparent when they say things like "The umpire ruled the ball in even though Hawkeye [a brand name of such technology] says it's out by 4 cm."

Here, we have two sources of opinion about where the ball landed. Hawkeye issues its opinion, based on collating video images, and some modeling. The umpire's opinion comes primarily from the evidence on the ground (on a clay court, the ball leaves a mark on the ground), aided by the prior call of the line judge. In the view of these commentators, when the two opinions differ, the computer wins.

In effect, the computer animation is taken as ground truth against which the umpire's calls are evaluated. The computer animation is never wrong. So, it's not the evidence that favors the computer, but a presumption of its superiority.

And yet, on a clay court, the truth is literally on the ground. The ball leaves a mark on striking the surface. Before the computer age, a player can dispute a close call, the umpire will inspect the mark, and confirm or overturn the line judge's call. I don't see any problems with this arrangement. In fact, it uses the best evidence available - the ground truth.

Instead of the actual evidence, the commentators prefer a "modeled" truth - the abstract reconstruction of the ball's landing, and when they chastise the umpire for making the wrong call, they effectively invalidate the ground truth.

What about the time saving argument? The current practice of the umpire inspecting the ground truth takes little time, almost always less than a minute. It may take a little longer if the player makes a scene, even though no umpire is going to take the player's words over their own eyes.

It's not that computer technology doesn't take time. It would have taken a similar amount of time to watch the animated video of the ball hitting the ground. The reason why electronic calls save time is because electronic line judges are designated as dictators - players are disallowed from contesting any call.

The time saving comes not from checking the call but from banning challenges! No player can make a scene since no challenges are allowed.

But, they could have achieved the same result by making line judges dictators. Or, if they want to allow the ground truth to be inspected, then make the umpires dictators. The umpire can be called to inspect the marks, but his/her decision is final, and any player making a scene would get a demerit. That would move the game along, as these commentators seem to want.

See my previous post about the illusion of perfection in automated line calling.