Evan A. Sultanik, Ph.D.

PoC‖GTFO Issue 0x00

International Journal of PoC ‖ GTFO Issue 0x00, a CFP with PoC An epistle from the desk of Rt. Revd. Pastor Manul Laphroaig

$International Journal of PoC $\|$ GTFO Issue 0x00, a CFP with PoC An epistle from the desk of Rt. Revd. Pastor Manul Laphroaig$

Copyright Quandry

Ceci c'est ne pas une pirate.

A couple years ago I wrote about some thought experiments related to copyright law. Let's say you have a huge number, $n$, two documents, $D_1$ & $D_2$, and a cipher, $f$, that, given the number and the first document, yields the second document: $f(n, D_1) = D_2$. Now let's say $D_1$ is some book in the public domain, whereas $D_2$ is a copyrighted book. In my previous article I wondered,

What is the copyright status of $n$? Can a number be copyrighted?

The difficulty arises in the fact that we have reconstructed a copyrighted artifact solely using artefacts in the public domain. I subsequently posed this scenario to some lawyers, and their opinion was that either the cipher $f$ would be in violation of $D_2$'s copyright, or the copyright status would depend on the intent of the cipher’s creator.

But what if, for the same number $n$, there is another pair of documents, $D_3$ & $D_4$ for which $f(n, D_3) = D_4$? What if $D_3$ and $D_4$ are both in the public domain? Is such an occurrence even possible? It turns out that it is, and it's probably more probable that you might think. Read on to find out.

I decided to construct a general purpose cipher $f$. It actually is quite simple using a Running Key Cipher. The idea is that we use the first document, $D_1$, as a sort of public key: the huge number $n$ is factored into smaller numbers which are indexed into $D_1$ to produce $D_2$. Furthermore, for long enough documents, it is possible to find a number that will both produce $D_2$ from $D_1$ and will produce $D_4$ from $D_3$!

I wrote a short python script, below, that implements the cipher. It currently reads books from Project Gutenberg as the documents for a proof-of-concept, but any document (including copyrighted works!) would work. As an example, I selected Newton's Opticks as Document 1, Russell's Analasis of the Mind as Document 2, Plato's Republic as Document 3, and Paine's Age of Reason as Document 4. Running my cipher results in the following output:

1 2 1 2 1 2 1 2 155 33 103 108 68 69 56 108 38 1540 11894 617 2446 59 1080 332 110 386 722 3860 15 170 207 6966 186 62 262 261 505 454 184 183 145 454 6743 96 1023 492 411 1567 110 115 1020 4872 834 59 834 617 103 104 38 1054 157 1756 2446 162 353 141 4081 645 28 1431 492 10 382 47 470 110 306 112 52843 543 867 1409 15 249 1020 120 149 142 145 3816 24344 24506 49059 120 63 353 1910 351 5106 496 141 351 145 1851 5710 713 356 277 214 372 589 17 986 488 94 101 217 176 3841 189 345 215 1414 355 268 1081 67 435 848 187 155 5169 1164 273 165 183 141 62 172 582 1977 411 46 8260 2168 49 2126 24 559 369 386 17 379 618 217 391 15 4970 114 60 2402 17 435 215 702 285 122 199 112 24 180 1282 90 861 331 597 17 285 821 389 99 4529 934 1668 47 3100 510 378 315 345 188 2394 155 5169 318273 341197 230969 345 355 48959 321398 315037 316250 1911 181 460 262 355 101 221 1353 229 244 662 589 241 809 988 169 77 277 214 296 200 84 176 208 391 112 27 244 968 119 331 47 1260 702 285 724 492 189 775 170 96 440 702 355 554 19927 842 182 1896 379 345 2023 241 19506 3467 726 286 401 189 602 68 59 92...

While I realize that does not amount to the single, huge number I spoke of earlier, note that it can easily be converted to a large number, e.g., by concatinating each number as a digit in a very large base. This number is basically meaningless without $D_1$ and $D_3$. But with those documents we can recreate $D_2$ and $D_4$:

$ ./decrypt.py huge_number opticks.txt
the analysis of mind
by bertrand russell

1921




muirhead library of philosophy


an admirable statement of the aims of the library of phiosophy was
provided by the first editor, the late professor j. h. muirhead, in his
description of the original programme printed in erdmanns history of
philosophy under the date 1890. this was slightly modified in subsequent
volumes to take the form of the following statement...

$ ./decrypt.py huge_number republic.txt
                               the writings

                                    of

                               thomas paine

                         collected and edited by

                            moncure daniel conway

                                 volume iv.
...

So here I have demonstrated a method for generically generating a huge number that when applied to two books in the public domain produces two other books in the public domain. Note that any of these books could have just as easily been a copyrighted text. What would happen if $D_4$, instead of being Paine's Age of Reason, was instead some copyrighted text that I had not the license to distribute? Would I be able to distribute $n$ as long as I didn't let anyone know that, when combined with Analysis of the Mind, it produced the copyrighted material?

Many laws and legal decisions amount to one's intent. But what if one legitimately only knew of the use of $n$ with respect to $D_1$ and $D_2$, and didn't know of its nefarious use to produce $D_4$ from $D_3$? I therefore asked myself, “How likely would it be for one ciphertext $n$ to be able to produce more than one result document?” The answer surprised me!

Let’s say we have a ciphertext $n$ that we generated from $D_1$ and $D_2$. How likely is it that there exists two other books $D_3$ & $D_4$ such that $n$ will also generate some copyrighted text of $D_4$ from $D_3$? Note that I am not taking relative letter frequency into account: I assume that each character in a document is an independent random variable. This is a rather unrealistic assumption, but we shall see that the asymptotic properties of the problem should nullify it. (This assumption could be relaxed by instead applying Lová;sz’s local lemma.) First, let’s tackle the problem of figuring out the probability that a given pair of documents $D_3$ and $D_4$ will work. Let $m$ be the length of the documents in characters and let $k \lt m$ be the minimum required length of a string for that text to be considered a copyright violation (i.e., outside of fair use). The probability that $f(n, D_3)$ contains no substrings of length at least $k$ from $D_4$ is

$\left(1 - p^k\right)^{(m - k + 1)}$,

where $p$ is the probability that a pair of characters is equal. Here we have to take into account letter frequency in English. Using the table from Wikipedia, I calculated $p$ to be roughly 6.5 percent (it’s the sum of squares of the values in the table). According to Google, there are about 130 million books that have ever been written. Let’s be conservative and say that 2 million of them are in English. Therefore, the probability that at least one pair of those books will produce a copyrighted passage from $n$ is

$1-\left(\left(1 - p^k\right)^{(m - k + 1)}\right)^{2000000 \choose 2}$,

which is going to be extremely close to 1.0 for $k \lt m \ll 2000000$.

Therefore, for any choice of $n$ it is almost certain that there is a pair of books that one can choose that will cause a copyright violation, even if those books are unknown!

~~Here is my python script:~~ A newer version of the code is now available as the Lenticrypt project!

Here is the old proof-of-concept code:

#!/usr/bin/env python
import sys, re, itertools
def character_stream(f):
    found_start = False
    prog = re.compile(r'[\w\s\n.,;:-\\!\\?&]')
    for line in f:
        if not found_start:
            found_start = (line.startswith("*** START OF THIS PROJECT GUTENBERG EBOOK") or line.startswith("*END THE SMALL PRINT!"))
            continue
        for char in line.lower():
            if prog.match(char):
                yield char
pairs = {}
def encrypt(certificate1, certificate2, to_encrypt1, to_encrypt2):
    f1 = open(certificate1, 'r')
    f2 = open(certificate2, 'r')
    i = 0
    for c1, c2 in itertools.izip(character_stream(f1), character_stream(f2)):
        i += 1
        if (c1, c2) not in pairs:
            pairs[(c1, c2)] = i
            #print str(len(pairs)) + "\t" + str(i) + "\t" + c1 + "\t" + c2
    f1.close()
    f2.close()
    f1 = open(to_encrypt1, 'r')
    f2 = open(to_encrypt2, 'r')
    for c1, c2 in itertools.izip(character_stream(f1), character_stream(f2)):
        if (c1, c2) in pairs:
            print pairs[(c1, c2)],
        else:
            sys.stderr.write("Warning: pair " + str((c1, c2)) + " was not in the certificate!\n")
    f1.close()
    f2.close()
def decrypt(key, certificate):
    with open(certificate) as stream:
        cert = list(character_stream(stream))
    with open(key) as stream:
        k = stream.read()
    for i in k.split(r" "):
        sys.stdout.write(cert[int(i)-1])
if len(sys.argv) == 5:
    encrypt(sys.argv[1], sys.argv[2], sys.argv[3], sys.argv[4])
elif len(sys.argv) == 3:
    decrypt(sys.argv[1], sys.argv[2])
else:
    exit(1) # TODO: print usage

Streams Data Processing Workshop

A Presentation Introducing Distributed Combinatorial Optimization

I will be presenting a talk introducing distributed combinatorial optimization at the Streams Data Processing Workshop tomorrow. Here are the slides:

Handouts for the talk are also available, here.

Sushi Elitism

Three reasons why you've probably never had an authentic sushi experience.

There are a number of myths and misconceptions surrounding both the creation and consumption of sushi that I'd I think are important enough to devote an entire blog entry toward their clarification.

Proper sushi has a small amount of wasabi applied to the underside of the fish. Some people claim that "it is to prevent parasites" (via the natural anticeptic properties of the wasabi), but I find this explanation a bit dubious (I would think that the wasabi would have to be uniformly applied to the entire fish to have any measurable effect). The wasabi is really there to add flavor. In really high-end sushi restaurants in Japan, for example, it is relatively uncommon for the guest to be served a mound of grated wasabi; that's the sushi equivalent to serving an un-tossed caesar salad with the dressing on the side. Instead, the chef applies the perfect amount of wasabi to each piece of sushi. Some chefs will not even serve soy sauce, instead individually brushing on the sauce on each piece. If wasabi is served separately from the fish, it is generally also considered bad form to mix it with the soy sauce (as is commonly done in the US).

So, why might you have never seen this before at your local sushi joint?

It takes more time/skill/effort to do it properly.
Many sushi restaurants in the US do not have properly trained sushi chefs. In fact, in most areas of the country with little or no Japanese population, don't be surprised if your sushi chef is a Salvadorian who learned from the Oaxacan who learned from the Korean owner of the restaurant. Not that there's anything wrong with that; one of my favorite sushi places is run by an Indonesian. Just keep in mind that the people making you your sushi may have never experienced the "real thing" themselves.
"Real," fresh wasabi is very rare and expensive. Most of the stuff that is used in the US is basically horseradish paste plus food coloring. Restaurants that can both procure and afford to use the real stuff will want to use it sparingly; they wouldn't want to waste it by putting a mound of it on each plate. Therefore, they might be more inclined to use the "proper" technique.

Here is a video in which you can see famous chef Naomichi Yasuda using wasabi inside the sushi. It all happens very quickly, but you can clearly see him dip into the bin of wasabi, rub it on the underside of the fish, and then apply the fish to the rice. Here is another video in which Anthony Bourdain explains the "rules" of high-end sushi eating, while dining at the super-famous and super-expensive Sukiyabashi Jiro (the proprietor of which is actually designated as a "living treasure" by the Japanese government). That second video shows the chef brushing on the soy sauce, and the general lack of both soy and wasabi on the plates.

And on to the matter of chopsticks. Some purists claim that you're not supposed to use them, instead using your hands. I've been to Japan, and it really varies by the type of establishment. Here is my take on it: I think it really depends on the formality of the restaurant. In the more formal places, they won't serve any separate wasabi or soy (only gari) and they'll give you a small moist towel (in addition to the oshibori) that you are supposed to use to clean your fingers between pieces of sushi, under the expectation that you will use your hands. If they don't give a towel to clean your fingers, then I think it is acceptable to use chopsticks.

When I say "high-end", I mean a place that specializes in just sushi. In Japan, these places are usually tiny, consisting of only a bar. Each sushi chef services at most 5 customers at a time. Each chef is supported by perhaps one apprentice (who might prepare some of the seafood) and several other apprentices in the kitchen whose primary job is to prepare the rice. Sushi is all about the rice, after all; an apprentice will spend the better part of the first decade of his training just learning how to cook the rice, before even touching any seafood. That's why most casual restaurants in Japan (e.g., izakaya) will forego even serving sushi, instead offering preparations that require less skill/training (e.g., sashimi).

Will eating late at night make you fat?

TL;DR: No.

Despite awkwardness caused by his recent rants on the topics of modern cooking techniques and more recently fandom, Alton Brown is still one of my favorite sources of culinary quotes. One of which, related to nutrition, is

There are no "bad foods", only bad food habits.

In keeping with my recent posts' unintentional gastronomic theme [1,2], I am going to dedicate this post to debunking a myth related to the above quote. I claim that there is no bad time to eat, only bad eating habits. Specifically, I'd like to dispell the common misconception that the time of one's meal alone can affect weight gain.

I recall hearing a story on NPR a year or two ago about a study which specifically tried to test the claim that eating late at night is unhealthy. The study concluded that there was absolutely no correlation between the proximity of mealtime to sleep and weight gain, other than the fact that people tend to choose to eat more unhealthy foods late at night. I can’t seem to find a reference to that study, however, so I am going to have to do some research myself. Right. Lets get stuck in.

The majority of our food is actually digested during sleep, so the common argument that “eating late at night is bad because our metabolism [slows or shuts down] during sleep” is incorrect. With that said, there is a correlation between night eating, low self-esteem, reduced daytime hunger, and weight loss among people who are already obese or prone to obesity, however, this correlation does not necessarily imply causation (i.e., the act of eating a late meal does not necessarily provoke these conditions). It may simply be the case that the types of foods that people prefer to eat late at night are less healthy. There is still much debate on the subject, however, many scientists agree that meal frequency, as opposed to time, is one of the best predictors for weight gain. For example, the time between meals is highly correlated to one’s waist size. This makes some intuitive sense, since eating more, smaller meals will help regulate insulin levels, and spikes in insulin levels (which can be caused by large meals and/or large gaps in time between meals) have been linked to weight gain.

A newer study followed the eating and sleeping patterns of 52 subjects over one week. They found a correlation between “late sleepers” (i.e., people who go to sleep late and wake up late) and high body mass index, and that eating after 8pm was associated with higher body mass index. A relatively recent New York Times article summarizing the results of the study makes the further claim that eating late at night leads to weight gain, however, I disagree with that claim on the grounds that correlation does not imply causation. In fact, the original study noted:

Late sleepers consumed more calories at dinner and after 8:00 PM, had higher fast food, full-calorie soda and lower fruit and vegetable consumption.

Therefore, I think the results of the study can be interpreted to mean that there is a correlation between eating/sleeping late and a poor diet.

Furthermore back in 2006, the same research team conducted a study on monkeys in which they were fed a diet similar to the average (i.e., high-fat) diet common in the USA. The only variable was the time of day that the monkeys were fed. With all else remaining constant, the researchers found no correlation between weight gain and time of feeding.

It was really interesting to see that the monkeys who ate most of their food at night were no more likely to gain weight than monkeys who rarely ate at night. This suggests that calories cause weight gain no matter when you eat them.

While I’m on a roll here, let me quickly dispel yet another myth. I know many people that adhere to the strict “calorie-in/calorie-out” nutritional theory. This seems particularly popular among mathy/engineering types. The idea is that if your body burns fewer calories than what it takes in through food then it will gain weight. This theory, in general, is fallacious. The body doesn’t necessarily consume all of the calories of the food we stick down our throats. The time between meals will, in effect, affect the way one’s body “decides” to digest the food. Furthermore, as this study and others like it suggest, not all types of calorie sources are equal when it comes to how the body processes them!

In conclusion, if you’re overweight, it’s not necessarily legitimate to blame your late-night eating schedule. You can’t even blame your high calorie diet (only the types of calories you choose to eat).

The Economics of Eating Poorly

Is it cheaper to eat fast food than to cook a meal from scratch?

Is it cheaper to eat fast food than to cook a meal from scratch? Answering such a question is difficult, given that costs will vary greatly from country to country. Why? Well, raw commodity prices are very volatile due to varying government subsidies, differences in climate, extreme climatic events, supply chains, &c.

The supermarket industry in the US is extremely competitive. SuperValu, for example, is a giant corporation that owns many supermarket chains. They are lucky to eek away a 1.5% profit margin. (In other words, at most 1.5% of their gross income is a profit.) That means they could lower their prices at most ~1.5% without taking a loss. Bulk sellers like Costco and even the giant Wal-Mart are lucky to reach 3%. Successful fast food restaurants like McDonalds, on the other hand, easily reach a profit margin of 20%. That means, in effect, McDonalds could reduce their prices by ~20% and still stay afloat.

Why is this? One major reason is that companies like McDonalds can and do have complete vertical integration of their supply chains: McDonalds raises their own cattle, grows their own potatoes, transports their own raw ingredients, and largely owns the real estate of their restaurants. In fact, McDonalds even makes a profit off of leasing their real estate to McDonalds franchises. That flexibility is partially one reason why a Big Mac will be worth the equivalent of US\$10 in Brazil while the same Big Mac will be worth US\$1.50 in Croatia. Supermarkets don’t really have much opportunity for vertical integration unless they actually buy and operate their farms and suppliers.

So, is eating fast food really cheaper? As I mentioned in my previous blog entry about food deserts, in the US there is a definitive link between poverty, obesity, and lack of proximity to a supermarket. There was also a study in the UK that discovered a correlation between the density of fast food restaurants and poverty:

Statistically significant positive associations were found between neighborhood deprivation [i.e., poverty] and the mean number of McDonald’s outlets per 1000 people for Scotland ($p < 0.001$), England ($p < 0.001$), and both countries combined ($p < 0.001$). These associations were broadly linear with greater mean numbers of outlets per 1000 people occurring as deprivation levels increased.

Let’s have some fun now and look at the Consumer Price Index (CPI), which estimates average expenditures for various goods over a month. The current CPI in the US for food and beverages is \$227.5 (meaning that the average consumer spends \$227.5 per month on food and beverages). Now let’s assume that the cost of non-fast-food is cheaper than eating fast food and proceed with a proof by contradiction. Under this assumption, the CPI of \$227.5 is obviously a lower bound on the amount one would spend if one only ate fast food (since the CPI includes all types of food-related expenditures). This equates to about \$8 per day. In 2008, the average price of a Big Mac in the US was \$3.57, and it is certainly possible for one to subsist off of two Big Macs per day. That’s especially true since a Big Mac is one of the more expensive items on the McDonalds menu. A McDouble, for example, costs only \$1. This is a contradiction, i.e., we have shown that it is in fact possible to live off of less than \$8 a day of fast food, thus breaking the lower bound and invalidating our assumption that eating non-fast-food is cheaper than eating fast food.∎

This suggests that it is at least plausible that eating fast food could be cheaper than grocery shopping in the US.

Food Deserts

In which I argue that not only desserts but also deserts are corrolated with obesity.

What is a food desert? That’s “desert” as in “Sahara,” not the sweet thing you eat at the end of a meal. According to Wikipedia, it's

any area in the industrialised world where healthy, affordable food is difficult to obtain,

specifically associated with low income. There is a good amount of debate about whether such things even exist. The problem, as I see it, is that this definition is very subjective: One can always have access to high quality or nutritional food if one is willing to spend enough time to travel to it. If a person lives a couple miles away from a grocery store but has "access" via expensive (to them) public transport, does that constitute "access"? Technically, of course, yes. But what I really think the question should be:

Is there any statistical correlation between proximity to full-service grocery stores, obesity, and poverty?

I think the answer to that is "yes". Answering why is a much more difficult (and perhaps open) question. Read on to hear my reasoning.

This was inspired by a recent blog post on a similar topic by my friend Matt (of cooking fame) who is currently working on his Ph.D. in Policy and Management at CMU.

Nearly 13% of all households in Washington D.C. were struggling with hunger in 2007–2009. The district is divided into wards, much like townships. According to D.C. Hunger Solutions,

Wards 7 and 8, which have the District’s highest poverty rates, also have the city’s highest obesity rates and are home to large “food deserts.”
Of the city’s 43 full-service grocery stores^*, only two are located in Ward 4, four in Ward 7, and three in Ward 8. By contrast, Ward 3—the highest-income Ward—has eleven full-service stores.

^* There are lots of stores—especially in big US cities—that call themselves “grocery stores” that are really bodegas or convenience stores; they don’t sell much in the way of groceries, and usually don’t sell anything in the way of fresh produce or raw meats. I interpret a “full-service grocery store” to be any store that sells fresh produce, raw meats, and perhaps fish.

Ward 8’s poverty rate in 2009 was 35%. I couldn’t find an exact statistic for the area of Ward 8, but it appears to be at least 1/8^th the area of the entire district, which is 100 square miles (260km²). Assuming each of the ward’s three grocery stores services an equal 100/8/3 ≈ 4 square mile area, I think it is plausible that a good number of the ward’s residents live at least one mile from a supermarket.

A mile walk to a grocery store isn’t really very far, right? The problem is that, at least in urban environments, there are usually much more convenient and much less healthy options that are closer. Why would I walk 2+ miles to buy some veggies, fruit, and raw ingredients if I could walk to the end of my block and get a prepared fast food hamburger or fried chicken for likely the same price? (When I lived in a not-so-savory part of Philadelphia, I could buy a whole fried chicken at the end of my block for the same price as a raw chicken from the 1-mile-away grocery store). And if one is already obese, that walk to the store is even harder.

Many of these urban centers have extensive public transport systems that would allow car-less residents to commute back-and-forth to a supermarket. Here are some counter-arguments:

For people that live below the poverty line (the average per capita income in Camden, NJ, for example, is less than \$12k), a public transit ride for as little as $3 is a significant expense. It is likely even more expensive in DC.
How many shopping bags can one person reasonably carry home without a car? Enough for a week's worth of food for a family of four? I know that at least extrapolating from the way I shop, I'd have to make multiple trips per week to feed a family of 4, which is a further expense.
Many of these factors are likely social/cultural in nature, however, that only speaks to the underlying cause; it does not change the fact that there is a correlation between availability of produce, obesity, and poverty.

Townsend, et al., did a study on the correlation between food insecurity and obesity. Here is a summary from Oregon State University:

[Obesity] may also result from periodic episodes of food insecurity. For many people, food stamps and money for food run out before the end of the month. Among respondents to the 2004 Oregon Hunger Factors Assessment, 95 percent ran out of food stamps at least 1 week before the end of the month. When money and food stamps become available again, some may overeat low-cost, high-calorie foods that have limited nutrient density. This could result in gradual weight gain over time, especially for mothers with dependents in the household.

In conclusion, I think it is safe to say that food deserts do exist, and they're correlated with poverty and obesity.

Pronunciation of foreign words in American vs. British English

In which my etymological conjectures are repudiated by Peter Shor.

One of the differences between modern US English (hereafter referred to as "American English") and British English is the way in which we pronounce foreign words, particularly those of French origin and/or related to food. For example, Americans…

drop the "h" on "herb" and "Beethoven";
rhyme "fillet" and "valet" with "parlay" as opposed to "skillet"; and
pronounce foods like paella as /paɪˈeɪə/ (approximating a Castilian or Mexican accent), whereas the British pronounce it as /pʌɪˈɛlə/.

In general, the British seem to pronounce foreign/loan words as they would be phonetically pronounced if they were English, whereas Americans tend to approximate the original pronunciation. I've heard some people claim that this trend is due to the melting pot nature of America, and others claim that the French pronunciation, in particular, is due to America's very close relations with France during its infancy. This latter hypothesis, however, seems to be contradicted by the following:

Avoid the habit of employing French words in English conversation; it is in extremely bad taste to be always employing such expressions as ci-devant, soi-disant, en masse, couleur de rose, etc. Do not salute your acquaintances with bon jour, nor reply to every proposition, volontiers. In speaking of French cities and towns, it is a mark of refinement in education to pronounce them rigidly according to English rules of speech. Mr. Fox, the best French scholar, and one of the best bred men in England, always sounded the x in Bourdeaux, and the s in Calais, and on all occasions pronounced such names just as they are written.

THE LAWS OF ETIQUETTE; or, Short Rules and Reflections for CONDUCT IN SOCIETY. BY A GENTLEMAN. PHILADELPHIA: CAREY, LEA, AND BLANCHARD. 1836.

I wondered: At what point did the USA drop the apparent British convention of pronouncing foreign words as they are spelled?

I asked this question on English.sx, to little fanfare; most people—including Peter Shor!!!1—had trouble accepting that this phenomena even exists. Therefore, I extend my question to the Blogosphere!

I did some more digging, and it is interesting to note that there seems to be a trend in "upper-class" (U) English to substitute words that have an obvious counterpart in French with words that are either of Germanic origin or those that do not have a direct equivalent in modern French. For example, in U English:

scent is preferred over perfume;
looking glass is preferable to mirror;
false teeth is preferable to dentures;
graveyard > cemetery;
napkin > serviette;
lavatory > toilet;
drawing-room > lounge;
bus > coach;
stay > corset;
jam > preserve;
lunch > dinner (for a midday meal); and
what? > pardon?

This is admittedly a stretch, but perhaps there is some connection between the US's lack (and some might say derision) of a noble class and its preference toward non-U/French pronunciation?

Are no two snowflakes alike?

A mathematical argument for the negative.

Note: This post contains a lot of math may not appear prettily unless you are reading this from my website itself; if you are reading this from an RSS feed, you may want to switch to my website.

I think the claim that "no two snowflakes are alike" is fairly common. The idea is that there are so many possible configurations of snow crystals that it is improbable that any two flakes sitting next to each other would have the same configuration. I wanted to know: Is there any truth to this claim?

Kenneth Libbrecht, a physics professor at Caltech, thinks so, and makes the following argument:

Now when you look at a complex snow crystal, you can often pick out a hundred separate features if you look closely.

He goes on to explain that there are $10^{158}$ different configurations of those features. That's a 1 followed by 158 zeros, which is about $10^{70}$ times larger than the total number of atoms in the universe. Professor Libbrecht concludes

Thus the number of ways to make a complex snow crystal is absolutely huge. And thus it's unlikely that any two complex snow crystals, out of all those made over the entire history of the planet, have ever looked completely alike.

Being the skeptic that I am, I decided to rigorously investigate the true probability of two snowflakes having possessed the same configuration over the entire history of the Earth.

Let's assume Professor Libbrecht's analysis is correct and there are $10^{158}$ possible snowflake configurations. Roughly $10^{23}$ snow crystals fall on Earth per year. Therefore, by the pigeonhole principle, two similar flakes must fall by the time the Earth reaches the age $10^{158} \div 10^{23} = 10^{135}$ years, which isn't going to happen any time soon. This seems to support Professor Libbrecht's conjecture.

Now lets assume that each one of the $10^{158}$ possible snowflake configurations is equiprobable; I am not sure if this is a reasonable assumption, but I’ll slightly relax the assumption below. Since the Earth is roughly 4.54 billion years old, let's further assume that at least $(4.54\times 10^9) \times 10^{23} = 4.54\times 10^{32}$ snowflakes have fallen on Earth. We therefore want to know: Given $4.54\times 10^{32}$ items that are randomly chosen from a set of $10^{158}$, what is the probability that no two items are the same?

This is an instance of the classic "Birthday Paradox." Imagine a room full of people. Assuming birthdays are uniformly distributed across the calendar, the probability that any given pair of people in the room has a shared birthday is simply 1 in 365. By the pigeonhole principle, if there are at least 366 (or 367 if we include leap years) people in the room then there must be at least one pair of people that share a birthday. The Birthday Paradox is the fact that the probability that there exists a pair of people in the room that share a birthday rapidly converges to 1 as the number of people increases. For example, with only 23 people in the room, the probability that there are two people with the same birthday is over 50%. With at least 60 people in the room, the probability that there is a shared birthday is nearly 100%. This is a "paradox" because 60 people is a long way away from the guaranteed quantity of 366.

There has been a lot of study of the Birthday Paradox over the years, and a number of neat analysis techniques exist for calculating probabilities of similarity. The problem is that numerically calculating the probability that two snowflakes will be similar is unfortunately intractable given the astronomical numbers with which we're dealing. Therefore, we need to use an approximation. One common approximation is to cast the problem as a "collision problem": Given $n$ random integers drawn from a discrete uniform distribution with range $[1,d]$, what is the probability $P(n;d)$ that at least two numbers are the same?. The solution to this problem is

$P(n;d) \approx 1 - \left(\frac{d-1}{d}\right)^{\frac{n(n-1)}{2}}.$

Substituting our values results in $P(4.54\times 10^{32},10^{158}) \approx 1.03\times 10^{-93}$. In other words, under these assumptions, the probability that there have been at least two similar snowflakes thus far in the entire history of the Earth is basically zero.

But wait! Do snowflakes really occur according to a uniform distribution? I’d imagine that some configurations are more rare than others. Like many other natural phenomena, I might guess that the distribution of configurations is closer to a Zipfian (a.k.a. “Power Law”) distribution. Using this distribution unfortunately forces us to abandon the handy approximation we used above; we’ll have to derive the probability from scratch. The probability that all $n$ flakes are different is then this nasty expression:

$\frac{1}{n!}\sum_{x_1=1}^{10^{158}} \sum_{x_2=x_1+1}^{10^{158}} \sum_{x_3=x_2+1}^{10^{158}} \cdots \sum_{x_n=x_{n-1}+1}^{10^{158}} \prod_{i=1}^n \frac{x_i^{-2}}{\sum_{j=1}^k j^{-2}},$

which is all but impossible to solve numerically. Fortunately, we really only need an upper bound on this value. First, let’s rewrite it as

$\frac{1}{n!}\sum_{x_1=1}^{10^{158}} \frac{x_1^{-2}}{\sum_{j=1}^{10^{158}} j^{-2}} \sum_{x_2=x_1+1}^{10^{158}} \frac{x_2^{-2}}{\sum_{j=1}^{10^{158}} j^{-2}} \cdots \sum_{x_n=x_{n-1}+1}^{10^{158}} \frac{x_n^{-2}}{\sum_{j=1}^{10^{158}} j^{-2}}.$

Next, note that

$\frac{x^{-2}}{\sum_{j=1}^{10^{158}} j^{-2}} \leq \frac{x^{-2}}{\frac{10^{158}}{{10^{158}}^2}} = \frac{10^{158}}{x^2},$

allowing us to further bound the original expression by

$\cdots \leq \frac{10^{158}}{n!}\sum_{x_1=1}^{10^{158}} x_1^{-2} \sum_{x_2=x_1+1}^{10^{158}} x_2^{-2} \cdots \sum_{x_n=x_{n-1}+1}^{10^{158}} x_n^{-2}.$

Finally, we arrive at the ultimate bound:

$\cdots \lt \frac{10^{158}}{n!}\left(H_{10^{158},2}\right)^n \lt \frac{10^{158}}{n!}\zeta(2)^n,$

where “$H_{n,r}$” is the generalized harmonic number of order $n$ of $r$ and “$\zeta$” is the Riemann Zeta function. Substituting $n = 10^{23}$ (i.e., the number of snowflakes that fall in a single year), this upper bound is extremely small (incredibly close to zero). Keep in mind that this is the probability that all of the snowflakes have been different. Therefore, assuming snowflakes are configured according to a Zipfian distribution, the probably that at least two snow flakes are alike in a single year is extremely close to 100%!

We can get a similar result using a normal distribution. Since there are a discrete number of configurations, though, we really have to approximate a truncated discrete normal distribution using a binomial distribution $B(10^{158}, 0.5)$. The probability that all $n$ flakes are different is then this nasty expression:

$\frac{1}{n!}\sum_{x_1=1}^{10^{158}} \sum_{x_2=x_1+1}^{10^{158}} \cdots \sum_{x_n=x_{n-1}+1}^{10^{158}} \prod_{i=1}^n {10^{158} \choose x_i} 2^{-x_i} 2^{x_i-10^{158}},$

which is also all but impossible to solve numerically. Let’s then look at this symbolically, solving for increasingly large values of $n$ (ignoring for the time being the $n!$ term):

Probability that $n$ Normally Distributed Flakes are Distinct (ignoring the $n!$ term)

$n$	Value
1	$1-2^{-k}$
2	$\frac{1}{2}-2^{-k-1}$
3	$\frac{1}{4}-\frac{1}{4}2^{-k}$
4	$\frac{1}{8}-\frac{1}{8}2^{-k}$
5	$\frac{1}{16}-\frac{1}{16}2^{-k}$

Notice the trend? In general, adding back the $n!$ term, this simplifies to

$\frac{2^{1-n}-2^{1-n-k}}{n!}.$

Note, however, that this is the expression for the probability that all snowflakes will be different; we are interested in the probability that at least two are the same, so we need to subtract this value from one:

$1-\frac{2^{1-n}-2^{1-n-k}}{n!} = \frac{n!+2^{1-n-k}-2^{1-n}}{n!}.$

Setting $k = 10^{158}$, this rapidly converges toward 1 even for very small values of $n$. Therefore, if the configurations of snowflakes are non-uniformly distributed, there is a very high probability that at least two snowflakes have been similar!

Revisiting the Ballmer Peak

Might the Ballmer Peak be an actual phenomena?

Last year I made a rather esoteric joke about a supposed phenomena called the "Ballmer Peak" that was popularized by a web comic. The idea is that alcohol, at low doses, can actually increase the productivity of a computer programmer. The original claim was obviously in jest since, among other reasons, Randall Munroe (the web comic's author) claimed that this peak occurs at exactly 0.1337% blood alcohol content. This got me thinking: Could there be any truth to this claim? Read on to find out; the results may surprise you.

This article by Norlander specifically studies the relationship between moderate alcohol consumption (1.0ml/kg body weight) and creativity. It concludes

…modest alcohol consumption inhibits aspects of creativity based mainly on the secondary process (preparation, certain parts of illumination, and verification), and disinhibits those based mainly on the primary process (incubation, certain parts of illumination, and restitution).

In other words, moderate alcohol consumption does improve certain types of creative thinking, while inhibiting other types of creative thinking. Since the skills required for computer programming are solely cognitive in nature (discounting the motor skills required to type, of course), and given that creativity is a large part of computer programming, it is at least plausible that one might gain some amount of improvement from alcohol consumption.

There have also been studies on the relationship between alcohol consumption and creative output. That study examined 34 well known, heavy drinking, 20th century writers, artists, and composers/performers. It concludes:

Analysis of this information yielded a number of interesting findings. Alcohol use proved detrimental to productivity in over 75% of the sample, especially in the latter phases of their drinking careers. However, it appeared to provide direct benefit for about 9% of the sample, indirect benefit for 50% and no appreciable effect for 40% at different times in their lives. Creative activity, conversely, can also affect drinking behavior, leading, for instance, to increased alcohol consumption in over 30% of the sample. Because of the complexities of this relationship, no simplistic conclusions are possible.

So for a small portion of people there was a notable increase in creative output as a result of alcohol intake. It does appear that the study did not control for the quantity of alcohol intake, though, so this may not be directly applicable to the Ballmer Peak.

The best study I was able to find on the subject was by Lapp, Collins, and Izzo. They gave subjects vodka tonics of varying strengths (by varying the ratio of tonic to vodka), some of which did not even contain any alcohol. The subjects believed that they were drinking a standard-strength vodka tonic. The subjects then were asked to perform a number of cognitively and creatively challenging tasks. Here is what they conclude:

The present results support the idea that creative people probably gain inspiration from consuming alcohol …, but show that this effect may be due to the expected rather than the pharmacological effects of the drug. … A convergence of evidence supported the idea that creativity is enhanced (at least in some aspects) by the expected effects of alcohol.

In other words, alcohol can improve certain aspects of one's cognitive ability, but this effect is not likely due to any pharmacological process (i.e., it is often sufficient to merely believe that one is drinking alcohol in order to achieve the same benefit).

It looks like there may be some truth in the Ballmer Peak after all!