Ceci c’est ne pas une pirate.

A couple years ago I wrote about some thought experiments related to copyright law. Let's say you have a huge number, $n$, two documents, $D_1$ & $D_2$, and a cipher, $f$, that, given the number and the first document, yields the second document: $f(n, D_1) = D_2$. Now let's say $D_1$ is some book in the public domain, whereas $D_2$ is a copyrighted book. In my previous article I wondered,

What is the copyright status of $n$? Can a number be copyrighted?

The difficulty arises in the fact that we have reconstructed a copyrighted artifact solely using artefacts in the public domain. I subsequently posed this scenario to some lawyers, and their opinion was that either the cipher $f$ would be in violation of $D_2$'s copyright, or the copyright status would depend on the intent of the cipher’s creator.

But what if, for the same number $n$, there is another pair of documents, $D_3$ & $D_4$ for which $f(n, D_3) = D_4$? What if $D_3$ and $D_4$ are both in the public domain? Is such an occurrence even possible? It turns out that it is, and it's probably more probable that you might think. Read on to find out.

I decided to construct a general purpose cipher $f$. It actually is quite simple using a Running Key Cipher. The idea is that we use the first document, $D_1$, as a sort of public key: the huge number $n$ is factored into smaller numbers which are indexed into $D_1$ to produce $D_2$. Furthermore, for long enough documents, it is possible to find a number that will both produce $D_2$ from $D_1$ and will produce $D_4$ from $D_3$!

I wrote a short python script, below, that implements the cipher. It currently reads books from Project Gutenberg as the documents for a proof-of-concept, but any document (including copyrighted works!) would work. As an example, I selected Newton's Opticks as Document 1, Russell's Analasis of the Mind as Document 2, Plato's Republic as Document 3, and Paine's Age of Reason as Document 4. Running my cipher results in the following output:

1 2 1 2 1 2 1 2 155 33 103 108 68 69 56 108 38 1540 11894 617 2446 59 1080 332 110 386 722 3860 15 170 207 6966 186 62 262 261 505 454 184 183 145 454 6743 96 1023 492 411 1567 110 115 1020 4872 834 59 834 617 103 104 38 1054 157 1756 2446 162 353 141 4081 645 28 1431 492 10 382 47 470 110 306 112 52843 543 867 1409 15 249 1020 120 149 142 145 3816 24344 24506 49059 120 63 353 1910 351 5106 496 141 351 145 1851 5710 713 356 277 214 372 589 17 986 488 94 101 217 176 3841 189 345 215 1414 355 268 1081 67 435 848 187 155 5169 1164 273 165 183 141 62 172 582 1977 411 46 8260 2168 49 2126 24 559 369 386 17 379 618 217 391 15 4970 114 60 2402 17 435 215 702 285 122 199 112 24 180 1282 90 861 331 597 17 285 821 389 99 4529 934 1668 47 3100 510 378 315 345 188 2394 155 5169 318273 341197 230969 345 355 48959 321398 315037 316250 1911 181 460 262 355 101 221 1353 229 244 662 589 241 809 988 169 77 277 214 296 200 84 176 208 391 112 27 244 968 119 331 47 1260 702 285 724 492 189 775 170 96 440 702 355 554 19927 842 182 1896 379 345 2023 241 19506 3467 726 286 401 189 602 68 59 92...


While I realize that does not amount to the single, huge number I spoke of earlier, note that it can easily be converted to a large number, e.g., by concatinating each number as a digit in a very large base. This number is basically meaningless without $D_1$ and $D_3$. But with those documents we can recreate $D_2$ and $D_4$:

$./decrypt.py huge_number opticks.txt the analysis of mind by bertrand russell 1921 muirhead library of philosophy an admirable statement of the aims of the library of phiosophy was provided by the first editor, the late professor j. h. muirhead, in his description of the original programme printed in erdmanns history of philosophy under the date 1890. this was slightly modified in subsequent volumes to take the form of the following statement...  $ ./decrypt.py huge_number republic.txt
the writings

of

thomas paine

collected and edited by

moncure daniel conway

volume iv.
...


So here I have demonstrated a method for generically generating a huge number that when applied to two books in the public domain produces two other books in the public domain. Note that any of these books could have just as easily been a copyrighted text. What would happen if $D_4$, instead of being Paine's Age of Reason, was instead some copyrighted text that I had not the license to distribute? Would I be able to distribute $n$ as long as I didn't let anyone know that, when combined with Analysis of the Mind, it produced the copyrighted material?

Many laws and legal decisions amount to one's intent. But what if one legitimately only knew of the use of $n$ with respect to $D_1$ and $D_2$, and didn't know of its nefarious use to produce $D_4$ from $D_3$? I therefore asked myself, “How likely would it be for one ciphertext $n$ to be able to produce more than one result document?” The answer surprised me!

Let’s say we have a ciphertext $n$ that we generated from $D_1$ and $D_2$. How likely is it that there exists two other books $D_3$ & $D_4$ such that $n$ will also generate some copyrighted text of $D_4$ from $D_3$? Note that I am not taking relative letter frequency into account: I assume that each character in a document is an independent random variable. This is a rather unrealistic assumption, but we shall see that the asymptotic properties of the problem should nullify it. (This assumption could be relaxed by instead applying Lovász’s local lemma.) First, let’s tackle the problem of figuring out the probability that a given pair of documents $D_3$ and $D_4$ will work. Let $m$ be the length of the documents in characters and let $k \lt m$ be the minimum required length of a string for that text to be considered a copyright violation (i.e., outside of fair use). The probability that $f(n, D_3)$ contains no substrings of length at least $k$ from $D_4$ is

$\left(1 - p^k\right)^{(m - k + 1)}$,

where $p$ is the probability that a pair of characters is equal. Here we have to take into account letter frequency in English. Using the table from Wikipedia, I calculated $p$ to be roughly 6.5 percent (it’s the sum of squares of the values in the table). According to Google, there are about 130 million books that have ever been written. Let’s be conservative and say that 2 million of them are in English. Therefore, the probability that at least one pair of those books will produce a copyrighted passage from $n$ is

$1-\left(\left(1 - p^k\right)^{(m - k + 1)}\right)^{2000000 \choose 2}$,

which is going to be extremely close to 1.0 for $k \lt m \ll 2000000$.

Therefore, for any choice of $n$ it is almost certain that there is a pair of books that one can choose that will cause a copyright violation, even if those books are unknown!

Here is my python script: A newer version of the code is now available as the Lenticrypt project!

Here is the old proof-of-concept code:

#!/usr/bin/env python
import sys, re, itertools
def character_stream(f):
found_start = False
prog = re.compile(r’[\w\s\n.,;:-\\!\\?&]’)
for line in f:
found_start = (line.startswith(“*** START OF THIS PROJECT GUTENBERG EBOOK”) or line.startswith(”*END THE SMALL PRINT!”))
continue
for char in line.lower():
if prog.match(char):
yield char
pairs = {}
def encrypt(certificate1, certificate2, to_encrypt1, to_encrypt2):
f1 = open(certificate1, ‘r’)
f2 = open(certificate2, ‘r’)
i = 0
for c1, c2 in itertools.izip(character_stream(f1), character_stream(f2)):
i += 1
if (c1, c2) not in pairs:
pairs[(c1, c2)] = i
#print str(len(pairs)) + “\t” + str(i) + “\t” + c1 + “\t” + c2
f1.close()
f2.close()
f1 = open(to_encrypt1, ‘r’)
f2 = open(to_encrypt2, ‘r’)
for c1, c2 in itertools.izip(character_stream(f1), character_stream(f2)):
if (c1, c2) in pairs:
print pairs[(c1, c2)],
else:
sys.stderr.write(”Warning: pair “ + str((c1, c2)) + “ was not in the certificate!\n”)
f1.close()
f2.close()
def decrypt(key, certificate):
with open(certificate) as stream:
cert = list(character_stream(stream))
with open(key) as stream: