Tuesday, January 13, 2015

[Updated 1/24/2015] Defeat n-gram snoopers with a Scrabble Bag

Now that the onslaught on end-to-end encryption is sooo on, let's have a bit of Scrabble fun.

Because, as you will see, if any government bans my ability to communicate in private via the internet, they'd better ban Scrabble, too.

The somewhat abstract "strong" encryption is not something this writer uses all the time. The reasons are manifold and well regurgitated everywhere.

It is that I love to have a choice, and the choice is really more a matter of dignity than anything else. And the fact that there are lots of countries that are even less free than the flagship democracies.

Okay, sneaking in the little qualifier less is not nice but allow me to be a bit grumpy today. It's the weather, a winter that does not deserve its name, too hot, too rainy, you know.

Let's assume the great defeat of reliable encryption becomes a reality, does it mean you are defenseless?

There is one huge problem with the great encryption tools and algorithms we rely on: they are incredibly hard to design and just as hard to turn into reliable software. Even the pros get it wrong in astonishing ways, as the heartbleed bug in SSL showed recently.

Given that the number of geniuses I could call on to whip up some great encryption for me, if some government bans the good stuff, is limited and given these folks are very busy solving the world's bigger problems, like, for example, optimizing the NetFilx queueing system, what can an average Joe or Jane do?

First, make some text message envelopes which totally abuse common file formats to hide text. Kind of simple stuff such as putting a message into an xml file that - to humans and computers looks like it contains only a copyright free version of Huckleberry Finn. Or a little video with snowflakes or something that looks like Pacman.

Done: http://www.unchartedcharters.com

Incidentally, if you send encrypted messages as attachments, you still want an envelope so that the highway robber computers along the way cannot simply look for "GPG v. 2.0" and save your message for later decryption.

The next step is to see if old encryption methods can be used. That's Hide-a-KeyText on that same website.

A running text or Beale cipher is pretty cool but has a problem called statistics. In simple terms, a language contains a limited number of characters which need to be combined into  words.
The permissible combinations are not a free for all as you know if you have played Scrabble and felt the desire for a few extra points.

The guys who want to break these ciphers know this and use n-grams to get there.
The "n" is just a placeholder for a number. So, you can say 2-gram, 3-gram, 4-gram, and it means simply a sequence of that many characters, for example:

This is an example.

Breaking it into 3-grams gives you
Thi
his
is<space>
s<space>i

Spaces are generally removed but it is not important to the concept.

But, and this is the beauty for average coders like meeself, you can mess with n-grams with little programming.

Assign a position to each letter:
This is
1234567

Then sort the characters of "This is" alphabetically while taking the numbers along for the ride [it is called a two dimensional array or a map, but you do not need this jargon]:
 iihssT
753461

In a simple world, you can now separate the numbers and the letters. Send the numbers to your buddy (753461). Wait a couple of hours, send the letters ( iihssT).

Okay, this very short example can be easily resolved by just looking at it. But if you have a longish message (remember to add a chunk of a Daily Mail or Guardian article at the end for more letters), you will get sequences like "aaaaaaaaabbbbbccccddfffffffgggghhhhhhhhhiiiiiiiiii"

That's your very own Scrabble Bag. Just letters, no words, no n-grams.

In a not so simple world, you could do much more, for example, do some mathematical operation on the letters. Add or subtract the position number, XOR something, lift them out of the expected alphabet, whatever you fancy.

If you do a running cipher encryption on "aaaaaaaaabbbbbccccddfffffffgggghhhhhhhhhiiiiiiiiii", the n-gram snoopers will hate you. Which is what you want.
You will also want to do a running cipher on the numbers, for which they will hate you even more.

While you are  at it, don't start the position count with 1. If you have a text of 20 000 characters, that gives you a spread of 4 digits. Start high at 100 000 or so. With a text of 20 000 characters, you begin with a 6 digit position number and and end with a 6 digit position number, hence there is no "meta" information to be gleaned from the size of the numbers.

So, yes, communicating with a good degree of privacy for message content will be a lot harder if the anti encryption camp wins but hey, we are not in China, so Scrabble will survive and with it will the Scrabble Bag.

[Update 1/19]
A fancier term for this approach would be Double Strand encryption.  The row of sorted letters and the row of position numbers form two strands, as opposed to the original text (or other data, by the way). The original data is a single strand in which the letters and their positions form a unit. The "T" in our example is implicitly at position [0]. Creating two strands allows us to first explicitly associate a position to each letter and, as a second step, to pry them apart into one strand of letters and one strand of positions.

If they are long enough, unlike in our example, the strands are safe.

This is true even without encryption.

According to this website, your standard English Scrabble Bag has 100 letters (tiles). A text message of several thousand alphabetically ordered letters becomes very hard to reconstitute. Do the math if you like really big numbers.

As mentioned above, you can make things even harder by adding superfluous text, something not common but done in obfuscation in order to introduce data to mask the size and the content of a data set.
The best fluff to add would be from a different alphabet or script, very much like dumping two Scrabble Bags into a new unmarked bag, say English and Czech, or Hebrew.

Fun with different length strands
Until now we have implied that your letter and number strand are of equal length, right?
But there is no compelling reason for it. So, instead of adding some extra text, you could just append more, preferably random, numbers to the end of the numbers strand once you separate it from the text strand.
When they are combined by the recipient, the dangling extra numbers are ignored.
You can do the same to the letters strand, but make sure the extras are themselves alphabetically ordered.

Public transmission
Given that the double strand creates to separate carriers of information (content is in one, position in the other), you could even transmit one of them via a public means in a pinch.
As long as no unauthorized person gets hold of the other strand.

Why has nobody thought of this before?
Actually, we don't know that. Maybe someone has but was too shy to ruin their reputation. Maybe somebody has and the technique is currently being used - to save the world, or the end it, who knows.
But two simple reasons come to mind. The first one is efficiency. Short messages, securely encrypted were what was needed. Who would want to decrypt half of Mark Twain's Huckleberry Finn to get at the real message that says "meet me at the bar at 5"?
Only with a computer, you can do it.
The other reason is versatility and overhead. You don't want internet infrastructure to run on Huck Finn.

[End update 1/19]


[Update 1/24]
Simply sorting a text alphabetically will remove all position information from the collection of letters, but the sequence of numbers is still to some extent vulnerable to the old statistics hack. Say, you have the number 7429615 and the bad guys know there is an alphabet "underneath"? They would go and guess a language, say English, assume that many normal English words contain an "a" and try to set 7 = a. That's a pretty good guess, and with enough time, they could solve the Double Strand from the number strand.

For a good real life solution, you would want to "shake, not sort", which means introducing random draw from the Scrabble Bag.

[End update 1/24]

[Update] Fixed spelling, grammar, made a couple of sentences for readable.


No comments:

Post a Comment