Sunday, February 12, 2017

More hay for the Internet haystack - send text in images and defeat optical character recognition

Providers of ad supported email scan the text of your mails to target ads. Sometimes, like Yahoo, they do it for a government. We have previously written about the small javascript image processing tool Message Blur that defeats standard optical character recognition software. Since there is a new version out, we did some testing against OCR, specifically the very reliable free site ocronline.

Below are the results of the tests. They range from 100% correct text recognition to partial recognition to no recognition, where the site gives up and says "No recognized text".

Message Blur now has its own OCR feature that you use by clicking "Tesseract OCR test extraction", intended to give you an idea of how well the obfuscation works. Right now, the inbuilt feature only recognizes Japanese, Korean, Devanagari (Hindi), Cyrillic (Russian), Arabic, Hebrew and Latin (English). This means, you will not see accented characters (French and others) or German Umlauts.

This is a screenshot of the new tool version. The sample text in the text area has been "moved" to the image area.




The next image is a small version of the "Exported to file" image from the screen above. Nothing special about this, none of the various modifications the tool can make have been tried out.

OCRONLINE nicely extracts the text from the .png file as shown here:

We then tried to defeat the OCR. Still with the sample text, we used the "Line" tool and hit "Draw random lines" for a few seconds. Just for fun, we added some very transparent lines manually and then exported the image:
You can see that the text can still be read easily by a human, and we guessed that some of the continuous text without any crossing lines would be recognized. The output of ocronline was:
This is substantial improvement over the clear image from the first round. Interestingly, the OCR software is smart enough to not get fooled by the faint transparent overlay of the very first word "Type". But the rest does not go so well for the character recognition software.

So, only a few seconds of work by the computer is enough to degrade a message to the point that simple scans will very likely discard it.

What ad Google might serve based on this? Maybe for coffee because the algorithm thinks you were too asleep to be coherent?

For the next test, we "Reset" and "Moved" the text into the empty image again, then we played with random lines and drew a few thick lines manually. Note that most of the text is not obstructed. It is clear that OCR would easily extract more than in the previous example.
This is where we "Peel" the image into two. Click "Peel", then export to file twice. To reassemble the original image and read the text, you load the two images into an empty Message Blur image canvas. The saved split images are:

Message Blur "cut" the text lines in the middle, then put half of the image in one file and the other half into the second file.

Submitting each of these files to ocronline gives us the desired result, a "No text" message:

Simply splitting a text image like this will defeat even good standard OCR software. Artificial intelligence (AI) folks have been working on the most common images that contain manipulated text, the awful captchas that seek to separate humans from machines. AI has defeated some, like this article on a Yahoo captcha illustrates.

Message Blur does not let you change the image size via a menu, but there is a trick to make it suitable for long text.

You can load images of any size.

Here is a reduced size example, a screenshot from the German Der Spiegel website. The actual size in Message Blur was about 2000 by 1200 pixels, resized down for display here to a third of the size.

We let the computer add some random lines, then gave the image to OCRONLINE for extraction.
As expected, OCRONLINE does catch some of the text, especially part of the text under the large photo showing the chancellor in a red chair.



If the OCR software does not "know" the language it is supposed to extract, it gets a lot less, as this result for Der Spiegel with the language left on the default English shows:



















No comments:

Post a Comment