Cracking captchas with neural networks
For educational purposes
So you'd like to break a captcha, huh? It might come as a surprise, but it's actually fairly easy to do - even in your browser (as long as you don't set your expectations all that high). The trick is in using neural networks to find the best match for each character, after training it extensively with sample data. Luckily for us, there already exists an excellent implementation of neural networks in Javascript, called brain.js. For this guide we'll be breaking one of the variants generated by Securimage, as they can be made pretty weak (other services, such as captchacreator, are much weaker by default, but I couldn't find a decent dataset for those) - breaking a more complex one, not to mention reCaptcha, quickly becomes close to impossible. It should go without mention that this is only for educational purposes, even if it can be used on some real captchas.
Step 1. Gathering a data set
First thing we need is a collection of images we can work on. As Securimage is a free captcha implementation, this is as simple as downloading it and generating a bunch. For the sake of keeping this guide relatively short, I've chosen to disable a lot of the obfuscation, leaving us with a result similar to this:
The captchas are very predictable: they always have 6 characters, the character set is known (A-Z, a-z and 0-9), the characters are easily distinguishable from the background, and the characters aren't warped - all characteristics that make it very easy to break. They're also very rare in the wild, but that doesn't really matter since we're doing this for educational purposes ;)
I generated 500 of these images, converted them to their base64 encoding, and embedded them in some HTML. The base64 encoding was to avoid tainting the canvas, when loading it on CodePen, but also gives us a huge HTML size. Therefore, I've only included the first 10 images in the pen.
Step 2. Extracting the data
Once we have our dataset, we're ready to start working on our actual script. As neural networks quickly get expensive, we'd like to simplify each image some - namely, we want to extract the parts of each image that actually contain the letters, and convert them to binary (black/white) instead of grayscale. In the end, we can scale each of those selections down to a standard size, and run them through our neural network.
Since we're working with Javascript, we need to extract the image data from the images. This is done by creating a canvas element, drawing the image to it, and then extracting the image data. This will give us an ImageData object, which has width
and height
properties, as well as a data
property containing an array (actually Uint8ClampedArray
) of the image data - more on that below.
Threshold
Step one is to convert the image to a binary (black/white) format. This is most easily done by running a simple threshold algorithm on it, comparing each pixels color values to a threshold value, and then assigning them either black or white depending on whether they exceed the threshold or not. As we're working with image data on a canvas element, each pixel has four values (one red channel, one blue channel, one green channel and one alpha channel). Luckily for us, our image is grayscale, so we'll just ignore everything but the first channel.
A simple threshold function would then look something like this:
function thresholdData( imgData , threshold ) {
for (var i = 0, j = imgData.data.length; i < j; i+=4) {
if (imgData.data[i] > threshold) {
imgData.data[i] = 0; // black
} else {
imgData.data[i] = 255; // max value (red, as we're working on the red channel)
}
}
return imgData;
}
For my images, I found 150
to be a good threshold value (as the grey of the letters is the color rgb(147,147,147)
) - with this setting, all letter pixels would become red, while all non-letter pixels would become black (or rather turquoise, as the blue and green channels still had values of 147
). The resulting image looks like this:
Separating letters
Great! We now have an image that's very clearly split up into two sections: non-letter pixels with a red color value of 0
, and letter pixels with a red color value of 255
. But there are still a lot of pixels that we'd have to process - our next job is to reduce that amount, by splitting the image up into 6 smaller images, each containing one image.
There are a few different ways to detect the boundaries for each letter. One is to fill from every pixel we detected as belonging to a letter, and then picking the 6 biggest fills. For us, there's an even easier way though: there is no image in our dataset in which two letters have pixels in the same x-coordinate (that is, no column of pixels contains pixels from more than one letter). This means that we can simply run through our columns, find the first/last column in which we have a letter pixel (times 6, once for each letter), and crop it (there are some character combinations - such as "rf" - where this doesn't work. If you feel like it, you can try to improve it!). In pseudo-code, this would look like this:
for every column
for every pixel in column
if it's a letter pixel
we're now searching through a letter
update the boundaries for this letter to include this pixel
if we didn't find a letter pixel in this column, and we're currently searching through a letter
save every pixel in letter boundary to an array (for output)
we're no longer searching through a letter
reset letter boundaries for next letter
If you'd like to see an implementation, the ImageParser.prototype.extract
function in my pen (at the bottom of this blog) does this exact thing.
Once this is done, we should have a set of 6 image datas:






Scaling
Finally, we'd like to further minimize the amount of pixels to process, all while normalizing the letters - this is done by scaling them down to square of a fixed size - note that this will stretch some letters, but since they're stretched equally each time, that doesn't really matter much.
We have two options: we could either draw the image data to a canvas, and then use its .scale()
method, or we could implement a custom scaling function. Since I'd like to avoid interpolation (in order to keep the image data binary), I opted for the custom version - another choice would be to use the native method, and then simply threshold it again afterwards.
The pseudo-code for my custom scaling function looks like this:
for every pixel in small square
find the top-left pixel of the corresponding area in the large image data
if it's a letter pixel
set as letter pixel in small square
otherwise
set as background pixel in small square
This could of course be improved - for instance by checking the average value of the corresponding area, instead of a single pixel - but it suits our needs well enough for the task at hand.
This is done for each letter we extracted in the previous step, giving us a total of six 16x16 squares (because of the way my implementation works, this also conveniently discards the blue/green channels):






Step 3. The neural network
Finally! It has become time for us to throw our image data into a neural network. Before we get started on the neural network, let me explain briefly about what neural networks actually are.
Neural networks are essentially a collection of nodes - called neurons - which are connected. The neurons are organized in layers, always with one "input" layer, and one "output" layer. There can be one or more "hidden" layers in-between those two, or the two can be connected directly - for our use case, we'll be adding a few hidden layers.
The input layer has one neuron for each input (in our case, one neuron for each pixel), which in turn is connected to every neuron in the next layer. These connections have "weights" - numbers to scale the neuron value by. If we give our first neuron a value of 255
, it might only send 25
on to the first neuron in the next layer, and 0
on to the second neuron in the next layer.
In the hidden layers, each neuron receives a certain value from the previous layer. Based on these values (for example by summing them up and comparing them to a threshold), the neuron's own value is decided. This value is then sent on to every neuron in the next layer, and the process repeats itself until we reach the output layer.
In the output layer, each neuron corresponds to a possible output (in our case, this is our character set: A-Z, a-z and 0-9, giving 62 outputs in total). The neuron value for each output neuron can be thought of as the likelihood of said neuron being the correct output.
Training the network happens by changing the weights of the connections between neurons. The weights are changed a bit from their current state, then the training data is passed through again. If the result was closer to what we wanted, the new state is saved. If not, it's discarded. This process repeats until we've decided the result is good enough - if you want a precise network and have a lot of training data, this can take a long time, often several hours.
brain.js
Luckily for us, we don't have to actually implement the neural network logic, since brain.js has that all figured out. All we need to do is tell it how many hidden layers we want, what our inputs are, and what outputs we expect. It'll then train the network, eventually spitting out a sets of weights. Once trained, we can start running new, unknown to the network, captcha images through it, and seeing how well it guesses.
But first, we have to actually code it. Since we're using brain.js, we need to play by its rules: brain expects all of its inputs and outputs to be either arrays or objects, containing numbers ranging from 0
to 1
. Therefore, we have to format our image data from an array of pixel values, to an array of numbers, one for each pixel:
function formatForBrain( imgData ) {
var output = [];
for (var i = 0, j = imgData.data.length; i < j; i+=4) {
// imgData.data has four channels, but we're only interested in the first
// we also normalize our numbers from 0-255 to 0-1
output[i/4] = imgData.data[i] / 255;
}
return output;
}
And that's basically it. What's left to do is a lot of slave work, with manually entering the expected outputs for each image in our dataset (I filled out the captcha for 250 images - the data can be found here), followed by extracting the relevant image data and giving it to the neural network for training:
var trainingData = [];
for (var i = 0; i < imgs.length; ++i) {
var data = extractImageData(imgs[i]), // extract letter images
answer = answers[i]; // manually entered
// format image data
var formattedData = data.map(function(imgData){ return formatForBrain(imgData) });
// split into array of letterImg/letterString objects
var outp = data.map(function(imgData,index){
// `output` property must be an object
var outputObj = {};
outputObj[answer[index]] = 1;
return {
input: data[index],
output: outputObj
}
});
// add image+answer to training data
trainingData = trainingData.concat(outp);
}
Finally, we can create and train our neural network. This is made just about as easy as it can be by brain.js:
var net = new brain.NeuralNetwork({hiddenLayers: [128,128]});
net.train({
errorThresh: 0.0001, // error threshold to reach
iterations: 200, // maximum training iterations
log: true, // console.log() progress periodically
logPeriod: 10 // number of iterations between logging
});
I chose to use a neural network with two hidden layers, each with 128 nodes in them; this choice was fairly arbitrary. Playing around with the number and size of hidden layers could probably lead to better performance. I trained the network for roughly 20 minutes, which led to an error threshold of ~0.00021
(training to 0.0005
took less than a minute).
With the training finished, the weights can be exported either as a JSON object, or as a function (which simply includes the JSON object, and implements the logic to generate and run the network, all without requiring brain.js to be loaded):
function run(input/**/) {
var net = {"layers":[{"0":{},"1":{},"2":{},"3":{},"4":{},"5":{},"6":{},"7":{},"8":{},"9":{},"10":{},"11":{},"12":{},"13":{},"14":{},"15":{},"16":{},"17":{},"18":{},"19":{},"20":{},"21":{},"22":{},"23":{},"24":{},"25":{},"26":{},"27":{},"28":{},"29":{},"30":{},"31":{},"32":{},"33":{},"34":{},"35":{},"36":{},"37":{},"38":{},"39":{},"40":{},"41":{},"42":{},"43":{},"44":{},"45":{},"46":{},"47":{},"48":{},"49":{},"50":{},"51":{},"52":{},"53":{},"54":{},"55":{},"56":{},"57":{},"58":{},"59":{},"60":{},"61":{},"62":{},"63":{},"64":{},"65":{},"66":{},"67":{},"68":{},"69":{},[...],"127":0.07311768650887056}}}],"outputLookup":true,"inputLookup":false};
for (var i = 1; i < net.layers.length; i++) {
var layer = net.layers[i];
var output = {};
for (var id in layer) {
var node = layer[id];
var sum = node.bias;
for (var iid in node.weights) {
sum += node.weights[iid] * input[iid];
}
output[id] = (1 / (1 + Math.exp(-sum)));
}
input = output;
}
return output;
}
This function can then be thrown in whereever you want your neural network to work - such as in the below pen - and it'll easily give you your guesses. Keep in mind that your guesses are still in the object format { key: guessProbability, ... }
, so you'll have to find the most likely on:
var guess = {string: "", probability: 0};
for (var k in guesses) {
if (guesses[k] > guess.probability) {
guess = {string: k, probability: guesses[k] }
}
}
And that's it! We've just created a "small" script that can guess simple captchas with close to 100% accuracy (which would be greatly improved with some more advanced filtering than a simple threshold)!