{"id":461860,"date":"2025-06-03T09:00:14","date_gmt":"2025-06-03T09:00:14","guid":{"rendered":"http:\/\/savepearlharbor.com\/?p=461860"},"modified":"-0001-11-30T00:00:00","modified_gmt":"-0001-11-29T21:00:00","slug":"","status":"publish","type":"post","link":"https:\/\/savepearlharbor.com\/?p=461860","title":{"rendered":"<span>How an AI CAPTCHA Solver Works: From OCR to Deep Learning<\/span>"},"content":{"rendered":"<div><!--[--><!--]--><\/div>\n<div id=\"post-content-body\">\n<div>\n<div class=\"article-formatted-body article-formatted-body article-formatted-body_version-2\">\n<div xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">\n<p>CAPTCHA has become a familiar part of the internet: distorted texts, \u201cfind all the traffic lights\u201d images, audio riddles, and other challenges that distinguish humans from machines. Every bot-system developer or QA engineer automating web scenarios has at least once run into a script suddenly stumbling over a CAPTCHA. A natural question arises: can a program be taught to solve CAPTCHAs the way a human does\u2014quickly and reliably? In this article I will try to figure out how AI CAPTCHA solvers are built, from classical OCR methods to modern neural networks.<\/p>\n<figure class=\"full-width\"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/fcd\/aff\/3f7\/fcdaff3f75f38f0b72b6e77aa8d9b320.jpg\" width=\"572\" height=\"430\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/fcd\/aff\/3f7\/fcdaff3f75f38f0b72b6e77aa8d9b320.jpg 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/fcd\/aff\/3f7\/fcdaff3f75f38f0b72b6e77aa8d9b320.jpg 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<h2>CAPTCHA Types and Why Bots Find Them Hard &#8212; CAPTCHA Solver AI  <\/h2>\n<p>Before breaking a CAPTCHA, let\u2019s look at the kinds that exist and why algorithms have trouble with them. CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) comes in many forms. The main types can be classified as follows:<\/p>\n<ul>\n<li>\n<p>Text CAPTCHAs \u2014 classic images with distorted characters (letters, digits) that must be typed manually. In the past such tasks were solved with OCR (Optical Character Recognition). But as distortions and noise intensified, simple recognition became difficult: characters overlap, bend, are covered with interference purposely added against segmentation and recognition. This creates a problem for bot programs, since the computer must isolate and recognize each character, which is non-trivial without training.<\/p>\n<\/li>\n<li>\n<p>reCAPTCHA v2 (Google) \u2014 the familiar \u201cI\u2019m not a robot\u201d checkbox (which analyzes user behavior) and, if suspicious, a pop-up window with a 3 \u00d7 3 grid of images where you must pick pictures according to a criterion (e.g., all squares with cars). This CAPTCHA combines behavioral analysis with a visual object-classification task. Bots struggle because they must understand image content\u2014a computer-vision problem.<\/p>\n<\/li>\n<li>\n<p>reCAPTCHA v3 and Cloudflare Turnstile \u2014 invisible next-generation CAPTCHAs. They require no user action: the backend analyzes many behavioral, environmental, and browser parameters and assigns a hidden \u201csuspicion rating.\u201d If the rating is low, the user is considered human; if high, an extra check may follow. For a bot this is a tough barrier, because it must imitate human behavior across many signals, not solve a specific puzzle.<\/p>\n<\/li>\n<li>\n<p>hCaptcha and FunCaptcha \u2014 alternative CAPTCHAs from other services (Cloudflare, Arkose Labs). Essentially similar to reCAPTCHA v2: the user either gets a set of images to classify or an interactive task (e.g., rotate a 3-D object or find an element). Each system adds its own variations of visual puzzles.<\/p>\n<\/li>\n<li>\n<p>GeeTest and other puzzles \u2014 popular on Asian services: puzzle-piece variants (drag a fragment into the correct position), shuffled image tiles, simple questions, or sliders. For example, a puzzle CAPTCHA offers to align a cut-out fragment with a hole in the picture, requiring coordination and image understanding. Bots find this hard because it requires both pattern recognition and simulated human input (mouse movement).<\/p>\n<\/li>\n<li>\n<p>Audio CAPTCHA \u2014 usually supplements a visual CAPTCHA for users with impaired vision. A distorted recording of numbers or words plays, which must be distinguished and typed. It is believed that humans find speech with noise easier than machines. Yet these CAPTCHAs are not flawless: Stanford researchers succeeded in automatically cracking audio CAPTCHAs with up to 75 % probability using speech-recognition algorithms. With powerful ASR (Automatic Speech Recognition) models, audio riddles no longer guarantee protection.<\/p>\n<\/li>\n<li>\n<p>Behavioral and invisible CAPTCHAs \u2014 I already mentioned reCAPTCHA v3 and Turnstile; there are also hidden tests integrated into a site (for example, honeypot fields that only bots fill, or analysis of form-filling speed). All these are new-type CAPTCHAs checking the naturalness of user actions. The bot faces not a specific puzzle but the need to pretend to be a real user: move the mouse \u201clike a human,\u201d wait random delays, and have a \u201cclean\u201d browser. Such methods are harder to bypass directly by algorithm, so workarounds are needed\u2014e.g., obtaining a reCAPTCHA token via API or using browser-fingerprint databases.<\/p>\n<\/li>\n<\/ul>\n<figure class=\"full-width\"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/927\/e1d\/62c\/927e1d62c232b3a13a8b3451f7874c21.jpg\" width=\"996\" height=\"720\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/927\/e1d\/62c\/927e1d62c232b3a13a8b3451f7874c21.jpg 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/927\/e1d\/62c\/927e1d62c232b3a13a8b3451f7874c21.jpg 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>You can look in more detail in my previous <a href=\"https:\/\/habr.com\/ru\/articles\/847302\/\" rel=\"noopener noreferrer nofollow\">article about CAPTCHA types<\/a>.<\/p>\n<p>Each CAPTCHA type requires a special solving approach. A universal CAPTCHA solver remains a challenge: the bot system must read text, classify pictures, or synthesize behavior, depending on the check encountered. Let\u2019s move on to how algorithms try to solve CAPTCHAs and to the evolution of these methods.<\/p>\n<h2>From OCR to Neural Networks (CAPTCHA AI Solver): The Evolution of Bypassing CAPTCHA<\/h2>\n<p>The first attempts at automatic CAPTCHA solving were closely tied to the development of OCR (Optical Character Recognition). A classical text CAPTCHA\u2014distorted letters\/digits on a noisy background\u2014was essentially designed to puzzle OCR systems. Old CAPTCHA versions could be cracked with relatively simple methods: filtering the image, extracting contours, segmenting into individual characters, and template matching or standard OCR engines like Tesseract. For some simple CAPTCHAs you could skip \u201cintelligence\u201d altogether: just overlay several sample digits to get a mask unique to each character and find matches in the picture. But such tricks suit only the most primitive and uniform CAPTCHAs.<\/p>\n<p>CAPTCHA Complication vs. Algorithm Improvement. CAPTCHA creators responded to cracks by increasing complexity: characters were more heavily distorted, color noise and background patterns were added, fonts became inconsistent. CAPTCHAs appeared with stuck-together characters where letters overlap. All this hindered segmentation\u2014the key step for classical OCR. Machine learning entered the game: researchers trained models to distinguish CAPTCHA characters even in noise. Back in the 2000s there were papers applying SVM and other algorithms to recognize specific CAPTCHA generators. But the breakthrough came with deep learning.<\/p>\n<p>In 2014 Google announced a sensational result: its neural network learned to solve the toughest text CAPTCHAs with 99.8 % accuracy (how ironic\u2014Google itself pioneered recognizing the defense it had effectively invented, I mean reCAPTCHA). The machine outperformed humans at what had once been meant as a purely human task! This immediately made text distortions useless\u2014if an algorithm can read characters better than people, such protection loses meaning. Probably for this reason Google quickly moved reCAPTCHA from noisy texts to pictures and behavior evaluation.<\/p>\n<blockquote>\n<p>As to my first sentence in this paragraph, another thought suggests itself: if not for the vanity of some and the foolishness of others, we might still be at the stage when reCAPTCHA or even simple text CAPTCHAs emerged. The moment some enthusiast found a solution, he promptly posted it for all to see, which in turn prompted CAPTCHA developers to complicate it\u2026 My tongue is my enemy\u2026<\/p>\n<\/blockquote>\n<p>A similar plot is unfolding with image CAPTCHAs. It was initially thought that, say, recognizing traffic lights in photos is harder for computers than for people because humans have advanced visual perception. But with the revolution in computer vision (deep convolutional networks) this asymmetry disappeared. A modern image-classification model can detect objects with high accuracy\u2014consider how accurately your phone recognizes animals, signs, and other objects in photos. A fresh example: since 2024 an advanced YOLO model can detect traffic lights and other reCAPTCHA v2 images in 100 % of cases, whereas earlier best results were ~70 %. Moreover, an AI bot now has to go through as many pictures as an average human before the system lets it through. One would like to believe the slogan \u201cwe have officially entered the post-CAPTCHA era,\u201d where classical checks can no longer distinguish a human from a smart machine, has finally arrived\u2014but it feels like this is not yet the end.<\/p>\n<p>It is important to understand that deep learning not only increased accuracy\u2014it changed the approach itself. Previously, a script had to follow a set of rigid steps: filter the background, split the characters, recognize them separately. Now an end-to-end neural network can be trained: you feed it a CAPTCHA picture, and it outputs the text string (or the probability of the needed class for an image). Auxiliary tasks, like segmentation, the network can learn internally without hand-coding rules. For example, a DenseNet variation called DFCR, coming from China (from where else, if not the nation that once gave the world gunpowder, right?), achieved &gt; 99.9 % accuracy on CAPTCHAs with noise and stuck-together characters\u2014because the deep convolutional network learned to see separate symbols even in a difficult case and confidently classify them.<\/p>\n<p>For clarity, a small table:<\/p>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<td>\n<p align=\"left\">Approach to Cracking CAPTCHA<\/p>\n<\/td>\n<td>\n<p align=\"left\">Classical (OCR, scripts)<\/p>\n<\/td>\n<td>\n<p align=\"left\">Modern (AI \/ neural nets)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Requirements<\/p>\n<\/td>\n<td>\n<p align=\"left\">Filtering rules, OCR libraries, templates. Requires manual tuning per CAPTCHA.<\/p>\n<\/td>\n<td>\n<p align=\"left\">Trained ML model (CNN\/RNN). Needs a training dataset but then works more universally.<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Character Segmentation<\/p>\n<\/td>\n<td>\n<p align=\"left\">Necessary: must find the boundary of each character before recognition. Frequent failure with interference or merged letters.<\/p>\n<\/td>\n<td>\n<p align=\"left\">Not explicitly needed: an end-to-end model recognizes the full text immediately, covertly segmenting by internal features. Even stuck-together symbols are recognized correctly.<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Accuracy on Difficult CAPTCHAs<\/p>\n<\/td>\n<td>\n<p align=\"left\">Limited, often &lt; 90 % with heavy distortions. To improve, heuristics must be added for the specific case.<\/p>\n<\/td>\n<td>\n<p align=\"left\">Near 100 % with sufficient training. Makes fewer mistakes than humans on typical tasks but may be vulnerable to totally new types not in the training data.<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Adaptability<\/p>\n<\/td>\n<td>\n<p align=\"left\">Poor transfer to new CAPTCHA types: code\/logic must be reworked.<\/p>\n<\/td>\n<td>\n<p align=\"left\">Can be fine-tuned on new data. Universal architectures (e.g., ResNet, Transformer) apply to various tasks.<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Solving Speed<\/p>\n<\/td>\n<td>\n<p align=\"left\">High (milliseconds) since algorithms are simple, but on difficult CAPTCHAs may waste time on segmentation attempts.<\/p>\n<\/td>\n<td>\n<p align=\"left\">High: neural nets perform recognition in tens of milliseconds on GPU. A bottleneck is data preparation and, for services, the task queue (discussed later).<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<p>As we can see, AI has surpassed classical methods in flexibility and efficiency. But how exactly do neural networks solve the CAPTCHA problem? Let\u2019s examine the main architecture types applied to crack different CAPTCHAs.<\/p>\n<h2>Neural Networks vs. CAPTCHA: CNN, RNN, CRNN, Transformers (How AI CAPTCHA Solver Work)  <\/h2>\n<p>Modern AI CAPTCHA solvers rely on a rich arsenal of deep-learning models. Architecture choice usually depends on CAPTCHA type. Here are the main approaches:<\/p>\n<figure class=\"full-width\"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/ef6\/66f\/d54\/ef666fd545fc71b792ee1e59253093cf.jpg\" width=\"939\" height=\"497\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/ef6\/66f\/d54\/ef666fd545fc71b792ee1e59253093cf.jpg 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/ef6\/66f\/d54\/ef666fd545fc71b792ee1e59253093cf.jpg 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>Convolutional Neural Networks (CNN) \u2014 specialize in images. CNNs learn to pull out meaningful features from a picture: letter contours, textures, object shapes. Therefore, in CAPTCHAs they are primarily used for character recognition or image classification. A simpler option: train a CNN to recognize individual characters (0\u20139, A\u2013Z)\u2014then the CAPTCHA image must first be sliced into symbols. A more advanced option feeds the entire CAPTCHA through a CNN, obtaining a feature for each image section. However, CNNs alone do not model a sequence of characters, so in complex CAPTCHAs they are supplemented with recurrent layers.<\/p>\n<p>Recurrent Neural Networks (RNN) \u2014 a family of networks capable of processing sequences (data series). In CAPTCHA context, RNNs are used to read text left-to-right, as a person does. For example, you can first extract image features (vector representations of image columns) and feed them into an RNN, which sequentially \u201creads\u201d them and outputs a sequence of characters. Classic modules \u2014 LSTM or GRU \u2014 can remember context, which is useful if characters influence each other (say, the algorithm tries to consider probabilities of letter combinations). RNNs are especially helpful for dynamic or sequential CAPTCHAs: e.g., when the user enters several digits appearing one by one, or for audio CAPTCHAs (where the sound sequence must be converted to character sequence). Nonetheless, by themselves RNNs work worse with pictures, so they are often combined with CNNs.<\/p>\n<p>CRNN (Convolutional Recurrent Neural Network) \u2014 a CNN and RNN combination that has become the de-facto standard for recognizing text CAPTCHAs and indeed texts in images. A typical scheme: a convolutional network (e.g., several conv + pooling layers) extracts a CAPTCHA feature map that can be treated as a sequence of features along the image width. Then comes a recurrent block (often BiLSTM\u2014bidirectional LSTM), which processes this feature sequence and considers neighboring context. The RNN output is then transformed into a predicted sequence of characters. Such a model is often trained with CTC-loss (Connectionist Temporal Classification), which allows aligning arbitrary output length with the real CAPTCHA text. Thanks to CTC, the model does not need perfect character segmentation\u2014she learns to \u201cstretch\u201d the output to the needed length herself. The result: a CRNN can read an entire CAPTCHA even when characters overlap or their count varies from CAPTCHA to CAPTCHA.<\/p>\n<p>In real projects CRNNs have repeatedly demonstrated their effectiveness. For example, a CNN+BiLSTM model trained on 20 k synthetic CAPTCHAs (random letters and digits with various fonts and noise lines) showed high accuracy on previously unseen CAPTCHAs, and the model guessed even symbols with unfamiliar distortions. When compared with a classical approach (splitting a CAPTCHA image into five parts and classifying each fragment with a separate CNN model) the end-to-end LSTM model was much more reliable and easier to generalize different fonts.<\/p>\n<p>Transformers and Attention \u2014 the newest class of models that has conquered NLP and CV. In CAPTCHA context, transformers are still at the research stage but have huge potential. A Transformer handles sequences without recurrence, thanks to a self-attention mechanism. For example, you can take a Vision Transformer (ViT) \u2014 split the CAPTCHA image into patches, pass them through self-attention layers to obtain feature vectors. Then apply a text decoder (another transformer) that will generate text based on the picture, \u201cattentively\u201d looking at the necessary image areas via the attention mechanism. Essentially, this is similar to how large models now describe pictures with text. There are already examples where transformers have been successfully applied to crack CAPTCHAs: a Swin-Transformer based architecture showed &gt; 90 % accuracy on complex text CAPTCHAs, surpassing classic CNN+RNN. And in 2023 there were attempts to involve large language models (LLM) for logical CAPTCHAs, though accuracy was a modest ~63 %. But the trend is clear: transformers can combine vision and language, solving even CAPTCHAs with scene descriptions or complex questions.<\/p>\n<figure class=\"full-width\"><img decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/cbf\/22a\/4b4\/cbf22a4b4d8103b37222eccb09539cd1.jpg\" width=\"668\" height=\"500\" sizes=\"auto, (max-width: 780px) 100vw, 50vw\" srcset=\"https:\/\/habrastorage.org\/r\/w780\/getpro\/habr\/upload_files\/cbf\/22a\/4b4\/cbf22a4b4d8103b37222eccb09539cd1.jpg 780w,&#10;       https:\/\/habrastorage.org\/r\/w1560\/getpro\/habr\/upload_files\/cbf\/22a\/4b4\/cbf22a4b4d8103b37222eccb09539cd1.jpg 781w\" loading=\"lazy\" decode=\"async\"\/><\/figure>\n<p>Generative Adversarial Networks (GAN) \u2014 although GANs do not directly \u201csolve\u201d CAPTCHAs, they contributed from another angle. The idea of GANs\u2014adversarial training between a generator and a discriminator\u2014was applied to generate CAPTCHAs resembling real ones, to improve solver training. The idea is simple: a generator creates CAPTCHA images and a discriminator (essentially, a CAPTCHA solver) tries to distinguish generated from real. During training the generator begins to produce CAPTCHAs increasingly difficult for the discriminator\u2014effectively, the network learns on automatically generated \u201chard\u201d examples, helping to increase recognition accuracy. This approach allows unlimited training data and adapts to new CAPTCHA distortions.<\/p>\n<h2>Practice: Tools and Services for Solving CAPTCHA Using AI<\/h2>\n<p>Theory is interesting and even at times not boring, but how to apply all the above in practice? Let\u2019s look at existing tools, from open-source libraries to commercial services that position themselves as AI CAPTCHA solvers.<\/p>\n<p>Open-Source: GitHub and Research Communities<\/p>\n<p>In recent years the community has published many projects demonstrating bypassing different CAPTCHAs. A simple GitHub search for \u201ccaptcha solver ai\u201d or \u201cAI captcha solver GitHub\u201d yields dozens of repositories. As a rule, these are either research projects or utilities sharpened for a particular CAPTCHA.<\/p>\n<p>Text CAPTCHA solvers on neural networks. For example, one project (<a href=\"https:\/\/github.com\/jameskokoska\/CAPTCHA-Solver\" rel=\"noopener noreferrer nofollow\">the CAPTCHA-Solver repository<\/a>) describes in detail building a CNN+BiLSTM model. The authors generate a set of ~20 k synthetic CAPTCHAs (random letters and digits with different fonts and noise lines) and train the model to recognize sequences of length 5. The code uses PyTorch and TensorFlow, and for image processing \u2014 OpenCV and Pillow. Using the pytesseract library as a control, they compare quality. The trained model successfully solves &gt; 95 % of test CAPTCHAs in fractions of a second, whereas standard OCR errs on most complex distortions. Similar projects publish datasets too\u2014for example, here is an <a href=\"https:\/\/github.com\/drandule\/mail.ru__captcha_dataset\" rel=\"noopener noreferrer nofollow\">open dataset of ~30 k CAPTCHAs<\/a> from mail.ru and scripts to train a model for them.<\/p>\n<p>Scripts based on OpenCV + OCR. Some repositories offer solutions without deep learning for simple CAPTCHAs. For example, they find contours, pull out characters, and run Tesseract. Or even, as mentioned above, compare with bit templates. Such projects are interesting in their simplicity and can be a basic level: if a CAPTCHA is simple, there\u2019s no need to build neural networks. However, such CAPTCHAs are almost gone on popular sites\u2014spammers have long defeated them, so nowadays more intelligent algorithms are valuable.<\/p>\n<p>Browser extensions and scripts to bypass CAPTCHAs. In web automation, tools are known that can be configured in Selenium \/ Puppeteer. For example, the open-source extension <a href=\"https:\/\/chromewebstore.google.com\/detail\/buster-captcha-solver-for\/mpbjkejclgfgadiemmefgebjfooflfhl?hl=en\" rel=\"noopener noreferrer nofollow\"><strong>Buster<\/strong> <\/a>(for Chrome \/ Firefox) automatically presses the \u201cplay audio\u201d button in reCAPTCHA and sends the file to Google Speech-to-Text API\u2014the obtained text is entered back, bypassing the CAPTCHA for free, or the <a href=\"https:\/\/chromewebstore.google.com\/detail\/captcha-solver-auto-recog\/nghbiefcnamlpkjagnhoknkklkfiganp?hl=en\" rel=\"noopener noreferrer nofollow\"><strong>SolveCaptcha<\/strong><\/a> extension that solves CAPTCHA with AI using its internal algorithms. Another example is the <a href=\"https:\/\/github.com\/rucaptcha\/2captcha-solver\" rel=\"noopener noreferrer nofollow\"><strong>2captcha-solver<\/strong><\/a> library (npm, Python) which integrates with the <strong>2Captcha<\/strong> service to send a CAPTCHA for solving and receive the answer in code. GitHub has a \u201ccaptcha-solving\u201d topic with collections of such tools. Many of them support several services at once (reCAPTCHA, hCaptcha, FunCaptcha, etc.), automatically identifying the CAPTCHA type on a page. An open-source tool usually provides a convenient API, and \u201cunder the hood\u201d may use either external services or built-in models (like Buster for audio or SolveCaptcha for other CAPTCHA types).<\/p>\n<p>GitHub has not only CAPTCHA-solving projects\u2014some are dedicated to CAPTCHA generation. For example, the <strong>captcha<\/strong> library in Python allows generating typical text CAPTCHAs for training models.<\/p>\n<h2>Commercial Services: Humans (Human CAPTCHA Solver ) vs. Machines (AI CAPTCHA Solver  )<\/h2>\n<p>If you have neither the desire nor the opportunity to develop your own neural network, ready-made services come to the rescue\u2014historically they relied on human labor: you send a CAPTCHA image to the server, and a real person within a few seconds looks and enters the answer, which you get back. Classic representatives: <strong>2Captcha<\/strong> (aka RuCaptcha), <strong>SolveCaptcha<\/strong>, <strong>DeathByCaptcha<\/strong>, etc. Now AI-only services enter the market offering CAPTCHA solving without human participation\u2014faster and cheaper. Let\u2019s briefly look at the main options and their characteristics:<\/p>\n<p>For clarity a comparison table of AI services with human ones:<\/p>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<td>\n<p align=\"left\">Service \/ Approach<\/p>\n<\/td>\n<td>\n<p align=\"left\">Example<\/p>\n<\/td>\n<td>\n<p align=\"left\">Solution Method<\/p>\n<\/td>\n<td>\n<p align=\"left\">Time and Success<\/p>\n<\/td>\n<td>\n<p align=\"left\">Approximate Cost<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Human<\/p>\n<\/td>\n<td>\n<p align=\"left\">2Captcha, Anti-Captcha<\/p>\n<\/td>\n<td>\n<p align=\"left\">Live people worldwide type CAPTCHAs for pay.<\/p>\n<\/td>\n<td>\n<p align=\"left\">~7\u201320 s on reCAPTCHA (images faster). ~99 % accuracy (several people answer if necessary).<\/p>\n<\/td>\n<td>\n<p align=\"left\">~$2\u20133 per 1 000 solved reCAPTCHAs (~$0.002 each). About $0.5\u20131 per 1 000 simple text CAPTCHAs.<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Artificial Intelligence<\/p>\n<\/td>\n<td>\n<p align=\"left\">noCaptchaAi<\/p>\n<\/td>\n<td>\n<p align=\"left\">Specialized neural networks and browser emulation.<\/p>\n<\/td>\n<td>\n<p align=\"left\">~5 s on reCAPTCHA v2 (often limited by CAPTCHA\u2019s own minimum time). Accuracy up to 99 % on supported types. Possible failures on brand-new types.<\/p>\n<\/td>\n<td>\n<p align=\"left\">~$0.8\u20131 per 1 000 solved reCAPTCHAs (~$0.0008 each).<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Hybrid<\/p>\n<\/td>\n<td>\n<p align=\"left\">SolveCaptcha (extension), others<\/p>\n<\/td>\n<td>\n<p align=\"left\">Tries AI first; if unsuccessful, involves a human.<\/p>\n<\/td>\n<td>\n<p align=\"left\">Combines pros: AI instantly solves easy 80\u201390 %, humans finish the rest. Total time ~5\u201315 s, success ~99.9 %.<\/p>\n<\/td>\n<td>\n<p align=\"left\">~$1\u20132 per 1 000 (price depends on the share of tasks for humans)<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<blockquote>\n<p>Note that human services may face worker unavailability at times, while AI services may require model updates when CAPTCHA algorithms change.<\/p>\n<\/blockquote>\n<p>As we see, AI CAPTCHA solvers are already economically advantageous in many cases. It is not surprising that even classic services begin to implement machine learning to avoid losing the market to competitors.<\/p>\n<h2>The Future of AI CAPTCHA Solvers in the AI Era<\/h2>\n<p>There is a sense that the current CAPTCHA market situation looks like a tipping point: classic Turing tests for users are doing an ever poorer job. AI has learned to read distorted text, see objects in photos, and decipher audio\u2014sometimes with superhuman accuracy. Add synthetic data, GANs, and distributed computing, and any specific CAPTCHA will sooner or later be cracked by a machine.<\/p>\n<p>CAPTCHA developers, of course, are not sitting idle. A significant shift toward invisible checks (behavioral factors) and use of extensive contextual data (browser history, device parameters\u2014even analysis of mouse movements, phone tilt angle, and more) is observed. Ideally the check should occur so the user does not feel it. For example, Cloudflare Turnstile asks no questions at all, performing a \u201csecurity check\u201d in the background\u2014and in their opinion, this is more effective than classical CAPTCHAs. Another trend is multilayer authentication: before showing a CAPTCHA, the system analyzes whether it already knows the user (logged in, has a token, origin). Possibly the CAPTCHA of the future will move entirely from the UI to the backend, and for suspicious cases measures like SMS verification or biometrics will be applied (which is already beyond classical CAPTCHA).<\/p>\n<p>With the development of web protocols and identification, we may eventually access the web via a trusted attestation (through a government portal account, a device, or a digital passport)\u2014and then extra checks will no longer be needed. But there are people who voluntarily give their government accounts to scammers\u2026<\/p>\n<p>No, this is definitely not the end!<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p><!----><!----><\/div>\n<p><!----><!----><br \/> \u0441\u0441\u044b\u043b\u043a\u0430 \u043d\u0430 \u043e\u0440\u0438\u0433\u0438\u043d\u0430\u043b \u0441\u0442\u0430\u0442\u044c\u0438 <a href=\"https:\/\/habr.com\/ru\/articles\/915078\/\"> https:\/\/habr.com\/ru\/articles\/915078\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<div><!--[--><!--]--><\/div>\n<div id=\"post-content-body\">\n<div>\n<div class=\"article-formatted-body article-formatted-body article-formatted-body_version-2\">\n<div xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">\n<p>CAPTCHA has become a familiar part of the internet: distorted texts, \u201cfind all the traffic lights\u201d images, audio riddles, and other challenges that distinguish humans from machines. Every bot-system developer or QA engineer automating web scenarios has at least once run into a script suddenly stumbling over a CAPTCHA. A natural question arises: can a program be taught to solve CAPTCHAs the way a human does\u2014quickly and reliably? In this article I will try to figure out how AI CAPTCHA solvers are built, from classical OCR methods to modern neural networks.<\/p>\n<figure class=\"full-width\"><\/figure>\n<h2>CAPTCHA Types and Why Bots Find Them Hard &#8212; CAPTCHA Solver AI  <\/h2>\n<p>Before breaking a CAPTCHA, let\u2019s look at the kinds that exist and why algorithms have trouble with them. CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) comes in many forms. The main types can be classified as follows:<\/p>\n<ul>\n<li>\n<p>Text CAPTCHAs \u2014 classic images with distorted characters (letters, digits) that must be typed manually. In the past such tasks were solved with OCR (Optical Character Recognition). But as distortions and noise intensified, simple recognition became difficult: characters overlap, bend, are covered with interference purposely added against segmentation and recognition. This creates a problem for bot programs, since the computer must isolate and recognize each character, which is non-trivial without training.<\/p>\n<\/li>\n<li>\n<p>reCAPTCHA v2 (Google) \u2014 the familiar \u201cI\u2019m not a robot\u201d checkbox (which analyzes user behavior) and, if suspicious, a pop-up window with a 3 \u00d7 3 grid of images where you must pick pictures according to a criterion (e.g., all squares with cars). This CAPTCHA combines behavioral analysis with a visual object-classification task. Bots struggle because they must understand image content\u2014a computer-vision problem.<\/p>\n<\/li>\n<li>\n<p>reCAPTCHA v3 and Cloudflare Turnstile \u2014 invisible next-generation CAPTCHAs. They require no user action: the backend analyzes many behavioral, environmental, and browser parameters and assigns a hidden \u201csuspicion rating.\u201d If the rating is low, the user is considered human; if high, an extra check may follow. For a bot this is a tough barrier, because it must imitate human behavior across many signals, not solve a specific puzzle.<\/p>\n<\/li>\n<li>\n<p>hCaptcha and FunCaptcha \u2014 alternative CAPTCHAs from other services (Cloudflare, Arkose Labs). Essentially similar to reCAPTCHA v2: the user either gets a set of images to classify or an interactive task (e.g., rotate a 3-D object or find an element). Each system adds its own variations of visual puzzles.<\/p>\n<\/li>\n<li>\n<p>GeeTest and other puzzles \u2014 popular on Asian services: puzzle-piece variants (drag a fragment into the correct position), shuffled image tiles, simple questions, or sliders. For example, a puzzle CAPTCHA offers to align a cut-out fragment with a hole in the picture, requiring coordination and image understanding. Bots find this hard because it requires both pattern recognition and simulated human input (mouse movement).<\/p>\n<\/li>\n<li>\n<p>Audio CAPTCHA \u2014 usually supplements a visual CAPTCHA for users with impaired vision. A distorted recording of numbers or words plays, which must be distinguished and typed. It is believed that humans find speech with noise easier than machines. Yet these CAPTCHAs are not flawless: Stanford researchers succeeded in automatically cracking audio CAPTCHAs with up to 75 % probability using speech-recognition algorithms. With powerful ASR (Automatic Speech Recognition) models, audio riddles no longer guarantee protection.<\/p>\n<\/li>\n<li>\n<p>Behavioral and invisible CAPTCHAs \u2014 I already mentioned reCAPTCHA v3 and Turnstile; there are also hidden tests integrated into a site (for example, honeypot fields that only bots fill, or analysis of form-filling speed). All these are new-type CAPTCHAs checking the naturalness of user actions. The bot faces not a specific puzzle but the need to pretend to be a real user: move the mouse \u201clike a human,\u201d wait random delays, and have a \u201cclean\u201d browser. Such methods are harder to bypass directly by algorithm, so workarounds are needed\u2014e.g., obtaining a reCAPTCHA token via API or using browser-fingerprint databases.<\/p>\n<\/li>\n<\/ul>\n<figure class=\"full-width\"><\/figure>\n<p>You can look in more detail in my previous <a href=\"https:\/\/habr.com\/ru\/articles\/847302\/\" rel=\"noopener noreferrer nofollow\">article about CAPTCHA types<\/a>.<\/p>\n<p>Each CAPTCHA type requires a special solving approach. A universal CAPTCHA solver remains a challenge: the bot system must read text, classify pictures, or synthesize behavior, depending on the check encountered. Let\u2019s move on to how algorithms try to solve CAPTCHAs and to the evolution of these methods.<\/p>\n<h2>From OCR to Neural Networks (CAPTCHA AI Solver): The Evolution of Bypassing CAPTCHA<\/h2>\n<p>The first attempts at automatic CAPTCHA solving were closely tied to the development of OCR (Optical Character Recognition). A classical text CAPTCHA\u2014distorted letters\/digits on a noisy background\u2014was essentially designed to puzzle OCR systems. Old CAPTCHA versions could be cracked with relatively simple methods: filtering the image, extracting contours, segmenting into individual characters, and template matching or standard OCR engines like Tesseract. For some simple CAPTCHAs you could skip \u201cintelligence\u201d altogether: just overlay several sample digits to get a mask unique to each character and find matches in the picture. But such tricks suit only the most primitive and uniform CAPTCHAs.<\/p>\n<p>CAPTCHA Complication vs. Algorithm Improvement. CAPTCHA creators responded to cracks by increasing complexity: characters were more heavily distorted, color noise and background patterns were added, fonts became inconsistent. CAPTCHAs appeared with stuck-together characters where letters overlap. All this hindered segmentation\u2014the key step for classical OCR. Machine learning entered the game: researchers trained models to distinguish CAPTCHA characters even in noise. Back in the 2000s there were papers applying SVM and other algorithms to recognize specific CAPTCHA generators. But the breakthrough came with deep learning.<\/p>\n<p>In 2014 Google announced a sensational result: its neural network learned to solve the toughest text CAPTCHAs with 99.8 % accuracy (how ironic\u2014Google itself pioneered recognizing the defense it had effectively invented, I mean reCAPTCHA). The machine outperformed humans at what had once been meant as a purely human task! This immediately made text distortions useless\u2014if an algorithm can read characters better than people, such protection loses meaning. Probably for this reason Google quickly moved reCAPTCHA from noisy texts to pictures and behavior evaluation.<\/p>\n<blockquote>\n<p>As to my first sentence in this paragraph, another thought suggests itself: if not for the vanity of some and the foolishness of others, we might still be at the stage when reCAPTCHA or even simple text CAPTCHAs emerged. The moment some enthusiast found a solution, he promptly posted it for all to see, which in turn prompted CAPTCHA developers to complicate it\u2026 My tongue is my enemy\u2026<\/p>\n<\/blockquote>\n<p>A similar plot is unfolding with image CAPTCHAs. It was initially thought that, say, recognizing traffic lights in photos is harder for computers than for people because humans have advanced visual perception. But with the revolution in computer vision (deep convolutional networks) this asymmetry disappeared. A modern image-classification model can detect objects with high accuracy\u2014consider how accurately your phone recognizes animals, signs, and other objects in photos. A fresh example: since 2024 an advanced YOLO model can detect traffic lights and other reCAPTCHA v2 images in 100 % of cases, whereas earlier best results were ~70 %. Moreover, an AI bot now has to go through as many pictures as an average human before the system lets it through. One would like to believe the slogan \u201cwe have officially entered the post-CAPTCHA era,\u201d where classical checks can no longer distinguish a human from a smart machine, has finally arrived\u2014but it feels like this is not yet the end.<\/p>\n<p>It is important to understand that deep learning not only increased accuracy\u2014it changed the approach itself. Previously, a script had to follow a set of rigid steps: filter the background, split the characters, recognize them separately. Now an end-to-end neural network can be trained: you feed it a CAPTCHA picture, and it outputs the text string (or the probability of the needed class for an image). Auxiliary tasks, like segmentation, the network can learn internally without hand-coding rules. For example, a DenseNet variation called DFCR, coming from China (from where else, if not the nation that once gave the world gunpowder, right?), achieved &gt; 99.9 % accuracy on CAPTCHAs with noise and stuck-together characters\u2014because the deep convolutional network learned to see separate symbols even in a difficult case and confidently classify them.<\/p>\n<p>For clarity, a small table:<\/p>\n<div>\n<div class=\"table\">\n<table>\n<tbody>\n<tr>\n<td>\n<p align=\"left\">Approach to Cracking CAPTCHA<\/p>\n<\/td>\n<td>\n<p align=\"left\">Classical (OCR, scripts)<\/p>\n<\/td>\n<td>\n<p align=\"left\">Modern (AI \/ neural nets)<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Requirements<\/p>\n<\/td>\n<td>\n<p align=\"left\">Filtering rules, OCR libraries, templates. Requires manual tuning per CAPTCHA.<\/p>\n<\/td>\n<td>\n<p align=\"left\">Trained ML model (CNN\/RNN). Needs a training dataset but then works more universally.<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Character Segmentation<\/p>\n<\/td>\n<td>\n<p align=\"left\">Necessary: must find the boundary of each character before recognition. Frequent failure with interference or merged letters.<\/p>\n<\/td>\n<td>\n<p align=\"left\">Not explicitly needed: an end-to-end model recognizes the full text immediately, covertly segmenting by internal features. Even stuck-together symbols are recognized correctly.<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Accuracy on Difficult CAPTCHAs<\/p>\n<\/td>\n<td>\n<p align=\"left\">Limited, often &lt; 90 % with heavy distortions. To improve, heuristics must be added for the specific case.<\/p>\n<\/td>\n<td>\n<p align=\"left\">Near 100 % with sufficient training. Makes fewer mistakes than humans on typical tasks but may be vulnerable to totally new types not in the training data.<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Adaptability<\/p>\n<\/td>\n<td>\n<p align=\"left\">Poor transfer to new CAPTCHA types: code\/logic must be reworked.<\/p>\n<\/td>\n<td>\n<p align=\"left\">Can be fine-tuned on new data. Universal architectures (e.g., ResNet, Transformer) apply to various tasks.<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p align=\"left\">Solving Speed<\/p>\n<\/td>\n<td>\n<p align=\"left\">High (milliseconds) since algorithms are simple, but on difficult CAPTCHAs may waste time on segmentation attempts.<\/p>\n<\/td>\n<td>\n<p align=\"left\">High: neural nets perform recognition in tens of milliseconds on GPU. A bottleneck is data preparation<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-461860","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/461860","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=461860"}],"version-history":[{"count":0,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/461860\/revisions"}],"wp:attachment":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=461860"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=461860"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=461860"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}