Advanced CAPTCHA Bypass Techniques for SEO Specialists with Code Examples

от автора

CAPTCHA Bypass in SEO: What It Is and Whether Its Importance is Overrated?

Every SEO professional has encountered CAPTCHA. If not, they are either not professionals, misunderstand the term SEO (possibly confusing it with SMM or CEO), or are very new to this challenging field.

One could endlessly deny and argue that CAPTCHA is overrated and does not deserve significant resources. But these arguments end the moment one needs data from a search engine result page, like Yandex, without knowledge of XML requests… Or, say, a client wants to scrape the entire Amazon (just because they do) and offers good pay…

When the client pays well: "Say no more..."

When the client pays well: «Say no more…»

In short — it might not always be needed, but it’s better to be prepared, just in case. This is not akin to apocalypse prep; I’d liken it to swapping summer tires for winter ones in the South. Theoretically, the South has above-freezing temperatures year-round, so you could skip the swap. And during those few snowy days, you could just stay home. But what if you suddenly need to leave the region? And your tires are still summer ones… This is similar to having CAPTCHA bypass skills. You never know when they might come in handy.

We can’t avoid CAPTCHA, but we can prepare to bypass it.

We can’t avoid CAPTCHA, but we can prepare to bypass it.

Why CAPTCHA is Used Despite Available Bypass Methods

In reality, the situation is more nuanced than it might seem. Securing a site against data scraping can be challenging, especially for non-commercial projects or «hamster sites.» Most lack the time or desire to allocate resources for some vague CAPTCHA. It’s a different story if you own a major portal generating millions in profit. Then, protecting it is worth considering. Even basic protection against unscrupulous competitors who might start DDoSing your site is a valid reason.

Take Amazon as an example — they know a thing or two about eccentric protection methods. Not only do they have three types of CAPTCHA on their site, each appearing in different scenarios, but they also randomly change the design to thwart automation tools and scrapers from reusing outdated methods. Change the design, change the script (and judging by the number of people trying to scrape Amazon, we’re talking not one script but hundreds or thousands). This is how the cycle goes.

Amazon and its bot protection

Amazon and its bot protection

Now, for smaller webmasters, what then? Webmasters aren’t fools; they realize that by implementing a tricky CAPTCHA, they create additional challenges for genuine visitors. The harder the CAPTCHA, the higher the chance the user will leave for a competitor’s search result. Thus, modern site owners try not to overdo CAPTCHA and seek a balance.

Leaving a site completely unprotected is unwise — low-level bots that lack CAPTCHA bypass skills but can still execute mass actions will swarm the site. Hence, they opt for universal solutions like reCAPTCHA or hCaptcha. Mission accomplished (meaning, the site is protected) without overly stressing users.

Website Protection Level: Acceptable

Website Protection Level: Acceptable

More complex CAPTCHAs come into play when bots start overrunning the site, but that’s another article’s topic.

In conclusion, leaving a site without protection is poor form, and using a middle-ground solution is entirely acceptable. It won’t deter those targeting the content but will discourage less sophisticated bots.

Why SEO Specialists Need CAPTCHA Bypass

Now, let’s consider it from the SEO specialist’s point of view: why and for what purpose would they need to bypass CAPTCHA?

CAPTCHA bypass may be necessary for the most basic task — analyzing search engine rankings. Yes, there are third-party services for this, which charge for daily monitoring, and aside from paying the service fee, you would also need to pay an external CAPTCHA recognition service.

When researching competitor sites, CAPTCHA is less relevant as CAPTCHA bypass on-site is easier than during ranking collection (the levels differ slightly).

For automating routine tasks — this is a niche area. Not everyone uses it, but for committed SEO specialists, it can be useful.

In general, it’s essential to consider the economics — is it cheaper to pay a ranking monitoring service and a CAPTCHA recognition service, or to create your solution and reduce costs? Naturally, if you only have 1–2 projects, the second option sounds cumbersome, and if the client pays for everything, calculating costs is pointless. But if you own multiple projects and foot the bill yourself… You need to think…

Most SEO specialists resort to CAPTCHA bypass for Google or Yandex work, using ready-made solutions like CAPTCHA recognition services or API keys. But some specialists are not inclined to outsource everything; they have their own solutions for routine tasks.

Primary CAPTCHA Bypass Methods

Let’s consider methods requiring a bit more effort than simply adding an API key in Key Collector. Note that this requires deeper knowledge than just finding an API key on the service’s homepage and pasting it in the right field.

The CAPTCHA changes its design, we update the script. And so it goes on endlessly.

The CAPTCHA changes its design, we update the script. And so it goes on endlessly.

Third-Party CAPTCHA Recognition Services

The most popular method is sending CAPTCHA to a specialized service (2captcha, ruCaptcha, etc.), which returns a ready solution. These services charge only for solved CAPTCHAs.

Everything can be automated, as I learned from days of struggle with various tasks. For instance, finding the “sitekey” on a page once is manageable, but in multi-threading, it’s unrealistic. And the code starts expanding — a module is written to locate the required parameter on the page.

We move on: the parameter is found, sent to the third-party service, and we get the result. But something still needs to be done with the result… Insert it into the correct field… Code expansion again.

So, the code below won’t solve your problem with a snap — certain actions are required, at minimum, those described above.

Example of Standard (Universal) Code for Solving reCAPTCHA V2 in Python:

import requests import time  API_KEY = 'YOUR_2CAPTCHA_KEY' SITE_KEY = 'YOUR_SITE_KEY' PAGE_URL = 'https://example.com'  def get_captcha_solution():     captcha_id_response = requests.post("http://2captcha.com/in.php", data={         'key': API_KEY,         'method': 'userrecaptcha',         'googlekey': SITE_KEY,         'pageurl': PAGE_URL,         'json': 1     }).json()      if captcha_id_response['status'] != 1:         print(f"Error: {captcha_id_response['request']}")         return None      captcha_id = captcha_id_response['request']     print(f"CAPTCHA sent. ID: {captcha_id}")      for attempt in range(30):         time.sleep(5)         result = requests.get("http://2captcha.com/res.php", params={             'key': API_KEY,             'action': 'get',             'id': captcha_id,             'json': 1         }).json()          if result['status'] == 1:             print(f"CAPTCHA solved: {result['request']}")             return result['request']         elif result['request'] == 'CAPCHA_NOT_READY':             print(f"Waiting for solution... attempt {attempt + 1}/30")         else:             print(f"Error: {result['request']}")             return None     return None  captcha_solution = get_captcha_solution()  if captcha_solution:     print('CAPTCHA solution:', captcha_solution) else:     print('Solution failed.')
Me at 3 a.m.: "Just one more line of code, and the CAPTCHA will bypass itself automatically."

Me at 3 a.m.: «Just one more line of code, and the CAPTCHA will bypass itself automatically.»

In this example, you need to find one parameter on the page (in the code, it’s called SITE_KEY), which is usually labeled “sitekey” on the CAPTCHA page. This parameter, along with the CAPTCHA URL and 2captcha service API key, is sent to the server, and after receiving the solved CAPTCHA token, you must insert it into the CAPTCHA field.

Next, you can resume actions paused due to CAPTCHA.

In any CAPTCHA that uses a token, the code will be similar, with differences in the set of parameters sent, and for some (like Amazon CAPTCHA), there is also a maximum resolution time.

Advantages — Ease of Setup, Fast Recognition

Disadvantages — Paid Method as It Uses a Third-Party Service


CAPTCHA Bypass with Proxy and IP Rotation

The second option to bypass CAPTCHA for SEO tasks is based on using a large number of proxies that rotate either after a set period, upon encountering a CAPTCHA, or upon an error from the resource being accessed.

By rotating proxies, each request to the scraping source appears as a unique user, making actions less suspicious to the source (at least for a while).

To implement this method, you need an IP pool (which can be relatively expensive, sometimes even more costly than the CAPTCHA bypass through a third-party service). You can use rotating residential proxies, which eliminate the need for a large IP pool if the rotation is properly configured. Alternatively, mobile proxies, though more expensive than residential ones, provide higher quality. There is a solution for every budget and requirement.

Here is a code example that won’t work autonomously, but can be integrated into your script (program):

import requests from itertools import cycle import time import urllib.parse  # List of proxies with individual logins and passwords proxies_list = [     {"proxy": "2captcha_proxy_1:port", "username": "user1", "password": "pass1"},     {"proxy": "2captcha_proxy_2:port", "username": "user2", "password": "pass2"},     {"proxy": "2captcha_proxy_3:port", "username": "user3", "password": "pass3"},     {"proxy": "2captcha_proxy_4:port", "username": "user4", "password": "pass4"},     # Add more proxies as needed ]  # Proxy rotation cycle proxy_pool = cycle(proxies_list)  # Target URL to work with url = "https://example.com"  # Replace with the desired site  # Headers to simulate a real user headers = {     "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:129.0) Gecko/20100101 Firefox/129.0" }  # Sending several requests with proxy rotation for i in range(5):  # Specify the desired number of requests     proxy_info = next(proxy_pool)  # Select the next proxy     proxy = proxy_info["proxy"]     username = urllib.parse.quote(proxy_info["username"])     password = urllib.parse.quote(proxy_info["password"])      # Create a proxy with authorization     proxy_with_auth = f"http://{username}:{password}@{proxy}"      try:         response = requests.get(             url,             headers=headers,             proxies={"http": proxy_with_auth, "https": proxy_with_auth},             timeout=10         )                  # Check response status         if response.status_code == 200:             print(f"Request {i + 1} via proxy {proxy} successful. Status: {response.status_code}")         else:             print(f"Request {i + 1} via proxy {proxy} ended with status code {response.status_code}")          except requests.exceptions.RequestException as e:         print(f"Error with proxy {proxy}: {e}")          # Delay between requests for natural behavior     time.sleep(2)  # Adjust delay based on requirements 

In this example, rotating residential proxies are used, which can be configured for 2captcha.

A good example of using this method is on LinkedIn, where it’s configured so you can log in three times from a single IP without a CAPTCHA, but on the fourth attempt, a CAPTCHA appears. Knowing this, you can set up proxy rotation and enjoy circumventing the system… at least until it’s patched.

Pros — Does not require integration with third-party services if you have your own IP pool

Cons — Less stable than the first method, and high-quality proxies can be more expensive than using a CAPTCHA bypass service


CAPTCHA Bypass Using Headless Browsers

Of course, using proxy rotation alone without additional tactics is insufficient. Simple IP switching is no longer enough to prevent CAPTCHA in modern conditions, so the third method naturally follows the second and combines them.

Using Headless Browsers is not some new gimmick. It’s been used for a long time to simulate user actions, making bots appear as real people. In other words, you go to great lengths to look like a live person consuming content on the page, solving your SEO tasks as you go. When detected, you switch proxies and start over.

This clever method, given an adequate IP pool, allows scraping of large projects.

Another interesting aspect is that now you must not only make the bot look like a person but also hide the fact that you are using a headless browser. It’s like a spy game, rather than an advanced SEO task solution.

Below is a sample code for working with a headless browser:

from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By import time import random from itertools import cycle  # List of proxies with login and password proxies_list = [     {"proxy": "proxy1.example.com:8080", "username": "user1", "password": "pass1"},     {"proxy": "proxy2.example.com:8080", "username": "user2", "password": "pass2"},     {"proxy": "proxy3.example.com:8080", "username": "user3", "password": "pass3"},     # Add more proxies as needed ]  # Proxy rotation cycle proxy_pool = cycle(proxies_list)  # Settings for Headless Browser def create_browser(proxy=None):     chrome_options = Options()     chrome_options.headless = True  # Enable headless mode     chrome_options.add_argument("--disable-blink-features=AutomationControlled")  # Disable auto-detection          # Set up proxy     if proxy:         chrome_options.add_argument(f'--proxy-server=http://{proxy["proxy"]}')          # Additional arguments to hide headless mode     chrome_options.add_argument("start-maximized")     chrome_options.add_argument("disable-infobars")     chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"])     chrome_options.add_experimental_option("useAutomationExtension", False)          # Set the path to the Chrome driver     browser = webdriver.Chrome(options=chrome_options)     browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", {         "source": """             Object.defineProperty(navigator, 'webdriver', {                 get: () => undefined             })         """     })          # Return the initialized browser     return browser  # Simulate user behavior def mimic_user_behavior(browser):     actions = [         lambda: browser.execute_script("window.scrollBy(0, 300);"),  # Scroll down         lambda: browser.execute_script("window.scrollBy(0, -300);"),  # Scroll up         lambda: browser.execute_script("window.scrollBy(0, random.randint(0, 500));")  # Random scroll     ]     random.choice(actions)()  # Choose a random action     time.sleep(random.uniform(1, 3))  # Random delay  # Main function to bypass CAPTCHA def bypass_captcha(url, num_attempts=5):     for i in range(num_attempts):         proxy_info = next(proxy_pool)  # Get a new proxy         browser = create_browser(proxy_info)  # Start browser with proxy                  try:             # Go to the site             browser.get(url)                          # Simulate user actions on the site             mimic_user_behavior(browser)                          # Example check for an element on the page             try:                 element = browser.find_element(By.XPATH, "//h1")                 print(f"Element found: {element.text}")             except Exception:                 print("Element not found, maybe CAPTCHA")                          # Check if entry was successful             print(f"Attempt {i + 1} via proxy {proxy_info['proxy']} successful")                  except Exception as e:             print(f"Error with proxy {proxy_info['proxy']}: {e}")                  finally:             browser.quit()                  # Pause between attempts         time.sleep(random.uniform(2, 5))  # Run CAPTCHA bypass on target site url = "https://example.com"  # Replace with the target site bypass_captcha(url)

This code integrates proxy rotation with headless Chrome settings for bypassing CAPTCHA, simulating natural browsing behavior.

Proxy Rotation with Authentication

The code uses a list of proxies with login credentials. Proxy rotation via itertools.cycle allows sequential use of proxies, creating the illusion of new connections and complicating bot detection.

Headless Chrome Configuration

Setting chrome_options.headless = True enables headless mode, with additional settings to hide signs of automation like —disable-blink-features=AutomationControlled. The excludeSwitches and useAutomationExtension arguments remove visible automation indicators. Adding the script to clear navigator.webdriver from the browser makes it less detectable as headless, similar to the stealth plugin in Puppeteer.

Simulating User Behavior

The mimic_user_behavior function performs random scrolling to create an illusion of user interaction with the content, reducing the risk of being flagged by anti-bot systems.

Basic CAPTCHA Handling

The bypass_captcha function launches a browser session with proxy rotation, simulating page access and user behavior. Checking for elements (find_element) helps verify successful access.

Pros and Cons

Pros:

  • Relatively low cost and straightforward setup for proxy rotation to avoid bans.

Cons:

  • Less effective when parsing highly protected sites, particularly search engines with complex anti-bot algorithms.

My bot when the site ramps up security: "Who, me? I was just passing by."

My bot when the site ramps up security: «Who, me? I was just passing by.»

CAPTCHA Bypass Using More Complex Methods: Machine Learning

For those aiming to minimize CAPTCHA bypass costs entirely, machine learning (training your system on hundreds of CAPTCHAs) is an option, or you can use ready solutions like Tesseract.

 Bot: "I can bypass any CAPTCHA!"Reality: "You can't even get past reCAPTCHA v2"

Bot: «I can bypass any CAPTCHA!»
Reality: «You can’t even get past reCAPTCHA v2»

In Both Cases, Solving Complex CAPTCHA with This System Won’t Work; This Solution is Suitable Only for Text Recognition CAPTCHA or Image/Receipt Classification

Where image classification might be needed in SEO is uncertain (feel free to come up with your own examples), but this approach is viable for simple CAPTCHAs.

It’s free, and it’s effective.

There’s no code example here, as it’s a rather niche topic, and there isn’t a universal code applicable to different variants.

Advantages — Free and Effective

Disadvantages — Not Suitable for Complex CAPTCHAs


Conclusion

In conclusion, if you have some time and the desire to dig into the code, a combination of methods 1, 2, and 3 is your best option. If you prefer maximum simplicity, look for services that provide tools for the job. However, I won’t recommend any specific services because, truthfully, this is a topic for an entirely separate article!

Wishing you CAPTCHA-free access!


ссылка на оригинал статьи https://habr.com/ru/articles/856126/


Комментарии

Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *