CAPTCHA Bypass in SEO: What It Is and Whether Its Importance is Overrated?
Every SEO professional has encountered CAPTCHA. If not, they are either not professionals, misunderstand the term SEO (possibly confusing it with SMM or CEO), or are very new to this challenging field.
One could endlessly deny and argue that CAPTCHA is overrated and does not deserve significant resources. But these arguments end the moment one needs data from a search engine result page, like Yandex, without knowledge of XML requests… Or, say, a client wants to scrape the entire Amazon (just because they do) and offers good pay…
In short — it might not always be needed, but it’s better to be prepared, just in case. This is not akin to apocalypse prep; I’d liken it to swapping summer tires for winter ones in the South. Theoretically, the South has above-freezing temperatures year-round, so you could skip the swap. And during those few snowy days, you could just stay home. But what if you suddenly need to leave the region? And your tires are still summer ones… This is similar to having CAPTCHA bypass skills. You never know when they might come in handy.
Why CAPTCHA is Used Despite Available Bypass Methods
In reality, the situation is more nuanced than it might seem. Securing a site against data scraping can be challenging, especially for non-commercial projects or «hamster sites.» Most lack the time or desire to allocate resources for some vague CAPTCHA. It’s a different story if you own a major portal generating millions in profit. Then, protecting it is worth considering. Even basic protection against unscrupulous competitors who might start DDoSing your site is a valid reason.
Take Amazon as an example — they know a thing or two about eccentric protection methods. Not only do they have three types of CAPTCHA on their site, each appearing in different scenarios, but they also randomly change the design to thwart automation tools and scrapers from reusing outdated methods. Change the design, change the script (and judging by the number of people trying to scrape Amazon, we’re talking not one script but hundreds or thousands). This is how the cycle goes.
Now, for smaller webmasters, what then? Webmasters aren’t fools; they realize that by implementing a tricky CAPTCHA, they create additional challenges for genuine visitors. The harder the CAPTCHA, the higher the chance the user will leave for a competitor’s search result. Thus, modern site owners try not to overdo CAPTCHA and seek a balance.
Leaving a site completely unprotected is unwise — low-level bots that lack CAPTCHA bypass skills but can still execute mass actions will swarm the site. Hence, they opt for universal solutions like reCAPTCHA or hCaptcha. Mission accomplished (meaning, the site is protected) without overly stressing users.
More complex CAPTCHAs come into play when bots start overrunning the site, but that’s another article’s topic.
In conclusion, leaving a site without protection is poor form, and using a middle-ground solution is entirely acceptable. It won’t deter those targeting the content but will discourage less sophisticated bots.
Why SEO Specialists Need CAPTCHA Bypass
Now, let’s consider it from the SEO specialist’s point of view: why and for what purpose would they need to bypass CAPTCHA?
CAPTCHA bypass may be necessary for the most basic task — analyzing search engine rankings. Yes, there are third-party services for this, which charge for daily monitoring, and aside from paying the service fee, you would also need to pay an external CAPTCHA recognition service.
When researching competitor sites, CAPTCHA is less relevant as CAPTCHA bypass on-site is easier than during ranking collection (the levels differ slightly).
For automating routine tasks — this is a niche area. Not everyone uses it, but for committed SEO specialists, it can be useful.
In general, it’s essential to consider the economics — is it cheaper to pay a ranking monitoring service and a CAPTCHA recognition service, or to create your solution and reduce costs? Naturally, if you only have 1–2 projects, the second option sounds cumbersome, and if the client pays for everything, calculating costs is pointless. But if you own multiple projects and foot the bill yourself… You need to think…
Most SEO specialists resort to CAPTCHA bypass for Google or Yandex work, using ready-made solutions like CAPTCHA recognition services or API keys. But some specialists are not inclined to outsource everything; they have their own solutions for routine tasks.
Primary CAPTCHA Bypass Methods
Let’s consider methods requiring a bit more effort than simply adding an API key in Key Collector. Note that this requires deeper knowledge than just finding an API key on the service’s homepage and pasting it in the right field.
Third-Party CAPTCHA Recognition Services
The most popular method is sending CAPTCHA to a specialized service (2captcha, ruCaptcha, etc.), which returns a ready solution. These services charge only for solved CAPTCHAs.
Everything can be automated, as I learned from days of struggle with various tasks. For instance, finding the “sitekey” on a page once is manageable, but in multi-threading, it’s unrealistic. And the code starts expanding — a module is written to locate the required parameter on the page.
We move on: the parameter is found, sent to the third-party service, and we get the result. But something still needs to be done with the result… Insert it into the correct field… Code expansion again.
So, the code below won’t solve your problem with a snap — certain actions are required, at minimum, those described above.
Example of Standard (Universal) Code for Solving reCAPTCHA V2 in Python:
import requests import time API_KEY = 'YOUR_2CAPTCHA_KEY' SITE_KEY = 'YOUR_SITE_KEY' PAGE_URL = 'https://example.com' def get_captcha_solution(): captcha_id_response = requests.post("http://2captcha.com/in.php", data={ 'key': API_KEY, 'method': 'userrecaptcha', 'googlekey': SITE_KEY, 'pageurl': PAGE_URL, 'json': 1 }).json() if captcha_id_response['status'] != 1: print(f"Error: {captcha_id_response['request']}") return None captcha_id = captcha_id_response['request'] print(f"CAPTCHA sent. ID: {captcha_id}") for attempt in range(30): time.sleep(5) result = requests.get("http://2captcha.com/res.php", params={ 'key': API_KEY, 'action': 'get', 'id': captcha_id, 'json': 1 }).json() if result['status'] == 1: print(f"CAPTCHA solved: {result['request']}") return result['request'] elif result['request'] == 'CAPCHA_NOT_READY': print(f"Waiting for solution... attempt {attempt + 1}/30") else: print(f"Error: {result['request']}") return None return None captcha_solution = get_captcha_solution() if captcha_solution: print('CAPTCHA solution:', captcha_solution) else: print('Solution failed.')
In this example, you need to find one parameter on the page (in the code, it’s called SITE_KEY), which is usually labeled “sitekey” on the CAPTCHA page. This parameter, along with the CAPTCHA URL and 2captcha service API key, is sent to the server, and after receiving the solved CAPTCHA token, you must insert it into the CAPTCHA field.
Next, you can resume actions paused due to CAPTCHA.
In any CAPTCHA that uses a token, the code will be similar, with differences in the set of parameters sent, and for some (like Amazon CAPTCHA), there is also a maximum resolution time.
Advantages — Ease of Setup, Fast Recognition
Disadvantages — Paid Method as It Uses a Third-Party Service
CAPTCHA Bypass with Proxy and IP Rotation
The second option to bypass CAPTCHA for SEO tasks is based on using a large number of proxies that rotate either after a set period, upon encountering a CAPTCHA, or upon an error from the resource being accessed.
By rotating proxies, each request to the scraping source appears as a unique user, making actions less suspicious to the source (at least for a while).
To implement this method, you need an IP pool (which can be relatively expensive, sometimes even more costly than the CAPTCHA bypass through a third-party service). You can use rotating residential proxies, which eliminate the need for a large IP pool if the rotation is properly configured. Alternatively, mobile proxies, though more expensive than residential ones, provide higher quality. There is a solution for every budget and requirement.
Here is a code example that won’t work autonomously, but can be integrated into your script (program):
import requests from itertools import cycle import time import urllib.parse # List of proxies with individual logins and passwords proxies_list = [ {"proxy": "2captcha_proxy_1:port", "username": "user1", "password": "pass1"}, {"proxy": "2captcha_proxy_2:port", "username": "user2", "password": "pass2"}, {"proxy": "2captcha_proxy_3:port", "username": "user3", "password": "pass3"}, {"proxy": "2captcha_proxy_4:port", "username": "user4", "password": "pass4"}, # Add more proxies as needed ] # Proxy rotation cycle proxy_pool = cycle(proxies_list) # Target URL to work with url = "https://example.com" # Replace with the desired site # Headers to simulate a real user headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:129.0) Gecko/20100101 Firefox/129.0" } # Sending several requests with proxy rotation for i in range(5): # Specify the desired number of requests proxy_info = next(proxy_pool) # Select the next proxy proxy = proxy_info["proxy"] username = urllib.parse.quote(proxy_info["username"]) password = urllib.parse.quote(proxy_info["password"]) # Create a proxy with authorization proxy_with_auth = f"http://{username}:{password}@{proxy}" try: response = requests.get( url, headers=headers, proxies={"http": proxy_with_auth, "https": proxy_with_auth}, timeout=10 ) # Check response status if response.status_code == 200: print(f"Request {i + 1} via proxy {proxy} successful. Status: {response.status_code}") else: print(f"Request {i + 1} via proxy {proxy} ended with status code {response.status_code}") except requests.exceptions.RequestException as e: print(f"Error with proxy {proxy}: {e}") # Delay between requests for natural behavior time.sleep(2) # Adjust delay based on requirements
In this example, rotating residential proxies are used, which can be configured for 2captcha.
A good example of using this method is on LinkedIn, where it’s configured so you can log in three times from a single IP without a CAPTCHA, but on the fourth attempt, a CAPTCHA appears. Knowing this, you can set up proxy rotation and enjoy circumventing the system… at least until it’s patched.
Pros — Does not require integration with third-party services if you have your own IP pool
Cons — Less stable than the first method, and high-quality proxies can be more expensive than using a CAPTCHA bypass service
CAPTCHA Bypass Using Headless Browsers
Of course, using proxy rotation alone without additional tactics is insufficient. Simple IP switching is no longer enough to prevent CAPTCHA in modern conditions, so the third method naturally follows the second and combines them.
Using Headless Browsers is not some new gimmick. It’s been used for a long time to simulate user actions, making bots appear as real people. In other words, you go to great lengths to look like a live person consuming content on the page, solving your SEO tasks as you go. When detected, you switch proxies and start over.
This clever method, given an adequate IP pool, allows scraping of large projects.
Another interesting aspect is that now you must not only make the bot look like a person but also hide the fact that you are using a headless browser. It’s like a spy game, rather than an advanced SEO task solution.
Below is a sample code for working with a headless browser:
from selenium import webdriver from selenium.webdriver.chrome.options import Options from selenium.webdriver.common.by import By import time import random from itertools import cycle # List of proxies with login and password proxies_list = [ {"proxy": "proxy1.example.com:8080", "username": "user1", "password": "pass1"}, {"proxy": "proxy2.example.com:8080", "username": "user2", "password": "pass2"}, {"proxy": "proxy3.example.com:8080", "username": "user3", "password": "pass3"}, # Add more proxies as needed ] # Proxy rotation cycle proxy_pool = cycle(proxies_list) # Settings for Headless Browser def create_browser(proxy=None): chrome_options = Options() chrome_options.headless = True # Enable headless mode chrome_options.add_argument("--disable-blink-features=AutomationControlled") # Disable auto-detection # Set up proxy if proxy: chrome_options.add_argument(f'--proxy-server=http://{proxy["proxy"]}') # Additional arguments to hide headless mode chrome_options.add_argument("start-maximized") chrome_options.add_argument("disable-infobars") chrome_options.add_experimental_option("excludeSwitches", ["enable-automation"]) chrome_options.add_experimental_option("useAutomationExtension", False) # Set the path to the Chrome driver browser = webdriver.Chrome(options=chrome_options) browser.execute_cdp_cmd("Page.addScriptToEvaluateOnNewDocument", { "source": """ Object.defineProperty(navigator, 'webdriver', { get: () => undefined }) """ }) # Return the initialized browser return browser # Simulate user behavior def mimic_user_behavior(browser): actions = [ lambda: browser.execute_script("window.scrollBy(0, 300);"), # Scroll down lambda: browser.execute_script("window.scrollBy(0, -300);"), # Scroll up lambda: browser.execute_script("window.scrollBy(0, random.randint(0, 500));") # Random scroll ] random.choice(actions)() # Choose a random action time.sleep(random.uniform(1, 3)) # Random delay # Main function to bypass CAPTCHA def bypass_captcha(url, num_attempts=5): for i in range(num_attempts): proxy_info = next(proxy_pool) # Get a new proxy browser = create_browser(proxy_info) # Start browser with proxy try: # Go to the site browser.get(url) # Simulate user actions on the site mimic_user_behavior(browser) # Example check for an element on the page try: element = browser.find_element(By.XPATH, "//h1") print(f"Element found: {element.text}") except Exception: print("Element not found, maybe CAPTCHA") # Check if entry was successful print(f"Attempt {i + 1} via proxy {proxy_info['proxy']} successful") except Exception as e: print(f"Error with proxy {proxy_info['proxy']}: {e}") finally: browser.quit() # Pause between attempts time.sleep(random.uniform(2, 5)) # Run CAPTCHA bypass on target site url = "https://example.com" # Replace with the target site bypass_captcha(url)
This code integrates proxy rotation with headless Chrome settings for bypassing CAPTCHA, simulating natural browsing behavior.
Proxy Rotation with Authentication
The code uses a list of proxies with login credentials. Proxy rotation via itertools.cycle allows sequential use of proxies, creating the illusion of new connections and complicating bot detection.
Headless Chrome Configuration
Setting chrome_options.headless = True enables headless mode, with additional settings to hide signs of automation like —disable-blink-features=AutomationControlled. The excludeSwitches and useAutomationExtension arguments remove visible automation indicators. Adding the script to clear navigator.webdriver from the browser makes it less detectable as headless, similar to the stealth plugin in Puppeteer.
Simulating User Behavior
The mimic_user_behavior function performs random scrolling to create an illusion of user interaction with the content, reducing the risk of being flagged by anti-bot systems.
Basic CAPTCHA Handling
The bypass_captcha function launches a browser session with proxy rotation, simulating page access and user behavior. Checking for elements (find_element) helps verify successful access.
Pros and Cons
Pros:
-
Relatively low cost and straightforward setup for proxy rotation to avoid bans.
Cons:
-
Less effective when parsing highly protected sites, particularly search engines with complex anti-bot algorithms.
CAPTCHA Bypass Using More Complex Methods: Machine Learning
For those aiming to minimize CAPTCHA bypass costs entirely, machine learning (training your system on hundreds of CAPTCHAs) is an option, or you can use ready solutions like Tesseract.
In Both Cases, Solving Complex CAPTCHA with This System Won’t Work; This Solution is Suitable Only for Text Recognition CAPTCHA or Image/Receipt Classification
Where image classification might be needed in SEO is uncertain (feel free to come up with your own examples), but this approach is viable for simple CAPTCHAs.
It’s free, and it’s effective.
There’s no code example here, as it’s a rather niche topic, and there isn’t a universal code applicable to different variants.
Advantages — Free and Effective
Disadvantages — Not Suitable for Complex CAPTCHAs
Conclusion
In conclusion, if you have some time and the desire to dig into the code, a combination of methods 1, 2, and 3 is your best option. If you prefer maximum simplicity, look for services that provide tools for the job. However, I won’t recommend any specific services because, truthfully, this is a topic for an entirely separate article!
Wishing you CAPTCHA-free access!
ссылка на оригинал статьи https://habr.com/ru/articles/856126/
Добавить комментарий