{"id":430927,"date":"2024-09-02T21:00:03","date_gmt":"2024-09-02T21:00:03","guid":{"rendered":"http:\/\/savepearlharbor.com\/?p=430927"},"modified":"-0001-11-30T00:00:00","modified_gmt":"-0001-11-29T21:00:00","slug":"","status":"publish","type":"post","link":"https:\/\/savepearlharbor.com\/?p=430927","title":{"rendered":"<span>Amazon parsing on easy level and all by yourself<\/span>"},"content":{"rendered":"<div><!--[--><!--]--><\/div>\n<div id=\"post-content-body\">\n<div>\n<div class=\"article-formatted-body article-formatted-body article-formatted-body_version-2\">\n<div xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">\n<p>I came across a script on the Internet that allows you to parse product cards from Amazon. And I just needed a solution to a problem like that.<\/p>\n<p>I wracked my brain while looking for a way to parse product cards from Amazon. The problem is that Amazon uses different design options for different outputs, in particular \u2013 if you need to parse the cards with the search query &#171;bags&#187; \u2013 the cards will be arranged vertically, as I need it, but if you take, for example, &#171;t-shirts&#187; \u2013 then the cards will be arranged horizontally, and in such way the script falls into an error, it works out opening the page, but does not want to scroll.<\/p>\n<figure class=\"full-width\"><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/habrastorage.org\/r\/w780q1\/getpro\/habr\/upload_files\/216\/ae6\/c05\/216ae6c050d634b34aabcf364042f0d7.jpg\" width=\"1000\" height=\"500\" data-src=\"https:\/\/habrastorage.org\/getpro\/habr\/upload_files\/216\/ae6\/c05\/216ae6c050d634b34aabcf364042f0d7.jpg\" data-blurred=\"true\"\/><\/figure>\n<p>Moreover, after reading various articles where users are puzzling over <a href=\"https:\/\/2captcha.com\/api-docs\/recaptcha-v3\" rel=\"noopener noreferrer nofollow\">how to bypass captcha<\/a> on Amazon, I upgraded the script and now it can bypass the captcha if it occurs (it works with 2captcha). The script checks for the presence of a captcha on the page after each loading of a new page, and if the captcha occurs, it sends a request to the 2capcha server, and after receiving the solution, substitutes it and continues to work.<\/p>\n<p>However, how to bypass the captcha is not the most difficult problem, since this is a trivial task nowadays. The more pressing question is how to make the script work not only with the vertical arrangement of product cards, but also with the horizontal one.<\/p>\n<p>Below I will describe in detail what the script includes, demonstrate its work, and if you can help to solve the problem, if you know what to add (change) in the script so that it works on horizontal setup of cards, I will be grateful.<\/p>\n<p>And for now the script can help someone at least in its limited functionality.<\/p>\n<p>So, let&#8217;s take the script apart piece by piece!<\/p>\n<h3>Preparation<\/h3>\n<p>Firstly, the script imports the modules needed to complete the task<\/p>\n<pre><code class=\"python\">from selenium import webdriver  from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import csv import os from time import sleep import requests <\/code><\/pre>\n<p>Let&#8217;s take it apart in parts:<\/p>\n<pre><code class=\"python\">from selenium import webdriver<\/code><\/pre>\n<p>That imports the <code>webdriver<\/code> class, which allows you to control the browser (in my case Firefox) through the script.<\/p>\n<pre><code class=\"python\">from selenium.webdriver.common.by import By<\/code><\/pre>\n<p>That imports the <code>By<\/code> class, with which the script will search for elements to parse by XPath (it can search for other attributes, but in this case <code>Xpath<\/code> will be used).<\/p>\n<pre><code class=\"python\">from selenium.webdriver.common.keys import Keys<\/code><\/pre>\n<p>That imports the <code>Keys<\/code> class, which will be used to simulate keystrokes, in the case of this script, it will scroll the page down <code>Keys.PAGE_DOWN<\/code><\/p>\n<pre><code class=\"python\">from selenium.webdriver.common.action_chains import ActionChains<\/code><\/pre>\n<p>That imports the <code>ActionChains<\/code> class to create complex sequential actions, in our case \u2013 clicking on the <code>PAGE_DOWN<\/code> button and waiting for all elements on the page to load (since on Amazon cards are loaded as they are being scrolled)<\/p>\n<pre><code class=\"python\">from selenium.webdriver.support.ui import WebDriverWait<\/code><\/pre>\n<p>That imports the <strong>WebDriverWait<\/strong> class, which waits until the information we are looking for is loaded, for example, a product description, which we will search by <code>Xpath<\/code><\/p>\n<pre><code class=\"python\">from selenium.webdriver.support import expected_conditions as EC<\/code><\/pre>\n<p>That imports the <code>expected_conditions<\/code> class (abbreviated <code>EC<\/code>) which works in conjunction with the previous class and tells WebDriverWait which specific condition it needs to wait for. That increases the reliability of the script so that it would not start interacting with the unloaded yet content.<\/p>\n<pre><code class=\"python\">import csv<\/code><\/pre>\n<p>That imports the <code>csv<\/code> module to work with csv files.<\/p>\n<pre><code class=\"python\">import os<\/code><\/pre>\n<p>That imports the <code>os<\/code> module to work with the operating system (creating directories, checking for the files presence, etc.).<\/p>\n<pre><code class=\"python\">from time import sleep<\/code><\/pre>\n<p>We import the <code>sleep<\/code> function \u2013 this is the function that will pause the script for a specific time (in my case, 2 seconds, but you can set more) so that the elements would load while scrolling.<\/p>\n<pre><code class=\"python\">import requests<\/code><\/pre>\n<p>That imports the <code>requests<\/code> library for sending HTTP requests, to interact with the 2captcha recognition service.<\/p>\n<h3>Configuration<\/h3>\n<p>After everything is imported, the script starts configuring the browser for work, in particular:<\/p>\n<p>Installing the API key to access the 2captcha service<\/p>\n<pre><code class=\"python\"># API key for 2Captcha API_KEY = \"<\/code><\/pre>\n<p>The script contains a user-agent (it can be changed, of course), which is installed for the browser. After that, the browser starts with the specified settings.<\/p>\n<pre><code class=\"python\">user_agent = \"Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/91.0.4472.124 Safari\/537.36\"  options = webdriver.FirefoxOptions() options.add_argument(f\"user-agent={user_agent}\")  driver = webdriver.Firefox(options=options)<\/code><\/pre>\n<p>Next comes the captcha solution module. This is exactly the place that users are looking for when they search how to solve a captcha. We will not analyze this piece of code for a long time, since there were no particular problems with it.<\/p>\n<p>In short, the script, after each page load, checks for the presence of a captcha on the page and if it finds it there, solves it by sending it to the 2captcha server. If there is no captcha, it just continues the execution further.<\/p>\n<pre><code class=\"python\">def solve_captcha(driver):     # Check for the presence of a captcha on the page     try:         captcha_element = driver.find_element(By.CLASS_NAME, 'g-recaptcha')         if captcha_element:             print(\"Captcha detected. Solving...\")             site_key = captcha_element.get_attribute('data-sitekey')             current_url = driver.current_url                          # Send captcha request to 2Captcha             captcha_id = requests.post(                 'http:\/\/2captcha.com\/in.php',                  data={                     'key': API_KEY,                      'method': 'userrecaptcha',                      'googlekey': site_key,                      'pageurl': current_url                 }             ).text.split('|')[1]              # Wait for the captcha to be solved             recaptcha_answer = ''             while True:                 sleep(5)                 response = requests.get(f\"http:\/\/2captcha.com\/res.php?key={API_KEY}&amp;action=get&amp;id={captcha_id}\")                 if response.text == 'CAPCHA_NOT_READY':                     continue                 if 'OK|' in response.text:                     recaptcha_answer = response.text.split('|')[1]                     break                          # Inject the captcha answer into the page             driver.execute_script(f'document.getElementById(\"g-recaptcha-response\").innerHTML = \"{recaptcha_answer}\";')             driver.find_element(By.ID, 'submit').click()             sleep(5)             print(\"Captcha solved.\")     except Exception as e:         print(\"No captcha found or error occurred:\", e)<\/code><\/pre>\n<h3>Parsing<\/h3>\n<p>Next comes a section of the code that is responsible for sorting pages, loading, and scrolling them<\/p>\n<pre><code class=\"python\">try:     base_url = \"https:\/\/www.amazon.in\/s?k=bags\"      for page_number in range(1, 10):          page_url = f\"{base_url}&amp;page={page_number}\"          driver.get(page_url)         driver.implicitly_wait(10)          solve_captcha(driver)          WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '\/\/span[@class=\"a-size-medium a-color-base a-text-normal\"]')))          for _ in range(5):               ActionChains(driver).send_keys(Keys.PAGE_DOWN).perform()             sleep(2)<\/code><\/pre>\n<p>The next piece is the collection of product data. The most important part. In this part, the script examines the loaded page and takes the data that is specified from there. In our case it is the <code>product name<\/code>, <code>number of reviews<\/code>, <code>price<\/code>, <code>URL<\/code>, <code>product rating<\/code>.<\/p>\n<pre><code class=\"python\">product_name_elements = driver.find_elements(By.XPATH, '\/\/span[@class=\"a-size-medium a-color-base a-text-normal\"]')         rating_number_elements = driver.find_elements(By.XPATH, '\/\/span[@class=\"a-size-base s-underline-text\"]')         star_rating_elements = driver.find_elements(By.XPATH, '\/\/span[@class=\"a-icon-alt\"]')         price_elements = driver.find_elements(By.XPATH, '\/\/span[@class=\"a-price-whole\"]')         product_urls = driver.find_elements(By.XPATH, '\/\/a[@class=\"a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal\"]')                  product_names = [element.text for element in product_name_elements]         rating_numbers = [element.text for element in rating_number_elements]         star_ratings = [element.get_attribute('innerHTML') for element in star_rating_elements]         prices = [element.text for element in price_elements]         urls = [element.get_attribute('href') for element in product_urls]<\/code><\/pre>\n<p>Next, the specified data is uploaded to a folder (a csv file is created for each page, which is saved to the output files folder). If the folder is missing, the script creates it.<\/p>\n<pre><code class=\"python\">        output_directory = \"output files\"         if not os.path.exists(output_directory):             os.makedirs(output_directory)                  with open(os.path.join(output_directory, f'product_details_page_{page_number}.csv'), 'w', newline='', encoding='utf-8') as csvfile:             csv_writer = csv.writer(csvfile)             csv_writer.writerow(['Product Urls', 'Product Name', 'Product Price', 'Rating', 'Number of Reviews'])             for url, name, price, star_rating, num_ratings in zip(urls, product_names, prices, star_ratings, rating_numbers):                 csv_writer.writerow([url, name, price, star_rating, num_ratings])<\/code><\/pre>\n<p>And the final stage is the completion of work and the release of resources.<\/p>\n<pre><code class=\"python\">finally:     driver.quit()<\/code><\/pre>\n<p>The full script<\/p>\n<pre><code class=\"python\">from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import csv import os from time import sleep import requests  # API key for 2Captcha API_KEY = \"Your API Key\"  # Set a custom user agent to mimic a real browser user_agent = \"Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/91.0.4472.124 Safari\/537.36\"  options = webdriver.FirefoxOptions() options.add_argument(f\"user-agent={user_agent}\")  driver = webdriver.Firefox(options=options)  def solve_captcha(driver):     # Check for the presence of a captcha on the page     try:         captcha_element = driver.find_element(By.CLASS_NAME, 'g-recaptcha')         if captcha_element:             print(\"Captcha detected. Solving...\")             site_key = captcha_element.get_attribute('data-sitekey')             current_url = driver.current_url                          # Send captcha request to 2Captcha             captcha_id = requests.post(                 'http:\/\/2captcha.com\/in.php',                  data={                     'key': API_KEY,                      'method': 'userrecaptcha',                      'googlekey': site_key,                      'pageurl': current_url                 }             ).text.split('|')[1]              # Wait for the captcha to be solved             recaptcha_answer = ''             while True:                 sleep(5)                 response = requests.get(f\"http:\/\/2captcha.com\/res.php?key={API_KEY}&amp;action=get&amp;id={captcha_id}\")                 if response.text == 'CAPCHA_NOT_READY':                     continue                 if 'OK|' in response.text:                     recaptcha_answer = response.text.split('|')[1]                     break                          # Inject the captcha answer into the page             driver.execute_script(f'document.getElementById(\"g-recaptcha-response\").innerHTML = \"{recaptcha_answer}\";')             driver.find_element(By.ID, 'submit').click()             sleep(5)             print(\"Captcha solved.\")     except Exception as e:         print(\"No captcha found or error occurred:\", e)  try:     # Starting page URL     base_url = \"https:\/\/www.amazon.in\/s?k=bags\"      for page_number in range(1, 2):          page_url = f\"{base_url}&amp;page={page_number}\"          driver.get(page_url)         driver.implicitly_wait(10)          # Attempt to solve captcha if detected         solve_captcha(driver)          # Explicit Wait         WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '\/\/span[@class=\"a-size-medium a-color-base a-text-normal\"]')))          for _ in range(5):               ActionChains(driver).send_keys(Keys.PAGE_DOWN).perform()             sleep(2)          product_name_elements = driver.find_elements(By.XPATH, '\/\/span[@class=\"a-size-medium a-color-base a-text-normal\"]')         rating_number_elements = driver.find_elements(By.XPATH, '\/\/span[@class=\"a-size-base s-underline-text\"]')         star_rating_elements = driver.find_elements(By.XPATH, '\/\/span[@class=\"a-icon-alt\"]')         price_elements = driver.find_elements(By.XPATH, '\/\/span[@class=\"a-price-whole\"]')         product_urls = driver.find_elements(By.XPATH, '\/\/a[@class=\"a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal\"]')                  # Extract and print the text content of each product name, number of ratings, and star rating, urls         product_names = [element.text for element in product_name_elements]         rating_numbers = [element.text for element in rating_number_elements]         star_ratings = [element.get_attribute('innerHTML') for element in star_rating_elements]         prices = [element.text for element in price_elements]         urls = [element.get_attribute('href') for element in product_urls]                  sleep(5)                 output_directory = \"output files\"         if not os.path.exists(output_directory):             os.makedirs(output_directory)                  with open(os.path.join(output_directory, f'product_details_page_{page_number}.csv'), 'w', newline='', encoding='utf-8') as csvfile:             csv_writer = csv.writer(csvfile)             csv_writer.writerow(['Product Urls', 'Product Name', 'Product Price', 'Rating', 'Number of Reviews'])             for url, name, price, star_rating, num_ratings in zip(urls, product_names, prices, star_ratings, rating_numbers):                 csv_writer.writerow([url, name, price, star_rating, num_ratings])  finally:     driver.quit()<\/code><\/pre>\n<p>This way the script works without errors, but only for vertical product cards. Here is an example of how the script works.<\/p>\n<div class=\"tm-iframe_temp\" data-src=\"https:\/\/embedd.srv.habr.com\/iframe\/66d54fd1ee68229db67ae8a9\" data-style=\"\" id=\"66d54fd1ee68229db67ae8a9\" width=\"\"><\/div>\n<p>I will be glad to discuss it in the comments if you have something to say about it.<\/p>\n<\/p>\n<\/div>\n<\/div>\n<\/div>\n<p><!----><!----><\/div>\n<p><!----><!----><br \/> \u0441\u0441\u044b\u043b\u043a\u0430 \u043d\u0430 \u043e\u0440\u0438\u0433\u0438\u043d\u0430\u043b \u0441\u0442\u0430\u0442\u044c\u0438 <a href=\"https:\/\/habr.com\/ru\/articles\/840208\/\"> https:\/\/habr.com\/ru\/articles\/840208\/<\/a><\/p>\n","protected":false},"excerpt":{"rendered":"<div><!--[--><!--]--><\/div>\n<div id=\"post-content-body\">\n<div>\n<div class=\"article-formatted-body article-formatted-body article-formatted-body_version-2\">\n<div xmlns=\"http:\/\/www.w3.org\/1999\/xhtml\">\n<p>I came across a script on the Internet that allows you to parse product cards from Amazon. And I just needed a solution to a problem like that.<\/p>\n<p>I wracked my brain while looking for a way to parse product cards from Amazon. The problem is that Amazon uses different design options for different outputs, in particular \u2013 if you need to parse the cards with the search query &#171;bags&#187; \u2013 the cards will be arranged vertically, as I need it, but if you take, for example, &#171;t-shirts&#187; \u2013 then the cards will be arranged horizontally, and in such way the script falls into an error, it works out opening the page, but does not want to scroll.<\/p>\n<figure class=\"full-width\"><\/figure>\n<p>Moreover, after reading various articles where users are puzzling over <a href=\"https:\/\/2captcha.com\/api-docs\/recaptcha-v3\" rel=\"noopener noreferrer nofollow\">how to bypass captcha<\/a> on Amazon, I upgraded the script and now it can bypass the captcha if it occurs (it works with 2captcha). The script checks for the presence of a captcha on the page after each loading of a new page, and if the captcha occurs, it sends a request to the 2capcha server, and after receiving the solution, substitutes it and continues to work.<\/p>\n<p>However, how to bypass the captcha is not the most difficult problem, since this is a trivial task nowadays. The more pressing question is how to make the script work not only with the vertical arrangement of product cards, but also with the horizontal one.<\/p>\n<p>Below I will describe in detail what the script includes, demonstrate its work, and if you can help to solve the problem, if you know what to add (change) in the script so that it works on horizontal setup of cards, I will be grateful.<\/p>\n<p>And for now the script can help someone at least in its limited functionality.<\/p>\n<p>So, let&#8217;s take the script apart piece by piece!<\/p>\n<h3>Preparation<\/h3>\n<p>Firstly, the script imports the modules needed to complete the task<\/p>\n<pre><code class=\"python\">from selenium import webdriver  from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import csv import os from time import sleep import requests <\/code><\/pre>\n<p>Let&#8217;s take it apart in parts:<\/p>\n<pre><code class=\"python\">from selenium import webdriver<\/code><\/pre>\n<p>That imports the <code>webdriver<\/code> class, which allows you to control the browser (in my case Firefox) through the script.<\/p>\n<pre><code class=\"python\">from selenium.webdriver.common.by import By<\/code><\/pre>\n<p>That imports the <code>By<\/code> class, with which the script will search for elements to parse by XPath (it can search for other attributes, but in this case <code>Xpath<\/code> will be used).<\/p>\n<pre><code class=\"python\">from selenium.webdriver.common.keys import Keys<\/code><\/pre>\n<p>That imports the <code>Keys<\/code> class, which will be used to simulate keystrokes, in the case of this script, it will scroll the page down <code>Keys.PAGE_DOWN<\/code><\/p>\n<pre><code class=\"python\">from selenium.webdriver.common.action_chains import ActionChains<\/code><\/pre>\n<p>That imports the <code>ActionChains<\/code> class to create complex sequential actions, in our case \u2013 clicking on the <code>PAGE_DOWN<\/code> button and waiting for all elements on the page to load (since on Amazon cards are loaded as they are being scrolled)<\/p>\n<pre><code class=\"python\">from selenium.webdriver.support.ui import WebDriverWait<\/code><\/pre>\n<p>That imports the <strong>WebDriverWait<\/strong> class, which waits until the information we are looking for is loaded, for example, a product description, which we will search by <code>Xpath<\/code><\/p>\n<pre><code class=\"python\">from selenium.webdriver.support import expected_conditions as EC<\/code><\/pre>\n<p>That imports the <code>expected_conditions<\/code> class (abbreviated <code>EC<\/code>) which works in conjunction with the previous class and tells WebDriverWait which specific condition it needs to wait for. That increases the reliability of the script so that it would not start interacting with the unloaded yet content.<\/p>\n<pre><code class=\"python\">import csv<\/code><\/pre>\n<p>That imports the <code>csv<\/code> module to work with csv files.<\/p>\n<pre><code class=\"python\">import os<\/code><\/pre>\n<p>That imports the <code>os<\/code> module to work with the operating system (creating directories, checking for the files presence, etc.).<\/p>\n<pre><code class=\"python\">from time import sleep<\/code><\/pre>\n<p>We import the <code>sleep<\/code> function \u2013 this is the function that will pause the script for a specific time (in my case, 2 seconds, but you can set more) so that the elements would load while scrolling.<\/p>\n<pre><code class=\"python\">import requests<\/code><\/pre>\n<p>That imports the <code>requests<\/code> library for sending HTTP requests, to interact with the 2captcha recognition service.<\/p>\n<h3>Configuration<\/h3>\n<p>After everything is imported, the script starts configuring the browser for work, in particular:<\/p>\n<p>Installing the API key to access the 2captcha service<\/p>\n<pre><code class=\"python\"># API key for 2Captcha API_KEY = \"<\/code><\/pre>\n<p>The script contains a user-agent (it can be changed, of course), which is installed for the browser. After that, the browser starts with the specified settings.<\/p>\n<pre><code class=\"python\">user_agent = \"Mozilla\/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/91.0.4472.124 Safari\/537.36\"  options = webdriver.FirefoxOptions() options.add_argument(f\"user-agent={user_agent}\")  driver = webdriver.Firefox(options=options)<\/code><\/pre>\n<p>Next comes the captcha solution module. This is exactly the place that users are looking for when they search how to solve a captcha. We will not analyze this piece of code for a long time, since there were no particular problems with it.<\/p>\n<p>In short, the script, after each page load, checks for the presence of a captcha on the page and if it finds it there, solves it by sending it to the 2captcha server. If there is no captcha, it just continues the execution further.<\/p>\n<pre><code class=\"python\">def solve_captcha(driver):     # Check for the presence of a captcha on the page     try:         captcha_element = driver.find_element(By.CLASS_NAME, 'g-recaptcha')         if captcha_element:             print(\"Captcha detected. Solving...\")             site_key = captcha_element.get_attribute('data-sitekey')             current_url = driver.current_url                          # Send captcha request to 2Captcha             captcha_id = requests.post(                 'http:\/\/2captcha.com\/in.php',                  data={                     'key': API_KEY,                      'method': 'userrecaptcha',                      'googlekey': site_key,                      'pageurl': current_url                 }             ).text.split('|')[1]              # Wait for the captcha to be solved             recaptcha_answer = ''             while True:                 sleep(5)                 response = requests.get(f\"http:\/\/2captcha.com\/res.php?key={API_KEY}&amp;action=get&amp;id={captcha_id}\")                 if response.text == 'CAPCHA_NOT_READY':                     continue                 if 'OK|' in response.text:                     recaptcha_answer = response.text.split('|')[1]                     break                          # Inject the captcha answer into the page             driver.execute_script(f'document.getElementById(\"g-recaptcha-response\").innerHTML = \"{recaptcha_answer}\";')             driver.find_element(By.ID, 'submit').click()             sleep(5)             print(\"Captcha solved.\")     except Exception as e:         print(\"No captcha found or error occurred:\", e)<\/code><\/pre>\n<h3>Parsing<\/h3>\n<p>Next comes a section of the code that is responsible for sorting pages, loading, and scrolling them<\/p>\n<pre><code class=\"python\">try:     base_url = \"https:\/\/www.amazon.in\/s?k=bags\"      for page_number in range(1, 10):          page_url = f\"{base_url}&amp;page={page_number}\"          driver.get(page_url)         driver.implicitly_wait(10)          solve_captcha(driver)          WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.XPATH, '\/\/span[@class=\"a-size-medium a-color-base a-text-normal\"]')))          for _ in range(5):               ActionChains(driver).send_keys(Keys.PAGE_DOWN).perform()             sleep(2)<\/code><\/pre>\n<p>The next piece is the collection of product data. The most important part. In this part, the script examines the loaded page and takes the data that is specified from there. In our case it is the <code>product name<\/code>, <code>number of reviews<\/code>, <code>price<\/code>, <code>URL<\/code>, <code>product rating<\/code>.<\/p>\n<pre><code class=\"python\">product_name_elements = driver.find_elements(By.XPATH, '\/\/span[@class=\"a-size-medium a-color-base a-text-normal\"]')         rating_number_elements = driver.find_elements(By.XPATH, '\/\/span[@class=\"a-size-base s-underline-text\"]')         star_rating_elements = driver.find_elements(By.XPATH, '\/\/span[@class=\"a-icon-alt\"]')         price_elements = driver.find_elements(By.XPATH, '\/\/span[@class=\"a-price-whole\"]')         product_urls = driver.find_elements(By.XPATH, '\/\/a[@class=\"a-link-normal s-underline-text s-underline-link-text s-link-style a-text-normal\"]')                  product_names = [element.text for element in product_name_elements]         rating_numbers = [element.text for element in rating_number_elements]         star_ratings = [element.get_attribute('innerHTML') for element in star_rating_elements]         prices = [element.text for element in price_elements]         urls = [element.get_attribute('href') for element in product_urls]<\/code><\/pre>\n<p>Next, the specified data is uploaded to a folder (a csv file is created for each page, which is saved to the output files folder). If the folder is missing, the script creates it.<\/p>\n<pre><code class=\"python\">        output_directory = \"output files\"         if not os.path.exists(output_directory):             os.makedirs(output_directory)                  with open(os.path.join(output_directory, f'product_details_page_{page_number}.csv'), 'w', newline='', encoding='utf-8') as csvfile:             csv_writer = csv.writer(csvfile)             csv_writer.writerow(['Product Urls', 'Product Name', 'Product Price', 'Rating', 'Number of Reviews'])             for url, name, price, star_rating, num_ratings in zip(urls, product_names, prices, star_ratings, rating_numbers):                 csv_writer.writerow([url, name, price, star_rating, num_ratings])<\/code><\/pre>\n<p>And the final stage is the completion of work and the release of resources.<\/p>\n<pre><code class=\"python\">finally:     driver.quit()<\/code><\/pre>\n<p>The full script<\/p>\n<pre><code class=\"python\">from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.common.action_chains import ActionChains from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC import csv import os from time import sleep import requests  # API key for 2Captcha API_KEY = \"Your API Key\"  # Set a custom user agent to mimic a real<\/code><\/pre>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[],"tags":[],"class_list":["post-430927","post","type-post","status-publish","format-standard","hentry"],"_links":{"self":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/430927","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=430927"}],"version-history":[{"count":0,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=\/wp\/v2\/posts\/430927\/revisions"}],"wp:attachment":[{"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=430927"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=430927"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/savepearlharbor.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=430927"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}