
Web scraping at scale inevitably runs into anti-bot defenses, with IP bans being a primary obstacle. Puppeteer, despite its power, exposes a distinctive browser fingerprint that modern anti-bot systems are designed to detect. Merely rotating IP addresses without proper authentication and accompanying fingerprint spoofing is a recipe for swift blocking. This guide delivers a complete walkthrough for implementing secure proxy routing via both HTTP/S and SOCKS proxies, and seamlessly integrating them into your automation workflows. You'll gain secure configuration templates, practical code samples, and critical best practices to achieve reliable, uninterrupted data extraction.
Websites defend against bots using rate limiting and IP tracking. Because headless browsers leave unique fingerprints, simply rotating IPs isn't sufficient—without comprehensive identity management, your scraper will still get blocked.
Proxies solve the network layer of this problem. A proxy server intermediates requests, distributing them across multiple IP addresses to avoid rate limits and bans. However, IP rotation must always be paired with fingerprint spoofing to truly mimic organic traffic.
There are two primary proxy protocols used in web scraping:
Without a proxy, every request comes from your machine's IP, quickly triggering blocks. With a properly configured proxy pool, requests originate from diverse IPs, enabling sustained data extraction.
Request flow through a proxy:
Puppeteer → Proxy Server → Target Website(Response follows the reverse path)
Let's dive into setting up your own authenticated HTTP/S proxy using Squid.
sudo apt updatesudo apt install squid apache2-utils -y
htpasswd (this creates the password file /etc/squid/passwords): sudo htpasswd -c /etc/squid/passwords proxyuserNote: The
-c flag creates the file; omit it when adding subsequent users.squid.conf. Edit /etc/squid/squid.conf:dns_nameservers).auth_param basic program /usr/lib/squid/basic_ncsa_auth /etc/squid/passwordsauth_param basic realm proxyacl authenticated proxy_auth REQUIREDhttp_access allow authenticated
ufw) to allow the Squid port: sudo ufw allow 3128/tcp
sudo systemctl restart squidTest your setup using curl:
curl -x http://proxyuser:password@your-server-ip:3128 http://example.com.Security is critical. Always use strong, unique passwords. For production environments, restrict source IPs via ACLs and consider tunneling Squid behind an HTTPS reverse proxy (like Nginx) if you are handling sensitive data. This secured HTTP proxy is now ready for your scripts.
| Proxy Port | Authentication Method | Notes |
|---|---|---|
| 3128 | basic_ncsa_auth (htpasswd) | Standard, widely compatible, recommended for Squid. |
| 8080 | None or custom | Alternative port; avoid leaving open without authentication. |
A secure squid.conf enforces both authentication and network-level access controls. Use this baseline for Ubuntu 24.04 to prevent unauthorized relaying.
# Squid secure configurationhttp_port 3128cache_dir ufs /var/spool/squid 100 16 256dns_nameservers 8.8.8.8 8.8.4.4# Hide client IP for better anonymityforwarded_for deleterequest_header_access Via deny all# Authenticationauth_param basic program /usr/lib/squid/basic_ncsa_auth /etc/squid/passwordsauth_param basic realm proxyacl authenticated proxy_auth REQUIRED# Access Control Lists (ACLs)acl localnet src 10.0.0.0/8 192.168.0.0/16 # Replace with your scraper's IPshttp_access allow localnet authenticatedhttp_access deny allKey directives explained:
forwarded_for delete: Strips the X-Forwarded-For header, hiding your scraper's real IP address from the target website.acl localnet: Restricts incoming connections to your trusted networks.http_access allow localnet authenticated: Only allows connections that match both the trusted IP range and valid credentials.While HTTP proxies suffice for standard web scraping, SOCKS proxies like Dante provide enhanced flexibility for handling lower-level TCP connections.
A SOCKS proxy like Dante tunnels any TCP traffic, making it ideal for non-HTTP protocols and specific scraping edge cases. Unlike HTTP/S proxies, SOCKS operates at Layer 4 (Transport layer).
| Feature | HTTP/S Proxy (Squid) | SOCKS Proxy (Dante) |
|---|---|---|
| OSI Layer | 7 (Application) | 4 (Transport) |
| Protocols | HTTP/HTTPS | Any TCP/UDP |
| Authentication | Basic, NTLM | SOCKS5: username/password; SOCKS4: none |
| Typical Use | Web scraping, caching, header manipulation | Raw socket tunneling, IP rotation |
Install Dante server:
sudo apt update && sudo apt install dante-server -yConfigure /etc/danted.conf. Here is a recommended SOCKS5 setup with authentication:
logoutput: sysloguser.privileged: rootuser.unprivileged: nobody# The interface and port Dante listens oninternal: 0.0.0.0 port = 1080# The interface Dante uses for outgoing trafficexternal: eth0 method: usernameclient pass { from: 0.0.0.0/0 to: 0.0.0.0/0 log: connect disconnect error}pass { from: 0.0.0.0/0 to: 0.0.0.0/0 protocol: tcp log: connect disconnect error}Note: Ensure you replace eth0 with your server's actual external network interface (e.g., ens3 or eth1).
Create a proxy user:
sudo adduser proxyuserSet a strong password. Dante authenticates directly via system users.
Allow port 1080 in your firewall:
sudo ufw allow 1080/tcpRestart and enable the service:
sudo systemctl restart dantedsudo systemctl enable dantedTesting the setup:
curl --socks5-hostname localhost:1080 --socks5-user proxyuser:password http://example.comTo route your Puppeteer traffic through an authenticated HTTP/S proxy, you must pass the proxy server address in the puppeteer.launch arguments and handle the actual credentials using page.authenticate(). Do not embed credentials directly in the proxy URL string, as Chromium ignores them and will throw a 407 error.
const puppeteer = require('puppeteer');(async () => { // 1. Launch browser pointing to the proxy server const browser = await puppeteer.launch({ args: ['--proxy-server=http://proxy.example.com:3128'] }); const page = await browser.newPage(); // 2. Provide the username and password for the proxy await page.authenticate({ username: 'proxyuser', password: 'securepassword123' }); // 3. Navigate to the target await page.goto('https://api.ipify.org'); // Check your IP await browser.close();})();client pass { from: YOUR_IP ... }) and disable password auth.Effective proxy management depends on balancing evasion tactics with computational overhead. Rotating proxies per request maximizes stealth but increases latency. Rotating per session reduces load but creates detectable behavior patterns.
| Error | Common Cause | Fix |
|---|---|---|
ERR_PROXY_CONNECTION_FAILED | Proxy is offline or blocked by firewall | Check ufw rules on the proxy server; ensure the service is running. |
407 Proxy Auth Required | Invalid credentials or missing page.authenticate() | Do not put credentials in the --proxy-server string. Use page.authenticate(). |
| DNS leaks (HTTP proxy) | DNS queries bypass the proxy, revealing your location | Enforce DNS over proxy, or use SOCKS5. |
Implementing robust retry logic is highly recommended. Proxies drop connections frequently. Use a fallback mechanism to switch proxies if a page fails to load.
async function fetchWithProxyFallback(url, proxyList, credentials) { for (const proxy of proxyList) { let browser; try { browser = await puppeteer.launch({ args: [`--proxy-server=${proxy}`] }); const page = await browser.newPage(); await page.authenticate(credentials); await page.goto(url, { waitUntil: 'domcontentloaded' }); return { page, browser }; } catch (err) { console.warn(`Proxy ${proxy} failed. Retrying...`); if (browser) await browser.close(); if (proxy === proxyList[proxyList.length - 1]) throw err; // Throw if last proxy fails } }}Proxy Server Management Cheatsheet
| Action | HTTP/S (Squid) | SOCKS (Dante) |
|---|---|---|
| Install (Ubuntu) | sudo apt install squid apache2-utils | sudo apt install dante-server |
| Create User | sudo htpasswd -c /etc/squid/passwords user | sudo adduser proxyuser |
| Check Status | systemctl status squid | systemctl status danted |
| Test Connection | curl -x http://user:pass@localhost:3128 http://example.com | curl --socks5 localhost:1080 --socks5-user user:pass http://example.com |
Effective web scraping with Puppeteer hinges on sophisticated proxy management. We've covered setting up secure HTTP/S and SOCKS servers, integrating them correctly using page.authenticate(), and navigating Chromium's infamous SOCKS5 limitations. Remember: proxies alone aren't a silver bullet. Combine reliable proxy routing with robust fingerprint spoofing (using libraries like puppeteer-extra-plugin-stealth) to truly bypass modern bot protections. Monitor your success rates, rotate your IP pools frequently, and adapt to evolving countermeasures.