Knowledge base https://changedetection.io/ en-gb What are the main types of anti-robot mechanisms? https://changedetection.io/tutorial/what-are-main-types-anti-robot-mechanisms <span class="field field--name-title field--type-string field--label-hidden">What are the main types of anti-robot mechanisms?</span> <span class="field field--name-uid field--type-entity-reference field--label-hidden"><a title="View user profile." href="/tech-writer/stephen" class="username">Stephen</a></span> <span class="field field--name-created field--type-created field--label-hidden"><time datetime="2024-05-20T14:48:57+02:00" title="Monday, May 20, 2024 - 14:48" class="datetime">Mon, 05/20/2024 - 14:48</time> </span> <div class="field field--name-field-topic field--type-entity-reference field--label-above"> <div class="field__label">Topic</div> <div class='field__items'> <div class="field__item"><a href="/topic/knowledge-base" hreflang="en-gb">Knowledge base</a></div> </div> </div> <div class="clearfix text-formatted field field--name-body field--type-text-with-summary field--label-hidden field__item"><p>You may have read a lot about some tricks to defeat scraping such as changing your scraping browser's "User Agent" settings, however that's only the start of the story.</p><p><img src="/sites/changedetection.io/files/inline-images/image_80.png" data-entity-uuid="c5e1bef5-4dc4-46a7-98de-fb8979ace16f" data-entity-type="file" width="204" height="207" alt="anti-robot blocker" class="align-right" loading="lazy"></p><p>In reality there are multiple ways that a service can detect your scraping attempts, changing the user-agent is just a small trick that may or may not help.</p><p>&nbsp;</p><p>It's important to look at the whole session not with just "am I detected or not" perspective, but as a<em><strong> score between 1 and 10</strong></em>, with 10 being "looks like a robot!" and 1 being "hey nice browser and pretty human!"</p><p>&nbsp;</p><p>So let's break it down a bit further, what are the main factors?</p><h3>&nbsp;</h3><h3>Browser fingerprinting</h3><p><em>Anti-robot score importance: <strong>High</strong></em></p><p>This is a tricky subject, as it's probably the fastest evolving part of anti-robot technologies, so we'll break it down into a few parts.</p><p>&nbsp;</p><h5>Browser headers - User Agent</h5><p><em>Anti-robot score importance: <strong>Medium/Low</strong></em></p><p>Headers such as "User agent" are sent on request, and may identify you as a robot, the problem is that since Chrome 89 there's a <strong>new subset of headers </strong>that are sent...</p><p><a href="https://developer.chrome.com/docs/privacy-security/user-agent-client-hints">https://developer.chrome.com/docs/privacy-security/user-agent-client-hints</a></p><p>Browsers also emit such headers as "<code>Sec-CH-UA</code>", not just simple old school "User agent" headers.<br>&nbsp;</p><p>For example;</p><p><code>Sec-CH-UA: "Chromium";v="93", "Google Chrome";v="93", " Not;A Brand";v="99"</code><br><code>Sec-CH-UA-Mobile: ?0</code><br><code>Sec-CH-UA-Platform: "macOS"</code></p><p>Fortunately these can be turned off with the Chrome flag <code>--disable-features=UserAgentClientHint</code><br>&nbsp;</p><p>So just overriding the "User-agent" header without disabling this increases the chances of being detected as a robot, but so does disabling the feature altogether..</p><h5>&nbsp;</h5><h5>Browser TLS/SSL/TCP-IP fingerprinting</h5><p>&nbsp;</p><p><em>Anti-robot score importance: <strong>Medium/High</strong></em></p><p>There also exists a few methods to fingerprint the actual connection "pattern" that the browser makes when performing the SSL connection with the webserver, JA3 and now JA4+ are just one of many libraries/methods out there that perform this.</p><p>Because the browser is compiled with a certain set/versions of SSL libraries, the connection itself from the browser can be fingerprinted</p><p>More here <a href="https://github.com/salesforce/ja3">https://github.com/salesforce/ja3</a></p><blockquote><p><em><strong>"TLS and it’s predecessor, SSL, I will refer to both as “SSL” for simplicity, are used to encrypt communication for both common applications, to keep your data secure, and malware, so it can hide in the noise. To initiate a SSL session, a client will send a SSL Client Hello packet following the TCP 3-way handshake. This packet and the way in which it is generated is dependant on packages and methods used when building the client application. The server, if accepting SSL connections, will respond with a SSL Server Hello packet that is formulated based on server-side libraries and configurations as well as details in the Client Hello. Because SSL negotiations are transmitted in the clear, it’s possible to fingerprint and identify client applications using the details in the SSL Client Hello packet."</strong></em><br><br><strong>"With JA3S it is possible to fingerprint the entire cryptographic negotiation between client and it's server by combining JA3 + JA3S. That is because servers will respond to different clients differently but will always respond to the same client the same."</strong></p></blockquote><p>We're not aware of an easy work-around for this case, however perhaps using a specialist proxy service or other could help, this espicially affects Puppeteer type browsers that are all shipped with similar SSL/chrome libraries</p><h5>&nbsp;</h5><h5>Browser fingerprinting in general</h5><p>&nbsp;</p><p><em>Anti-robot score importance: <strong>Medium</strong></em></p><p>There are many technologies here, for example some anti-robot services will attempt to detect the actual hardware you're running on and compare that against a database of known configurations, if it looks "weird" it will increase your robot rating</p><p>&nbsp;</p><h3>The IP address that you're calling from</h3><p><em>Anti-robot score importance: <strong>Medium</strong></em></p><p><em>IP address</em> plays a fairly important role, many services such as Cloudflare protect their customers by rating the IP address you are calling from, if you are calling from a cheap data-centre proxy for example (the most common kind of paid proxy), they already know their IP ranges and will rate the possibility higher that you're a robot.</p><p>Use quality residential proxies, preferably via a "SOCKS" connection so that it does not relay to the remote end that you're even using a proxy.</p><p>&nbsp;</p><h3>Scraping behaviour in general</h3><p><em>Anti-robot score importance: <strong>Low</strong></em></p><p>With computing power being quite high and costs cheap, this is not as important as it used to be, but many services will block you if you scrape too many pages concurrently or too fast sequentially, always be a "good internet scraper" and limit your scraping methods.</p><p>&nbsp;</p><h3>Deny-at-start and allow-later</h3><p><em>Anti-robot score importance: <strong>Hard to say</strong></em></p><p>Some services assume EVERYONE is a robot, and block you at the start then use some fingerprinting to identify you going forwards, this is usually some kind of CAPTCHA or other process, then the browser is fingerprinted and you're allowed in there on.</p><p>&nbsp;</p></div> Mon, 20 May 2024 12:48:57 +0000 Stephen 33 at https://changedetection.io