Why User Agents Matter in Web Scraping (with Python Examples)

Why User Agents Matter in Web Scraping (with Python Examples)

Most beginner scrapers send a request, get blocked, and blame the website. The real culprit is almost always the same: a missing or default User Agent. That single HTTP header tells servers what kind of client you are — and if you announce yourself as python-requests/2.x, many sites will refuse to serve real content.

This post explains what a User Agent is, why it has such a strong effect on scraping, and shows three working Python examples comparing desktop, mobile, and feature-phone responses from a real site.

What is a User Agent?

The User Agent is a plain-text string sent in the HTTP request headers that identifies the device, OS, and browser making the request. Servers use it to decide which version of a page to serve: a mobile-optimized layout for phones, a heavier JavaScript SPA for modern desktop browsers, a stripped-down HTML for older devices.

Because the header is plain text, it’s trivial to manipulate. Whatever value you send, the server treats as truth. That’s exactly what makes it so useful for scraping.

Why your scraper needs one

If you don’t set a User Agent, libraries fall back to defaults like python-requests/2.31.0 or curl/8.x — both obvious bot signatures. Many sites block them outright (HTTP 403, captcha walls, or empty bodies).

When to use this: always. Even if a site doesn’t block you today, sending a realistic User Agent is the lowest-effort improvement you can make to a scraper. Pair it with rotation across a small pool to mimic organic traffic patterns.

Setup: dependencies

pip install requests beautifulsoup4 lxml

Example 1 — Desktop User Agent

A modern desktop User Agent string. Many sites respond with the heavy JavaScript SPA version — which means requests alone won’t render the actual content:

import requests
from bs4 import BeautifulSoup

desktop_ua = (
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/124.0.0.0 Safari/537.36"
)
headers = {"User-Agent": desktop_ua}
resp = requests.get("https://twitter.com/billgates", headers=headers)

if resp.status_code == 200:
    soup = BeautifulSoup(resp.text, "lxml")
    print(soup.prettify()[:2000])
else:
    print(f"Error: {resp.status_code}")

You’ll see a <noscript> form telling you to enable JavaScript — Twitter (now X) loads its timeline via JS, and requests doesn’t execute scripts.

Example 2 — Smartphone User Agent

smartphone_ua = (
    "Mozilla/5.0 (Linux; Android 14; Pixel 8) "
    "AppleWebKit/537.36 (KHTML, like Gecko) "
    "Chrome/124.0.0.0 Mobile Safari/537.36"
)
headers = {"User-Agent": smartphone_ua}
resp = requests.get("https://twitter.com/billgates", headers=headers)

Modern mobile sites also rely on JS, so the response looks similar to desktop. Mobile UAs help mostly on older sites or those that maintain a separate m. subdomain.

Example 3 — Feature-phone User Agent

This is where it gets interesting. Older feature-phone UAs trigger the lightweight, JS-free HTML version of many social sites — perfect for raw HTML scraping with requests:

old_phone_ua = (
    "Nokia5310XpressMusic_CMCC/2.0 (10.10) Profile/MIDP-2.1 "
    "Configuration/CLDC-1.1 UCWEB/2.0 (Java; U; MIDP-2.0; en-US; "
    "Nokia5310XpressMusic) U2/1.0.0 UCBrowser/9.5.0.449 U2/1.0.0 Mobile"
)
headers = {"User-Agent": old_phone_ua}
resp = requests.get("https://twitter.com/billgates", headers=headers)

The response includes actual tweet content embedded in plain HTML — no JS required. This is the kind of “User Agent leverage” that turns a scraping problem from impossible without a headless browser into three lines with requests.

Comparison: which User Agent for which job?

Use caseRecommended UA typeWhy
General scrapingModern desktop Chrome/FirefoxMost realistic; lowest block rate
Mobile-only sitesRecent Android/iOSTriggers mobile layout
JS-heavy sites without a headless browserFeature-phone or old smartphoneForces lightweight HTML version
Search engine respectMozilla/5.0 (compatible; MyBot/1.0; +https://your-site.example/bot)Honest, identifies your scraper

Best practices

  • Keep User Agents fresh. Browser versions roll out monthly. A UA from 2019 is itself a bot signal.
  • Rotate across a small pool (5–10 UAs) instead of using one fixed string.
  • Match other headers: Accept, Accept-Language, Accept-Encoding. A request with a Chrome UA but missing those headers looks suspicious.
  • Respect robots.txt. The legality and ethics of scraping vary by site and jurisdiction.
  • For aggressive anti-bot sites (Cloudflare, DataDome, PerimeterX), User Agent alone won’t be enough — you’ll need a real headless browser like Playwright.

Final thoughts

The User Agent header is one of the cheapest and highest-leverage things to get right when scraping. Most beginner blocks vanish after switching from the default to a realistic desktop User Agent — and a few clever choices (like feature-phone UAs) can unlock content that would otherwise require a full headless browser.

For a complete scraping example, see Python Facebook Posts Scraper with Requests and BeautifulSoup4. Once your data is collected, you may want to expose it as an API — How to get more exposure for your API.

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Pin It on Pinterest

0
Would love your thoughts, please comment.x
()
x