The importance of using User Agent to Scraping Data

The importance of using User Agent to Scraping Data

Using user agents isn’t a common practice by many scrapers and crawlers developers. But it is important to know that using the correct user agent can help and make easy the scraping tasks of many websites.

What is a User Agent?

The user agent is a text string that the client sends through the headers of a request, and serves as an identifier for the type of device, operating system, and browser that we are using. This information tells the server that, for example, we are using Google Chrome 80 browser and a computer with Windows 10. And therefore, the server prepares a response intended for that type of device.

User Agent Process
User Agent Flow

It is not the same response that Facebook, Twitter, or Google sends us when we enter with a smartphone with Android or iOS as when we enter with a computer with Windows, Mac OS, or Linux. And their servers know this through the user agent.

Because the user agent is a plaintext string it is easy to manipulate and thus trick the web server into believing that we are visiting it from a different device.

Why is recommended to use User Agent?

Not setting a user agent in our requests will cause that our tools use a default one that in many cases is one that announces our presence as a Bot, which in many websites is not allowed, and therefore it is possible that they can easily ban us.

It is recommended to always use a popular user agent so that it can go unnoticed. The following website contains a huge user agent database, but in my recommendation, it is easier to use the user agent of our browser and in the case of a Windows 10 PC using Google Chrome version 80 it would look something like this:

Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36

Example: testing of different user agents.

We will be using Python 3 for this example, you can download it here if you don’t already have it.

Necessary libraries:

  • requests
  • BeautifulSoup4
  • lxml

Install them with this command in a terminal:

pip install requests BeautifulSoup4 lxml

First, we will classify the types of user agents based on the content that a website could serve us when accessing the mentioned user agent.

  • For desktops or laptops: computers in general.
  • For smartphones: Android, iOS, Windows Phone.
  • For featurephones: Nokia 5310 xpressmusic, Sony Ericsson etc. childhood phones.

In the following tests, I will be cutting the response of the server to not make the post so big.

Desktop computers:

Let’s take a desktop user agent, Windows 10 and Google Chrome, then run the request:

import requests # Import requests
from bs4 import BeautifulSoup # Import BeautifulSoup4

# Windows 10 with Google Chrome
user_agent_desktop = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) '\
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 '\
'Safari/537.36'

headers = { 'User-Agent': user_agent_desktop}

url_twitter = 'https://twitter.com/billgates'
resp = requests.get(url_twitter, headers=headers)  # Send request

code = resp.status_code  # HTTP response code
if code == 200:
    soup = BeautifulSoup(resp.text, 'lxml')  # Parsing the HTML
    print(soup.prettify())
else:
    print(f'Error to load Twitter: {code}')

What does the response of the Twitter server look like if we send a request with this User Agent?

<!-- ... -->
<body>
  <noscript>
   <form action="https://mobile.twitter.com/i/nojs_router?path=%2Fbillgates" method="POST" style="background-color: #fff; position: fixed; top: 0; left: 0; right: 0; bottom: 0; z-index: 9999;">
    <div style="font-size: 18px; font-family: Helvetica,sans-serif; line-height: 24px; margin: 10%; width: 80%;">
     <p>
      We've detected that JavaScript is disabled in your browser. Would you like to proceed to legacy Twitter?
     </p>
<!-- ... -->

As we can see in the result, the server returns a page that is loaded dynamically through the use of Javascript and that by default will do nothing if the client doesn’t have to enable Javascript. Python Requests doesn’t execute Javascript so we will not be able to see the information that interests us, so let’s try with another user agent.

Smartphones:

Now let’s try an Android 9 phone and Google Chrome.

import requests # Import requests
from bs4 import BeautifulSoup # Import BeautifulSoup4

# Android 9 with Google Chrome
user_agent_smartphone = 'Mozilla/5.0 (Linux; Android 9; SM-G960F '\
'Build/PPR1.180610.011; wv) AppleWebKit/537.36 (KHTML, like Gecko) '\
'Version/4.0 Chrome/74.0.3729.157 Mobile Safari/537.36'

headers = { 'User-Agent': user_agent_smartphone}

url_twitter = 'https://twitter.com/billgates'
resp = requests.get(url_twitter, headers=headers)  # Send request

code = resp.status_code  # HTTP response code
if code == 200:
    soup = BeautifulSoup(resp.text, 'lxml')  # Parsing the HTML
    print(soup.prettify())
else:
    print(f'Error to load Twitter: {code}')

The answer is quite similar to requesting with a desktop browser, and this is due to the same thing, the server expects a smartphone to have Javascript to display the page content.

<!-- ... -->
<body>
  <noscript>
   <form action="https://mobile.twitter.com/i/nojs_router?path=%2Fbillgates" method="POST" style="background-color: #fff; position: fixed; top: 0; left: 0; right: 0; bottom: 0; z-index: 9999;">
    <div style="font-size: 18px; font-family: Helvetica,sans-serif; line-height: 24px; margin: 10%; width: 80%;">
     <p>
      We've detected that JavaScript is disabled in your browser. Would you like to proceed to legacy Twitter?
     </p>
<!-- ... -->

Featurephone:

Finally, we will see the answer when we request with an old mobile:

import requests # Import requests
from bs4 import BeautifulSoup # Import BeautifulSoup4

# Nokia 5310 with UC Browser
user_agent_old_phone = 'Nokia5310XpressMusic_CMCC/2.0 (10.10) Profile/MIDP-2.1 '\
'Configuration/CLDC-1.1 UCWEB/2.0 (Java; U; MIDP-2.0; en-US; '\
'Nokia5310XpressMusic) U2/1.0.0 UCBrowser/9.5.0.449 U2/1.0.0 Mobile'

headers = { 'User-Agent': user_agent_old_phone}

url_twitter = 'https://twitter.com/billgates'
resp = requests.get(url_twitter, headers=headers)  # Send request

code = resp.status_code  # HTTP response code
if code == 200:
    soup = BeautifulSoup(resp.text, 'lxml')  # Parsing the HTML
    print(soup.prettify())
else:
    print(f'Error to load Twitter: {code}')

Let’s see the answer:

<!-- ... -->
<table class="tweet" href="/BillGates/status/1249497817900433408?p=v">
  <tr class="tweet-header">
    <td class="avatar" rowspan="3">
      <a href="/BillGates?p=i"><img alt="Bill Gates" src="https://pbs.twimg.com/profile_images/988775660163252226/XpgonN0X_normal.jpg" /></a>
    </td>
    <td class="user-info">
      <a href="/BillGates?p=s">
        <strong class="fullname">Bill Gates</strong>
        <div class="username"> <span>@</span>BillGates</div>
      </a>
    </td>
    <td class="timestamp">...</td>
  </tr>
  <tr class="tweet-container">
    <td class="tweet-content" colspan="2">
      <div class="tweet-text" data-id="1249497817900433408">
        <div class="dir-ltr" dir="ltr">
          .
          <a class="twitter-atreply dir-ltr" data-mentioned-user-id="17004618" data-screenname="NickKristof" dir="ltr"
            href="/NickKristof">
            @NickKristof
          </a>
          does an amazing job capturing the heroism of the health care workers on the front lines of the
          coronavirus fight.
          <a class="twitter_external_link dir-ltr tco-link"
            data-expanded-url="https://twitter.com/NickKristof/status/1248996159491919873"
            data-url="https://twitter.com/NickKristof/status/1248996159491919873" dir="ltr"
            href="https://t.co/x1TgE2oNXE" rel="nofollow noopener" target="_blank"
            title="https://twitter.com/NickKristof/status/1248996159491919873">
            twitter.com/NickKristof/st…
          </a>
        </div>
      </div>
    </td>
  </tr>
  <tr>
    <td class="meta-and-actions" colspan="2">...</span>
      <span class="tweet-actions">...</span>
    </td>
  </tr>
</table>
<!-- ... -->

This time we have a good response, and this is because popular sites like Twitter, Facebook, or Google have versions for all these devices because they started for those devices and they hope to have users from all media and want their services to always be available.

Final thoughts

User agents can be used to request the page from the server in a certain predefined style. In the examples, we saw how popular websites have different responses depending on the device that visits them, and we can use this to our advantage to scrape them.

In practice, it is usually not necessary to use mobile user agents, it is enough to rotate between common desktop user agents if what we want is to avoid being detected during our scraping tasks.

I also recommend you to check out this book where I learned some cool tricks to make web scraping: https://amzn.to/3umlGuc.

I hope this guide has helped you to know the differences between using user agents.

0 0 votes
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments

Pin It on Pinterest

0
Would love your thoughts, please comment.x
()
x