Python urllib.robotparser Module














































Python urllib.robotparser Module



Parser For robots.txt File - urllib.robotparser


The urllib.robotparser is a module that consists of a single class that lets the user know if the user can or cannot fetch URL on website that published the robots.txt file.

Introduction to robots.txt file

WWW Robots are programs that go through numerous pages in the World Wide Web by retrieving pages that are linked together recursively.

Earlier, many times robots overloaded servers with requests or retrieved same files repeatedly. They also sometimes went through parts of servers that were not suitable. Such incidents initiated the need of established mechanisms for robots on which part of server they have access to.

robots.txt file is basically a method through which access policy for robots is specified. The file is accessible through HTTP on the local URL "/robots.txt". This method was selected as it could be easy implemented on any of the existing WWW servers and a robot   could find the access policu through a single document retrieval.

Methods of RobotFileParser Class

The class includes certain methods to read, parse and also answer questions about the robots.txt file for the selected URL.

1. set_url(url)

This method sets the URL that refers to robots.txt file.

2. read()

It reads through the robots.txt URL and then feeds it to the parser.

3. parse(lines)

It parses the given line argument.

4. can_fetch(useragent, url)

Depending on the parsed robots.txt file, the method returns True if the useragent is allowed to fetch the URL complying with the rules.

5. mtime()

This method returns the time when robots.txt file was last fetched.

6. modified ()

It sets the time of fetching the robots.txt file from last time to current time.

7. crawl_delay(useragent)

For the useragent in question, this method returns the value of crawl-delay parameter from robots.txt file. It returns None if there is no such parameter.

8. request_rate(useragent)

For a named tuple RequestRate(requests, seconds) , this method returns the contents of Request-rate parameter from robots.txt file.

9. site_maps()

This method returns the contents of Sitemap parameter from robots.txt file in a lost format.

Understanding the Function of urllib.robotparser Through an Example

SAMPLE CODE:

import urllib.robotparser 
 parser = urllib.robotparser.RobotFileParser()  rp.set_url("https://cppsecrets.com/robots.txt")
parser.read()
parser.crawl_delay("*") 
parser.can_fetch("*", "https://www.cppsecrets.com/")

OUTPUT:

6
True

For an even better understanding of how the module works, it is suggested to try all the functions like those in the above example.


Also Read:

Comments