Python Accessing the Web Through urllib Python














































Python Accessing the Web Through urllib Python



       Python Accessing the Web Through urllib Python

What is urllib?

The urllib module in Python 3 allows you to access websites via your program. It has built-in functions and classes to help in URL actions. urllib in Python 3 is slightly different than urllib2 in Python 2, but they are mostly the same. Through urllib, you can access websites, download data, parse data, modify your headers, and do any GET and POST requests you might need to do. With Python urllib Library you can also access and retrieve data from the internet like XML, HTML, JSON, etc. 

You can also use Python to work with this data directly. Let us see through an example of how to access a webpage through urllib python and access its contents.

Urllib is a package that collects several modules for working with URLs, such as:

  • urllib.request for opening and reading.

  • urllib.parse for parsing URLs.

  • urllib.error for the exceptions raised.

  • urllib.robotparser for parsing robot.txt files.

 

urllib.request

Here, Is an Example of how to retrieve information on a webpage Through python urllib.request, with the urlopen method.

Example:


import urllib.request

import pprint


class urllibDemo:

def __init__(self,url):

self.url = url

try:

self.parse = urllib.request.urlopen(self.url)

except:

print('Url not Found.')


def GetData(self):                       # Getting Data from the Webpage

parse_data = self.parse.read()

pp = pprint.PrettyPrinter()

pp.pprint(parse_data)

print(f'Result Code : {self.parse.getcode()}')

def main():

url = 'https://cppsecrets.com'

r = urllibDemo(url)

r.GetData()


if __name__  == '__main__':

main()



Output:


                


urllib.parse

This module defines a standard interface to break Uniform Resource Locator (URL) strings up in components (addressing scheme, network location, path, etc.), to combine the components back into a URL string, and to convert a relative URL to an absolute URL given a base URL.

Here, Is an example of how the URL parsing functions operate on strings, but also bytes or bytearrays, through an example.

 

Example:


from urllib.parse import urlparse 


class urllibDemo:

def __init__(self,url):

self.url = url

try:

self.parse = urlparse(self.url)

except:

print('Url not Found.')


def GetData(self):

print(f'Parsed Data : { self.parse }')

def main():

url = 'https://cppsecrets.com'

r = urllibDemo(url)

r.GetData()


if __name__ == '__main__':

main()


Output:


     


urllib.error

The urllib.error module defines the exception classes for exceptions that are raised by urllib.request.

These exceptions are classified as follows:-

  •  URL Error, which is raised when the URL is incorrect, or when there is no Internet connectivity.

  •  HTTP Error, which is raised because of HTTP errors such as 404 (request not found) and 403 

          (request was forbidden).

 

Example:


import urllib.request

from urllib.error import HTTPError, URLError


class urllibDemo:

def __init__(self,url):

self.url = url

try:

self.parse = urllib.request.urlopen(self.url)

except HTTPError as e:

print(f'Http Error : { e }')

except URLError as e:

print(f'Url Error : { e }')

def main():

url = 'https://cppsecrets.com/nishchint'

r = urllibDemo(url)


if __name__ == '__main__':

main()


Output:



urllib.robotparser

robotparser implements a parser for the robots.txt file format, including a function that checks if a given user agent can access a resource. It is intended for use in well-behaved spiders, or other crawler applications that need to either be throttled or otherwise restricted.

This is robots.txt file for https://cppsecrets.com.

          # robots.txt


           Sitemap: https://cppsecrets.com/sitemap.xml

           User-Agent: *

           Disallow: /admin/

           Disallow: /google/

           Disallow: /animesh/

           Disallow: /backup/

           Disallow: /forgot/

           Disallow: /mail/


Example:


import urllib.robotparser


class Example:

def __init__(self,url):

self.url = url

self.rp = urllib.robotparser.RobotFileParser()


def can_access(self):

self.rp.set_url(self.url)

self.rp.read()

access = self.rp.can_fetch('*', self.url+'robots.txt')

print(f"User '*' can access {self.url} :{access}")              # Allowed URL for Users

paths = [ '/admin/' , '/google/' , '/animesh/' , '/backup/' ]            # Disallowed URL's for Users

for path in paths:

access = self.rp.can_fetch('*',path)

print(f"User '*' can access {path} : {access}")


def main():

url = 'https://cppsecrets.com/'

ex = Example(url)

ex.can_access()


if __name__ == '__main__':

main()


Output:




Comments