View on GitHub

Spider

Spider for crawling sites

Download this project as a .zip file Download this project as a tar.gz file

Spiders.

A spider algorithm or a web crawler is an internet bot that crawls a site for information usually in the form of text. A spider works by first going to the seminal link and then parsing the text on that web page to find more links that direct it to other web pages. These freshly found links are added to a stack or queue. All the links or urls in this stack or queue are traversed by the spider one by one.

So, from the initial web page the spider lands to another web page and continues its journey forth, recursively. In this way it crawls throughout the web. This means that from one page the spider should be able to access all the pages that are present on the World Wide Web, but my spider does not do that because I have limited it to access only the links present on the initial web page.

Spiders are usually used for web indexing. Other uses are to update web content or indexes. Spiders have the ability to copy all the content of a web page (if the page permits). All this data will be processed by a search engine at a later stage for indexing so that searching and information retrieval is rendered faster and easier.

In order to get data from a site I usually use one of the following four methods:

curl command - Tool to transfer data from or to a server using http/https/ftp
lynx command - World Wide Web (WWW) client/browser for users running terminals.
wget command - Enables non-interactive download of files from the Web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies.
w3m command - It is a text based Web browser and pager.

These are not the only way to get links or urls from a site. An easier and direct way is to use Python's parsing libraries namely urlparse, urllib along with Beautiful Soup. Another alternate way is to use HTTP get and post requests.

Privacy Policy

The file robots.txt provides permission to the crawler to get the data from the page. It is essential that scraping be done only for those web pages that allow it. In order to read this file we can employ robotparser of Python. Then if the permission is granted the spider will create a separate thread and go to this site to parse it.

Tags

The biggest advantage of using Beautiful Soup is its soup objects. These soup objects are flexible enough to be used capriciously. I have used them by sorting according to tags. This enables me to differentiate between the various HTML tags which is very important because this entails the basis of all web pages. I have used ‘href’ tags to find the url of the next page that will be appended to the list of the urls. If instead I had used the ‘a’ tag, then the url would turn out to be only a page and no hyperlink would exist, so a better option was to use the ‘href’ tags.