What is a search engine?
By definition, an Internet search engine is an information retrieval system, which helps us find information on the World Wide Web. World Wide Web is the universe of information where the information is accessible on the network. It facilitates global sharing of information. However, WWW is an unstructured database. It is exponentially growing to become an enormous store of information. Searching for information on the web is therefore a difficult task. There is a need to have a tool to manage, filter and retrieve this oceanic information. A search engine serves this purpose.
How does a Search Engine Work?
Internet search engines or web search engines as they are also called, are engines that search and retrieve information on the web. Most of them use crawler indexer architecture. They depend on their crawler modules. Crawlers, also referred to as spiders, are small programs that browse the web.
Crawlers are given an initial set of URLs whose pages they retrieve. They extract URLs that appear on the crawled pages and give this information to the crawler control module. The crawler module decides which pages to visit next and gives their URLs back to the crawlers.
The topics covered by different search engines vary according to the algorithms they use. Some search engines are programmed to search sites on a particular topic while the crawlers in others may be visiting as many sites as possible.
The crawl control module may use the link graph of a previous crawl or may use usage patterns to help in its crawling strategy.
The indexer module extracts words from each page it visits and records its URLs. It results into a large lookup table that gives a list of URLs pointing to pages where each word occurs. The table lists those pages, which were covered in the crawling process.
A collection analysis module is another important part of the search engine architecture. It creates a utility index. A utility index may provide access to pages of a given length or pages containing a certain number of pictures on them.
During the process of crawling and indexing, a search engine stores the pages it retrieves. They are temporarily stored in a page repository. search engines maintain a cache of pages they visit so that retrieval of already visited pages expedites.
The query module of a search engine receives search requests from users in the form of keywords. The ranking module sorts the results.
The crawler indexer architecture has many variants. It is modified in the distributed architecture of a search engine. These search engine architectures consist of gatherers and brokers. Gatherers collect indexing information from web servers while brokers give the indexing mechanism and the query interface. Brokers update indices on the basis of information received from gatherers and other brokers. They can filter information. Many search engines of today use this type of architecture.
Search engines and Page Ranking
When we submit a query to a search engine, results are displayed in a particular order. Most of us tend to visit the pages in the top order and ignore those beyond the first few. This is because we consider the top few pages to bear most relevance to our query. This is why people are interested in ranking their pages in the first ten results of a search engine.
The words you specify in the query interface of a search engine are the keywords, which are sought by search engines. They present a list of pages relevant to the queried keywords. During this process, search engines retrieve those pages, which have frequent occurrences of the keywords. They look for interrelationships between keywords. The location of keywords is also considered while ranking pages containing them. Keywords that occur in the page titles or in the URLs are given greater weight. A page having links that point to it makes it more popular. If many other sites link to a page, it is regarded as valuable and more relevant.
There is actually a ranking algorithm that every search engine uses. The algorithm is a computerised formula devised to match relevant pages with a user query. Each search engine may have a different ranking algorithm, which analyses the pages in the engine's database to determine relevant responses to search queries. Different search engines index information differently. This leads to the fact that a particular query put before two distinct search engines may fetch pages in different orders or may retrieve different pages. Both the keyword as well as the website popularity is factors which determine relevance. Click-through popularity of a site is another determinant of its rank. This popularity is the measure of how often the site is visited.
Webmasters try to trick search engine algorithms to raise the ranks of their websites. The tricks include highly populating the home page of a site with keywords or the use of meta-tags to deceive search engine ranking strategies. However, search engines are smart enough. They keep revising their algorithms and counter program their systems so that we as researchers don't fall prey to illegal or unethical practices.