STORAGE AND RETRIEVAL MODULE

1 INTRODUCTION

The storage and retrieval module is a sub module of the search engine. This module has the task of storing the web pages in compressed format. Subsequently it parses the document, searches for keywords, anchor text etc. and stores this information in 'barrels'. All this is done taking distributing environment into consideration.

2 OPERATING ENVIRONMENT

The system is implemented on a distributed environment. The network consists of 16 computers connected on a 100Mbps Ethernet network. Each computer has 35 GB HDD and Pentium 4/3 processor. The machines are running on Linux operating system. The network is connected to Internet using a 2Mbps dedicated line.

3 SYSTEM FEATURES

Word to wordID Mapper
Document Storage & retrieval
Parser for documents
Barrel storage subsystem
Client-Server interface

3.1 WORD TO WORD-ID MAPPER

The mapper generates word-id using MD5 hashing algorithm. MD5 generates 16 byte digest for each word. The most significant 4 bytes of the 16 byte digest are used as the word-id. The selection of 4 bytes is based on the assumption that the number of words will not exceed 4 GIGA words. Template for the interaction with wordID mapper is specified in internal interfaces.

3.2 DOCUMENT STORAGE AND RETRIEVAL

The crawlers divide the crawling of URLs based on the doc -id to which the URL maps. The document is stored in the same system in which it is fetched. The storage and retrieval module exports Put & Get function to store & retrieve documents respectively. The put function stores the data on basis of the docID and the document. The Get function will return document given a docID.The crawler will be provided with a client interface to send the data to the server executing on each machine.

The server holds the responsibility of storing the data in compressed format. The gzip utility is used for compression purpose. It is the faster compared to bzip,compress and pack utilities. It's compression ratio is around 40-60 %.

3.3 PARSER FOR DOCUMENTS

The objective of the parser is to convert a given html document into list of wordIDs and a HITLIST for each wordID. It also extracts ANCHORTEXT for the document referred in the current document. The parser only parses documents sent to it by the local crawler process.

FUNCTIONALITY:

The deliverable for this component is a single parsing module. Input to the parsing module is an ASCII text page which is usually tagged (via HTML). Output from the parser is a list of text record representations of (wordID HITLIST) tuples.There are also records corresponding to anchortext in the page being parsed. Records corresponding to anchortext also have an identical structure except for an additional field corresponding to the docID of the page pointed to. Anchortext records are separated since they will be stored separately.

3.3.2 IMPLEMENTATION DETAILS

We implemented the parser in PERL because of flexibility of PERL functions. The input to the parser is an HTML page in a known shared memory segment.The use of shared memory segments reduces disk IO as well as unneccessary copy operations. This improves the performance of the overall system. The parser is triggered via a signal from the crawler which places the downloaded page in a shared memory segment.The parser then processes page as discussed below.

An alternative implementation of parser could have exploited the features of parsers like YACC, but due to performance constraints we restrict ourself to PERL.

At a higher level the parser does the following steps:

Reads the page and its docID supplied by the crawler from the shared memory segment.
Remove the scripts sections if any
Remove the stop words from the page
Extracts and stores the anchor text ( where the words are converted into wordID) into anchor text file.
Do stemming of the remaining words and convert the stemmed words into wordID

using the hashing function.
For each word encountered we store the position and the font size description

into the main file.

The parser is modeled as a state machine. It moves through the page character changing state. Depending on the current state appropriate actions are taken.

For every word that the state machine considers, the following process is applied. The word is checked against a lexicon of stop words and rejected if it is a stop word. Otherwise, the word is passed through a stemming algorithm, in particular we use "Porter's stemming algorithm" to stem it. The stemmed word is converted into a wordID. It is checked against a hash table of wordIDs for the document being parsed. A new entry (wordID,HITLIST tuple) is created if necessary and an appropriate entry is added into the corresponding HITLIST. The position of the word the font size description constitutes the HITLIST entry for a given word.

For example, encountering a <script> tag causes the parser to move into an IGNORE state. Characters encountered while in an IGNORE state are not considered for processing. In this example the IGNORE state is exited on encountering a </script> tag. Thus we prune the scripts in the page.

Similarly if we are inside the <href> tag we extract the anchor text words and put them in temporary storage which will be stemmed and converted into wordIDs. These anchor text wordIDs are then stored in the anchor text file via the interface provided by storage.The storage provides "send" call which takes the data and the port number. Depending on port number the data is either sent to anchor text file or main data file. Both the files can be seen as relation with key docID.

This information is stored as a tuple in the main data file. The format of the main data file and anchor text file are explained in section 5. The more important problem which is beyond the scope of our implementation is extracting keywords from the document.

3.4 BARREL STORAGE SUBSYSTEM

The barrel storage subsystem gets < docID, wordID, hitlistinfo> information from the parser. Based on wordID it distributes this record to the appropriate machine .It distinguishes between the word appearing in anchor tag and normal text . The Former is stored in anchor text file, along with the docID of the document pointed by that URL. The normal text is stored in the keyword file. The storage system provides a client to the parser module to send the tuple to the server, which stores the tuple in the respective files.

3.5 CLIENT SERVER INTERFACE

The external and the internal interfaces to the storage and retrieval subsystem follows client server model. In this model client is provided by us to the various modules which want to store and retrieve data. The server running on each machine services the requests of the client. Different ports are assigned for different

Client template

client(char *IP_ADDRESS, int port, char* buffer)

4 EXTERNAL INTERFACE REQUIREMENTS

The module interfaces with the following :

Crawler Module
Indexer Module
Query Processing Module

4.1 CRAWLER MODULE

The inputs from the crawler module are docID, URL, Document

Template to interact with crawler

Store(int docID,char* URL,char* doc);

Graph(char* adjnodelist);

4.2 INDEXER MODULE

4.3 QUERY PROCESSING MODULE

The inputs from query module is the word for which the corresponding wordID is returned

5 INTERNAL INTERFACE REQUIREMENTS

5.1 PARSER- STORAGE

The parser sends the following data to the storage using the same client server model that was used for external interface. The format of the data sent by the parser is as follows

STRUCTURE OF MAIN TEXT TUPLE:

< docID (8 bytes) ,pointer (doc location) , page rank ( 1 byte) >

< wordID, position, fontsize, capitalization, metatag >

STRUCTURE OF ANCHOR WORD TUPLE

< docID (8 bytes) , wordID>

PARSER- wordID MAPPER

The wordID mapper takes a word as input and returns wordID. The template of the function is as follows