Is There Anything Similar To Crystal Disk Info That Works

  1. This is a short clip explaining the use of Crystal Disk Info. This will be embedded into an article Ben wrote on the topic of Failing Drives.
  2. We like Crystal Disk Info as a program. However, there's a serious issue with the Help file, which is in the CHM format and would not open in either Windows 7 Home Premium or Ultimate because of a.
  1. Is There Anything Similar To Crystal Disk Info That Works
  2. Is There Anything Similar To Crystal Disk Info That Works Well
Disk

Download Macrorit Disk Scanner (freemium) 8. Crystal Disk Info. Crystal Disk Info is another popular, well-crafted, and clean looking software that gives detailed S.M.A.R.T health report about all your hard disks. Additionally, the software also has advanced features like advanced disk search, raw values, command line support, AAM and APM.

CrystalDiskInfo is a HDD/SSD health monitoring utility. It displays basic HDD (also SSD and USB-HDD) information, monitors S.M.A.R.T. values and disk temperature. It will also display the S.M.A.R.T data as a list so you can see the specific issue that a hard drive may have. It provides a health rating based on your drive’s SMART status, plus will list its temperature, enabling you to see just how hot your drives are running. As various factors approach thresholds of danger, CrystalDiskInfo will alert you, letting you know it's time to make backups while you still can.

CrystalDiskInfo 8.3.2 changelog:

  • Improved High Contrast support

  • Added KureiKeiHomebuiltComputer~Kronotokage theme [Kurei Kei Edition]

  • Added ARM support [ZIP, INSTALLER]

  • Updated INSTALLER for ARM support, but New INSTALLER does not support XP/2003

Download: CrystalDiskInfo 8.3.2 | 3.8 MB (Open Source)
Download: Portable CrystalDiskInfo 8.3.2 | 4.3 MB
View: CrystalDiskInfo Home Page

Get alerted to all of our Software updates on Twitter at @NeowinSoftware

Sergey Brin and Lawrence Page
{sergey, page}@cs.stanford.edu
Computer Science Department, Stanford University, Stanford, CA94305
Info

Abstract

In this paper, we presentGoogle, a prototype of a large-scale search engine which makes heavy useof the structure present in hypertext. Google is designed to crawl andindex the Web efficiently and produce much more satisfying search resultsthan existing systems. The prototype with a full text and hyperlink databaseof at least 24 million pages is available at http://google.stanford.edu/
To engineer a search engine isa challenging task. Search engines index tens to hundreds of millions ofweb pages involving a comparable number of distinct terms. They answertens of millions of queries every day. Despite the importance of large-scalesearch engines on the web, very little academic research has been doneon them. Furthermore, due to rapid advance in technology and web proliferation,creating a web search engine today is very different from three years ago.This paper provides an in-depth description of our large-scale web searchengine -- the first such detailed public description we know of to date.
Apart from the problems of scalingtraditional search techniques to data of this magnitude, there are newtechnical challenges involved with using the additional information presentin hypertext to produce better search results. This paper addresses thisquestion of how to build a practical large-scale system which can exploitthe additional information present in hypertext. Also we look at the problemof how to effectively deal with uncontrolled hypertext collections whereanyone can publish anything they want.
Keywords: World Wide Web, Search Engines, InformationRetrieval, PageRank, Google

1. Introduction

(Note: There are two versions of this paper -- a longer full versionand a shorter printed version. The full version is available on theweb and the conference CD-ROM.)
The web creates new challenges for information retrieval. The amountof information on the web is growing rapidly, as well as the number ofnew users inexperienced in the art of web research. People are likely tosurf the web using its link graph, often starting with high quality humanmaintained indices such as Yahoo! orwith search engines. Human maintained lists cover popular topics effectivelybut are subjective, expensive to build and maintain, slow to improve, andcannot cover all esoteric topics. Automated search engines that rely onkeyword matching usually return too many low quality matches. To make mattersworse, some advertisers attempt to gain people's attention by taking measuresmeant to mislead automated search engines. We have built a large-scalesearch engine which addresses many of the problems of existing systems.It makes especially heavy use of the additional structure present in hypertextto provide much higher quality search results. We chose our system name,Google, because it is a common spelling of googol, or 10100and fits well with our goal of building very large-scale search engines.

1.1 Web Search Engines -- Scaling Up: 1994 - 2000

Search engine technology has had to scale dramatically to keep up withthe growth of the web. In 1994, one of the first web search engines, theWorld Wide Web Worm (WWWW) [McBryan94] had an index of 110,000 web pages and web accessible documents.As of November, 1997, the top search engines claim to index from 2 million(WebCrawler) to 100 million web documents (from SearchEngine Watch). It is foreseeable that by the year 2000, a comprehensiveindex of the Web will contain over a billion documents. At the same time,the number of queries search engines handle has grown incredibly too. InMarch and April 1994, the World Wide Web Worm received an average of about1500 queries per day. In November 1997, Altavista claimed it handled roughly20 million queries per day. With the increasing number of users on theweb, and automated systems which query search engines, it is likely thattop search engines will handle hundreds of millions of queries per dayby the year 2000. The goal of our system is to address many of the problems,both in quality and scalability, introduced by scaling search engine technologyto such extraordinary numbers.

1.2. Google: Scaling with the Web

Creating a search engine which scales even to today's web presents manychallenges. Fast crawling technology is needed to gather the web documentsand keep them up to date. Storage space must be used efficiently to storeindices and, optionally, the documents themselves. The indexing systemmust process hundreds of gigabytes of data efficiently. Queries must behandled quickly, at a rate of hundreds to thousands per second.

These tasks are becoming increasingly difficult as the Web grows. However,hardware performance and cost have improved dramatically to partially offsetthe difficulty. There are, however, several notable exceptions to thisprogress such as disk seek time and operating system robustness. In designingGoogle, we have considered both the rate of growth of the Web and technologicalchanges. Google is designed to scale well to extremely large data sets.It makes efficient use of storage space to store the index. Its data structuresare optimized for fast and efficient access (see section 4.2).Further, we expect that the cost to index and store text or HTML will eventuallydecline relative to the amount that will be available (see AppendixB). This will result in favorable scaling properties for centralizedsystems like Google.

1.3 Design Goals

1.3.1 Improved Search Quality

Our main goal is to improve the quality of web search engines. In 1994,some people believed that a complete search index would make it possibleto find anything easily. According to Bestof the Web 1994 -- Navigators, 'The best navigation service shouldmake it easy to find almost anything on the Web (once all the data is entered).' However, the Web of 1997 is quite different. Anyone who has used a searchengine recently, can readily testify that the completeness of the indexis not the only factor in the quality of search results. 'Junk results'often wash out any results that a user is interested in. In fact, as ofNovember 1997, only one of the top four commercial search engines findsitself (returns its own search page in response to its name in the topten results). One of the main causes of this problem is that the numberof documents in the indices has been increasing by many orders of magnitude,but the user's ability to look at documents has not. People are still onlywilling to look at the first few tens of results. Because of this, as thecollection size grows, we need tools that have very high precision (numberof relevant documents returned, say in the top tens of results). Indeed,we want our notion of 'relevant' to only include the very best documentssince there may be tens of thousands of slightly relevant documents. Thisvery high precision is important even at the expense of recall (the totalnumber of relevant documents the system is able to return). There is quitea bit of recent optimism that the use of more hypertextual informationcan help improve search and other applications [Marchiori97] [Spertus 97] [Weiss 96] [Kleinberg98]. In particular, link structure[Page 98] and link text provide a lot of informationfor making relevance judgments and quality filtering. Google makes useof both link structure and anchor text (see Sections 2.1and 2.2).

Is There Anything Similar To Crystal Disk Info That Works

1.3.2 Academic Search Engine Research

Aside from tremendous growth, the Web has also become increasingly commercialover time. In 1993, 1.5% of web servers were on .com domains. This numbergrew to over 60% in 1997. At the same time, search engines have migratedfrom the academic domain to the commercial. Up until now most search enginedevelopment has gone on at companies with little publication of technicaldetails. This causes search engine technology to remain largely a blackart and to be advertising oriented (see Appendix A). WithGoogle, we have a strong goal to push more development and understandinginto the academic realm.

Another important design goal was to build systems that reasonable numbersof people can actually use. Usage was important to us because we thinksome of the most interesting research will involve leveraging the vastamount of usage data that is available from modern web systems. For example,there are many tens of millions of searches performed every day. However,it is very difficult to get this data, mainly because it is consideredcommercially valuable.

Our final design goal was to build an architecture that can supportnovel research activities on large-scale web data. To support novel researchuses, Google stores all of the actual documents it crawls in compressedform. One of our main goals in designing Google was to set up an environmentwhere other researchers can come in quickly, process large chunks of theweb, and produce interesting results that would have been very difficultto produce otherwise. In the short time the system has been up, there havealready been several papers using databases generated by Google, and manyothers are underway. Another goal we have is to set up a Spacelab-likeenvironment where researchers or even students can propose and do interestingexperiments on our large-scale web data.

2. System Features

The Google search engine has two important features that help it producehigh precision results. First, it makes use of the link structure of theWeb to calculate a quality ranking for each web page. This ranking is calledPageRank and is described in detail in [Page 98]. Second, Google utilizeslink to improve search results.

2.1 PageRank: Bringing Order to the Web

The citation (link) graph of the web is an important resource that haslargely gone unused in existing web search engines. We have created mapscontaining as many as 518 million of these hyperlinks, a significant sampleof the total. These maps allow rapid calculation of a web page's 'PageRank',an objective measure of its citation importance that corresponds well withpeople's subjective idea of importance. Because of this correspondence,PageRank is an excellent way to prioritize the results of web keyword searches.For most popular subjects, a simple text matching search that is restrictedto web page titles performs admirably when PageRank prioritizes the results(demo available at google.stanford.edu).For the type of full text searches in the main Google system, PageRankalso helps a great deal.

2.1.1 Description of PageRank Calculation

Academic citation literature has been applied to the web, largely by countingcitations or backlinks to a given page. This gives some approximation ofa page's importance or quality. PageRank extends this idea by not countinglinks from all pages equally, and by normalizing by the number of linkson a page. PageRank is defined as follows:We assume page A has pages T1...Tn which point to it (i.e.,are citations). The parameter d is a damping factor which can be set between0 and 1. We usually set d to 0.85. There are more details about d in thenext section. Also C(A) is defined as the number of links going out ofpage A. The PageRank of a page A is given as follows:

PR(A) = (1-d) + d (PR(T1)/C(T1) + ... + PR(Tn)/C(Tn))

Note that the PageRanks form a probability distribution over webpages, so the sum of all web pages' PageRanks will be one.

PageRank or PR(A) can be calculated using a simple iterative algorithm,and corresponds to the principal eigenvector of the normalized link matrixof the web. Also, a PageRank for 26 million web pages can be computed ina few hours on a medium size workstation. There are many other detailswhich are beyond the scope of this paper.

2.1.2 Intuitive Justification

PageRank can be thought of as a model of user behavior. We assume thereis a 'random surfer' who is given a web page at random and keeps clickingon links, never hitting 'back' but eventually gets bored and starts onanother random page. The probability that the random surfer visits a pageis its PageRank. And, the d damping factor is the probability ateach page the 'random surfer' will get bored and request another randompage. One important variation is to only add the damping factor dto a single page, or a group of pages. This allows for personalizationand can make it nearly impossible to deliberately mislead the system inorder to get a higher ranking. We have several other extensions toPageRank, again see [Page 98].

Another intuitive justification is that a page can have a high PageRankif there are many pages that point to it, or if there are some pages thatpoint to it and have a high PageRank. Intuitively, pages that are wellcited from many places around the web are worth looking at. Also, pagesthat have perhaps only one citation from something like the Yahoo!homepage are also generally worth looking at. If a page was not high quality,or was a broken link, it is quite likely that Yahoo's homepage would notlink to it. PageRank handles both these cases and everything in betweenby recursively propagating weights through the link structure of the web.

2.2 Anchor Text

The text of links is treated in a special way in our search engine. Mostsearch engines associate the text of a link with the page that the linkis on. In addition, we associate it with the page the link points to. Thishas several advantages. First, anchors often provide more accurate descriptionsof web pages than the pages themselves. Second, anchors may exist for documentswhich cannot be indexed by a text-based search engine, such as images,programs, and databases. This makes it possible to return web pages whichhave not actually been crawled. Note that pages that have not been crawledcan cause problems, since they are never checked for validity before beingreturned to the user. In this case, the search engine can even return apage that never actually existed, but had hyperlinks pointing to it. However,it is possible to sort the results, so that this particular problem rarelyhappens.

This idea of propagating anchor text to the page it refers to was implementedin the World Wide Web Worm [McBryan 94] especially becauseit helps search non-text information, and expands the search coverage withfewer downloaded documents. We use anchor propagation mostly becauseanchor text can help provide better quality results. Using anchor textefficiently is technically difficult because of the large amounts of datawhich must be processed. In our current crawl of 24 million pages,we had over 259 million anchors which we indexed.

2.3 Other Features

Aside from PageRank and the use of anchor text, Google has several otherfeatures. First, it has location information for all hits and so it makesextensive use of proximity in search. Second, Google keeps track of somevisual presentation details such as font size of words. Words in a largeror bolder font are weighted higher than other words. Third, full raw HTMLof pages is available in a repository.

3 Related Work

Search research on the web has a short and concise history. The World WideWeb Worm (WWWW) [McBryan94] was one of the first web search engines. It was subsequently followedby several other academic search engines, many of which are now publiccompanies. Compared to the growth of the Web and the importance ofsearch engines there are precious few documents about recent search engines[Pinkerton 94].According to Michael Mauldin (chief scientist, Lycos Inc) [Mauldin],'the various services (including Lycos) closely guard the details of thesedatabases'. However, there has been a fair amount of work on specific featuresof search engines. Especially well represented is work which can get resultsby post-processing the results of existing commercial search engines, orproduce small scale 'individualized' search engines. Finally, there hasbeen a lot of research on information retrieval systems, especially onwell controlled collections. In the next two sections, we discuss someareas where this research needs to be extended to work better on the web.

3.1 Information Retrieval

Work in information retrieval systems goes back many years and is welldeveloped [Witten 94]. However, most of the researchon information retrieval systems is on small well controlled homogeneouscollections such as collections of scientific papers or news stories ona related topic. Indeed, the primary benchmark for information retrieval,the Text Retrieval Conference [TREC 96], uses a fairlysmall, well controlled collection for their benchmarks. The 'Very LargeCorpus' benchmark is only 20GB compared to the 147GB from our crawl of24 million web pages. Things that work well on TREC often do not producegood results on the web. For example, the standard vector space model triesto return the document that most closely approximates the query, giventhat both query and document are vectors defined by their word occurrence.On the web, this strategy often returns very short documents that are thequery plus a few words. For example, we have seen a major search enginereturn a page containing only 'Bill Clinton Sucks' and picture from a 'BillClinton' query. Some argue that on the web, users should specify more accuratelywhat they want and add more words to their query. We disagree vehementlywith this position. If a user issues a query like 'Bill Clinton' they shouldget reasonable results since there is a enormous amount of high qualityinformation available on this topic. Given examples like these, we believethat the standard information retrieval work needs to be extended to dealeffectively with the web.

3.2 Differences Between the Web and Well Controlled Collections

The web is a vast collection of completely uncontrolled heterogeneous documents.Documents on the web have extreme variation internal to the documents,and also in the external meta information that might be available. Forexample, documents differ internally in their language (both human andprogramming), vocabulary (email addresses, links, zip codes, phone numbers,product numbers), type or format (text, HTML, PDF, images, sounds), andmay even be machine generated (log files or output from a database). Onthe other hand, we define external meta information as information thatcan be inferred about a document, but is not contained within it. Examplesof external meta information include things like reputation of the source,update frequency, quality, popularity or usage, and citations. Not onlyare the possible sources of external meta information varied, but the thingsthat are being measured vary many orders of magnitude as well. For example,compare the usage information from a major homepage, like Yahoo's whichcurrently receives millions of page views every day with an obscure historicalarticle which might receive one view every ten years. Clearly, these twoitems must be treated very differently by a search engine.

Another big difference between the web and traditional well controlledcollections is that there is virtually no control over what people canput on the web. Couple this flexibility to publish anything with the enormousinfluence of search engines to route traffic and companies which deliberatelymanipulating search engines for profit become a serious problem.This problem that has not been addressed in traditional closed informationretrieval systems. Also, it is interesting to note that metadata effortshave largely failed with web search engines, because any text on the pagewhich is not directly represented to the user is abused to manipulate searchengines. There are even numerous companies which specialize in manipulatingsearch engines for profit.

4 System Anatomy

First, we will provide a high level discussion of the architecture. Then,there is some in-depth descriptions of important data structures. Finally,the major applications: crawling, indexing, and searching will be examinedin depth.

4.1 Google Architecture Overview

In this section, we will give a high level overview of how the whole systemworks as pictured in Figure 1. Further sections will discuss the applicationsand data structures not mentioned in this section. Most of Google is implementedin C or C++ for efficiency and can run in either Solaris or Linux.

In Google, the web crawling (downloading of web pages) is done by severaldistributed crawlers. There is a URLserver that sends lists of URLs tobe fetched to the crawlers. The web pages that are fetched are then sentto the storeserver. The storeserver then compresses and stores the webpages into a repository. Every web page has an associated ID number calleda docID which is assigned whenever a new URL is parsed out of a web page.The indexing function is performed by the indexer and the sorter. The indexerperforms a number of functions. It reads the repository, uncompresses thedocuments, and parses them. Each document is converted into a set of wordoccurrences called hits. The hits record the word, position in document,an approximation of font size, and capitalization. The indexer distributesthese hits into a set of 'barrels', creating a partially sorted forwardindex. The indexer performs another important function. It parsesout all the links in every web page and stores important information aboutthem in an anchors file. This file contains enough information to determinewhere each link points from and to, and the text of the link.

The URLresolver reads the anchors file and converts relative URLs intoabsolute URLs and in turn into docIDs. It puts the anchor text into theforward index, associated with the docID that the anchor points to. Italso generates a database of links which are pairs of docIDs. The linksdatabase is used to compute PageRanks for all the documents.

The sorter takes the barrels, which are sorted by docID (this is a simplification,see Section 4.2.5), and resorts them by wordID to generatethe inverted index. This is done in place so that little temporary spaceis needed for this operation. The sorter also produces a list of wordIDsand offsets into the inverted index. A program called DumpLexicon takesthis list together with the lexicon produced by the indexer and generatesa new lexicon to be used by the searcher. The searcher is run by a webserver and uses the lexicon built by DumpLexicon together with the invertedindex and the PageRanks to answer queries.

4.2 Major Data Structures

Google's data structures are optimized so that a large document collectioncan be crawled, indexed, and searched with little cost. Although, CPUsand bulk input output rates have improved dramatically over the years,a disk seek still requires about 10 ms to complete. Google is designedto avoid disk seeks whenever possible, and this has had a considerableinfluence on the design of the data structures.

4.2.1 BigFiles

BigFiles are virtual files spanning multiple file systems and are addressableby 64 bit integers. The allocation among multiple file systems ishandled automatically. The BigFiles package also handles allocation anddeallocation of file descriptors, since the operating systems do not provideenough for our needs. BigFiles also support rudimentary compression options.

4.2.2 Repository

Figure 2. Repository Data Structure
The repository contains the full HTML of every web page. Each page iscompressed using zlib (see RFC1950).The choice of compression technique is a tradeoff between speed and compressionratio. We chose zlib's speed over a significant improvement in compressionoffered by bzip. The compressionrate of bzip was approximately 4 to 1 on the repository as compared tozlib's 3 to 1 compression. In the repository, the documents are storedone after the other and are prefixed by docID, length, and URL as can beseen in Figure 2. The repository requires no other data structures to beused in order to access it. This helps with data consistency and makesdevelopment much easier; we can rebuild all the other data structures fromonly the repository and a file which lists crawler errors.

4.2.3 Document Index

The document index keeps information about each document. It is afixed width ISAM (Index sequential access mode) index, ordered by docID.The information stored in each entry includes the current document status,a pointer into the repository, a document checksum, and various statistics.If the document has been crawled, it also contains a pointer into a variablewidth file called docinfo which contains its URL and title. Otherwisethe pointer points into the URLlist which contains just the URL. This designdecision was driven by the desire to have a reasonably compact data structure,and the ability to fetch a record in one disk seek during a search

Additionally, there is a file which is used to convert URLs into docIDs. It is a list of URL checksums with their corresponding docIDs and is sortedby checksum. In order to find the docID of a particular URL, theURL's checksum is computed and a binary search is performed on the checksumsfile to find its docID. URLs may be converted into docIDs in batchby doing a merge with this file. This is the technique the URLresolveruses to turn URLs into docIDs. This batch mode of update is crucial becauseotherwise we must perform one seek for every link which assuming one diskwould take more than a month for our 322 million link dataset.

4.2.4 Lexicon

The lexicon has several different forms. One important change fromearlier systems is that the lexicon can fit in memory for a reasonableprice. In the current implementation we can keep the lexicon in memoryon a machine with 256 MB of main memory. The current lexicon contains14 million words (though some rare words were not added to the lexicon).It is implemented in two parts -- a list of the words (concatenated togetherbut separated by nulls) and a hash table of pointers. For variousfunctions, the list of words has some auxiliary information which is beyondthe scope of this paper to explain fully.

4.2.5 Hit Lists

A hit list corresponds to a list of occurrences of a particular word ina particular document including position, font, and capitalization information.Hit lists account for most of the space used in both the forward and theinverted indices. Because of this, it is important to represent them asefficiently as possible. We considered several alternatives for encodingposition, font, and capitalization -- simple encoding (a triple of integers),a compact encoding (a hand optimized allocation of bits), and Huffman coding.In the end we chose a hand optimized compact encoding since it requiredfar less space than the simple encoding and far less bit manipulation thanHuffman coding. The details of the hits are shown in Figure 3.

Our compact encoding uses two bytes for every hit. There are two typesof hits: fancy hits and plain hits. Fancy hits include hits occurring ina URL, title, anchor text, or meta tag. Plain hits include everything else.A plain hit consists of a capitalization bit, font size, and 12 bits ofword position in a document (all positions higher than 4095 are labeled4096). Font size is represented relative to the rest of the documentusing three bits (only 7 values are actually used because 111 is the flagthat signals a fancy hit). A fancy hit consists of a capitalization bit,the font size set to 7 to indicate it is a fancy hit, 4 bits to encodethe type of fancy hit, and 8 bits of position. For anchor hits, the 8 bitsof position are split into 4 bits for position in anchor and 4 bits fora hash of the docID the anchor occurs in. This gives us some limited phrasesearching as long as there are not that many anchors for a particular word.We expect to update the way that anchor hits are stored to allow for greaterresolution in the position and docIDhash fields. We use font size relativeto the rest of the document because when searching, you do not want torank otherwise identical documents differently just because one of thedocuments is in a larger font.

Figure 3. Forward and Reverse Indexes and the Lexicon

The length of a hit list is stored before the hits themselves. To savespace, the length of the hit list is combined with the wordID in the forwardindex and the docID in the inverted index. This limits it to 8 and 5 bitsrespectively (there are some tricks which allow 8 bits to be borrowed fromthe wordID). If the length is longer than would fit in that many bits,an escape code is used in those bits, and the next two bytes contain theactual length.

4.2.6 Forward Index

The forward index is actually already partially sorted. It is stored ina number of barrels (we used 64). Each barrel holds a range of wordID's.If a document contains words that fall into a particular barrel, the docIDis recorded into the barrel, followed by a list of wordID's with hitlistswhich correspond to those words. This scheme requires slightly more storagebecause of duplicated docIDs but the difference is very small for a reasonablenumber of buckets and saves considerable time and coding complexity inthe final indexing phase done by the sorter. Furthermore, instead of storingactual wordID's, we store each wordID as a relative difference from theminimum wordID that falls into the barrel the wordID is in. This way, wecan use just 24 bits for the wordID's in the unsorted barrels, leaving8 bits for the hit list length.

4.2.7 Inverted Index

The inverted index consists of the same barrels as the forward index, exceptthat they have been processed by the sorter. For every valid wordID, thelexicon contains a pointer into the barrel that wordID falls into. It pointsto a doclist of docID's together with their corresponding hit lists. Thisdoclist represents all the occurrences of that word in all documents.

An important issue is in what order the docID's should appear in thedoclist. One simple solution is to store them sorted by docID. This allowsfor quick merging of different doclists for multiple word queries. Anotheroption is to store them sorted by a ranking of the occurrence of the wordin each document. This makes answering one word queries trivial and makesit likely that the answers to multiple word queries are near the start.However, merging is much more difficult. Also, this makes development muchmore difficult in that a change to the ranking function requires a rebuildof the index. We chose a compromise between these options, keepingtwo sets of inverted barrels -- one set for hit lists which include titleor anchor hits and another set for all hit lists. This way, we check thefirst set of barrels first and if there are not enough matches within thosebarrels we check the larger ones.

4.3 Crawling the Web

Running a web crawler is a challenging task. There are tricky performanceand reliability issues and even more importantly, there are social issues.Crawling is the most fragile application since it involves interactingwith hundreds of thousands of web servers and various name servers whichare all beyond the control of the system.

Is There Anything Similar To Crystal Disk Info That Works Well

In order to scale to hundreds of millions of web pages, Google has afast distributed crawling system. A single URLserver serves lists of URLsto a number of crawlers (we typically ran about 3). Both the URLserverand the crawlers are implemented in Python. Each crawler keeps roughly300 connections open at once. This is necessary to retrieve web pages ata fast enough pace. At peak speeds, the system can crawl over 100 web pagesper second using four crawlers. This amounts to roughly 600K per secondof data. A major performance stress is DNS lookup. Each crawler maintainsa its own DNS cache so it does not need to do a DNS lookup before crawlingeach document. Each of the hundreds of connections can be in a numberof different states: looking up DNS, connecting to host, sending request,and receiving response. These factors make the crawler a complexcomponent of the system. It uses asynchronous IO to manage events, anda number of queues to move page fetches from state to state.

It turns out that running a crawler which connects to more than halfa million servers, and generates tens of millions of log entries generatesa fair amount of email and phone calls. Because of the vast number of peoplecoming on line, there are always those who do not know what a crawler is,because this is the first one they have seen. Almost daily, we receivean email something like, 'Wow, you looked at a lot of pages from my website. How did you like it?' There are also some people who do not knowabout the robotsexclusion protocol, and think their page should be protected from indexingby a statement like, 'This page is copyrighted and should not be indexed',which needless to say is difficult for web crawlers to understand. Also,because of the huge amount of data involved, unexpected things will happen.For example, our system tried to crawl an online game. This resulted inlots of garbage messages in the middle of their game! It turns out thiswas an easy problem to fix. But this problem had not come up until we haddownloaded tens of millions of pages. Because of the immense variationin web pages and servers, it is virtually impossible to test a crawlerwithout running it on large part of the Internet. Invariably, there arehundreds of obscure problems which may only occur on one page out of thewhole web and cause the crawler to crash, or worse, cause unpredictableor incorrect behavior. Systems which access large parts of the Internetneed to be designed to be very robust and carefully tested. Since largecomplex systems such as crawlers will invariably cause problems, thereneeds to be significant resources devoted to reading the email and solvingthese problems as they come up.

4.4 Indexing the Web

  • Parsing -- Any parser which is designed to run on the entire Webmust handle a huge array of possible errors. These range from typos inHTML tags to kilobytes of zeros in the middle of a tag, non-ASCII characters,HTML tags nested hundreds deep, and a great variety of other errors thatchallenge anyone's imagination to come up with equally creative ones.For maximum speed, instead of using YACC to generate a CFG parser, we useflex to generate a lexical analyzer which we outfit with its own stack.Developing this parser which runs at a reasonable speed and is very robustinvolved a fair amount of work.
  • IndexingDocuments into Barrels -- After each document isparsed, it is encoded into a number of barrels. Every word is convertedinto a wordID by using an in-memory hash table -- the lexicon. New additionsto the lexicon hash table are logged to a file. Once the words are convertedinto wordID's, their occurrences in the current document are translatedinto hit lists and are written into the forward barrels. The maindifficulty with parallelization of the indexing phase is that the lexiconneeds to be shared. Instead of sharing the lexicon, we took the approachof writing a log of all the extra words that were not in a base lexicon,which we fixed at 14 million words. That way multiple indexers can runin parallel and then the small log file of extra words can be processedby one final indexer.
  • Sorting -- In order to generate the inverted index, the sorter takeseach of the forward barrels and sorts it by wordID to produce an invertedbarrel for title and anchor hits and a full text inverted barrel. Thisprocess happens one barrel at a time, thus requiring little temporary storage.Also, we parallelize the sorting phase to use as many machines as we havesimply by running multiple sorters, which can process different bucketsat the same time. Since the barrels don't fit into main memory, the sorterfurther subdivides them into baskets which do fit into memory based onwordID and docID. Then the sorter, loads each basket into memory, sortsit and writes its contents into the short inverted barrel and the fullinverted barrel.

4.5 Searching

The goal of searching is to provide quality search results efficiently.Many of the large commercial search engines seemed to have made great progressin terms of efficiency. Therefore, we have focused more on quality of searchin our research, although we believe our solutions are scalable to commercialvolumes with a bit more effort. The google query evaluation processis show in Figure 4.
  1. Parse the query.
  2. Convert words into wordIDs.
  3. Seek to the start of the doclist in the short barrel for every word.
  4. Scan through the doclists until there is a document that matches all thesearch terms.
  5. Compute the rank of that document for the query.
  6. If we are in the short barrels and at the end of any doclist, seek to thestart of the doclist in the full barrel for every word and go to step 4.
  7. If we are not at the end of any doclist go to step 4.

  8. Sort the documents that have matched by rank and return the top k.
Figure 4. Google Query Evaluation

To put a limit on response time, once a certain number (currently 40,000)of matching documents are found, the searcher automatically goes to step8 in Figure 4. This means that it is possible that sub-optimal resultswould be returned. We are currently investigating other ways to solve thisproblem. In the past, we sorted the hits according to PageRank, which seemedto improve the situation.

4.5.1 The Ranking System

Google maintains much more information about web documents than typicalsearch engines. Every hitlist includes position, font, and capitalizationinformation. Additionally, we factor in hits from anchor text and the PageRankof the document. Combining all of this information into a rank is difficult.We designed our ranking function so that no particular factor can havetoo much influence. First, consider the simplest case -- a singleword query. In order to rank a document with a single word query, Googlelooks at that document's hit list for that word. Google considers eachhit to be one of several different types (title, anchor, URL, plain textlarge font, plain text small font, ...), each of which has its own type-weight.The type-weights make up a vector indexed by type. Google counts the numberof hits of each type in the hit list. Then every count is converted intoa count-weight. Count-weights increase linearly with counts at first butquickly taper off so that more than a certain count will not help. We takethe dot product of the vector of count-weights with the vector of type-weightsto compute an IR score for the document. Finally, the IR score is combinedwith PageRank to give a final rank to the document.

For a multi-word search, the situation is more complicated. Now multiplehit lists must be scanned through at once so that hits occurring closetogether in a document are weighted higher than hits occurring far apart.The hits from the multiple hit lists are matched up so that nearby hitsare matched together. For every matched set of hits, a proximity is computed.The proximity is based on how far apart the hits are in the document (oranchor) but is classified into 10 different value 'bins' ranging from aphrase match to 'not even close'. Counts are computed not only forevery type of hit but for every type and proximity. Every type and proximitypair has a type-prox-weight. The counts are converted into count-weightsand we take the dot product of the count-weights and the type-prox-weightsto compute an IR score. All of these numbers and matrices can allbe displayed with the search results using a special debug mode. Thesedisplays have been very helpful in developing the ranking system.

4.5.2 Feedback

The ranking function has many parameters like the type-weights and thetype-prox-weights. Figuring out the right values for these parameters issomething of a black art. In order to do this, we have a user feedbackmechanism in the search engine. A trusted user may optionally evaluateall of the results that are returned. This feedback is saved. Then whenwe modify the ranking function, we can see the impact of this change onall previous searches which were ranked. Although far from perfect, thisgives us some idea of how a change in the ranking function affects thesearch results.

5 Results and Performance

Query: bill clinton
http://www.whitehouse.gov/
100.00% (no date) (0K)
http://www.whitehouse.gov/
Officeof the President
99.67%(Dec 23 1996) (2K)
http://www.whitehouse.gov/WH/EOP/OP/html/OP_Home.html
WelcomeTo The White House
99.98% (Nov 09 1997) (5K)
http://www.whitehouse.gov/WH/Welcome.html
SendElectronic Mail to the President
99.86% (Jul 14 1997) (5K)
http://www.whitehouse.gov/WH/Mail/html/Mail_President.html
mailto:president@whitehouse.gov
99.98%
mailto:President@whitehouse.gov
99.27%
The 'Unofficial'Bill Clinton
94.06%(Nov 11 1997) (14K)
http://zpub.com/un/un-bc.html
BillClinton Meets The Shrinks
86.27% (Jun 29 1997) (63K)
http://zpub.com/un/un-bc9.html
PresidentBill Clinton - The Dark Side
97.27% (Nov 10 1997) (15K)
http://www.realchange.org/clinton.htm
$3Bill Clinton
94.73% (no date) (4K) http://www.gatewy.net/~tjohnson/clinton1.html
Figure 4. Sample Results from Google
The most important measure of a search engine is the quality of its search results. Whilea complete user evaluation is beyond the scope of this paper, our own experiencewith Google has shown it to produce better results than the major commercialsearch engines for most searches. As an example which illustrates the useof PageRank, anchor text, and proximity, Figure 4 shows Google's resultsfor a search on 'bill clinton'. These results demonstrates some ofGoogle's features. The results are clustered by server. This helpsconsiderably when sifting through result sets. A number of results arefrom the whitehouse.gov domain which is what one may reasonably expectfrom such a search. Currently, most major commercial search engines donot return any results from whitehouse.gov, much less the right ones. Noticethat there is no title for the first result. This is because it was notcrawled. Instead, Google relied on anchor text to determine this was agood answer to the query. Similarly, the fifth result is an email addresswhich, of course, is not crawlable. It is also a result of anchor text.

All of the results are reasonably high quality pages and, at last check,none were broken links. This is largely because they all have high PageRank.The PageRanks are the percentages in red along with bar graphs. Finally,there are no results about a Bill other than Clinton or about a Clintonother than Bill. This is because we place heavy importance on the proximityof word occurrences. Of course a true test of the quality of a search enginewould involve an extensive user study or results analysis which we do nothave room for here. Instead, we invite the reader to try Google for themselvesat http://google.stanford.edu.

5.1 Storage Requirements

Aside from search quality, Google is designed to scale cost effectivelyto the size of the Web as it grows. One aspect of this is to use storageefficiently. Table 1 has a breakdown of some statistics and storage requirementsof Google. Due to compression the total size of the repository isabout 53 GB, just over one third of the total data it stores. At currentdisk prices this makes the repository a relatively cheap source of usefuldata. More importantly, the total of all the data used by the search enginerequires a comparable amount of storage, about 55 GB. Furthermore, mostqueries can be answered using just the short inverted index. With betterencoding and compression of the Document Index, a high quality web searchengine may fit onto a 7GB drive of a new PC.
Storage Statistics
Total Size of Fetched Pages147.8 GB
Compressed Repository53.5 GB
Short Inverted Index4.1 GB
Full Inverted Index37.2 GB
Lexicon293 MB
Temporary Anchor Data
(not in total)
6.6 GB
Document Index Incl.
Variable Width Data
9.7 GB
Links Database3.9 GB
Total Without Repository55.2 GB
Total With Repository108.7 GB
Web Page Statistics
Number of Web Pages Fetched24 million
Number of Urls Seen76.5 million
Number of Email Addresses1.7 million
Number of 404's1.6 million
Table 1. Statistics

5.2 System Performance

It is important for a search engine to crawl and index efficiently. Thisway information can be kept up to date and major changes to the systemcan be tested relatively quickly. For Google, the major operations areCrawling, Indexing, and Sorting. It is difficult to measure how longcrawling took overall because disks filled up, name servers crashed, orany number of other problems which stopped the system. In total it tookroughly 9 days to download the 26 million pages (including errors). However,once the system was running smoothly, it ran much faster, downloading thelast 11 million pages in just 63 hours, averaging just over 4 million pagesper day or 48.5 pages per second. We ran the indexer and the crawlersimultaneously. The indexer ran just faster than the crawlers. This islargely because we spent just enough time optimizing the indexer so thatit would not be a bottleneck. These optimizations included bulk updatesto the document index and placement of critical data structures on thelocal disk. The indexer runs at roughly 54 pages per second. Thesorters can be run completely in parallel; using four machines, the wholeprocess of sorting takes about 24 hours.

5.3 Search Performance

Improving the performance of search was not the major focus of our researchup to this point. The current version of Google answers most queries inbetween 1 and 10 seconds. This time is mostly dominated by disk IO overNFS (since disks are spread over a number of machines). Furthermore, Googledoes not have any optimizations such as query caching, subindices on commonterms, and other common optimizations. We intend to speed up Google considerablythrough distribution and hardware, software, and algorithmic improvements.Our target is to be able to handle several hundred queries per second.Table 2 has some sample query times from the current version of Google.They are repeated to show the speedups resulting from cached IO.
Initial QuerySame Query Repeated (IO mostly cached)
QueryCPU Time(s)Total Time(s)CPU Time(s)Total Time(s)
al gore0.092.130.060.06
vice president1.773.841.661.80
hard disks0.254.860.200.24
search engines1.319.631.161.16
Table 2. Search Times

6 Conclusions

Google is designed to be a scalable search engine. The primary goal isto provide high quality search results over a rapidly growing World WideWeb. Google employs a number of techniques to improve search quality includingpage rank, anchor text, and proximity information. Furthermore, Googleis a complete architecture for gathering web pages, indexing them, andperforming search queries over them.

6.1 Future Work

A large-scale web search engine is a complex system and much remains tobe done. Our immediate goals are to improve search efficiency and to scaleto approximately 100 million web pages. Some simple improvements to efficiencyinclude query caching, smart disk allocation, and subindices. Anotherarea which requires much research is updates. We must have smart algorithmsto decide what old web pages should be recrawled and what new ones shouldbe crawled. Work toward this goal has been done in [Cho98]. One promising area of research is using proxy caches to buildsearch databases, since they are demand driven. We are planning to addsimple features supported by commercial search engines like boolean operators,negation, and stemming. However, other features are just startingto be explored such as relevance feedback and clustering (Google currentlysupports a simple hostname based clustering). We also plan to supportuser context (like the user's location), and result summarization.We are also working to extend the use of link structure and link text.Simple experiments indicate PageRank can be personalized by increasingthe weight of a user's home page or bookmarks. As for link text, we areexperimenting with using text surrounding links in addition to the linktext itself. A Web search engine is a very rich environment for researchideas. We have far too many to list here so we do not expect this FutureWork section to become much shorter in the near future.

6.2 High Quality Search

The biggest problem facing users of web search engines today is the qualityof the results they get back. While the results are often amusing and expandusers' horizons, they are often frustrating and consume precious time.For example, the top result for a search for 'Bill Clinton' on one of themost popular commercial search engines was the BillClinton Joke of the Day: April 14, 1997. Google is designed toprovide higher quality search so as the Web continues to grow rapidly,information can be found easily. In order to accomplish this Google makesheavy use of hypertextual information consisting of link structure andlink (anchor) text. Google also uses proximity and font information. Whileevaluation of a search engine is difficult, we have subjectively foundthat Google returns higher quality search results than current commercialsearch engines. The analysis of link structure via PageRank allowsGoogle to evaluate the quality of web pages. The use of link text as adescription of what the link points to helps the search engine return relevant(and to some degree high quality) results. Finally, the use of proximityinformation helps increase relevance a great deal for many queries.

6.3 Scalable Architecture

Aside from the quality of search, Google is designed to scale. It mustbe efficient in both space and time, and constant factors are very importantwhen dealing with the entire Web. In implementing Google, we have seenbottlenecks in CPU, memory access, memory capacity, disk seeks, disk throughput,disk capacity, and network IO. Google has evolved to overcome a numberof these bottlenecks during various operations. Google's major data structuresmake efficient use of available storage space. Furthermore, the crawling,indexing, and sorting operations are efficient enough to be able to buildan index of a substantial portion of the web -- 24 million pages, in lessthan one week. We expect to be able to build an index of 100 million pagesin less than a month.

6.4 A Research Tool

In addition to being a high quality search engine, Google is a researchtool. The data Google has collected has already resulted in many otherpapers submitted to conferences and many more on the way. Recent researchsuch as [Abiteboul 97] has shown a number of limitationsto queries about the Web that may be answered without having the Web availablelocally. This means that Google (or a similar system) is not only a valuableresearch tool but a necessary one for a wide range of applications. Wehope Google will be a resource for searchers and researchers all aroundthe world and will spark the next generation of search engine technology.

7 Acknowledgments

Scott Hassan and Alan Steremberg have been critical to the developmentof Google. Their talented contributions are irreplaceable, and the authorsowe them much gratitude. We would also like to thank Hector Garcia-Molina,Rajeev Motwani, Jeff Ullman, and Terry Winograd and the whole WebBase groupfor their support and insightful discussions. Finally we would liketo recognize the generous support of our equipment donorsIBM, Intel, and Sun and our funders. The research described here was conducted as part of the StanfordIntegrated Digital Library Project, supported by the National ScienceFoundation under Cooperative Agreement IRI-9411306. Funding for thiscooperative agreement is also provided by DARPA and NASA, and by Interval Research, and the industrial partners of the Stanford Digital Libraries Project.

References

  • Best of the Web 1994 -- Navigators http://botw.org/1994/awards/navigators.html
  • Bill Clinton Joke of the Day: April 14, 1997. http://www.io.com/~cjburke/clinton/970414.html.
  • Bzip2 Homepage http://www.muraroa.demon.co.uk/
  • Google Search Engine http://google.stanford.edu/
  • Harvest http://harvest.transarc.com/
  • Mauldin, Michael L. Lycos Design Choices in an Internet Search Service,IEEE Expert Interview http://www.computer.org/pubs/expert/1997/trends/x1008/mauldin.htm
  • The Effect of Cellular Phone Use Upon Driver Attention http://www.webfirst.com/aaa/text/cell/cell0toc.htm
  • Search Engine Watch http://www.searchenginewatch.com/
  • RFC 1950 (zlib) ftp://ftp.uu.net/graphics/png/documents/zlib/zdoc-index.html
  • Robots Exclusion Protocol: http://info.webcrawler.com/mak/projects/robots/exclusion.htm
  • Web Growth Summary: http://www.mit.edu/people/mkgray/net/web-growth-summary.html
  • Yahoo! http://www.yahoo.com/
  • [Abiteboul 97] Serge Abiteboul and Victor Vianu, Queries and Computationon the Web. Proceedings of the International Conference on DatabaseTheory. Delphi, Greece 1997.
  • [Bagdikian 97] Ben H. Bagdikian. The Media Monopoly. 5th Edition.Publisher: Beacon, ISBN: 0807061557
  • [Chakrabarti 98] S.Chakrabarti, B.Dom, D.Gibson, J.Kleinberg, P. Raghavan and S. Rajagopalan. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Seventh International Web Conference(WWW 98). Brisbane, Australia, April 14-18, 1998.
  • [Cho 98] Junghoo Cho, Hector Garcia-Molina, Lawrence Page. EfficientCrawling Through URL Ordering. Seventh International Web Conference(WWW 98). Brisbane, Australia, April 14-18, 1998.
  • [Gravano 94] Luis Gravano, Hector Garcia-Molina, and A. Tomasic. TheEffectiveness of GlOSS for the Text-Database Discovery Problem. Proc.of the 1994 ACM SIGMOD International Conference On Management Of Data,1994.
  • [Kleinberg 98] Jon Kleinberg, Authoritative Sources in a HyperlinkedEnvironment, Proc. ACM-SIAM Symposium on Discrete Algorithms, 1998.
  • [Marchiori 97] Massimo Marchiori. The Quest for Correct Informationon the Web: Hyper Search Engines. The Sixth International WWW Conference(WWW 97). Santa Clara, USA, April 7-11, 1997.
  • [McBryan 94] Oliver A. McBryan. GENVL and WWWW: Tools for Taming theWeb. First International Conference on the World Wide Web. CERN, Geneva(Switzerland), May 25-26-27 1994. http://www.cs.colorado.edu/home/mcbryan/mypapers/www94.ps
  • [Page 98] Lawrence Page, Sergey Brin, Rajeev Motwani, Terry Winograd. ThePageRank Citation Ranking: Bringing Order to the Web. Manuscript inprogress. http://google.stanford.edu/~backrub/pageranksub.ps
  • [Pinkerton 94] Brian Pinkerton, Finding What People Want: Experienceswith the WebCrawler. The Second International WWW Conference Chicago,USA, October 17-20, 1994. http://info.webcrawler.com/bp/WWW94.html
  • [Spertus 97] Ellen Spertus. ParaSite: Mining Structural Informationon the Web. The Sixth International WWW Conference (WWW 97). SantaClara, USA, April 7-11, 1997.
  • [TREC 96] Proceedings of the fifth Text REtrieval Conference (TREC-5).Gaithersburg, Maryland, November 20-22, 1996. Publisher: Departmentof Commerce, National Institute of Standards and Technology. Editors: D.K. Harman and E. M. Voorhees. Full text at: http://trec.nist.gov/
  • [Witten 94] Ian H Witten, Alistair Moffat, and Timothy C. Bell. ManagingGigabytes: Compressing and Indexing Documents and Images. New York:Van Nostrand Reinhold, 1994.
  • [Weiss 96] Ron Weiss, Bienvenido Velez, Mark A. Sheldon, Chanathip Manprempre,Peter Szilagyi, Andrzej Duda, and David K. Gifford. HyPursuit: A HierarchicalNetwork Search Engine that Exploits Content-Link Hypertext Clustering.Proceedings of the 7th ACM Conference on Hypertext. New York, 1996.

Vitae


Sergey Brin received his B.S. degree in mathematics and computerscience from the University of Maryland at College Park in 1993. Currently,he is a Ph.D. candidate in computer science at Stanford University wherehe received his M.S. in 1995. He is a recipient of a National ScienceFoundation Graduate Fellowship. His research interests include searchengines, information extraction from unstructured sources, and data miningof large text collections and scientific data.

Lawrence Page was born in East Lansing, Michigan, and receiveda B.S.E. in Computer Engineering at the University of Michigan Ann Arborin 1995. He is currently a Ph.D. candidate in Computer Science atStanford University. Some of his research interests include the linkstructure of the web, human computer interaction, search engines, scalabilityof information access interfaces, and personal data mining.

8 Appendix A: Advertising and Mixed Motives

Currently, the predominant business model for commercial search enginesis advertising. The goals of the advertising business model do not alwayscorrespond to providing quality search to users. For example, in our prototypesearch engine one of the top results for cellular phone is 'TheEffect of Cellular Phone Use Upon Driver Attention', a study whichexplains in great detail the distractions and risk associated with conversingon a cell phone while driving. This search result came up first becauseof its high importance as judged by the PageRank algorithm, an approximationof citation importance on the web [Page, 98]. It isclear that a search engine which was taking money for showing cellularphone ads would have difficulty justifying the page that our system returnedto its paying advertisers. For this type of reason and historical experiencewith other media [Bagdikian 83], we expect that advertisingfunded search engines will be inherently biased towards the advertisersand away from the needs of the consumers.

Since it is very difficult even for experts to evaluate search engines,search engine bias is particularly insidious. A good example was OpenText,which was reported to be selling companies the right to be listed at thetop of the search results for particular queries [Marchiori97]. This type of bias is much more insidious than advertising, becauseit is not clear who 'deserves' to be there, and who is willing to pay moneyto be listed. This business model resulted in an uproar, and OpenText hasceased to be a viable search engine. But less blatant bias are likely tobe tolerated by the market. For example, a search engine could add a smallfactor to search results from 'friendly' companies, and subtract a factorfrom results from competitors. This type of bias is very difficult to detectbut could still have a significant effect on the market. Furthermore, advertisingincome often provides an incentive to provide poor quality search results.For example, we noticed a major search engine would not return a largeairline's homepage when the airline's name was given as a query. It sohappened that the airline had placed an expensive ad, linked to the querythat was its name. A better search engine would not have required thisad, and possibly resulted in the loss of the revenue from the airline tothe search engine. In general, it could be argued from the consumer pointof view that the better the search engine is, the fewer advertisementswill be needed for the consumer to find what they want. This of courseerodes the advertising supported business model of the existing searchengines. However, there will always be money from advertisers who wanta customer to switch products, or have something that is genuinely new.But we believe the issue of advertising causes enough mixed incentivesthat it is crucial to have a competitive search engine that is transparentand in the academic realm.

9 Appendix B: Scalability

9. 1 Scalability of Google

We have designed Google to be scalable in the near term to a goal of 100million web pages. We have just received disk and machines to handle roughlythat amount. All of the time consuming parts of the system are parallelizeand roughly linear time. These include things like the crawlers, indexers,and sorters. We also think that most of the data structures will deal gracefullywith the expansion. However, at 100 million web pages we will be very closeup against all sorts of operating system limits in the common operatingsystems (currently we run on both Solaris and Linux). These include thingslike addressable memory, number of open file descriptors, network socketsand bandwidth, and many others. We believe expanding to a lot more than100 million pages would greatly increase the complexity of our system.

9.2 Scalability of Centralized Indexing Architectures

As the capabilities of computers increase, it becomes possible to indexa very large amount of text for a reasonable cost. Of course, other morebandwidth intensive media such as video is likely to become more pervasive.But, because the cost of production of text is low compared to media likevideo, text is likely to remain very pervasive. Also, it is likely thatsoon we will have speech recognition that does a reasonable job convertingspeech into text, expanding the amount of text available. All of this providesamazing possibilities for centralized indexing. Here is an illustrativeexample. We assume we want to index everything everyone in the US has writtenfor a year. We assume that there are 250 million people in the US and theywrite an average of 10k per day. That works out to be about 850 terabytes.Also assume that indexing a terabyte can be done now for a reasonable cost.We also assume that the indexing methods used over the text are linear,or nearly linear in their complexity. Given all these assumptions we cancompute how long it would take before we could index our 850 terabytesfor a reasonable cost assuming certain growth factors. Moore's Law wasdefined in 1965 as a doubling every 18 months in processor power. It hasheld remarkably true, not just for processors, but for other importantsystem parameters such as disk as well. If we assume that Moore's law holdsfor the future, we need only 10 more doublings, or 15 years to reach ourgoal of indexing everything everyone in the US has written for a year fora price that a small company could afford. Of course, hardware expertsare somewhat concerned Moore's Law may not continue to hold for the next15 years, but there are certainly a lot of interesting centralized applicationseven if we only get part of the way to our hypothetical example.

Of course a distributed systems like Gloss [Gravano94] or Harvest will oftenbe the most efficient and elegant technical solution for indexing, butit seems difficult to convince the world to use these systems because ofthe high administration costs of setting up large numbers of installations.Of course, it is quite likely that reducing the administration cost drasticallyis possible. If that happens, and everyone starts running a distributedindexing system, searching would certainly improve drastically.

Because humans can only type or speak a finite amount, and as computerscontinue improving, text indexing will scale even better than it does now.Of course there could be an infinite amount of machine generated content,but just indexing huge amounts of human generated content seems tremendouslyuseful. So we are optimistic that our centralized web search enginearchitecture will improve in its ability to cover the pertinent text informationover time and that there is a bright future for search.