全部博文(83)
分类: Java
2006-05-17 17:01:41
分布式搜索引擎search.minty dowser类聚引擎和larbin蜘蛛
search.minty.org:
Open, Distributed Web Search
Dowser:
Dowser is a research tool for the web. It clusters results from major search engines, associates words that appear in previous searches, and keeps a local cache of all the results you click on in a searchable database. It helps you keep track of what you find on the web.
Larbin:
crawler
Open, Distributed Web Search
Murray Walker,
Consider this an RFC (Request For Comments).
Please add them to the wiki
--------------------------------------------------------------------------------
Abstract
This article discusses the concepts, goals and ideas for an open source, community-run (distributed) web search service that allows anyone to search, and thus navigate around the internet. A key tenet is impartiality and objectivity - avoiding bias, corruption and "spamming" of the service.
An outline of one possible technical architecture is presented. This does not cover the details of any specific algorithms or search techniques. No working code or implementation exists.
Introduction
The ability to find things on the internet, and particularly the world wide web, is often taken for granted by the vast majority of people who (often many times a day) are "searching the web".
Whilst there are a huge number of tools and technologies available to do this, a tiny handful dominate the market:
Google
Yahoo
Microsoft [1]
They, along with many other similar services, share a number of key features:
Their algorithms, or rules for deciding which results to display, are "closed" and often tightly guarded secrets.
They generate money, selling various forms of advertising around the search results.
They are commercial companies whose first priority is to generate value and profit for their shareholders.
There are no obligations or requirements for them to be objective, impartial, comprehensive or even accurate. And there is very little anyone can do to measure them against these metrics.
Given the power and influence these services have over everyone that uses them, it could be argued there is scope for abuse and at least the possibility for a conflict of interest.
Which leads to the idea for an "open" [2] web search service that is free to use - both financially and (more importantly) in the sense that it can be used by anyone, for any purpose. This means that the technical code and information needed to run such a system are "open sourced", but also that the service is provided by the community of users without imposing any restrictions on how the service is used.
It is not known if this idea would work, or if it is simply a description of a hypothetical, impractical or impossible scenario. Hopefully by reading and adding to the debate, it will be possible to find out.
The rest of this article discusses some issues that such a project might encounter, and outlines very briefly one possibly technical architecture that might be used to implement such a system.
If you would like to comment, you can do so on the accompanying wiki.
Assumptions and Scope
This article will discuss only a plain text web search service, including HTML files. This excludes features like image search for the sake of simplicity.
Many existing search services, including the "big three" mentioned above, have paid advertisements around the search results. They also all have separate, custom technology for delivering such advertisements [3]. The ability to do, or technology of, "ad serving" is not discussed here.
Custom search services, such as searching for books on Amazon or auctions on ebay, are out of scope. A generic web search service must be able to handle unstructured, plain text data.
The primary focus is a search service that requires no manual intervention or classification on a site by site, or page by page basis. Namely, once it is setup and running, it is fully autonomous.
Advanced features of existing services, such as viewing cached versions of pages, spell checking, etc. are also out of scope.
Resources : Bandwidth, Disk space and CPU
Consider an index of around 4 billion items, roughly the size of Google's current web index. Taking an average of 30Kb per item or url (only looking at the plain text - including html source), it would take around 111 Terabytes of data being downloaded to consider each document exactly once.
Then consider updates, which could range from quarterly or monthly for resources infrequently used or rarely changed, to weekly, daily or even hourly in the case of very popular or highly dynamic sites, such as news sites.
It is not the goal, at least initially, to store complete copies of everything downloaded. Thus, if a 30Kb item could be reduced to 300 bytes of information, disk space requirements would be around 1000 Gigabytes. It is also possible that in order to produce fast results when a query is run, it would require storing more than 30Kb per item. If it was a ten fold increase, it would put disk storage requirements at close to 1 Petabyte.
CPU or raw processing power is even harder to predict, but it breaks down into two distinct areas. There is the task of crawling or indexing pages. This can be done "offline" or in a batch approach. For the sake of argument, if a typical 2Ghz Pentium 4(r) computer could process 100 urls per second, that computer could do just over 8.5 million urls in a day. (Assuming the data could be downloaded this fast.) It would take that (single) computer about a year and a half to process the 4 billion urls. A hundred such computers could do it in just under 5 days.
Then there is the task of returning answers to queries in real time. Again, for the sake of argument, assume our typical 2Ghz Pentium computer could deliver 100 requests per second. Assuming it can scale close to linearly, the CPU resource required here would then become a function of how popular the service became.
Considering a worse case, where our benchmark machine is capable of crawling only 1 url per second, and answering queries at a rate of 1 request per second, it could easily require thousands of computers.
It is clear that the resources required are considerable.
Historically, approaches to this problem used a small number of massively powerful computers (the Mainframe approach). More recently, and the technique used by Google, Yahoo and most likely Microsoft, involve many hundreds or thousands of "commodity" computers (cheap, standard, generic machines) working in some degree of parallelism, using the network to communicate.
Taking this model one step further, it would appear that a distributed, "peer-to-peer" model could potentially be used to provide these resources. There is the potential for hundreds of thousands of computers, distributed across the internet all working together.
Existing (working) examples of peer-to-peer networks:
Traditional file-sharing networks such as Kazaa, which already allow millions of users to search across millions of files, representing terabytes of data, and then download those files simultaneously from multiple sources.
Internet Telephony networks, like Skype, which use peer-to-peer networks to route calls for free around the world in real time.
Less traditional (newer?) file sharing networks, such as BitTorrent which implement an "open" service similar to the traditional service offered by Akamai. Also Freenet which implements an entirely decentralised hosting network where both publishers and consumers are anonymous.
Spamming, corruption and abuse
Because it will drive more people to a web site, many people and organisations go to considerable lengths and cost to try and ensure their sites appear as high as possible in search results. This is commonly referred to as "Search Engine Optimisation".
An example might be trying to ensure your web site appears above those of your competitors - even when the search query was for your competitors name, or product.
However this is only the tip of the iceberg when it comes to bias and corruption in search results. For instance, a technique often referred to as "Google Bombing" resulted in Microsoft's website appearing at the top of the results when someone searched for "more evil than satan himself" towards the end of 1999.
And then, of course, there are laws, lawyers, legal systems and governments. Consider liable and propaganda to name but two.
There are a number of ways to address these problems, which are typically coded into the algorithms and rules used by the search service and are often highly guarded secrets. Although the specific implementation details are not public knowledge, many of the concepts are generally well understood and documented. PageRank(tm), as used by Google, is a good example of this. Various forms of hashing to try and identify similar documents being another.
Discussion of these techniques, while a critical and very complex part of any web search solution, are beyond the scope of this document. There are however two problems specific to a distributed web search service worth noting.
Open algorithm
The rules and algorithms used in an open web search service would be available to anyone, including those wanting to corrupt it. This is often cited as a key reason for keeping these rules secret in existing services.
Without trying, it is impossible to know if this would indeed be an insurmountable problem.
However, Microsoft and others have argued that proprietary "closed" software is more secure because the source code is not public, and thus it is harder to find and exploit bugs or weakness. Linux takes the opposite approach, and says that making everything open will both promote a more secure design and allow for problems to be found and fixed faster.
Without wishing to get involved in this specific debate, it would be fair to say that neither side has categorically proved their case, and most likely never will. Which raises the question, could the same be true of the algorithms used in web search engines?
There are a number of additional measures that can be used to avoid spamming and corruption. Many sites, including Amazon and slashdot.org, employ community-rating schemes. Yahoo Mail allows users to easily mark emails as spam. Used collectively, this information can help identify and deal with problems and can be a difficult system to abuse on a consistent and repetitive basis.
Corrupt nodes or clients
Existing models such as those used by Google, Yahoo or Microsoft obviously involve a massive number of computers. However, all of Google's computers are under the full control of Google. They can be reasonably sure all their machines are running the same code and that none of their machines are intentionally trying to corrupt their data or service.
In a distributed, community-driven service, where code would be running on a vast array of different computers around the internet, it would be very easy for someone to obtain the source code for an open, distributed web search service, adjust it to suit their specific purposes (such as always returning porn sites top of the results) and then add this new node or client to the network.
A good example of this is when the music industry added corrupt or invalid files to file sharing networks such as Kazaa. Because the file names and sizes matched, users thought they were downloading the latest Madonna tracks. Instead, they got a special message from the popstar.
Again, it is impossible to know without trying whether this would be a fundamental problem that could not be resolved.
That said, and assuming there is more good than evil, one possible solution would be to introduce redundancy and cross checking.
Imagine there are a number of independent machines (or nodes in our open, distributed network) that visit, crawl and analyse a url or web page. Let's call these "Crawlers". See Fig 1.
The key point here is redundancy - more than one Crawler will visit and analyse a single url or page.
Having completed their task, the crawlers do nothing more than pass the information "up the chain" to another set of machines (or nodes in the network) that hold the index data for a whole set of urls or pages. These are called "Indexers". See Fig 2.
An Indexer would typically get information about a single url from a number of disparate Crawlers. It could compare these various copies, and all being well, each copy would be identical. If one of the copies differed, it would know something was wrong with that crawler.
Indexer nodes could be thought of as a collection of lots of little databases. No one node would contain the entire database. It is unlikely any of the nodes on the network would have enough storage available.
In addition to splitting the whole index across multiple Indexer nodes, there would be many different Indexer nodes all storing the same section of the index.
The third type of node is a Searcher node. These accept "end user" queries - for example a search for "car hire in london". The Searcher node breaks this down, and passes it along to a number of different Indexer nodes. Each Indexer node is responsible for returning a partial set of the overall results. See Fig 3.
The Searcher nodes combine the results from the various Indexer nodes and additionally cross check the results. In this way, a Searcher node can quickly identify an Indexer node that is misbehaving or returning bad results. Such nodes can then be factored out of the current results and not used in future.
Obviously, any such system would be vastly more complex. But it should be clear that there are various ways to deal with badly-behaved nodes and act accordingly.
Making Access Easy
While it would be important that as many people as possible installed and ran the software (and act as a node), this is unlikely to ever be the majority since most people search via a web site.
Consider a portal site that offers the ability to search the web. If it were to contribute sufficient resources to the network, it ought to be able to use the network to power its web search offering. The portal would be contributing resources to the network, which could be used by others.
Leeches
Most distributed networks have to deal with "leeches" - people that suck resources from the network without contributing back. This problem is related to the problem of spammers and people trying to corrupt the network. The most obvious type of leech would be a large site operating many Searcher nodes but no Indexer or Crawler ones.
Indexer nodes would quickly spot any one Searcher node that was requesting abnormally high levels of traffic and penalise it accordingly. This would force the Searcher node onto another Indexer node, which would eventually also penalise the Searcher node for its high levels of traffic. As this progresses, a greedy Searcher node finds fewer and fewer Indexer nodes willing to communicate with it.
It would be possible to taint your local node to only act as a Searcher, but in order to send more than a trivial amount of queries, you would need to set up a very high number of these nodes. In addition, because Indexer nodes communicate with both Searcher and Crawler nodes, and because one physical machine ought to be capable of acting as multiple types of nodes simultaneously, an Indexer node that receives high numbers of queries from a discreet location ought also to receive Crawler data from that same location. We penalise nodes that do not meet these criteria.
Which sites to crawl?
How do Crawlers know which sites to visit and how often?
Individual Indexer nodes are aware of a subset of the overall index and also which pages are matching most in current queries. When Crawler nodes are passing information about specific sites to the Indexers, the Indexers can be passing back statistical information about where to crawl next.
The Crawler nodes can cross reference the statistical models they receive about a small sub-set of the internet from various Indexer nodes. Then, by a combination of weighting popular urls and randomly picking a starting point from the statistical model, a Crawler can find a point to begin.
In addition, it ought to be possible to monitor the results end users are clicking on. This would allow sites that get more traffic to be considered more worthy of additional, or more frequent, crawling. It might be most appropriate for the Crawler nodes themselves to gather this data. Note that we do not need to record every click, only a statistical sample. See Fig 4.
Dynamic Pages
Many pages on the web change frequently. This could present a problem for Indexer nodes when cross referencing data from multiple Crawler nodes.
Crawler 1 visits , indexes the page and passes it onto various Indexer nodes.
updates the content of their page
Crawler 2 visits , indexes and passes on the data to various Indexer nodes.
When the various Indexer nodes compare the data about from the various Crawlers, they will notice a discrepancy. Pages that include the current time (including seconds), the user name or a "random" headline will all suffer this fate. It would also be a problem when updates to the code or systems of the various nodes where updated.
Once again, it is hard to know if this problem could be solved sufficiently, but a number of techniques may help.
When information is received from a Crawler that does not match the existing data for a site, the Crawler is not immediately banned. The receiving node would simply note that a given Crawler had presented possibly corrupt data. If a Crawler consistently gives bad data for a given site (or for all sites) only then would it be treated as corrupt.
By knowing the time a Crawler visited a page, a node could sort the various different copies of data about a site by time. This would help to spot a typical site update. If there were five versions of information about one site, we would use the version that was most consistent across the available sample.
Hashing algorithms exist that could be used to spot two documents that are substantially identical except for a small amount of text. If all the various copies of information received about a site differed in a similar way, it would be safe to assume there was no problem.
Final Thoughts
Searcher nodes might expose a vast array of features and configuration abilities. Date ranges, geography, language and profanity might be variables that could be adjusted. A client (including a web site) could mask these from the end user by simply hard-coding values for them.
This would allow the generic algorithm to focus on providing generic "features" and still allow for a huge amount of customisation and tweaking for end users.
Indexer nodes would appear to be doing the most work, so it is possible that this conceptual block might be sub-divided into smaller units, each of which could be a distinct node. Alternatively, the vast majority of nodes would need to be Indexers.
It ought to be possible for one physical machine to function as many different types of nodes. Namely, your computer could be a Crawler, Indexer and Searcher node simultaneously. In fact, it is probably better if this is the case.
The Apache HTTP Server, especially version 2, is a very mature multi platform application development framework that is well suited to a project such as this. Despite its name, it need not be limited to the HTTP protocol for communication between nodes.
Next Steps
Maths, Statistics and Probability
From a theoretical point of view, could this model stand up? Or would it require so many nodes to operate that it would be unfeasible to build? Would the communication overhead be too high?
How do we split the index up across multiple nodes? For instance, it could be split by the words people might search on, leaving a set of Indexer nodes with information on all the urls that mention a word starting with the letter A, another for words starting B, etc. In reality, sub-dividing by much more than the 26 letters of the alphabet would be required - but how far? Would this work?
Another approach might be to split based on the domains of sites. Thus, one group of Indexer nodes would know all about sites whose domains began "..." and another group of Indexer nodes could handle sites beginning "..."
Given that a typical node will not have a huge amount of resources available to it, how many of each type of node (Crawler, Indexer and Searcher) would be required to index 4 billion urls, refreshing them all once per month? And to be handle a sustained rate of 100 queries per second?
What mathematical or statistical techniques could be used to analyse sites? Problems here include looking for similar (but maybe not identical) documents and identifying likely spam. Would Bayesian classifiers be of use? Perhaps adapting other de-spamming technologies would be useful?
Weakness and Risk
How would you attack such a network in order to improve the ranking of your site?
Where are the weak spots in the design?
Could an update to the code used by the nodes result in a deadlock, such that the entire network failed?
Proof of concept
By limiting ourselves to a couple of small domains to crawl and index and not worrying too much about the complexities of dealing with spam and abuse, it ought to be possible to implement a working prototype.
Which language to use? This could be contentious at best, but it would be reasonable to assume that it must be an open source language. A strong collection of networking and text analysis tools would be of much use, as would an abundance of programmers willing to hack.
It could be argued that there is the most value in rapid application development, focusing on building the API's and system design. A language such as Perl might be appropriate for this. From here, faster, more optimised and efficient clones could evolve in other languages.
Feedback
Some hearty discussion on the wiki might be in order. Please try to keep it constructive.
--------------------------------------------------------------------------------
[1] Microsoft already operate some of the top internet sites and many of these, including their MSN portal, offer the ability to search the web. Behind the scenes, they currently rely heavily on much of the same technology owned by and used for Yahoo web search results. Microsoft have said they plan to develop and use their own technologies instead. Go back to where this point is referenced
[2] This takes the Debian project as a basis for the license model, social contract, etc. A similar, although not identical starting point cound be found at . Free as in beer, as well as in speech. Go back to where this point is referenced
[3] Google have AdSense; Yahoo use (and own) Overture, who are currently used by Microsoft. Go back to where this point is referenced