Crawl 'em all
Google has 3 billion pages in its database, with AlltheWeb and Inktomi close behind. But there may be a trillion more pages hiding in plain sight - in online databases such as WebMD and The New York Times' archive, and they can't be reached by hopping from one link to another. To get at them, a search engine needs to submit a query to each site, then consolidate the results onto one page. CompletePlanet lets visitors search more than 100,000 such databases, but only a few subjects at a time. No current service is powerful enough to churn through all trillion possible results at once.
Keep 'em all
Google is fast replacing Lexis-Nexis as the research tool professionals turn to first. But Google lets you search only its most recently crawled version of the Web. Pages that were changed or deleted prior to the last crawl are lost forever. The Internet Archive's Wayback Machine preserves a fraction of the Web's page history. What if you could search every version of every page ever posted?
Follow the feeds
News sites and blogs are supplementing their pages with RSS feeds - a service that pushes new content to subscribers as soon as it's published. Google doesn't track RSS feeds, and bloggers gripe that their posts take two to three days to show up in search results. An engine to which Web site owners could upload RSS would provide the latest version of every page.
Don't give away the formula
When Google debuted in 1998, its search results were free of the marketing pages that clogged other engines. Even though Sergey Brin and Larry Page had published their PageRank formula while at Stanford, it was tougher to fool than other scoring systems. In 2000, Google gave out a free PC toolbar that displayed the PageRank value of any Web page, unwittingly handing Google gamers a cheat sheet.
No comments:
Post a Comment