NO IMAGE

 寫這篇的動力源於上一篇中反覆出現的robots,它讓我想起了spider(蜘蛛)與crawler(爬蟲)。此二者一樣?不一樣?

  以前就看過一篇文章,說此二者不一樣,或是嚴格說不一樣。剛才又在網上搜了搜,大部分意見說此二者一樣。這個大部分的意見,我就不在此熬述了,網上找吧,一大堆呢。我就這篇說說“此二者不一樣”。對或不對,全當個參考,百家爭鳴、百花齊放。

  在 WebmasterWorld,曾有過個帖子,談的就是spider與crawler。帖子開始就有一段敘述:

  Search engines consist of five discrete software components:

  Spider : a robotic browser like program that downloads webpages.

  Crawler : a wandering spider that automatically follows links found on pages.

  Indexer : a blender like program that dissects webpages that are downloaded by spiders.

  The Database : a warehouse of the pages downloaded and processed.

  Search Engine Results Engine : digs search results out of the database.

  一句話總結一下它的意思,就是:spider與crawler不一樣。

  帖子裡還有個觀點,就是說robots有5種,其名稱、作用依次是:spider,下載網頁;crawler,順著內鏈,訪問該連結的另一 端;indexer,收錄下載了的網頁;datebase,下載了的、處理了的網頁的倉庫;result engine, 從資料庫中找出搜尋結果。5種?這個觀點,我不知道是否正確,不過至少對我來說,夠新穎的。

  還有人發言道:

  Let’s talk about how robots interpret your page for a bit. If I follow Brett’s historical topic, you have three different types of robots, a spider, crawler and indexer.

  First the Spider comes around and requests the URI. It reads server header information and other on page information. Then the Crawler follows all the links within that domain (those that are found and allowed). Then the Indexer reads the html while making heads and tails of it.

  其發言者認為robots有3種:spider、crawler、indexer。一開始是spider根據URI,訪問進來,接著,讀取伺服器的header和網頁的head標籤。然後,crawler順著spider發現的網頁的內鏈,去訪問該內鏈的另一端。最後,indexer來讀取HTML程式碼。