PHP Classes

What is the best PHP web crawler class class?: Get content from one site to store my database

Recommend this page to a friend!
  All requests RSS feed  >  What is the best PHP web crawler clas...  >  Request new recommendation  >  A request is featured when there is no good recommended package on the site when it is posted. Featured requests  >  No recommendations No recommendations  

What is the best PHP web crawler class class?

A request is featured when there is no good recommended package on the site when it is posted. Edit

Picture of kash by kash - 7 years ago (2016-05-26)

Get content from one site to store my database

This request is clear and relevant.
This request is not clear or is not relevant.

+2

I want an efficient Web crawler that can get contents from an Web site.

  • 2 Clarification requests
  • 7. Picture of David by David - 7 years ago (2016-06-04) Reply

    Kash,

    Take a look at Goutte, I looked for it but could not find one here.

    Its a screen scraper which wraps around BrowserKit, CssSelector and DomCrawler.

    you can crawl whole the document using Goutte & coz you have access to every element on page, you can store them into any storage media(e.g. DataBase, File System,...)

    • 1. Picture of Manuel Lemos by Manuel Lemos - 7 years ago (2016-05-27) Reply

      Do you want the Web site page contents or just the links to the crawled page URLs?

      • 2. Picture of kash by kash - 7 years ago (2016-05-27) in reply to comment 1 by Manuel Lemos Comment

        i want the website content

      • 3. Picture of Axel Hahn by Axel Hahn - 7 years ago (2016-05-27) in reply to comment 2 by kash Comment

        You can use a non blocking crawler - I used "rolling-curl" for my own crawler (but it is to early to make my crawler public). In a callback function you get the content where you fetch title + meta description from head and the body for the content.

        If you want to follow links (recursive scan): You must parse the content to find new crawlable links. There you need to check - that you stay on the domain - make relative links absolute - check if the found url was added for crawling already - check depth of the url path (maybe). If you want to fetch foreign domains you should respect the robots.txt and index and follow rules in html head and a tags.

      • 4. Picture of Manuel Lemos by Manuel Lemos - 7 years ago (2016-05-27) in reply to comment 2 by kash Comment

        I have seen crawlers that extract all site page links and store in a database but the actual pages content I have not yet found any.

      • 5. Picture of Manuel Lemos by Manuel Lemos - 7 years ago (2016-05-27) in reply to comment 3 by Axel Hahn Comment

        Axel, there are some crawlers that retrieve the content. If your package is going to be able to store content in a database, that seems to be what kash wants.

      • 6. Picture of scott Winterstein by scott Winterstein - 7 years ago (2016-05-31) in reply to comment 5 by Manuel Lemos Comment

        You may not find one. I have used httrack over the years of development.

    Ask clarification

    Recommend package
    : 
    :