Recommend this page to a friend! |
All requests | > | What is the best PHP web crawler clas... | > | Request new recommendation | > | Featured requests | > | No recommendations |
by kash - 7 years ago (2016-05-26)
+2 | I want an efficient Web crawler that can get contents from an Web site. |
7. by David - 7 years ago (2016-06-04) Reply
Kash,
Take a look at Goutte, I looked for it but could not find one here.
Its a screen scraper which wraps around BrowserKit, CssSelector and DomCrawler.
you can crawl whole the document using Goutte & coz you have access to every element on page, you can store them into any storage media(e.g. DataBase, File System,...)
1. by Manuel Lemos - 7 years ago (2016-05-27) Reply
Do you want the Web site page contents or just the links to the crawled page URLs?
2. by kash - 7 years ago (2016-05-27) in reply to comment 1 by Manuel Lemos Comment
i want the website content
3. by Axel Hahn - 7 years ago (2016-05-27) in reply to comment 2 by kash Comment
You can use a non blocking crawler - I used "rolling-curl" for my own crawler (but it is to early to make my crawler public). In a callback function you get the content where you fetch title + meta description from head and the body for the content.
If you want to follow links (recursive scan): You must parse the content to find new crawlable links. There you need to check - that you stay on the domain - make relative links absolute - check if the found url was added for crawling already - check depth of the url path (maybe). If you want to fetch foreign domains you should respect the robots.txt and index and follow rules in html head and a tags.
4. by Manuel Lemos - 7 years ago (2016-05-27) in reply to comment 2 by kash Comment
I have seen crawlers that extract all site page links and store in a database but the actual pages content I have not yet found any.
5. by Manuel Lemos - 7 years ago (2016-05-27) in reply to comment 3 by Axel Hahn Comment
Axel, there are some crawlers that retrieve the content. If your package is going to be able to store content in a database, that seems to be what kash wants.
6. by scott Winterstein - 7 years ago (2016-05-31) in reply to comment 5 by Manuel Lemos Comment
You may not find one. I have used httrack over the years of development.
Recommend package | |
|