Admun's Tech Journey

thoughts, ideas, projects, and discoveries on technologies

  • Main
  • Tag Cloud
  • Archives

21

Sep

A simple web crawler in PHP

Posted by Admun  Tags: php, web, crawler
I was talking to a colleague about how to write a web crawler, and spend an hour to split out this simple piece of code as a proof of concept.

// a simple web crawler example
// v1.1
//

$toplink = "http://edmondhui.homeip.net/blog";
$depth = 2;

walkIt($toplink,1);

function walkIt($link, $level) {
global $depth;
if ($level <= $depth) {
$content = file($link);
if ($content != false) {
$content = html_entity_decode(implode($content));
//
// IF NEED, PAGE CONTENTS PROCESS AND SAVE TO DB HERE
//
preg_match_all("/href=\".*?\"/i", $content, $matches);
$matches = array_unique($matches[0]);
foreach ($matches as $idx => $url) {
$url = substr($url, 6, -1);
if (strstr($url, "http://") != false) {
for ($i = 0; $i < $level; $i++) { echo ' '; }
echo "$url\n";
walkIt($url, $level+1);
}
}
} else {
echo "failed to crawle - $link\n";
}
}
}
?>


Next exercise, write this in python!
Leave a comment

About Me

admun My passion has always been on software development, and I know it since I wrote my first program on an Apple II. I worked on cellular wireless system in the past (C/C++) and now focus on web application (LAMP, PHP, MySQL, CakePHP, Symfony, jQuery, Google AppEngine/python).

Tags

abit android apache app engine audio bandwidth benchmark bing blogroll bluetooth boinc bookmark botnet bt cakephp cdma cellphone chinese chrome clouds coding crash crawler date dns drm dsl fedora friendster g1 gaim gdesklets gmail google grid h323 hardware html ie6 instant messaging internet javascript jquery language lifestreaming linux liunx meetup ming motorola msi mysql nas nat networking nokia ntp nucleus opinion optimization os p2p palm parrot php power reblog redhat regex rss sdk se search security shell skype social network spam stats string svn syntax sysadmin t-mobile teksavvy telecom thinkpad time tips tuning tv twitter unix voip web web app web2.0 webos wordpress

Archives

  • Full archive
  • May, 2013
  • Feb, 2013
  • May, 2012
  • Aug, 2011

Search

Powered by LMNucleus CMS v3.66 | Copyright Edmond Hui
This page takes 0.039 sec/25 queries to process | NP_BadBehavior blocked 334 spams for the past 7 days
Theme Design by short funny jokes | Ported to Nucleus CMS by BABOCHTA
[Valid XHTML 1.0 Strict]