Admun's Tech Journey

thoughts, ideas, projects, and discoveries on technologies

Sep

A simple web crawler in PHP

I was talking to a colleague about how to write a web crawler, and spend an hour to split out this simple piece of code as a proof of concept.




// a simple web crawler example

// v1.1

//



$toplink = "http://edmondhui.homeip.net/blog";

$depth = 2;



walkIt($toplink,1);



function walkIt($link, $level) {

	global $depth;

	if ($level <= $depth) {

		$content = file($link);

		if ($content != false) {

			$content = html_entity_decode(implode($content));

			//

			// IF NEED, PAGE CONTENTS PROCESS AND SAVE TO DB HERE

			//

			preg_match_all("/href=\".*?\"/i", $content, $matches);

			$matches = array_unique($matches[0]);

			foreach ($matches as $idx => $url) {

				$url = substr($url, 6, -1);

				if (strstr($url, "http://") != false) {

					for ($i = 0; $i < $level; $i++) { echo '  '; }

					echo "$url\n";

					walkIt($url, $level+1);

				}

			}

		} else {

			echo "failed to crawle - $link\n";

		}

	}

}

?>

Next exercise, write this in python!

About Me

admun My passion has always been on software development, and I know it since I wrote my first program on an Apple II. I worked on cellular wireless system in the past (C/C++) and now focus on web application (LAMP, PHP, MySQL, CakePHP, Symfony, jQuery, Google AppEngine/python).

Search

Powered by LMNucleus CMS v3.66 | Copyright Edmond Hui
This page takes 0.054 sec/25 queries to process | NP_BadBehavior blocked 334 spams for the past 7 days
Theme Design by short funny jokes | Ported to Nucleus CMS by BABOCHTA

A simple web crawler in PHP

About Me

Tags

Archives

Search