Hudzilla.org - the homepage of Paul Hudson
Contents > Networks > Sockets Wish List | Report Bug | About Me ]

15.1.2     Making a simple search engine

This is NOT the latest copy of this book; click here for the latest version.

We've now looked at both fopen() and fsockopen(), both of which are great for reading in content from websites. However, thanks to the way streams work in PHP, you can read remote data in with a huge selection of functions - even down to the relatively lowly file_get_contents(). To show off this functionality, I wrote a very simple search engine that spiders websites by pulling out hyperlinks and inserting data into a MySQL table. The code is very, very simple, and very naive - it's here to demonstrate a point, not be a perfect search engine, so please don't base your own efforts on it!

<?php
    $urls
= array("http://www.slashdot.org");
    
$parsed = array();

    
$sitesvisited = 0;

    
mysql_connect("localhost", "phpuser", "alm65z");
    
mysql_select_db("phpdb");

    
mysql_query("DROP TABLE simplesearch;");
    
mysql_query("CREATE TABLE simplesearch (URL CHAR(255), Contents TEXT);");
    
mysql_query("ALTER TABLE simplesearch ADD FULLTEXT(Contents);");

    function
parse_site() {
        GLOBAL
$urls, $parsed, $sitesvisited;

        
$newsite = array_shift($urls);

        echo
"\n Now parsing $newsite...\n";

        
// the @ is because not all URLs are valid, and we don't want
        // lots of errors being printed out
        
$ourtext = @file_get_contents($newsite);
        if (!
$ourtext) return;

        
$newsite = addslashes($newsite);
        
$ourtext = addslashes($ourtext);

        
mysql_query("INSERT INTO simplesearch VALUES ('$newsite', '$ourtext');");

        
// this site has been successfully indexed; increment the counter
        
++$sitesvisited;

        
// this extracts all hyperlinks in the document
        
preg_match_all("/http:\/\/[A-Z0-9_\-\.\/\?\#\=\&]*/i", $ourtext, $matches);

        if (
count($matches)) {
            
$matches = $matches[0];
            
$nummatches = count($matches);

            echo
"Got $nummatches from $newsite\n";

            foreach(
$matches as $match) {

                
// we want to ignore all these strings
                
if (stripos($match, ".exe") !== false) continue;
                if (
stripos($match, ".zip") !== false) continue;
                if (
stripos($match, ".rar") !== false) continue;
                if (
stripos($match, ".wmv") !== false) continue;
                if (
stripos($match, ".wav") !== false) continue;
                if (
stripos($match, ".mp3") !== false) continue;
                if (
stripos($match, ".sit") !== false) continue;
                if (
stripos($match, ".mov") !== false) continue;
                if (
stripos($match, ".avi") !== false) continue;
                if (
stripos($match, ".msi") !== false) continue;
                if (
stripos($match, ".rpm") !== false) continue;
                if (
stripos($match, ".rm") !== false) continue;
                if (
stripos($match, ".ram") !== false) continue;
                if (
stripos($match, ".asf") !== false) continue;
                if (
stripos($match, ".mpg") !== false) continue;
                if (
stripos($match, ".mpeg") !== false) continue;
                if (
stripos($match, ".tar") !== false) continue;
                if (
stripos($match, ".tgz") !== false) continue;
                if (
stripos($match, ".bz2") !== false) continue;
                if (
stripos($match, ".deb") !== false) continue;
                if (
stripos($match, ".pdf") !== false) continue;
                if (
stripos($match, ".jpg") !== false) continue;
                if (
stripos($match, ".jpeg") !== false) continue;
                if (
stripos($match, ".gif") !== false) continue;
                if (
stripos($match, ".tif") !== false) continue;
                if (
stripos($match, ".png") !== false) continue;
                if (
stripos($match, ".swf") !== false) continue;
                if (
stripos($match, ".svg") !== false) continue;
                if (
stripos($match, ".bmp") !== false) continue;
                if (
stripos($match, ".dtd") !== false) continue;
                if (
stripos($match, ".xml") !== false) continue;
                if (
stripos($match, ".js") !== false) continue;
                if (
stripos($match, ".vbs") !== false) continue;
                if (
stripos($match, ".css") !== false) continue;
                if (
stripos($match, ".ico") !== false) continue;
                if (
stripos($match, ".rss") !== false) continue;
                if (
stripos($match, "w3.org") !== false) continue;    

                
// yes, these next two are very vague, but they do cut out
                // the vast majority of advertising links.  Like I said,
                // this indexer is far from perfect!
                
if (stripos($match, "ads.") !== false) continue;
                if (
stripos($match, "ad.") !== false) continue;

                if (
stripos($match, "doubleclick") !== false) continue;

                
// this URL looks safe
                
if (!in_array($match, $parsed)) { // we haven't already parsed this URL...
                    
if (!in_array($match, $urls)) { // we don't already plan to parse this URL...
                        
array_push($urls, $match);
                        echo
"Adding $match...\n";
                    }
                }
            }
        } else {
            echo
"Got no matches from $newsite\n";
        }

        
// add this site to the list we've visited already
        
$parsed[] = $newsite;
    }

    while (
$sitesvisited < 500 && count($urls) != 0) {
        
parse_site();

        
// this stops us from overloading web servers
        
sleep(5);
    }
?>

It's commented throughout, and so shouldn't be a problem to understand. That thing is pre-programmed to only index 500 URLs, but even that will take about ten minutes to do on a moderate connection because it is single-threaded. Once you have run the script, you'll need to be able to search through it - here's the corresponding file:

<?php
    
if (isset($_POST['criteria'])) {
        
mysql_connect("localhost", "phpuser", "alm65z");
        
mysql_select_db("phpdb");

        
$criteria = addslashes($_POST['criteria']);

        
$result = mysql_query("SELECT URL FROM simplesearch WHERE MATCH(Contents) AGAINST ('$criteria') ORDER BY URL ASC;");

        if (
mysql_num_rows($result)) {
            echo
"Search found the following matches...<BR /><BR />";

            echo
"<UL>";

            while (
$r = mysql_fetch_assoc($result)) {
                
extract($r, EXTR_PREFIX_ALL, 'find');
                echo
"<LI><A HREF=\"$find_URL\">$find_URL</A></LI>";
                
            }

            echo
"</UL>";
        } else {
            echo
"No matches found for the criteria '$criteria'.<BR /><BR />";
        }
        
    }
?>

<FORM METHOD="POST">
Search for: <INPUT TYPE="TEXT" NAME="criteria" />
<INPUT TYPE="SUBMIT" VALUE="Go" />
</FORM>

Anyway, that was just a short example to see how easy network programming is in PHP. Like I said, as a search engine it's basically as simplistic as they come: there are numerous problems in there. At the very least, a good search engine should at least cache the URLs of media items like MP3s and AVI files, instead of ignoring them like that script does. Furthermore, 500 URLs take up about 16MB of disk space, which is an enormous amount for so little payback. There are almost certainly faster regular expressions for link matching, too. So, if you really want to make your own search engine, look somewhere else!





<< 15.1.1 Sockets are files: fsockopen()   15.1.3 Sockets aren't all about HTTP >>
Table of Contents
Want to see this stuff in print? PHP in a Nutshell takes the core topics covered here, adds in thousands of edits from the editorial team and myself, and combines them to make an unbeatable reference for PHP programmers at all levels.



My latest book has hundreds more tips on how to use PHP, Apache, and MySQL, plus Perl, Python, shell scripts, performance tuning, and more!



Top-right shadow
 
Bottom-left shadow Bottom shadow

Comments from other readers
A PHP User - 29 Aug 2008

Hung,
Your problem is that it isn't connect to the server at all. See your first error.

daevid@daevid.com - 29 Aug 2008

Just a tip here.

I think it would be more sane to put all those extensions in a big array and then use the php built in "in_array()"

if (in_array($match, $extArray)) continue;

rather than eight billion of these:

if (stripos($match, ".exe") !== false) continue;

lines of code. ;-)

but that's just me.

A PHP User - 29 Aug 2008

Hi,

I had submit my note, but with out error message, I resubmit
the error message below:


Warning: mssql_connect() [function.mssql-connect]: Unable to connect to server: localhost in C:\Program Files\xampp\htdocs\SAMPLE\WEB2\Search2\search.php on line 7

Warning: mssql_query() [function.mssql-query]: message: 'MATCH' is not a recognized function name. (severity 15) in C:\Program Files\xampp\htdocs\SAMPLE\WEB2\Search2\search.php on line 12

Warning: mssql_query() [function.mssql-query]: Query failed in C:\Program Files\xampp\htdocs\SAMPLE\WEB2\Search2\search.php on line 12

Warning: mssql_num_rows(): supplied argument is not a valid MS SQL-result resource in C:\Program Files\xampp\htdocs\SAMPLE\WEB2\Search2\search.php on line 15
No matches found for the criteria 'search'.


Thanks,
Hung

A PHP User - 29 Aug 2008

Hi,

I had submit my note, but with out error message, I resubmit
the error message below:


Warning: mssql_connect() [function.mssql-connect]: Unable to connect to server: localhost in C:\Program Files\xampp\htdocs\SAMPLE\WEB2\Search2\search.php on line 7

Warning: mssql_query() [function.mssql-query]: message: 'MATCH' is not a recognized function name. (severity 15) in C:\Program Files\xampp\htdocs\SAMPLE\WEB2\Search2\search.php on line 12

Warning: mssql_query() [function.mssql-query]: Query failed in C:\Program Files\xampp\htdocs\SAMPLE\WEB2\Search2\search.php on line 12

Warning: mssql_num_rows(): supplied argument is not a valid MS SQL-result resource in C:\Program Files\xampp\htdocs\SAMPLE\WEB2\Search2\search.php on line 15
No matches found for the criteria 'search'.


Thanks,
Hung

A PHP User - 29 Aug 2008

Hi,

I am new user on PHP.
I am using MS Sql server, I gad modified from
mysql_quesry() to mssql_quesry(), because I did
not stalled mysql in my system.

I got some error, and it seem like the MATCH function,
not work for ms sql.

Any help to reslove the problem for me is appreciate.

Thanks,
Hung

my email address: hungcphan@gmail.com

A PHP User - 29 Aug 2008

[b]test[/b]

A PHP User - 29 Aug 2008

[b]test[/b]



Add comment
Please note that by posting a comment here you are committing it to the public domain. This is important so that others can make use of your code themselves, and also so that I can incorporate helpful notes directly into the main text. Comments are limited to 2000 characters in length.

If you are reporting an error in the content, please tell me directly.

Your name/email address:
Your comment:
 
Now, in order to verify that you're a real person, please answer this simple question: what is three plus eight?
The answer is:
(please write in
numbers, eg 19)


Top-right shadow
 
Bottom-left shadow Bottom shadow