PHP Cookbook/Web Automation

From WikiContent

< PHP Cookbook
Revision as of 22:30, 6 March 2008 by Evanlenz (Talk | contribs)
Jump to: navigation, search
PHP Cookbook


Contents

Introduction

Most of the time, PHP is part of a web server, sending content to browsers. Even when you run it from the command line, it usually performs a task and then prints some output. PHP can also be useful, however, playing the role of a web browser — retrieving URLs and then operating on the content. Most recipes in this chapter cover retrieving URLs and processing the results, although there are a few other tasks in here as well, such as using templates and processing server logs.

There are four ways to retrieve a remote URL in PHP. Choosing one method over another depends on your needs for simplicity, control, and portability. The four methods are to use fopen( ) , fsockopen( ), the cURL extension, or the HTTP_Request class from PEAR.

Using fopen( ) is simple and convenient. We discuss it in Recipe 11.2. The fopen( ) function automatically follows redirects, so if you use this function to retrieve the directory http://www.example.com/people and the server redirects you to http://www.example.com/people/, you'll get the contents of the directory index page, not a message telling you that the URL has moved. The fopen( ) function also works with both HTTP and FTP. The downsides to fopen( ) include: it can handle only HTTP GET requests (not HEAD or POST), you can't send additional headers or any cookies with the request, and you can retrieve only the response body with it, not response headers.

Using fsockopen( ) requires more work but gives you more flexibility. We use fsockopen( ) in Recipe 11.3. After opening a socket with fsockopen( ), you need to print the appropriate HTTP request to that socket and then read and parse the response. This lets you add headers to the request and gives you access to all the response headers. However, you need to have additional code to properly parse the response and take any appropriate action, such as following a redirect.

If you have access to the cURL extension or PEAR's HTTP_Request class, you should use those rather than fsockopen( ). cURL supports a number of different protocols (including HTTPS, discussed in Recipe 11.6) and gives you access to response headers. We use cURL in most of the recipes in this chapter. To use cURL, you must have the cURL library installed, available at http://curl.haxx.se. Also, PHP must be built with the --with-curl configuration option.

PEAR's HTTP_Request class, which we use in Recipe 11.3, Recipe 11.4, and Recipe 11.5, doesn't support HTTPS, but does give you access to headers and can use any HTTP method. If this PEAR module isn't installed on your system, you can download it from http://pear.php.net/get/HTTP_Request. As long as the module's files are in your include_path, you can use it, making it a very portable solution.

Recipe 11.7 helps you go behind the scenes of an HTTP request to examine the headers in a request and response. If a request you're making from a program isn't giving you the results you're looking for, examining the headers often provides clues as to what's wrong.

Once you've retrieved the contents of a web page into a program, use Recipe 11.8 through Recipe 11.12 to help you manipulate those page contents. Recipe 11.8 demonstrates how to mark up certain words in a page with blocks of color. This technique is useful for highlighting search terms, for example. Recipe 11.9 provides a function to find all the links in a page. This is an essential building block for a web spider or a link checker. Converting between plain ASCII and HTML is covered in Recipe 11.10 and Recipe 11.11. Recipe 11.12 shows how to remove all HTML and PHP tags from a web page.

Another kind of page manipulation is using a templating system. Discussed in Recipe 11.13, templates give you freedom to change the look and feel of your web pages without changing the PHP plumbing that populates the pages with dynamic data. Similarly, you can make changes to the code that drives the pages without affecting the look and feel. Recipe 11.14 discusses a common server administration task — parsing your web server's access log files.

Two sample programs use the link extractor from Recipe 11.9. The program in Recipe 11.15 scans the links in a page and reports which are still valid, which have been moved, and which no longer work. The program in Recipe 11.16 reports on the freshness of links. It tells you when a linked-to page was last modified and if it's been moved.

Fetching a URL with the GET Method

Problem

You want to retrieve the contents of a URL. For example, you want to include part of one web page in another page's content.

Solution

Pass the URL to fopen( ) and get the contents of the page with fread( ):

$page = '';
$fh = fopen('http://www.example.com/robots.txt','r') or die($php_errormsg);
while (! feof($fh)) {
    $page .= fread($fh,1048576);
}
fclose($fh);

You can use the cURL extension:

$c = curl_init('http://www.example.com/robots.txt');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($c);
curl_close($c);

You can also use the HTTP_Request class from PEAR:

require 'HTTP/Request.php';

$r = new HTTP_Request('http://www.example.com/robots.txt');
$r->sendRequest();
$page = $r->getResponseBody();

Discussion

You can put a username and password in the URL if you need to retrieve a protected page. In this example, the username is david, and the password is hax0r. Here's how to do it with fopen( ) :

$fh = fopen('http://david:hax0r@www.example.com/secrets.html','r') 
    or die($php_errormsg);
while (! feof($fh)) {
    $page .= fread($fh,1048576);
}
fclose($fh);

Here's how to do it with cURL:

$c = curl_init('http://www.example.com/secrets.html');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_USERPWD, 'david:hax0r');
$page = curl_exec($c);
curl_close($c);

Here's how to do it with HTTP_Request :

$r = new HTTP_Request('http://www.example.com/secrets.html');
$r->setBasicAuth('david','hax0r');
$r->sendRequest();
$page = $r->getResponseBody();

While fopen( ) follows redirects in Location response headers, HTTP_Request does not. cURL follows them only when the CURLOPT_FOLLOWLOCATION option is set:

$c = curl_init('http://www.example.com/directory');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_FOLLOWLOCATION, 1);
$page = curl_exec($c);
curl_close($c);

cURL can do a few different things with the page it retrieves. If the CURLOPT_RETURNTRANSFER option is set, curl_exec( ) returns a string containing the page:

$c = curl_init('http://www.example.com/files.html');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($c);
curl_close($c);

To write the retrieved page to a file, open a file handle for writing with fopen( ) and set the CURLOPT_FILE option to the file handle:

$fh = fopen('local-copy-of-files.html','w') or die($php_errormsg);
$c = curl_init('http://www.example.com/files.html');
curl_setopt($c, CURLOPT_FILE, $fh);
curl_exec($c);
curl_close($c);

To pass the cURL resource and the contents of the retrieved page to a function, set the CURLOPT_WRITEFUNCTION option to the name of the function:

// save the URL and the page contents in a database
function save_page($c,$page) {
    $info = curl_getinfo($c);
    mysql_query("INSERT INTO pages (url,page) VALUES ('" .
                mysql_escape_string($info['url']) . "', '" .
                mysql_escape_string($page) . "')");
}

$c = curl_init('http://www.example.com/files.html');
curl_setopt($c, CURLOPT_WRITEFUNCTION, 'save_page');
curl_exec($c);
curl_close($c);

If none of CURLOPT_RETURNTRANSFER, CURLOPT_FILE, or CURLOPT_WRITEFUNCTION is set, cURL prints out the contents of the returned page.

The fopen() function and the include and require directives can retrieve remote files only if URL fopen wrappers are enabled. URL fopen wrappers are enabled by default and are controlled by the allow_url_fopen configuration directive. On Windows, however, include and require can't retrieve remote files in versions of PHP earlier than 4.3, even if allow_url_fopen is on.

See Also

Recipe 11.3 for fetching a URL with the POST method; Recipe 8.13 discusses opening remote files with fopen(); documentation on fopen( ) at http://www.php.net/fopen, include at http://www.php.net/include, curl_init( ) at http://www.php.net/curl-init, curl_setopt( ) at http://www.php.net/curl-setopt, curl_exec( ) at http://www.php.net/curl-exec, and curl_close( ) at http://www.php.net/curl-close; the PEAR HTTP_Request class at http://pear.php.net/package-info.php?package=HTTP_Request.

Fetching a URL with the POST Method

Problem

You want to retrieve a URL with the POST method, not the default GET method. For example, you want to submit an HTML form.

Solution

Use the cURL extension with the CURLOPT_POST option set:

$c = curl_init('http://www.example.com/submit.php');
curl_setopt($c, CURLOPT_POST, 1);
curl_setopt($c, CURLOPT_POSTFIELDS, 'monkey=uncle&rhino=aunt');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($c);
curl_close($c);

If the cURL extension isn't available, use the PEAR HTTP_Request class:

require 'HTTP/Request.php';

$r = new HTTP_Request('http://www.example.com/submit.php');
$r->setMethod(HTTP_REQUEST_METHOD_POST);
$r->addPostData('monkey','uncle');
$r->addPostData('rhino','aunt');
$r->sendRequest();
$page = $r->getResponseBody();

Discussion

Sending a POST method request requires special handling of any arguments. In a GET request, these arguments are in the query string, but in a POST request, they go in the request body. Additionally, the request needs a Content-Length header that tells the server the size of the content to expect in the request body.

Because of the argument handling and additional headers, you can't use fopen( ) to make a POST request. If neither cURL nor HTTP_Request are available, use the pc_post_request( ) function, shown in Example 11-1, which makes the connection to the remote web server with fsockopen( ) .

Example 11-1. pc_post_request( )

function pc_post_request($host,$url,$content='') {
    $timeout = 2;
    $a = array();
    if (is_array($content)) {
        foreach ($content as $k => $v) {
            array_push($a,urlencode($k).'='.urlencode($v));
        }
    }
    $content_string = join('&',$a);
    $content_length = strlen($content_string);
    $request_body = "POST $url HTTP/1.0
Host: $host
Content-type: application/x-www-form-urlencoded
Content-length: $content_length

$content_string";

    $sh = fsockopen($host,80,&$errno,&$errstr,$timeout)
        or die("can't open socket to $host: $errno $errstr");

    fputs($sh,$request_body);
    $response = '';
    while (! feof($sh)) {
        $response .= fread($sh,16384);
    }
    fclose($sh) or die("Can't close socket handle: $php_errormsg");

    list($response_headers,$response_body) = explode("\r\n\r\n",$response,2);
    $response_header_lines = explode("\r\n",$response_headers);
        
    // first line of headers is the HTTP response code
    $http_response_line = array_shift($response_header_lines);
    if (preg_match('@^HTTP/[0-9]\.[0-9] ([0-9]{3})@',$http_response_line,
                   $matches)) {
        $response_code = $matches[1];
    }

    // put the rest of the headers in an array 
    $response_header_array = array();
    foreach ($response_header_lines as $header_line) {
        list($header,$value) = explode(': ',$header_line,2);
        $response_header_array[$header] = $value;
    }
    
    return array($response_code,$response_header_array,$response_body);
}

Call pc_post_request( ) like this:

list($code,$headers,$body) = pc_post_request('www.example.com','/submit.php',
                                             array('monkey' => 'uncle',
                                                   'rhino' => 'aunt'));

Retrieving a URL with POST instead of GET is especially useful if the URL is very long, more than 200 characters or so. The HTTP 1.1 specification in RFC 2616 doesn't place a maximum length on URLs, so behavior varies among different web and proxy servers. If you retrieve URLs with GET and receive unexpected results or results with status code 414 ("Request-URI Too Long"), convert the request to a POST request.

See Also

Recipe 11.2 for fetching a URL with the GET method; documentation on curl_setopt( ) at http://www.php.net/curl-setopt and fsockopen( ) at http://www.php.net/fsockopen; the PEAR HTTP_Request class at http://pear.php.net/package-info.php?package=HTTP_Request; RFC 2616 is available at http://www.faqs.org/rfcs/rfc2616.html.

Fetching a URL with Cookies

Problem

You want to retrieve a page that requires a cookie to be sent with the request for the page.

Solution

Use the cURL extension and the CURLOPT_COOKIE option:

$c = curl_init('http://www.example.com/needs-cookies.php');
curl_setopt($c, CURLOPT_VERBOSE, 1);
curl_setopt($c, CURLOPT_COOKIE, 'user=ellen; activity=swimming');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($c);
curl_close($c);

If cURL isn't available, use the addHeader( ) method in the PEAR HTTP_Request class:

require 'HTTP/Request.php';

$r = new HTTP_Request('http://www.example.com/needs-cookies.php');
$r->addHeader('Cookie','user=ellen; activity=swimming');
$r->sendRequest();
$page = $r->getResponseBody();

Discussion

Cookies are sent to the server in the Cookie request header. The cURL extension has a cookie-specific option, but with HTTP_Request, you have to add the Cookie header just as with other request headers. Multiple cookie values are sent in a semicolon-delimited list. The examples in the Solution send two cookies: one named user with value ellen and one named activity with value swimming.

To request a page that sets cookies and then make subsequent requests that include those newly set cookies, use cURL's "cookie jar" feature. On the first request, set CURLOPT_COOKIEJAR to the name of a file to store the cookies in. On subsequent requests, set CURLOPT_COOKIEFILE to the same filename, and cURL reads the cookies from the file and sends them along with the request. This is especially useful for a sequence of requests in which the first request logs into a site that sets session or authentication cookies, and then the rest of the requests need to include those cookies to be valid:

$cookie_jar = tempnam('/tmp','cookie');

// log in
$c = curl_init('https://bank.example.com/login.php?user=donald&password=b1gmoney$');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_COOKIEJAR, $cookie_jar);
$page = curl_exec($c);
curl_close($c);

// retrieve account balance
$c = curl_init('http://bank.example.com/balance.php?account=checking');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_COOKIEFILE, $cookie_jar);
$page = curl_exec($c);
curl_close($c);

// make a deposit
$c = curl_init('http://bank.example.com/deposit.php');
curl_setopt($c, CURLOPT_POST, 1);
curl_setopt($c, CURLOPT_POSTFIELDS, 'account=checking&amount=122.44');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_COOKIEFILE, $cookie_jar);
$page = curl_exec($c);
curl_close($c);

// remove the cookie jar
unlink($cookie_jar) or die("Can't unlink $cookie_jar");

Be careful where you store the cookie jar. It needs to be in a place your web server has write access to, but if other users can read the file, they may be able to poach the authentication credentials stored in the cookies.

See Also

Documentation on curl_setopt( ) at http://www.php.net/curl-setopt; the PEAR HTTP_Request class at http://pear.php.net/package-info.php?package=HTTP_Request

Fetching a URL with Headers

Problem

You want to retrieve a URL that requires specific headers to be sent with the request for the page.

Solution

Use the cURL extension and the CURLOPT_HTTPHEADER option:

$c = curl_init('http://www.example.com/special-header.php');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_HTTPHEADER, array('X-Factor: 12', 'My-Header: Bob'));
$page = curl_exec($c);
curl_close($c);

If cURL isn't available, use the addHeader( ) method in HTTP_Request:

require 'HTTP/Request.php';

$r = new HTTP_Request('http://www.example.com/special-header.php');
$r->addHeader('X-Factor',12);
$r->addHeader('My-Header','Bob');
$r->sendRequest();
$page = $r->getResponseBody();

Discussion

cURL has special options for setting the Referer and User-Agent request headers — CURLOPT_REFERER and CURLOPT_USERAGENT:

$c = curl_init('http://www.example.com/submit.php');
curl_setopt($c, CURLOPT_VERBOSE, 1);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_REFERER, 'http://www.example.com/form.php');
curl_setopt($c, CURLOPT_USERAGENT, 'CURL via PHP');
$page = curl_exec($c);
curl_close($c);

See Also

Recipe 11.14 explains why "referrer" is often misspelled "referer" in web programming contexts; documentation on curl_setopt( ) at http://www.php.net/curl-setopt; the PEAR HTTP_Request class at http://pear.php.net/package-info.php?package=HTTP_Request.

Fetching an HTTPS URL

Problem

You want to retrieve a secure URL.

Solution

Use the cURL extension with an HTTPS URL:

$c = curl_init('https://secure.example.com/accountbalance.php');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($c);
curl_close($c);

Discussion

To retrieve secure URLs, the cURL extension needs access to an SSL library, such as OpenSSL. This library must be available when PHP and the cURL extension are built. Aside from this additional library requirement, cURL treats secure URLs just like regular ones. You can provide the same cURL options to secure requests, such as changing the request method or adding POST data.

See Also

The OpenSSL Project at http://www.openssl.org/.

Debugging the Raw HTTP Exchange

Problem

You want to analyze the HTTP request a browser makes to your server and the corresponding HTTP response. For example, your server doesn't supply the expected response to a particular request so you want to see exactly what the components of the request are.

Solution

For simple requests, connect to the web server with telnet and type in the request headers:

% telnet www.example.com 80
Trying 10.1.1.1...
Connected to www.example.com.
Escape character is '^]'.
GET / HTTP/1.0
               Host: www.example.com

HTTP/1.1 200 OK
Date: Sat, 17 Aug 2002 06:10:19 GMT
Server: Apache/1.3.26 (Unix) PHP/4.2.2 mod_ssl/2.8.9 OpenSSL/0.9.6d
X-Powered-By: PHP/4.2.2
Connection: close
Content-Type: text/html

// ... the page body ...

Discussion

When you type in request headers, the web server doesn't know that it's just you typing and not a web browser submitting a request. However, some web servers have timeouts on how long they'll wait for a request, so it can be useful to pretype the request and then just paste it into telnet. The first line of the request contains the request method (GET ), a space and the path of the file you want (/), and then a space and the protocol you're using (HTTP/1.0). The next line, the Host header, tells the server which virtual host to use if many are sharing the same IP address. A blank line tells the server that the request is over; it then spits back its response: first headers, then a blank line, and then the body of the response.

Pasting text into telnet can get tedious, and it's even harder to make requests with the POST method that way. If you make a request with HTTP_Request , you can retrieve the response headers and the response body with the getResponseHeader( ) and getResponseBody( ) methods:

require 'HTTP/Request.php';

$r = new HTTP_Request('http://www.example.com/submit.php');
$r->setMethod(HTTP_REQUEST_METHOD_POST);
$r->addPostData('monkey','uncle');
$r->sendRequest();

$response_headers = $r->getResponseHeader();
$response_body    = $r->getResponseBody();

To retrieve a specific response header, pass the header name to getResponseHeader( ). Without an argument, getResponseHeader( ) returns an array containing all the response headers. HTTP_Request doesn't save the outgoing request in a variable, but you can reconstruct it by calling the private _buildRequest( ) method:

require 'HTTP/Request.php';

$r = new HTTP_Request('http://www.example.com/submit.php');
$r->setMethod(HTTP_REQUEST_METHOD_POST);
$r->addPostData('monkey','uncle');

print $r->_buildRequest();

The request that's printed is:

POST /submit.php HTTP/1.1
User-Agent: PEAR HTTP_Request class ( http://pear.php.net/ )
Content-Type: application/x-www-form-urlencoded
Connection: close
Host: www.example.com
Content-Length: 12

monkey=uncle

With cURL, to include response headers in the output from curl_exec( ), set the CURLOPT_HEADER option:

$c = curl_init('http://www.example.com/submit.php');
curl_setopt($c, CURLOPT_HEADER, 1);
curl_setopt($c, CURLOPT_POST, 1);
curl_setopt($c, CURLOPT_POSTFIELDS, 'monkey=uncle&rhino=aunt');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$response_headers_and_page = curl_exec($c);
curl_close($c);

To write the response headers directly to a file, open a file handle with fopen( ) and set CURLOPT_WRITEHEADER to that file handle:

$fh = fopen('/tmp/curl-response-headers.txt','w') or die($php_errormsg);
$c = curl_init('http://www.example.com/submit.php');
curl_setopt($c, CURLOPT_POST, 1);
curl_setopt($c, CURLOPT_POSTFIELDS, 'monkey=uncle&rhino=aunt');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_WRITEHEADER, $fh);
$page = curl_exec($c);
curl_close($c);
fclose($fh) or die($php_errormsg);

The cURL module's CURLOPT_VERBOSE option causes curl_exec( ) and curl_close( ) to print out debugging information to standard error, including the contents of the request:

$c = curl_init('http://www.example.com/submit.php');
curl_setopt($c, CURLOPT_VERBOSE, 1);
curl_setopt($c, CURLOPT_POST, 1);
curl_setopt($c, CURLOPT_POSTFIELDS, 'monkey=uncle&rhino=aunt');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
$page = curl_exec($c);
curl_close($c);

This prints:

* Connected to www.example.com (10.1.1.1)
> POST /submit.php HTTP/1.1
Host: www.example.com
Pragma: no-cache
Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, */*
Content-Length: 23
Content-Type: application/x-www-form-urlencoded

monkey=uncle&rhino=aunt* Connection #0 left intact
* Closing connection #0

Because cURL prints the debugging information to standard error and not standard output, it can't be captured with output buffering, as Recipe 10.11 does with print_r( ). You can, however, open a file handle for writing and set CURLOUT_STDERR to that file handle to divert the debugging information to a file:

$fh = fopen('/tmp/curl.out','w') or die($php_errormsg);
$c = curl_init('http://www.example.com/submit.php');
curl_setopt($c, CURLOPT_VERBOSE, 1);
curl_setopt($c, CURLOPT_POST, 1);
curl_setopt($c, CURLOPT_POSTFIELDS, 'monkey=uncle&rhino=aunt');
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_STDERR, $fh);
$page = curl_exec($c);
curl_close($c);
fclose($fh) or die($php_errormsg);

See Also

Recipe 10.11 for output buffering; documentation on curl_setopt( ) at http://www.php.net/curl-setopt; the PEAR HTTP_Request class at http://pear.php.net/package-info.php?package=HTTP_Request; the syntax of an HTTP request is defined in RFC 2616 and available at http://www.faqs.org/rfcs/rfc2616.html.

Marking Up a Web Page

Problem

You want to display a web page, for example a search result, with certain words highlighted.

Solution

Use preg_replace( ) with an array of patterns and replacements:

$patterns = array('\bdog\b/', '\bcat\b');
$replacements = array('<b style="color:black;background-color=#FFFF00">dog</b>',
                      '<b style='color:black;background-color=#FF9900">cat</b>');
while ($page) {
    if (preg_match('{^([^<]*)?(</?[^>]+?>)?(.*)$}',$page,$matches)) {
        print preg_replace($patterns,$replacements,$matches[1]);
        print $matches[2];
        $page = $matches[3];
    }
}

Discussion

The regular expression used with preg_match( ) matches as much text as possible before an HTML tag, then an HTML tag, and then the rest of the content. The text before the HTML tag has the highlighting applied to it, the HTML tag is printed out without any highlighting, and the rest of the content has the same match applied to it. This prevents any highlighting of words that occur inside HTML tags (in URLs or alt text, for example) which would prevent the page from displaying properly.

The following program retrieves the URL in $url and highlights the words in the $words array. Words are not highlighted when they are part of larger words because they are matched with the \b Perl-compatible regular expression operator for finding word boundaries.

$colors = array('FFFF00','FF9900','FF0000','FF00FF',
                '99FF33','33FFCC','FF99FF','00CC33'); 

// build search and replace patterns for regex 
$patterns = array();
$replacements = array();
for ($i = 0, $j = count($words); $i < $j; $i++) {
    $patterns[$i] = '/\b'.preg_quote($words[$i], '/').'\b/';
    $replacements[$i] = '<b style="color:black;background-color:#' .
                         $colors[$i % 8] .'">' . $words[$i] . '</b>';
}

// retrieve page 
$fh = fopen($url,'r') or die($php_errormsg);
while (! feof($fh)) {
    $s .= fread($fh,4096);
}
fclose($fh);

if ($j) {
    while ($s) {
        if (preg_match('{^([^<]*)?(</?[^>]+?>)?(.*)$}s',$s,$matches)) {
            print preg_replace($patterns,$replacements,$matches[1]);
            print $matches[2];
            $s = $matches[3];
        }
    }
} else {
    print $s;
}

See Also

Recipe 13.8 for information on capturing text inside HTML tags; documentation on preg_match( ) at http://www.php.net/preg-match and preg_replace( ) at http://www.php.net/preg-replace.

Extracting Links from an HTML File

Problem

You need to extract the URLs that are specified inside an HTML document.

Solution

Use the pc_link_extractor( ) function shown in Example 11-2.

Example 11-2. pc_link_extractor( )

function pc_link_extractor($s) {
  $a = array();
  if (preg_match_all('/<a\s+.*?href=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/a>/i',
                     $s,$matches,PREG_SET_ORDER)) {
    foreach($matches as $match) {
      array_push($a,array($match[1],$match[2]));
    }
  }
  return $a;
}

For example:

$links = pc_link_extractor($page);

Discussion

The pc_link_extractor( ) function returns an array. Each element of that array is itself a two-element array. The first element is the target of the link, and the second element is the text that is linked. For example:

$links=<<<END
Click <a href="http://www.oreilly.com">here</a> to visit a computer book 
publisher. Click <a href="http://www.sklar.com">over here</a> to visit 
a computer book author.
END;

$a = pc_link_extractor($links);
print_r($a);
Array
               (
                   [0] => Array
                       (
                           [0] => http://www.oreilly.com
                           [1] => here
                       )
                   [1] => Array
                       (
                           [0] => http://www.sklar.com
                           [1] => over here
                       )
               )
            

The regular expression in pc_link_extractor( ) won't work on all links, such as those that are constructed with JavaScript or some hexadecimal escapes, but it should function on the majority of reasonably well-formed HTML.

See Also

Recipe 13.8 for information on capturing text inside HTML tags; documentation on preg_match_all( ) at http://www.php.net/preg-match-all.

Converting ASCII to HTML

Problem

You want to turn plaintext into reasonably formatted HTML.

Solution

First, encode entities with htmlentities( ) ; then, transform the text into various HTML structures. The pc_ascii2html( ) function shown in Example 11-3 has basic transformations for links and paragraph breaks.

Example 11-3. pc_ascii2html( )

function pc_ascii2html($s) {
  $s = htmlentities($s);
  $grafs = split("\n\n",$s);
  for ($i = 0, $j = count($grafs); $i < $j; $i++) {
    // Link to what seem to be http or ftp URLs
    $grafs[$i] = preg_replace('/((ht|f)tp:\/\/[^\s&]+)/',
                              '<a href="$1">$1</a>',$grafs[$i]);

    // Link to email addresses
    $grafs[$i] = preg_replace('/[^@\s]+@([-a-z0-9]+\.)+[a-z]{2,}/i',
        '<a href="mailto:$1">$1</a>',$grafs[$i]);

    // Begin with a new paragraph 
    $grafs[$i] = '<p>'.$grafs[$i].'</p>';
  }
  return join("\n\n",$grafs);
}

Discussion

The more you know about what the ASCII text looks like, the better your HTML conversion can be. For example, if emphasis is indicated with *asterisks* or /slashes/ around words, you can add rules that take care of that, as follows:

$grafs[$i] = preg_replace('/(\A|\s)\*([^*]+)\*(\s|\z)/',
                          '$1<b>$2</b>$3',$grafs[$i]);
$grafs[$i] = preg_replace('{(\A|\s)/([^/]+)/(\s|\z)}',
                          '$1<i>$2</i>$3',$grafs[$i]);

See Also

Documentation on preg_replace( ) at http://www.php.net/preg-replace.

Converting HTML to ASCII

Problem

You need to convert HTML to readable, formatted ASCII text.

Solution

If you have access to an external program that formats HTML as ASCII, such as lynx, call it like so:

$file = escapeshellarg($file);
$ascii = `lynx -dump $file`;

Discussion

If you can't use an external formatter, the pc_html2ascii( ) function shown in Example 11-4 handles a reasonable subset of HTML (no tables or frames, though).

Example 11-4. pc_html2ascii( )

function pc_html2ascii($s) {
  // convert links
  $s = preg_replace('/<a\s+.*?href="?([^\" >]*)"?[^>]*>(.*?)<\/a>/i',
                    '$2 ($1)', $s);

  // convert <br>, <hr>, <p>, <div> to line breaks
  $s = preg_replace('@<(b|h)r[^>]*>@i',"\n",$s);
  $s = preg_replace('@<p[^>]*>@i',"\n\n",$s);
  $s = preg_replace('@<div[^>]*>(.*)</div>@i',"\n".'$1'."\n",$s);
  
  // convert bold and italic
  $s = preg_replace('@<b[^>]*>(.*?)</b>@i','*$1*',$s);
  $s = preg_replace('@<i[^>]*>(.*?)</i>@i','/$1/',$s);

  // decode named entities
  $s = strtr($s,array_flip(get_html_translation_table(HTML_ENTITIES)));

  // decode numbered entities
  $s = preg_replace('//e','chr(\\1)',$s);
  
  // remove any remaining tags
  $s = strip_tags($s);
  
  return $s;
}

See Also

Recipe 9.9 for more on get_html_translation_table(); documentation on preg_replace( ) at http://www.php.net/preg-replace, get_html_translation_table( ) at http://www.php.net/get-html-translation-table, and strip_tags( ) at http://www.php.net/strip-tags.

Removing HTML and PHP Tags

Problem

You want to remove HTML and PHP tags from a string or file.

Solution

Use strip_tags( ) to remove HTML and PHP tags from a string:

$html = '<a href="http://www.oreilly.com">I <b>love computer books.</b></a>';
print strip_tags($html);
I love computer books.

Use fgetss( ) to remove them from a file as you read in lines:

$fh = fopen('test.html','r') or die($php_errormsg);
while ($s = fgetss($fh,1024)) {
    print $s;
}
fclose($fh)                  or die($php_errormsg);

Discussion

While fgetss( ) is convenient if you need to strip tags from a file as you read it in, it may get confused if tags span lines or if they span the buffer that fgetss( ) reads from the file. At the price of increased memory usage, reading the entire file into a string provides better results:

$no_tags = strip_tags(join('',file('test.html')));

Both strip_tags( ) and fgetss( ) can be told not to remove certain tags by specifying those tags as a last argument. The tag specification is case-insensitive, and for pairs of tags, you only have to specify the opening tag. For example, this removes all but <b></b> tags from $html:

$html = '<a href="http://www.oreilly.com">I <b>love</b> computer books.</a>';
print strip_tags($html,'<b>');
I <b>love</b> computer books.

See Also

Documentation on strip_tags( ) at http://www.php.net/strip-tags and fgetss( ) at http://www.php.net/fgetss.

Using Smarty Templates

Problem

You want to separate code and design in your pages. Designers can work on the HTML files without dealing with the PHP code, and programmers can work on the PHP files without worrying about design.

Solution

Use a templating system. One easy-to-use template system is called Smarty. In a Smarty template, strings between curly braces are replaced with new values:

Hello, {$name}

The PHP code that creates a page sets up the variables and then displays the template like this:

require 'Smarty.class.php';

$smarty = new Smarty;
$smarty->assign('name','Ruby');
$smarty->display('hello.tpl');

Discussion

Here's a Smarty template for displaying rows retrieved from a database:

<html>
<head><title>cheeses</title></head>
<body>
<table border="1">
<tr>
  <th>cheese</th>
  <th>country</th>
  <th>price</th>
</tr>
{section name=id loop=$results}
<tr>
  <td>{$results[id]->cheese}</td>
  <td>{$results[id]->country}</td>
  <td>{$results[id]->price}</td>
</tr>
{/section}        
</table>
</body>
</html>

Here's the corresponding PHP file that loads the data from the database and then displays the template, stored in food.tpl:

require 'Smarty.class.php';

mysql_connect('localhost','test','test');
mysql_select_db('test');

$r = mysql_query('SELECT * FROM cheese');
while ($ob = mysql_fetch_object($r)) {
    $ob->price = sprintf('$%.02f',$ob->price);
    $results[] = $ob;

}
$smarty = new Smarty;
$smarty->assign('results',$results);
$smarty->display('food.tpl');

After including the base class for the templating engine (Smarty.class.php), you retrieve and format the results from the database and store them in an array. To generate the templated page, just instantiate a new $smarty object, tell $smarty to pay attention to the $results variable, and then tell $smarty to display the template.

Smarty is easy to install: just copy a few files to your include_path and make a few directories. You can find full instructions at http://smarty.php.net/manual/en/installing.smarty.basic.html. Use Smarty with discipline to preserve the value of having templates in the first place — separating your logic and your presentation. A template engine has its own scripting language you use to interpolate variables, execute loops, and do other simple logic. Try to keep that to a minimum in your templates and load up your PHP files with the programming.

See Also

The Smarty home page at http://smarty.php.net/.

Parsing a Web Server Log File

Problem

You want to do calculations based on the information in your web server's access log file.

Solution

Open the file and parse each line with a regular expression that matches the log file format. This regular expression matches the NCSA Combined Log Format:

$pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) (\[[^\]]+\]) "(.*) (.*) (.*)" ([0-9\-]+)
    ([0-9\-]+) "(.*)" "(.*)"$/';

Discussion

This program parses the NCSA Combined Log Format lines and displays a list of pages sorted by the number of requests for each page:

$log_file = '/usr/local/apache/logs/access.log';
$pattern = '/^([^ ]+) ([^ ]+) ([^ ]+) (\[[^\]]+\]) "(.*) (.*) (.*)" ([0-9\-]+)
    ([0-9\-]+) "(.*)" "(.*)"$/';

$fh = fopen($log_file,'r') or die($php_errormsg);
$i = 1;
$requests = array();
while (! feof($fh)) {
    // read each line and trim off leading/trailing whitespace
    if ($s = trim(fgets($fh,16384))) {
        // match the line to the pattern
        if (preg_match($pattern,$s,$matches)) {
            /* put each part of the match in an appropriately-named
             * variable */
            list($whole_match,$remote_host,$logname,$user,$time,
                 $method,$request,$protocol,$status,$bytes,$referer,
                 $user_agent) = $matches;
             // keep track of the count of each request 
            $requests[$request]++;
        } else {
            // complain if the line didn't match the pattern 
            error_log("Can't parse line $i: $s");
        }
    }
    $i++;
}
fclose($fh) or die($php_errormsg);

// sort the array (in reverse) by number of requests 
arsort($requests);

// print formatted results
foreach ($requests as $request => $accesses) {
    printf("%6d   %s\n",$accesses,$request);
}

The pattern used in preg_match( ) matches Combined Log Format lines such as:

10.1.1.162 - david [20/Jul/2001:13:05:02 -0400] "GET /sklar.css HTTP/1.0" 200 
278 "-" "Mozilla/4.77 [en] (WinNT; U)"
10.1.1.248 - - [14/Mar/2002:13:31:37 -0500] "GET /php-cookbook/colors.html 
HTTP/1.1" 200 460 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0)"

In the first line, 10.1.1.162 is the IP address that the request came from. Depending on the server configuration, this could be a hostname instead. When the $matches array is assigned to the list of separate variables, the hostname is stored in $remote_host. The next hyphen (-) means that the remote host didn't supply a username via identd,[1] so $logname is set to -.

The string david is a username provided by the browser using HTTP Basic Authentication and is put in $user. The date and time of the request, stored in $time, is in brackets. This date and time format isn't understood by strtotime( ), so if you wanted to do calculations based on request date and time, you have to do some further processing to extract each piece of the formatted time string. Next, in quotes, is the first line of the request. This is composed of the method (GET, POST, HEAD, etc.) which is stored in $method; the requested URI, which is stored in $request, and the protocol, which is stored in $protocol. For GET requests, the query string is part of the URI. For POST requests, the request body that contains the variables isn't logged.

After the request comes the request status, stored in $status. Status 200 means the request was successful. After the status is the size in bytes of the response, stored in $bytes. The last two elements of the line, each in quotes, are the referring page if any, stored in $referer [2] and the user agent string identifying the browser that made the request, stored in $user_agent.

Once the log file line has been parsed into distinct variables, you can do the needed calculations. In this case, just keep a counter in the $requests array of how many times each URI is requested. After looping through all lines in the file, print out a sorted, formatted list of requests and counts.

Calculating statistics this way from web server access logs is easy, but it's not very flexible. The program needs to be modified for different kinds of reports, restricted date ranges, report formatting, and many other features. A better solution for comprehensive web site statistics is to use a program such as analog , available for free at http://www.analog.cx. It has many types of reports and configuration options that should satisfy just about every need you may have.

See Also

Documentation on preg_match( ) at http://www.php.net/preg-match; information about common log file formats is available at http://httpd.apache.org/docs/logs.html.

Program: Finding Stale Links

The stale-links.php program in Example 11-5 produces a list of links in a page and their status. It tells you if the links are okay, if they've been moved somewhere else, or if they're bad. Run the program by passing it a URL to scan for links:

% stale-links.php http://www.oreilly.com/
http://www.oreilly.com/index.html: OK
http://www.oreillynet.com: OK
http://conferences.oreilly.com: OK
http://international.oreilly.com: OK
http://safari.oreilly.com: MOVED: mainhom.asp?home
...

The stale-links.php program uses the cURL extension to retrieve web pages. First, it retrieves the URL specified on the command line. Once a page has been retrieved, the program uses the pc_link_extractor( ) function from Recipe 11.9 to get a list of links in the page. Then, after prepending a base URL to each link if necessary, the link is retrieved. Because we need just the headers of these responses, we use the HEAD method instead of GET by setting the CURLOPT_NOBODY option. Setting CURLOPT_HEADER tells curl_exec( ) to include the response headers in the string it returns. Based on the response code, the status of the link is printed, along with its new location if it's been moved.

Example 11-5. stale-links.php

function_exists('curl_exec') or die('CURL extension required');

function pc_link_extractor($s) {
    $a = array();
    if (preg_match_all('/<A\s+.*?HREF=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/A>/i',
                       $s,$matches,PREG_SET_ORDER)) {
        foreach($matches as $match) {
            array_push($a,array($match[1],$match[2]));
        }
    }
    return $a;
}

$url = $_SERVER['argv'][1];

// retrieve URL
$c = curl_init($url);
curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($c, CURLOPT_FOLLOWLOCATION,1);
$page = curl_exec($c);
$info = curl_getinfo($c);
curl_close($c);

// compute base url from url
// this doesn't pay attention to a <base> tag in the page
$url_parts = parse_url($info['url']);
if ('' == $url_parts['path']) { $url_parts['path'] = '/'; }
$base_path = preg_replace('<^(.*/)([^/]*)$>','\\1',$url_parts['path']);
$base_url = sprintf('%s://%s%s%s',
                    $url_parts['scheme'],
                    ($url_parts['username'] || $url_parts['password']) ?
                    "$url_parts[username]:$url_parts[password]@" : '',
                    $url_parts['host'],
                    $url_parts['path']);

// keep track of the links we visit so we don't visit each more than once
$seen_links = array();

if ($page) {
    $links = pc_link_extractor($page);
    foreach ($links as $link) {
        // resolve relative links
        if (! (preg_match('{^(http|https|mailto):}',$link[0]))) {
            $link[0] = $base_url.$link[0];
        }
        // skip this link if we've seen it already
        if ($seen_links[$link[0]]) {
            continue;
        } 
        
        // mark this link as seen
        $seen_links[$link[0]] = true;

        // print the link we're visiting
        print $link[0].': ';
        flush();
        
        // visit the link
        $c = curl_init($link[0]);
        curl_setopt($c, CURLOPT_RETURNTRANSFER, 1);
        curl_setopt($c, CURLOPT_NOBODY, 1);
        curl_setopt($c, CURLOPT_HEADER, 1);
        $link_headers = curl_exec($c);
        $curl_info = curl_getinfo($c);
        curl_close($c);

        switch (intval($curl_info['http_code']/100)) {
        case 2:
            // 2xx response codes mean the page is OK
            $status = 'OK';
            break;
        case 3:
            // 3xx response codes mean redirection
            $status = 'MOVED';
            if (preg_match('/^Location: (.*)$/m',$link_headers,$matches)) {
                $location = trim($matches[1]);
                $status .= ": $location";
            }
            break;
        default:
            // other response codes mean errors
            $status = "ERROR: $curl_info[http_code]";
            break;
        }

        print "$status\n";
    }
}

Program: Finding Fresh Links

Example 11-6, fresh-links.php, is a modification of the program in Recipe 11.15 that produces a list of links and their last modified time. If the server on which a URL lives doesn't provide a last modified time, the program reports the URL's last modified time as the time the URL was requested. If the program can't retrieve the URL successfully, it prints out the status code it got when it tried to retrieve the URL. Run the program by passing it a URL to scan for links:

% fresh-links.php http://www.oreilly.com
http://www.oreilly.com/index.html: Fri Aug 16 16:48:34 2002
http://www.oreillynet.com: Mon Aug 19 10:18:54 2002
http://conferences.oreilly.com: Fri Aug 16 19:41:46 2002
http://international.oreilly.com: Fri Mar 29 18:06:32 2002
http://safari.oreilly.com: 302
http://www.oreilly.com/catalog/search.html: Tue Apr  2 19:05:57 2002
http://www.oreilly.com/oreilly/press/: 302
...

This output is from a run of the program at about 10:20 A.M. EDT on August 19, 2002. The link to http://www.oreillynet.com is very fresh, but the others are of varying ages. The link to http://www.oreilly.com/oreilly/press/ doesn't have a last modified time next to it; it has instead, an HTTP status code (302). This means it's been moved elsewhere, as reported by the output of stale-links.php in Recipe 11.15.

The program to find fresh links is conceptually almost identical to the program to find stale links. It uses the same pc_link_extractor( ) function from Recipe 11.10; however, it uses the HTTP_Request class instead of cURL to retrieve URLs. The code to get the base URL specified on the command line is inside a loop so that it can follow any redirects that are returned.

Once a page has been retrieved, the program uses the pc_link_extractor( ) function to get a list of links in the page. Then, after prepending a base URL to each link if necessary, sendRequest( ) is called on each link found in the original page. Since we need just the headers of these responses, we use the HEAD method instead of GET. Instead of printing out a new location for moved links, however, it prints out a formatted version of the Last-Modified header if it's available.

Example 11-6. fresh-links.php

require 'HTTP/Request.php';

function pc_link_extractor($s) {
    $a = array();
    if (preg_match_all('/<A\s+.*?HREF=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/A>/i',
                       $s,$matches,PREG_SET_ORDER)) {
        foreach($matches as $match) {
            array_push($a,array($match[1],$match[2]));
        }
    }
    return $a;
}

$url = $_SERVER['argv'][1];

// retrieve URLs in a loop to follow redirects 
$done = 0;
while (! $done) {
    $req = new HTTP_Request($url);
    $req->sendRequest();
    if ($response_code = $req->getResponseCode()) {
        if ((intval($response_code/100) == 3) &&
            ($location = $req->getResponseHeader('Location'))) {
            $url = $location;
        } else {
            $done = 1;
        }
    } else {
        return false;
    }
}

// compute base url from url
// this doesn't pay attention to a <base> tag in the page 
$base_url = preg_replace('{^(.*/)([^/]*)$}','\\1',$req->_url->getURL());

// keep track of the links we visit so we don't visit each more than once
$seen_links = array();

if ($body = $req->getResponseBody()) {
    $links = pc_link_extractor($body);
    foreach ($links as $link) {
        // skip https URLs
        if (preg_match('{^https://}',$link[0])) {
            continue;
        }
        // resolve relative links
        if (! (preg_match('{^(http|mailto):}',$link[0]))) {
            $link[0] = $base_url.$link[0];
        }
        // skip this link if we've seen it already
        if ($seen_links[$link[0]]) {
            continue;
        } 
        
        // mark this link as seen
        $seen_links[$link[0]] = true;

        // print the link we're visiting
        print $link[0].': ';
        flush();
        
        // visit the link
        $req2 = new HTTP_Request($link[0],
                                 array('method' => HTTP_REQUEST_METHOD_HEAD));
        $now = time();
        $req2->sendRequest();
        $response_code = $req2->getResponseCode();
        
        // if the retrieval is successful
        if ($response_code == 200) {
            // get the Last-Modified header
            if ($lm = $req2->getResponseHeader('Last-Modified')) {
                $lm_utc = strtotime($lm);
            } else {
                // or set Last-Modified to now
                $lm_utc = $now;
            }
            print strftime('%c',$lm_utc);
        } else {
            // otherwise, print the response code
            print $response_code;
        }
        print "\n";
    }
}

Notes

  1. identd, defined in RFC 1413, is supposed to be a good way to identify users remotely. However, it's not very secure or reliable. A good explanation of why is at http://www.clock.org/~fair/opinion/identd.html.
  2. The correct way to spell this word is "referrer." However, since the original HTTP specification (RFC 1945) misspelled it as "referer," the three-R spelling is frequently used in context.
Personal tools