PHP Cookbook/Regular Expressions

From WikiContent

< PHP Cookbook(Difference between revisions)
Jump to: navigation, search
m (1 revision(s))
Current revision (13:36, 7 March 2008) (edit) (undo)
(Initial conversion from Docbook)
 

Current revision

PHP Cookbook


Contents

Introduction

Regular expressions are a powerful tool for matching and manipulating text. While not as fast as plain-vanilla string matching, regular expressions are extremely flexible; they allow you to construct patterns to match almost any conceivable combination of characters with a simple, albeit terse and somewhat opaque syntax.

In PHP, you can use regular expression functions to find text that matches certain criteria. Once located, you can choose to modify or replace all or part of the matching substrings. For example, this regular expression turns text email addresses into mailto: hyperlinks:

$html = preg_replace('/[^@\s]+@([-a-z0-9]+\.)+[a-z]{2,}/i',
                     '<a href="mailto:$0">$0</a>', $text);

As you can see, regular expressions are handy when transforming plain text into HTML and vice versa. Luckily, since these are such popular subjects, PHP has many built-in functions to handle these tasks. Recipe 9.9 tells how to escape HTML entities, Recipe 11.12 covers stripping HTML tags, and Recipe 11.10 and Recipe 11.11 show how to convert ASCII to HTML and HTML to ASCII, respectively. For more on matching and validating email addresses, see Recipe 13.7.

Over the years, the functionality of regular expressions has grown from its basic roots to incorporate increasingly useful features. As a result, PHP offers two different sets of regular-expression functions. The first set includes the traditional (or POSIX) functions, all beginning with ereg (for extended regular expressions; the ereg functions themselves are already an extension of the original feature set). The other set includes the Perl family of functions, prefaced with preg (for Perl-compatible regular expressions).

The preg functions use a library that mimics the regular expression functionality of the Perl programming language. This is a good thing because Perl allows you to do a variety of handy things with regular expressions, including nongreedy matching, forward and backward assertions, and even recursive patterns.

In general, there's no longer any reason to use the ereg functions. They offer fewer features, and they're slower than preg functions. However, the ereg functions existed in PHP for many years prior to the introduction of the preg functions, so many programmers still use them because of legacy code or out of habit. Thankfully, the prototypes for the two sets of functions are identical, so it's easy to switch back and forth from one to another in your mind without too much confusion. (We list how to do this while avoiding the major gotchas in Recipe 13.2.)

The basics of regular expressions are simple to understand. You combine a sequence of characters to form a pattern. You then compare strings of text to this pattern and look for matches. In the pattern, most characters represent themselves. So, to find if a string of HTML contains an image tag, do this:

if (preg_match('/<img /', $html)) {
    // found an opening image tag
}

The preg_match( ) function compares the pattern of "<img " against the contents of $html. If it finds a match, it returns 1; if it doesn't, it returns 0. The / characters are called pattern delimiters ; they set off the start and end of the pattern.

A few characters, however, are special. The special nature of these characters are what transforms regular expressions beyond the feature set of strstr( ) and strpos( ). These characters are calledmetacharacters. The most frequently used metacharacters include the period (.), asterisk (*), plus (+), and question mark (?). To match an actual metacharacter, precede the character with a backslash(\).

  • The period matches any character, so the pattern /.at/ matches bat, cat, and even rat.
  • The asterisk means match 0 or more of the preceding object. (Right now, the only objects we know about are characters.)
  • The plus is similar to asterisk, but it matches 1 or more instead of or more. So, /.+at/ matches brat, sprat, and even catastrophe, but not at. To match at, replace the + with a *.
  • The question mark matches 0 or 1 objects.

To apply * and + to objects greater than one character, place the sequence of characters inside parentheses. Parentheses allow you to group characters for more complicated matching and also capture the part of the pattern that falls inside them. A captured sequence can be referenced in preg_replace( ) to alter a string, and all captured matches can be stored in an array that's passed as a third parameter to preg_match( ) and preg_match_all( ) . The preg_match_all( ) function is similar to preg_match( ), but it finds all possible matches inside a string, instead of stopping at the first match. Here are some examples:

if (preg_match('/<title>.+<\/title>/', $html)) {
    // page has a title
}

if (preg_match_all('/<li>/', $html, $matches)) {
    print 'Page has ' . count($matches[0]) . " list items\n";
}

// turn bold into italic
$italics = preg_replace('/(<\/?)b(>)/', '$1i$2', $bold);

If you want to match strings with a specific set of letters, create a character class with the letters you want. A character class is a sequence of characters placed inside square brackets. The caret (^) and the dollar sign ($) anchor the pattern at the beginning and the end of the string, respectively. Without them, a match can occur anywhere in the string. So, to match only vowels, make a character class containing a, e, i, o, and u; start your pattern with ^; and end it with $:

preg_match('/^[aeiou]+$/', $string); // only vowels

If it's easier to define what you're looking for by its complement, use that. To make a character class match the complement of what's inside it, begin the class with a caret. A caret outside a character class anchors a pattern at the beginning of a string; a caret inside a character class means "match everything except what's listed in the square brackets":

preg_match('/^[^aeiou]+$/', $string) // only non-vowels

Note that the opposite of [aeiou] isn't [bcdfghjklmnpqrstvwxyz]. The character class [^aeiou] also matches uppercase vowels such as AEIOU, numbers such as 123, URLs such as http://www.cnpq.br/, and even emoticons such as :).

The vertical bar (|), also known as the pipe, specifies alternatives. For example:

// find a gif or a jpeg
preg_match('/(gif|jpeg)/', $images);

Beside metacharacters, there are also metasymbols. Metasymbols are like metacharacters, but are longer than one character in length. Some useful metasymbols are \w (match any word character, [a-zA-Z0-9_]); \d (match any digit, [0-9]); \s (match any whitespace character), and \b (match a word boundary). Here's how to find all numbers that aren't part of another word:

// find digits not touching other words
preg_match_all('/\b\d+\b/', $html, $matches);

This matches 123, 76!, and 38-years-old, but not 2nd.

Here's a pattern that is the regular expression equivalent of trim( ) :

// delete leading whitespace or trailing whitespace
$trimmed = preg_replace('/(^\s+)|(\s+$)/', '', $string);

Finally, there are pattern modifiers. Modifiers effect the entire pattern, not just a character or group of characters. Pattern modifiers are placed after the trailing pattern delimiter. For example, the letter i makes a regular expression pattern case-insensitive:

// strict match lower-case image tags only (XHTML compliant)
if (preg_match('/<img[^>]+>/', $html)) {
   ...
}

// match both upper and lower-case image tags
if (preg_match('/<img[^>]+>/i', $html)) {
   ...
}

We've covered just a small subset of the world of regular expressions. We provide some additional details in later recipes, but the PHP web site also has some very useful information on POSIX regular expressions at http://www.php.net/regex and on Perl-compatible regular expressions at http://www.php.net/pcre. The links from this last page to "Pattern Modifiers" and "Pattern Syntax" are especially detailed and informative.

The best books on this topic are Mastering Regular Expressions by Jeffrey Friedl, and Programming Perl by Larry Wall, Tom Christiansen, and Jon Orwant, both published by O'Reilly. (Since the Perl-compatible regular expressions are based on Perl's regular expressions, we don't feel too bad suggesting a book on Perl.)

Switching From ereg to preg

Problem

You want to convert from using ereg functions to preg functions.

Solution

First, you have to add delimiters to your patterns:

preg_match('/pattern/', 'string')

For eregi( ) case-insensitive matching, use the /i modifier instead:

preg_match('/pattern/i', 'string');

When using integers instead of strings as patterns or replacement values, convert the number to hexadecimal and specify it using an escape sequence:

$hex = dechex($number);
preg_match("/\x$hex/", 'string');

Discussion

There are a few major differences between ereg and preg. First, when you use preg functions, the pattern isn't just the string pattern; it also needs delimiters, as in Perl, so it's /pattern/ instead.[1] So:

ereg('pattern', 'string');

becomes:

preg_match('/pattern/', 'string');

When choosing your pattern delimiters, don't put your delimiter character inside the regular-expression pattern, or you'll close the pattern early. If you can't find a way to avoid this problem, you need to escape any instances of your delimiters using the backslash. Instead of doing this by hand, call addcslashes( ) .

For example, if you use / as your delimiter:

$ereg_pattern = '<b>.+</b>';
$preg_pattern = addcslashes($ereg_pattern, '/');

The value of $preg_pattern is now <b>.+<\/b>.

The preg functions don't have a parallel series of case-insensitive functions. They have a case-insensitive modifier instead. To convert, change:

eregi('pattern', 'string');

to:

preg_match('/pattern/i', 'string');

Adding the i after the closing delimiter makes the change.

Finally, there is one last obscure difference. If you use a number (not a string) as a pattern or replacement value in ereg_replace( ) , it's assumed you are referring to the ASCII value of a character. Therefore, since 9 is the ASCII representation of tab (i.e., \t), this code inserts tabs at the beginning of each line:

$tab = 9;
$replaced = ereg_replace('^', $tab, $string);

Here's how to convert linefeed endings:

$converted = ereg_replace(10, 12, $text);

To avoid this feature in ereg functions, use this instead:

$tab = '9';

On the other hand, preg_replace( ) treats the number 9 as the number 9, not as a tab substitute. To convert these character codes for use in preg_replace( ), convert them to hexadecimal and prefix them with \x. For example, 9 becomes \x9 or \x09, and 12 becomes \x0c. Alternatively, you can use \t , \r, and \n for tabs, carriage returns, and linefeeds, respectively.

See Also

Documentation on ereg( ) at http://www.php.net/ereg, preg_match( ) at http://www.php.net/preg-match, and addcslashes( ) at http://www.php.net/addcslashes.

Matching Words

Problem

You want to pull out all words from a string.

Solution

The key to this is carefully defining what you mean by a word. Once you've created your definition, use the special character types to create your regular expression:

/\S+/         // everything that isn't whitespace
/[A-Z'-]+/i   // all upper and lowercase letters, apostrophes, and hyphens

Discussion

The simple question "what is a word?" is surprisingly complicated. While the Perl compatible regular expressions have a built-in word character type, specified by \w , it's important to understand exactly how PHP defines a word. Otherwise, your results may not be what you expect.

Normally, because it comes directly from Perl's definition of a word, \w encompasses all letters, digits, and underscores; this means a_z is a word, but the email address php@example.com is not.

In this recipe, we only consider English words, but other languages use different alphabets. Because Perl-compatible regular expressions use the current locale to define its settings, altering the locale can switch the definition of a letter, which then redefines the meaning of a word.

To combat this, you may want to explicitly enumerate the characters belonging to your words inside a character class. To add a nonstandard character, use \ddd , where ddd is a character's octal code.

See Also

Recipe 16.3 for information about setting locales.

Finding the nth Occurrence of a Match

Problem

You want to find the nth word match instead of the first one.

Solution

Use preg_match_all( ) to pull all the matches into an array; then pick out the specific matches you're interested in:

preg_match_all ("/$pattern/$modifiers", $string, $matches)

foreach($matches[1] as $match) {
    print "$match\n";
}

Discussion

Unlike in Perl, PHP's Perl-compatible regular expressions don't support the /g modifier that allows you to loop through the string one match at a time. You need to use preg_match_all( ) instead of preg_match( ).

The preg_match_all( ) function returns a two-dimensional array. The first element holds an array of matches of the complete pattern. The second element also holds an array of matches, but of the parenthesized submatches within each complete match. So, to get the third potato, you access the third element of the second element of the $matches array:

$potatoes = 'one potato two potato three potato four';
preg_match_all("/(\w+)\s+potato\b/", $potatoes, $matches);
print $matches[1][2];
three
            

Instead of returning an array divided into full matches and then submatches, preg_match_all( ) returns an array divided by matches, with each submatch inside. To trigger this, pass PREG_SET_ORDER in as the fourth argument. Now, three isn't in $matches[1][2], as previously, but in $matches[2][1].

Check the return value of preg_match_all( ) to find the number of matches:

print preg_match_all("/(\w+)\s+potato\b/", $potatoes, $matches);
3
            

Note that there are only three matches, not four, because there's no trailing potato after the word four in the string.

See Also

Documentation on preg_match_all( ) at http://www.php.net/preg-match-all.

Choosing Greedy or Nongreedy Matches

Problem

You want your pattern to match the smallest possible string instead of the largest.

Solution

Place a ? after a quantifier to alter that portion of the pattern:

// find all bolded sections
preg_match_all('#<b>.+?</b>#', $html, $matches);

Or, use the U pattern modifier ending to invert all quantifiers from greedy to nongreedy:

// find all bolded sections
preg_match_all('#<b>.+</b>#U', $html, $matches);

Discussion

By default, all regular expressions in PHP are what's known as greedy. This means a quantifier always tries to match as many characters as possible.

For example, take the pattern p.*, which matches a p and then 0 or more characters, and match it against the string php. A greedy regular expression finds one match, because after it grabs the opening p, it continues on and also matches the hp. A nongreedy regular expression, on the other hand, finds a pair of matches. As before, it matches the p and also the h, but then instead of continuing on, it backs off and leaves the final p uncaptured. A second match then goes ahead and takes the closing letter.

The following code shows that the greedy match finds only one hit; the nongreedy ones find two:

print preg_match_all('/p.*/', "php");  // greedy
print preg_match_all('/p.*?/', "php"); // nongreedy
print preg_match_all('/p.*/U', "php"); // nongreedy
1
               2
               2
            

Greedy matching is also known as maximal matching and nongreedy matching can be called minimal matching, because these options match either the maximum or minimum number of characters possible.

Initially, all regular expressions were strictly greedy. Therefore, you can't use this syntax with ereg( ) or ereg_replace( ). Greedy matching isn't supported by the older engine that powers these functions; instead, you must use Perl-compatible functions.

Nongreedy matching is frequently useful when trying to perform simplistic HTML parsing. Let's say you want to find all text between bold tags. With greedy matching, you get this:

$html = '<b>I am bold.</b> <i>I am italic.</i> <b>I am also bold.</b>';
preg_match_all('#<b>(.+)</b>#', $html, $bolds);
print_r($bolds[1]);
Array
               (
                   [0] => I am bold.</b> <i>I am italic.</i> <b>I am also bold.

               )
            

Because there's a second set of bold tags, the pattern extends past the first </b>, which makes it impossible to correctly break up the HTML. If you use minimal matching, each set of tags is self-contained:

$html = '<b>I am bold.</b> <i>I am italic.</i> <b>I am also bold.</b>';
preg_match_all('#<b>(.+?)</b>#', $html, $bolds);
print_r($bolds[1]);
Array
               (
                   [0] => I am bold.
                   [1] => I am also bold.
               )
            

Of course, this can break down if your markup isn't 100% valid, and there are stray bold tags lying around.[2] If your goal is just to remove all (or some) HTML tags from a block of text, you're better off not using a regular expression. Instead, use the built-in function strip_tags( ); it's faster and it works correctly. See Recipe 11.12 for more details.

Finally, even though the idea of nongreedy matching comes from Perl, the -U modifier is incompatible with Perl and is unique to PHP's Perl-compatible regular expressions. It inverts all quantifiers, turning them from greedy to nongreedy and also the reverse. So, to get a greedy quantifier inside of a pattern operating under a trailing /U, just add a ? to the end, the same way you would normally turn a greedy quantifier into a nongreedy one.

See Also

Recipe 13.9 for more on capturing text inside HTML tags; Recipe 11.12 for more on stripping HTML tags; documentation on preg_match_all( ) at http://www.php.net/preg-match-all.

Matching a Valid Email Address

Problem

You want to check if an email address is valid.

Solution

This is a popular question and everyone has a different answer, depending on their definition of valid. If valid means a mailbox belonging to a legitimate user at an existing hostname, the real answer is that you can't do it correctly, so don't even bother. However, sometimes a regular expression can help weed out some simple typos and obvious bogus attempts. That said, our favorite pattern that doesn't require maintenance is:

/^[^@\s]+@([-a-z0-9]+\.)+[a-z]{2,}$/i

If the IMAP extension is enabled, you can also use imap_rfc822_parse_adrlist( ) :

$parsed = imap_rfc822_parse_adrlist($email_address, $default_host)
if ('INVALID_ADDRESS' == $parsed['mailbox']) {
    // bad address
}

Ironically, because this function is so RFC-compliant, it may not give the results you expect.

Discussion

The pattern in the Solution accepts any email address that has a name of any sequence of characters that isn't a @ or whitespace. After the @, you need at least one domain name consisting of the letters a-z, the numbers 0-9, and the hyphen, separated by periods, and proceed it with as many subdomains you want. Finally, you end with either a two-digit country code or another top-level domain, such as .com or .edu.

The solution pattern is handy because it still works if ICANN adds new top-level domains. However, it does allow through a few false positives. This more strict pattern explicitly enumerates the current noncountry top-level domains:

/
    ^               # anchor at the beginning
    [^@\s]+         # name is all characters except @ and whitespace
    @               # the @ divides name and domain
    (
        [-a-z0-9]+  # (sub)domains are letters, numbers, and hyphens
        \.          # separated by a period
    )+              # and we can have one or more of them
    (
        [a-z]{2}    # TLDs can be a two-letter alphabetical country code
        |com|net    # or one of 
        |edu|org    # many 
        |gov|mil    # possible
        |int|biz    # three-letter
        |pro        # combinations
        |info|arpa  # or even
        |aero|coop  # a few 
        |name       # four-letter ones
        |museum     # plus one that's six-letters long!
    )
    $               # anchor at the end
/ix                 # and everything is case-insensitive

Both patterns are intentionally liberal in what they accept, because we assume you're only trying to make sure someone doesn't accidentally leave off their top-level domain or type in something fake such as "not telling." For instance, there's no domain "-.com", but "foo@-.com" flies through without a blip. (It wouldn't be hard to modify the pattern to correct this, but that's left as an exercise for you.) On the other hand, it is legal to have an address of "Tim O'Reilly@oreilly.com", and our pattern won't accept this. However, spaces in email addresses are rare; because a space almost always represents a mistake, we flag that address as bad.

The canonical definition of what's a valid address is documented in RFC 822; however, writing code to handle all cases isn't a pretty task. Here's one example of what you need to consider: people are allowed to embed comments inside addresses! Comments are set inside parentheses, so it's valid to write:

Tim (is the man @ computer books) @ oreilly.com

That's equivalent to "tim@oreilly.com". (So, again, the pattern fails on that address.)

Alternatively, the IMAP extension has an RFC 822-compliant address parser. This parser correctly navigates through whitespace comments and other oddities, but it allows obvious mistakes because it assumes that addresses without hostnames are local:

$email = 'stephen(his account)@ example(his host)';
$parsed = imap_rfc822_parse_adrlist($email,'');
print_r($parsed);
Array
               (
                   [0] => stdClass Object
                       (
                           [mailbox] => stephen
                           [host] => example
                           [personal] => his host
                       )

               )
            

Reassembling the mailbox and host, you get "stephen@example", which probably isn't what you want. The empty string you must pass in as the second argument defeats your ability to check for valid hostnames.

Some people like behind-the-scenes processing such as DNS lookups, to check if the address is valid. This doesn't make much sense because that technique won't always work, and you may end up rejecting perfectly valid people from your site, due to no fault of their own. (Also, its unlikely a mail administrator would fix his mail handling just to work around one web site's email validation scheme.)

Another consideration when validating email addresses is that it doesn't take too much work for a user to enter a completely legal and working address that isn't his. For instance, one of the authors used to have a bad habit of entering "billg@microsoft.com" when signing up for Microsoft's web sites because "Hey! Maybe Bill doesn't know about that new version of Internet Explorer?"

If the primary concern is to avoid typos, make people enter their address twice, and compare the two. If they match, it's probably correct. Also, filter out popular bogus addresses, such as "president@whitehouse.gov" and the previously mentioned "billg@microsoft.com". (This does have the downside of not letting The President of the United States of America or Bill Gates sign up for your site.)

However, if you need to ensure people actually have access to the email address they provide, one technique is to send a message to their address and require them to either reply to the message or go to a page on your site and type in a special code printed in the body of the message to confirm their sign-up. If you do choose the special code route, we suggest that you don't generate a random string of letters, such as HSD5nbADl8. Since it looks like garbage, it's hard to retype it correctly. Instead, use a word list and create code words such as television4coatrack. While, on occasion, it's possible to divine hidden meanings in these combos, you can cut the error rate and your support costs.

See Also

Recipe 8.6 for information about generating good passwords; Recipe 8.27 for a web site account deactivation program; documentation on imap_rfc822_parse_adrlist( ) at http://www.php.net/imap-rfc822-parse-adrlist.

Finding All Lines in a File That Match a Pattern

Problem

You want to find all the lines in a file that match a pattern.

Solution

Read the file into an array and use preg_grep( ) .

Discussion

There are two ways to do this. Here's the faster method:

$pattern = "/\bo'reilly\b/i"; // only O'Reilly books
$ora_books = preg_grep($pattern, file('/path/to/your/file.txt'));

Use the file( ) command to automatically load each line of the file into an array element and preg_grep( ) to filter the bad lines out.

Here's the more efficient method:

$fh = fopen('/path/to/your/file.txt', 'r') or die($php_errormsg);
while (!feof($fh)) {
    $line = fgets($fh, 4096);
    if (preg_match($pattern, $line)) { $ora_books[ ] = $line; }
}
fclose($fh);

Since the first method reads in everything all at once, it's about three times faster then the second way, which parses the file line by line but uses less memory. One downside, however, is that because the regular expression works only on one line at a time, the second method doesn't find strings that span multiple lines.

See Also

Recipe 18.6 on reading files into strings; documentation on preg_grep( ) at http://www.php.net/preg-grep.

Capturing Text Inside HTML Tags

Problem

You want to capture text inside HTML tags. For example, you want to find all the headings in a HTML document.

Solution

Read the HTML file into a string and use nongreedy matching in your pattern:

$html = join('',file($file));
preg_match('#<h([1-6])>(.+?)</h\1>#is', $html, $matches);

In this example, $matches[2] contains an array of captured headings.

Discussion

True parsing of HTML is difficult using a simple regular expression. This is one advantage of using XHTML; it's significantly easier to validate and parse.

For instance, the pattern in the Solution is smart enough to find only matching headings, so <h1>Dr. Strangelove<h1> is okay, because it's wrapped inside <h1> tags, but not <h2>How I Learned to Stop Worrying and Love the Bomb</h3>, because the opening tag is an <h2> while the closing tag is not.

This technique also works for finding all text inside bold and italic tags:

$html = join('',file($file));
preg_match('#<([bi])>(.+?)</\1>#is', $html, $matches);

However, it breaks on nested headings. Using that regular expression on:

<b>Dr. Strangelove or: <i>How I Learned to Stop Worrying and Love the Bomb</i></b>

doesn't capture the text inside the <i> tags as a separate item.

This wasn't a problem earlier; because headings are block level elements, it's illegal to nest them. However, as inline elements, nested bold and italic tags are valid.

Captured text can be processed by looping through the array of matches. For example, this code parses a document for its headings and pretty-prints them with indentation according to the heading level:

$html = join('',file($file));
preg_match('#<h([1-6])>(.+?)</h\1>#is', $html, $matches);

for ($i = 0, $j = count($matches[0]); $i < $j; $i++) {
  print str_repeat(' ', 2 * ($matches[1][$i] - 1)) . $matches[2][$i] . "\n";
}

So, with one representation of this recipe in HTML:

$html =<<<_END_
<h1>PHP Cookbook</h1>

Other Chapters
<h2>Regular Expressions</h2>

Other Recipes
<h3>Capturing Text Inside of HTML Tags</h3>

<h4>Problem</h4>
<h4>Solution</h4>
<h4>Discussion</h4>
<h4>See Also</h4>

_END_;

preg_match_all('#<h([1-6])>(.+?)</h\1>#is', $html, $matches);

for ($i = 0, $j = count($matches[0]); $i < $j; $i++) {
  print str_repeat(' ', 2 * ($matches[1][$i] - 1)) . $matches[2][$i] . "\n";
}

You get:

PHP Cookbook
  Regular Expressions
    Capturing Text Inside of HTML Tags
      Problem
      Solution
      Discussion
      See Also

By capturing the heading level and heading text separately, you can directly access the level and treat it as an integer when calculating the indentation size. To avoid a two-space indent for all lines, subtract 1 from the level.

See Also

Recipe 11.8 for information on marking up a web page and Recipe 11.9 for extracting links from an HTML file; documentation on preg_match( ) at http://www.php.net/preg-match and str_repeat( ) at http://www.php.net/str-repeat.

Escaping Special Characters in a Regular Expression

Problem

You want to have characters such as * or + treated as literals, not as metacharacters, inside a regular expression. This is useful when allowing users to type in search strings you want to use inside a regular expression.

Solution

Use preg_quote( ) to escape Perl-compatible regular-expression metacharacters:

$pattern = preg_quote('The Education of H*Y*M*A*N K*A*P*L*A*N').':(\d+)';
if (preg_match("/$pattern/",$book_rank,$matches)) {
    print "Leo Rosten's book ranked: ".$matches[1];
}

Use quotemeta( ) to escape POSIX metacharacters:

$pattern = quotemeta('M*A*S*H').':[0-9]+';
if (ereg($pattern,$tv_show_rank,$matches)) {
    print 'Radar, Hot Lips, and the gang ranked: '.$matches[1];
}

Discussion

Here are the characters that preg_quote( ) escapes:

. \ + * ? ^ $ [ ] ( ) { } < > = ! | :

Here are the characters that quotemeta( ) escapes:

. \ + * ? ^ $ [ ] ( )

These functions escape the metacharacters with backslash.

The quotemeta( ) function doesn't match all POSIX metacharacters. The characters {, }, and | are also valid metacharacters but aren't converted. This is another good reason to use preg_match( ) instead of ereg( ).

You can also pass preg_quote( ) an additional character to escape as a second argument. It's useful to pass your pattern delimiter (usually /) as this argument so it also gets escaped. This is important if you incorporate user input into a regular-expression pattern. The following code expects $_REQUEST['search_term'] from a web form and searches for words beginning with $_REQUEST['search_term'] in a string $s:

$search_term = preg_quote($_REQUEST['search_term'],'/');
if (preg_match("/\b$search_term/i",$s)) {
   print 'match!';
}

Using preg_quote( ) ensures the regular expression is interpreted properly if, for example, a Magnum, P.I. fan enters t.c. as a search term. Without preg_quote( ), this matches tic, tucker, and any other words whose first letter is t and third letter is c. Passing the pattern delimiter to preg_quote( ) as well makes sure that user input with forward slashes in it, such as CP/M, is also handled correctly.

See Also

Documentation on preg_quote( ) at http://www.php.net/preg-quote and quotemeta( ) at http://www.php.net/quotemeta.

Reading Records with a Pattern Separator

Problem

You want to read in records from a file, in which each record is separated by a pattern you can match with a regular expression.

Solution

Read the entire file into a string and then split on the regular expression:

$filename = '/path/to/your/file.txt';
$fh = fopen($filename, 'r') or die($php_errormsg);
$contents = fread($fh, filesize($filename));
fclose($fh);

$records = preg_split('/[0-9]+\) /', $contents);

Discussion

This breaks apart a numbered list and places the individual list items into array elements. So, if you have a list like this:

1) Gödel 
2) Escher
3) Bach

You end up with a four-element array, with an empty opening element. That's because preg_split( ) assumes the delimiters are between items, but in this case, the numbers are before items:

               Array
               (
                   [0] => 
                   [1] => Gödel
                   [2] => Escher
                   [3] => Bach
               )
            

From one point of view, this can be a feature, not a bug, since the nth element holds the nth item. But, to compact the array, you can eliminate the first element:

$records = preg_split('/[0-9]+\) /', $contents);
array_shift($records);

Another modification you might want is to strip new lines from the elements and substitute the empty string instead:

$records = preg_split('/[0-9]+\) /', str_replace("\n",'',$contents));
array_shift($records);

PHP doesn't allow you to change the input record separator to anything other than a newline, so this technique is also useful for breaking apart records divided by strings. However, if you find yourself splitting on a string instead of a regular expression, substitute explode( ) for preg_split( ) for a more efficient operation.

See Also

Recipe 18.6 for reading from a file; Recipe 1.12 for parsing CSV files.

Notes

  1. Or {}, <>, ||, ##, or whatever your favorite delimiters are. PHP supports them all.
  2. It's possible to have valid HTML and still get into trouble. For instance, if you have bold tags inside a comment. A true HTML parser ignores this section, but our pattern won't.
Personal tools