Extracting Links from an HTML File (PHP Cookbook)

11.9.1. Problem

You need to extract the URLs that are specified inside an HTML document.

11.9.2. Solution

Use the pc_link_extractor( ) function shown in Example 11-2.

Example 11-2. pc_link_extractor( )

function pc_link_extractor($s) {
  $a = array();
  if (preg_match_all('/<a\s+.*?href=[\"\']?([^\"\' >]*)[\"\']?[^>]*>(.*?)<\/a>/i',
                     $s,$matches,PREG_SET_ORDER)) {
    foreach($matches as $match) {
      array_push($a,array($match[1],$match[2]));
    }
  }
  return $a;
}

For example:

$links = pc_link_extractor($page);

11.9.3. Discussion

The pc_link_extractor( ) function returns an array. Each element of that array is itself a two-element array. The first element is the target of the link, and the second element is the text that is linked. For example:

$links=<<<END
Click <a href="http://www.e-reading.club">here</a> to visit a computer book 
publisher. Click <a href="http://www.sklar.com">over here</a> to visit 
a computer book author.
END;

$a = pc_link_extractor($links);
print_r($a);
Array
(
    [0] => Array
        (
            [0] => http://www.oreilly.com
            [1] => here
        )
    [1] => Array
        (
            [0] => http://www.sklar.com
            [1] => over here
        )
)

The regular expression in pc_link_extractor( ) won't work on all links, such as those that are constructed with JavaScript or some hexadecimal escapes, but it should function on the majority of reasonably well-formed HTML.

11.9.4. See Also

Recipe 13.8 for information on capturing text inside HTML tags; documentation on preg_match_all( ) at http://www.php.net/preg-match-all.


11.8. Marking Up a Web Page		11.10. Converting ASCII to HTML

11.9. Extracting Links from an HTML File

11.9.1. Problem

11.9.2. Solution

Example 11-2. pc_link_extractor( )

11.9.3. Discussion

11.9.4. See Also