Parse Web Pages with PHP Simple HTML DOM Parser

PHP Simple HTML DOM Parser

PHP Simple HTML DOM Parser is a dream utility for developers that work with both PHP and the DOM because developers can easily find DOM elements using PHP. Here are a few sample uses of PHP Simple HTML DOM Parser:

copy// Include the library
include('simple_html_dom.php');
 
// Retrieve the DOM from a given URL
$html = file_get_html('http://davidwalsh.name/');

// Find all "A" tags and print their HREFs
foreach($html->find('a') as $e) 
    echo $e->href . '<br>';

// Retrieve all images and print their SRCs
foreach($html->find('img') as $e)
    echo $e->src . '<br>';

// Find all images, print their text with the "<>" included
foreach($html->find('img') as $e)
    echo $e->outertext . '<br>';

// Find the DIV tag with an id of "myId"
foreach($html->find('div#myId') as $e)
    echo $e->innertext . '<br>';

// Find all SPAN tags that have a class of "myClass"
foreach($html->find('span.myClass') as $e)
    echo $e->outertext . '<br>';

// Find all TD tags with "align=center"
foreach($html->find('td[align=center]') as $e)
    echo $e->innertext . '<br>';
    
// Extract all text from a given cell
echo $html->find('td[align="center"]', 1)->plaintext.'<br><hr>';

Like I said earlier, this library is a dream for finding elements, just as the early JavaScript frameworks and selector engines have become. Armed with the ability to pick content from DOM nodes with PHP, it’s time to analyze websites for changes.

The Script

The following script checks two websites for changes:

copy// Pull in PHP Simple HTML DOM Parser
include("simplehtmldom/simple_html_dom.php");

// Settings on top
$sitesToCheck = array(
          // id is the page ID for selector
          array("url" => "http://www.arsenal.com/first-team/players", "selector" => "#squad"),
          array("url" => "http://www.liverpoolfc.tv/news", "selector" => "ul[style='height:400px;']")
        );
$savePath = "cachedPages/";
$emailContent = "";

// For every page to check...
foreach($sitesToCheck as $site) {
  $url = $site["url"];
  
  // Calculate the cachedPage name, set oldContent = "";
  $fileName = md5($url);
  $oldContent = "";
  
  // Get the URL's current page content
  $html = file_get_html($url);
  
  // Find content by querying with a selector, just like a selector engine!
  foreach($html->find($site["selector"]) as $element) {
    $currentContent = $element->plaintext;;
  }
  
  // If a cached file exists
  if(file_exists($savePath.$fileName)) {
    // Retrieve the old content
    $oldContent = file_get_contents($savePath.$fileName);
  }
  
  // If different, notify!
  if($oldContent && $currentContent != $oldContent) {
    // Here's where we can do a whoooooooooooooole lotta stuff
    // We could tweet to an address
    // We can send a simple email
    // We can text ourselves
    
    // Build simple email content
    $emailContent = "David, the following page has changed!\n\n".$url."\n\n";
  }
  
  // Save new content
  file_put_contents($savePath.$fileName,$currentContent);
}

// Send the email if there's content!
if($emailContent) {
  // Sendmail!
  mail("david@davidwalsh.name","Sites Have Changed!",$emailContent,"From: alerts@davidwalsh.name","\r\n");
  // Debug
  echo $emailContent;
}

The code and comments are self-explanatory.  I’ve set the script up such that I get one "digest" alert if many of the pages change.  The script is the hard part — to enact the script, I’ve set up a CRON job to run the script every 20 minutes.

This solution isn’t specific to just spying on footy — you could use this type of script on any number of sites.  This script, however, is a bit simplistic in all cases.  If you wanted to spy on a website that had extremely dynamic code (i.e. a timestamp was in the code), you would want to create a regular expressions that would isolate the content to just the block you’re looking for. Since each website is constructed differently, I’ll leave it up to you to create page-specific isolators. Have fun spying on websites though…and be sure to let me know if you hear a good, reliable footy rumor!

 

Source

Leave a comment