Add Scroogle to your search area in Firefox 2.0 Install the 'Scroogle Scraper' search plugin.

August 2007


Real Age calculators have become all the rage lately and so I decided to reverse engineer (and improve) a popular one (http://www.poodwaddle.com/realage.htm .) The poodwaddle calc is made in flash, but I made mine in php and javascript with an XML backend. Below is the code I used to read the Real Age XML into a PHP array. (oh, and you’ll also need the PHP XML parsing class from php.net.)

(http://healthtech.accordingtome.com/parseData.inc.phps)

if (!isset($_SESSION['questionArray'])) //cache
{
$lookupTable=array();
foreach($xml->children[0]->children as $lookup)
{
  if ($lookup->name=='row')
    $lookupTable[$lookup->attributes["input"]]=$lookup->attributes["output"];
}
 
$questionArray=array();
foreach($xml->children[1]->children as $questions)
{
  $tempQuestion=new Question();
  $tempQuestion->title=$questions->children[0]->content;
  $tempQuestion->prompt=$questions->children[1]->content;
  $tempQuestion->genderSpecific=$questions->children[2]->content;
  $tempQuestion->controllable=$questions->children[4]->content;
  $tempQuestion->options=array();
  foreach($questions->children[3]->children as $option)
    {
      $optionPrompt=$option->attributes["prompt"];
      $RAEffect=$option->attributes["RA-effect"];
      $tempQuestion->options[$optionPrompt]=$RAEffect;
    }
  $questionArray[]=$tempQuestion;
}
$_SESSION['questionArray']=$questionArray;
$_SESSION['lookupTable']=$lookupTable;
}

The actual Real Age code is pretty ugly, but I’ll post it as soon as I clean it up. Yes it works in IE and FF. Yes, the javascript slider code is slow (but it works!). It’s from the dojokit. It was the only vertical slidebar that allowed custom labels I could find.

healthitblogs_sml.jpgI recently stumbled upon Javascript/Canvas Graph library which allows you to create cool network graphs with nodes and interconnections. Now that I had a graphic library, I needed something with lots of connections to graph. That’s when I thought of mapping health IT blog sites. Now all I needed was a spider…

Spiders, robots, crawlers all work on the same principle: start with a seed url(s), request the urls and then parse out the resultant html for all link tags which are then pushed into the queue for crawling.

time for some code:

(http://www.ryanbyrd.net/linkSpider.phps)

<?
$tldList=file("tlds.txt");
$masterList=file("masterList.txt");
$acceptableNodes=array();
foreach($masterList as $node)
{
  $node="http://".trim($node);
  $acceptableNodes[]=parse_url_domain ($node);
}
//print_r($acceptableNodes);
$connections=array();
$nodeCounter=array();
foreach ($masterList as $webpage)
{
$url ="http://".trim($webpage);
 $strippedWebpage=parse_url_domain ($url);
 $input = file_get_contents($url); //or die("Could not access file: $url");
if (!$input) continue;
$regexp = "<a\s[^>]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>";
$URLs=array();
if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER))
{
  foreach($matches as $match)
    {
      //echo($match[2]."\t");
      $currentURL=parse_url_domain ($match[2]);
      //echo($currentURL);
      //echo("\n");
      if ($currentURL!=$webpage && !in_array($currentURL,$URLs)&&check_domain($currentURL)&&check_tld($currentURL))
{
 $URLs[]=$currentURL;
 if (in_array($currentURL,$acceptableNodes)&& ($strippedWebpage!=$currentURL))
   {
   $connections[]="$strippedWebpage -> $currentURL";
   $nodeCounter[$currentURL]++;
   }
}
 
    }
}
 echo("$webpage : \n");
 print_r($URLs);
}
echo("connections: ");
 print_r($connections);
echo("nodeCounter: ");
print_r($nodeCounter);
echo("javascript edges: ");
$multiplier=3;
foreach($connections as $connection)
{
  list($from,$to)=explode(" -> ",$connection);
  $fromCtr=3*$nodeCounter[$from];
  $toCtr=3*$nodeCounter[$to];
  echo("g.addEdge($('".$from."'), $('".$to."'),".$fromCtr.",".$toCtr.");\n");
}
function parse_url_domain ($url)
{
 
  $raw_url= parse_url($url);
  if ($raw_url['host'] == '')
  {
    $raw_url['host'] = $raw_url['path'];
  }
    $domain_only[1] = $raw_url['host'];
  return strtolower($domain_only[1]);
}
function check_domain ($url)
{
if (!ereg("^.*\..*$", $url))
{
  return false;
}
  $local_array = explode(".", $url);
  for ($i = 0; $i < sizeof($local_array); $i++)
    {
    if (!ereg("^(([A-Za-z0-9!#$%&#038;'*+/=?^_`{|}~-][A-Za-z0-9!#$%&#038;'*+/=?^_`{|}~\.-]{0,63})|(\"[^(\\|\")]{0,62}\"))$", $local_array[$i]))
      {
      return false;
      }
    }  
  return true;
}
function check_tld($url)
{
  global $tldList;
  $parts=explode(".",$url);
  $lastpart=trim($parts[count($parts)-1]);
  foreach($tldList as $item)
    {
      if (trim($item)==$lastpart)
return true;
    }
      return false;
 
}
?>

Send to a friend * Print this page * Join the club * Talk with my robot * Advertise here * Search this Site * Donate * Link to me


Web hosting by Utah Hub *  Powered by CreativeTap *  In association with Segomo
Unless otherwise noted, Copyright 2004-2013, Ryan Byrd. All Rights Reserved.
Ryan Byrd dot net -- probably the coolest site in Utah