Friday, October 22, 2004

Userfriendly scraper

#!/usr/bin/perl -w

use strict;
use LWP::Simple;
use XML::RSS;
use Date::Format;

# These regexes taken from Dailystrips
my $patternpre = "<img.+?src=\"(http://www\.userfriendly\.org/cartoons/archives/%y.+?/uf.+?\.gif)\"";
my $urlpre = "http://ars.userfriendly.org/cartoons/?id=%Y%m%d&mode=classic";

my $pattern = time2str ($patternpre, time);
my $url = time2str ($urlpre, time);

my $page = get($url);

my $rss = new XML::RSS (version => '1.0');

$rss->channel(title       => 'User Friendly',
	      link	  => 'http://userfriendly.org/',
	      description => 'User Friendly the Comic Strip');

if ($page =~ /$pattern/ig)
{
	$rss->add_item(title       => time2str("CARTOON FOR %a %b, %Y",time),
	               link        => "$url",
       	               description => "&lt;img src='$1' /&gt;");
}
	
	
print $rss->as_string;
I got fed up of using overblown commands to HTMLise these things, so I have a little sed script now:
s/&/\&amp;/g
s/</\&lt;/g
s/>/\&gt;/g
s/"/\&quot;/g
I changed sun-pic.pl. The line
foreach (split ("\r\n", $page))
is now
foreach (split ("\n", $page))

0 Comments:

Post a Comment

<< Home