PDF module for SpamAssassin

July 19, 2007

With the recent torrent of PDF spam, we created a module for SpamAssassin that allows for the scanning of PDF files. The module, linked below this post, works in the following way:

  1. Email bodies are scanned upon connection, and checked for PDF attachments.
  2. Text is extracted from the PDF via pdftotext, and scanned by SpamAssassin.
  3. Should the PDF contain images, the gocr binary is called to extract the text content.
  4. The total spam score of the PDF is compared against the global required_score setting; if it’s higher, a score equal to the one specified in pdf.cf (default of 10) is appended to the overall score of the email message.

This approach is a departure from the usual method as it scans the content against the SpamAssassin engine, instead of using a word list filter.

Should you need to install the module, download it from: http://atmail.com/kb/attach/PDFassassin.tgz

Installation directions can be found in the README file inside the archive.

PDFassassin forum: http://forum.atmail.com/viewforum.php?id=10


Filed under: Frontpage — John Contad @ 11:01 pm

7 Comments »

  1. Thanks Guys this is great..

    Comment by Joe Peterson — July 27, 2007 @ 3:04 pm
  2. Pretty nice stuff and the most important thing is that it works!

    Comment by Michael — July 30, 2007 @ 7:46 am
  3. PDFassassin has been featured at www.linuxlinks.com

    Comment by Steve — August 5, 2007 @ 11:52 am
  4. Is the image text scanned in the same way the pdf text is?

    Comment by Mark Hannessen — August 8, 2007 @ 7:37 am
  5. Hello!

    Just installed your great stuff, i got a lot of spams with PDF documents attached.

    Maybe this is not the right place, but i would like to suggest an improvement:

    - replace pdftotext with pstotext.

    The reason is that 99% of these attached PDF documents arrives password protected, and pdftotext cannot deal with them. I’ve successfully replaced pdftotext with pstotext (http://freshmeat.net/projects/pstotext/) in the code, and now it’s really-really works!

    Cheers!

    Comment by zooly — August 8, 2007 @ 8:50 am
  6. Great module!

    I briefly looked over the code. I noticed you’re using the PID as a unique identifier to create temporary files.

    Perl has ‘File::Temp’ to properly and safely create temporary files.

    Comment by Pascal de Bruijn — August 13, 2007 @ 1:16 am
  7. Hi
    Thanks i am using your plugin.
    But I like the plug in to return the score from the started sub Spamassassin. Not the fix value 10.
    I created a patch for that. After patching the plug in keeps its behavior. You need to add the following to the pdf.cf to activate dynamic score

    pdf_dynamic_score 1

    retrieving revision 1.1
    retrieving revision 1.2
    diff -u -r1.1 -r1.2
    — Pdf.pm 6 Feb 2008 12:38:48 -0000 1.1
    +++ Pdf.pm 6 Feb 2008 12:39:33 -0000 1.2
    @@ -11,6 +11,7 @@
    use Mail::SpamAssassin::Plugin;

    our @ISA = qw (Mail::SpamAssassin::Plugin);
    +our $dynamic_score = 0;

    # constructor: register the eval rule
    sub new {
    @@ -22,6 +23,13 @@
    return $self;
    }

    +sub parse_config {
    + my ( $self, $opts ) = @_;
    + if ( $opts->{key} =~ /^pdf_dynamic_score/i ){
    + $dynamic_score= $opts->{value};
    + }
    +}
    +
    sub check_pdf {
    my ( $self, $pms ) = @_;

    @@ -108,13 +116,29 @@
    my $mail = $spamtest->parse($message);
    my $status = $spamtest->check($mail);
    my $score = $status->get_body_only_points();
    +$self->_clean_pdf_tmp();

    +if($dynamic_score == 1){
    + if($score) {
    + my $description = $pms->{conf}->{descriptions}->{PDF};
    + $description .= ” — with score $score”;
    + $pms->{conf}->{descriptions}->{PDF} = $description;
    +
    + #$pms->got_hit(”PDF”, “HEADER: “, score => $score);
    + $pms->_handle_hit(”PDF”,$score, “HEADER: “, $pms->{conf}->{descriptions}->{PDF});
    +
    + for my $set (0..3) {
    + $pms->{conf}->{scoreset}->[$set]->{”PDF”} =
    + sprintf(”%0.3f”, $score);
    + }
    + }
    + return 0;
    +}
    # Message is marked as Spam
    -if($score >= $req) {
    +elsif($score >= $req) {
    #print STDERR “MARKED”;
    #print STDERR $status->get_report();
    -$self->_clean_pdf_tmp();
    -return 1;
    + return 1;
    #$message = $status->rewrite_mail();

    } else {

    Comment by Noah Heusser — February 6, 2008 @ 6:25 am

RSS feed for comments on this post. TrackBack URI

Leave a comment