Joho the Blog
An Entry from the Archives

« Me on the radio about spyware || Back to Blog | Justice, Religion, Sexuality »

May 18, 2004

Please ignore this MT-Blackllist URL extractor

NOTE: I updated this script on June 2 so that it now pulls out all (?) of the URLs embedded in the selected spams. The code listed below is updated, but not my comment about it not doing what I just updated it to do. So to speak. (The version without the wordwrap issues has also been updated. But what you really want is this pure text version, suitable for copying and pasting.)

I just received 200 comment spams. They each listed a different URL and came from a different IP address,which means the invaluable MT-Blacklist (thank you, Jay Allen) has to be told to delete each one, one at a time.

Instead, I cobbled together an Outlook script — yes, I use OL on my desktop machine, although I've been happy with Thunderbird on my laptop — that looks through the messages you have highlighted in your inbox and builds a list of the URLs that people listed in the URL field of the comment. You then paste these into the text box in MT-Blacklist's "Add" tab. (It also shows a list of the IP addresses, although I don't know why I bothered.)

Here are just some of the caveats you need to take very seriously: I am fumbling around in the dark when it comes to VBA for Outlook. And, there's almost no (= NO) error checking in this little program, so you could end up banning your mother; you must carefully inspect each of the URLs to make sure you really want to delete the comment that contains it. Further, I don't really understand how MT-Blacklist works. And there are probably some bad line wraps in the code below which will totally break it. Finally, this does NOT find any of the URLs in the body of the message because that's too hard. Well, finding the beginning of the urls isn't hard, but figuring out when they end is.

So with that warning (WARNING: read the warning!), here's the script:

Sub FindURLStoBAN() 
' walks through selected 
'files to find bad urls
 Dim objApp As Application
 Dim objSelection As Selection
 Dim objItem As Object
 Dim ipstr As String
 Dim urlstr As String
 Dim ips As String
 Dim  us As String
 
 Set objApp = CreateObject("Outlook.Application")
 ' get the selected msgs
 
 Set objSel = objApp.ActiveExplorer.Selection
 
 x = 0
 For Each objItem In objSel
   If objItem.Class = 43 Then ' 43=mailitem
     msgtxt = objItem.Body ' get msg text
     ' Is this msg from mt-blacklist?
     p = InStr(msgtxt, "MT-Blacklist") 
     If p > 0 Then ' yes it is
      ' get the ip to ban
        p1 = InStr(msgtxt, "IP Address:")
        p1 = p1 + 12
        p2 = InStr(p1, msgtxt, vbCr)
        ips = Mid(msgtxt, p1, p2 - p1)
        ipstr = ipstr & vbCr & vbLf & ips
        ' get the url listed for the name
        p1 = InStr(msgtxt, "URL: ") + 5
        p2 = InStr(p1, msgtxt, vbCr)
        us = Mid(msgtxt, p1, p2 - p1)
        urlstr = urlstr & vbCr & vbLf & us

    ' ----'Get urls in the text
      udone = False: prevp = 1
      'u ppercase it because I'm lazy
     msgtxt = UCase(msgtxt)
     While Not udone
     u = ""
   ' get next a href
   p1 = InStr(prevp, msgtxt, "<A HREF=")
    ' get end of href
   p3 = InStr(p1 + 1, msgtxt, "">")
  ' find end of href
  If p1 > 0 And p3 > 0 Then
   ' get /a
   p2 = InStr(p1 + 1, msgtxt, "</A">")
    ' if it has an end /a
    If p2 > 0 Then
     ' extract the string
      u = Mid(msgtxt, p1 + 9, (p3 - (p1 + 11)))
     ' note where it ended for next loop
     prevp = p2
    ' is it already in the string?
      If InStr(1, urlstr, u) = 0 Then
     urlstr = urlstr & vbCr & vbLf & u
     End If
    End If
    End If
   ' are we out of links?
    if p1 = 0 Then udone = True
  Wend
                      
      End If ' if p > 0 msg from mtblacklist
     x = x + 1
 
     End If
  
 Next

 ' Fill the two textboxes

 mtblacklistfrm.iptxt.Text = ipstr
 mtblacklistfrm.urltxt.Text = urlstr
 mtblacklistfrm.Show
 Set objItem = Nothing
 
 
 End Sub 

(Here's a version that shouldn't have word-wrap problems.)

To make this work, you have to create a form called mtblacklistfrm and stick into it a text box that you name iptxt and one that you name urltxt. Set the text boxes' scroll bars to on and make sure that they're set to multiline.

If you don't know how to stick a script like this into OL, then you shouldn't. If you do, then you could have done this better yourself.

Warning: Do not trust this script! It undoubtedly is embarrassingly wrong and dangerous. Have pity on me. I'm a humanities major.

Thank you.

Posted by D. Weinberger at May 18, 2004 11:09 PM


Comments

Could you be so kind as to add all of those IPs to Jay's Blacklist Clearinghouse?

http://www.jayallen.org/comment_spam/submit

Thanks!

Posted by: timsamoff | May 19, 2004 11:09 AM


We're testing a very simple/stupid anti-spam mechanism: add another checkbox to the "Post a comment" form that states: "I am not a bot" (or less parsable variant thereof" and don't post anything that doesn't have the box checked. Maybe I'm being stupid spreading this among MT users, but most bots are too naive to overcome that very simple step.

Posted by: Gene Koo | May 25, 2004 06:19 PM


Having just spent *hours* cleaning up the blacklist over at corante.com, let me add a caveat to all of this.

There are two problems with blindly grabbing the URLs from the messages and adding them.

The first is the proliferation of "throwaway" subdomains in order to make blacklisting harder. Thus, you might get 100+ versions of offensive-string-of-text.baddomain.com -- all the blacklist needs is "baddomain.com", and if you add the subdomains, not only will you clog up the blacklist, you'll still get spam from offensive-string-of-text2.baddomain.com.

The second is the "poison pill" issue--I've had to remove a bunch of "good" URLs and text strings from the blacklist because it was preventing people from posting legitimate comments. "yahoo.com" for example, which blocked everyone with a yahoo mail account.

Posted by: Liz Lawley | June 2, 2004 11:24 AM


Liz, how serious is the clogging problem? I know theoretically it's good to keep the list shorter rather than longer, but after about a year, I have about 4,500 entries on my list. Do you have a sense of how many names it takes to truly slow the system down?

As for the poison pills: Yup yup yup! My script creates an editable display of the urls partially for that reason. Thanks for the warning!

Posted by: David Weinberger | June 2, 2004 12:30 PM


Hallo friends! Really nice place here. I found a lot of interesting stuff all around. Just what I was looking for. Great joy!

Posted by: Josi Denise | September 21, 2004 04:51 AM


In response to the earlier comment regarding throwaway subdomains. A better approach is to resolve the IP address from the various websites, all sub-domains would have the same ip-address. In the case of large organisations like Geocities / Yahoo, they may use server farms, and thus different ip addressses.

I've an article on resolving DNS names at my website

Posted by: Joe Mc Laughlin | November 6, 2004 11:25 AM


its just more coding and more memory, and more stupid irrating problems

Posted by: John Yajer | June 6, 2005 05:13 PM


Thank you very much for the link that helped a TON.

Posted by: tucex | January 23, 2006 10:37 PM


My main concern is that you can't guarantee every page of your website will be included in the SERPs. Considering I'm constantly adding new products to my company's website, I need to be sure that customers can find them as soon as possible.http://www.seoptimizerz.com

Posted by: SEO | July 23, 2007 09:42 AM


Post a comment

Guidelines for Commenting

Basically, you can say what you want. (Click here for the fine print.)

If you haven't left a comment here before, your comment may be put into a queue for me to approve. Sorry for the delay. Blame the damn spammers.