13 March, 2006


At times, I have thought it would be handy to run a search engine that trolls government web pages and -- in real time, not with some 6-month delay -- lets users see how a page has changed. In particular, it would automatically alert users when web content disappears.

(As someone shows in Jonathan's comments, the page didn't disappear from everywhere, just from all *.mil sites.)

The U.S. government is in a censorship frenzy*. Sometimes I think a very clever person could divine what they're worried about from seeing what they censor. But the page Jonathan refers to is mystifying. Would the Pentagon really take down an entire interview with their Secretary in order to (ineffectively) hide one not-very-embarrassing sentence? It's hard to believe and it's also possible, given the current environment. When in doubt, leave it out -- of the public record.


Man, I love the commenters there.

I thought Jonathan was kiddng when he listed the interviewer as Plum TV--as in, what a plum interview arrangement, they don't even point it out when you idiotic things. But no, it really is a network for kissing up to people like Rumsfeld .

I think such a search engine would be incredibly costly, scale baadly, and be subject anti-bot scripts.

In the mean time we can give an occasional hand to Steven Aftergood et al. 

Posted by Saheli

Actually it would be pretty trivial to write such a spider. I could probably do it by breaking robots.txt support in 'wget' and writing a few perl scripts in about two hours. After that you just need a sufficient amount of archive space. If you're just keeping diffs and exluding PDFs, etc., this isn't really a HUGE amount of space... I don't think it would be undoable, though it would require way more resources than any individuals like us have to spare. 

Posted by saurabh

exluding PDFs, 

Hmm. That seems problematic. I'm always shocked at how very much information is stored in this format.

How much space would you need just to track changes? How would you stop from being blocked? 

Posted by Saheli

Yeah, PDFs are a pain. They're everywhere.

* I don't remember why I put that asterik there, but I need to close the tag, as it were. 

Posted by hedgencrisy

