Remember Me | register
Entries Blogs

Forums > Off Topic > Yet Another Scraper

Kabitzin
Kabitzin
Lvl: 4
Posts: 41
10/22/2007 10:29 AM EDT

I noticed an unrecognized site leaving a pingback on one of our site news posts, and thought it quite odd.  It turns out that the site (anime.blogscene dot net) is just a scraper site that publishes various feeds in full in a way to make it look as if the posts were all written by the splog author.  I have already put in an email to blogscene to ask that the page be removed, but I thought I'd make the announcement, since they are stealing from a lot of other anime sites. 

Kabitzin
Kabitzin
Lvl: 4
Posts: 41
10/22/2007 02:30 PM EDT

We forwarded your mail to the "World of Anime" blogauthor, he promised us to remove immediately all your posts from his blog.

We apologize for this problem, we don't allow to use a feedposter but if they run a script that simulates a normal post this is more difficult to discover. And if we warn them they always tell us they can do this because of the linkback to the original author.

======

At least Blogscene was fairly quick in replying, although I am a bit skeptical about the resolution.  The linkback does not even contain the name of the source post, and there is no indication on the posts that the posts are taken from someone else's site.

Christopher Fritz
Christopher fritz
Lvl: 2
Posts: 3
10/23/2007 06:46 AM EDT

I must admit, Hung's own solution is the kind I've always imagined I would use were it my content being taken.

Zyl
Zyl
Lvl: 2
Posts: 10
11/05/2007 06:50 PM EST

I feel like I'm under siege from scrappers and sploggers! A new one (which also scrapped Sea Slugs! today) just popped up but has no ads yet.

Even LiveJournal has gotten in on the act! - www.plagiarismtoday.com/2007/04/03/six-apartrojo-now-spam-bloggers/ Which is particularly bad for me since I use the Full Text Feed plugin. Or make that past tense. InsaneJournal also does it.

Basically I'm putting an IP deny on anything drawing on my feed that doesn't show a user Agent. Not sure how much collateral damage this is going to cause but I'm so pissed off right now.

hung
Hung
Lvl: 12
Posts: 462
11/05/2007 11:02 PM EST

Incidentally, I think one of those scrapers was using Anime Nano's rss feed. So I blocked it. Or rather, did the thing I did before that Christopher linked to.

Hmm. Hopefully Anime Nano shows a user agent. I can't really remember if it does. 

Zyl
Zyl
Lvl: 2
Posts: 10
11/06/2007 06:23 AM EST

Ah, so that's why all those *ahem* naughty posts were showing up on that scraper site. Hung, good job! :)

Anime Nano does show an Agent: sporkmonger FeedTools though it doesn't identify itself as AN whereas AB.net's bot identifies itself in the logs as such. Don't sweat it though, it was pretty easy to identify AN's feed pull via the IP anyway. 

hung
Hung
Lvl: 12
Posts: 462
11/06/2007 07:35 PM EST

Oh yeah! I remember setting up that feedtools thing now. I'm not sure exactly how to do the agent thing though. Oh well, looks like it all worked out anyway.

Zyl
Zyl
Lvl: 2
Posts: 10
11/07/2007 03:58 PM EST

Gah, another problem. Being scraped by two car-related spammers - my guess is that they are using this scraping tool because the track/pingback enables them to bypass the usual anti-spam plugins. 

I'm normally just satisfied to use an IP deny but these two jokers were just too much. Made the requisite modifications, based on Hung's post, to my .htaccess but got a 500 Internal Server Error as a result. I must be doing something wrong but I have no idea what. ;_;

Kurisu
Kurisu
Lvl: 5
Posts: 52
11/07/2007 05:05 PM EST

Do you have access to your error.log?

Is your syntax correct?

Does your hoster allow the use of .htaccess?

Did you upload it as ANSI in ASCII mode? 

AllowOverride problem?

 

 

Zyl
Zyl
Lvl: 2
Posts: 10
11/08/2007 04:27 AM EST

Thanks, Kurisu for the checklist! 

Error log (via cpanel). Yes but only the most recent 300 entries. Will try again soon and see what turns up.

Hoster allows .htaccess and upload (via Filezilla) was on auto. I assume it was in the correct form as .htaccess worked fine when it just contained the lines for IP deny, hotlink protection (and exceptions).

Syntax. That might be it... some of the other websites I checked also included a RewriteBase / after RewriteEngine On but don't know if that is necessary in my case. I don't really know anything about Apache so not sure what the AllowOverried problem is. Trying to read up on it now. This has been educational. :)

Kurisu
Kurisu
Lvl: 5
Posts: 52
11/08/2007 05:14 AM EST

If you can use your error log, it gets easier:

 

[alert] [client ...] LOCAL_PATH_TO/.htaccess: RewriteRule: bad flag delimiters
-> bad syntax

[alert] [client ...] /www/user1/htdocs/.htaccess: Options not allowed here
-> probably AllowOverride problem

[error] [client ...] mod_rewrite: maximum number of internal redirects reached. Assuming configuration error. Use 'RewriteOptions MaxRedirects' to increase the limit if neccessary.
->  divided by zero infinite loop

 

 

Zyl
Zyl
Lvl: 2
Posts: 10
11/10/2007 06:29 PM EST

I've finally managed to try the redirect again. No 500 Internal Server Error this time - I did make a change to the syntax by removing the L flag.

Still waiting to see if the redirect will work though. Keeping fingers crossed! (^_^)_x

Anime-nano-rss