From: Aurelien Jarno via Gdb <gdb@sourceware.org>
To: Jonathan Wakely <jwakely.gcc@gmail.com>
Cc: Guinevere Larsen <guinevere@redhat.com>,
Mark Wielaard <mark@klomp.org>,
binutils@sourceware.org, elfutils-devel@sourceware.org,
gcc@gcc.gnu.org, gdb@sourceware.org, libc-alpha@sourceware.org,
libabigail@sourceware.org, newlib@sourceware.org,
overseers@sourceware.org
Subject: Re: scraperbot protection - Patchwork and Bunsen behind Anubis
Date: Tue, 22 Apr 2025 23:39:36 +0200 [thread overview]
Message-ID: <aAgMmOYZGKW_Oqyn@aurel32.net> (raw)
In-Reply-To: <CAH6eHdTz4ybqj6mXVaK4ONaaZA0wFj9tKBAXJPc43M_Ox_NE2w@mail.gmail.com>
On 2025-04-22 14:06, Jonathan Wakely wrote:
> On Tue, 22 Apr 2025 at 13:36, Guinevere Larsen via Gcc <gcc@gcc.gnu.org> wrote:
> >
> > On 4/21/25 12:59 PM, Mark Wielaard wrote:
> > > Hi hackers,
> > >
> > > TLDR; When using https://patchwork.sourceware.org or Bunsen
> > > https://builder.sourceware.org/testruns/ you might now have to enable
> > > javascript. This should not impact any scripts, just browsers (or bots
> > > pretending to be browsers). If it does cause trouble, please let us
> > > know. If this works out we might also "protect" bugzilla, gitweb,
> > > cgit, and the wikis this way.
> > >
> > > We don't like to hav to do this, but as some of you might have noticed
> > > Sourceware has been fighting the new AI scraperbots since start of the
> > > year. We are not alone in this.
> > >
> > > https://lwn.net/Articles/1008897/
> > > https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/
> > >
> > > We have tried to isolate services more and block various ip-blocks
> > > that were abusing the servers. But that has helped only so much.
> > > Unfortunately the scraper bots are using lots of ip addresses
> > > (probably by installing "free" VPN services that use normal user
> > > connections as exit point) and pretending to be common
> > > browsers/agents. We seem to have to make access to some services
> > > depend on solving a javascript challenge.
> >
> > Jan Wildeboer, on the fediverse, has a pretty interesting lead on how AI
> > scrapers might be doing this:
> > https://social.wildeboer.net/@jwildeboer/114360486804175788 (this is the
> > last post in the thread because it was hard to actually follow the
> > thread given the number of replies, please go all the way up and read
> > all 8 posts).
> >
> > Essentially, there's a library developer that pays developers to just
> > "include this library and a few more lines in your TOS". This library
> > then allows the app to sell the end-user's bandwidth to clients of the
> > library developer, allowing them to make requests. This is how big
> > companies are managing to have so many IP addresses, so many of those
> > being residential IP addresses, and it also means that by blocking those
> > IP addresses we will be - necessarily - blocking real user traffic to
> > our platforms.
>
> It seems to me that blocking real users *who are running these shady
> apps* is perfectly reasonable.
How do you detect them? From my experience at other hosting places,
those IPs, just make a few request per hours or per day, with a standard
User Agent. As such it's difficult to differentiate them from normal
users.
The problem is that you suddenly have hundreds of thousands of requests
per hours from just a slightly lower number of IPs. And in the middle
you also have legit users using IPs from the same net block.
--
Aurelien Jarno GPG: 4096R/1DDD8C9B
aurelien@aurel32.net http://aurel32.net
next prev parent reply other threads:[~2025-04-22 21:43 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-21 15:59 Mark Wielaard
2025-04-22 12:34 ` Guinevere Larsen via Gdb
2025-04-22 13:06 ` Jonathan Wakely via Gdb
2025-04-22 13:17 ` Guinevere Larsen via Gdb
2025-04-22 14:44 ` Jonathan Wakely via Gdb
2025-04-22 21:39 ` Aurelien Jarno via Gdb [this message]
2025-04-23 3:52 ` Chris Packham via Gdb
2025-04-23 16:56 ` Christophe Lyon via Gdb
2025-04-23 17:49 ` Frank Ch. Eigler via Gdb
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aAgMmOYZGKW_Oqyn@aurel32.net \
--to=gdb@sourceware.org \
--cc=aurelien@aurel32.net \
--cc=binutils@sourceware.org \
--cc=elfutils-devel@sourceware.org \
--cc=gcc@gcc.gnu.org \
--cc=guinevere@redhat.com \
--cc=jwakely.gcc@gmail.com \
--cc=libabigail@sourceware.org \
--cc=libc-alpha@sourceware.org \
--cc=mark@klomp.org \
--cc=newlib@sourceware.org \
--cc=overseers@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox