Mirror of the gdb mailing list
 help / color / mirror / Atom feed
From: Aurelien Jarno via Gdb <gdb@sourceware.org>
To: Jonathan Wakely <jwakely.gcc@gmail.com>
Cc: Guinevere Larsen <guinevere@redhat.com>,
	Mark Wielaard <mark@klomp.org>,
	binutils@sourceware.org, elfutils-devel@sourceware.org,
	gcc@gcc.gnu.org, gdb@sourceware.org, libc-alpha@sourceware.org,
	libabigail@sourceware.org, newlib@sourceware.org,
	overseers@sourceware.org
Subject: Re: scraperbot protection - Patchwork and Bunsen behind Anubis
Date: Tue, 22 Apr 2025 23:39:36 +0200	[thread overview]
Message-ID: <aAgMmOYZGKW_Oqyn@aurel32.net> (raw)
In-Reply-To: <CAH6eHdTz4ybqj6mXVaK4ONaaZA0wFj9tKBAXJPc43M_Ox_NE2w@mail.gmail.com>

On 2025-04-22 14:06, Jonathan Wakely wrote:
> On Tue, 22 Apr 2025 at 13:36, Guinevere Larsen via Gcc <gcc@gcc.gnu.org> wrote:
> >
> > On 4/21/25 12:59 PM, Mark Wielaard wrote:
> > > Hi hackers,
> > >
> > > TLDR; When using https://patchwork.sourceware.org or Bunsen
> > > https://builder.sourceware.org/testruns/ you might now have to enable
> > > javascript. This should not impact any scripts, just browsers (or bots
> > > pretending to be browsers). If it does cause trouble, please let us
> > > know. If this works out we might also "protect" bugzilla, gitweb,
> > > cgit, and the wikis this way.
> > >
> > > We don't like to hav to do this, but as some of you might have noticed
> > > Sourceware has been fighting the new AI scraperbots since start of the
> > > year. We are not alone in this.
> > >
> > > https://lwn.net/Articles/1008897/
> > > https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/
> > >
> > > We have tried to isolate services more and block various ip-blocks
> > > that were abusing the servers. But that has helped only so much.
> > > Unfortunately the scraper bots are using lots of ip addresses
> > > (probably by installing "free" VPN services that use normal user
> > > connections as exit point) and pretending to be common
> > > browsers/agents.  We seem to have to make access to some services
> > > depend on solving a javascript challenge.
> >
> > Jan Wildeboer, on the fediverse, has a pretty interesting lead on how AI
> > scrapers might be doing this:
> > https://social.wildeboer.net/@jwildeboer/114360486804175788 (this is the
> > last post in the thread because it was hard to actually follow the
> > thread given the number of replies, please go all the way up and read
> > all 8 posts).
> >
> > Essentially, there's a library developer that pays developers to just
> > "include this library and a few more lines in your TOS". This library
> > then allows the app to sell the end-user's bandwidth to clients of the
> > library developer, allowing them to make requests. This is how big
> > companies are managing to have so many IP addresses, so many of those
> > being residential IP addresses, and it also means that by blocking those
> > IP addresses we will be - necessarily - blocking real user traffic to
> > our platforms.
> 
> It seems to me that blocking real users *who are running these shady
> apps* is perfectly reasonable.

How do you detect them? From my experience at other hosting places, 
those IPs, just make a few request per hours or per day, with a standard
User Agent. As such it's difficult to differentiate them from normal 
users.

The problem is that you suddenly have hundreds of thousands of requests 
per hours from just a slightly lower number of IPs. And in the middle 
you also have legit users using IPs from the same net block.

-- 
Aurelien Jarno                          GPG: 4096R/1DDD8C9B
aurelien@aurel32.net                     http://aurel32.net

  parent reply	other threads:[~2025-04-22 21:43 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-04-21 15:59 Mark Wielaard
2025-04-22 12:34 ` Guinevere Larsen via Gdb
2025-04-22 13:06   ` Jonathan Wakely via Gdb
2025-04-22 13:17     ` Guinevere Larsen via Gdb
2025-04-22 14:44       ` Jonathan Wakely via Gdb
2025-04-22 21:39     ` Aurelien Jarno via Gdb [this message]
2025-04-23  3:52 ` Chris Packham via Gdb
2025-04-23 16:56 ` Christophe Lyon via Gdb
2025-04-23 17:49   ` Frank Ch. Eigler via Gdb

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aAgMmOYZGKW_Oqyn@aurel32.net \
    --to=gdb@sourceware.org \
    --cc=aurelien@aurel32.net \
    --cc=binutils@sourceware.org \
    --cc=elfutils-devel@sourceware.org \
    --cc=gcc@gcc.gnu.org \
    --cc=guinevere@redhat.com \
    --cc=jwakely.gcc@gmail.com \
    --cc=libabigail@sourceware.org \
    --cc=libc-alpha@sourceware.org \
    --cc=mark@klomp.org \
    --cc=newlib@sourceware.org \
    --cc=overseers@sourceware.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox