From: Jonathan Wakely via Gdb <gdb@sourceware.org>
To: Guinevere Larsen <guinevere@redhat.com>
Cc: Mark Wielaard <mark@klomp.org>,
binutils@sourceware.org, elfutils-devel@sourceware.org,
gcc@gcc.gnu.org, gdb@sourceware.org, libc-alpha@sourceware.org,
libabigail@sourceware.org, newlib@sourceware.org,
overseers@sourceware.org
Subject: Re: scraperbot protection - Patchwork and Bunsen behind Anubis
Date: Tue, 22 Apr 2025 14:06:08 +0100 [thread overview]
Message-ID: <CAH6eHdTz4ybqj6mXVaK4ONaaZA0wFj9tKBAXJPc43M_Ox_NE2w@mail.gmail.com> (raw)
In-Reply-To: <138abdfe-29db-45c8-a9e4-e7210e847ce7@redhat.com>
On Tue, 22 Apr 2025 at 13:36, Guinevere Larsen via Gcc <gcc@gcc.gnu.org> wrote:
>
> On 4/21/25 12:59 PM, Mark Wielaard wrote:
> > Hi hackers,
> >
> > TLDR; When using https://patchwork.sourceware.org or Bunsen
> > https://builder.sourceware.org/testruns/ you might now have to enable
> > javascript. This should not impact any scripts, just browsers (or bots
> > pretending to be browsers). If it does cause trouble, please let us
> > know. If this works out we might also "protect" bugzilla, gitweb,
> > cgit, and the wikis this way.
> >
> > We don't like to hav to do this, but as some of you might have noticed
> > Sourceware has been fighting the new AI scraperbots since start of the
> > year. We are not alone in this.
> >
> > https://lwn.net/Articles/1008897/
> > https://arstechnica.com/ai/2025/03/devs-say-ai-crawlers-dominate-traffic-forcing-blocks-on-entire-countries/
> >
> > We have tried to isolate services more and block various ip-blocks
> > that were abusing the servers. But that has helped only so much.
> > Unfortunately the scraper bots are using lots of ip addresses
> > (probably by installing "free" VPN services that use normal user
> > connections as exit point) and pretending to be common
> > browsers/agents. We seem to have to make access to some services
> > depend on solving a javascript challenge.
>
> Jan Wildeboer, on the fediverse, has a pretty interesting lead on how AI
> scrapers might be doing this:
> https://social.wildeboer.net/@jwildeboer/114360486804175788 (this is the
> last post in the thread because it was hard to actually follow the
> thread given the number of replies, please go all the way up and read
> all 8 posts).
>
> Essentially, there's a library developer that pays developers to just
> "include this library and a few more lines in your TOS". This library
> then allows the app to sell the end-user's bandwidth to clients of the
> library developer, allowing them to make requests. This is how big
> companies are managing to have so many IP addresses, so many of those
> being residential IP addresses, and it also means that by blocking those
> IP addresses we will be - necessarily - blocking real user traffic to
> our platforms.
It seems to me that blocking real users *who are running these shady
apps* is perfectly reasonable.
They might not realise it, but those users are part of the problem. If
we block them, maybe they'll be incentivised to stop using the shady
apps. And if users stop using those apps, maybe those app developers
will stop bundling the libraries that piggyback on users' bandwidth.
>
> I'm happy to see that the sourceware is moving to a more comprehensive
> solution, and if this is successful, I'd suggest that we also try to do
> that to the forgejo instance, and remove the IPs blocked because of this
> scraping.
For now, maybe. This thread already explained how to get around Anubis
by changing the UserAgent string - how long will it be until these
peer-to-business network libraries figure that out?
next prev parent reply other threads:[~2025-04-22 13:08 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-21 15:59 Mark Wielaard
2025-04-22 12:34 ` Guinevere Larsen via Gdb
2025-04-22 13:06 ` Jonathan Wakely via Gdb [this message]
2025-04-22 13:17 ` Guinevere Larsen via Gdb
2025-04-22 14:44 ` Jonathan Wakely via Gdb
2025-04-22 21:39 ` Aurelien Jarno via Gdb
2025-04-23 3:52 ` Chris Packham via Gdb
2025-04-23 16:56 ` Christophe Lyon via Gdb
2025-04-23 17:49 ` Frank Ch. Eigler via Gdb
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAH6eHdTz4ybqj6mXVaK4ONaaZA0wFj9tKBAXJPc43M_Ox_NE2w@mail.gmail.com \
--to=gdb@sourceware.org \
--cc=binutils@sourceware.org \
--cc=elfutils-devel@sourceware.org \
--cc=gcc@gcc.gnu.org \
--cc=guinevere@redhat.com \
--cc=jwakely.gcc@gmail.com \
--cc=libabigail@sourceware.org \
--cc=libc-alpha@sourceware.org \
--cc=mark@klomp.org \
--cc=newlib@sourceware.org \
--cc=overseers@sourceware.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox