Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.misc > #25305

Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...]

Message-ID <66a39428@news.ausics.net> (permalink)
From Computer Nerd Kev <not@telling.you.invalid>
Subject Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...]
Newsgroups comp.misc
References (9 earlier) <v7tvu0$2c8e9$1@dont-email.me> <66a2d000@news.ausics.net> <20240726013343.02805fe30e4853cf7cd40797@gmail.moc> <66a31b29@news.ausics.net> <cceeb788-c131-4a84-588d-044e917fa810@example.net>
Date 2024-07-26 22:18 +1000
Organization Ausics - https://newsgroups.ausics.net

Show all headers | View raw


D <nospam@example.net> wrote:
> 
> Read only sounds very simple. I usually scrape in python with the requests 
> library and the beautiful soup library. A simple scraping loop could look 
> like this (modify per web board of course):
> 
> for page in range(100, 150):
>     html = requests.get("https://www.svt.se/text-tv/" + str(page))
>     soup = BeautifulSoup(html.text, 'html.parser')
>     div_bs4 = soup.find('div', {"class": "Content_screenreaderOnly__3Cnkp"})
>     try:
>         email_body += div_bs4.string + "\n"
>     except AttributeError:
>         None
> 
> So basically a range of pages, then loop over those pages,

You need to sync it to the messages in the forum index though,
otherwise when they get a spam flood of messages that the admin
deletes, or just jump the thread counter around for some other
reason, the scraper is stuck looking for the next 25 threads
after the last one it saw when it needs to jump forwards 150. I
guess you could interpret the deleted thread pages and crawl
through them, but then you need the crawler to remember the gap
that was left so it doesn't forget to check for new posts in the
threads before the spam flood.

So even if it's possible to iterate over threads that way on all
forum platforms (which I'm not sure about), I think it would be
more reliable in the long run to parse the index pages to determine
which threads to retrieve. Also less risk of getting blocked by web
servers for too many requests.

But thanks for the example. I'm not really sure whether a HTML
parser library would be helpful or just a pointless extra layer
of complexity. So far I've just used regular expressions for
scraping webpages. I was thinking along the lines of a template
system defining strings that indicate the start/end of fields (and
any key features in-between) ideally allowing new forum parsers to
be added without needing to touch the code. There must be things
like that around already...

Perhaps I'm determined to make it hard for myself, but if it broke
all the time and was complicated to fix, then that would be worse.

Anyhow now I've got onto thinking about that I've wasted all the
time I was actually going to spend finishing a PHP static site
generator to format data that I scraped off a website last week.
That seemed simple at first too...

-- 
__          __
#_ < |\| |< _#

Back to comp.misc | Previous | NextPrevious in thread | Next in thread | Find similar


Thread

Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Anton Shepelev <anton.txt@g{oogle}mail.com> - 2024-07-24 11:58 +0300
  Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-07-24 11:08 +0200
    Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Johanne Fairchild <jfairchild@tudado.org> - 2024-07-24 10:21 -0300
    Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] The Real Bev <bashley101@gmail.com> - 2024-07-24 15:10 -0700
      Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-07-25 11:24 +0200
        Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Rich <rich@example.invalid> - 2024-07-25 12:03 +0000
          Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-07-25 17:38 +0200
            Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Anton Shepelev <anton.txt@gmail.moc> - 2024-07-26 01:16 +0300
              Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-07-26 10:44 +0200
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] kludge@panix.com (Scott Dorsey) - 2024-07-26 19:49 +0000
            Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] The Real Bev <bashley101@gmail.com> - 2024-07-25 20:04 -0700
              Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-07-26 10:52 +0200
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Johanne Fairchild <jfairchild@tudado.org> - 2024-07-26 09:21 -0300
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-07-26 18:35 +0200
              Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Retrograde <fungus@amongus.com.invalid> - 2024-07-27 03:06 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-07-27 11:25 +0200
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] kludge@panix.com (Scott Dorsey) - 2024-07-27 13:38 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-07-27 20:03 +0200
          Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] The Real Bev <bashley101@gmail.com> - 2024-07-25 19:52 -0700
        Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] The Real Bev <bashley101@gmail.com> - 2024-08-05 21:48 -0700
          Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] The Real Bev <bashley101@gmail.com> - 2024-08-06 17:14 -0700
            Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Richard Kettlewell <invalid@invalid.invalid> - 2024-08-07 08:57 +0100
              Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] arnold@skeeve.com (Aharon Robbins) - 2024-08-07 15:04 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-08-07 22:53 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] kludge@panix.com (Scott Dorsey) - 2024-08-08 09:21 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-08-08 23:52 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] kludge@panix.com (Scott Dorsey) - 2024-08-09 20:54 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-08-09 23:32 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Dan Purgert <dan@djph.net> - 2024-08-08 09:33 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-08-08 22:24 +0200
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Rich <rich@example.invalid> - 2024-08-08 20:31 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-08-09 10:32 +0200
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Bob Eager <news0009@eager.cx> - 2024-08-08 20:45 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-08-08 23:53 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Dan Purgert <dan@djph.net> - 2024-08-09 01:14 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-08-09 02:41 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] kludge@panix.com (Scott Dorsey) - 2024-08-08 09:20 +0000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Johanne Fairchild <jfairchild@tudado.org> - 2024-08-13 22:40 -0300
              Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] The Real Bev <bashley101@gmail.com> - 2024-08-09 10:54 -0700
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-08-09 22:28 +0200
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] The Real Bev <bashley101@gmail.com> - 2024-08-10 12:12 -0700
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Johanne Fairchild <jfairchild@tudado.org> - 2024-08-13 22:43 -0300
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <remailer@domain.invalid> - 2024-08-13 22:54 -0400
            Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] kludge@panix.com (Scott Dorsey) - 2024-08-07 09:45 +0000
    Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] David LaRue <huey.dll@tampabay.rr.com> - 2024-08-05 14:49 +0000
      Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-08-05 21:45 +0200
  Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] George Musk <grgmusk@skiff.com> - 2024-07-25 11:49 +0000
    Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Marco Moock <mm+usenet-es@dorfdsl.de> - 2024-07-25 16:16 +0200
      Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] yeti <yeti@tilde.institute> - 2024-07-25 15:13 +0042
        Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-07-25 17:39 +0200
      Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Johanne Fairchild <jfairchild@tudado.org> - 2024-07-25 12:27 -0300
        Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-07-25 23:40 +0200
          Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] candycanearter07 <candycanearter07@candycanearter07.nomail.afraid> - 2024-07-26 16:40 +0000
            Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-07-26 22:38 +0200
        Re: Emigration from Usenet Johanne Fairchild <jfairchild@tudado.org> - 2024-07-26 19:38 -0300
          Re: Emigration from Usenet Johanne Fairchild <jfairchild@tudado.org> - 2024-07-27 10:03 -0300
          Re: Emigration from Usenet Richard Kettlewell <invalid@invalid.invalid> - 2024-07-27 17:24 +0100
            Re: Emigration from Usenet D <nospam@example.net> - 2024-07-27 20:06 +0200
              Re: Emigration from Usenet Rich <rich@example.invalid> - 2024-07-27 18:31 +0000
                Re: Emigration from Usenet D <nospam@example.net> - 2024-07-28 11:23 +0200
            Re: Emigration from Usenet Richard Kettlewell <invalid@invalid.invalid> - 2024-07-27 21:01 +0100
              Re: Emigration from Usenet The Real Bev <bashley101@gmail.com> - 2024-07-27 18:48 -0700
              Re: Emigration from Usenet D <nospam@example.net> - 2024-07-28 11:25 +0200
              Re: Emigration from Usenet Richard Kettlewell <invalid@invalid.invalid> - 2024-07-28 10:53 +0100
                Re: Emigration from Usenet Rich <rich@example.invalid> - 2024-07-28 16:00 +0000
                Re: Emigration from Usenet Richard Kettlewell <invalid@invalid.invalid> - 2024-07-28 21:35 +0100
                Re: Emigration from Usenet Rich <rich@example.invalid> - 2024-07-28 20:51 +0000
                Re: Emigration from Usenet D <nospam@example.net> - 2024-07-28 19:14 +0200
            Re: Emigration from Usenet Andreas Eder <a_eder_muc@web.de> - 2024-07-28 19:48 +0200
            Re: Emigration from Usenet Johanne Fairchild <jfairchild@tudado.org> - 2024-07-28 21:58 -0300
              Re: Emigration from Usenet Richard Kettlewell <invalid@invalid.invalid> - 2024-07-29 08:50 +0100
                Re: Emigration from Usenet Javier <invalid@invalid.invalid> - 2024-07-29 08:51 +0000
                Re: Emigration from Usenet yeti <yeti@tilde.institute> - 2024-07-29 09:51 +0042
                Re: Emigration from Usenet Rich <rich@example.invalid> - 2024-07-29 12:58 +0000
                Re: Emigration from Usenet Johanne Fairchild <jfairchild@tudado.org> - 2024-07-29 20:50 -0300
                Re: Emigration from Usenet Johanne Fairchild <jfairchild@tudado.org> - 2024-07-30 11:55 -0300
                Re: Emigration from Usenet Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-07-30 20:35 +0000
                Re: Emigration from Usenet Richard Kettlewell <invalid@invalid.invalid> - 2024-07-29 14:21 +0100
            Re: Emigration from Usenet Johanne Fairchild <jfairchild@tudado.org> - 2024-07-28 22:00 -0300
              Re: Emigration from Usenet The Real Bev <bashley101@gmail.com> - 2024-07-28 19:41 -0700
                Re: Emigration from Usenet Mike Spencer <mds@bogus.nodomain.nowhere> - 2024-07-29 14:20 -0300
                Re: Emigration from Usenet The Real Bev <bashley101@gmail.com> - 2024-07-29 10:51 -0700
              Re: Emigration from Usenet D <nospam@example.net> - 2024-07-29 11:09 +0200
      Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Rich <rich@example.invalid> - 2024-07-25 16:53 +0000
        Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] not@telling.you.invalid (Computer Nerd Kev) - 2024-07-26 08:21 +1000
          Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Anton Shepelev <anton.txt@gmail.moc> - 2024-07-26 01:33 +0300
            Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Computer Nerd Kev <not@telling.you.invalid> - 2024-07-26 13:42 +1000
              Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] D <nospam@example.net> - 2024-07-26 11:00 +0200
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Computer Nerd Kev <not@telling.you.invalid> - 2024-07-26 22:18 +1000
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Lawrence D'Oliveiro <ldo@nz.invalid> - 2024-07-28 01:55 +0000
              Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] Theo <theom+news@chiark.greenend.org.uk> - 2024-08-12 17:13 +0100
                Re: Emigration from Usenet [was: Re: PTD was the most-respected of the AUE regulars ...] not@telling.you.invalid (Computer Nerd Kev) - 2024-08-13 08:12 +1000

csiph-web