Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #61622 > unrolled thread

Re: Movie (MPAA) ratings and Python?

Started byDan Stromberg <drsalists@gmail.com>
First post2013-12-11 15:07 -0800
Last post2013-12-12 08:56 -0500
Articles 14 — 10 participants

Back to article view | Back to comp.lang.python

This discussion starts older than the indexed window; earlier articles aren't shown. The article labeled Started by below is the oldest one visible, not the original post.


Contents

  Re: Movie (MPAA) ratings and Python? Dan Stromberg <drsalists@gmail.com> - 2013-12-11 15:07 -0800
    Re: Movie (MPAA) ratings and Python? Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-11 23:24 +0000
      Re: Movie (MPAA) ratings and Python? Dan Stromberg <drsalists@gmail.com> - 2013-12-11 15:39 -0800
      Re: Movie (MPAA) ratings and Python? Ned Batchelder <ned@nedbatchelder.com> - 2013-12-11 20:01 -0500
      Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?) Ben Finney <ben+python@benfinney.id.au> - 2013-12-12 12:12 +1100
        Re: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?) rusi <rustompmody@gmail.com> - 2013-12-11 19:23 -0800
          Re: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?) Chris Angelico <rosuav@gmail.com> - 2013-12-12 15:27 +1100
          Re: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?) Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2013-12-12 11:05 +0000
            Re: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?) Steve Hayes <hayesstw@telkomsa.net> - 2013-12-12 15:36 +0200
      Re: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?) Ian Kelly <ian.g.kelly@gmail.com> - 2013-12-11 18:31 -0700
      Re: Movie (MPAA) ratings and Python? Ian Kelly <ian.g.kelly@gmail.com> - 2013-12-11 18:37 -0700
      Re: Movie (MPAA) ratings and Python? Dan Stromberg <drsalists@gmail.com> - 2013-12-11 19:52 -0800
      Re: Movie (MPAA) ratings and Python? Michael Torrie <torriem@gmail.com> - 2013-12-11 23:22 -0700
      Re: Movie (MPAA) ratings and Python? Dave Angel <davea@davea.name> - 2013-12-12 08:56 -0500

#61622 — Re: Movie (MPAA) ratings and Python?

FromDan Stromberg <drsalists@gmail.com>
Date2013-12-11 15:07 -0800
SubjectRe: Movie (MPAA) ratings and Python?
Message-ID<mailman.3936.1386803257.18130.python-list@python.org>

[Multipart message — attachments visible in raw view] — view raw

On Wed, Dec 11, 2013 at 10:35 AM, Ned Batchelder <ned@nedbatchelder.com>wrote:

> On 12/10/13 6:50 PM, Dan Stromberg wrote:
> Now the question becomes: Why did chardet tell me it was windows-1255?  :)
>
> It probably told you it was Windows-1252 (I'm assuming the last 5 is a
> typo).
>
> Windows-1252 is a super-set of ISO-8859-1, so any text that is correct
> ISO-8859-1 is also correct Windows-1252.  In addition, it's not uncommon to
> find text marked as ISO-8859-1 that in fact has characters that make it
> Windows-1252.
>

 $ chardet mpaa-ratings-reasons.list
mpaa-ratings-reasons.list: windows-1255 (confidence: 0.97)

I'm aware that chardet is playing guessing games, though one would hope it
would guess well most of the time, and give a reasonable confidence rating.

[toc] | [next] | [standalone]


#61627

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-12-11 23:24 +0000
Message-ID<52a8f410$0$29992$c3e8da3$5496439d@news.astraweb.com>
In reply to#61622
On Wed, 11 Dec 2013 15:07:35 -0800, Dan Stromberg wrote:

>  $ chardet mpaa-ratings-reasons.list
> mpaa-ratings-reasons.list: windows-1255 (confidence: 0.97)
> 
> I'm aware that chardet is playing guessing games, though one would hope
> it would guess well most of the time, and give a reasonable confidence
> rating. 

What reason do you have for thinking that Windows-1255 isn't a reasonable 
guess? If the bulk of the text is Latin-1 except perhaps for one or two 
Hebrew characters (or what chardet thinks are Hebrew characters), it may 
actually be a reasonable guess.

If it is a poor guess, perhaps you ought to report it to the chardet 
maintainers as a good example of a poor guess.


By the way, this forum is a text-only newsgroup and so-called "Rich 
Text" (actually HTML) posts are frowned upon because most people don't 
appreciate having to read gunk like this:

> <div dir="ltr"><br><div class="gmail_extra"><div
> class="gmail_quote"> ... <br>
> <blockquote class="gmail_quote" style="margin:0px 0px 0px
> 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div
> class="im"> ... <br></div></div></div></div>

If you can, would you please turn off rich text posting when you post 
here please?

Thank you.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#61629

FromDan Stromberg <drsalists@gmail.com>
Date2013-12-11 15:39 -0800
Message-ID<mailman.3940.1386805193.18130.python-list@python.org>
In reply to#61627

[Multipart message — attachments visible in raw view] — view raw

On Wed, Dec 11, 2013 at 3:24 PM, Steven D'Aprano <
steve+comp.lang.python@pearwood.info> wrote:

> On Wed, 11 Dec 2013 15:07:35 -0800, Dan Stromberg wrote:
>
> >  $ chardet mpaa-ratings-reasons.list
> > mpaa-ratings-reasons.list: windows-1255 (confidence: 0.97)
> >
> > I'm aware that chardet is playing guessing games, though one would hope
> > it would guess well most of the time, and give a reasonable confidence
> > rating.
>
> What reason do you have for thinking that Windows-1255 isn't a reasonable
> guess? If the bulk of the text is Latin-1 except perhaps for one or two
> Hebrew characters (or what chardet thinks are Hebrew characters), it may
> actually be a reasonable guess.
>

I get a traceback if I try to read the file as Windows-1255.  I don't get a
traceback if I read it as ISO-8859-1.


> If it is a poor guess, perhaps you ought to report it to the chardet
> maintainers as a good example of a poor guess.
>
I was considering that, and may do so.

I've also been wondering if ISO-8859-1 is just an octet-oriented codec, so
it'll read about anything.  There are clearly non-7-bit-ASCII characters in
the file that look like line noise in an mrxvt.

By the way, this forum is a text-only newsgroup and so-called "Rich
> Text" (actually HTML) posts are frowned upon because most people don't
> appreciate having to read gunk like this:
>
> > <div dir="ltr"><br><div class="gmail_extra"><div
> > class="gmail_quote"> ... <br>
> > <blockquote class="gmail_quote" style="margin:0px 0px 0px
> > 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex"><div
> > class="im"> ... <br></div></div></div></div>
>
> If you can, would you please turn off rich text posting when you post
> here please?
>
> Thank you.
>
Apologies.  I didn't realize gmail was doing this.   I had thought it would
only do so if I used the formatting options in the composer, but perhaps it
does so even when just typing text.

I formerly used MH; are you using MH?  There isn't a lot of e-mail programs
that don't do HTML anymore.  Even mutt can do HTML with very slight
configuration; it's actually quite powerful and ISTR it can do MH folders.

I found a "remove formatting" button in gmail's composer, and used it on
this message.  Does this message look like plain text?

I'm not really prepared to give up gmail's quick searching; I used to index
my e-mails using pyindex and dovecot, but happily I don't need to anymore.

[toc] | [prev] | [next] | [standalone]


#61633

FromNed Batchelder <ned@nedbatchelder.com>
Date2013-12-11 20:01 -0500
Message-ID<mailman.3943.1386810121.18130.python-list@python.org>
In reply to#61627
On 12/11/13 6:39 PM, Dan Stromberg wrote:
>
> On Wed, Dec 11, 2013 at 3:24 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info
> <mailto:steve+comp.lang.python@pearwood.info>> wrote:
>
>     On Wed, 11 Dec 2013 15:07:35 -0800, Dan Stromberg wrote:
>
>      >  $ chardet mpaa-ratings-reasons.list
>      > mpaa-ratings-reasons.list: windows-1255 (confidence: 0.97)
>      >
>      > I'm aware that chardet is playing guessing games, though one
>     would hope
>      > it would guess well most of the time, and give a reasonable
>     confidence
>      > rating.
>
>     What reason do you have for thinking that Windows-1255 isn't a
>     reasonable
>     guess? If the bulk of the text is Latin-1 except perhaps for one or two
>     Hebrew characters (or what chardet thinks are Hebrew characters), it may
>     actually be a reasonable guess.
>
>
> I get a traceback if I try to read the file as Windows-1255.  I don't
> get a traceback if I read it as ISO-8859-1.
>
>     If it is a poor guess, perhaps you ought to report it to the chardet
>     maintainers as a good example of a poor guess.
>
> I was considering that, and may do so.
>
> I've also been wondering if ISO-8859-1 is just an octet-oriented codec,
> so it'll read about anything.  There are clearly non-7-bit-ASCII
> characters in the file that look like line noise in an mrxvt.

Both ISO-8859-1 and Windows-1255 are octet-oriented, I don't see why one 
would raise an exception when the other didn't.  Unless the exception 
isn't on the decode, but instead on your attempt to output the result. 
Can you show the full traceback you're seeing?

-- 
Ned Batchelder, http://nedbatchelder.com

[toc] | [prev] | [next] | [standalone]


#61636 — Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?)

FromBen Finney <ben+python@benfinney.id.au>
Date2013-12-12 12:12 +1100
SubjectDisable HTML in forum messages (was: Movie (MPAA) ratings and Python?)
Message-ID<mailman.3945.1386810776.18130.python-list@python.org>
In reply to#61627
Dan Stromberg <drsalists@gmail.com> writes:

> On Wed, Dec 11, 2013 at 3:24 PM, Steven D'Aprano <
> steve+comp.lang.python@pearwood.info> wrote:
> > By the way, this forum is a text-only newsgroup and so-called "Rich
> > Text" (actually HTML) posts are frowned upon […]
> > If you can, would you please turn off rich text posting when you post
> > here please?
>
> Apologies. I didn't realize gmail was doing this. I had thought it
> would only do so if I used the formatting options in the composer, but
> perhaps it does so even when just typing text.

Thanks for taking measures to send messages in plain text.

> I found a "remove formatting" button in gmail's composer, and used it
> on this message. Does this message look like plain text?

Still sent with an HTML part, so some other change must be needed to
disable that.

> There isn't a lot of e-mail programs that don't do HTML anymore.

Many of the better mail clients allow the user to explicitly stop
rendering HTML (but still have it available, as Steven points out).

Disabling HTML in messages is a good idea: HTML rarely adds anything
useful to a message in a discussion forum, but it can cause the mail
program to do actions unwanted by the user (e.g. fetch images from
elsewhere, or run ECMAScript, or invoke HTML rendering bugs).

Plain text doesn't have those problems, which is why it's more courteous
to stop sending HTML messages in most cases.

Because it's inefficient to poll many recipients for whether their
system can work with HTML messages, avoiding sending HTML altogether is
especially advisable with multiple recipients, such as discussion
forums.

> I'm not really prepared to give up gmail's quick searching; I used to
> index my e-mails using pyindex and dovecot, but happily I don't need
> to anymore.

You will be pleased to know, then, that ‘notmuch’ is a client-side
system providing very quick email indexing and searching
<URL:http://notmuchmail.org/>.

Notmuch is available directly from several operating systems (e.g.
Debian) or install it yourself. It works with numerous existing mail
clients, and brings the significant advantage of organising one's email
by search, not by exclusive folders.

-- 
 \        “Intellectual property is to the 21st century what the slave |
  `\                              trade was to the 16th.” —David Mertz |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [next] | [standalone]


#61654 — Re: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?)

Fromrusi <rustompmody@gmail.com>
Date2013-12-11 19:23 -0800
SubjectRe: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?)
Message-ID<c2985805-4e08-43f7-9b6b-5298f4913f4f@googlegroups.com>
In reply to#61636
On Thursday, December 12, 2013 6:42:42 AM UTC+5:30, Ben Finney wrote:
> Dan Stromberg writes:

> > I found a "remove formatting" button in gmail's composer, and used it
> > on this message. Does this message look like plain text?

> Still sent with an HTML part, so some other change must be needed to
> disable that.

> > There isn't a lot of e-mail programs that don't do HTML anymore.

> Many of the better mail clients allow the user to explicitly stop
> rendering HTML (but still have it available, as Steven points out).

> Disabling HTML in messages is a good idea: HTML rarely adds anything
> useful to a message in a discussion forum, but it can cause the mail
> program to do actions unwanted by the user (e.g. fetch images from
> elsewhere, or run ECMAScript, or invoke HTML rendering bugs).

When you click on send/reply in gmail, there's a small down-triangle
next to the dustbin, inside which you will find a plain text option

The problem is that then your other mails (may) become plain text and
your friends/recipients will wonder whether you've entered a time-machine
and gone back to 1990!!

Many people find it simpler to just use Google groups.  It also has its
problems (as do all methods!) but in sum its the easiest option to use.

[toc] | [prev] | [next] | [standalone]


#61660 — Re: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?)

FromChris Angelico <rosuav@gmail.com>
Date2013-12-12 15:27 +1100
SubjectRe: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?)
Message-ID<mailman.3959.1386822428.18130.python-list@python.org>
In reply to#61654
On Thu, Dec 12, 2013 at 2:23 PM, rusi <rustompmody@gmail.com> wrote:
> When you click on send/reply in gmail, there's a small down-triangle
> next to the dustbin, inside which you will find a plain text option
>
> The problem is that then your other mails (may) become plain text and
> your friends/recipients will wonder whether you've entered a time-machine
> and gone back to 1990!!

Or maybe they'll wonder if you've just magically changed your font
settings to be exactly what they most want to read, because you're no
longer sending text that's too large / too small for them to
comfortably read.

ChrisA

[toc] | [prev] | [next] | [standalone]


#61696 — Re: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?)

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2013-12-12 11:05 +0000
SubjectRe: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?)
Message-ID<52a9987e$0$29992$c3e8da3$5496439d@news.astraweb.com>
In reply to#61654
On Wed, 11 Dec 2013 19:23:39 -0800, rusi wrote:

> The problem is that then your other mails (may) become plain text and
> your friends/recipients will wonder whether you've entered a
> time-machine and gone back to 1990!!

Not everything that's changed since 1990 has been an improvement.


> Many people find it simpler to just use Google groups.  It also has its
> problems (as do all methods!) but in sum its the easiest option to use.

How ironic. After mocking those of us who prefer to send and receive 
plain text, you then recommend that people use a delivery mechanism which 
sends plain text.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#61707 — Re: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?)

FromSteve Hayes <hayesstw@telkomsa.net>
Date2013-12-12 15:36 +0200
SubjectRe: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?)
Message-ID<eueja9h2qdj0uo8tbihh62u7ibh673lo10@4ax.com>
In reply to#61696
On 12 Dec 2013 11:05:35 GMT, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:

>On Wed, 11 Dec 2013 19:23:39 -0800, rusi wrote:
>
>> The problem is that then your other mails (may) become plain text and
>> your friends/recipients will wonder whether you've entered a
>> time-machine and gone back to 1990!!
>
>Not everything that's changed since 1990 has been an improvement.

And vice versa. 


-- 
Steve Hayes from Tshwane, South Africa
Web:  http://www.khanya.org.za/stevesig.htm
Blog: http://khanya.wordpress.com
E-mail - see web page, or parse: shayes at dunelm full stop org full stop uk

[toc] | [prev] | [next] | [standalone]


#61639 — Re: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?)

FromIan Kelly <ian.g.kelly@gmail.com>
Date2013-12-11 18:31 -0700
SubjectRe: Disable HTML in forum messages (was: Movie (MPAA) ratings and Python?)
Message-ID<mailman.3948.1386811939.18130.python-list@python.org>
In reply to#61627
On Wed, Dec 11, 2013 at 6:12 PM, Ben Finney <ben+python@benfinney.id.au> wrote:
>> I found a "remove formatting" button in gmail's composer, and used it
>> on this message. Does this message look like plain text?
>
> Still sent with an HTML part, so some other change must be needed to
> disable that.

Check the default formatting in the settings, or perhaps instead of
"no signature" there is an empty signature selected that is adding
formatting?

>> There isn't a lot of e-mail programs that don't do HTML anymore.
>
> Many of the better mail clients allow the user to explicitly stop
> rendering HTML (but still have it available, as Steven points out).

Unfortunately, Gmail has recently moved away from the explicit toggle
and now only has that "Remove formatting" command, which will remove
any existing formatting from the draft but won't necessarily prevent
it from accidentally slipping back in.

[toc] | [prev] | [next] | [standalone]


#61641

FromIan Kelly <ian.g.kelly@gmail.com>
Date2013-12-11 18:37 -0700
Message-ID<mailman.3949.1386812304.18130.python-list@python.org>
In reply to#61627
On Wed, Dec 11, 2013 at 6:01 PM, Ned Batchelder <ned@nedbatchelder.com> wrote:
>> I've also been wondering if ISO-8859-1 is just an octet-oriented codec,
>> so it'll read about anything.  There are clearly non-7-bit-ASCII
>> characters in the file that look like line noise in an mrxvt.
>
>
> Both ISO-8859-1 and Windows-1255 are octet-oriented, I don't see why one
> would raise an exception when the other didn't.  Unless the exception isn't
> on the decode, but instead on your attempt to output the result. Can you
> show the full traceback you're seeing?

There are gaps in CP 1255 (see
http://en.wikipedia.org/wiki/Code_page_1255), so I presume the file
contains one or more of those octets that don't map to anything at
all.

[toc] | [prev] | [next] | [standalone]


#61657

FromDan Stromberg <drsalists@gmail.com>
Date2013-12-11 19:52 -0800
Message-ID<mailman.3957.1386820342.18130.python-list@python.org>
In reply to#61627

[Multipart message — attachments visible in raw view] — view raw

On Wed, Dec 11, 2013 at 5:01 PM, Ned Batchelder <ned@nedbatchelder.com>wrote:

> On 12/11/13 6:39 PM, Dan Stromberg wrote:
>
>>
>> On Wed, Dec 11, 2013 at 3:24 PM, Steven D'Aprano
>> <steve+comp.lang.python@pearwood.info
>> <mailto:steve+comp.lang.python@pearwood.info>> wrote:
>>
>>     On Wed, 11 Dec 2013 15:07:35 -0800, Dan Stromberg wrote:
>>
>>      >  $ chardet mpaa-ratings-reasons.list
>>      > mpaa-ratings-reasons.list: windows-1255 (confidence: 0.97)
>>      >
>>      > I'm aware that chardet is playing guessing games, though one
>>     would hope
>>      > it would guess well most of the time, and give a reasonable
>>     confidence
>>      > rating.
>>
>>     What reason do you have for thinking that Windows-1255 isn't a
>>     reasonable
>>     guess? If the bulk of the text is Latin-1 except perhaps for one or
>> two
>>     Hebrew characters (or what chardet thinks are Hebrew characters), it
>> may
>>     actually be a reasonable guess.
>>
>>
>> I get a traceback if I try to read the file as Windows-1255.  I don't
>> get a traceback if I read it as ISO-8859-1.
>>
>>     If it is a poor guess, perhaps you ought to report it to the chardet
>>     maintainers as a good example of a poor guess.
>>
>> I was considering that, and may do so.
>>
>> I've also been wondering if ISO-8859-1 is just an octet-oriented codec,
>> so it'll read about anything.  There are clearly non-7-bit-ASCII
>> characters in the file that look like line noise in an mrxvt.
>>
>
> Both ISO-8859-1 and Windows-1255 are octet-oriented, I don't see why one
> would raise an exception when the other didn't.  Unless the exception isn't
> on the decode, but instead on your attempt to output the result. Can you
> show the full traceback you're seeing?
>

$ ./movie-ratings
Traceback (most recent call last):
  File "./movie-ratings", line 85, in <module>
    main()
  File "./movie-ratings", line 68, in main
    ratings =
get_ratings('/home/dstromberg/src/home-svn/movie-ratings/trunk/mpaa-ratings-reasons.list')
  File "./movie-ratings", line 52, in get_ratings
    for line in ratings_file:
  File "/usr/local/cpython-3.3/lib/python3.3/encodings/cp1255.py", line 23,
in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0xfc in position
1225: character maps to <undefined>

BTW, other than satisfying our respective curiosities, I consider this
project finished.  It's probably not getting ratings for my entire movie
collection, but it is getting them for a significant fraction, which is all
I was really looking for.  Now I know which ones are rated PG, so I can
decide whether to let my 8 year old watch them.

This is with cpython-3.3.

Thanks.  ^_^

[toc] | [prev] | [next] | [standalone]


#61669

FromMichael Torrie <torriem@gmail.com>
Date2013-12-11 23:22 -0700
Message-ID<mailman.3965.1386829372.18130.python-list@python.org>
In reply to#61627
On 12/11/2013 04:39 PM, Dan Stromberg wrote:
>> If you can, would you please turn off rich text posting when you post
>> here please?

> Apologies.  I didn't realize gmail was doing this.   I had thought it would
> only do so if I used the formatting options in the composer, but perhaps it
> does so even when just typing text.

>From what I can see gmail is producing a multipart message that has a
plaint text part and an html part.  This is what gmail normally does and
as far as I know it's RFC-compliant and that's what gmail always does.

[toc] | [prev] | [next] | [standalone]


#61710

FromDave Angel <davea@davea.name>
Date2013-12-12 08:56 -0500
Message-ID<mailman.3989.1386856520.18130.python-list@python.org>
In reply to#61627
On Wed, 11 Dec 2013 23:22:14 -0700, Michael Torrie 
<torriem@gmail.com> wrote:
> From what I can see gmail is producing a multipart message that has 
a
> plaint text part and an html part.  This is what gmail normally 
does and
> as far as I know it's RFC-compliant and that's what gmail always 
does.

"Always does" doesn't mean it's a good idea on a text newsgroup. 

Very often the pretty text in the html part is mangled in the text 
part. Most often this is just indentation,  but for Python that's a 
biggie. It also means that we don't all see the same thing. 

Including both makes the download slower and more expensive. 

Some text newsreaders refuse to show anything if there's an html 
part.  Mine (groundhog on android) apparently shows the text part if 
it follows the html part.

-- 
DaveA

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web