Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #85369 > unrolled thread

Python 3.x stuffing utf-8 into SQLite db

Started bySkip Montanaro <skip.montanaro@gmail.com>
First post2015-02-08 21:44 -0600
Last post2015-02-09 19:41 +0000
Articles 8 — 6 participants

Back to article view | Back to comp.lang.python


Contents

  Python 3.x stuffing utf-8 into SQLite db Skip Montanaro <skip.montanaro@gmail.com> - 2015-02-08 21:44 -0600
    Re: Python 3.x stuffing utf-8 into SQLite db Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-02-09 15:51 +1100
      Re: Python 3.x stuffing utf-8 into SQLite db Skip Montanaro <skip.montanaro@gmail.com> - 2015-02-09 11:32 -0600
      Re: Python 3.x stuffing utf-8 into SQLite db Chris Angelico <rosuav@gmail.com> - 2015-02-10 04:41 +1100
      Re: Python 3.x stuffing utf-8 into SQLite db Zachary Ware <zachary.ware+pylist@gmail.com> - 2015-02-09 14:05 -0600
      Re: Python 3.x stuffing utf-8 into SQLite db Skip Montanaro <skip.montanaro@gmail.com> - 2015-02-09 18:36 -0600
        Re: Python 3.x stuffing utf-8 into SQLite db wxjmfauth@gmail.com - 2015-02-10 00:23 -0800
    Re: Python 3.x stuffing utf-8 into SQLite db mm0fmf <none@mailinator.com> - 2015-02-09 19:41 +0000

#85369 — Python 3.x stuffing utf-8 into SQLite db

FromSkip Montanaro <skip.montanaro@gmail.com>
Date2015-02-08 21:44 -0600
SubjectPython 3.x stuffing utf-8 into SQLite db
Message-ID<mailman.18555.1423453501.18130.python-list@python.org>

[Multipart message — attachments visible in raw view] — view raw

I am trying to process a CSV file using Python 3.5 (CPython tip as of a
week or so ago). According to chardet[1], the file is encoded as utf-8:

>>> s = open("data/meets-usms.csv", "rb").read()
>>> len(s)
562272
>>> import chardet
>>> chardet.detect(s)
{'encoding': 'utf-8', 'confidence': 0.99}

so I created the reader like so:

        rdr = csv.DictReader(open(csvfile, encoding="utf-8"))

This seems to work. The rows are read and records added to a SQLite3
database. When I go into sqlite3, I get what looks to be raw utf-8 on
output:

% LANG=en_US.UTF-8 sqlite3 topten.db
SQLite version 3.8.5 2014-08-15 22:37:57
Enter ".help" for usage hints.
sqlite> select * from swimmeet where meetname like '%Barracuda%';
sqlite> select count(*) from swimmeet;
0
sqlite> select count(*) from swimmeet;
4171
sqlite> select meetname from swimmeet where meetname like
'%Barracuda%Patrick%';
Anderson Barracudas St. Patrick's Day Swim Meet
Anderson Barracuda Masters - 2010 St. Patrick’s Day Swim Meet
Anderson Barracuda Masters 2011 St. Patrick’s Day Swim Meet
Anderson Barracuda Masters St. Patrick's Day Meet
Anderson Barracuda Masters St. Patrick's Day Meet 2014
Anderson Barracuda Masters 2015 St. Patrick’s Day Swim Meet

Note the wacky three bytes where the apostrophe in "St. Patrick's" should
be. The data came to me as an XLSX spreadsheet, which I dumped to CSV using
LibreOffice. That's how the character was encoded at that point.

I tweaked my CSV-to-SQLite script to print the meet name and id for those
meets with "Barracuda" and "Patrick" in their name:

                if dry_run or verbose:
                    if ("Barracuda" in row["MeetTitle"] and
                        "Patrick" in row["MeetTitle"]):
                        print("Insert", n, row["MeetTitle"], row["MeetID"])

When I run it, I see raw bytes instead of a properly rendered apostrophe:

% LANG=en_US.utf-8 python3.5 src/usmsmeets2db.py -v data/meets-usms.csv
topten.db
Insert 1173 Anderson Barracudas St. Patrick's Day Swim Meet 20090321ABMSTPY
Insert 1559 Anderson Barracuda Masters - 2010 St. Patrick’s Day Swim Meet
20100320CUDASY
Insert 1995 Anderson Barracuda Masters 2011 St. Patrick’s Day Swim Meet
20110319ANDERSY
Insert 3012 Anderson Barracuda Masters St. Patrick's Day Meet
20130316AndersY
Insert 3562 Anderson Barracuda Masters St. Patrick's Day Meet 2014
20140315ANDERSY
Insert 4114 Anderson Barracuda Masters 2015 St. Patrick’s Day Swim Meet
20150321AndersY
Read 4962 rows, inserted 4171 records

Why am I not seeing what I believe to be a non-ASCII apostrophe of some
sort properly printed? This is running on a Mac (Yosemite) in its Terminal
app, with its encoding preference set to utf-8. It appears just as shown
above, "a" with a caret, the Euro symbol, then the "TM" symbol. Have I
perhaps lost the properly encoded bytes somewhere, and now it's just
spewing the bogus bytes (mojibake)?

Thanks,

Skip

--
[1] https://pypi.python.org/pypi/chardet/2.3.0

[toc] | [next] | [standalone]


#85370

FromSteven D'Aprano <steve+comp.lang.python@pearwood.info>
Date2015-02-09 15:51 +1100
Message-ID<54d83ced$0$12895$c3e8da3$5496439d@news.astraweb.com>
In reply to#85369
Skip Montanaro wrote:

> sqlite> select meetname from swimmeet where meetname like
> '%Barracuda%Patrick%';
> Anderson Barracudas St. Patrick's Day Swim Meet
> Anderson Barracuda Masters - 2010 St. Patrick’s Day Swim Meet
> Anderson Barracuda Masters 2011 St. Patrick’s Day Swim Meet
> Anderson Barracuda Masters St. Patrick's Day Meet
> Anderson Barracuda Masters St. Patrick's Day Meet 2014
> Anderson Barracuda Masters 2015 St. Patrick’s Day Swim Meet
> 
> Note the wacky three bytes where the apostrophe in "St. Patrick's" should
> be. The data came to me as an XLSX spreadsheet, which I dumped to CSV
> using LibreOffice. That's how the character was encoded at that point.

The first question I would ask is whether SQLite's command shell supports 
UTF-8. I think it does, but aren't sure. The second question is, are you 
using Windows? If so, the Windows shell may be broken and mis-printing the 
string, or otherwise doing something weird:

http://stackoverflow.com/questions/13405071/sqlite-converts-all-unicode-
characters-into-ansi

Can you confirm that the strings appear correctly in Python before you put 
them into sqlite?


-- 
Steve

[toc] | [prev] | [next] | [standalone]


#85383

FromSkip Montanaro <skip.montanaro@gmail.com>
Date2015-02-09 11:32 -0600
Message-ID<mailman.18567.1423503158.18130.python-list@python.org>
In reply to#85370
On Sun, Feb 8, 2015 at 10:51 PM, Steven D'Aprano
<steve+comp.lang.python@pearwood.info> wrote:
> The second question is, are you
> using Windows?

No, I'm on a Mac (as, I think I indicated in my original note). All
transformations occurred on a Mac. LibreOffice spit out a CSV file
(with those three odd bytes). My script sucked in the CSV file and
inserted data into my SQLite db.

Skip

[toc] | [prev] | [next] | [standalone]


#85385

FromChris Angelico <rosuav@gmail.com>
Date2015-02-10 04:41 +1100
Message-ID<mailman.18569.1423503703.18130.python-list@python.org>
In reply to#85370
On Tue, Feb 10, 2015 at 4:32 AM, Skip Montanaro
<skip.montanaro@gmail.com> wrote:
> On Sun, Feb 8, 2015 at 10:51 PM, Steven D'Aprano
> <steve+comp.lang.python@pearwood.info> wrote:
>> The second question is, are you
>> using Windows?
>
> No, I'm on a Mac (as, I think I indicated in my original note). All
> transformations occurred on a Mac. LibreOffice spit out a CSV file
> (with those three odd bytes). My script sucked in the CSV file and
> inserted data into my SQLite db.

Did the file originally come from Windows, or was it always on a Mac?

ChrisA

[toc] | [prev] | [next] | [standalone]


#85399

FromZachary Ware <zachary.ware+pylist@gmail.com>
Date2015-02-09 14:05 -0600
Message-ID<mailman.18575.1423512371.18130.python-list@python.org>
In reply to#85370
On Mon, Feb 9, 2015 at 11:32 AM, Skip Montanaro
<skip.montanaro@gmail.com> wrote:
> LibreOffice spit out a CSV file
> (with those three odd bytes). My script sucked in the CSV file and
> inserted data into my SQLite db.

If all else fails, you can try ftfy to fix things:
http://ftfy.readthedocs.org/en/latest/

   >>> import ftfy
   >>> ftfy.fix_text('Anderson Barracuda Masters - 2010 St.
Patrick’s Day Swim Meet')
   "Anderson Barracuda Masters - 2010 St. Patrick's Day Swim Meet"

It also seems to agree that there was a bad (en|de)coding with cp1252
at some point.

   >>> ftfy.fixes.fix_encoding_and_explain('Anderson Barracuda Masters
- 2010 St. Patrick’s Day Swim Meet')
   ('Anderson Barracuda Masters - 2010 St. Patrick’s Day Swim Meet',
[('encode', 'sloppy-windows-1252'), ('decode', 'utf-8')])

-- 
Zach

[toc] | [prev] | [next] | [standalone]


#85414

FromSkip Montanaro <skip.montanaro@gmail.com>
Date2015-02-09 18:36 -0600
Message-ID<mailman.18584.1423528619.18130.python-list@python.org>
In reply to#85370

[Multipart message — attachments visible in raw view] — view raw

On Mon, Feb 9, 2015 at 2:38 PM, Skip Montanaro <skip.montanaro@gmail.com>
wrote:

> On Mon, Feb 9, 2015 at 2:05 PM, Zachary Ware
> <zachary.ware+pylist@gmail.com> wrote:
> > If all else fails, you can try ftfy to fix things:
> > http://ftfy.readthedocs.org/en/latest/
>
> Thanks for the pointer. I would prefer to not hand-mangle this stuff
> in case I get another database dump from my USMS friends. Something
> like ftfy should help things "just work".
>

And indeed it did. Thanks Zachary.

Skip

[toc] | [prev] | [next] | [standalone]


#85437

Fromwxjmfauth@gmail.com
Date2015-02-10 00:23 -0800
Message-ID<7ce36cdc-2cb6-41f9-964d-227248c9d33c@googlegroups.com>
In reply to#85414
Le mardi 10 février 2015 01:37:15 UTC+1, Skip Montanaro a écrit :
> On Mon, Feb 9, 2015 at 2:38 PM, Skip Montanaro <skip.mo...@gmail.com> wrote:
> On Mon, Feb 9, 2015 at 2:05 PM, Zachary Ware
> 
> <zachary.w...@gmail.com> wrote:
> 
> > If all else fails, you can try ftfy to fix things:
> 
> > http://ftfy.readthedocs.org/en/latest/
> 
> 
> 
> Thanks for the pointer. I would prefer to not hand-mangle this stuff
> 
> in case I get another database dump from my USMS friends. Something
> 
> like ftfy should help things "just work".
> 
> 
> 
> And indeed it did. Thanks Zachary.
> 

%%%%%%

ftfy: a mountain of absurdities. On top of this: ~buggy.

Everything works fine if it's done correctly. There is
nothing to fix. I have the feeling you are destroying a
correct data file, and later you try to correct what you have
destroyed.

Basically the same experiment from Matthew Ruffalo:
Office suite --> csv file saved as pd.txt.

From my GUI interactive interpreter (py32).

>>> with open('pd.csv', encoding='utf-8') as f:
...     r = f.read()
...     
>>> print(r)
"Patrick's Day A1","Patrick's Day B1","Patrick's Day C1"
"Patrick's Day A2","Patrick's Day C2","Patrick's Day C2"

>>>


Now what may happen, is that the terminal (the host system)
may not display all these chars correctly (Windows, Russion *x, ...).
In that case, one has to code correctly (Windows, Russion *x, ...)

Still with the same GUI interpreter:

>>> sys.stdout.sethostencoding('cp850')
>>> outenc = sys.stdout.encoding
>>> print(r.encode(outenc, 'replace').decode(outenc))
"Patrick?s Day A1","Patrick?s Day B1","Patrick?s Day C1"
"Patrick?s Day A2","Patrick?s Day C2","Patrick?s Day C2"


>>> sys.stdout.sethostencoding('iso-8859-5')
>>> outenc = sys.stdout.encoding
>>> print(r.encode(outenc, 'replace').decode(outenc))
"Patrick?s Day A1","Patrick?s Day B1","Patrick?s Day C1"
"Patrick?s Day A2","Patrick?s Day C2","Patrick?s Day C2"

This is exactly what can be observed in a web browser.

Just for the fun, in fact a no-op.

>>> sys.stdout.sethostencoding('utf-32-le')
>>> outenc = sys.stdout.encoding
>>> print(r.encode(outenc, 'replace').decode(outenc))
"Patrick's Day A1","Patrick's Day B1","Patrick's Day C1"
"Patrick's Day A2","Patrick's Day C2","Patrick's Day C2"

[toc] | [prev] | [next] | [standalone]


#85398

Frommm0fmf <none@mailinator.com>
Date2015-02-09 19:41 +0000
Message-ID<A38Cw.919306$E64.201025@fx13.am4>
In reply to#85369
On 09/02/2015 03:44, Skip Montanaro wrote:
> I am trying to process a CSV file using Python 3.5 (CPython tip as of a
> week or so ago). According to chardet[1], the file is encoded as utf-8:
>
>  >>> s = open("data/meets-usms.csv", "rb").read()
>  >>> len(s)
> 562272
>  >>> import chardet
>  >>> chardet.detect(s)
> {'encoding': 'utf-8', 'confidence': 0.99}
>
> so I created the reader like so:
>
>          rdr = csv.DictReader(open(csvfile, encoding="utf-8"))
>
> This seems to work. The rows are read and records added to a SQLite3
> database. When I go into sqlite3, I get what looks to be raw utf-8 on
> output:
>
> % LANG=en_US.UTF-8 sqlite3 topten.db
> SQLite version 3.8.5 2014-08-15 22:37:57
> Enter ".help" for usage hints.
> sqlite> select * from swimmeet where meetname like '%Barracuda%';
> sqlite> select count(*) from swimmeet;
> 0
> sqlite> select count(*) from swimmeet;
> 4171
> sqlite> select meetname from swimmeet where meetname like
> '%Barracuda%Patrick%';
> Anderson Barracudas St. Patrick's Day Swim Meet
> Anderson Barracuda Masters - 2010 St. Patrick’s Day Swim Meet
> Anderson Barracuda Masters 2011 St. Patrick’s Day Swim Meet
> Anderson Barracuda Masters St. Patrick's Day Meet
> Anderson Barracuda Masters St. Patrick's Day Meet 2014
> Anderson Barracuda Masters 2015 St. Patrick’s Day Swim Meet
>

How is meetname defined? Is it a varchar or nvarchar?

My only experience is with MS-SQL and C# but reading from a utf-8 
encoded file with a StreamReader set to utf-8 and trying to insert that 
into varchar fields results in similar issues to what you are showing. I 
changed to using nvarchar and it all start working as expected.


[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web