Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #107569 > unrolled thread
| Started by | CM <cmpython@gmail.com> |
|---|---|
| First post | 2016-04-24 11:58 -0700 |
| Last post | 2016-04-25 17:59 +0000 |
| Articles | 6 — 4 participants |
Back to article view | Back to comp.lang.python
Scraping email to make invoice CM <cmpython@gmail.com> - 2016-04-24 11:58 -0700
Re: Scraping email to make invoice Friedrich Rentsch <anthra.norell@bluewin.ch> - 2016-04-24 22:38 +0200
Re: Scraping email to make invoice Michael Torrie <torriem@gmail.com> - 2016-04-24 17:19 -0600
Re: Scraping email to make invoice Grant Edwards <grant.b.edwards@gmail.com> - 2016-04-25 14:39 +0000
Re: Scraping email to make invoice Michael Torrie <torriem@gmail.com> - 2016-04-25 11:16 -0600
Re: Scraping email to make invoice Grant Edwards <grant.b.edwards@gmail.com> - 2016-04-25 17:59 +0000
| From | CM <cmpython@gmail.com> |
|---|---|
| Date | 2016-04-24 11:58 -0700 |
| Subject | Scraping email to make invoice |
| Message-ID | <e75f5681-6e6f-424f-8697-b01c94d0f3ce@googlegroups.com> |
I would like to write a Pythons script to automate a tedious process and could use some advice.
The source content will be an email that has 5-10 PO (purchase order) numbers and information for freelance work done. The target content will be an invoice. (There will be an email like this every week).
Right now, the "recommended" way to go (from the company) from source to target is manually copying and pasting all the tedious details of the work done into the invoice. But this is laborious, error-prone...and just begging for automation. There is no human judgment necessary whatsoever in this.
I'm comfortable with "scraping" a text file and have written scripts for this, but could use some pointers on other parts of this operation.
1. INPUT: What's the best way to scrape an email like this? The email is to a Gmail account, and the content shows up in the email as a series of basically 6x7 tables (HTML?), one table per PO number/task. I know if the freelancer were to copy and paste the whole set of tables into a text file and save it as plain text, Python could easily scrape that file, but I'd much prefer to save the user those steps. Is there a relatively easy way to go from the Gmail email to generating the invoice directly? (I know there is, but wasn't sure what is state of the art these days).
2. OUPUT: The invoice will have boilerplate content on top and then an Excel table at bottom that is mostly the same information from the source content. Ideally, so that the invoice looks good, the invoice should be a Word document. For the first pass at this, it looked best by laying out the entire invoice in Excel and then copy and pasting it into a Word doc as an image (since otherwise the columns ran over for some reason). In any case, the goal is to create a single page invoice that looks like a clean, professional looking invoice.
3. UI: I am comfortable with making GUI apps, so could use this as the interface for the (somewhat computer-uncomfortable) user. But the less user actions necessary, the better. The emails always come from the same sender, and always have the same boilerplate language ("Below please find your Purchase Order (PO)"), so I'm envisioning a small GUI window with a single button that says "MAKE NEWEST INVOICE" and the user presses it and it automatically searches the user's email for PO # emails and creates the newest invoice. I'm guessing I could keep a sqlite database or flat file on the computer to just track what is meant by "newest", and then the output would have the date created in the file, so the user can be sure what has been invoiced.
I'm hoping I can write this in a couple of days.
Any suggestions welcome! Thanks.
[toc] | [next] | [standalone]
| From | Friedrich Rentsch <anthra.norell@bluewin.ch> |
|---|---|
| Date | 2016-04-24 22:38 +0200 |
| Message-ID | <mailman.58.1461530366.32212.python-list@python.org> |
| In reply to | #107569 |
On 04/24/2016 08:58 PM, CM wrote:
> I would like to write a Pythons script to automate a tedious process and could use some advice.
>
> The source content will be an email that has 5-10 PO (purchase order) numbers and information for freelance work done. The target content will be an invoice. (There will be an email like this every week).
>
> Right now, the "recommended" way to go (from the company) from source to target is manually copying and pasting all the tedious details of the work done into the invoice. But this is laborious, error-prone...and just begging for automation. There is no human judgment necessary whatsoever in this.
>
> I'm comfortable with "scraping" a text file and have written scripts for this, but could use some pointers on other parts of this operation.
>
> 1. INPUT: What's the best way to scrape an email like this? The email is to a Gmail account, and the content shows up in the email as a series of basically 6x7 tables (HTML?), one table per PO number/task. I know if the freelancer were to copy and paste the whole set of tables into a text file and save it as plain text, Python could easily scrape that file, but I'd much prefer to save the user those steps. Is there a relatively easy way to go from the Gmail email to generating the invoice directly? (I know there is, but wasn't sure what is state of the art these days).
>
> 2. OUPUT: The invoice will have boilerplate content on top and then an Excel table at bottom that is mostly the same information from the source content. Ideally, so that the invoice looks good, the invoice should be a Word document. For the first pass at this, it looked best by laying out the entire invoice in Excel and then copy and pasting it into a Word doc as an image (since otherwise the columns ran over for some reason). In any case, the goal is to create a single page invoice that looks like a clean, professional looking invoice.
>
> 3. UI: I am comfortable with making GUI apps, so could use this as the interface for the (somewhat computer-uncomfortable) user. But the less user actions necessary, the better. The emails always come from the same sender, and always have the same boilerplate language ("Below please find your Purchase Order (PO)"), so I'm envisioning a small GUI window with a single button that says "MAKE NEWEST INVOICE" and the user presses it and it automatically searches the user's email for PO # emails and creates the newest invoice. I'm guessing I could keep a sqlite database or flat file on the computer to just track what is meant by "newest", and then the output would have the date created in the file, so the user can be sure what has been invoiced.
>
> I'm hoping I can write this in a couple of days.
>
> Any suggestions welcome! Thanks.
INPUT: What's the best way to scrape an email like this? -- Like what? You need to explain what exactly your input is or show an example.
Frederic
[toc] | [prev] | [next] | [standalone]
| From | Michael Torrie <torriem@gmail.com> |
|---|---|
| Date | 2016-04-24 17:19 -0600 |
| Message-ID | <mailman.61.1461540002.32212.python-list@python.org> |
| In reply to | #107569 |
On 04/24/2016 12:58 PM, CM wrote:
> 1. INPUT: What's the best way to scrape an email like this? The
> email is to a Gmail account, and the content shows up in the email as
> a series of basically 6x7 tables (HTML?), one table per PO
> number/task. I know if the freelancer were to copy and paste the
> whole set of tables into a text file and save it as plain text,
> Python could easily scrape that file, but I'd much prefer to save the
> user those steps. Is there a relatively easy way to go from the Gmail
> email to generating the invoice directly? (I know there is, but
> wasn't sure what is state of the art these days).
I would configure Gmail to allow IMAP access (you'll have to set up a
special password for this most likely), and then use an imap library
from Python to directly find the relevant messages and access the email
message body. If the body is HTML-formatted (sounds like it is) I would
use either BeautifulSoup or lxml to parse it and get out the relevant
information.
> 2. OUPUT: The invoice will have boilerplate content on top and then
> an Excel table at bottom that is mostly the same information from
> the source content. Ideally, so that the invoice looks good, the
> invoice should be a Word document. For the first pass at this, it
> looked best by laying out the entire invoice in Excel and then copy
> and pasting it into a Word doc as an image (since otherwise the
> columns ran over for some reason). In any case, the goal is to create
> a single page invoice that looks like a clean, professional looking
> invoice.
There are several libraries for creating Excel and Word files,
especially the XML-based formats, though I have little experience with
them. There are also nice libraries for emitting PDF if that would work
better.
> 3. UI: I am comfortable with making GUI apps, so could use this as
> the interface for the (somewhat computer-uncomfortable) user. But
> the less user actions necessary, the better. The emails always come
> from the same sender, and always have the same boilerplate language
> ("Below please find your Purchase Order (PO)"), so I'm envisioning a
> small GUI window with a single button that says "MAKE NEWEST
> INVOICE" and the user presses it and it automatically searches the
> user's email for PO # emails and creates the newest invoice. I'm
> guessing I could keep a sqlite database or flat file on the computer
> to just track what is meant by "newest", and then the output would
> have the date created in the file, so the user can be sure what has
> been invoiced.
Once you have a working script, your GUI interface would be pretty easy.
Though it seems to me that it would be unnecessary. This process
sounds like it should just run automatically from a cron job or something.
> I'm hoping I can write this in a couple of days.
The automated part should be possible, but personally I'd give myself a
week.
[toc] | [prev] | [next] | [standalone]
| From | Grant Edwards <grant.b.edwards@gmail.com> |
|---|---|
| Date | 2016-04-25 14:39 +0000 |
| Message-ID | <mailman.80.1461595206.32212.python-list@python.org> |
| In reply to | #107569 |
On 2016-04-24, Michael Torrie <torriem@gmail.com> wrote:
> On 04/24/2016 12:58 PM, CM wrote:
>
>> 1. INPUT: What's the best way to scrape an email like this? The
>> email is to a Gmail account, and the content shows up in the
>> email as a series of basically 6x7 tables (HTML?), one table per
>> PO number/task. I know if the freelancer were to copy and paste
>> the whole set of tables into a text file and save it as plain
>> text, Python could easily scrape that file, but I'd much prefer
>> to save the user those steps. Is there a relatively easy way to
>> go from the Gmail email to generating the invoice directly? (I
>> know there is, but wasn't sure what is state of the art these
>> days).
>
> I would configure Gmail to allow IMAP access (you'll have to set up a
> special password for this most likely),
Your normal gmail password is used for IMAP.
> and then use an imap library from Python to directly find the
> relevant messages and access the email message body. If the body is
> HTML-formatted (sounds like it is) I would use either BeautifulSoup
> or lxml to parse it and get out the relevant information.
Warning: don't use the basic imaplib. IMAP is a miserable protocol,
and imap lib is too thin a wrapper. It'll make you bleed from the ears
and wish you were dead. Use imapclient or imaplib2. I've used both
(with Gmail's IMAP server), and IMO both are pretty good. Either one
is miles ahead of plain imaplib.
--
Grant Edwards grant.b.edwards Yow! But they went to MARS
at around 1953!!
gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Michael Torrie <torriem@gmail.com> |
|---|---|
| Date | 2016-04-25 11:16 -0600 |
| Message-ID | <mailman.87.1461604629.32212.python-list@python.org> |
| In reply to | #107569 |
On 04/25/2016 08:39 AM, Grant Edwards wrote: > Your normal gmail password is used for IMAP. Actually, no, unless you explicitly tell Google to allow "less-secure" authentication. Otherwise you are required to set up a special, application-specific password. https://support.google.com/accounts/answer/185833?hl=en > Warning: don't use the basic imaplib. IMAP is a miserable protocol, > and imap lib is too thin a wrapper. It'll make you bleed from the ears > and wish you were dead. Use imapclient or imaplib2. I've used both > (with Gmail's IMAP server), and IMO both are pretty good. Either one > is miles ahead of plain imaplib.
[toc] | [prev] | [next] | [standalone]
| From | Grant Edwards <grant.b.edwards@gmail.com> |
|---|---|
| Date | 2016-04-25 17:59 +0000 |
| Message-ID | <mailman.88.1461607161.32212.python-list@python.org> |
| In reply to | #107569 |
On 2016-04-25, Michael Torrie <torriem@gmail.com> wrote:
> On 04/25/2016 08:39 AM, Grant Edwards wrote:
>> Your normal gmail password is used for IMAP.
>
> Actually, no, unless you explicitly tell Google to allow "less-secure"
> authentication. Otherwise you are required to set up a special,
> application-specific password.
>
> https://support.google.com/accounts/answer/185833?hl=en
You're right. I should have said your normal gmail password _can_be_
used for IMAP.
--
Grant Edwards grant.b.edwards Yow! TAILFINS!! ... click
at ...
gmail.com
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web