Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #102332 > unrolled thread
| Started by | "Veek. M" <vek.m1234@gmail.com> |
|---|---|
| First post | 2016-01-31 10:28 +0530 |
| Last post | 2016-02-01 11:40 -0700 |
| Articles | 13 — 6 participants |
Back to article view | Back to comp.lang.python
x=something, y=somethinelse and z=crud all likely to fail - how do i wrap them up "Veek. M" <vek.m1234@gmail.com> - 2016-01-31 10:28 +0530
Re: x=something, y=somethinelse and z=crud all likely to fail - how do i wrap them up Chris Angelico <rosuav@gmail.com> - 2016-01-31 16:23 +1100
Re: x=something, y=somethinelse and z=crud all likely to fail - how do i wrap them up "Veek. M" <vek.m1234@gmail.com> - 2016-01-31 11:59 +0530
Re: x=something, y=somethinelse and z=crud all likely to fail - how do i wrap them up "Veek. M" <vek.m1234@gmail.com> - 2016-01-31 12:01 +0530
Re: x=something, y=somethinelse and z=crud all likely to fail - how do i wrap them up Steven D'Aprano <steve@pearwood.info> - 2016-01-31 20:21 +1100
Re: x=something, y=somethinelse and z=crud all likely to fail - how do i wrap them up Chris Angelico <rosuav@gmail.com> - 2016-01-31 20:40 +1100
Re: x=something, y=somethinelse and z=crud all likely to fail - how do i wrap them up Steven D'Aprano <steve@pearwood.info> - 2016-01-31 21:14 +1100
Re: x=something, y=somethinelse and z=crud all likely to fail - how do i wrap them up Chris Angelico <rosuav@gmail.com> - 2016-02-01 00:27 +1100
Re: x=something, y=somethinelse and z=crud all likely to fail - how do i wrap them up Peter Otten <__peter__@web.de> - 2016-01-31 11:40 +0100
Re: x=something, y=somethinelse and z=crud all likely to fail - how do i wrap them up Larry Hudson <orgnut@yahoo.com> - 2016-01-31 13:27 -0800
Re: x=something, y=somethinelse and z=crud all likely to fail - how do i wrap them up Steven D'Aprano <steve@pearwood.info> - 2016-01-31 18:22 +1100
Re: x=something, y=somethinelse and z=crud all likely to fail - how do i wrap them up "Veek. M" <vek.m1234@gmail.com> - 2016-01-31 20:55 +0530
Re: x=something, y=somethinelse and z=crud all likely to fail - how do i wrap them up Vincent Davis <vincent@vincentdavis.net> - 2016-02-01 11:40 -0700
| From | "Veek. M" <vek.m1234@gmail.com> |
|---|---|
| Date | 2016-01-31 10:28 +0530 |
| Subject | x=something, y=somethinelse and z=crud all likely to fail - how do i wrap them up |
| Message-ID | <n8k448$3pd$1@dont-email.me> |
I'm parsing html and i'm doing: x = root.find_class(... y = root.find_class(.. z = root.find_class(.. all 3 are likely to fail so typically i'd have to stick it in a try. This is a huge pain for obvious reasons. try: .... except something: x = 'default_1' (repeat 3 times) Is there some other nice way to wrap this stuff up? I can't do: try: x= y= z= except: because here if x fails, y and z might have succeeded. Pass the statement as a string to a try function? Any other way?
[toc] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2016-01-31 16:23 +1100 |
| Message-ID | <mailman.152.1454217800.2338.python-list@python.org> |
| In reply to | #102332 |
On Sun, Jan 31, 2016 at 3:58 PM, Veek. M <vek.m1234@gmail.com> wrote:
> I'm parsing html and i'm doing:
>
> x = root.find_class(...
> y = root.find_class(..
> z = root.find_class(..
>
> all 3 are likely to fail so typically i'd have to stick it in a try. This is
> a huge pain for obvious reasons.
>
> try:
> ....
> except something:
> x = 'default_1'
> (repeat 3 times)
>
> Is there some other nice way to wrap this stuff up?
I'm not sure what you're using to parse HTML here (there are several
libraries for doing that), but the first thing I'd look for is an
option to have it return a default if it doesn't find something - even
if that default has to be (say) None.
But failing that, you can always write your own wrapper:
def find_class(root, ...):
try:
return root.find_class(...)
except something:
return 'default_1'
Or have the default as a parameter, if it's different for the different ones.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | "Veek. M" <vek.m1234@gmail.com> |
|---|---|
| Date | 2016-01-31 11:59 +0530 |
| Message-ID | <n8k9f4$gjf$1@dont-email.me> |
| In reply to | #102334 |
Chris Angelico wrote:
> On Sun, Jan 31, 2016 at 3:58 PM, Veek. M <vek.m1234@gmail.com> wrote:
>> I'm parsing html and i'm doing:
>>
>> x = root.find_class(...
>> y = root.find_class(..
>> z = root.find_class(..
>>
>> all 3 are likely to fail so typically i'd have to stick it in a try. This
>> is a huge pain for obvious reasons.
>>
>> try:
>> ....
>> except something:
>> x = 'default_1'
>> (repeat 3 times)
>>
>> Is there some other nice way to wrap this stuff up?
>
> I'm not sure what you're using to parse HTML here (there are several
> libraries for doing that), but the first thing I'd look for is an
> option to have it return a default if it doesn't find something - even
> if that default has to be (say) None.
>
> But failing that, you can always write your own wrapper:
>
> def find_class(root, ...):
> try:
> return root.find_class(...)
> except something:
> return 'default_1'
>
> Or have the default as a parameter, if it's different for the different
> ones.
>
> ChrisA
I'm using lxml.html
def parse_page(self, root):
for li_item in root.xpath('//li[re:test(@id, "^item[a-z0-9]+$")]',
namespaces={'re': "http://exslt.org/regular-expressions"}):
description = li_item.find_class('vip')[0].text_content()
link = li_item.find_class('vip')[0].get('href')
price_dollar = li_item.find_class('lvprice prc')
[0].xpath('span')[0].text
bids = li_item.find_class('lvformat')[0].xpath('span')[0].text
tme_time = li_item.find_class('tme')[0].xpath('span')
[0].get('timems')
if tme_time:
time_hrs = int(tme_time)/1000 - time.time()
else:
time_hrs = 'No time found'
shipping = li_item.find_class('lvshipping')
[0].xpath('span/span/span')[0].text_content()"
print('{} {} {} {} {}'.format(link, price_dollar, time_hrs,
shipping, bids))
print('-----------------------------------------------------------------')
[toc] | [prev] | [next] | [standalone]
| From | "Veek. M" <vek.m1234@gmail.com> |
|---|---|
| Date | 2016-01-31 12:01 +0530 |
| Message-ID | <n8k9j7$gjf$2@dont-email.me> |
| In reply to | #102335 |
Veek. M wrote:
> Chris Angelico wrote:
>
>> On Sun, Jan 31, 2016 at 3:58 PM, Veek. M <vek.m1234@gmail.com> wrote:
>>> I'm parsing html and i'm doing:
>>>
>>> x = root.find_class(...
>>> y = root.find_class(..
>>> z = root.find_class(..
>>>
>>> all 3 are likely to fail so typically i'd have to stick it in a try.
>>> This is a huge pain for obvious reasons.
>>>
>>> try:
>>> ....
>>> except something:
>>> x = 'default_1'
>>> (repeat 3 times)
>>>
>>> Is there some other nice way to wrap this stuff up?
>>
>> I'm not sure what you're using to parse HTML here (there are several
>> libraries for doing that), but the first thing I'd look for is an
>> option to have it return a default if it doesn't find something - even
>> if that default has to be (say) None.
>>
>> But failing that, you can always write your own wrapper:
>>
>> def find_class(root, ...):
>> try:
>> return root.find_class(...)
>> except something:
>> return 'default_1'
>>
>> Or have the default as a parameter, if it's different for the different
>> ones.
>>
>> ChrisA
>
> I'm using lxml.html
>
> def parse_page(self, root):
> for li_item in root.xpath('//li[re:test(@id, "^item[a-z0-9]+$")]',
> namespaces={'re': "http://exslt.org/regular-expressions"}):
> description = li_item.find_class('vip')[0].text_content()
> link = li_item.find_class('vip')[0].get('href')
> price_dollar = li_item.find_class('lvprice prc')
> [0].xpath('span')[0].text
> bids = li_item.find_class('lvformat')[0].xpath('span')[0].text
>
> tme_time = li_item.find_class('tme')[0].xpath('span')
> [0].get('timems')
> if tme_time:
> time_hrs = int(tme_time)/1000 - time.time()
> else:
> time_hrs = 'No time found'
>
> shipping = li_item.find_class('lvshipping')
> [0].xpath('span/span/span')[0].text_content()"
>
> print('{} {} {} {} {}'.format(link, price_dollar, time_hrs,
> shipping, bids))
>
print('-----------------------------------------------------------------')
Someone suggested i refactor the find_class/xpath into wrapper functions but
i tried it and it didn't look all that great..
Just give me a general idea of how to deal with messy crud like this..
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-01-31 20:21 +1100 |
| Message-ID | <56add21a$0$1593$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #102335 |
On Sun, 31 Jan 2016 05:29 pm, Veek. M wrote:
> I'm using lxml.html
Hmmm. Well, I've never used lxml, but the first obvious problem I see is
that your lines:
description = li_item.find_class('vip')[0].text_content()
link = li_item.find_class('vip')[0].get('href')
price_dollar = li_item.find_class('lvprice prc')[0].xpath('span')[0].text
bids = li_item.find_class('lvformat')[0].xpath('span')[0].text
look suspiciously like a violation of the Liskov Substitution Principle.
("Talk to your dog, not to the dog's legs!") A long series of chained dot
accesses (or equivalent getitem, call, getitem, dot, etc) is a code-smell
suggesting that you are trying to control your dog's individual legs,
instead of just calling the dog.
But, I'll assume that this is part of the design of lxml, and so allowed. So
let's refactor by adding some helper methods and tidying the parse_page
method. This will also make it easier to test, refactor and maintain the
code, especially if the format of the XML file changes.
def extract(self, item, clsname, extractor, default="unknown"):
"""Return the class of item, or default if unknown."""
try:
cls = item.find_class(clsname)
except lxml.ClassNotFoundError: # what should this be?
return default
return extractor(cls)
def get_time(self, clsname, default='No time found'):
extractor = lambda obj: obj[0].xpath('span')[0].get('timems')
t = self.extract(li_item, clsname, extractor, None)
if t is None:
return default
return int(t)/1000 - time.time()
def parse_page(self, root):
for li_item in root.xpath(
'//li[re:test(@id, "^item[a-z0-9]+$")]',
namespaces={'re': "http://exslt.org/regular-expressions"}
):
description = self.extract(li_item, 'vip',
lambda obj: obj[0].text_content(), "no description")
link = self.extract(li_item, 'vip',
lambda obj: obj[0].get('href'))
price_dollar = self.extract(li_item, 'lvprice prc',
lambda obj: obj[0].xpath('span')[0].text)
bids = self.extract(li_item, 'lvformat',
lambda obj: obj[0].xpath('span')[0].text)
time_hrs = self.get_time('tme')
shipping = self.extract(li_item, 'lvshipping',
lambda obj: obj[0].xpath(
'span/span/span')[0].text_content()
)
print('{} {} {} {} {}'.format(
link, price_dollar, time_hrs, shipping, bids))
print('-'*70)
#######################
If you prefer a more Java-style object-oriented solution:
def get_class(self, item, clsname):
"""Return the class of item, or None if unknown."""
try:
return item.find_class(clsname)
except lxml.ClassNotFoundError: # what should this be?
return None
def get_description(self, maybe_cls, default="unknown"):
if maybe_cls is None:
return default
return maybe_cls[0].text_content()
def get_link(self, maybe_cls, tag='href', default='none'):
if maybe_cls is None:
return default
return maybe_cls[0].get(tag)
def get_text(self, maybe_cls, default='unknown'):
if maybe_cls is None:
return default
return maybe_cls[0].xpath('span')[0].text
def get_time(self, maybe_cls, default='No time found'):
if maybe_cls is None:
return default
t = maybe_cls[0].xpath('span')[0].get('timems')
return int(t)/1000 - time.time()
def get_shipping(self, maybe_cls, default='unknown shipping'):
if maybe_cls is None:
return default
return maybe_cls[0].xpath('span/span/span')[0].text_content()
def parse_page(self, root):
for li_item in root.xpath(
'//li[re:test(@id, "^item[a-z0-9]+$")]',
namespaces={'re': "http://exslt.org/regular-expressions"}
):
description = self.get_description(
self.get_class(li_item, 'vip'), "no description")
link = self.get_link(self.get_class(li_item, 'vip'))
price_dollar = self.get_text(
self.get_class(li_item, 'lvprice prc'))
bids = self.get_text(
self.get_class(li_item, 'lvformat')
time_hrs = self.get_time(self.get_class(li_item, 'tme'))
shipping = self.get_shipping(
self.get_class(li_item, 'lvshipping')
print('{} {} {} {} {}'.format(
link, price_dollar, time_hrs, shipping, bids))
print('-'*70)
Obviously I haven't tested this code.
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2016-01-31 20:40 +1100 |
| Message-ID | <mailman.153.1454233251.2338.python-list@python.org> |
| In reply to | #102338 |
On Sun, Jan 31, 2016 at 8:21 PM, Steven D'Aprano <steve@pearwood.info> wrote:
> Hmmm. Well, I've never used lxml, but the first obvious problem I see is
> that your lines:
>
> description = li_item.find_class('vip')[0].text_content()
>
> link = li_item.find_class('vip')[0].get('href')
>
> price_dollar = li_item.find_class('lvprice prc')[0].xpath('span')[0].text
>
> bids = li_item.find_class('lvformat')[0].xpath('span')[0].text
>
>
> look suspiciously like a violation of the Liskov Substitution Principle.
> ("Talk to your dog, not to the dog's legs!") A long series of chained dot
> accesses (or equivalent getitem, call, getitem, dot, etc) is a code-smell
> suggesting that you are trying to control your dog's individual legs,
> instead of just calling the dog.
(Isn't that the Law of Demeter, not LSP?)
The principle of "one dot maximum" is fine when dots represent a form
of ownership. The dog owns his legs; you own (or, have a relationship
with) the dog. But in this case, the depth of subscripting is more
about the inherent depth of the document, and it's more of a data
thing than a code one. Imagine taking a large and complex JSON blob
and loading it into a Python structure with nested lists and dicts -
it wouldn't violate software design principles to call up
info["records"][3]["name"], even though that's three indirections in a
row. Parsing HTML is even worse, as there's generally going to be
numerous levels of structure that have no semantic meaning (they're
there for layout) - so instead of three levels, you might easily have
a dozen.
ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-01-31 21:14 +1100 |
| Message-ID | <56adde86$0$1601$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #102340 |
On Sun, 31 Jan 2016 08:40 pm, Chris Angelico wrote:
> On Sun, Jan 31, 2016 at 8:21 PM, Steven D'Aprano <steve@pearwood.info>
> wrote:
>> Hmmm. Well, I've never used lxml, but the first obvious problem I see is
>> that your lines:
>>
>> description = li_item.find_class('vip')[0].text_content()
>>
>> link = li_item.find_class('vip')[0].get('href')
>>
>> price_dollar = li_item.find_class('lvprice prc')[0].xpath('span')[0].text
>>
>> bids = li_item.find_class('lvformat')[0].xpath('span')[0].text
>>
>>
>> look suspiciously like a violation of the Liskov Substitution Principle.
>> ("Talk to your dog, not to the dog's legs!") A long series of chained dot
>> accesses (or equivalent getitem, call, getitem, dot, etc) is a code-smell
>> suggesting that you are trying to control your dog's individual legs,
>> instead of just calling the dog.
>
> (Isn't that the Law of Demeter, not LSP?)
D'oh!
I mean, yes, excellent, you have passed my test!
*wink*
> The principle of "one dot maximum" is fine when dots represent a form
> of ownership. The dog owns his legs; you own (or, have a relationship
> with) the dog. But in this case, the depth of subscripting is more
> about the inherent depth of the document, and it's more of a data
> thing than a code one.
Yes. that's right. That's why I said it was a code smell, not necessarily
wrong. But you do make a good point -- the Law of Demeter is not
*necessarily* about the number of dots. But the number of dots is a good
hint that you're looking too deeply into an object.
> Imagine taking a large and complex JSON blob
> and loading it into a Python structure with nested lists and dicts -
> it wouldn't violate software design principles to call up
> info["records"][3]["name"], even though that's three indirections in a
> row. Parsing HTML is even worse, as there's generally going to be
> numerous levels of structure that have no semantic meaning (they're
> there for layout) - so instead of three levels, you might easily have
> a dozen.
This might not be a Law of Demeter violation, but it's certain a violation
of "Flat Is Better Than Nested".
--
Steven
[toc] | [prev] | [next] | [standalone]
| From | Chris Angelico <rosuav@gmail.com> |
|---|---|
| Date | 2016-02-01 00:27 +1100 |
| Message-ID | <mailman.168.1454246871.2338.python-list@python.org> |
| In reply to | #102343 |
On Sun, Jan 31, 2016 at 9:14 PM, Steven D'Aprano <steve@pearwood.info> wrote: >> Imagine taking a large and complex JSON blob >> and loading it into a Python structure with nested lists and dicts - >> it wouldn't violate software design principles to call up >> info["records"][3]["name"], even though that's three indirections in a >> row. Parsing HTML is even worse, as there's generally going to be >> numerous levels of structure that have no semantic meaning (they're >> there for layout) - so instead of three levels, you might easily have >> a dozen. > > This might not be a Law of Demeter violation, but it's certain a violation > of "Flat Is Better Than Nested". Oh, absolutely! But "Flat is better than nested" is a principle of design, and when you're parsing someone else's data structure, you follow their design, not yours. This isn't the best of designs, but it's what works with the file he has. ChrisA
[toc] | [prev] | [next] | [standalone]
| From | Peter Otten <__peter__@web.de> |
|---|---|
| Date | 2016-01-31 11:40 +0100 |
| Message-ID | <mailman.156.1454236823.2338.python-list@python.org> |
| In reply to | #102335 |
Veek. M wrote:
> Chris Angelico wrote:
>
>> On Sun, Jan 31, 2016 at 3:58 PM, Veek. M <vek.m1234@gmail.com> wrote:
>>> I'm parsing html and i'm doing:
>>>
>>> x = root.find_class(...
>>> y = root.find_class(..
>>> z = root.find_class(..
>>>
>>> all 3 are likely to fail so typically i'd have to stick it in a try.
>>> This is a huge pain for obvious reasons.
>>>
>>> try:
>>> ....
>>> except something:
>>> x = 'default_1'
>>> (repeat 3 times)
>>>
>>> Is there some other nice way to wrap this stuff up?
>>
>> I'm not sure what you're using to parse HTML here (there are several
>> libraries for doing that), but the first thing I'd look for is an
>> option to have it return a default if it doesn't find something - even
>> if that default has to be (say) None.
>>
>> But failing that, you can always write your own wrapper:
>>
>> def find_class(root, ...):
>> try:
>> return root.find_class(...)
>> except something:
>> return 'default_1'
>>
>> Or have the default as a parameter, if it's different for the different
>> ones.
>>
>> ChrisA
>
> I'm using lxml.html
>
> def parse_page(self, root):
> for li_item in root.xpath('//li[re:test(@id, "^item[a-z0-9]+$")]',
> namespaces={'re': "http://exslt.org/regular-expressions"}):
> description = li_item.find_class('vip')[0].text_content()
> link = li_item.find_class('vip')[0].get('href')
> price_dollar = li_item.find_class('lvprice prc')
> [0].xpath('span')[0].text
> bids = li_item.find_class('lvformat')[0].xpath('span')[0].text
>
> tme_time = li_item.find_class('tme')[0].xpath('span')
> [0].get('timems')
> if tme_time:
> time_hrs = int(tme_time)/1000 - time.time()
> else:
> time_hrs = 'No time found'
>
> shipping = li_item.find_class('lvshipping')
> [0].xpath('span/span/span')[0].text_content()"
>
> print('{} {} {} {} {}'.format(link, price_dollar, time_hrs,
> shipping, bids))
>
print('-----------------------------------------------------------------')
When you use XPath instead of the chained function calls your initial
> Pass the statement as a string to a try function?
idea works out naturally:
def parse_page(self, root):
def get_xpath(path, default="<not available>"):
result = li_item.xpath(path)
if result:
return " ".join(part.strip() for part in result)
return default
for li_item in root.xpath(
'//li[re:test(@id, "^item[a-z0-9]+$")]',
namespaces={'re': "http://exslt.org/regular-expressions"}):
description = get_xpath("*[@class='vip']//text()")
link = get_xpath("*[@class='vip']/@href")
price = get_xpath("*[@class='lvprice prc']/span/text()")
bids = get_xpath("*[@class='lvformat']/span/text()")
tme_time = get_xpath("*[@class='tme']/span/@timems", None)
if tme_time is not None:
time_hrs = int(tme_time)/1000 - time.time()
else:
time_hrs = "No time found"
shipping = get_xpath(
"*[@class='lvshipping']/span/span/span//text()")
[toc] | [prev] | [next] | [standalone]
| From | Larry Hudson <orgnut@yahoo.com> |
|---|---|
| Date | 2016-01-31 13:27 -0800 |
| Message-ID | <yu6dnafWWIa34TPLnZ2dnUU7-KWdnZ2d@giganews.com> |
| In reply to | #102335 |
On 01/30/2016 10:29 PM, Veek. M wrote:
[snip]
Trivial comment (and irrelevant to your question)...
Replace your
print('-----------------------------------------------------------------')
with the shorter
print('-' * 65)
Of course, feel free to disagree if you think the longer version is visually more obviously a line.
-=- Larry -=-
[toc] | [prev] | [next] | [standalone]
| From | Steven D'Aprano <steve@pearwood.info> |
|---|---|
| Date | 2016-01-31 18:22 +1100 |
| Message-ID | <56adb63a$0$1614$c3e8da3$5496439d@news.astraweb.com> |
| In reply to | #102332 |
On Sun, 31 Jan 2016 03:58 pm, Veek. M wrote: > Is there some other nice way to wrap this stuff up? The answer to "how do I wrap this stuff up?" is nearly always: - refactor your code so you don't need to; - subclass and extend the method; - write a function; - write a delegate class. Pick whichever is more relevant to your specific situation. -- Steven
[toc] | [prev] | [next] | [standalone]
| From | "Veek. M" <vek.m1234@gmail.com> |
|---|---|
| Date | 2016-01-31 20:55 +0530 |
| Message-ID | <n8l8ra$ppc$1@dont-email.me> |
| In reply to | #102332 |
Thanks guys: you've given me some good ideas - I really need to re-read the lxml docs for xpath. (basically trying to scrape ebay and score a mobo - ebaysdk doesn't work) Also need to google those principles :) thanks! (i knew one shouldn't overly rely on chained attribute lookups - didn't figure that had a name :))
[toc] | [prev] | [next] | [standalone]
| From | Vincent Davis <vincent@vincentdavis.net> |
|---|---|
| Date | 2016-02-01 11:40 -0700 |
| Message-ID | <mailman.5.1454352055.3032.python-list@python.org> |
| In reply to | #102332 |
On Sat, Jan 30, 2016 at 9:58 PM, Veek. M <vek.m1234@gmail.com> wrote: > Is there some other nice way to wrap this stuff up? > I can't do: > try: > x= > y= > z= > except: > I happend to Have just been doing the something similar. You can put x,y,x in a list and loop over it. In my case a dict was better. See the example here. https://github.com/vincentdavis/USAC_data/blob/master/tools.py#L24 Vincent Davis 720-301-3003
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web