Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.java.programmer > #8418 > unrolled thread
| Started by | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| First post | 2011-09-30 09:04 -0700 |
| Last post | 2011-11-06 18:06 -0500 |
| Articles | 19 — 8 participants |
Back to article view | Back to comp.lang.java.programmer
Phht! on screenscaping Roedy Green <see_website@mindprod.com.invalid> - 2011-09-30 09:04 -0700
Re: Phht! on screenscaping markspace <-@.> - 2011-09-30 10:10 -0700
Re: Phht! on screenscaping Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-09-30 10:24 -0700
Re: Phht! on screenscaping markspace <-@.> - 2011-09-30 10:30 -0700
Re: Phht! on screenscaping Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-09-30 10:40 -0700
Re: Phht! on screenscaping Arne Vajhøj <arne@vajhoej.dk> - 2011-09-30 21:19 -0400
Re: Phht! on screenscaping Arne Vajhøj <arne@vajhoej.dk> - 2011-09-30 21:21 -0400
Re: Phht! on screenscaping Arved Sandstrom <asandstrom3minus1@eastlink.ca> - 2011-09-30 22:53 -0300
Re: Phht! on screenscaping Arne Vajhøj <arne@vajhoej.dk> - 2011-09-30 22:11 -0400
Re: Phht! on screenscaping Arved Sandstrom <asandstrom3minus1@eastlink.ca> - 2011-10-01 09:09 -0300
Re: Phht! on screenscaping Arne Vajhøj <arne@vajhoej.dk> - 2011-10-01 15:48 -0400
Re: Phht! on screenscaping Roedy Green <see_website@mindprod.com.invalid> - 2011-10-01 19:22 -0700
Re: Phht! on screenscaping Movable Hype <mhype101@snortwad.net> - 2011-10-02 03:40 +0000
Re: Phht! on screenscaping Arved Sandstrom <asandstrom3minus1@eastlink.ca> - 2011-10-02 10:20 -0300
Re: Phht! on screenscaping Lew <lewbloch@gmail.com> - 2011-10-02 08:43 -0700
Re: Phht! on screenscaping Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2011-10-02 16:22 -0700
Re: Phht! on screenscaping Martin Gregorie <martin@address-in-sig.invalid> - 2011-10-02 12:11 +0000
Re: Phht! on screenscaping Roedy Green <see_website@mindprod.com.invalid> - 2011-10-01 19:03 -0700
Re: Phht! on screenscaping Arne Vajhøj <arne@vajhoej.dk> - 2011-11-06 18:06 -0500
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2011-09-30 09:04 -0700 |
| Subject | Phht! on screenscaping |
| Message-ID | <kjpb87ppeu4etii296ulk595m26poim048@4ax.com> |
On my website I have links to 20 different bookstores. The problem is, there is no guarantee all the bookstores actually carry any given book. I wanted to grey-out links to bookstores that don't for now carry that particular book. This means probing every bookstore with every ISBN to see if they have it. I discovered I needed an average of 8 marker strings to analyse the response There are about 4 different ways they say they have the book and 4 to say they do not. I found this by trial and error, adding more and more strings and seeing if there were responses that could not be categorised, then translating and examining the responses for likely markers, then looking at the original. This was complicated somewhat since some of the bookstores are in German, French, Italian and Spanish. As the bookstores change their wordings, I will have to keep adjusting my program to track. All his would be so much easier if the bookstores would offer an alternate computer-friendly api. You could give them an ISBN, and they could give you back some XML, JSON, CSV etc, with a single Yes/No instock field. It would take them all of an hour to cook something up. Sometimes they do it, but make it so complicated and so volatile you might as well screenscrape. Ditto companies that sell posters, or sell anything else via affiliates need that sort of API. -- Roedy Green Canadian Mind Products http://mindprod.com It should not be considered an error when the user starts something already started or stops something already stopped. This applies to browsers, services, editors... It is inexcusable to punish the user by requiring some elaborate sequence to atone, e.g. open the task editor, find and kill some processes.
[toc] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2011-09-30 10:10 -0700 |
| Message-ID | <j64t5u$1kc$1@dont-email.me> |
| In reply to | #8418 |
On 9/30/2011 9:04 AM, Roedy Green wrote: > It would take them all of an hour to cook something > up. Uh, right. You didn't used to work in management, perhaps?
[toc] | [prev] | [next] | [standalone]
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
|---|---|
| Date | 2011-09-30 10:24 -0700 |
| Message-ID | <4Xmhq.126$x14.66@newsfe22.iad> |
| In reply to | #8418 |
On 9/30/11 9:04 AM, Roedy Green wrote: > All his would be so much easier if the bookstores would offer an > alternate computer-friendly api. You could give them an ISBN, and > they could give you back some XML, JSON, CSV etc, with a single Yes/No > instock field. It would take them all of an hour to cook something > up. Sometimes they do it, but make it so complicated and so volatile > you might as well screenscrape. > > Ditto companies that sell posters, or sell anything else via > affiliates need that sort of API. As an employee of a company that has introduced an XML API (which is used both internally and externally) , I can speak with experience that it takes far more than an hour to cook up. Not only that, but it requires constant maintenance and operational support. It *is* worth it for the company because it provides benefits (easier to support an front-end webapp which doesn't connect to databases, easy to provide data to partners, etc...), however for an average book-store, providing that data through an API may actually cost them money, rather than save them. The data itself probably has value, and the maintenance of the system to provide that data has a cost. I would love it if all data was available freely (as in free speech and free beer). I would also love it if all data could be standardized and normalized appropriately. I'd also like a unicorn and world piece. I think all of those things come together, but I'd expect to see a unicorn before any of the others. Genetically engineer a narwhal crossed with a pony. The rest is much more complicated.
[toc] | [prev] | [next] | [standalone]
| From | markspace <-@.> |
|---|---|
| Date | 2011-09-30 10:30 -0700 |
| Message-ID | <j64ucg$9bt$1@dont-email.me> |
| In reply to | #8421 |
On 9/30/2011 10:24 AM, Daniel Pitts wrote: > I'd also like a unicorn and world piece. Which piece would you like? ;-)
[toc] | [prev] | [next] | [standalone]
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
|---|---|
| Date | 2011-09-30 10:40 -0700 |
| Message-ID | <hanhq.575$C11.439@newsfe15.iad> |
| In reply to | #8422 |
On 9/30/11 10:30 AM, markspace wrote: > On 9/30/2011 10:24 AM, Daniel Pitts wrote: >> I'd also like a unicorn and world piece. > > > Which piece would you like? ;-) > The part that has homonym checkers ;-) And world peace :-)
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2011-09-30 21:19 -0400 |
| Message-ID | <4e866aad$0$288$14726298@news.sunsite.dk> |
| In reply to | #8422 |
On 9/30/2011 1:30 PM, markspace wrote: > On 9/30/2011 10:24 AM, Daniel Pitts wrote: >> I'd also like a unicorn and world piece. > > > Which piece would you like? ;-) He should pick something with either gold or oil ! :-) Arne
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2011-09-30 21:21 -0400 |
| Message-ID | <4e866b0c$0$288$14726298@news.sunsite.dk> |
| In reply to | #8418 |
On 9/30/2011 12:04 PM, Roedy Green wrote: > On my website I have links to 20 different bookstores. The problem > is, there is no guarantee all the bookstores actually carry any given > book. I wanted to grey-out links to bookstores that don't for now > carry that particular book. > > This means probing every bookstore with every ISBN to see if they have > it. I discovered I needed an average of 8 marker strings to analyse > the response There are about 4 different ways they say they have the > book and 4 to say they do not. I found this by trial and error, > adding more and more strings and seeing if there were responses that > could not be categorised, then translating and examining the responses > for likely markers, then looking at the original. This was complicated > somewhat since some of the bookstores are in German, French, Italian > and Spanish. > > As the bookstores change their wordings, I will have to keep adjusting > my program to track. Lot of work. And potentially non legal. > All his would be so much easier if the bookstores would offer an > alternate computer-friendly api. You could give them an ISBN, and > they could give you back some XML, JSON, CSV etc, with a single Yes/No > instock field. It would take them all of an hour to cook something > up. If you have worked in professional software development then you would know that there are no such thing as adding a new feature for 1 hour of work. Arne
[toc] | [prev] | [next] | [standalone]
| From | Arved Sandstrom <asandstrom3minus1@eastlink.ca> |
|---|---|
| Date | 2011-09-30 22:53 -0300 |
| Message-ID | <Oouhq.716$jh2.114@newsfe19.iad> |
| In reply to | #8418 |
On 11-09-30 01:04 PM, Roedy Green wrote: > On my website I have links to 20 different bookstores. The problem > is, there is no guarantee all the bookstores actually carry any given > book. I wanted to grey-out links to bookstores that don't for now > carry that particular book. > > This means probing every bookstore with every ISBN to see if they have > it. I discovered I needed an average of 8 marker strings to analyse > the response There are about 4 different ways they say they have the > book and 4 to say they do not. I found this by trial and error, > adding more and more strings and seeing if there were responses that > could not be categorised, then translating and examining the responses > for likely markers, then looking at the original. This was complicated > somewhat since some of the bookstores are in German, French, Italian > and Spanish. > > As the bookstores change their wordings, I will have to keep adjusting > my program to track. > > All his would be so much easier if the bookstores would offer an > alternate computer-friendly api. You could give them an ISBN, and > they could give you back some XML, JSON, CSV etc, with a single Yes/No > instock field. It would take them all of an hour to cook something > up. Sometimes they do it, but make it so complicated and so volatile > you might as well screenscrape. > > Ditto companies that sell posters, or sell anything else via > affiliates need that sort of API. An hour? Even if one single bookstore decided to do that with their own proprietary API, and they owned their own server and had a dedicated developer on staff, it still wouldn't happen quite that quick. And how would you the consumer then find out about this API? You don't really believe in things like UDDI still, right? And assuming you did have some way of discovering the API you'd still have to adapt your own client code for it. Way over an hour. And _whose_ API is that? Individual bookstore API? Not practical. So does a chain decide to do that instead? Committees, approvals. Months of work. Industry-wide consortium, conflicting with existing proprietary APIs? Years or never. You're actually better off screenscraping. I definitely don't see how this would be more work than dealing with thousands of different APIs. AHS -- I tend to watch a little TV... Court TV, once in a while. Some of the cases I get interested in. -- O. J. Simpson
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2011-09-30 22:11 -0400 |
| Message-ID | <4e8676f3$0$291$14726298@news.sunsite.dk> |
| In reply to | #8432 |
On 9/30/2011 9:53 PM, Arved Sandstrom wrote: > You're actually better off screenscraping. I definitely don't see how > this would be more work than dealing with thousands of different APIs. I can see two advantages of API over screen scraping for the consuming side of the service: * more robust in regard to handling unusual data * easier to see what to change when a new version comes out (they may even announce changes to an API in advance) Arne
[toc] | [prev] | [next] | [standalone]
| From | Arved Sandstrom <asandstrom3minus1@eastlink.ca> |
|---|---|
| Date | 2011-10-01 09:09 -0300 |
| Message-ID | <0qDhq.1519$jh2.616@newsfe19.iad> |
| In reply to | #8433 |
On 11-09-30 11:11 PM, Arne Vajhøj wrote: > On 9/30/2011 9:53 PM, Arved Sandstrom wrote: >> You're actually better off screenscraping. I definitely don't see how >> this would be more work than dealing with thousands of different APIs. > > I can see two advantages of API over screen scraping for the consuming > side of the service: > * more robust in regard to handling unusual data > * easier to see what to change when a new version comes out (they > may even announce changes to an API in advance) > > Arne > Well, according to Roedy he's got a not overly-complicated-sounding screenscraping algorithm that works for roughly 20 bookstore websites, and there's no reason to believe that if he added another 20 sites to the list that the algorithm would change substantially. Unless all of the bookstores, that he is interested in, offered the same useful API, he'd still have to have the screenscraping code handy. Besides, assuming it was legal, *Roedy* could offer the API as a service. He's the aggregating screenscraper, does all the heavy-lifting, and other people can query *his* web service. AHS -- I tend to watch a little TV... Court TV, once in a while. Some of the cases I get interested in. -- O. J. Simpson
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2011-10-01 15:48 -0400 |
| Message-ID | <4e876ead$0$293$14726298@news.sunsite.dk> |
| In reply to | #8439 |
On 10/1/2011 8:09 AM, Arved Sandstrom wrote: > On 11-09-30 11:11 PM, Arne Vajhøj wrote: >> On 9/30/2011 9:53 PM, Arved Sandstrom wrote: >>> You're actually better off screenscraping. I definitely don't see how >>> this would be more work than dealing with thousands of different APIs. >> >> I can see two advantages of API over screen scraping for the consuming >> side of the service: >> * more robust in regard to handling unusual data >> * easier to see what to change when a new version comes out (they >> may even announce changes to an API in advance) > Well, according to Roedy he's got a not overly-complicated-sounding > screenscraping algorithm that works for roughly 20 bookstore websites, > and there's no reason to believe that if he added another 20 sites to > the list that the algorithm would change substantially. Unless all of > the bookstores, that he is interested in, offered the same useful API, > he'd still have to have the screenscraping code handy. Until everybody does if right, then he would still need the hack. But number of cases and changes should decrease with a smaller number of non API sites. Arne
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2011-10-01 19:22 -0700 |
| Message-ID | <qmhf87h6hrerra99rsf9u758ub5p3vt22t@4ax.com> |
| In reply to | #8439 |
On Sat, 01 Oct 2011 09:09:31 -0300, Arved Sandstrom <asandstrom3minus1@eastlink.ca> wrote, quoted or indirectly quoted someone who said : >Besides, assuming it was legal, *Roedy* could offer the API as a >service. He's the aggregating screenscraper, does all the heavy-lifting, >and other people can query *his* web service there are all kinds of companies doing just that, though they don't think of themselves that way. See http://mindprod.com/jgloss/bookstores.html There are many services that let you find out which bookstores carry a given book. If you order through them, they get a finder's fee. They have my problem magnified many times, since they may be polling 200+ bookstores. I poll only 20. If we had a common API to get info about a book from a store, this programming task would be trivial and would not require constant maintenance. Further, it would not fail in production. No bookstore gives any warning that is changing the format of its pages, or is adding or changing wordings. Further, you would not need to deal with many languages in your code. It would not be that hard to come up with a format of the file and an API to fetch it, and even write a sample client and server app. The hard part is political, selling it. Perhaps Google might ask its customers to implement it, or the ISBN people. Perhaps somebody has already done that. It would just take inquiries to bookstore asking them the URL to access the XXX API. -- Roedy Green Canadian Mind Products http://mindprod.com It should not be considered an error when the user starts something already started or stops something already stopped. This applies to browsers, services, editors... It is inexcusable to punish the user by requiring some elaborate sequence to atone, e.g. open the task editor, find and kill some processes.
[toc] | [prev] | [next] | [standalone]
| From | Movable Hype <mhype101@snortwad.net> |
|---|---|
| Date | 2011-10-02 03:40 +0000 |
| Message-ID | <j68mg1$k28$1@dont-email.me> |
| In reply to | #8474 |
Roedy Green <see_website@mindprod.com.invalid> writes: > asking them the URL to access the XXX API. I've never heard it called *that* before!
[toc] | [prev] | [next] | [standalone]
| From | Arved Sandstrom <asandstrom3minus1@eastlink.ca> |
|---|---|
| Date | 2011-10-02 10:20 -0300 |
| Message-ID | <jyZhq.1121$Gu.344@newsfe14.iad> |
| In reply to | #8474 |
On 11-10-01 11:22 PM, Roedy Green wrote: > On Sat, 01 Oct 2011 09:09:31 -0300, Arved Sandstrom > <asandstrom3minus1@eastlink.ca> wrote, quoted or indirectly quoted > someone who said : > >> Besides, assuming it was legal, *Roedy* could offer the API as a >> service. He's the aggregating screenscraper, does all the heavy-lifting, >> and other people can query *his* web service > > there are all kinds of companies doing just that, though they don't > think of themselves that way. > > See http://mindprod.com/jgloss/bookstores.html > > There are many services that let you find out which bookstores carry a > given book. If you order through them, they get a finder's fee. > > They have my problem magnified many times, since they may be polling > 200+ bookstores. I poll only 20. > > If we had a common API to get info about a book from a store, this > programming task would be trivial and would not require constant > maintenance. Further, it would not fail in production. No bookstore > gives any warning that is changing the format of its pages, or is > adding or changing wordings. > > Further, you would not need to deal with many languages in your code. > > It would not be that hard to come up with a format of the file and an > API to fetch it, and even write a sample client and server app. The > hard part is political, selling it. Perhaps Google might ask its > customers to implement it, or the ISBN people. > > Perhaps somebody has already done that. It would just take inquiries > to bookstore asking them the URL to access the XXX API. > You have described the problem well. I am no expert in this domain, but two existing APIs that stand out in this discussion are the Google Books API (http://code.google.com/apis/books/docs/v1/getting_started.html) and the Amazon Product Advertising API (https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html). For example, in the Amazon API the ItemSearch and SimilarityLookup web service operations are just your ticket. Google Books API has 'list' and 'get' as REST actions. Neither of these actually help our problem; they are just examples of what we would like to have. You're right that the problem is primarily political; it's getting myriad bookstores to adopt a Simple Bookstore API. It's not completely trivial technologically, though: your WSDL will be uniform, but you'd need to write and provide implementations for PHP and Java and all your other target languages. And _those_ implementations would probably need to be written as SPIs, so that appropriate code in each bookstore's backend logic (for their existing website) can be identified and plugged in (likely with adapters). AHS -- I tend to watch a little TV... Court TV, once in a while. Some of the cases I get interested in. -- O. J. Simpson
[toc] | [prev] | [next] | [standalone]
| From | Lew <lewbloch@gmail.com> |
|---|---|
| Date | 2011-10-02 08:43 -0700 |
| Message-ID | <10974580.2771.1317570197215.JavaMail.geo-discussion-forums@preb19> |
| In reply to | #8482 |
Arved Sandstrom wrote: > Roedy Green wrote: >> If we had a common API to get info about a book from a store, this > > programming task would be trivial and would not require constant > > maintenance. Further, it would not fail in production. No bookstore >> gives any warning that is changing the format of its pages, or is >> adding or changing wordings. >> >> Further, you would not need to deal with many languages in your code. >> >> It would not be that hard to come up with a format of the file and an >> API to fetch it, and even write a sample client and server app. The >> hard part is political, selling it. Perhaps Google might ask its >> customers to implement it, or the ISBN people. >> >> Perhaps somebody has already done that. It would just take inquiries >> to bookstore asking them the URL to access the XXX API. >> > You have described the problem well. I am no expert in this domain, but > two existing APIs that stand out in this discussion are the Google Books > API (http://code.google.com/apis/books/docs/v1/getting_started.html) and > the Amazon Product Advertising API > (https://affiliate-program.amazon.com/gp/advertising/api/detail/main.html). > > For example, in the Amazon API the ItemSearch and SimilarityLookup web > service operations are just your ticket. Google Books API has 'list' and > 'get' as REST actions. > > Neither of these actually help our problem; they are just examples of > what we would like to have. You're right that the problem is primarily > political; it's getting myriad bookstores to adopt a Simple Bookstore API. > > It's not completely trivial technologically, though: your WSDL will be > uniform, but you'd need to write and provide implementations for PHP and > Java and all your other target languages. And _those_ implementations > would probably need to be written as SPIs, so that appropriate code in > each bookstore's backend logic (for their existing website) can be > identified and plugged in (likely with adapters). http://xkcd.com/927/ -- Lew
[toc] | [prev] | [next] | [standalone]
| From | Daniel Pitts <newsgroup.nospam@virtualinfinity.net> |
|---|---|
| Date | 2011-10-02 16:22 -0700 |
| Message-ID | <fn6iq.940$6S1.933@newsfe16.iad> |
| In reply to | #8484 |
On 10/2/11 8:43 AM, Lew wrote: > http://xkcd.com/927/ lol, I actually thought of that comic when the thread started.
[toc] | [prev] | [next] | [standalone]
| From | Martin Gregorie <martin@address-in-sig.invalid> |
|---|---|
| Date | 2011-10-02 12:11 +0000 |
| Message-ID | <j69kd8$ren$2@localhost.localdomain> |
| In reply to | #8474 |
On Sat, 01 Oct 2011 19:22:40 -0700, Roedy Green wrote: > It would not be that hard to come up with a format of the file and an > API to fetch it, and even write a sample client and server app. The hard > part is political, selling it. Perhaps Google might ask its customers > to implement it, or the ISBN people. > It already exists. Take a look at UNIMARC, a machine-readable way of formatting bibliographic data for interchange, which was developed by IFLA (The International Federation of Library Associations and Institutions). To my eyes it looks remarkably similar to the SWIFT financial message formats: both uses field ID tags, fields may have subfields and there is a field terminator symbol (@ in UNIMARC, newline in SWIFT). There's an introduction to UNIMARC with a clear example here: http://archive.ifla.org/VI/3/p1996-1/unimarc.htm This type of thing works pretty well: SWIFT message formats are a defacto financial message interchange format despite originally being intended for use only on the SWIFT network, and years ago I was part of a team that used the music cataloging version of UNIMARC as the basis of an online searchable music catalogue that formed part of the BBC Radio 3 music planning and production system. -- martin@ | Martin Gregorie gregorie. | Essex, UK org |
[toc] | [prev] | [next] | [standalone]
| From | Roedy Green <see_website@mindprod.com.invalid> |
|---|---|
| Date | 2011-10-01 19:03 -0700 |
| Message-ID | <idhf87pjoue8pon7pq3bmast6d36p66qul@4ax.com> |
| In reply to | #8432 |
On Fri, 30 Sep 2011 22:53:49 -0300, Arved Sandstrom <asandstrom3minus1@eastlink.ca> wrote, quoted or indirectly quoted someone who said : >Way over an hour. The whole process yes, writing the code to generate the file given the info was in a database I doubt would take me more than an hour. On the other hand, when you total up the total hours of clients and server side code, it is orders of magnitude easier to create and maintain a programmer API. The big advantage is not so much the format, but that it sits still (or is XML-style upward compatible) and has no undocumented variants. -- Roedy Green Canadian Mind Products http://mindprod.com It should not be considered an error when the user starts something already started or stops something already stopped. This applies to browsers, services, editors... It is inexcusable to punish the user by requiring some elaborate sequence to atone, e.g. open the task editor, find and kill some processes.
[toc] | [prev] | [next] | [standalone]
| From | Arne Vajhøj <arne@vajhoej.dk> |
|---|---|
| Date | 2011-11-06 18:06 -0500 |
| Message-ID | <4eb712f3$0$287$14726298@news.sunsite.dk> |
| In reply to | #8473 |
On 10/1/2011 10:03 PM, Roedy Green wrote: > On Fri, 30 Sep 2011 22:53:49 -0300, Arved Sandstrom > <asandstrom3minus1@eastlink.ca> wrote, quoted or indirectly quoted > someone who said : > >> Way over an hour. > The whole process yes, writing the code to generate the file given the > info was in a database I doubt would take me more than an hour. Well - those companies have to pay for the whole process not just for the code typing. But if you offer them money for providing the API so it will not be a cost to them, then they may consider doing it. Arne
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.java.programmer
csiph-web