Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.java.programmer > #19526 > unrolled thread

startsWith datastructure

Started byRoedy Green <see_website@mindprod.com.invalid>
First post2012-10-27 03:50 -0700
Last post2012-10-27 13:56 +0200
Articles 4 — 3 participants

Back to article view | Back to comp.lang.java.programmer


Contents

  startsWith datastructure Roedy Green <see_website@mindprod.com.invalid> - 2012-10-27 03:50 -0700
    Re: startsWith datastructure Roedy Green <see_website@mindprod.com.invalid> - 2012-10-27 04:13 -0700
    Re: startsWith datastructure "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-10-27 12:50 +0100
    Re: startsWith datastructure Robert Klemme <shortcutter@googlemail.com> - 2012-10-27 13:56 +0200

#19526 — startsWith datastructure

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-10-27 03:50 -0700
SubjectstartsWith datastructure
Message-ID<rhen88pmvervpnbdkg32cgokujup957b08@4ax.com>
Lets say you had a list of book publishers and the block of ISBN
numbers they own. e.g. Prentice-Hall owns all isbns of the form
013XXXXXXXXXX,
or put another way, own all ISBNs that startsWith "013".
Some publishers such as O'Reilly own multiple blocks.

     013, Prentice Hall
     014, Penguin
    0201, Addison-Wesley
    0393, H. W. Norton
    0471, John Wiley & Sons
    0553, Bantam
  059600, O'Reilly
  156592, O'Reilly
    0060, Harper
   06723, Sams
   07356, Microsoft Press
   07432, Simon & Schuster
   07710, McClelland & Stewart
   08125, Tor
  086571, New Society Publishers
 0915972, Love Line
15795106, Ronin Publishing

etc.

Your list is incomplete. It has only the major publishers.

The problem is, given an ISBN, can you make at guess at its publisher?

What sort of  datastructure/lookup mechanism would you use?

Obviously, you could just do a linear search looking for a match, and
unless the list got very long, that should suffice,  but, just for
fun, say you wanted something faster, what would you do?

This is just an example of a class of problem I wanted to write a
little essay on for the Java glossary with possible solutions.
-- 
Roedy Green Canadian Mind Products http://mindprod.com
There are four possible ways to poke a card into a slot.
Nearly always, only one way works. To me that betrays a 
Fascist mentality, demanding customers conform to some 
arbitrary rule, and hassling them to discover the magic 
orientation. The polite way to do it is to design the reader 
slot so that all four ways work, or so that all the customer 
has to do is put the card in the vicinity of the reader. 

[toc] | [next] | [standalone]


#19527

FromRoedy Green <see_website@mindprod.com.invalid>
Date2012-10-27 04:13 -0700
Message-ID<7dgn889jf31tm6skhck3kojeabvt4a1tf4@4ax.com>
In reply to#19526
On Sat, 27 Oct 2012 03:50:53 -0700, Roedy Green
<see_website@mindprod.com.invalid> wrote, quoted or indirectly quoted
someone who said :

>     013, Prentice Hall
>     014, Penguin

More precisely, I should have written that as:

>    978013, Prentice Hall
>    978014, Penguin
-- 
Roedy Green Canadian Mind Products http://mindprod.com
Let all things be done decently and in order. ~ I Corinthians 14:40.
Apparently Jehovah disapproves of Java Threads.

[toc] | [prev] | [next] | [standalone]


#19529

From"Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>
Date2012-10-27 12:50 +0100
Message-ID<opidneaQWYhVVxbNnZ2dnUVZ8tSdnZ2d@bt.com>
In reply to#19526
Roedy Green wrote:
> Lets say you had a list of book publishers and the block of ISBN
> numbers they own. e.g. Prentice-Hall owns all isbns of the form
> 013XXXXXXXXXX,
> or put another way, own all ISBNs that startsWith "013".
> Some publishers such as O'Reilly own multiple blocks.
> [...]
> What sort of  datastructure/lookup mechanism would you use?

Some sort of trie ?  A 10-way Paticia(-ish) trie might be fun.

Or maybe a sorted array of prefixes (maybe in pairs of start- and end-points), 
and binary search on the array ?

Or use your favorite regexp compiler to generate a linear-time characterisation 
function from your list of ISBN prefixes (it's basically the same as lexical 
analysis).  You could even try to use the platform-supplied Java regexp stuff 
...if you can stomach that.

Or stuff it into a SQL (or similar) database and use:
    select publisher_name
    from Publishers
    where <input> between first_ISBN and last_ISBN;

(Or even, if your DBMs supports it (efficiently):
    where concat(ISBN_prefix, '%') like <input>
or, with the same caveat:
    where left(<input>, length(ISBN_prefix)) = <input>
or...
)

    -- chris 

[toc] | [prev] | [next] | [standalone]


#19530

FromRobert Klemme <shortcutter@googlemail.com>
Date2012-10-27 13:56 +0200
Message-ID<af20epFsifeU1@mid.individual.net>
In reply to#19526
On 27.10.2012 12:50, Roedy Green wrote:
> Lets say you had a list of book publishers and the block of ISBN
> numbers they own. e.g. Prentice-Hall owns all isbns of the form
> 013XXXXXXXXXX,
> or put another way, own all ISBNs that startsWith "013".
> Some publishers such as O'Reilly own multiple blocks.
>
>       013, Prentice Hall
>       014, Penguin
>      0201, Addison-Wesley
>      0393, H. W. Norton
>      0471, John Wiley & Sons
>      0553, Bantam
>    059600, O'Reilly
>    156592, O'Reilly
>      0060, Harper
>     06723, Sams
>     07356, Microsoft Press
>     07432, Simon & Schuster
>     07710, McClelland & Stewart
>     08125, Tor
>    086571, New Society Publishers
>   0915972, Love Line
> 15795106, Ronin Publishing
>
> etc.
>
> Your list is incomplete. It has only the major publishers.
>
> The problem is, given an ISBN, can you make at guess at its publisher?
>
> What sort of  datastructure/lookup mechanism would you use?
>
> Obviously, you could just do a linear search looking for a match, and
> unless the list got very long, that should suffice,  but, just for
> fun, say you wanted something faster, what would you do?
>
> This is just an example of a class of problem I wanted to write a
> little essay on for the Java glossary with possible solutions.

That's a typical use case for a Trie:
http://en.wikipedia.org/wiki/Trie

Other than that you could use a TreeSet and make use of method subSet() 
where the lower bound is the fragment you got and the upper bound is the 
fragment with the last character increased "by one" or a "max" character 
appended.

Or you use a sorted array and use Arrays.binarySearch().  The Trie is 
more efficient though.

Cheers

	robert


-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.java.programmer


csiph-web