Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!usenet.ukfsn.org!not-for-mail
From: Martin Gregorie <martin@address-in-sig.invalid>
Newsgroups: comp.lang.java.programmer
Subject: Re: Pattern suggestion
Date: Sun, 15 Apr 2012 19:56:23 +0000 (UTC)
Organization: UK Free Software Network
Lines: 67
Message-ID: <jmf957$8p7$1@localhost.localdomain>
References: <jmel0t$jrh$1@news2.carnet.hr> <WradnUhB_qEbaRfSnZ2dnUVZ_i2dnZ2d@earthlink.com> <b6Dir.3201$bU5.353@newsfe04.iad>
NNTP-Posting-Host: 84.45.235.129
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: localhost.localdomain 1334519783 8999 84.45.235.129 (15 Apr 2012 19:56:23 GMT)
X-Complaints-To: usenet@localhost.localdomain
NNTP-Posting-Date: Sun, 15 Apr 2012 19:56:23 +0000 (UTC)
User-Agent: Pan/0.135 (Tomorrow I'll Wake Up and Scald Myself with Tea; GIT 30dc37b master)
Xref: csiph.com comp.lang.java.programmer:13562

On Sun, 15 Apr 2012 13:57:42 -0300, Arved Sandstrom wrote:

> On 12-04-15 01:17 PM, Patricia Shanahan wrote:
>> On 4/15/2012 7:11 AM, FrenKy wrote:
>>> Hi *,
>>> I have a huge file (~10GB) which I'm reading line by line. Each line
>>> has to be analyzed by many number of different analyzers. The problem
>>> I have is that to make it at least a bit performance optimized due to
>>> sometimes time consuming processing (usually because of delays due to
>>> external interfaces) i would need to make it heavily multithreaded.
>>> File should be read only once to reduce IO on disks.
>>>
>>> So I need "1 driver to many workers" pattern where workers are
>>> multithreaded.
>>>
>>> I have a solution now based on Observable/Observer that I use (and it
>>> works) but I'm not sure if it is the best way.
>> 
>> I suggest taking a look at java.util.concurrent.ThreadPoolExecutor and
>> related classes.
>> 
>> Try to minimize ordering relationships between processing on the lines,
>> so that you can overlap work on multiple lines as much as possible.
>> 
>> Patricia
> 
> I agree. A problem description like this, java.util.concurrent is the
> first thing that pops into my head. markspace mentioned map-reduce, and
> there is specifically fork-join in Java 1.7 (they are similar insofar as
> they are algorithms for dividing problems); I don't know if any of
> that's involved because the line analysis may be independent. IOW, this
> may not be a distributable problem, this may be millions of individual
> problems.
> 
> java.util.concurrent will definitely have something. It could well be
> that the processing of each line is isolated, and I'd assuredly be
> thinking of ThreadPoolExecutor or something similar for managing these.
> It has a lot of tuning options including queues. If the analyzers for
> each line have to coordinate (and maybe there's some final processing
> after all complete) there are classes for that too, like CyclicBarrier.
> 
Yes. Since the OP doesn't give any indication that you can decide that 
the analysers needed for each line can be selected by some sort of fast, 
simple inspection, about all you can do is:

    foreach line l
        foreach analyser a
            start a thread for a(l)
        wait for all threads to finish

At first glance you might thinkusing a queue per analyser would help but, 
with the data volumes quoted that will soon fall apart if any analyser is 
more than trivially slower than the rest. As the OP has already said that 
some analysers can be much slower due to external interface delays (I 
presume that means waiting for DNS queries, etc.), I think he's stuck 
with the sort of logic I sketched out. After processing has gotten under 
way and any analyser-specific queues have filled up, the performance of 
any more complex logic will degrade to the above long before the input 
has been completely read and processed. 

In summary, don't try to do anything more sophisticated than the above.
 

-- 
martin@   | Martin Gregorie
gregorie. | Essex, UK
org       |