Path: csiph.com!news.mixmin.net!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: Tim Rentsch <tr.17687@z991.linuxsc.com>
Newsgroups: comp.programming
Subject: Re: Another little puzzle
Date: Sun, 25 Dec 2022 01:05:06 -0800
Organization: A noiseless patient Spider
Lines: 144
Message-ID: <86mt7byhtp.fsf@linuxsc.com>
References: <puzzle-20221214131815@ram.dialup.fu-berlin.de> <algorithm-20221221130021@ram.dialup.fu-berlin.de> <86pmcczcak.fsf@linuxsc.com> <87wn6krsc5.fsf@bsb.me.uk> <868riwzxu3.fsf@linuxsc.com> <87y1qwpb5p.fsf@bsb.me.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: reader01.eternal-september.org; posting-host="bacb00f09a0305cdb2c72c0e332a04cb"; logging-data="2756965"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18djjpYAnJxjlAvVq3a7YNL9mym0R9HbUc="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:5QvaTxhIfPI0IdxUZHxv0k55fTI= sha1:5QLJS6aHQ1v+r2DmcBoPwqBSdGE=
Xref: csiph.com comp.programming:16152

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

> Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
>
>> Ben Bacarisse <ben.usenet@bsb.me.uk> writes:
>>
>>> Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
>>>
>>>> ram@zedat.fu-berlin.de (Stefan Ram) writes:
>>>>
>>>>> ram@zedat.fu-berlin.de (Stefan Ram) writes:
>>>>>
>>>>>> Given n times of the 24-hour day, print their average.
>>>>>> For example, the average of "eight o'clock" and
>>>>>> "ten o'clock" (n=2) would be "nine o'clock".
>>>>>> (You can choose any representation, for example "HH:MM"
>>>>>> or "seconds since midnight".)
>>>>>
>>>>>   Thanks for all replies!
>>>>>
>>>>>   I waited a few days before answering to allow
>>>>>   sufficient time to think about the problem.
>>>>>
>>>>>   There were not enough tests written and run.  As a result,
>>>>>   the puzzle has not yet been solved (unless I have overlooked
>>>>>   a contribution or misworded expectations).
>>>>>
>>>>>   So, here are two possible test cases.
>>>>>
>>>>> average( 23.5,  1.5 )==  0.5
>>>>> average( 11.5, 13.5 )== 12.5
>>>>>
>>>>>   (I use hours as units, so "0.5" means, "half past midnight".)
>>>>>
>>>>>   I hope that these test cases encode sensible expectations
>>>>>   for an average of two times on a 24-hour clock in the spirit
>>>>>   of the example given in the OP, which was, "the average of
>>>>>   eight o'clock and ten o'clock would be nine o'clock", since
>>>>>   these test cases just have rotated that example by 3.5 and
>>>>>   15.5 hours.
>>>>>
>>>>>   I believe that I have not seen an algorithm so far in this
>>>>>   thread that would pass these tests.
>>>>
>>>> As before, the problem is underspecified.
>>>
>>> Some remarks not specifically in reply to you, Tim...
>>
>> I didn't really say very much.  Your comments are a lot more
>> interesting.
>>
>>> The input is a collection, t(n), of n > 1 numbers in [0, 24).  The
>>> average should be a number, A, in [0, 24) that minimises
>>>
>>>   Sum_{i=1,n} distance(A, t(i))
>>>
>>> (or Sum_{i=1,n} difference(A, t(i))^2 if you prefer to think in terms
>>> of variance).  So far, this is just what an average is.  The key point
>>> is what is the distance (or difference) whose sum (or sum of squares)
>>> we want to minimise?  For times, I would say it is the length of the
>>> shorter arc round an imaginary 24-hour clock face.
>>
>> Minimizing the square of the distance (aka length of the shorter arc)
>> is one metric, and I think a reasonable one.  (Minimizing the absolute
>> value of the distance gives a result more like the median than the
>> mean, IIANM.)
>>
>>> The problem has a natural interpretation in terms of angles.  Whatever
>>> the circular quantity is, convert the values to unit vectors round a
>>> circle.  For times of day, just scale [0, 24) to [0, 2*pi).  The
>>> average is then just the direction of the average vector, converted
>>> back to a time of day.
>>
>> Averaging the "time unit vectors" is another plausible metric.
>>
>>> Sometimes that vector has zero length, and the average is undefined,
>>> but otherwise the length of the vector gives an indication of the
>>> variability of the data.
>>>
>>> Why do I consider this a reasonable interpretation of the problem?
>>> Well, given a list of times of day when a train is observed to pass
>>> some station, the circular 24-hour-time average should be our best
>>> estimate of the scheduled time.
>>>
>>> Obviously there are other possible readings of the problem, but I was
>>> not able to justify any of them as useful for any real-world
>>> applications.  This is a case where I hope I am wrong and there /are/
>>> other circular averages with practical interpretations.
>>
>> A nice analysis.
>
> A generous appraisal, thanks.  I was trying to be vague enough not to
> put my foot in it, by was too specific with the formula.
>
>> Some observations (all of which I believe are correct, but no
>> guarantees are offered).
>>
>> The "vector average" metric is easier to calculate and gives an
>> explicit indication when the data produces a degenerate result.
>>
>> Considering only cases that do not produce a degenerate result:
>>
>> (1) the "conventional average" metric and the "vector average"
>> metric can produce different results when there are more than two
>> data points.  (For one or two data points I think they are always
>> the same.)
>>
>> (2) the "vector average" metric can be calculated in O(n).  The
>> "conventional average" metric can be calculated in O(n*n).
>
> By which I presume you mean and /not/ in O(n).  (Big O is just an
> upper bound.)

What I mean is I have an algorithm that runs in time proportional
to n*n.  I suspect that there is an algorithm with better big-O
behavior (probably O( n * log n )), but I don't actually have one
to show, so I simply gave the best big-O result that I knew.

>> (3) when the times are all reasonably close together, the
>> "conventional average" metric seems more likely to produce a result
>> corresponding to what most people expect.  Another way to say this
>> is that the "vector average" metric will sometimes surprise people.
>
> I'm sorry to be obtuse, but what is the "conventional average"?  The
> name makes it sound trivial, but the quadratic time makes me certain
> that is isn't.

Sorry, I meant to refer to your formulation of average 

    A that minimizes { Sum_{i=1,n} difference(A, t(i))^2 }

where 'difference' means the shorter arc length.  This formula
matches the result for 'mean' on real numbers.

> My "conventional average" algorithm (which is not well thought out) was
> to (a) rotate the data set to avoid the 23/0 boundary (not always
> possible), (b) take the arithmetic mean, and then (c) rotate the result
> back.  E.g. [23,0,1] -> [0,1,2] by adding one, and the average is
> mean[0,1,2] - 1 = 0.

Yes, if you know where to split the cycle then the answer can be
found in O(n) time.  But how can we figure out where to split the
cycle?  (Incidentally I got that part wrong the first way I tried
to do it.)