Path: csiph.com!news.mixmin.net!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: Tim Rentsch <tr.17687@z991.linuxsc.com>
Newsgroups: comp.programming
Subject: Re: Another little puzzle
Date: Sat, 24 Dec 2022 06:21:40 -0800
Organization: A noiseless patient Spider
Lines: 112
Message-ID: <868riwzxu3.fsf@linuxsc.com>
References: <puzzle-20221214131815@ram.dialup.fu-berlin.de> <algorithm-20221221130021@ram.dialup.fu-berlin.de> <86pmcczcak.fsf@linuxsc.com> <87wn6krsc5.fsf@bsb.me.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: reader01.eternal-september.org; posting-host="4fac2999906db85e6e2faf120b1d87d4"; logging-data="2421808"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19DN8yZtSnUCcUP5S7efl4dsN3EgmViqlI="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:CgFytGo5aZI0OfhFxjx1rhT0NkY= sha1:eh+jLu61AwtljfsO/cXw1T80gKs=
Xref: csiph.com comp.programming:16134

Ben Bacarisse <ben.usenet@bsb.me.uk> writes:

> Tim Rentsch <tr.17687@z991.linuxsc.com> writes:
>
>> ram@zedat.fu-berlin.de (Stefan Ram) writes:
>>
>>> ram@zedat.fu-berlin.de (Stefan Ram) writes:
>>>
>>>> Given n times of the 24-hour day, print their average.
>>>> For example, the average of "eight o'clock" and
>>>> "ten o'clock" (n=2) would be "nine o'clock".
>>>> (You can choose any representation, for example "HH:MM"
>>>> or "seconds since midnight".)
>>>
>>>   Thanks for all replies!
>>>
>>>   I waited a few days before answering to allow
>>>   sufficient time to think about the problem.
>>>
>>>   There were not enough tests written and run.  As a result,
>>>   the puzzle has not yet been solved (unless I have overlooked
>>>   a contribution or misworded expectations).
>>>
>>>   So, here are two possible test cases.
>>>
>>> average( 23.5,  1.5 )==  0.5
>>> average( 11.5, 13.5 )== 12.5
>>>
>>>   (I use hours as units, so "0.5" means, "half past midnight".)
>>>
>>>   I hope that these test cases encode sensible expectations
>>>   for an average of two times on a 24-hour clock in the spirit
>>>   of the example given in the OP, which was, "the average of
>>>   eight o'clock and ten o'clock would be nine o'clock", since
>>>   these test cases just have rotated that example by 3.5 and
>>>   15.5 hours.
>>>
>>>   I believe that I have not seen an algorithm so far in this
>>>   thread that would pass these tests.
>>
>> As before, the problem is underspecified.
>
> Some remarks not specifically in reply to you, Tim...

I didn't really say very much.  Your comments are a lot more
interesting.

> The input is a collection, t(n), of n > 1 numbers in [0, 24).  The
> average should be a number, A, in [0, 24) that minimises
>
>   Sum_{i=1,n} distance(A, t(i))
>
> (or Sum_{i=1,n} difference(A, t(i))^2 if you prefer to think in terms
> of variance).  So far, this is just what an average is.  The key point
> is what is the distance (or difference) whose sum (or sum of squares)
> we want to minimise?  For times, I would say it is the length of the
> shorter arc round an imaginary 24-hour clock face.

Minimizing the square of the distance (aka length of the shorter arc)
is one metric, and I think a reasonable one.  (Minimizing the absolute
value of the distance gives a result more like the median than the
mean, IIANM.)

> The problem has a natural interpretation in terms of angles.  Whatever
> the circular quantity is, convert the values to unit vectors round a
> circle.  For times of day, just scale [0, 24) to [0, 2*pi).  The
> average is then just the direction of the average vector, converted
> back to a time of day.

Averaging the "time unit vectors" is another plausible metric.

> Sometimes that vector has zero length, and the average is undefined,
> but otherwise the length of the vector gives an indication of the
> variability of the data.
>
> Why do I consider this a reasonable interpretation of the problem?
> Well, given a list of times of day when a train is observed to pass
> some station, the circular 24-hour-time average should be our best
> estimate of the scheduled time.
>
> Obviously there are other possible readings of the problem, but I was
> not able to justify any of them as useful for any real-world
> applications.  This is a case where I hope I am wrong and there /are/
> other circular averages with practical interpretations.

A nice analysis.

Some observations (all of which I believe are correct, but no
guarantees are offered).

The "vector average" metric is easier to calculate and gives an
explicit indication when the data produces a degenerate result.

Considering only cases that do not produce a degenerate result:

(1) the "conventional average" metric and the "vector average"
metric can produce different results when there are more than two
data points.  (For one or two data points I think they are always
the same.)

(2) the "vector average" metric can be calculated in O(n).  The
"conventional average" metric can be calculated in O(n*n).

(3) when the times are all reasonably close together, the
"conventional average" metric seems more likely to produce a result
corresponding to what most people expect.  Another way to say this
is that the "vector average" metric will sometimes surprise people.

Incidentally, I had implemented a method that is isomorphic to
the "vector average" metric (at least I think it is), and I was
surprised to discover that it gave different results than the
"conventional average" metric in some cases.