Path: csiph.com!news.mixmin.net!eternal-september.org!reader01.eternal-september.org!.POSTED!not-for-mail
From: Tim Rentsch
Newsgroups: comp.programming
Subject: Re: Another little puzzle
Date: Sun, 25 Dec 2022 01:05:06 -0800
Organization: A noiseless patient Spider
Lines: 144
Message-ID: <86mt7byhtp.fsf@linuxsc.com>
References: <86pmcczcak.fsf@linuxsc.com> <87wn6krsc5.fsf@bsb.me.uk> <868riwzxu3.fsf@linuxsc.com> <87y1qwpb5p.fsf@bsb.me.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Injection-Info: reader01.eternal-september.org; posting-host="bacb00f09a0305cdb2c72c0e332a04cb"; logging-data="2756965"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18djjpYAnJxjlAvVq3a7YNL9mym0R9HbUc="
User-Agent: Gnus/5.11 (Gnus v5.11) Emacs/22.4 (gnu/linux)
Cancel-Lock: sha1:5QvaTxhIfPI0IdxUZHxv0k55fTI= sha1:5QLJS6aHQ1v+r2DmcBoPwqBSdGE=
Xref: csiph.com comp.programming:16152
Ben Bacarisse writes:
> Tim Rentsch writes:
>
>> Ben Bacarisse writes:
>>
>>> Tim Rentsch writes:
>>>
>>>> ram@zedat.fu-berlin.de (Stefan Ram) writes:
>>>>
>>>>> ram@zedat.fu-berlin.de (Stefan Ram) writes:
>>>>>
>>>>>> Given n times of the 24-hour day, print their average.
>>>>>> For example, the average of "eight o'clock" and
>>>>>> "ten o'clock" (n=2) would be "nine o'clock".
>>>>>> (You can choose any representation, for example "HH:MM"
>>>>>> or "seconds since midnight".)
>>>>>
>>>>> Thanks for all replies!
>>>>>
>>>>> I waited a few days before answering to allow
>>>>> sufficient time to think about the problem.
>>>>>
>>>>> There were not enough tests written and run. As a result,
>>>>> the puzzle has not yet been solved (unless I have overlooked
>>>>> a contribution or misworded expectations).
>>>>>
>>>>> So, here are two possible test cases.
>>>>>
>>>>> average( 23.5, 1.5 )== 0.5
>>>>> average( 11.5, 13.5 )== 12.5
>>>>>
>>>>> (I use hours as units, so "0.5" means, "half past midnight".)
>>>>>
>>>>> I hope that these test cases encode sensible expectations
>>>>> for an average of two times on a 24-hour clock in the spirit
>>>>> of the example given in the OP, which was, "the average of
>>>>> eight o'clock and ten o'clock would be nine o'clock", since
>>>>> these test cases just have rotated that example by 3.5 and
>>>>> 15.5 hours.
>>>>>
>>>>> I believe that I have not seen an algorithm so far in this
>>>>> thread that would pass these tests.
>>>>
>>>> As before, the problem is underspecified.
>>>
>>> Some remarks not specifically in reply to you, Tim...
>>
>> I didn't really say very much. Your comments are a lot more
>> interesting.
>>
>>> The input is a collection, t(n), of n > 1 numbers in [0, 24). The
>>> average should be a number, A, in [0, 24) that minimises
>>>
>>> Sum_{i=1,n} distance(A, t(i))
>>>
>>> (or Sum_{i=1,n} difference(A, t(i))^2 if you prefer to think in terms
>>> of variance). So far, this is just what an average is. The key point
>>> is what is the distance (or difference) whose sum (or sum of squares)
>>> we want to minimise? For times, I would say it is the length of the
>>> shorter arc round an imaginary 24-hour clock face.
>>
>> Minimizing the square of the distance (aka length of the shorter arc)
>> is one metric, and I think a reasonable one. (Minimizing the absolute
>> value of the distance gives a result more like the median than the
>> mean, IIANM.)
>>
>>> The problem has a natural interpretation in terms of angles. Whatever
>>> the circular quantity is, convert the values to unit vectors round a
>>> circle. For times of day, just scale [0, 24) to [0, 2*pi). The
>>> average is then just the direction of the average vector, converted
>>> back to a time of day.
>>
>> Averaging the "time unit vectors" is another plausible metric.
>>
>>> Sometimes that vector has zero length, and the average is undefined,
>>> but otherwise the length of the vector gives an indication of the
>>> variability of the data.
>>>
>>> Why do I consider this a reasonable interpretation of the problem?
>>> Well, given a list of times of day when a train is observed to pass
>>> some station, the circular 24-hour-time average should be our best
>>> estimate of the scheduled time.
>>>
>>> Obviously there are other possible readings of the problem, but I was
>>> not able to justify any of them as useful for any real-world
>>> applications. This is a case where I hope I am wrong and there /are/
>>> other circular averages with practical interpretations.
>>
>> A nice analysis.
>
> A generous appraisal, thanks. I was trying to be vague enough not to
> put my foot in it, by was too specific with the formula.
>
>> Some observations (all of which I believe are correct, but no
>> guarantees are offered).
>>
>> The "vector average" metric is easier to calculate and gives an
>> explicit indication when the data produces a degenerate result.
>>
>> Considering only cases that do not produce a degenerate result:
>>
>> (1) the "conventional average" metric and the "vector average"
>> metric can produce different results when there are more than two
>> data points. (For one or two data points I think they are always
>> the same.)
>>
>> (2) the "vector average" metric can be calculated in O(n). The
>> "conventional average" metric can be calculated in O(n*n).
>
> By which I presume you mean and /not/ in O(n). (Big O is just an
> upper bound.)
What I mean is I have an algorithm that runs in time proportional
to n*n. I suspect that there is an algorithm with better big-O
behavior (probably O( n * log n )), but I don't actually have one
to show, so I simply gave the best big-O result that I knew.
>> (3) when the times are all reasonably close together, the
>> "conventional average" metric seems more likely to produce a result
>> corresponding to what most people expect. Another way to say this
>> is that the "vector average" metric will sometimes surprise people.
>
> I'm sorry to be obtuse, but what is the "conventional average"? The
> name makes it sound trivial, but the quadratic time makes me certain
> that is isn't.
Sorry, I meant to refer to your formulation of average
A that minimizes { Sum_{i=1,n} difference(A, t(i))^2 }
where 'difference' means the shorter arc length. This formula
matches the result for 'mean' on real numbers.
> My "conventional average" algorithm (which is not well thought out) was
> to (a) rotate the data set to avoid the 23/0 boundary (not always
> possible), (b) take the arithmetic mean, and then (c) rotate the result
> back. E.g. [23,0,1] -> [0,1,2] by adding one, and the average is
> mean[0,1,2] - 1 = 0.
Yes, if you know where to split the cycle then the answer can be
found in O(n) time. But how can we figure out where to split the
cycle? (Incidentally I got that part wrong the first way I tried
to do it.)