Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!feeder2.ecngs.de!ecngs!feeder.ecngs.de!Xl.tags.giganews.com!border1.nntp.ams.giganews.com!nntp.giganews.com!local2.nntp.ams.giganews.com!nntp.bt.com!news.bt.com.POSTED!not-for-mail
NNTP-Posting-Date: Sat, 05 May 2012 04:04:44 -0500
From: "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>
Newsgroups: comp.programming
References: <12217875.401.1335542191031.JavaMail.geo-discussion-forums@ynjj38> <W5udnasYne_KmQDSnZ2dnUVZ7q2dnZ2d@bt.com> <1rnzov5qdfjg9$.1xzgbukwvzdqc$.dlg@40tude.net> <oOOdncDIyY_CCwLSnZ2dnUVZ7vCdnZ2d@bt.com> <ojotwsvdqgff$.am755c0uxjj2$.dlg@40tude.net>
Subject: Re: quantifying bloat
Date: Sat, 5 May 2012 10:03:29 +0100
X-Priority: 3
X-MSMail-Priority: Normal
X-Newsreader: Microsoft Outlook Express 6.00.2900.5512
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.5512
X-RFC2646: Format=Flowed; Original
Message-ID: <RfudnQbbhYaxcDnSnZ2dnUVZ8vCdnZ2d@bt.com>
Lines: 108
X-Usenet-Provider: http://www.giganews.com
X-AuthenticatedUsername: NoAuthUser
X-Trace: sv3-Cm5hVhNComD7T5kTxlWQ3laWe5KtXLwn68OFnw60pV9/J+MAUFLo72Od4g2QSzrdIkwZx82/PZInSnc!BafhJQAfr1VDSm8aPIr5r3Ylvx2YrlkDhZU4VfM9eh4CQX9FGWs2qi/KjgNeSEvim/IHfJxhcao=
X-Complaints-To: abuse@btinternet.com
X-DMCA-Complaints-To: abuse@btinternet.com
X-Abuse-and-DMCA-Info: Please be sure to forward a copy of ALL headers
X-Abuse-and-DMCA-Info: Otherwise we will be unable to process your complaint properly
X-Postfilter: 1.3.40
X-Original-Bytes: 7342
Xref: csiph.com comp.programming:1537

Dmitry A. Kazakov wrote:

> > [me]
> > I think you've misunderstood the word "information" in the phrase
> > "information theory".  In that context, it doesn't have the normal
> > English meaning (something similar to "knowledge" -- which must
> > certainly have a "knower"), but has a narrow technical (jargon) meaning
> > which is very roughly -- the information in a message is the [size of
> > the] set of other messages which might have been transmitted instead.
>
> There are technical terms to describe what you mean, e.g. code density,
> bandwidth etc.

The "information" in "information theory" /is/ a technical term.  I know it has 
other meanings in other contexts, but mixing those meanings up (as you are 
doing) will not help communication here.  The idea under consideration is that 
the tools of information theory might help identify code bloat.  Confounding 
the irrelevant (here) meanings of the word "information" will only obscure that 
question.

> > So, consider a program made of many function definitions (or lines, or
> > classes or whatever).  If knowing the text of all the other function
> > definitions gives you a better guess of the text of some arbitrarily
> > chosen remaining one than you would have if you did the same exercise
> > with a different program, then the first is definitely more
> > redundant/compressible than the second.  The hypothesis here is that
> > similar reasoning might justify the claim that the first was more
> > "bloated" than the second.
>
> My point was that this particular issue, provided OP indeed meant that,
> has very little to do with information theory (coding theory). It does
> with psychology and linguistics, with how human beings sense, comprehend,
> feel
> about programs.

But if we (or someone) /can/ use the tools of information theory (objectively, 
perhaps even automatically) to identify factors in code which correlate well 
with (some aspects of) what we informally describe as code bloat, then that 
would constitute a disproof of your claim.  /Starting/ with the assertion that 
it's impossible is begging the question.


> > I think that one could use that sort of technique to identify programs
> > where a lot of copy-paste repetition exists, and that is certainly
> > something one /could/ label as "bloat" -- for all it's not the only
> > meaning of "bloat", nor does that label really capture the essence of
> > what's wrong with the code.
>
> Yes, the level of reuse can be considered as one characteristic. However,
> in any language reuse comes at the cost of means used to factor out the
> repeated piece of code. Be it a class, a subprogram, a template, it always
> "bloats" a bit.

If I refactor (or redesign) a 1MLoC program into an equivalent 200KLOC program, 
then I don't think that /anyone/ could call that bloating the program.  Adding 
layers of abstraction? For sure.  Making it harder for a total newbie to 
understand? Quite possibly.  But "bloated" ?  Never!

(Note: there's nothing in the above paragraph to claim that information theory 
would help me refactor, or identify the need/possibility to refactor, the 
original code.  But it /might/ and that's the question.)

Another place where one might try to apply information theory, would be in the 
external interface to the program.  Say it's a GUI application.  You could 
identify a "language" which described the legal sequences of 
commands/gestures/inputs and responses/outputs.  Perhaps a context-free 
grammar, perhaps something more complex, perhaps something simpler.  The more 
tightly the language captures the actual sequences the better, but even a loose 
characterisation has value.  You then build some sort of a probabilistic model 
of the relative frequencies of sentences and clauses in that grammar as 
actually used by the real users of the app.  That gives you /a/ model to plug 
into your information-theoretic calculations.  (Note: /a/ model, not /the/ 
model -- it may be good or bad).

Armed with that model you can talk about the "information" content (relative 
/to/ the model) of commands issued to the app.  Two ways you might use that 
are:

1) compare that model with another one for the same app ten years ago (or a 
current competitor).  If the user is required to supply (or consume) 
significantly more "information" (as calculated objectively using the model) 
when using the later app compared to the earlier, then we can say that (in some 
technical, objective, sense) the interface has become more complicated to use. 
If technical, objective, facts like that correlate well with what people think 
of as "bloat" in the UI (think MS Office here ;-) then we have a tool for the 
automatic identification of "bloat" (admittedly heuristic, but we've never been 
aiming for mathematical proof).

2) if the information density of the language the user uses to control the app 
stays more of less constant, but the code base size increases significantly, or 
the memory/cpu/disk-space required by the app increase significantly, then 
again we can say that the app has become bloated in its implementation. 
Support from more examples (and not too many counter-examples) of the same 
thing, would justify us in using the (objective) heuristic to detect 
implementation bloat.

Personally, I'm more interested in the possibility of analysing the code base 
directly (because I'm a programmer, not a user, I suppose).  I've got too many 
projects on hand already, but it's tempting to go find some chunk of software 
where the source history is available, and which is commonly supposed to have 
bloated (the Linux kernel, perhaps, or the JRE, or even just *IX "cat") and do 
some modelling and analysis.  (If only the gzip hack I mentioned earlier.)  I'd 
need some examples of code that has grown or changed /without/ bloating too, of 
course -- not quite so easy to think of candidates ;-)

    -- chris