Groups > comp.programming > #1504 > unrolled thread

quantifying bloat

Started by	bob <bob@coolfone.comze.com>
First post	2012-04-27 08:56 -0700
Last post	2012-04-29 11:01 +0100
Articles	20 on this page of 26 — 10 participants

Back to article view | Back to comp.programming

  quantifying bloat bob <bob@coolfone.comze.com> - 2012-04-27 08:56 -0700
    Re: quantifying bloat hopcode <hopcode@invalid.de> - 2012-04-27 18:16 +0200
      Re: quantifying bloat Nomen Nescio <nobody@dizum.com> - 2012-04-29 16:22 +0200
    Re: quantifying bloat "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-04-29 10:03 +0100
      Re: quantifying bloat "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> - 2012-04-29 11:36 +0200
        Re: quantifying bloat Daniel Pitts <newsgroup.nospam@virtualinfinity.net> - 2012-04-29 15:09 -0700
          Re: quantifying bloat "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> - 2012-04-30 10:09 +0200
        Re: quantifying bloat "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-05-01 08:53 +0100
          Re: quantifying bloat "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> - 2012-05-01 10:52 +0200
            Re: quantifying bloat hopcode <hopcode@invalid.de> - 2012-05-02 04:02 +0200
            Re: quantifying bloat "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-05-05 10:03 +0100
              Re: quantifying bloat "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> - 2012-05-05 12:50 +0200
                Re: quantifying bloat hopcode <hopcode@invalid.de> - 2012-05-05 16:23 +0200
                  Re: quantifying bloat "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> - 2012-05-05 17:43 +0200
        Re: quantifying bloat gremnebulin <peterdjones@yahoo.com> - 2012-05-03 09:27 -0700
          Re: quantifying bloat "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> - 2012-05-03 18:50 +0200
            Re: quantifying bloat Willem <willem@toad.stack.nl> - 2012-05-04 13:52 +0000
              Re: quantifying bloat "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> - 2012-05-04 16:05 +0200
                Re: quantifying bloat hopcode <hopcode@invalid.de> - 2012-05-04 20:44 +0200
                  Re: quantifying bloat Willem <willem@toad.stack.nl> - 2012-05-04 20:32 +0000
                    Re: quantifying bloat "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-05-05 10:16 +0100
      Re: quantifying bloat James Dow Allen <jdallen2000@yahoo.com> - 2012-05-02 02:44 -0700
        Re: quantifying bloat "Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org> - 2012-05-05 10:11 +0100
          Re: quantifying bloat "Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de> - 2012-05-05 13:22 +0200
            Re: quantifying bloat hopcode <hopcode@invalid.de> - 2012-05-05 16:27 +0200
    Re: quantifying bloat rossum <rossum48@coldmail.com> - 2012-04-29 11:01 +0100

Page 1 of 2 [1] 2 Next page →

#1504 — quantifying bloat

From	bob <bob@coolfone.comze.com>
Date	2012-04-27 08:56 -0700
Subject	quantifying bloat
Message-ID	<12217875.401.1335542191031.JavaMail.geo-discussion-forums@ynjj38>

Has anyone ever tried to apply information theory to source code to quantitatively determine if code is bloated or not?

[toc] | [next] | [standalone]

#1506

From	hopcode <hopcode@invalid.de>
Date	2012-04-27 18:16 +0200
Message-ID	<jnegp6$qjn$1@dont-email.me>
In reply to	#1504

Il 27.04.2012 17:56, bob ha scritto:
> Has anyone ever tried to apply information theory to source code to quantitatively determine if code is bloated or not?
>
me.
i am still at work on some few formulations.
here the goal,

http://sites.google.com/site/x64lab/home/uncategorized/programs-by-code-languages-by-semantics

Cheers,

-- 
.:mrk[hopcode]
   .:x64lab:.
  group http://groups.google.com/group/x64lab
  site http://sites.google.com/site/x64lab

[toc] | [prev] | [next] | [standalone]

#1513

From	Nomen Nescio <nobody@dizum.com>
Date	2012-04-29 16:22 +0200
Message-ID	<2a56ef3122afad094c80ae86801158ff@dizum.com>
In reply to	#1506

hopcode <hopcode@invalid.de> wrote:

> Il 27.04.2012 17:56, bob ha scritto:
> > Has anyone ever tried to apply information theory to source code to
> quantitatively determine if code is bloated or not?

That is pretty easy. I use the following logic:

if (env_windows || env_unix_gnu)
    code_bloat == yes;

[toc] | [prev] | [next] | [standalone]

#1510

From	"Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>
Date	2012-04-29 10:03 +0100
Message-ID	<W5udnasYne_KmQDSnZ2dnUVZ7q2dnZ2d@bt.com>
In reply to	#1504

bob wrote:

> Has anyone ever tried to apply information theory to source code to
> quantitatively determine if code is bloated or not?

tar -cf - $codebase | gzip -v > /dev/null

;-)

More seriously (though the above certanly isn't entirely silly), it depends on 
what you mean by "bloat".  Wordy/verbose language design?  Verbose API's ? 
Copy-paste redundancy ?  Missing abstractions[*] ? Excess features ?  Dead code 
left unpruned ?  ...

Some of those could be attacked, I think, with information theory.

But note that the closer you get to some kind of infomation theoretic ideal, 
with no "wasted" bandwidth, the nearer you get to the situation where any error 
in transmission results in a /different/, but /still valid/ message.  Not 
something that I'd like in a programming environment.

    -- chris

[*] abstraction can be thought of as a compression technique.

[toc] | [prev] | [next] | [standalone]

#1511

From	"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Date	2012-04-29 11:36 +0200
Message-ID	<1rnzov5qdfjg9$.1xzgbukwvzdqc$.dlg@40tude.net>
In reply to	#1510

On Sun, 29 Apr 2012 10:03:46 +0100, Chris Uppal wrote:

>> Has anyone ever tried to apply information theory to source code to
>> quantitatively determine if code is bloated or not?
> 
> tar -cf - $codebase | gzip -v > /dev/null
> 
> ;-)
> 
> 
> More seriously (though the above certanly isn't entirely silly), it depends on 
> what you mean by "bloat".  Wordy/verbose language design?  Verbose API's ? 
> Copy-paste redundancy ?  Missing abstractions[*] ? Excess features ?  Dead code 
> left unpruned ?  ...
> 
> Some of those could be attacked, I think, with information theory.

The notion of information complexity is just rubbish. There is no
information without an observer. So there is no complexity in raw data.

> But note that the closer you get to some kind of infomation theoretic ideal, 
> with no "wasted" bandwidth, the nearer you get to the situation where any error 
> in transmission results in a /different/, but /still valid/ message.  Not 
> something that I'd like in a programming environment.

Yes, that is one. There is fundamentally no way to distinguish noise and
tightly messages.

Another is the meaning of the message. For example, Pi is incomputable, but
there is no problem to pass a message "Pi" to a recipient knowing what Pi
is. Is Pi complex? A meaningless question.

The bottom line, complexity of code is not a subject of information theory.
It is a subject of psychology if we consider how complex is it for an
average programmer to understand the code. It is a subject of compiler
construction if we consider how to build a compiler to translate the code.
etc.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

[toc] | [prev] | [next] | [standalone]

#1514

From	Daniel Pitts <newsgroup.nospam@virtualinfinity.net>
Date	2012-04-29 15:09 -0700
Message-ID	<S_inr.17525$zA4.4715@newsfe19.iad>
In reply to	#1511

On 4/29/12 2:36 AM, Dmitry A. Kazakov wrote:
> On Sun, 29 Apr 2012 10:03:46 +0100, Chris Uppal wrote:
>
>>> Has anyone ever tried to apply information theory to source code to
>>> quantitatively determine if code is bloated or not?
>>
>> tar -cf - $codebase | gzip -v>  /dev/null
>>
>> ;-)
>>
>>
>> More seriously (though the above certanly isn't entirely silly), it depends on
>> what you mean by "bloat".  Wordy/verbose language design?  Verbose API's ?
>> Copy-paste redundancy ?  Missing abstractions[*] ? Excess features ?  Dead code
>> left unpruned ?  ...
>>
>> Some of those could be attacked, I think, with information theory.
>
> The notion of information complexity is just rubbish. There is no
> information without an observer. So there is no complexity in raw data.
The information is there, if *can* be observed, not if it *is* observed. 
Well, maybe Erwin Schrödinger would disagree, but the point is that the 
information stored has a certain amount of entropy in it, and a specific 
piece of information, to be discernible from any other piece, has a 
specific minimum amount of space needed to encode it.
Though, I think bloat in software terms is a bit different.  See below...

> The bottom line, complexity of code is not a subject of information theory.
> It is a subject of psychology if we consider how complex is it for an
> average programmer to understand the code. It is a subject of compiler
> construction if we consider how to build a compiler to translate the code.
> etc.

I would characterize software bloat as the difference in resources usage 
between the "most optimal" and the "actual" implementation.  Resources 
being memory, cpu, disk space, etc...

With this definition, most software is slightly bloated, since they rely 
on abstraction layers (such as the O.S., standard libraries, etc...). 
Bloat isn't entirely bad, as allowing for bloat allows for these 
abstractions, and these abstractions make it easier to produce software 
that is complex and reasonably correct.

Good tools can provide these abstractions with less bloat than a raw 
translation of the abstractions strictly require (optimizing compilers 
for example). Though some bloat is caused by poor use of abstractions, 
or poor abstractions themselves.

The difficulty sometimes is determining what the "optimum" resource 
usage actually is.  Some operations have several algorithms that perform 
slightly differently depending on input, and some operations don't have 
a "proven minimum" big O.

I do think you can quantify bloat, but that value will have large 
error-bars for most real-world situations.

[toc] | [prev] | [next] | [standalone]

#1515

From	"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Date	2012-04-30 10:09 +0200
Message-ID	<687fvjke62hx$.1hlswyyvuygns.dlg@40tude.net>
In reply to	#1514

On Sun, 29 Apr 2012 15:09:53 -0700, Daniel Pitts wrote:

> On 4/29/12 2:36 AM, Dmitry A. Kazakov wrote:

>> The notion of information complexity is just rubbish. There is no
>> information without an observer. So there is no complexity in raw data.

> The information is there, if *can* be observed, not if it *is* observed.

No, a system of N independent states contains nothing without an
association of these states with meanings by a concrete observer. 123 *can*
mean 123 employees or 123 beer bottles or "S". As such it means nothing.

> Well, maybe Erwin Schrödinger would disagree, but the point is that the 
> information stored has a certain amount of entropy in it, and a specific 
> piece of information, to be discernible from any other piece, has a 
> specific minimum amount of space needed to encode it.

This is wrong too. A specific piece of information (just one) needs no
space to encode. Consider a medium which has only one state. You assign
your piece to that state and you are done. You can encode whole Britannica
this way. Which returns us to the difference between *can* and *is*.

> Though, I think bloat in software terms is a bit different.  See below...
> 
>> The bottom line, complexity of code is not a subject of information theory.
>> It is a subject of psychology if we consider how complex is it for an
>> average programmer to understand the code. It is a subject of compiler
>> construction if we consider how to build a compiler to translate the code.
>> etc.
> 
> I would characterize software bloat as the difference in resources usage 
> between the "most optimal" and the "actual" implementation.  Resources 
> being memory, cpu, disk space, etc...

This is a case when the observer is the machine hardware. I think most
people would disagree with this definition of "bloat," because except for
very specific embedded and heavy duty applications machine resources play
far lesser role than maintenance costs, safety, security and other
non-functional requirements.

> With this definition, most software is slightly bloated, since they rely 
> on abstraction layers (such as the O.S., standard libraries, etc...).

No, it is hugely "bloated" in this sense. If you compare the resources of a
PC now and ones of a workstation 20 years ago against the capabilities of
the software used for typical activities: word processing, editing source
code, compiling, the latter is almost same. 99.9% of resources gain is just
wasted. Of course gaming and other computation-intensive stuff is another
beast. But there too, development is not focused on resources as it was
before. Mass Effect is far more "bloated" than Packman running on a 32K
PDP-11.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

[toc] | [prev] | [next] | [standalone]

#1517

From	"Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>
Date	2012-05-01 08:53 +0100
Message-ID	<oOOdncDIyY_CCwLSnZ2dnUVZ7vCdnZ2d@bt.com>
In reply to	#1511

Dmitry A. Kazakov wrote:

> The notion of information complexity is just rubbish. There is no
> information without an observer. So there is no complexity in raw data.

I think you've misunderstood the word "information" in the phrase "information 
theory".  In that context, it doesn't have the normal English meaning 
(something similar to "knowledge" -- which must certainly have a "knower"), but 
has a narrow technical (jargon) meaning which is very roughly -- the 
information in a message is the [size of the] set of other messages which might 
have been transmitted instead.

That's very rough, of course, but it captures the important point that 
"information theory" isn't about information, as that word is normally 
understood, at all.

That definition (the real version, or my paraphrase) only applies when there is 
a known set of potential messages to consider.  So it doesn't directly apply to 
just one program (what set is that program drawn from?), but it is very common 
to wave ones hands a bit, and treat individual passages from the text as if 
drawn from a set which is exemplified by the whole (available) program.  In 
which case, it becomes possible to talk of the information-density of "the 
program" (I don't like this misuse of words myself, I think it's confusing, 
although there is a perfectly well-defined concept there)

So, consider a program made of many function definitions (or lines, or classes 
or whatever).  If knowing the text of all the other function definitions gives 
you a better guess of the text of some arbitrarily chosen remaining one than 
you would have if you did the same exercise with a different program, then the 
first is definitely more redundant/compressible than the second.  The 
hypothesis here is that similar reasoning might justify the claim that the 
first was more "bloated" than the second.

I think that one could use that sort of technique to identify programs where a 
lot of copy-paste repetition exists, and that is certainly something one 
/could/ label as "bloat" -- for all it's not the only meaning of "bloat", nor 
does that label really capture the essence of what's wrong with the code.

    -- chris

[toc] | [prev] | [next] | [standalone]

#1518

From	"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Date	2012-05-01 10:52 +0200
Message-ID	<ojotwsvdqgff$.am755c0uxjj2$.dlg@40tude.net>
In reply to	#1517

On Tue, 1 May 2012 08:53:12 +0100, Chris Uppal wrote:

> Dmitry A. Kazakov wrote:
> 
>> The notion of information complexity is just rubbish. There is no
>> information without an observer. So there is no complexity in raw data.
> 
> I think you've misunderstood the word "information" in the phrase "information 
> theory".  In that context, it doesn't have the normal English meaning 
> (something similar to "knowledge" -- which must certainly have a "knower"), but 
> has a narrow technical (jargon) meaning which is very roughly -- the 
> information in a message is the [size of the] set of other messages which might 
> have been transmitted instead.

There are technical terms to describe what you mean, e.g. code density,
bandwidth etc.

> So, consider a program made of many function definitions (or lines, or classes 
> or whatever).  If knowing the text of all the other function definitions gives 
> you a better guess of the text of some arbitrarily chosen remaining one than 
> you would have if you did the same exercise with a different program, then the 
> first is definitely more redundant/compressible than the second.  The 
> hypothesis here is that similar reasoning might justify the claim that the 
> first was more "bloated" than the second.

My point was that this particular issue, provided OP indeed meant that, has
very little to do with information theory (coding theory). It does with
psychology and linguistics, with how human beings sense, comprehend, feel
about programs.

> I think that one could use that sort of technique to identify programs where a 
> lot of copy-paste repetition exists, and that is certainly something one 
> /could/ label as "bloat" -- for all it's not the only meaning of "bloat", nor 
> does that label really capture the essence of what's wrong with the code.

Yes, the level of reuse can be considered as one characteristic. However,
in any language reuse comes at the cost of means used to factor out the
repeated piece of code. Be it a class, a subprogram, a template, it always
"bloats" a bit. Additionally it needs a variance with leads to all sorts of
substitutability issues and ways to formalize them, and thus to more code.
Furthermore, it also requires the reader to understand the abstraction
behind, e.g. roughly speaking the software pattern applied. If the reader
does not recognize the pattern, the code would appear extremely bloated to
him. E.g. the result heavily depends on the observer again.

Another issue is that you have to consider a set of equivalent programs,
ones having same semantics in order to compare them for bloating. This
alone is a problem (undecidable).

All in one, I think we have to live with empirical software metrics for a
long while...

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

[toc] | [prev] | [next] | [standalone]

#1519

From	hopcode <hopcode@invalid.de>
Date	2012-05-02 04:02 +0200
Message-ID	<jnq4k9$ise$1@dont-email.me>
In reply to	#1518

Hi all,
i agree 100% with Dmitry; i would add that the *can* or *is* of
information establish a complexity that cannot be solved without
relying on probabilistic methods *nor* the enigma of two or more
overlapping states of an observed information/event,
AFAIK the "black cat" of it, has been clearly solved by quantum
computing.

then i suggest avoiding entering the Schrödinger's realm,
because most of the times i see that conceptual rather confusing
for those people taking advantage from it as a tautological
confirmation of one's own belief. where the conceptual abuse there
consists in the fact that because the black cat in the box may have 2
or more overlapped states, this should be enough to justify
stopping euristhics about it, because it is /already/ of relevance the
matter that 2 or more overlapping states can fully
satisify all eventual answers we expect from the analysis of the whole
(das Ganze).

also we should stay fest on this planet and consider concretely the
machine as actor and "observer", and the information itself as a
"vector" of itself, meaningful when observed in a well defined context.

i would instead sum up some points of relevance outlined by Dmitry,
because they are fundamentals for practical reasons

Dmitry starts generally from the info-theory

1) the states (attributes) of information means only when observed.
2) an information has no direct relation to space-time
    archetypes (=categories, it may have no encoding space/time)

but he relates then in details the two points above, back on the
planet, considering the "observer", it is to say, to the machine.
and that is the way i would enter.

Il 01.05.2012 10:52, Dmitry A. Kazakov ha scritto:
>
> Yes, the level of reuse can be considered as one characteristic. However,
> in any language reuse comes at the cost of means used to factor out the
> repeated piece of code. Be it a class, a subprogram, a template, it always
> "bloats" a bit. Additionally it needs a variance with leads to all sorts of
> substitutability issues and ways to formalize them, and thus to more code.
> Furthermore, it also requires the reader to understand the abstraction
> behind, e.g. roughly speaking the software pattern applied. If the reader
> does not recognize the pattern, the code would appear extremely bloated to
> him. E.g. the result heavily depends on the observer again.
>
> Another issue is that you have to consider a set of equivalent programs,
> ones having same semantics in order to compare them for bloating. This
> alone is a problem (undecidable).

if the meaning given to "semantics" is "functionalities"/"aims",
like the Opera browser share the same "semantics" of Firefox, well, we
need to distinguish functionalities: for example the "Anonymous
Browsing Session" or the capability to install plug-in, when they
both show to have got implemented that functionality.
anyways they are fully "decidable", granted that we have previously
built a standard skala.
it seems obvious that they do not share the same performances,
nor they waste the same resources, nor they share the same "skin". but
all this need evaluation too!

>
> All in one, I think we have to live with empirical software metrics for a
> long while...
>
and i propose a new metric based on the difference beetween
output-code and language-semantics. consider the observer, the machine,
and tell me what is the difference, from the observer's point of view,
beetween the two lines

   if (alpha == beta)

and

   cmp eax,ebx

because this will give some hints on the reasons of my position
"Programs by code, languages by semantics"


Cheers,

-- 
.:mrk[hopcode]
   .:x64lab:.
  group http://groups.google.com/group/x64lab
  site http://sites.google.com/site/x64lab

[toc] | [prev] | [next] | [standalone]

#1537

From	"Chris Uppal" <chris.uppal@metagnostic.REMOVE-THIS.org>
Date	2012-05-05 10:03 +0100
Message-ID	<RfudnQbbhYaxcDnSnZ2dnUVZ8vCdnZ2d@bt.com>
In reply to	#1518

Dmitry A. Kazakov wrote:

> > [me]
> > I think you've misunderstood the word "information" in the phrase
> > "information theory".  In that context, it doesn't have the normal
> > English meaning (something similar to "knowledge" -- which must
> > certainly have a "knower"), but has a narrow technical (jargon) meaning
> > which is very roughly -- the information in a message is the [size of
> > the] set of other messages which might have been transmitted instead.
>
> There are technical terms to describe what you mean, e.g. code density,
> bandwidth etc.

The "information" in "information theory" /is/ a technical term.  I know it has 
other meanings in other contexts, but mixing those meanings up (as you are 
doing) will not help communication here.  The idea under consideration is that 
the tools of information theory might help identify code bloat.  Confounding 
the irrelevant (here) meanings of the word "information" will only obscure that 
question.

> > So, consider a program made of many function definitions (or lines, or
> > classes or whatever).  If knowing the text of all the other function
> > definitions gives you a better guess of the text of some arbitrarily
> > chosen remaining one than you would have if you did the same exercise
> > with a different program, then the first is definitely more
> > redundant/compressible than the second.  The hypothesis here is that
> > similar reasoning might justify the claim that the first was more
> > "bloated" than the second.
>
> My point was that this particular issue, provided OP indeed meant that,
> has very little to do with information theory (coding theory). It does
> with psychology and linguistics, with how human beings sense, comprehend,
> feel
> about programs.

But if we (or someone) /can/ use the tools of information theory (objectively, 
perhaps even automatically) to identify factors in code which correlate well 
with (some aspects of) what we informally describe as code bloat, then that 
would constitute a disproof of your claim.  /Starting/ with the assertion that 
it's impossible is begging the question.

> > I think that one could use that sort of technique to identify programs
> > where a lot of copy-paste repetition exists, and that is certainly
> > something one /could/ label as "bloat" -- for all it's not the only
> > meaning of "bloat", nor does that label really capture the essence of
> > what's wrong with the code.
>
> Yes, the level of reuse can be considered as one characteristic. However,
> in any language reuse comes at the cost of means used to factor out the
> repeated piece of code. Be it a class, a subprogram, a template, it always
> "bloats" a bit.

If I refactor (or redesign) a 1MLoC program into an equivalent 200KLOC program, 
then I don't think that /anyone/ could call that bloating the program.  Adding 
layers of abstraction? For sure.  Making it harder for a total newbie to 
understand? Quite possibly.  But "bloated" ?  Never!

(Note: there's nothing in the above paragraph to claim that information theory 
would help me refactor, or identify the need/possibility to refactor, the 
original code.  But it /might/ and that's the question.)

Another place where one might try to apply information theory, would be in the 
external interface to the program.  Say it's a GUI application.  You could 
identify a "language" which described the legal sequences of 
commands/gestures/inputs and responses/outputs.  Perhaps a context-free 
grammar, perhaps something more complex, perhaps something simpler.  The more 
tightly the language captures the actual sequences the better, but even a loose 
characterisation has value.  You then build some sort of a probabilistic model 
of the relative frequencies of sentences and clauses in that grammar as 
actually used by the real users of the app.  That gives you /a/ model to plug 
into your information-theoretic calculations.  (Note: /a/ model, not /the/ 
model -- it may be good or bad).

Armed with that model you can talk about the "information" content (relative 
/to/ the model) of commands issued to the app.  Two ways you might use that 
are:

1) compare that model with another one for the same app ten years ago (or a 
current competitor).  If the user is required to supply (or consume) 
significantly more "information" (as calculated objectively using the model) 
when using the later app compared to the earlier, then we can say that (in some 
technical, objective, sense) the interface has become more complicated to use. 
If technical, objective, facts like that correlate well with what people think 
of as "bloat" in the UI (think MS Office here ;-) then we have a tool for the 
automatic identification of "bloat" (admittedly heuristic, but we've never been 
aiming for mathematical proof).

2) if the information density of the language the user uses to control the app 
stays more of less constant, but the code base size increases significantly, or 
the memory/cpu/disk-space required by the app increase significantly, then 
again we can say that the app has become bloated in its implementation. 
Support from more examples (and not too many counter-examples) of the same 
thing, would justify us in using the (objective) heuristic to detect 
implementation bloat.

Personally, I'm more interested in the possibility of analysing the code base 
directly (because I'm a programmer, not a user, I suppose).  I've got too many 
projects on hand already, but it's tempting to go find some chunk of software 
where the source history is available, and which is commonly supposed to have 
bloated (the Linux kernel, perhaps, or the JRE, or even just *IX "cat") and do 
some modelling and analysis.  (If only the gzip hack I mentioned earlier.)  I'd 
need some examples of code that has grown or changed /without/ bloating too, of 
course -- not quite so easy to think of candidates ;-)

    -- chris

[toc] | [prev] | [next] | [standalone]

#1540

From	"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Date	2012-05-05 12:50 +0200
Message-ID	<1kct04w5vsodz.kh2ib37r71em$.dlg@40tude.net>
In reply to	#1537

On Sat, 5 May 2012 10:03:29 +0100, Chris Uppal wrote:

> Dmitry A. Kazakov wrote:
> 
>>> [me]
>>> I think you've misunderstood the word "information" in the phrase
>>> "information theory".  In that context, it doesn't have the normal
>>> English meaning (something similar to "knowledge" -- which must
>>> certainly have a "knower"), but has a narrow technical (jargon) meaning
>>> which is very roughly -- the information in a message is the [size of
>>> the] set of other messages which might have been transmitted instead.
>>
>> There are technical terms to describe what you mean, e.g. code density,
>> bandwidth etc.
> 
> The "information" in "information theory" /is/ a technical term.

It is a buzz word used here and there.

> The idea under consideration is that 
> the tools of information theory might help identify code bloat.

If "information theory" means here a theory that utilizes mathematical
statistics to deal with encoding issues of signal transmission, then the
clear answer is no.

> Confounding 
> the irrelevant (here) meanings of the word "information" will only obscure that 
> question.

The irrelevant meanings here are ones of the information theory. Which is
why it does not work here.

>>> So, consider a program made of many function definitions (or lines, or
>>> classes or whatever).  If knowing the text of all the other function
>>> definitions gives you a better guess of the text of some arbitrarily
>>> chosen remaining one than you would have if you did the same exercise
>>> with a different program, then the first is definitely more
>>> redundant/compressible than the second.  The hypothesis here is that
>>> similar reasoning might justify the claim that the first was more
>>> "bloated" than the second.
>>
>> My point was that this particular issue, provided OP indeed meant that,
>> has very little to do with information theory (coding theory). It does
>> with psychology and linguistics, with how human beings sense, comprehend,
>> feel about programs.
> 
> But if we (or someone) /can/ use the tools of information theory (objectively, 
> perhaps even automatically) to identify factors in code which correlate well 
> with (some aspects of) what we informally describe as code bloat, then that 
> would constitute a disproof of your claim.

Any theory has its application domain. In order to be useful or just
meaningful certain premises has to be met. [ The burden of proof is on the
applicant. ]

As for the "tools" these are just of mathematical statistics and nothing
else. It is applied mathematics, which per definition of has no fundamental
merit of its own. Considering the mathematical statistics, if that to apply
to the code analysis, I doubt it could be any useful here, because:

1. Properties of the code are not random. In the overwhelming majority of
relevant cases it is all about the deterministic behavior of the program.

2. Human perception of the code as being bloated or not is not stochastic
either.

[ Perception of the code by a population (the only case where statistics
may apply) is, firstly, of no interest, and, secondly, would be a subject
of sociology. ]

>>> I think that one could use that sort of technique to identify programs
>>> where a lot of copy-paste repetition exists, and that is certainly
>>> something one /could/ label as "bloat" -- for all it's not the only
>>> meaning of "bloat", nor does that label really capture the essence of
>>> what's wrong with the code.
>>
>> Yes, the level of reuse can be considered as one characteristic. However,
>> in any language reuse comes at the cost of means used to factor out the
>> repeated piece of code. Be it a class, a subprogram, a template, it always
>> "bloats" a bit.
> 
> If I refactor (or redesign) a 1MLoC program into an equivalent 200KLOC program, 
> then I don't think that /anyone/ could call that bloating the program. Adding 
> layers of abstraction? For sure.  Making it harder for a total newbie to 
> understand? Quite possibly.  But "bloated" ?  Never!

Under the presumption of 1/5 code size reduction? A quite hard one. But
even so, 200K is much code, more than can be seen as a whole.

The perception of code is local. Being "bloated" is felt about the pieces
the programmer is aware right now. It is quite possible to feel these
pieces bloated even if all code is actually shorter. It is irrelevant if
the code "objectively" shorter, because negative effects of bloating are
inflicted on the programmers not on the hard drives (with possible
exceptions of course). Note also that code is difficult to insulate from
the libraries and frameworks it relies on. "Bloating" can migrate. Is a
10-lines long program bloated when requires 1Gb OS in the RAM?

> Another place where one might try to apply information theory, would be in the 
> external interface to the program.

But there is already an applied science to deal with that: ergomonics.

> 1) compare that model with another one for the same app ten years ago (or a 
> current competitor).

Windows 2000 vs Windows 8 metro? (:-))

> If the user is required to supply (or consume) 
> significantly more "information" (as calculated objectively using the model)

Two problems:

1. Measurement. Counting mouse clicks and the distance the mouse travelled
is the minor concern. The real problem is how to compare mouse clicks with
keystrokes, how do you quantify this. [ Stochastic models are known for not
working here. ]

2. The model itself. Much "simpler" problems like ranking are not
satisfactory solved until now.

> 2) if the information density of the language the user uses to control the app 
> stays more of less constant, but the code base size increases significantly, or 
> the memory/cpu/disk-space required by the app increase significantly, then 
> again we can say that the app has become bloated in its implementation.

[I leave aside meaningless "information density"]
The vendor could say that the application spends more resources addressing
non-functional issues, e.g. being more user friendly, more secure, easier
to maintain etc.

You need a measure and this measure has to be more or less additive in
order to allow quantification. Information theory is no help here, not even
on the subject.

> Personally, I'm more interested in the possibility of analysing the code base 
> directly (because I'm a programmer, not a user, I suppose).  I've got too many 
> projects on hand already, but it's tempting to go find some chunk of software 
> where the source history is available, and which is commonly supposed to have 
> bloated (the Linux kernel, perhaps, or the JRE, or even just *IX "cat") and do 
> some modelling and analysis.  (If only the gzip hack I mentioned earlier.)  I'd 
> need some examples of code that has grown or changed /without/ bloating too, of 
> course -- not quite so easy to think of candidates ;-)

Yes, much (all?) code migrates as the libraries change. Even kernel code
does because the hardware changes too. To have a model invariant to this...
You already know how sceptical I am.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

[toc] | [prev] | [next] | [standalone]

#1543

From	hopcode <hopcode@invalid.de>
Date	2012-05-05 16:23 +0200
Message-ID	<jo3d62$5an$1@dont-email.me>
In reply to	#1540

Il 05.05.2012 12:50, Dmitry A. Kazakov ha scritto:
> As for the "tools" these are just of mathematical statistics and nothing
> else. It is applied mathematics, which per definition of has no fundamental
> merit of its own. Considering the mathematical statistics, if that to apply
> to the code analysis, I doubt it could be any useful here, because:
>
> 1. Properties of the code are not random. In the overwhelming majority of
> relevant cases it is all about the deterministic behavior of the program.
>
> 2. Human perception of the code as being bloated or not is not stochastic
> either.

You set the thing as an identity ;-)
in fact we just want to trace what/how is the "bloat" just in that
deterministic behavior.also, those mathe-tools result to be unuseful
when used in a biased way.

the deterministic behavior of a program
is a function of some well known variables, example: the market of
compilers; the habit of using this toolchain instead of that.

exactly in the same way for natural languages the
information (as an useful acknowledgment) is a function of some other
well known variables like gesture-recognition etc, things blah-blah
belonging to semiology. but variables "without" time-space; they are
there meaning something precisely, but concretely un-utterable as they
were practically random in their significance !

isnt it "random" the fact that most of people likes C's toolchains ?

the conkret: it is damaging for ARM the same application
that contains the same "things", and behaves the same way as its
counterpart on X86.
because ARM, being low-power etc. doesent like the same "bloat" running
on x86 platform, they say to be useful. but it is not obvious the fact
that ARM will force the users to reduce those "bloated things", as used
on x86.

and now comes the human perception into scene. whether or not
stochastic, it's an istinktive guideline; not to be neglected.
in fact C's toolchains have been adapted to ARM for the sake of a
presumed *perception* of people used to C's toolchains. this is in
order to preserve user-habits of x86 on ARM, they say.
consequently, when outputtin for ARM, the same compiler convert/hides
and inserts/cuts/adapts lot of behaviours/informations automagically,
they say. they.

i would like to assume the above 2 points as working hypothesis, not as
obvious accepted reasons/limits. information theory seems to me not
such a perfect branch. it may be extended, imo.

Cheers,

-- 
.:mrk[hopcode]
   .:x64lab:.
  group http://groups.google.com/group/x64lab
  site http://sites.google.com/site/x64lab

[toc] | [prev] | [next] | [standalone]

#1545

From	"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Date	2012-05-05 17:43 +0200
Message-ID	<1c1b2jzia4k4m.d3v0gcvuts69$.dlg@40tude.net>
In reply to	#1543

On Sat, 05 May 2012 16:23:56 +0200, hopcode wrote:

> Il 05.05.2012 12:50, Dmitry A. Kazakov ha scritto:
>> As for the "tools" these are just of mathematical statistics and nothing
>> else. It is applied mathematics, which per definition of has no fundamental
>> merit of its own. Considering the mathematical statistics, if that to apply
>> to the code analysis, I doubt it could be any useful here, because:
>>
>> 1. Properties of the code are not random. In the overwhelming majority of
>> relevant cases it is all about the deterministic behavior of the program.
>>
>> 2. Human perception of the code as being bloated or not is not stochastic
>> either.
> 
> You set the thing as an identity ;-)
> in fact we just want to trace what/how is the "bloat" just in that
> deterministic behavior.also, those mathe-tools result to be unuseful
> when used in a biased way.

Statistics gets misused all the time. As I said, the burden of proof is on
the applicant's side. There is a set of axioms (the Kolmogorov axioms) for
the probability to satisfy. If anybody wants to apply the probability
theory and methods of mathematical statistics to the program behavior or
human perception or whatever, he is obliged to show, what are the
elementary events, how are they independent, random etc.

> isnt it "random" the fact that most of people likes C's toolchains ?

Don't you confuse "random" with "illogical"?

My pet hypothesis that people's love to C is somehow related to the
original sin. Though I must admit that my knowledge of theology is rather
superficial. (:-))

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

[toc] | [prev] | [next] | [standalone]

#1527

From	gremnebulin <peterdjones@yahoo.com>
Date	2012-05-03 09:27 -0700
Message-ID	<1dae75e0-2ddc-425f-99e4-3af9f7406926@k13g2000vbm.googlegroups.com>
In reply to	#1511

On Apr 29, 10:36 am, "Dmitry A. Kazakov" <mail...@dmitry-kazakov.de>
wrote:

> Another is the meaning of the message. For example, Pi is incomputable, but
> there is no problem to pass a message "Pi" to a recipient knowing what Pi
> is. Is Pi complex? A meaningless question.

Pi is computable. You could pass a finite string of code for computing
Pi
to a repient as well. Check out Chaitin and Kolmogorov.

[toc] | [prev] | [next] | [standalone]

#1528

From	"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Date	2012-05-03 18:50 +0200
Message-ID	<3leyi3uyxhlh$.vl287d3q1va2.dlg@40tude.net>
In reply to	#1527

On Thu, 3 May 2012 09:27:22 -0700 (PDT), gremnebulin wrote:

> On Apr 29, 10:36 am, "Dmitry A. Kazakov" <mail...@dmitry-kazakov.de>
> wrote:
> 
>> Another is the meaning of the message. For example, Pi is incomputable, but
>> there is no problem to pass a message "Pi" to a recipient knowing what Pi
>> is. Is Pi complex? A meaningless question.
> 
> Pi is computable.

Not its decimal representation by a FSM.

> You could pass a finite string of code for computing Pi
> to a repient as well.

I already did, quoting myself: "Pi."

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

[toc] | [prev] | [next] | [standalone]

#1531

From	Willem <willem@toad.stack.nl>
Date	2012-05-04 13:52 +0000
Message-ID	<slrnjq7np3.280u.willem@toad.stack.nl>
In reply to	#1528

Dmitry A. Kazakov wrote:
) On Thu, 3 May 2012 09:27:22 -0700 (PDT), gremnebulin wrote:
)> Pi is computable.
)
) Not its decimal representation by a FSM.

'computable' has a specific mathematical definition,
by which pi is computable.


SaSW, Willem
-- 
Disclaimer: I am in no way responsible for any of the statements
            made in the above text. For all I know I might be
            drugged or something..
            No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

[toc] | [prev] | [next] | [standalone]

#1532

From	"Dmitry A. Kazakov" <mailbox@dmitry-kazakov.de>
Date	2012-05-04 16:05 +0200
Message-ID	<1saien0an92og.iuio4t54i82a$.dlg@40tude.net>
In reply to	#1531

On Fri, 4 May 2012 13:52:35 +0000 (UTC), Willem wrote:

> Dmitry A. Kazakov wrote:
> ) On Thu, 3 May 2012 09:27:22 -0700 (PDT), gremnebulin wrote:
> )> Pi is computable.
> )
> ) Not its decimal representation by a FSM.
> 
> 'computable' has a specific mathematical definition,
> by which pi is computable.

That definition requires specification of a formal computation model.

-- 
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de

[toc] | [prev] | [next] | [standalone]

#1534

From	hopcode <hopcode@invalid.de>
Date	2012-05-04 20:44 +0200
Message-ID	<jo182e$93n$1@dont-email.me>
In reply to	#1532

Il 04.05.2012 16:05, Dmitry A. Kazakov ha scritto:
> On Fri, 4 May 2012 13:52:35 +0000 (UTC), Willem wrote:
>
>> >  Dmitry A. Kazakov wrote:
>> >  ) On Thu, 3 May 2012 09:27:22 -0700 (PDT), gremnebulin wrote:
>> >  )>  Pi is computable.
>> >  )
>> >  ) Not its decimal representation by a FSM.
>> >
>> >  'computable' has a specific mathematical definition,
>> >  by which pi is computable.
> That definition requires specification of a formal computation model.

pi is not computable. IIRC from the school pi is a real number;
it has infinite number of decimal digits, just like
the result of the division 10/3.
pi-digits may be countable (enumerable?). the count of its digits after
the integral part is a function depending on the limits/resources/algo
implemented for it on a Turing machine i.e.,largely speaking, bound to a 
computation model.

but please, dont forget the subject, interesting imho:

"...apply information theory to source code to quantitatively determine 
if code is bloated or not"

Cheers,

-- 
.:mrk[hopcode]
   .:x64lab:.
  group http://groups.google.com/group/x64lab
  site http://sites.google.com/site/x64lab

[toc] | [prev] | [next] | [standalone]

#1535

From	Willem <willem@toad.stack.nl>
Date	2012-05-04 20:32 +0000
Message-ID	<slrnjq8f72.upi.willem@toad.stack.nl>
In reply to	#1534

hopcode wrote:
) Il 04.05.2012 16:05, Dmitry A. Kazakov ha scritto:
)> On Fri, 4 May 2012 13:52:35 +0000 (UTC), Willem wrote:
)>
)>> >  Dmitry A. Kazakov wrote:
)>> >  ) On Thu, 3 May 2012 09:27:22 -0700 (PDT), gremnebulin wrote:
)>> >  )>  Pi is computable.
)>> >  )
)>> >  ) Not its decimal representation by a FSM.
)>> >
)>> >  'computable' has a specific mathematical definition,
)>> >  by which pi is computable.
)> That definition requires specification of a formal computation model.
)
) pi is not computable.

Again: According to mathematicians, it *is* computable.


SaSW, Willem
-- 
Disclaimer: I am in no way responsible for any of the statements
            made in the above text. For all I know I might be
            drugged or something..
            No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

[toc] | [prev] | [next] | [standalone]

Page 1 of 2 [1] 2 Next page →

csiph-web

quantifying bloat

Contents

#1504 — quantifying bloat

#1506

#1513

#1510

#1511

#1514

#1515

#1517

#1518

#1519

#1537

#1540

#1543

#1545

#1527

#1528

#1531

#1532

#1534

#1535