Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail
From: Eric Sosman <esosman@ieee-dot-org.invalid>
Newsgroups: comp.lang.java.programmer
Subject: Re: capitalize problem
Date: Wed, 07 Sep 2011 21:51:00 -0400
Organization: A noiseless patient Spider
Lines: 212
Message-ID: <j49735$ndo$1@dont-email.me>
References: <ed168428-59df-49f7-aebc-71d9fbc7fe24@4g2000vbn.googlegroups.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 8 Sep 2011 01:51:33 +0000 (UTC)
Injection-Info: mx04.eternal-september.org; posting-host="f8igmItKsWs6nM5YanFxAA"; logging-data="23992"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/sjfYhtJRx7kWrEpIfhB3g"
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20110902 Thunderbird/6.0.2
In-Reply-To: <ed168428-59df-49f7-aebc-71d9fbc7fe24@4g2000vbn.googlegroups.com>
Cancel-Lock: sha1:sOdHzFOhgIio3H7IUhqRIT+p8UQ=
Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:7693

On 9/7/2011 8:21 PM, bob wrote:
> What's the easiest way to capitalize the first letter of every word in
> a string and lowercase the rest?

     "Easiest" seems very important to you.  Are you afraid of work?

     First, decide what you mean by "word."  This is not as trivial
as it may appear: How many "words" are there in "e.g.", for instance,
or in "soi-disant?"

     Let's take an easy-to-implement definition of "word:" A "word"
for our purposes is defined as a sequence of adjacent alphabetic
characters beginning at the start of the String or after a
non-alphabetic character.  This definition is far from satisfactory,
as you'll find if you count the "words" in

	What's wrong with this definition?

Most people would say "five," but the definition says "six" (the
third word being "wrong").  If you're all right with that, fine --
if not, you'll have to come up with a better definition, and an
implementation to match.

     Okay.  There's our inadequate definition, now on to the code.
Let's see: We want to modify (potentially) a String, but Strings
are immutable, so let's get it into a mutable form to start with:

	String orig = "What's wrong with THIS definition?";
	StringBuilder buff = new StringBuilder(orig);

Our plan is to scan through buff, character by character, and modify
each as needed.

     Our definition has a position-dependent aspect ("beginning at
the start of the String or after a non-alphabetic character"), so
when we arrive at a character we'll need some notion of what came
before it.  There are only two possibilities: This character either
might be the start of a word, or there is no way it could be; the
either-or suggests using a boolean:

	boolean couldStartWord;

That's sort of incomplete, because it's almost always the case that
a variable should be initialized at the point of declaration.  Let's
see: We'll sweep across buff from left to right, and could the very
first character start a word?  Yes, it could, so instead:

	boolean couldStartWord = true;

     Okay, we're now ready to examine characters.  We know how to look
at all the characters in buff from left to right, with a loop like:

	for (int i = 0; i < buff.length(); ++i) {
	    char ch = buff.charAt(i);
	    // to be determined
	}

     What goes inside the loop?  There are many ways it might be
laid out, but certainly we'll need to decide whether a character is
or is not alphabetic -- that's part of our definition of "word."  So
let's ask the Character class about each character we find:

	for (int i = 0; i < buff.length(); ++i) {
	    char ch = buff.charAt(i);
	    if (Character.isLetter(ch)) {
	        // to be determined
	    } else {
	        // to be determined
	    }
	}

     Suppose we've found a letter.  Is it at the start of a "word,"
or somewhere later?  Let's ask our boolean!  And after having seen
a letter, is it possible that the next character (if there is one)
could be the start of a word?  No!  So we have

	if (Character.isLetter(ch)) {
	    if (couldStartWord) {
	        // to be determined
	    } else {
	        // to be determined
	    }
	    couldStartWord = false;
	} ...

     What do we want to do with the letter at the start of a "word?"
Convert it to upper case.  With later letters?  Lower case.  So now
we can flesh out those most recent two missing pieces, again using
the Character class to do the dirty work:

	if (Character.isLetter(ch)) {
	    if (couldStartWord) {
	        buff.setCharAt(i, Character.toUpperCase(ch));
	    } else {
	        buff.setCharAt(i, Character.toLowerCase(ch));
	    }
	    couldStartWord = false;
	} ...

     Fine.  Now what about non-letters, as detected in the outermost
if test?  As for the character itself, we want to leave it alone.
Having seen a non-letter, is the next character (if there is one)
eligible to start a word?  Yes.  So we get

	for (int i = 0; i < buff.length(); ++i) {
	    char ch = buff.charAt(i);
	    if (Character.isLetter(ch)) {
	        // done already
	    } else {
	        couldStartWord = true;
	    }
	}

     At the end, all we need to do is call buff.toString() to obtain
the modified word.

     Are we done?  NO!  We've assembled pieces "from the ground up,"
and now we should look "from the top down" for opportunities to
regularize or simplify what we've got, which is, at the moment:

	String orig = "What's wrong with THIS definition?";
	StringBuilder buff = new StringBuilder(orig);
	boolean couldStartWord = true;
	for (int i = 0; i < buff.length(); ++i) {
	    char ch = buff.charAt(i);
	    if (Character.isLetter(ch)) {
	        if (couldStartWord) {
	            buff.setCharAt(i, Character.toUpperCase(ch));
	        } else {
	            buff.setCharAt(i, Character.toLowerCase(ch));
	        }
	        couldStartWord = false;
	    } else {
	        couldStartWord = true;
	    }
	}
	String result = buff.toString();

     One obvious thing is that `couldStartWord = false' is being
executed even when the flag is already false: We can move that
statement into the first branch of the `if', because if the other
branch is taken it's unnecessary:

	if (Character.isLetter(ch)) {
	    if (couldStartWord) {
	        buff.setCharAt(i, Character.toUpperCase(ch));
	        couldStartWord = false;
	    } else {
	        buff.setCharAt(i, Character.toLowerCase(ch));
	    }
	} ...

If we run this code a gazillion times, avoiding the unnecessary
execution will save us fourteen-point-to milliquivers.  Not worth
the trouble in this case, but this is something to look for in
others: Are you doing something on both branches of an `if' that
only actually needs to be done on one of them?  More generally,
are you doing some big complicated operation X on both branches,
when you might be doing simpler Y on one and Z on the other?

     Another thing you might notice is that all letters get replaced,
while non-letters go untouched.  Since we're thinking about "words"
it seems likely we expect the input to be "text," in which letters
outnumber non-letters by a sizeble margin.  So maybe instead of
putting all the characters into buff only to overwrite most of them,
it might make sense to start with buff empty and add all character
(letters and non-letters) one by one.  It'd still be a small win to
size buff appropriately, so we get to

	String orig = "What's wrong with THIS definition?";
	StringBuilder buff = new StringBuilder(orig.length());
	boolean couldStartWord = true;
	for (int i = 0; i < buff.length(); ++i) {
	    char ch = buff.charAt(i);
	    if (Character.isLetter(ch)) {
	        if (couldStartWord) {
	            ch = Character.toUpperCase(ch);
	            couldStartWord = false;
	        } else {
	            ch = Character.toLowerCase(ch);
	        }
	    } else {
	        couldStartWord = true;
	    }
	    buff.append(ch);
	}
	String result = buff.toString();

     The loop now has the *EXTREMELY* useful form

	for (each element of the source) {
	    transform the element somehow
	    emit the transformed element to the destination
	}

This is a form you will encounter every day, over and over and
over again.  And there's a reason for its ubiquity: It's freaking'
useful, and freakin' powerful.  Learn to recognize and use it.

     (We now see that the variable `i' is almost useless, and wish
we could iterate over the String with `for (char ch : orig)' --
but we can't, so we sigh and move on.  Calling orig.toCharArray()
would allow us to use that form of `for', but the cure is worse
than the disease.)

     Further refinements are probably possible -- but, hey: This all
started with a quest for the "easiest," and I wouldn't want to overtax
you.

-- 
Eric Sosman
esosman@ieee-dot-org.invalid