Path: csiph.com!x330-a1.tempe.blueboxinc.net!usenet.pasdenom.info!weretis.net!feeder4.news.weretis.net!eternal-september.org!feeder.eternal-september.org!.POSTED!not-for-mail From: Eric Sosman Newsgroups: comp.lang.java.programmer Subject: Re: capitalize problem Date: Wed, 07 Sep 2011 21:51:00 -0400 Organization: A noiseless patient Spider Lines: 212 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Thu, 8 Sep 2011 01:51:33 +0000 (UTC) Injection-Info: mx04.eternal-september.org; posting-host="f8igmItKsWs6nM5YanFxAA"; logging-data="23992"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/sjfYhtJRx7kWrEpIfhB3g" User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20110902 Thunderbird/6.0.2 In-Reply-To: Cancel-Lock: sha1:sOdHzFOhgIio3H7IUhqRIT+p8UQ= Xref: x330-a1.tempe.blueboxinc.net comp.lang.java.programmer:7693 On 9/7/2011 8:21 PM, bob wrote: > What's the easiest way to capitalize the first letter of every word in > a string and lowercase the rest? "Easiest" seems very important to you. Are you afraid of work? First, decide what you mean by "word." This is not as trivial as it may appear: How many "words" are there in "e.g.", for instance, or in "soi-disant?" Let's take an easy-to-implement definition of "word:" A "word" for our purposes is defined as a sequence of adjacent alphabetic characters beginning at the start of the String or after a non-alphabetic character. This definition is far from satisfactory, as you'll find if you count the "words" in What's wrong with this definition? Most people would say "five," but the definition says "six" (the third word being "wrong"). If you're all right with that, fine -- if not, you'll have to come up with a better definition, and an implementation to match. Okay. There's our inadequate definition, now on to the code. Let's see: We want to modify (potentially) a String, but Strings are immutable, so let's get it into a mutable form to start with: String orig = "What's wrong with THIS definition?"; StringBuilder buff = new StringBuilder(orig); Our plan is to scan through buff, character by character, and modify each as needed. Our definition has a position-dependent aspect ("beginning at the start of the String or after a non-alphabetic character"), so when we arrive at a character we'll need some notion of what came before it. There are only two possibilities: This character either might be the start of a word, or there is no way it could be; the either-or suggests using a boolean: boolean couldStartWord; That's sort of incomplete, because it's almost always the case that a variable should be initialized at the point of declaration. Let's see: We'll sweep across buff from left to right, and could the very first character start a word? Yes, it could, so instead: boolean couldStartWord = true; Okay, we're now ready to examine characters. We know how to look at all the characters in buff from left to right, with a loop like: for (int i = 0; i < buff.length(); ++i) { char ch = buff.charAt(i); // to be determined } What goes inside the loop? There are many ways it might be laid out, but certainly we'll need to decide whether a character is or is not alphabetic -- that's part of our definition of "word." So let's ask the Character class about each character we find: for (int i = 0; i < buff.length(); ++i) { char ch = buff.charAt(i); if (Character.isLetter(ch)) { // to be determined } else { // to be determined } } Suppose we've found a letter. Is it at the start of a "word," or somewhere later? Let's ask our boolean! And after having seen a letter, is it possible that the next character (if there is one) could be the start of a word? No! So we have if (Character.isLetter(ch)) { if (couldStartWord) { // to be determined } else { // to be determined } couldStartWord = false; } ... What do we want to do with the letter at the start of a "word?" Convert it to upper case. With later letters? Lower case. So now we can flesh out those most recent two missing pieces, again using the Character class to do the dirty work: if (Character.isLetter(ch)) { if (couldStartWord) { buff.setCharAt(i, Character.toUpperCase(ch)); } else { buff.setCharAt(i, Character.toLowerCase(ch)); } couldStartWord = false; } ... Fine. Now what about non-letters, as detected in the outermost if test? As for the character itself, we want to leave it alone. Having seen a non-letter, is the next character (if there is one) eligible to start a word? Yes. So we get for (int i = 0; i < buff.length(); ++i) { char ch = buff.charAt(i); if (Character.isLetter(ch)) { // done already } else { couldStartWord = true; } } At the end, all we need to do is call buff.toString() to obtain the modified word. Are we done? NO! We've assembled pieces "from the ground up," and now we should look "from the top down" for opportunities to regularize or simplify what we've got, which is, at the moment: String orig = "What's wrong with THIS definition?"; StringBuilder buff = new StringBuilder(orig); boolean couldStartWord = true; for (int i = 0; i < buff.length(); ++i) { char ch = buff.charAt(i); if (Character.isLetter(ch)) { if (couldStartWord) { buff.setCharAt(i, Character.toUpperCase(ch)); } else { buff.setCharAt(i, Character.toLowerCase(ch)); } couldStartWord = false; } else { couldStartWord = true; } } String result = buff.toString(); One obvious thing is that `couldStartWord = false' is being executed even when the flag is already false: We can move that statement into the first branch of the `if', because if the other branch is taken it's unnecessary: if (Character.isLetter(ch)) { if (couldStartWord) { buff.setCharAt(i, Character.toUpperCase(ch)); couldStartWord = false; } else { buff.setCharAt(i, Character.toLowerCase(ch)); } } ... If we run this code a gazillion times, avoiding the unnecessary execution will save us fourteen-point-to milliquivers. Not worth the trouble in this case, but this is something to look for in others: Are you doing something on both branches of an `if' that only actually needs to be done on one of them? More generally, are you doing some big complicated operation X on both branches, when you might be doing simpler Y on one and Z on the other? Another thing you might notice is that all letters get replaced, while non-letters go untouched. Since we're thinking about "words" it seems likely we expect the input to be "text," in which letters outnumber non-letters by a sizeble margin. So maybe instead of putting all the characters into buff only to overwrite most of them, it might make sense to start with buff empty and add all character (letters and non-letters) one by one. It'd still be a small win to size buff appropriately, so we get to String orig = "What's wrong with THIS definition?"; StringBuilder buff = new StringBuilder(orig.length()); boolean couldStartWord = true; for (int i = 0; i < buff.length(); ++i) { char ch = buff.charAt(i); if (Character.isLetter(ch)) { if (couldStartWord) { ch = Character.toUpperCase(ch); couldStartWord = false; } else { ch = Character.toLowerCase(ch); } } else { couldStartWord = true; } buff.append(ch); } String result = buff.toString(); The loop now has the *EXTREMELY* useful form for (each element of the source) { transform the element somehow emit the transformed element to the destination } This is a form you will encounter every day, over and over and over again. And there's a reason for its ubiquity: It's freaking' useful, and freakin' powerful. Learn to recognize and use it. (We now see that the variable `i' is almost useless, and wish we could iterate over the String with `for (char ch : orig)' -- but we can't, so we sigh and move on. Calling orig.toCharArray() would allow us to use that form of `for', but the cure is worse than the disease.) Further refinements are probably possible -- but, hey: This all started with a quest for the "easiest," and I wouldn't want to overtax you. -- Eric Sosman esosman@ieee-dot-org.invalid