Path: csiph.com!usenet.pasdenom.info!weretis.net!feeder1.news.weretis.net!feeder.erje.net!eu.feeder.erje.net!newsfeed.freenet.ag!news2.euro.net!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.005 X-Spam-Evidence: '*H*': 0.99; '*S*': 0.00; 'programmer': 0.03; 'utf-8': 0.07; "'a'": 0.09; 'badly': 0.09; 'character,': 0.09; 'correspond': 0.09; 'differently.': 0.09; 'python': 0.11; '"a"': 0.16; 'any.': 0.16; 'assembler': 0.16; 'character.': 0.16; 'encodings': 0.16; 'from:addr:rosuav': 0.16; 'from:name:chris angelico': 0.16; 'interacting': 0.16; 'magic': 0.16; 'mapped': 0.16; 'on)': 0.16; 'wrote:': 0.18; 'hacking': 0.19; '>>>': 0.22; 'code,': 0.22; 'bytes': 0.24; 'subject:/': 0.26; 'defined': 0.27; 'header:In-Reply-To:1': 0.27; 'point': 0.28; 'am,': 0.29; 'characters': 0.30; 'message-id:@mail.gmail.com': 0.30; 'becoming': 0.31; 'pascal': 0.31; 'languages': 0.32; 'stuff': 0.32; 'beginning': 0.33; 'equal': 0.35; 'received:google.com': 0.35; 'there': 0.35; 'science,': 0.36; 'transition': 0.36; 'arrange': 0.38; 'to:addr:python-list': 0.38; 'to:addr:python.org': 0.39; 'even': 0.60; 'then,': 0.60; 'world.': 0.61; "you're": 0.61; 'back': 0.62; 'more': 0.64; 'between': 0.67; 'jul': 0.74; 'special': 0.74; 'characters,': 0.84; 'different.': 0.84; 'treatment': 0.95; '2013': 0.98 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=IvZc2WNZIM1o70veV1QxC3bUiEHz8cM/cw6+lb3Un5k=; b=MRa8fAstREKbNJ8KNDvYl/cVZIrmyvQFIFN/MroIBf1XT8ej6QcoI+sepStsvY82al yIISH9QKlgik9ZlG4iDPrsbX476tkjRR0AdhbWz9d4/CuyNBrFGG7IU/DVk5BmxY/8Lv M5ojPfvFLNN25YrslOvYTKfsz5x/QPeryh4ppeRjWUv05jsBmAf0+7jfVax3Ay8EuWdU YmMHgJtM5due5ek5M7ZOtXdZaCbAf2wevujpvZtM6/jrv/aQ72Sx0Lbwc3XAP+Aaqqp+ gopxzYU48Xj/eI/GEm4cAOXegSDoAWRAQVUDWVCwJrRnMo7itqc5fCZSgGFMb5sJx52P ZZVg== MIME-Version: 1.0 X-Received: by 10.58.223.238 with SMTP id qx14mr14237718vec.98.1373306842465; Mon, 08 Jul 2013 11:07:22 -0700 (PDT) In-Reply-To: <7b6fc645-8bf3-4681-821c-38fb1fa1d191@googlegroups.com> References: <7b6fc645-8bf3-4681-821c-38fb1fa1d191@googlegroups.com> Date: Tue, 9 Jul 2013 04:07:22 +1000 Subject: Re: hex dump w/ or w/out utf-8 chars From: Chris Angelico To: python-list@python.org Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 23 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1373306845 news.xs4all.nl 15957 [2001:888:2000:d::a6]:44223 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:50167 On Tue, Jul 9, 2013 at 3:53 AM, wrote: >>> All characters are UTF-8, characters. "a" is a UTF-8 character. So is "= =C4=83". > Not using python 3, for me (a programmer which was present at the beginni= ng of > computer science, badly interacting with many languages from assembler to > Fortran and from c to Pascal and so on) it was an hard job to arrange the > abrupt transition from characters only equal to bytes to some special > characters defined with 2, 3 bytes and even more. Even back then, bytes and characters were different. 'A' is a character, 0x41 is a byte. And they correspond 1:1 if and only if you know that your characters are represented in ASCII. Other encodings (eg EBCDIC) mapped things differently. The only difference now is that more people are becoming aware that there are more than 256 characters in the world. Like Magic 2014 and its treatment of Slivers, at some point you're going to have to master the difference between bytes and characters, or else be eternally hacking around stuff in your code, so now is as good a time as any. ChrisA