Path: csiph.com!usenet.pasdenom.info!gegeweb.org!de-l.enfer-du-nord.net!feeder1.enfer-du-nord.net!feeds.phibee-telecom.net!newsfeed.xs4all.nl!newsfeed2.news.xs4all.nl!xs4all!newsgate.cistron.nl!newsgate.news.xs4all.nl!post.news.xs4all.nl!not-for-mail Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.000 X-Spam-Evidence: '*H*': 1.00; '*S*': 0.00; 'algorithm': 0.04; 'encoding': 0.05; 'output': 0.05; 'pop': 0.05; 'subject:Python': 0.06; 'compiler': 0.07; 'explicit': 0.07; 'failing': 0.07; 'variables': 0.07; '#define': 0.09; '++i)': 0.09; 'aliases': 0.09; 'differing': 0.09; 'executed': 0.09; 'imply': 0.09; 'logic': 0.09; 'pointers': 0.09; 'sub': 0.09; 'typed': 0.09; 'worse': 0.09; 'xor': 0.09; 'template': 0.14; 'changes': 0.15; '-16': 0.16; '-20': 0.16; '2-4': 0.16; 'assembler': 0.16; 'character.': 0.16; 'cmp': 0.16; 'did,': 0.16; 'dl,': 0.16; 'dword': 0.16; 'eax': 0.16; 'eax,': 0.16; 'ebp': 0.16; 'ebp,': 0.16; 'ecx': 0.16; 'ecx,': 0.16; 'edx': 0.16; 'edx,': 0.16; 'encodings': 0.16; 'endp': 0.16; 'esi': 0.16; 'esi,': 0.16; 'esp': 0.16; 'esp,': 0.16; 'jmp': 0.16; 'jne': 0.16; 'lea': 0.16; 'less,': 0.16; 'loop.': 0.16; 'mov': 0.16; 'pointers,': 0.16; 'proc': 0.16; 'ptr': 0.16; 'received:74.208.4.195': 0.16; 'registers.': 0.16; 'regression': 0.16; 'segment': 0.16; 'template,': 0.16; 'index': 0.16; 'so.': 0.16; 'wrote:': 0.18; 'code.': 0.18; 'looked': 0.18; 'seems': 0.21; 'appears': 0.22; 'coding': 0.22; 'putting': 0.22; 'python?': 0.22; 'separate': 0.22; 'tests': 0.22; 'header:User- Agent:1': 0.23; "aren't": 0.24; 'byte': 0.24; 'comparing': 0.24; 'either.': 0.24; 'decide': 0.24; 'looks': 0.24; 'sort': 0.25; 'equivalent': 0.26; 'push': 0.26; 'switch': 0.26; 'post': 0.26; 'least': 0.26; 'subject:/': 0.26; 'values': 0.27; 'header:In- Reply-To:1': 0.27; 'am,': 0.29; 'operations,': 0.30; "i'm": 0.30; 'code': 0.31; '3.2': 0.31; 'coded': 0.31; 'comparison': 0.31; 'disabled': 0.31; 'gcc': 0.31; 'values.': 0.31; 'void': 0.31; 'probably': 0.32; 'checking': 0.33; 'raw': 0.33; 'skip:_ 10': 0.34; "i'd": 0.34; 'test': 0.35; 'but': 0.35; 'add': 0.35; 'there': 0.35; 'c++': 0.36; 'data,': 0.36; 'doing': 0.36; 'two': 0.37; 'being': 0.38; 'checks': 0.38; 'version,': 0.38; 'to:addr :python-list': 0.38; 'rather': 0.38; 'that,': 0.38; 'short': 0.38; 'to:addr:python.org': 0.39; 'changed': 0.39; 'either': 0.39; 'skip:p 20': 0.39; 'how': 0.40; 'even': 0.60; 'skip:u 10': 0.60; 'dave': 0.60; 'full': 0.61; 'skip:* 10': 0.61; 'simply': 0.61; 'first': 0.61; 'back': 0.62; 'times': 0.62; 'kind': 0.63; 'different': 0.65; 'here': 0.66; 'between': 0.67; 'received:74.208': 0.68; 'registers': 0.68; 'caused': 0.69; 'specialize': 0.74; 'hand': 0.80; 'skip:$ 10': 0.81; 'case?': 0.84; 'compiling': 0.84; "it'd": 0.84; 'preventing': 0.84; 'subject:long': 0.84 Date: Wed, 03 Apr 2013 07:52:42 -0400 From: Dave Angel User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130308 Thunderbird/17.0.4 MIME-Version: 1.0 To: python-list@python.org Subject: Re: Performance of int/long in Python 3 References: <87dff083-14d8-4163-89f3-d78a9be6c802@c15g2000vbl.googlegroups.com> <3qadncD4-6fcPsbMnZ2dnUVZ_rqdnZ2d@westnet.com.au> <515bbedb$0$29891$c3e8da3$5496439d@news.astraweb.com> In-Reply-To: Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Provags-ID: V02:K0:2HZ/KdGk+xKEiHAV4GT8Gst1+iOKf0WeWsP4ZnD2kpB 1D8fDwSHt4cLtZzqBNxxYrqLfbAZ0g9bdcHbTXfa1xCULXgqQR wsLb5PSp3VPUWcfebXMRNFUj3Ke8lnNanpNdWKOyPtu3m5V/ib tk75NYsOUad2TFAKqaJtnMQnmDNBWsyMH0BrBe9xUS21KEM9jK MWqV3jNJavIDKsoV5MKZLuOtQGB/i++ZjHdakHaHB2F55FOiSw osANYVM5l9CPa6R8setQwL1cnjixlsnlkaflFMCLohO9nEg0XS a8icONqHnyQ5EsTGzKuJQioK6BKRds1kwPG/AIVIxqMZLTPqA= = X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.15 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Newsgroups: comp.lang.python Message-ID: Lines: 319 NNTP-Posting-Host: 2001:888:2000:d::a6 X-Trace: 1364989980 news.xs4all.nl 6855 [2001:888:2000:d::a6]:39684 X-Complaints-To: abuse@xs4all.nl Xref: csiph.com comp.lang.python:42656 On 04/03/2013 07:05 AM, Neil Hodgson wrote: > Dave Angel: > >> That would seem to imply that the speed regression on your data is NOT >> caused by the differing size encodings. Perhaps it is the difference in >> MSC compiler version, or other changes made between 3.2 and 3.3 > > Its not caused by there actually being different size encodings but > that the code is checking encoding size 2-4 times for each character. > > Back in 3.2 the comparison loop looked like: > > while (len1 > 0 && len2 > 0) { > Py_UNICODE c1, c2; > > c1 = *s1++; > c2 = *s2++; > > if (c1 != c2) > return (c1 < c2) ? -1 : 1; > > len1--; len2--; > } > > For 3.3 this has changed to > > for (i = 0; i < len1 && i < len2; ++i) { > Py_UCS4 c1, c2; > c1 = PyUnicode_READ(kind1, data1, i); > c2 = PyUnicode_READ(kind2, data2, i); > > if (c1 != c2) > return (c1 < c2) ? -1 : 1; > } > > with PyUnicode_READ being > > #define PyUnicode_READ(kind, data, index) \ > ((Py_UCS4) \ > ((kind) == PyUnicode_1BYTE_KIND ? \ > ((const Py_UCS1 *)(data))[(index)] : \ > ((kind) == PyUnicode_2BYTE_KIND ? \ > ((const Py_UCS2 *)(data))[(index)] : \ > ((const Py_UCS4 *)(data))[(index)] \ > ) \ > )) > > There are either 1 or 2 kind checks in each call to PyUnicode_READ > and 2 calls to PyUnicode_READ inside the loop. A compiler may decide to > move the kind checks out of the loop and specialize the loop but MSVC > 2010 appears to not do so. I don't know how good MSC's template logic is, but it seems this would be a good case for an explicit template, typed on the 'kind's values. Or are all C++ features disabled when compiling Python? Failing that, just code up 9 cases, and do a switch on the kinds. I'm also puzzled. I thought that the sort algorithm used a hash of all the items to be sorted, and only reverted to a raw comparison of the original values when the hash collided. Is that not the case? Or is the code you post here only used when the hash collides? The assembler (32-bit build) for each > PyUnicode_READ looks like > > mov ecx, DWORD PTR _kind1$[ebp] > cmp ecx, 1 > jne SHORT $LN17@unicode_co@2 > lea ecx, DWORD PTR [ebx+eax] > movzx edx, BYTE PTR [ecx+edx] > jmp SHORT $LN16@unicode_co@2 > $LN17@unicode_co@2: > cmp ecx, 2 > jne SHORT $LN15@unicode_co@2 > movzx edx, WORD PTR [ebx+edi] > jmp SHORT $LN16@unicode_co@2 > $LN15@unicode_co@2: > mov edx, DWORD PTR [ebx+esi] > $LN16@unicode_co@2: It appears that the compiler is keeping the three pointers in three separate registers (eax, esi and edi) even though those are 3 aliases for the same pointer. This is preventing it from putting other values in those registers. It'd probably do better if the C code manipulated the pointers, rather than using an index i each time. But if it did, perhaps gcc would generate worse code. If I were coding the assembler by hand (Intel only), I'd be able to avoid the multiple cmp operations, simply by comparing first to 2, then doing a jne and a ja. I dunno whether the compiler would notice if I coded the equivalent in C. (make both comparisons to 2, one for less, and one for more) > > The kind1/kind2 variables aren't even going into registers and at > least one test+branch and a jump are executed for every character. Two > tests for 2 and 4 byte kinds. len1 and len2 don't get to go into > registers either. > > Here's the full assembler output for unicode_compare: > > ; COMDAT _unicode_compare > _TEXT SEGMENT > _kind2$ = -20 ; size = 4 > _kind1$ = -16 ; size = 4 > _len2$ = -12 ; size = 4 > _len1$ = -8 ; size = 4 > _data2$ = -4 ; size = 4 > _unicode_compare PROC ; COMDAT > ; _str1$ = ecx > ; _str2$ = eax > > ; 10417: { > > push ebp > mov ebp, esp > sub esp, 20 ; 00000014H > push ebx > push esi > mov esi, eax > > ; 10418: int kind1, kind2; > ; 10419: void *data1, *data2; > ; 10420: Py_ssize_t len1, len2, i; > ; 10421: > ; 10422: kind1 = PyUnicode_KIND(str1); > > mov eax, DWORD PTR [ecx+16] > mov edx, eax > shr edx, 2 > and edx, 7 > push edi > mov DWORD PTR _kind1$[ebp], edx > > ; 10423: kind2 = PyUnicode_KIND(str2); > > mov edx, DWORD PTR [esi+16] > mov edi, edx > shr edi, 2 > and edi, 7 > mov DWORD PTR _kind2$[ebp], edi > > ; 10424: data1 = PyUnicode_DATA(str1); > > test al, 32 ; 00000020H > je SHORT $LN9@unicode_co@2 > test al, 64 ; 00000040H > je SHORT $LN7@unicode_co@2 > lea ebx, DWORD PTR [ecx+24] > jmp SHORT $LN10@unicode_co@2 > $LN7@unicode_co@2: > lea ebx, DWORD PTR [ecx+36] > jmp SHORT $LN10@unicode_co@2 > $LN9@unicode_co@2: > mov ebx, DWORD PTR [ecx+36] > $LN10@unicode_co@2: > > ; 10425: data2 = PyUnicode_DATA(str2); > > test dl, 32 ; 00000020H > je SHORT $LN13@unicode_co@2 > test dl, 64 ; 00000040H > je SHORT $LN11@unicode_co@2 > lea edx, DWORD PTR [esi+24] > jmp SHORT $LN30@unicode_co@2 > $LN11@unicode_co@2: > lea eax, DWORD PTR [esi+36] > mov DWORD PTR _data2$[ebp], eax > mov edx, eax > jmp SHORT $LN14@unicode_co@2 > $LN13@unicode_co@2: > mov edx, DWORD PTR [esi+36] > $LN30@unicode_co@2: > mov DWORD PTR _data2$[ebp], edx > $LN14@unicode_co@2: > > ; 10426: len1 = PyUnicode_GET_LENGTH(str1); > > mov edi, DWORD PTR [ecx+8] > > ; 10427: len2 = PyUnicode_GET_LENGTH(str2); > > mov ecx, DWORD PTR [esi+8] > > ; 10428: > ; 10429: for (i = 0; i < len1 && i < len2; ++i) { > > xor eax, eax > mov DWORD PTR _len1$[ebp], edi > mov DWORD PTR _len2$[ebp], ecx > test edi, edi > jle SHORT $LN2@unicode_co@2 > > ; 10426: len1 = PyUnicode_GET_LENGTH(str1); > > mov esi, edx > mov edi, edx > > ; 10428: > ; 10429: for (i = 0; i < len1 && i < len2; ++i) { > > sub ebx, edx > jmp SHORT $LN4@unicode_co@2 > $LL28@unicode_co@2: > mov edx, DWORD PTR _data2$[ebp] > $LN4@unicode_co@2: > cmp eax, ecx > jge SHORT $LN29@unicode_co@2 > > ; 10430: Py_UCS4 c1, c2; > ; 10431: c1 = PyUnicode_READ(kind1, data1, i); > > mov ecx, DWORD PTR _kind1$[ebp] > cmp ecx, 1 > jne SHORT $LN17@unicode_co@2 > lea ecx, DWORD PTR [ebx+eax] > movzx edx, BYTE PTR [ecx+edx] > jmp SHORT $LN16@unicode_co@2 > $LN17@unicode_co@2: > cmp ecx, 2 > jne SHORT $LN15@unicode_co@2 > movzx edx, WORD PTR [ebx+edi] > jmp SHORT $LN16@unicode_co@2 > $LN15@unicode_co@2: > mov edx, DWORD PTR [ebx+esi] > $LN16@unicode_co@2: > > ; 10432: c2 = PyUnicode_READ(kind2, data2, i); > > mov ecx, DWORD PTR _kind2$[ebp] > cmp ecx, 1 > jne SHORT $LN21@unicode_co@2 > mov ecx, DWORD PTR _data2$[ebp] > movzx ecx, BYTE PTR [eax+ecx] > jmp SHORT $LN20@unicode_co@2 > $LN21@unicode_co@2: > cmp ecx, 2 > jne SHORT $LN19@unicode_co@2 > movzx ecx, WORD PTR [edi] > jmp SHORT $LN20@unicode_co@2 > $LN19@unicode_co@2: > mov ecx, DWORD PTR [esi] > $LN20@unicode_co@2: > > ; 10433: > ; 10434: if (c1 != c2) > > cmp edx, ecx > jne SHORT $LN31@unicode_co@2 > mov ecx, DWORD PTR _len2$[ebp] > inc eax > add edi, 2 > add esi, 4 > cmp eax, DWORD PTR _len1$[ebp] > jl SHORT $LL28@unicode_co@2 > $LN29@unicode_co@2: > mov edi, DWORD PTR _len1$[ebp] > $LN2@unicode_co@2: > > ; 10436: } > ; 10437: > ; 10438: return (len1 < len2) ? -1 : (len1 != len2); > > cmp edi, ecx > jge SHORT $LN23@unicode_co@2 > pop edi > pop esi > or eax, -1 > pop ebx > > ; 10439: } > > mov esp, ebp > pop ebp > ret 0 > $LN31@unicode_co@2: > > ; 10435: return (c1 < c2) ? -1 : 1; > > sbb eax, eax > pop edi > and eax, -2 ; fffffffeH > pop esi > inc eax > pop ebx > > ; 10439: } > > mov esp, ebp > pop ebp > ret 0 > $LN23@unicode_co@2: > > ; 10436: } > ; 10437: > ; 10438: return (len1 < len2) ? -1 : (len1 != len2); > > xor eax, eax > cmp edi, ecx > pop edi > pop esi > setne al > pop ebx > > ; 10439: } > > mov esp, ebp > pop ebp > ret 0 > _unicode_compare ENDP > > Neil -- DaveA