Path: csiph.com!xmission!news.snarked.org!news.linkpendium.com!news.linkpendium.com!panix!usenet.stanford.edu!not-for-mail From: George Jones Newsgroups: gnu.bash.bug Subject: Re: [PATCH] Implement rehashing for associative arrays (Re: speeding up hash_search?) Date: Sun, 19 Apr 2020 21:00:08 -0400 Lines: 102 Approved: bug-bash@gnu.org Message-ID: References: NNTP-Posting-Host: lists.gnu.org Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" X-Trace: usenet.stanford.edu 1587344426 6098 209.51.188.17 (20 Apr 2020 01:00:26 GMT) X-Complaints-To: action@cs.stanford.edu Cc: bug-bash@gnu.org To: Koichi Murase Envelope-to: bug-bash@gnu.org DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=OoCrE/aJ6as0h3tocQ18IpGRjiL+dqVuxCh9gUJ4yYk=; b=MYqN8hjNFlFdcDdLKD1F2ZP9LfLFB22v+am7jjP7f2KJc4UviwT5VJh1kL6MjU9/DM Inf00yOWiIoEW4aFO7eKaiFH03KqI3x1eC3czV5a6z3+TIykOzFwi1QecwPEK8KzD+0u uQ6XvMe+RIsvpi/awpClrtRPyyACOuPoZVJZ01ODNcpV6Da/QTKXu3oFt3buZ65Vsqq6 pviSvUIfEeBTeYwTvj7iU3vU2iVQMP6mUYdJC5lA6eF+SXAvQgU+ZFvt/0lqoPiKo+7g MoNoRxTCnZG3LOeo8NmeWZNMTU2nCtggCH4yQH5z4bu2u8SO5KDfH4LxluQ3JjE9hHPF dPsw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=OoCrE/aJ6as0h3tocQ18IpGRjiL+dqVuxCh9gUJ4yYk=; b=uRTYuYqQ4qgczfkb16Pp/ikCEmm66ECqOTfm6CVPaTLXpCIxjCtarOjMdpBWEQVYui RrQdc4XRHmOrJPKJoSo277MziRCmfqZ8zOKdAcU3COnd4+wWBYsRAJOQpoEz+WjjImdU FG21TnDn3dcZtCc93Caw8/Hg372QJWyuS+VFDJH8Z48iVmJiDpxVsUK/Y41vjYla3KCv Zt7RM+MdGuPO2yYzytA7i/lPG/ygfSvmXnrrMFXlK3ewr5YNhqW173wxTHYXt/dPCLIt 8J91yCV3aIm7ipHgSTdlKXfNZzzVNGIjpI7AF+LrahPW+z2i2dmDzofgYkWBz+SsnAlT qODg== X-Gm-Message-State: AGi0PuY+ybOuEwJ332d2Cu4+JizY5EOr6LDToXV3W14eV93f23j1eNH4 zCO1uu0Rkpb28s2npD+bNmu7yfI9KOKUaGGv8wU= X-Google-Smtp-Source: APiQypIJnrGIIqPuHVwq6RnJq5lrFm0YlWxR4YbZOx5HD9NrPfgR6yr47ra4Tia4CFQzRsoOjTEc7kG6zgoX+/MpDss= X-Received: by 2002:aca:f50e:: with SMTP id t14mr8825515oih.156.1587344419133; Sun, 19 Apr 2020 18:00:19 -0700 (PDT) In-Reply-To: Received-SPF: pass client-ip=2607:f8b0:4864:20::244; envelope-from=fooologist@gmail.com; helo=mail-oi1-x244.google.com X-detected-operating-system: by eggs1p.gnu.org: Genre and OS details not recognized. X-Received-From: 2607:f8b0:4864:20::244 X-Content-Filtered-By: Mailman/MimeDel 2.1.23 X-BeenThere: bug-bash@gnu.org X-Mailman-Version: 2.1.23 Precedence: list List-Id: Bug reports for the GNU Bourne Again SHell List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , X-Mailman-Original-Message-ID: X-Mailman-Original-References: Xref: csiph.com gnu.bash.bug:16181 Thank you. Patch applied and (performance) tested with come tests I was working on https://github.com/eludom/snippits/tree/master/bash/tests .... bottom line: Before: ... lines_per_sec 993.37 LINES 2700000 ELAPSED 2718 lines_per_sec 955.95 LINES 2800000 ELAPSED 2929 lines_per_sec 921.51 LINES 2900000 ELAPSED 3147 lines_per_sec 889.41 LINES 3000000 ELAPSED 3373 lines_per_sec 859.43 LINES 3100000 ELAPSED 3607 After: ... lines_per_sec 50000.00 LINES 2600000 ELAPSED 52 lines_per_sec 50000.00 LINES 2700000 ELAPSED 54 lines_per_sec 50000.00 LINES 2800000 ELAPSED 56 lines_per_sec 50000.00 LINES 2900000 ELAPSED 58 lines_per_sec 50000.00 LINES 3000000 ELAPSED 60 lines_per_sec 50000.00 LINES 3100000 ELAPSED 62 ---George Jones On Sun, Apr 19, 2020 at 3:52 PM Koichi Murase wrote: > 2020-04-19 23:54 George Jones : > > It looks like hash_search just does a linear walk if array entries > > to find elements in a list. > > https://eludom.github.io/blog/20200418/ > > and there it is, the linear search walking the list in hash_search() > > > > ``` > > [...] > > > > bucket = HASH_BUCKET (string, table, hv); > > > > for (list = table->bucket_array ? table->bucket_array[bucket] : 0; > list; list = list->next) > > { > > [...] > > ``` > > The associative arrays in `hashlib.c' are implemented by hash tables > as is clear from its name. The main lookup of hash table algorithm is > done by the following line > > bucket = HASH_BUCKET (string, table, hv); > > but not by the subsequent linear search. The linear search is just a > workaround for the collision of hashes. As far as the load factor of > the hash table is maintained properly, the linear search is O(1) > because the length of the list is O(1). > > 2020-04-19 23:54 George Jones : > > This slows down (order N?) new inserts when the number of entries > > gets large. > > I looked into `hashlib.c' and found that rehashing is actually not > implemented. I just added a function to perform rehashing, and > the performance has been improved. > > > Would there be any interest in merging a patch to add an option for > > making this faster (maybe using b-trees?) > > https://eludom.github.io/blog/20200418/ > > TODO Look for appropriate in-memory hash insert/lookup functions > > - btrees ? > > The B-tree is not the most appropriate option here. The hash table is > more appropriate. The B-tree can be used in the case that we want to > keep the ordering of the keys of associative arrays (e.g. we want to > enumerate items in ascending/descending order). Bash associative > arrays do not ensure the ordering of the items, so the hash table can > be used as a more efficient choice and is already implemented in > `hashlib.c'. We can simply add rehashing. > > ------------------------------------------------------------------------ > > I attach a patch `0001-hashlib-Implement-rehash.patch' which > implements the rehashing. > > I also tested the performance. The attached `test1.png' shows the > computational time of insertion of items before and after the fix. > The lines are fitted by the function `Time = C Size^alpha' where alpha > is the exponent. Before the fix, there are two regimes depending on > the number of items: linear regime O(N) (alpha ~ 1) and quadratic > regime O(N^2) (alpha ~ 2). The quadratic regime signals the linear > scaling of a single insertion. After the fix, the prformance has been > improved, and the computational time scales linearly in entire region. > I also attach a script `test1.sh' which was used to measure the time. > > -- > Koichi >