X-Received: by 10.52.100.102 with SMTP id ex6mr40389293vdb.0.1433316071869; Wed, 03 Jun 2015 00:21:11 -0700 (PDT) X-Received: by 10.140.95.109 with SMTP id h100mr363386qge.6.1433316071805; Wed, 03 Jun 2015 00:21:11 -0700 (PDT) Path: csiph.com!v102.xanadu-bbs.net!xanadu-bbs.net!usenet.blueworldhosting.com!feeder01.blueworldhosting.com!peer02.iad.highwinds-media.com!news.highwinds-media.com!feed-me.highwinds-media.com!z60no12983qgd.1!news-out.google.com!4ni164qgh.1!nntp.google.com!z60no12981qgd.1!postnews.google.com!glegroupsg2000goo.googlegroups.com!not-for-mail Newsgroups: comp.lang.ruby Date: Wed, 3 Jun 2015 00:21:11 -0700 (PDT) In-Reply-To: <60ab7dcb-8c62-4e6a-b2bc-412c6c612789@googlegroups.com> Complaints-To: groups-abuse@google.com Injection-Info: glegroupsg2000goo.googlegroups.com; posting-host=65.49.68.195; posting-account=a_BjagoAAAA6awvngiyeSEJGIwHilPnK NNTP-Posting-Host: 65.49.68.195 References: <60ab7dcb-8c62-4e6a-b2bc-412c6c612789@googlegroups.com> User-Agent: G2/1.0 MIME-Version: 1.0 Message-ID: <8efe9407-fb75-4561-ae23-e2e052869db8@googlegroups.com> Subject: Re: The core of the core of the big data solutions -- Map From: Wenwei Peng Injection-Date: Wed, 03 Jun 2015 07:21:11 +0000 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Received-Bytes: 3545 X-Received-Body-CRC: 1341388437 Xref: csiph.com comp.lang.ruby:7118 Title: The core of the big data solutions -- Map. Author: pengwenwei Email: pww71@foxmail.com=20 Language: c++ Platform: Windows, linux Technology: Perfect hash algorithm Level: Advanced Description: A high performance map algorithm Section MFC c++ map stl SubSection c++ algorithm License: (GPLv3) Map is widely used in c++ programs. Its performance is critical to programs= ' performance. Especially in big data and the scenarios which can't realiz= e data distribution and parallel processing. I have been working on big data analysis for many years in telecommunition = and information security industry. The data analysis is so complicated that= they can't work without map. Especially in information security industry, = the data is much more complicated than others. For example, ip table, mac t= able, telephone numbers table, dns table etc. Currently, the STL map and Google's hash map are the most popular maps. But= they have some disadvantages. The STL map is based on binary chop, which c= auses a bad performance. Google Hash map has the best performance at presen= t, but it has probability of collision. For big data analysis, the collisio= n probability is unacceptable. Now I would like to publish pwwMap. It includes three different maps for di= fferent scenarios: 1. Memory Map(memMap): It has a good access speed. But its size is limited = by memory size. 2. Harddisk Map(diskMap): It utilizes hard disk to store data. So it could = accept much more data than memory map. 3. Hashmap(hashMap): It has the best performance and a great lookup speed, = but it doesn't have 'insert' and 'delete' functionality. MemMap and diskMap could be converted to hashMap by function memMap2HashMap= and diskMap2HashMap. According to the test result, my algorithms' collisio= n probability is zero. About performance, memMap has a comparable performan= ce with google, and hashMap's performance is 100 times better than Google's= hashmap. In summary, pwwhash are perfect hash algorithms with zero collision probabi= lity. You can refer to following artical to find the key index and compress= algorithm theory: http://blog.csdn.net/chixinmuzi/article/details/1727195 Source code and documents: https://sourceforge.net/projects/pwwhashmap/files/?source=3Dnavbar