Path: csiph.com!fu-berlin.de!uni-berlin.de!not-for-mail From: Ian Kelly Newsgroups: comp.lang.python Subject: Re: non printable (moving away from Perl) Date: Fri, 11 Mar 2016 10:08:07 -0700 Lines: 35 Message-ID: References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 X-Trace: news.uni-berlin.de XV0Qt2z5oEXH98a2J9cteQSNPiMVSH9NcuaN9A23zVHQ== Return-Path: X-Original-To: python-list@python.org Delivered-To: python-list@mail.python.org X-Spam-Status: OK 0.039 X-Spam-Evidence: '*H*': 0.92; '*S*': 0.00; 'lines,': 0.05; 'second.': 0.09; 'python': 0.10; '2016': 0.16; 'received:209.85.213.176': 0.16; 'received:io': 0.16; 'received:psf.io': 0.16; 'subject:non': 0.16; 'wrote:': 0.16; 'input': 0.18; '>>>': 0.20; 'am,': 0.23; 'seems': 0.23; 'finished': 0.23; 'header:In-Reply-To:1': 0.24; 'fri,': 0.27; 'message-id:@mail.gmail.com': 0.27; 'perl': 0.29; 'file': 0.34; 'received:google.com': 0.35; 'received:209.85': 0.36; 'to:addr:python-list': 0.36; 'subject:: ': 0.37; 'received:209.85.213': 0.37; 'things': 0.38; 'received:209': 0.38; 'test': 0.39; 'data': 0.39; 'sure': 0.39; 'subject:from': 0.39; 'to:addr:python.org': 0.40; 'your': 0.60; 'granted': 0.63; 'mar': 0.65; 'fall': 0.66; '100': 0.79; 'category.': 0.84; 'to:name:python': 0.84 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=IzyIJkMS4ji1wbYoDMIfbRaFO7gOq0Kb2PDnsBMoB6A=; b=ndo/3/1Va5dqjPpT+BQniREz0OWCIlz5yAOmRsDc4fMe1uFVvTQx+BrT96AZrRkwR9 59Dv45witUPTDiH6KWtchxSMRnc90U0ixomqxXc1yCmoyq8vEuuSBv8WoT6PP8j5nqRI JWoy3WmHOQiCTYa0NkIpz8p3J8dONKCvXCAkz6OVdvjPPW5/a6RTE7xF2r766U6hq+/u Wfnxletxx1Z1sAnF3gD/7kZm8hlNNvnokOm+knRk4T/oXvXB3p/zqCqpK0f8+++vESue pi5WO6mopmZPZsC0pMrAeNnp7pMvQFOb/Bol6wb3EeOwVlZ41XarXkm445nH96ZMkcjn 1mTw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=IzyIJkMS4ji1wbYoDMIfbRaFO7gOq0Kb2PDnsBMoB6A=; b=c/p3I0aCekq97XGD8QOcWnuYQyXJgjPy7b1ABtg/I2bolcFa72cUQM5rAoCLrvxads DrXwynumjJESyAd5BEQO2PXPH/yBDxrKNtG7jBGki3K/PEAiod7b3omfm3dbpUWzDNWl IGnAWV4A/SQjlxBADR5ezDWtTuwGTlMNTQ3uF5jfdO0X3D5AiZDB5/0pH0jbPLumHP5e ODkFivdrfWQ0uInnkUZ8euK4CUNfjuAQR2vcISjadBTvLWSCKoOvY6CRgs41RoZWP/hY p8DLOTtd1QKmxz7fDD2ZS4uHPawa+REALg+XPVLP5kP7IaeatKQCiGpvQAifCg7sXkUH x4cQ== X-Gm-Message-State: AD7BkJL/acJSx0e0GuNw9FmxfS1Sa+Kn24kPQSTRgzdln4mkcvZinys1QWckMyzd7duqBCGGVOnwjJbdTENuMA== X-Received: by 10.50.61.177 with SMTP id q17mr4892908igr.68.1457716126966; Fri, 11 Mar 2016 09:08:46 -0800 (PST) In-Reply-To: X-BeenThere: python-list@python.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: General discussion list for the Python programming language List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Xref: csiph.com comp.lang.python:104625 On Fri, Mar 11, 2016 at 9:34 AM, Wolfgang Maier wrote: > On 11.03.2016 15:23, Fillmore wrote: >> >> On 03/11/2016 07:13 AM, Wolfgang Maier wrote: >>> >>> One lesson for Perl regex users is that in Python many things can be >>> solved without regexes. >>> How about defining: >>> >>> printable = {chr(n) for n in range(32, 127)} >>> >>> then using: >>> >>> if (set(my_string) - set(printable)): >>> break >> >> >> seems computationally heavy. I have a file with about 70k lines, of >> which only 20 contain "funny" chars. >> > > Not sure what you call computationally heavy. I just test-parsed a 30 MB > file (28k lines) with: > > with open(my_file) as i: > for line in i: > if set(line) - printable: > continue > > and it finished in less than a second. Did your test file contain on the order of 100 unique characters, or on the order of 100,000? Granted that most input data would likely fall into the former category.