Groups > comp.lang.python > #89946 > unrolled thread

Step further with filebasedMessages

Started by	Cecil Westerhof <Cecil@decebal.nl>
First post	2015-05-05 10:52 +0200
Last post	2015-05-05 13:10 -0400
Articles	10 — 6 participants

Back to article view | Back to comp.lang.python

  Step further with filebasedMessages Cecil Westerhof <Cecil@decebal.nl> - 2015-05-05 10:52 +0200
    Re: Step further with filebasedMessages Chris Angelico <rosuav@gmail.com> - 2015-05-05 19:20 +1000
      Re: Step further with filebasedMessages Cecil Westerhof <Cecil@decebal.nl> - 2015-05-05 12:14 +0200
    Re: Step further with filebasedMessages Steven D'Aprano <steve+comp.lang.python@pearwood.info> - 2015-05-05 20:41 +1000
      Re: Step further with filebasedMessages Cecil Westerhof <Cecil@decebal.nl> - 2015-05-05 13:27 +0200
        Re: Step further with filebasedMessages Ian Kelly <ian.g.kelly@gmail.com> - 2015-05-05 07:57 -0600
          Re: Step further with filebasedMessages Cecil Westerhof <Cecil@decebal.nl> - 2015-05-05 16:51 +0200
    Re: Step further with filebasedMessages Peter Otten <__peter__@web.de> - 2015-05-05 13:08 +0200
      Re: Step further with filebasedMessages Cecil Westerhof <Cecil@decebal.nl> - 2015-05-05 17:25 +0200
        Re: Step further with filebasedMessages Dave Angel <davea@davea.name> - 2015-05-05 13:10 -0400

#89946 — Step further with filebasedMessages

From	Cecil Westerhof <Cecil@decebal.nl>
Date	2015-05-05 10:52 +0200
Subject	Step further with filebasedMessages
Message-ID	<87oalzh0d5.fsf@Equus.decebal.nl>

I now defined get_message_slice:
    ### Add step
    def get_message_slice(message_filename, start, end):
        """
        Get a slice of messages, where 0 is the first message
        Works with negative indexes
        The values can be ascending and descending
        """

        message_list    = []
        real_file       = expanduser(message_filename)
        nr_of_messages  = get_nr_of_messages(real_file)
        if start < 0:
            start += nr_of_messages
        if end < 0:
            end += nr_of_messages
        assert (start >= 0) and (start < nr_of_messages)
        assert (end   >= 0) and (end   < nr_of_messages)
        if start > end:
            tmp             = start
            start           = end
            end             = tmp
            need_reverse    = True
        else:
            need_reverse    = False
        with open(real_file, 'r') as f:
            for message in islice(f, start, end + 1):
                message_list.append(message.rstrip())
        if need_reverse:
            message_list.reverse()
        return message_list

Is that a good way?

I also had:
    def get_indexed_message(message_filename, index):
        """
        Get index message from a file, where 0 gets the first message
        A negative index gets messages indexed from the end of the file
        Use get_nr_of_messages to get the number of messages in the file
        """

        real_file       = expanduser(message_filename)
        nr_of_messages  = get_nr_of_messages(real_file)
        if index < 0:
            index += nr_of_messages
        assert (index >= 0) and (index < nr_of_messages)
        with open(real_file, 'r') as f:
            [line] = islice(f, index, index + 1)
            return line.rstrip()

But changed it to:
    def get_indexed_message(message_filename, index):
        """
        Get index message from a file, where 0 gets the first message
        A negative index gets messages indexed from the end of the file
        Use get_nr_of_messages to get the number of messages in the file
        """

        return get_message_slice(message_filename, index, index)[0]

Is that acceptable? I am a proponent of DRY.
Or should I at least keep the assert in it?

-- 
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof

[toc] | [next] | [standalone]

#89947

From	Chris Angelico <rosuav@gmail.com>
Date	2015-05-05 19:20 +1000
Message-ID	<mailman.113.1430817623.12865.python-list@python.org>
In reply to	#89946

On Tue, May 5, 2015 at 6:52 PM, Cecil Westerhof <Cecil@decebal.nl> wrote:
> I now defined get_message_slice:

You're doing a lot of work involving flat-file storage of sequential
data. There are two possibilities:

1) Your files are small, so you shouldn't concern yourself with
details at all - just do whatever looks reasonable, nothing will
matter; or
2) Your files are bigger than that, performance might be a problem
(especially when your Big Oh starts looking bad), and you should move
to a database.

Maybe even with small files, a database would be cleaner. You can grab
whichever rows you want based on their IDs, and the database will do
the work for you. Grab SQLite3 or PostgreSQL, give it a whirl - you
may find that it does everything you need, right out of the box.

ChrisA

[toc] | [prev] | [next] | [standalone]

#89958

From	Cecil Westerhof <Cecil@decebal.nl>
Date	2015-05-05 12:14 +0200
Message-ID	<87k2wngwl8.fsf@Equus.decebal.nl>
In reply to	#89947

Op Tuesday 5 May 2015 11:20 CEST schreef Chris Angelico:

> On Tue, May 5, 2015 at 6:52 PM, Cecil Westerhof <Cecil@decebal.nl> wrote:
>> I now defined get_message_slice:
>
> You're doing a lot of work involving flat-file storage of sequential
> data. There are two possibilities:
>
> 1) Your files are small, so you shouldn't concern yourself with
> details at all - just do whatever looks reasonable, nothing will
> matter;

In my case the files are very small. Biggest is 150 lines. But if you
publish your code, you never know in which situation it will be used.

The suggestion implement a slice came from this newsgroup. ;-) And I
found it a good idea.

It is also to get reacquainted with Python.

> or 2) Your files are bigger than that, performance might be
> a problem (especially when your Big Oh starts looking bad), and you
> should move to a database.
>
> Maybe even with small files, a database would be cleaner. You can
> grab whichever rows you want based on their IDs, and the database
> will do the work for you. Grab SQLite3 or PostgreSQL, give it a
> whirl - you may find that it does everything you need, right out of
> the box.

Is a next step. I want to use PostgreSQL with SQLAlchemy.

-- 
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof

[toc] | [prev] | [next] | [standalone]

#89957

From	Steven D'Aprano <steve+comp.lang.python@pearwood.info>
Date	2015-05-05 20:41 +1000
Message-ID	<55489e80$0$12913$c3e8da3$5496439d@news.astraweb.com>
In reply to	#89946

On Tuesday 05 May 2015 18:52, Cecil Westerhof wrote:

> I now defined get_message_slice:
>     ### Add step
>     def get_message_slice(message_filename, start, end):
>         """
>         Get a slice of messages, where 0 is the first message
>         Works with negative indexes
>         The values can be ascending and descending
>         """

What's a message in this context? Just a line of text?

>         message_list    = []
>         real_file       = expanduser(message_filename)
>         nr_of_messages  = get_nr_of_messages(real_file)
>         if start < 0:
>             start += nr_of_messages
>         if end < 0:
>             end += nr_of_messages
>         assert (start >= 0) and (start < nr_of_messages)
>         assert (end   >= 0) and (end   < nr_of_messages)

You can write that as:

    assert 0 <= start < nr_of_messages

except you probably shouldn't, because that's not a good use for assert: 
start and end are user-supplied parameters, not internal invariants.

You might find this useful for understanding when to use assert:

http://import-that.dreamwidth.org/676.html

>         if start > end:
>             tmp             = start
>             start           = end
>             end             = tmp
>             need_reverse    = True

You can swap two variables like this:

    start, end = end, start

The language guarantees that the right hand side will be evaluated before 
the assignments are done, so it will automatically do the right thing.

>         else:
>             need_reverse    = False

Your behaviour when start and end are in opposite order does not match the 
standard slicing behaviour:

py> "abcdef"[3:5]
'de'
py> "abcdef"[5:3]
''

That doesn't mean your behaviour is wrong, but it will surprise anyone who 
expects your slicing to be like the slicing they are used to.

>         with open(real_file, 'r') as f:
>             for message in islice(f, start, end + 1):
>                 message_list.append(message.rstrip())
>         if need_reverse:
>             message_list.reverse()
>         return message_list
> 
> Is that a good way?

I think what I would do is:

def get_message_slice(message_filename, start=0, end=None, step=1):
    real_file = expanduser(message_filename)
    with open(real_file, 'r') as f:
        messages = f.readlines()
    return messages[start:end:step]

until such time that I could prove that I needed something more 
sophisticated. Then, and only then, would I consider your approach, except 
using a slice object:

# Untested.
def get_message_slice(message_filename, start=0, end=None, step=1):
    real_file = expanduser(message_filename)
    messages = []
    # FIXME: I assume this is expensive. Can we avoid it?
    nr_of_messages = get_nr_of_messages(real_file)
    the_slice = slice(start, end, step)
    # Calculate the indexes in the given slice, e.g.
    # start=1, stop=7, step=2 gives [1,3,5].
    indices = range(*(the_slice.indices(nr_of_messages)))
    with open(real_file, 'r') as f:
        for i, message in enumerate(f):
            if i in indices:
                messages.append(message)
    return messages

There is still room for optimization: e.g. if the slice is empty, don't 
bother iterating over the file. I leave that to you.

-- 
Steve

[toc] | [prev] | [next] | [standalone]

#89962

From	Cecil Westerhof <Cecil@decebal.nl>
Date	2015-05-05 13:27 +0200
Message-ID	<87fv7bgt6t.fsf@Equus.decebal.nl>
In reply to	#89957

Op Tuesday 5 May 2015 12:41 CEST schreef Steven D'Aprano:

> On Tuesday 05 May 2015 18:52, Cecil Westerhof wrote:
>
>> I now defined get_message_slice:
>> ### Add step
>> def get_message_slice(message_filename, start, end):
>> """
>> Get a slice of messages, where 0 is the first message
>> Works with negative indexes
>> The values can be ascending and descending
>> """
>
> What's a message in this context? Just a line of text?

Yes. In this case it is. Iuse it to post on Twitter. (OmgangMetTijd)
It was done with an old PHP script. But I switched it to Python, just
to get some needed exercise in Python. But I found it very easy to
extend the functionality.

In my twitter application I use '^' to have the possibility to create
newlines.

>
>
>> message_list    = []
>> real_file       = expanduser(message_filename)
>> nr_of_messages  = get_nr_of_messages(real_file)
>> if start < 0:
>> start += nr_of_messages
>> if end < 0:
>> end += nr_of_messages
>> assert (start >= 0) and (start < nr_of_messages)
>> assert (end   >= 0) and (end   < nr_of_messages)
>
> You can write that as:
>
> assert 0 <= start < nr_of_messages

I was thought that my way is more clear.


> except you probably shouldn't, because that's not a good use for
> assert: start and end are user-supplied parameters, not internal
> invariants.
>
> You might find this useful for understanding when to use assert:
>
> http://import-that.dreamwidth.org/676.html

I will read it.


>> if start > end:
>> tmp             = start
>> start           = end
>> end             = tmp
>> need_reverse    = True
>
> You can swap two variables like this:
>
> start, end = end, start
>
> The language guarantees that the right hand side will be evaluated
> before the assignments are done, so it will automatically do the
> right thing.

That is a lot clearer. Thanks.


> Your behaviour when start and end are in opposite order does not
> match the standard slicing behaviour:
>
> py> "abcdef"[3:5]
> 'de'
> py> "abcdef"[5:3]
> ''
>
> That doesn't mean your behaviour is wrong, but it will surprise
> anyone who expects your slicing to be like the slicing they are used
> to.

Good point: something to think about.


> I think what I would do is:
>
> def get_message_slice(message_filename, start=0, end=None, step=1):
> real_file = expanduser(message_filename)
> with open(real_file, 'r') as f:
> messages = f.readlines()
> return messages[start:end:step]

The idea is that this is expensive for large files.


> until such time that I could prove that I needed something more
> sophisticated. Then, and only then, would I consider your approach,
> except using a slice object:

Because I publish it, I should take reasonable care of the
possibilities of its use.


> # Untested.
> def get_message_slice(message_filename, start=0, end=None, step=1):
> real_file = expanduser(message_filename)
> messages = []
> # FIXME: I assume this is expensive. Can we avoid it?
> nr_of_messages = get_nr_of_messages(real_file)

If I want to give the possibility to use negative values also, I need
the value.


> the_slice = slice(start, end, step)
> # Calculate the indexes in the given slice, e.g.
> # start=1, stop=7, step=2 gives [1,3,5].
> indices = range(*(the_slice.indices(nr_of_messages)))
> with open(real_file, 'r') as f:
> for i, message in enumerate(f):
> if i in indices:
> messages.append(message)
> return messages

I will look into it.

-- 
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof

[toc] | [prev] | [next] | [standalone]

#89966

From	Ian Kelly <ian.g.kelly@gmail.com>
Date	2015-05-05 07:57 -0600
Message-ID	<mailman.125.1430834230.12865.python-list@python.org>
In reply to	#89962

[Multipart message — attachments visible in raw view] — view raw

On May 5, 2015 5:46 AM, "Cecil Westerhof" <Cecil@decebal.nl> wrote:
>
> Op Tuesday 5 May 2015 12:41 CEST schreef Steven D'Aprano:
>
> > # Untested.
> > def get_message_slice(message_filename, start=0, end=None, step=1):
> > real_file = expanduser(message_filename)
> > messages = []
> > # FIXME: I assume this is expensive. Can we avoid it?
> > nr_of_messages = get_nr_of_messages(real_file)
>
> If I want to give the possibility to use negative values also, I need
> the value.

You could make this call only if one of the boundaries is actually
negative. Then callers that provide positive values don't need to pay the
cost of that case.

Alternatively, consider that it's common for slices of iterators to
disallow negative indices altogether, and question whether you really need
that.

> > the_slice = slice(start, end, step)
> > # Calculate the indexes in the given slice, e.g.
> > # start=1, stop=7, step=2 gives [1,3,5].
> > indices = range(*(the_slice.indices(nr_of_messages)))
> > with open(real_file, 'r') as f:
> > for i, message in enumerate(f):
> > if i in indices:
> > messages.append(message)
> > return messages

I approve of using slice.indices instead of calculating the indices
manually, but otherwise, the islice approach feels cleaner to me. This
reads like a reimplementation of that.

[toc] | [prev] | [next] | [standalone]

#89967

From	Cecil Westerhof <Cecil@decebal.nl>
Date	2015-05-05 16:51 +0200
Message-ID	<87bnhzgjqr.fsf@Equus.decebal.nl>
In reply to	#89966

Op Tuesday 5 May 2015 15:57 CEST schreef Ian Kelly:

> On May 5, 2015 5:46 AM, "Cecil Westerhof" <Cecil@decebal.nl> wrote:
>>
>> Op Tuesday 5 May 2015 12:41 CEST schreef Steven D'Aprano:
>>
>>> # Untested. def get_message_slice(message_filename, start=0,
>>> end=None, step=1): real_file = expanduser(message_filename)
>>> messages = [] # FIXME: I assume this is expensive. Can we avoid
>>> it? nr_of_messages = get_nr_of_messages(real_file)
>>
>> If I want to give the possibility to use negative values also, I
>> need the value.
>
> You could make this call only if one of the boundaries is actually
> negative. Then callers that provide positive values don't need to
> pay the cost of that case.

You have a point there. I will think about it.


> Alternatively, consider that it's common for slices of iterators to
> disallow negative indices altogether, and question whether you
> really need that.

It was an idea I got from this newsgroup. :-D And I liked it,
otherwise I would not have implemented it.

-- 
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof

[toc] | [prev] | [next] | [standalone]

#89960

From	Peter Otten <__peter__@web.de>
Date	2015-05-05 13:08 +0200
Message-ID	<mailman.123.1430824107.12865.python-list@python.org>
In reply to	#89946

Cecil Westerhof wrote:

> I now defined get_message_slice:
>     ### Add step
>     def get_message_slice(message_filename, start, end):

Intervals are usually half-open in Python. I recommend that you follow that 
convention.

>         """
>         Get a slice of messages, where 0 is the first message
>         Works with negative indexes
>         The values can be ascending and descending
>         """
> 
>         message_list    = []
>         real_file       = expanduser(message_filename)
>         nr_of_messages  = get_nr_of_messages(real_file)
>         if start < 0:
>             start += nr_of_messages
>         if end < 0:
>             end += nr_of_messages
>         assert (start >= 0) and (start < nr_of_messages)

You should raise an exception. While asserts are rarely switched off in 
Python you still have to be prepeared, and an IndexError would be a better 
fit anyway.

>         assert (end   >= 0) and (end   < nr_of_messages)
>         if start > end:
>             tmp             = start
>             start           = end
>             end             = tmp
>             need_reverse    = True
>         else:
>             need_reverse    = False
>         with open(real_file, 'r') as f:
>             for message in islice(f, start, end + 1):
>                 message_list.append(message.rstrip())
>         if need_reverse:
>             message_list.reverse()
>         return message_list
> 
> Is that a good way?
> 
> I also had:
>     def get_indexed_message(message_filename, index):
>         """
>         Get index message from a file, where 0 gets the first message
>         A negative index gets messages indexed from the end of the file
>         Use get_nr_of_messages to get the number of messages in the file
>         """
> 
>         real_file       = expanduser(message_filename)
>         nr_of_messages  = get_nr_of_messages(real_file)
>         if index < 0:
>             index += nr_of_messages
>         assert (index >= 0) and (index < nr_of_messages)
>         with open(real_file, 'r') as f:
>             [line] = islice(f, index, index + 1)
>             return line.rstrip()
> 
> But changed it to:
>     def get_indexed_message(message_filename, index):
>         """
>         Get index message from a file, where 0 gets the first message
>         A negative index gets messages indexed from the end of the file
>         Use get_nr_of_messages to get the number of messages in the file
>         """
> 
>         return get_message_slice(message_filename, index, index)[0]
> 
> Is that acceptable? 

Yes.

> I am a proponent of DRY.

But note that you are implementing parts of the slicing logic that Python's 
sequence already has. Consider becoming a pronent of DRWTOGAD*.

> Or should I at least keep the assert in it?
 
No.

I see you have a tendency to overengineer. Here's
how I would approach case (1) in Chris' answer, where memory is not a 
concern:

import os

def read_messages(filename):
    with open(os.path.expanduser(filename)) as f:
        return [line.rstrip() for line in f]

# get_messages_slice(filename, start, end)
print(read_messages(filename)[start:end+1])

# get_indexed_message(filename, index)
print(read_messages(filename)[index])

Should you later decide that a database is a better fit you can change 
read_messages() to return a class that transparently accesses that database. 
Again, most of the work is already done:

class Messages(collections.Sequence):
    def __init__(self, filename):
        self.filename = filename)
    def __getitem__(self, index):
        # read record(s) from db
    def __len__(self): 
        # return num-records in db

def read_messages(filename):
    return Messages(filename)

By the way, where do you plan to use your functions? And where do the 
indices you feed them come from?

(*) Don't repeat what those other guys already did. Yeah sorry, I have a 
soft spot for lame jokes...

[toc] | [prev] | [next] | [standalone]

#89969

From	Cecil Westerhof <Cecil@decebal.nl>
Date	2015-05-05 17:25 +0200
Message-ID	<877fsngi7i.fsf@Equus.decebal.nl>
In reply to	#89960

Op Tuesday 5 May 2015 13:08 CEST schreef Peter Otten:

> Cecil Westerhof wrote:
>
>> I now defined get_message_slice:
>> ### Add step
>> def get_message_slice(message_filename, start, end):
>
> Intervals are usually half-open in Python. I recommend that you
> follow that convention.

I will change it.


>> """
>> Get a slice of messages, where 0 is the first message
>> Works with negative indexes
>> The values can be ascending and descending
>> """
>>
>> message_list    = []
>> real_file       = expanduser(message_filename)
>> nr_of_messages  = get_nr_of_messages(real_file)
>> if start < 0:
>> start += nr_of_messages
>> if end < 0:
>> end += nr_of_messages
>> assert (start >= 0) and (start < nr_of_messages)
>
> You should raise an exception. While asserts are rarely switched off
> in Python you still have to be prepeared, and an IndexError would be
> a better fit anyway.

OK.


> Should you later decide that a database is a better fit you can
> change read_messages() to return a class that transparently accesses
> that database. Again, most of the work is already done:
>
> class Messages(collections.Sequence):
> def __init__(self, filename):
> self.filename = filename)
> def __getitem__(self, index):
> # read record(s) from db
> def __len__(self): 
> # return num-records in db
>
> def read_messages(filename):
> return Messages(filename)

Another thing I have to look into. :-D


> By the way, where do you plan to use your functions? And where do
> the indices you feed them come from?

In my case from:
    get_random_message
and:
    dequeue_message

Both also in:
    https://github.com/CecilWesterhof/PythonLibrary/blob/master/filebasedMessages.py

I should write some documentation with those functions. ;-)


I have a file with quotes and a file with tips. I want to place random
messages from those two (without them being repeated to soon) on my
Twitter page. This I do with ‘get_random_message’. I also want to put
the first message of another file and remove it from the file. For
this I use ‘dequeue_message’.

-- 
Cecil Westerhof
Senior Software Engineer
LinkedIn: http://www.linkedin.com/in/cecilwesterhof

[toc] | [prev] | [next] | [standalone]

#89978

From	Dave Angel <davea@davea.name>
Date	2015-05-05 13:10 -0400
Message-ID	<mailman.128.1430845859.12865.python-list@python.org>
In reply to	#89969

On 05/05/2015 11:25 AM, Cecil Westerhof wrote:

>
> I have a file with quotes and a file with tips. I want to place random
> messages from those two (without them being repeated to soon) on my
> Twitter page. This I do with ‘get_random_message’. I also want to put
> the first message of another file and remove it from the file. For
> this I use ‘dequeue_message’.
>

Removing lines from the start of a file is an n-squared operation. 
Sometiomes it pays to reverse the file once, and just remove from the 
end.  Truncating a file doesn't require the whole thing to be rewritten, 
nor risk losing the file if the make-new-file-rename-delete-old isn't 
quite done right.

Alternatively, you could overwrite the line, or more especially the 
linefeed before it.  Then you always do two readline() calls, using the 
second one's result.

Various other games might include storing an offset at the begin of 
file, so you start by reading that, doing a seek to the place you want, 
and then reading the new line from there.

Not recommending any of these, just bringing up alternatives.

-- 
DaveA

[toc] | [prev] | [standalone]

csiph-web

Step further with filebasedMessages

Contents

#89946 — Step further with filebasedMessages

#89947

#89958

#89957

#89962

#89966

#89967

#89960

#89969

#89978