Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #52474 > unrolled thread

Getting a value that follows string.find()

Started byenglishkevin110@gmail.com
First post2013-08-13 15:51 -0700
Last post2013-08-14 15:58 +0000
Articles 8 — 5 participants

Back to article view | Back to comp.lang.python


Contents

  Getting a value that follows string.find() englishkevin110@gmail.com - 2013-08-13 15:51 -0700
    Re: Getting a value that follows string.find() Joel Goldstick <joel.goldstick@gmail.com> - 2013-08-13 18:58 -0400
      Re: Getting a value that follows string.find() englishkevin110@gmail.com - 2013-08-13 16:03 -0700
        Re: Getting a value that follows string.find() Joel Goldstick <joel.goldstick@gmail.com> - 2013-08-13 19:18 -0400
        Re: Getting a value that follows string.find() Joel Goldstick <joel.goldstick@gmail.com> - 2013-08-13 19:40 -0400
        Re: Getting a value that follows string.find() Steven D'Aprano <steve@pearwood.info> - 2013-08-14 06:29 +0000
    Re: Getting a value that follows string.find() Dave Angel <davea@davea.name> - 2013-08-14 01:31 +0000
    Re: Getting a value that follows string.find() John Gordon <gordon@panix.com> - 2013-08-14 15:58 +0000

#52474 — Getting a value that follows string.find()

Fromenglishkevin110@gmail.com
Date2013-08-13 15:51 -0700
SubjectGetting a value that follows string.find()
Message-ID<40816fed-38d4-4baa-92cc-c80cd8febd82@googlegroups.com>
I know the title doesn't make much sense, but I didnt know how to explain my problem.

Anywho, I've opened a page's source in URLLIB
starturlsource = starturlopen.read()
string.find(starturlsource, '<a href="/profile.php?id=')
And I used string.find to find a specific area in the page's source.
I want to store what comes after ?id= in a variable.
Can someone help me with this?

[toc] | [next] | [standalone]


#52475

FromJoel Goldstick <joel.goldstick@gmail.com>
Date2013-08-13 18:58 -0400
Message-ID<mailman.548.1376434691.1251.python-list@python.org>
In reply to#52474
lookup urlparse for you answer

On Tue, Aug 13, 2013 at 6:51 PM,  <englishkevin110@gmail.com> wrote:
> I know the title doesn't make much sense, but I didnt know how to explain my problem.
>
> Anywho, I've opened a page's source in URLLIB
> starturlsource = starturlopen.read()
> string.find(starturlsource, '<a href="/profile.php?id=')
> And I used string.find to find a specific area in the page's source.
> I want to store what comes after ?id= in a variable.
> Can someone help me with this?
> --
> http://mail.python.org/mailman/listinfo/python-list



-- 
Joel Goldstick
http://joelgoldstick.com

[toc] | [prev] | [next] | [standalone]


#52476

Fromenglishkevin110@gmail.com
Date2013-08-13 16:03 -0700
Message-ID<5b73c6fe-a282-4d28-ab29-2e1dfdd09290@googlegroups.com>
In reply to#52475
On Tuesday, August 13, 2013 5:58:07 PM UTC-5, Joel Goldstick wrote:
> lookup urlparse for you answer
> 
> 
> 
> On Tue, Aug 13, 2013 at 6:51 PM,  <> wrote:
> 
> > I know the title doesn't make much sense, but I didnt know how to explain my problem.
> 
> >
> 
> > Anywho, I've opened a page's source in URLLIB
> 
> > starturlsource = starturlopen.read()
> 
> > string.find(starturlsource, '<a href="/profile.php?id=')
> 
> > And I used string.find to find a specific area in the page's source.
> 
> > I want to store what comes after ?id= in a variable.
> 
> > Can someone help me with this?
> 
> > --
> 
> > http://mail.python.org/mailman/listinfo/python-list
> 
> 
> 
> 
> 
> 
> 
> -- 
> 
> Joel Goldstick
> 
> http://joelgoldstick.com

I dont want to do any kind of HTML parsing.

[toc] | [prev] | [next] | [standalone]


#52479

FromJoel Goldstick <joel.goldstick@gmail.com>
Date2013-08-13 19:18 -0400
Message-ID<mailman.551.1376435925.1251.python-list@python.org>
In reply to#52476
On Tue, Aug 13, 2013 at 7:03 PM,  <englishkevin110@gmail.com> wrote:
> On Tuesday, August 13, 2013 5:58:07 PM UTC-5, Joel Goldstick wrote:
>> lookup urlparse for you answer
>>
>>
>>
>> On Tue, Aug 13, 2013 at 6:51 PM,  <> wrote:
>>
>> > I know the title doesn't make much sense, but I didnt know how to explain my problem.
>>
>> >
>>
>> > Anywho, I've opened a page's source in URLLIB
>>
>> > starturlsource = starturlopen.read()
>>
>> > string.find(starturlsource, '<a href="/profile.php?id=')
>>
>> > And I used string.find to find a specific area in the page's source.
>>
>> > I want to store what comes after ?id= in a variable.
>>
>> > Can someone help me with this?
>>
>> > --
>>
>> > http://mail.python.org/mailman/listinfo/python-list
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Joel Goldstick
>>
>> http://joelgoldstick.com
>
> I dont want to do any kind of HTML parsing.

Aside from the fact that I really want a pony, and you seem to want
your work done for you, look here:

http://stackoverflow.com/questions/11600681/parse-query-part-from-url
> --
> http://mail.python.org/mailman/listinfo/python-list



-- 
Joel Goldstick
http://joelgoldstick.com

[toc] | [prev] | [next] | [standalone]


#52481

FromJoel Goldstick <joel.goldstick@gmail.com>
Date2013-08-13 19:40 -0400
Message-ID<mailman.553.1376437227.1251.python-list@python.org>
In reply to#52476
On Tue, Aug 13, 2013 at 7:18 PM, Joel Goldstick
<joel.goldstick@gmail.com> wrote:
> On Tue, Aug 13, 2013 at 7:03 PM,  <englishkevin110@gmail.com> wrote:
>> On Tuesday, August 13, 2013 5:58:07 PM UTC-5, Joel Goldstick wrote:
>>> lookup urlparse for you answer
>>>
>>>
>>>
>>> On Tue, Aug 13, 2013 at 6:51 PM,  <> wrote:
>>>
>>> > I know the title doesn't make much sense, but I didnt know how to explain my problem.
>>>
>>> >
>>>
>>> > Anywho, I've opened a page's source in URLLIB
>>>
>>> > starturlsource = starturlopen.read()
>>>
>>> > string.find(starturlsource, '<a href="/profile.php?id=')
>>>
>>> > And I used string.find to find a specific area in the page's source.
>>>
>>> > I want to store what comes after ?id= in a variable.
>>>
>>> > Can someone help me with this?
>>>
>>> > --
>>>
>>> > http://mail.python.org/mailman/listinfo/python-list
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>>
>>> Joel Goldstick
>>>
>>> http://joelgoldstick.com
>>
>> I dont want to do any kind of HTML parsing.
>
> Aside from the fact that I really want a pony, and you seem to want
> your work done for you, look here:
>
> http://stackoverflow.com/questions/11600681/parse-query-part-from-url
>> --
>> http://mail.python.org/mailman/listinfo/python-list
>
>
>
> --
> Joel Goldstick
> http://joelgoldstick.com

I may have been too quick on my reading of you question.  You wanted
to get the value of the parameters, but also to find the url in the
page.  You want to do this without parsing, if I understand you.  The
good news is there is a module called Beautiful Soup that will do the
parsing for you.  The tutorial is way better than excellent, and you
will be up and running in less than a half hour from downloading the
module

http://www.crummy.com/software/BeautifulSoup/bs4/doc/

-- 
Joel Goldstick
http://joelgoldstick.com

[toc] | [prev] | [next] | [standalone]


#52497

FromSteven D'Aprano <steve@pearwood.info>
Date2013-08-14 06:29 +0000
Message-ID<520b23d3$0$29885$c3e8da3$5496439d@news.astraweb.com>
In reply to#52476
On Tue, 13 Aug 2013 16:03:46 -0700, englishkevin110 wrote:


> On Tuesday, August 13, 2013 5:58:07 PM UTC-5, Joel Goldstick wrote:
[fixing Joel's top-posting]

>> On Tue, Aug 13, 2013 at 6:51 PM,  <> wrote:
>> 
>> > I know the title doesn't make much sense, but I didnt know how to
>> > explain my problem.
>> 
>> 
>> >
>> > Anywho, I've opened a page's source in URLLIB
>> 
>> > starturlsource = starturlopen.read()
>> 
>> > string.find(starturlsource, '<a href="/profile.php?id=')
>> 
>> > And I used string.find to find a specific area in the page's source.
>> 
>> > I want to store what comes after ?id= in a variable.
>> 
>> > Can someone help me with this?


>> lookup urlparse for you answer


> I dont want to do any kind of HTML parsing.


What you are doing *is* HTML parsing, or at least a half-baked, fragile, 
likely to go wrong form of parsing.

But if you insist, the algorithm is simple: after calling find(), you 
have the offset to the search string. You know the length of the search 
string. Therefore you can calculate the index of the first character that 
follows the search string:

text = "blah blah blah blah spam spam... blah blah blah blah..."
needle = "spam spam"  # what we search for

i = text.find(needle)
if i == -1:
    print("not found")
else:
    print(text[i+len(needle):])


Of course, the problem is, you need to know not just the *start* offset 
of the bit that follows, but the *ending* offset as well. Which brings 
you into the realm of half-arsed parsing.



-- 
Steven

[toc] | [prev] | [next] | [standalone]


#52487

FromDave Angel <davea@davea.name>
Date2013-08-14 01:31 +0000
Message-ID<mailman.557.1376443939.1251.python-list@python.org>
In reply to#52474
englishkevin110@gmail.com wrote:

> I know the title doesn't make much sense, but I didnt know how to explain my problem.
>
> Anywho, I've opened a page's source in URLLIB
> starturlsource = starturlopen.read()
> string.find(starturlsource, '<a href="/profile.php?id=')
> And I used string.find to find a specific area in the page's source.
> I want to store what comes after ?id= in a variable.
> Can someone help me with this?

Python 3.3.0 (default, Mar  7 2013, 00:24:38) 
[GCC 4.6.3] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import string
>>> help(string.find)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'module' object has no attribute 'find'

There is no find function in the string module [1].  But assuming
starturlsource is a str, you could do:

pattern =  '<a href="/profile.php?id='
index = starturlsource.find( pattern )

index will then be -1 if there's no match, or have a non-negative value
if a match is found.

In the latter case, you can extract the next 17 characters with

newstr = starturlsource[index+len(pattern):index+len(pattern)+17]

You are of course making several assumptions about the web page, which
are perfectly reasonable since it's a page under your control.  Or is
it?


[1]  Assuming Python 3.3 since you omitted stating the version you're
using.  But even in Python 2.7, using the string.find function is
deprecated in favor of the str method.

-- 
DaveA

[toc] | [prev] | [next] | [standalone]


#52523

FromJohn Gordon <gordon@panix.com>
Date2013-08-14 15:58 +0000
Message-ID<kug9ea$267$1@reader1.panix.com>
In reply to#52474
In <40816fed-38d4-4baa-92cc-c80cd8febd82@googlegroups.com> englishkevin110@gmail.com writes:

> I know the title doesn't make much sense, but I didnt know how to explain my problem.

> Anywho, I've opened a page's source in URLLIB
> starturlsource = starturlopen.read()
> string.find(starturlsource, '<a href="/profile.php?id=')
> And I used string.find to find a specific area in the page's source.
> I want to store what comes after ?id= in a variable.
> Can someone help me with this?

starturlsource = starturlopen.read()

match_string = '<a href="/profile.php?id='

match_index = string.find(starturlsource, match_string)

if match_index != -1:
    url = starturlsource[match_index + len(match_string):]

else:
    print 'not found'

-- 
John Gordon                   A is for Amy, who fell down the stairs
gordon@panix.com              B is for Basil, assaulted by bears
                                -- Edward Gorey, "The Gashlycrumb Tinies"

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web