Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #64461 > unrolled thread

Separate Address number and name

Started byShane Konings <shane.konings@gmail.com>
First post2014-01-21 15:49 -0800
Last post2014-01-22 15:40 +0000
Articles 13 — 8 participants

Back to article view | Back to comp.lang.python


Contents

  Separate Address number and name Shane Konings <shane.konings@gmail.com> - 2014-01-21 15:49 -0800
    Re: Separate Address number and name Anders Wegge Keller <wegge@wegge.dk> - 2014-01-22 00:55 +0100
      Re: Separate Address number and name Shane Konings <shane.konings@gmail.com> - 2014-01-21 16:01 -0800
    Re: Separate Address number and name Shane Konings <shane.konings@gmail.com> - 2014-01-21 16:06 -0800
      Re: Separate Address number and name Anders Wegge Keller <wegge@wegge.dk> - 2014-01-22 02:04 +0100
      Re: Separate Address number and name Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-22 10:08 +0000
      Re: Separate Address number and name Denis McMahon <denismfmcmahon@gmail.com> - 2014-01-22 17:35 +0000
        Re: Separate Address number and name Denis McMahon <denismfmcmahon@gmail.com> - 2014-01-23 18:11 +0000
    Re: Separate Address number and name Asaf Las <roegltd@gmail.com> - 2014-01-21 16:08 -0800
    Re: Separate Address number and name Ben Finney <ben+python@benfinney.id.au> - 2014-01-22 11:08 +1100
    Re: Separate Address number and name John Gordon <gordon@panix.com> - 2014-01-22 02:46 +0000
      Re: Separate Address number and name Tim Chase <python.list@tim.thechases.com> - 2014-01-21 21:03 -0600
    Re: Separate Address number and name Denis McMahon <denismfmcmahon@gmail.com> - 2014-01-22 15:40 +0000

#64461 — Separate Address number and name

FromShane Konings <shane.konings@gmail.com>
Date2014-01-21 15:49 -0800
SubjectSeparate Address number and name
Message-ID<9fe1b47b-65ce-4063-9188-07b81cdba49f@googlegroups.com>
I have the following sample from a data set and I am looking to split the address number and name into separate headings as seen below.

FarmID	Address
1	1067 Niagara Stone
2	4260 Mountainview
3	25 Hunter
4	1091 Hutchinson
5	5172 Green Lane
6	500 Glenridge
7	471 Foss
8	758 Niagara Stone
9	3836 Main
10	1025 York


FarmID	AddressNum    AddressName
1	1067          Niagara Stone
2	4260          Mountainview
3	25            Hunter
4	1091          Hutchinson
5	5172          Green Lane
6	500           Glenridge
7	471           Foss
8	758           Niagara Stone
9	3836          Main
10	1025          York

I have struggled with this for a while and know there must be a simple method to achieve this result.

[toc] | [next] | [standalone]


#64462

FromAnders Wegge Keller <wegge@wegge.dk>
Date2014-01-22 00:55 +0100
Message-ID<87r4803n5s.fsf@huddi.jernurt.dk>
In reply to#64461
Shane Konings <shane.konings@gmail.com> writes:

> I have struggled with this for a while and know there must be a
> simple method to achieve this result.

 There are several. But without seeing the code you have already
written, it's har to help you improve it.

-- 
/Wegge

Leder efter redundant peering af dk.*,linux.debian.*

[toc] | [prev] | [next] | [standalone]


#64463

FromShane Konings <shane.konings@gmail.com>
Date2014-01-21 16:01 -0800
Message-ID<23a0ec1e-bcc1-4251-9b0c-33be185969ff@googlegroups.com>
In reply to#64462
> I don't have any code to split that part up. There is other information following the street name such as street suffix, city, province, postal code, etc. I have been able to split the rest of it up based on certain criteria but have had no luck with splitting up the street name from the street number.

[toc] | [prev] | [next] | [standalone]


#64465

FromShane Konings <shane.konings@gmail.com>
Date2014-01-21 16:06 -0800
Message-ID<a218f6ef-37c1-4eaf-a127-5ea4846cb332@googlegroups.com>
In reply to#64461
inHandler = open(inFile, 'r')
outHandler = open(outFile, 'w')
outHandler.write('ID\tAddress\tStreetNum&Name\tSufType\tDir\tCity\tProvince\tPostalCode\n')
for line in inHandler:
    str = line.replace('FarmID\tAddress','')
    outHandler.write(str[0:-1])
    str = str.replace(', ON', '\t ON\t')
    str = str.replace(' Rd,', '\t Rd\t \t')
    str = str.replace(' Rd ', '\t Rd\t \t')
    str = str.replace(' St,', '\t St\t \t')
    str = str.replace(' St ', '\t St\t \t')
    str = str.replace(' Ave', '\t Ave\t \t')
    str = str.replace(' Pky', '\t Pky\t \t')
    str = str.replace(' Lane, ', '\t Lane\t \t')
    str = str.replace(', Lane, , Rd,', ' Lane\t Rd\t')
    str = str.replace(' Dr', '\t Dr\t \t')
    str = str.replace(' Sq', '\t Sq\t \t')
    str = str.replace(' Pl', '\t Pl\t \t')
    str = str.replace('\t \tN,', '\tN\t')
    str = str.replace('\t \t N,', '\tN\t')
    str = str.replace(' , North ', ' N\t')
    str = str.replace(' ,S ', ' S\t')
    str = str.replace(' , South ', ' S\t')
    str = str.replace('\t \tE,', '\tE\t')
    str = str.replace(' , East ', ' E\t')
    str = str.replace('\t \tW,', '\tW\t')
    str = str.replace('\t \t West', '\tW\t')
    str = str.replace(',.', '.')
    str = str.replace(',', '\t')
    str = str.replace(',,', '\t')
    str = str.replace(', Service', ' Service')
    str = str.replace('\t \t\t', '\t\t')
    outHandler.write(str[1:])
inHandler.close()
outHandler.close()


That is the code i currently have.

The following is a sample of the data. There are hundreds of lines that need to have an automated process of splitting the strings into headings to be imported into excel with theses headings 

ID  Address  StreetNum  StreetName  SufType  Dir   City  Province  PostalCode


1	1067 Niagara Stone Rd, W, Niagara-On-The-Lake, ON L0S 1J0
2	4260 Mountainview Rd, Lincoln, ON L0R 1B2
3	25 Hunter Rd, Grimsby, E, ON L3M 4A3
4	1091 Hutchinson Rd, Haldimand, ON N0A 1K0
5	5172 Green Lane Rd, Lincoln, ON L0R 1B3
6	500 Glenridge Ave, East, St. Catharines, ON L2S 3A1
7	471 Foss Rd, Pelham, ON L0S 1C0
8	758 Niagara Stone Rd, Niagara-On-The-Lake, ON L0S 1J0
9	3836 Main St, North, Lincoln, ON L0R 1S0
10	1025 York Rd, W, Niagara-On-The-Lake, ON L0S 1P0

[toc] | [prev] | [next] | [standalone]


#64471

FromAnders Wegge Keller <wegge@wegge.dk>
Date2014-01-22 02:04 +0100
Message-ID<87mwio3jym.fsf@huddi.jernurt.dk>
In reply to#64465
Shane Konings <shane.konings@gmail.com> writes:

...

> The following is a sample of the data. There are hundreds of lines
> that need to have an automated process of splitting the strings into
> headings to be imported into excel with theses headings

> ID  Address  StreetNum  StreetName  SufType  Dir   City  Province  PostalCode
> 
> 
> 1	1067 Niagara Stone Rd, W, Niagara-On-The-Lake, ON L0S 1J0
> 2	4260 Mountainview Rd, Lincoln, ON L0R 1B2
> 3	25 Hunter Rd, Grimsby, E, ON L3M 4A3
> 4	1091 Hutchinson Rd, Haldimand, ON N0A 1K0
> 5	5172 Green Lane Rd, Lincoln, ON L0R 1B3
> 6	500 Glenridge Ave, East, St. Catharines, ON L2S 3A1
> 7	471 Foss Rd, Pelham, ON L0S 1C0
> 8	758 Niagara Stone Rd, Niagara-On-The-Lake, ON L0S 1J0
> 9	3836 Main St, North, Lincoln, ON L0R 1S0
> 10	1025 York Rd, W, Niagara-On-The-Lake, ON L0S 1P0

 The input doesn't look consistent to me. Is Dir supposed to be an
optional value? If that is the only optional, it can be worked
around. But if the missing direction (I'm guessing) is due to
malformed input data, you have a hell of a job in front of you.

 What do you want to do with incomplete or malformed data? Try to
parse it as a "best effort", or simply spew out an error message for
an operator to look at?

 In the latter case, I suggest a stepwise approach:

* Split input by ',' ->res0

* Split the first result by ' ' -> res

-> Id = res[0]
-> Address = res[1:]
-> StreetNum = res[1]
-> StreetName= res [2:]
-> SufType = res[-1]

* Check if res0[1] looks like a cardinal direction
 If so Dir = res0[1]
 Otherwise, croak or use the default direction. Insert an element in
 the list, so the remainder is shifted to match the following steps.

-> City = res0[2]

* Split res0[3] by ' ' -> respp

respp[0] -> Province
respp[1:] -> Postcode


 And put in som basic sanitation of the resulting values, before
committing them as a parsed result. Provinces and post codes, should
be easy enough to validate against a fixed list. 

-- 
/Wegge

Leder efter redundant peering af dk.*,linux.debian.*

[toc] | [prev] | [next] | [standalone]


#64489

FromMark Lawrence <breamoreboy@yahoo.co.uk>
Date2014-01-22 10:08 +0000
Message-ID<mailman.5833.1390385319.18130.python-list@python.org>
In reply to#64465
On 22/01/2014 00:06, Shane Konings wrote:

> The following is a sample of the data. There are hundreds of lines that need to have an automated process of splitting the strings into headings to be imported into excel with theses headings
>

See here for code that could simplify your extire task 
http://www.python-excel.org/

-- 
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.

Mark Lawrence

[toc] | [prev] | [next] | [standalone]


#64518

FromDenis McMahon <denismfmcmahon@gmail.com>
Date2014-01-22 17:35 +0000
Message-ID<lbovgq$668$4@dont-email.me>
In reply to#64465
On Tue, 21 Jan 2014 16:06:56 -0800, Shane Konings wrote:


> The following is a sample of the data. There are hundreds of lines that
> need to have an automated process of splitting the strings into headings
> to be imported into excel with theses headings
> 
> ID  Address  StreetNum  StreetName  SufType  Dir   City  Province 
> PostalCode

Ok, the following general method seems to work:

First, use a regex to capture two numeric groups and the rest of the line 
separated by whitespace. If you can't find all three fields, you have 
unexpected data format.

re.search( r"(\d+)\s+(\d+)\s+(.*)", data )

Second, split the rest of the line on a regex of comma + 0 or more 
whitespace.

re.split( r",\s+", data )

Check that the rest of the line has 3 or 4 bits, otherwise you have an 
unexpected lack or excess of data fields.

Split the first bit of the rest of the line into street name and suffix/
type. If you can't split it, use it as the street name and set the suffix/
type to blank.

re.search( r"(.*)\s+(\w+)", data )

If there are 3 bits in rest of line, set direction to blank, otherwise 
set direction to the second bit.

Set the city to the last but one bit of the rest of the line.

Capture one word followed by two words in the last bit of the rest of the 
line, and use these as the province and postcode.

re.search( r"(\w+)\s+(\w+\s+\w+)", data )

Providing none of the searches or the split errored, you should now have 
the data fields you need to write. The easiest way to write them might be 
to assemble them as a list and use the csv module.

I'm assuming you're capable of working out from the help on the python re 
module what to use for each data, and how to access the captured results 
of a search, and the results of a split. I'm also assuming you're capable 
of working out how to use the csv module from the documentation. If 
you're not, then either go back and ask your lecturer for help, or tell 
your boss to hire a real programmer for his quick and easy coding jobs.

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]


#64627

FromDenis McMahon <denismfmcmahon@gmail.com>
Date2014-01-23 18:11 +0000
Message-ID<lbrm0h$dl1$2@dont-email.me>
In reply to#64518
On Wed, 22 Jan 2014 17:35:22 +0000, Denis McMahon wrote:

> On Tue, 21 Jan 2014 16:06:56 -0800, Shane Konings wrote:

>> The following is a sample of the data.

> A mechanism using regexes

Just to follow up, using regexes I transformed the sample data that I 
believe is as follows:

inputData = [
"1	1067 Niagara Stone Rd, W, Niagara-On-The-Lake, ON L0S 1J0",
"2	4260 Mountainview Rd, Lincoln, ON L0R 1B2",
"3	25 Hunter Rd, Grimsby, E, ON L3M 4A3",
"4	1091 Hutchinson Rd, Haldimand, ON N0A 1K0",
"5	5172 Green Lane Rd, Lincoln, ON L0R 1B3",
"6	500 Glenridge Ave, East, St. Catharines, ON L2S 3A1",
"7	471 Foss Rd, Pelham, ON L0S 1C0",
"8	758 Niagara Stone Rd, Niagara-On-The-Lake, ON L0S 1J0",
"9	3836 Main St, North, Lincoln, ON L0R 1S0",
"10	1025 York Rd, W, Niagara-On-The-Lake, ON L0S 1P0"
]

Into the following:

1,1067,Niagara Stone,Rd,W,Niagara-On-The-Lake,ON,L0S 1J0
2,4260,Mountainview,Rd,,Lincoln,ON,L0R 1B2
3,25,Hunter,Rd,Grimsby,E,ON,L3M 4A3
4,1091,Hutchinson,Rd,,Haldimand,ON,N0A 1K0
5,5172,Green Lane,Rd,,Lincoln,ON,L0R 1B3
6,500,Glenridge,Ave,East,St. Catharines,ON,L2S 3A1
7,471,Foss,Rd,,Pelham,ON,L0S 1C0
8,758,Niagara Stone,Rd,,Niagara-On-The-Lake,ON,L0S 1J0
9,3836,Main,St,North,Lincoln,ON,L0R 1S0
10,1025,York,Rd,W,Niagara-On-The-Lake,ON,L0S 1P0

Which should then read into Excel as CSV just fine.

One final question though, why are you using excel to manipulate data 
that looks as if it would be better held in and manipulated by a database?

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [next] | [standalone]


#64466

FromAsaf Las <roegltd@gmail.com>
Date2014-01-21 16:08 -0800
Message-ID<32b20aae-5888-4011-878b-a1762589b12b@googlegroups.com>
In reply to#64461
On Wednesday, January 22, 2014 1:49:16 AM UTC+2, Shane Konings wrote:
> I have the following sample from a data set and I am 
> looking to split the address number and name into separate headings 
>as seen below.
> I have struggled with this for a while and know there must be a simple method to achieve this result.

input = '''1        1067 Niagara Stone
2        4260 Mountainview
3        25 Hunter
4        1091 Hutchinson
5        5172 Green Lane
6        500 Glenridge
7        471 Foss
8        758 Niagara Stone
9        3836 Main
10        1025 York '''

tlist = input.splitlines()
for k in tlist:
    print(k.split())



do with 'k' whatever you wish 

[toc] | [prev] | [next] | [standalone]


#64467

FromBen Finney <ben+python@benfinney.id.au>
Date2014-01-22 11:08 +1100
Message-ID<mailman.5826.1390349308.18130.python-list@python.org>
In reply to#64461
Shane Konings <shane.konings@gmail.com> writes:

> I have the following sample from a data set and I am looking to split
> the address number and name into separate headings as seen below.
>
> FarmID	Address
> 1	1067 Niagara Stone
> 2	4260 Mountainview
> 3	25 Hunter
> 4	1091 Hutchinson
> 5	5172 Green Lane
> 6	500 Glenridge
> 7	471 Foss
> 8	758 Niagara Stone
> 9	3836 Main
> 10	1025 York

If it is *always* the case that you just want to split the string on the
first space (' ', U+0020) character, then this will do the job:

    >>> street_address = "1067 Niagara Stone"
    >>> (street_number, street_name) = street_address.split(' ', 1)
    >>> street_number
    '1067'
    >>> street_name
    'Niagara Stone'

But that's a very dubious assumption. Are you sure your data contains no
examples like:

    PO Box 27
    Unit 6, 52 Watford Avenue
    Lot D property 107 Sandusky Road

etc.? What is your policy for dealing with data which isn't structured
as you assume?

-- 
 \         “All my life I've had one dream: to achieve my many goals.” |
  `\                                            —Homer, _The Simpsons_ |
_o__)                                                                  |
Ben Finney

[toc] | [prev] | [next] | [standalone]


#64474

FromJohn Gordon <gordon@panix.com>
Date2014-01-22 02:46 +0000
Message-ID<lbnbdn$69j$1@reader1.panix.com>
In reply to#64461
In <9fe1b47b-65ce-4063-9188-07b81cdba49f@googlegroups.com> Shane Konings <shane.konings@gmail.com> writes:

> I have the following sample from a data set and I am looking to split the address number and name into separate headings as seen below.

> FarmID	Address
> 1	1067 Niagara Stone
> 2	4260 Mountainview
> 3	25 Hunter
> 4	1091 Hutchinson
> 5	5172 Green Lane
> 6	500 Glenridge
> 7	471 Foss
> 8	758 Niagara Stone
> 9	3836 Main
> 10	1025 York


> FarmID	AddressNum    AddressName
> 1	1067          Niagara Stone
> 2	4260          Mountainview
> 3	25            Hunter
> 4	1091          Hutchinson
> 5	5172          Green Lane
> 6	500           Glenridge
> 7	471           Foss
> 8	758           Niagara Stone
> 9	3836          Main
> 10	1025          York

> I have struggled with this for a while and know there must be a simple
> method to achieve this result.

for line in input_lines:
    fields = line.split()
    farm_id = fields[0]
    address_num = fields[1]
    address_name = ' '.join(fields[2:])

-- 
John Gordon         Imagine what it must be like for a real medical doctor to
gordon@panix.com    watch 'House', or a real serial killer to watch 'Dexter'.

[toc] | [prev] | [next] | [standalone]


#64477

FromTim Chase <python.list@tim.thechases.com>
Date2014-01-21 21:03 -0600
Message-ID<mailman.5828.1390359774.18130.python-list@python.org>
In reply to#64474
On 2014-01-22 02:46, John Gordon wrote:
> > FarmID	AddressNum    AddressName
> > 1	1067          Niagara Stone
> > 2	4260          Mountainview
> > 3	25            Hunter
> > 4	1091          Hutchinson
> 
> > I have struggled with this for a while and know there must be a
> > simple method to achieve this result.
> 
> for line in input_lines:
>     fields = line.split()
>     farm_id = fields[0]
>     address_num = fields[1]
>     address_name = ' '.join(fields[2:])

Or, you can split of just the parts you need:

  for line in input_lines:
    farm_id, street_no, street_name = line.split(None, 2)

It doesn't address the issues that Ben raised about the crazy formats
you can find in addresses, but address-parsing is an twisty maze of
passages all alike.

-tkc


[toc] | [prev] | [next] | [standalone]


#64503

FromDenis McMahon <denismfmcmahon@gmail.com>
Date2014-01-22 15:40 +0000
Message-ID<lboopb$668$2@dont-email.me>
In reply to#64461
On Tue, 21 Jan 2014 15:49:16 -0800, Shane Konings wrote:

> I have the following sample from a data set and I am looking to split
> the address number and name into separate headings as seen below.
> 
> FarmID	Address 1	1067 Niagara Stone 2	4260 Mountainview 
3	25 Hunter 4
> 1091 Hutchinson 5	5172 Green Lane 6	500 Glenridge 7	471 Foss 
8	758
> Niagara Stone 9	3836 Main 10	1025 York
> 
> 
> FarmID	AddressNum    AddressName 1	1067          Niagara 
Stone 2	4260   
>       Mountainview 3	25            Hunter 4	1091          Hutchinson 5
> 5172          Green Lane 6	500           Glenridge 7	
471           Foss
> 8	758           Niagara Stone 9	3836          Main 10	
1025          York
> 
> I have struggled with this for a while and know there must be a simple
> method to achieve this result.

Unfortunately the vagaries of nntp, my client and the google nntp posting 
host are such that I can't discern the format of your data from your post.

However, if as I think you have a text field that is always:

<1 or more digits><1 or more spaces><the rest>

where you want to capture the initial "1 or more digits" and "the rest" 
as 2 data elements then this should be possible with a simple re:

(\d+)\s+(.*)

If you have numeric id, whitespace, numeric addr bit, whitespace, the 
rest, then you may need something more like:

(\d+)\s+(\d+)\s+(.*)

The assumption is that it's not necessary to hold your hand through the 
whole looping through the input and applying the re to each line, then 
reading the captured bits and using them process.

-- 
Denis McMahon, denismfmcmahon@gmail.com

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web