Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]
Groups > comp.lang.python > #64461 > unrolled thread
| Started by | Shane Konings <shane.konings@gmail.com> |
|---|---|
| First post | 2014-01-21 15:49 -0800 |
| Last post | 2014-01-22 15:40 +0000 |
| Articles | 13 — 8 participants |
Back to article view | Back to comp.lang.python
Separate Address number and name Shane Konings <shane.konings@gmail.com> - 2014-01-21 15:49 -0800
Re: Separate Address number and name Anders Wegge Keller <wegge@wegge.dk> - 2014-01-22 00:55 +0100
Re: Separate Address number and name Shane Konings <shane.konings@gmail.com> - 2014-01-21 16:01 -0800
Re: Separate Address number and name Shane Konings <shane.konings@gmail.com> - 2014-01-21 16:06 -0800
Re: Separate Address number and name Anders Wegge Keller <wegge@wegge.dk> - 2014-01-22 02:04 +0100
Re: Separate Address number and name Mark Lawrence <breamoreboy@yahoo.co.uk> - 2014-01-22 10:08 +0000
Re: Separate Address number and name Denis McMahon <denismfmcmahon@gmail.com> - 2014-01-22 17:35 +0000
Re: Separate Address number and name Denis McMahon <denismfmcmahon@gmail.com> - 2014-01-23 18:11 +0000
Re: Separate Address number and name Asaf Las <roegltd@gmail.com> - 2014-01-21 16:08 -0800
Re: Separate Address number and name Ben Finney <ben+python@benfinney.id.au> - 2014-01-22 11:08 +1100
Re: Separate Address number and name John Gordon <gordon@panix.com> - 2014-01-22 02:46 +0000
Re: Separate Address number and name Tim Chase <python.list@tim.thechases.com> - 2014-01-21 21:03 -0600
Re: Separate Address number and name Denis McMahon <denismfmcmahon@gmail.com> - 2014-01-22 15:40 +0000
| From | Shane Konings <shane.konings@gmail.com> |
|---|---|
| Date | 2014-01-21 15:49 -0800 |
| Subject | Separate Address number and name |
| Message-ID | <9fe1b47b-65ce-4063-9188-07b81cdba49f@googlegroups.com> |
I have the following sample from a data set and I am looking to split the address number and name into separate headings as seen below. FarmID Address 1 1067 Niagara Stone 2 4260 Mountainview 3 25 Hunter 4 1091 Hutchinson 5 5172 Green Lane 6 500 Glenridge 7 471 Foss 8 758 Niagara Stone 9 3836 Main 10 1025 York FarmID AddressNum AddressName 1 1067 Niagara Stone 2 4260 Mountainview 3 25 Hunter 4 1091 Hutchinson 5 5172 Green Lane 6 500 Glenridge 7 471 Foss 8 758 Niagara Stone 9 3836 Main 10 1025 York I have struggled with this for a while and know there must be a simple method to achieve this result.
[toc] | [next] | [standalone]
| From | Anders Wegge Keller <wegge@wegge.dk> |
|---|---|
| Date | 2014-01-22 00:55 +0100 |
| Message-ID | <87r4803n5s.fsf@huddi.jernurt.dk> |
| In reply to | #64461 |
Shane Konings <shane.konings@gmail.com> writes: > I have struggled with this for a while and know there must be a > simple method to achieve this result. There are several. But without seeing the code you have already written, it's har to help you improve it. -- /Wegge Leder efter redundant peering af dk.*,linux.debian.*
[toc] | [prev] | [next] | [standalone]
| From | Shane Konings <shane.konings@gmail.com> |
|---|---|
| Date | 2014-01-21 16:01 -0800 |
| Message-ID | <23a0ec1e-bcc1-4251-9b0c-33be185969ff@googlegroups.com> |
| In reply to | #64462 |
> I don't have any code to split that part up. There is other information following the street name such as street suffix, city, province, postal code, etc. I have been able to split the rest of it up based on certain criteria but have had no luck with splitting up the street name from the street number.
[toc] | [prev] | [next] | [standalone]
| From | Shane Konings <shane.konings@gmail.com> |
|---|---|
| Date | 2014-01-21 16:06 -0800 |
| Message-ID | <a218f6ef-37c1-4eaf-a127-5ea4846cb332@googlegroups.com> |
| In reply to | #64461 |
inHandler = open(inFile, 'r')
outHandler = open(outFile, 'w')
outHandler.write('ID\tAddress\tStreetNum&Name\tSufType\tDir\tCity\tProvince\tPostalCode\n')
for line in inHandler:
str = line.replace('FarmID\tAddress','')
outHandler.write(str[0:-1])
str = str.replace(', ON', '\t ON\t')
str = str.replace(' Rd,', '\t Rd\t \t')
str = str.replace(' Rd ', '\t Rd\t \t')
str = str.replace(' St,', '\t St\t \t')
str = str.replace(' St ', '\t St\t \t')
str = str.replace(' Ave', '\t Ave\t \t')
str = str.replace(' Pky', '\t Pky\t \t')
str = str.replace(' Lane, ', '\t Lane\t \t')
str = str.replace(', Lane, , Rd,', ' Lane\t Rd\t')
str = str.replace(' Dr', '\t Dr\t \t')
str = str.replace(' Sq', '\t Sq\t \t')
str = str.replace(' Pl', '\t Pl\t \t')
str = str.replace('\t \tN,', '\tN\t')
str = str.replace('\t \t N,', '\tN\t')
str = str.replace(' , North ', ' N\t')
str = str.replace(' ,S ', ' S\t')
str = str.replace(' , South ', ' S\t')
str = str.replace('\t \tE,', '\tE\t')
str = str.replace(' , East ', ' E\t')
str = str.replace('\t \tW,', '\tW\t')
str = str.replace('\t \t West', '\tW\t')
str = str.replace(',.', '.')
str = str.replace(',', '\t')
str = str.replace(',,', '\t')
str = str.replace(', Service', ' Service')
str = str.replace('\t \t\t', '\t\t')
outHandler.write(str[1:])
inHandler.close()
outHandler.close()
That is the code i currently have.
The following is a sample of the data. There are hundreds of lines that need to have an automated process of splitting the strings into headings to be imported into excel with theses headings
ID Address StreetNum StreetName SufType Dir City Province PostalCode
1 1067 Niagara Stone Rd, W, Niagara-On-The-Lake, ON L0S 1J0
2 4260 Mountainview Rd, Lincoln, ON L0R 1B2
3 25 Hunter Rd, Grimsby, E, ON L3M 4A3
4 1091 Hutchinson Rd, Haldimand, ON N0A 1K0
5 5172 Green Lane Rd, Lincoln, ON L0R 1B3
6 500 Glenridge Ave, East, St. Catharines, ON L2S 3A1
7 471 Foss Rd, Pelham, ON L0S 1C0
8 758 Niagara Stone Rd, Niagara-On-The-Lake, ON L0S 1J0
9 3836 Main St, North, Lincoln, ON L0R 1S0
10 1025 York Rd, W, Niagara-On-The-Lake, ON L0S 1P0
[toc] | [prev] | [next] | [standalone]
| From | Anders Wegge Keller <wegge@wegge.dk> |
|---|---|
| Date | 2014-01-22 02:04 +0100 |
| Message-ID | <87mwio3jym.fsf@huddi.jernurt.dk> |
| In reply to | #64465 |
Shane Konings <shane.konings@gmail.com> writes: ... > The following is a sample of the data. There are hundreds of lines > that need to have an automated process of splitting the strings into > headings to be imported into excel with theses headings > ID Address StreetNum StreetName SufType Dir City Province PostalCode > > > 1 1067 Niagara Stone Rd, W, Niagara-On-The-Lake, ON L0S 1J0 > 2 4260 Mountainview Rd, Lincoln, ON L0R 1B2 > 3 25 Hunter Rd, Grimsby, E, ON L3M 4A3 > 4 1091 Hutchinson Rd, Haldimand, ON N0A 1K0 > 5 5172 Green Lane Rd, Lincoln, ON L0R 1B3 > 6 500 Glenridge Ave, East, St. Catharines, ON L2S 3A1 > 7 471 Foss Rd, Pelham, ON L0S 1C0 > 8 758 Niagara Stone Rd, Niagara-On-The-Lake, ON L0S 1J0 > 9 3836 Main St, North, Lincoln, ON L0R 1S0 > 10 1025 York Rd, W, Niagara-On-The-Lake, ON L0S 1P0 The input doesn't look consistent to me. Is Dir supposed to be an optional value? If that is the only optional, it can be worked around. But if the missing direction (I'm guessing) is due to malformed input data, you have a hell of a job in front of you. What do you want to do with incomplete or malformed data? Try to parse it as a "best effort", or simply spew out an error message for an operator to look at? In the latter case, I suggest a stepwise approach: * Split input by ',' ->res0 * Split the first result by ' ' -> res -> Id = res[0] -> Address = res[1:] -> StreetNum = res[1] -> StreetName= res [2:] -> SufType = res[-1] * Check if res0[1] looks like a cardinal direction If so Dir = res0[1] Otherwise, croak or use the default direction. Insert an element in the list, so the remainder is shifted to match the following steps. -> City = res0[2] * Split res0[3] by ' ' -> respp respp[0] -> Province respp[1:] -> Postcode And put in som basic sanitation of the resulting values, before committing them as a parsed result. Provinces and post codes, should be easy enough to validate against a fixed list. -- /Wegge Leder efter redundant peering af dk.*,linux.debian.*
[toc] | [prev] | [next] | [standalone]
| From | Mark Lawrence <breamoreboy@yahoo.co.uk> |
|---|---|
| Date | 2014-01-22 10:08 +0000 |
| Message-ID | <mailman.5833.1390385319.18130.python-list@python.org> |
| In reply to | #64465 |
On 22/01/2014 00:06, Shane Konings wrote: > The following is a sample of the data. There are hundreds of lines that need to have an automated process of splitting the strings into headings to be imported into excel with theses headings > See here for code that could simplify your extire task http://www.python-excel.org/ -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence
[toc] | [prev] | [next] | [standalone]
| From | Denis McMahon <denismfmcmahon@gmail.com> |
|---|---|
| Date | 2014-01-22 17:35 +0000 |
| Message-ID | <lbovgq$668$4@dont-email.me> |
| In reply to | #64465 |
On Tue, 21 Jan 2014 16:06:56 -0800, Shane Konings wrote: > The following is a sample of the data. There are hundreds of lines that > need to have an automated process of splitting the strings into headings > to be imported into excel with theses headings > > ID Address StreetNum StreetName SufType Dir City Province > PostalCode Ok, the following general method seems to work: First, use a regex to capture two numeric groups and the rest of the line separated by whitespace. If you can't find all three fields, you have unexpected data format. re.search( r"(\d+)\s+(\d+)\s+(.*)", data ) Second, split the rest of the line on a regex of comma + 0 or more whitespace. re.split( r",\s+", data ) Check that the rest of the line has 3 or 4 bits, otherwise you have an unexpected lack or excess of data fields. Split the first bit of the rest of the line into street name and suffix/ type. If you can't split it, use it as the street name and set the suffix/ type to blank. re.search( r"(.*)\s+(\w+)", data ) If there are 3 bits in rest of line, set direction to blank, otherwise set direction to the second bit. Set the city to the last but one bit of the rest of the line. Capture one word followed by two words in the last bit of the rest of the line, and use these as the province and postcode. re.search( r"(\w+)\s+(\w+\s+\w+)", data ) Providing none of the searches or the split errored, you should now have the data fields you need to write. The easiest way to write them might be to assemble them as a list and use the csv module. I'm assuming you're capable of working out from the help on the python re module what to use for each data, and how to access the captured results of a search, and the results of a split. I'm also assuming you're capable of working out how to use the csv module from the documentation. If you're not, then either go back and ask your lecturer for help, or tell your boss to hire a real programmer for his quick and easy coding jobs. -- Denis McMahon, denismfmcmahon@gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Denis McMahon <denismfmcmahon@gmail.com> |
|---|---|
| Date | 2014-01-23 18:11 +0000 |
| Message-ID | <lbrm0h$dl1$2@dont-email.me> |
| In reply to | #64518 |
On Wed, 22 Jan 2014 17:35:22 +0000, Denis McMahon wrote: > On Tue, 21 Jan 2014 16:06:56 -0800, Shane Konings wrote: >> The following is a sample of the data. > A mechanism using regexes Just to follow up, using regexes I transformed the sample data that I believe is as follows: inputData = [ "1 1067 Niagara Stone Rd, W, Niagara-On-The-Lake, ON L0S 1J0", "2 4260 Mountainview Rd, Lincoln, ON L0R 1B2", "3 25 Hunter Rd, Grimsby, E, ON L3M 4A3", "4 1091 Hutchinson Rd, Haldimand, ON N0A 1K0", "5 5172 Green Lane Rd, Lincoln, ON L0R 1B3", "6 500 Glenridge Ave, East, St. Catharines, ON L2S 3A1", "7 471 Foss Rd, Pelham, ON L0S 1C0", "8 758 Niagara Stone Rd, Niagara-On-The-Lake, ON L0S 1J0", "9 3836 Main St, North, Lincoln, ON L0R 1S0", "10 1025 York Rd, W, Niagara-On-The-Lake, ON L0S 1P0" ] Into the following: 1,1067,Niagara Stone,Rd,W,Niagara-On-The-Lake,ON,L0S 1J0 2,4260,Mountainview,Rd,,Lincoln,ON,L0R 1B2 3,25,Hunter,Rd,Grimsby,E,ON,L3M 4A3 4,1091,Hutchinson,Rd,,Haldimand,ON,N0A 1K0 5,5172,Green Lane,Rd,,Lincoln,ON,L0R 1B3 6,500,Glenridge,Ave,East,St. Catharines,ON,L2S 3A1 7,471,Foss,Rd,,Pelham,ON,L0S 1C0 8,758,Niagara Stone,Rd,,Niagara-On-The-Lake,ON,L0S 1J0 9,3836,Main,St,North,Lincoln,ON,L0R 1S0 10,1025,York,Rd,W,Niagara-On-The-Lake,ON,L0S 1P0 Which should then read into Excel as CSV just fine. One final question though, why are you using excel to manipulate data that looks as if it would be better held in and manipulated by a database? -- Denis McMahon, denismfmcmahon@gmail.com
[toc] | [prev] | [next] | [standalone]
| From | Asaf Las <roegltd@gmail.com> |
|---|---|
| Date | 2014-01-21 16:08 -0800 |
| Message-ID | <32b20aae-5888-4011-878b-a1762589b12b@googlegroups.com> |
| In reply to | #64461 |
On Wednesday, January 22, 2014 1:49:16 AM UTC+2, Shane Konings wrote:
> I have the following sample from a data set and I am
> looking to split the address number and name into separate headings
>as seen below.
> I have struggled with this for a while and know there must be a simple method to achieve this result.
input = '''1 1067 Niagara Stone
2 4260 Mountainview
3 25 Hunter
4 1091 Hutchinson
5 5172 Green Lane
6 500 Glenridge
7 471 Foss
8 758 Niagara Stone
9 3836 Main
10 1025 York '''
tlist = input.splitlines()
for k in tlist:
print(k.split())
do with 'k' whatever you wish
[toc] | [prev] | [next] | [standalone]
| From | Ben Finney <ben+python@benfinney.id.au> |
|---|---|
| Date | 2014-01-22 11:08 +1100 |
| Message-ID | <mailman.5826.1390349308.18130.python-list@python.org> |
| In reply to | #64461 |
Shane Konings <shane.konings@gmail.com> writes:
> I have the following sample from a data set and I am looking to split
> the address number and name into separate headings as seen below.
>
> FarmID Address
> 1 1067 Niagara Stone
> 2 4260 Mountainview
> 3 25 Hunter
> 4 1091 Hutchinson
> 5 5172 Green Lane
> 6 500 Glenridge
> 7 471 Foss
> 8 758 Niagara Stone
> 9 3836 Main
> 10 1025 York
If it is *always* the case that you just want to split the string on the
first space (' ', U+0020) character, then this will do the job:
>>> street_address = "1067 Niagara Stone"
>>> (street_number, street_name) = street_address.split(' ', 1)
>>> street_number
'1067'
>>> street_name
'Niagara Stone'
But that's a very dubious assumption. Are you sure your data contains no
examples like:
PO Box 27
Unit 6, 52 Watford Avenue
Lot D property 107 Sandusky Road
etc.? What is your policy for dealing with data which isn't structured
as you assume?
--
\ “All my life I've had one dream: to achieve my many goals.” |
`\ —Homer, _The Simpsons_ |
_o__) |
Ben Finney
[toc] | [prev] | [next] | [standalone]
| From | John Gordon <gordon@panix.com> |
|---|---|
| Date | 2014-01-22 02:46 +0000 |
| Message-ID | <lbnbdn$69j$1@reader1.panix.com> |
| In reply to | #64461 |
In <9fe1b47b-65ce-4063-9188-07b81cdba49f@googlegroups.com> Shane Konings <shane.konings@gmail.com> writes:
> I have the following sample from a data set and I am looking to split the address number and name into separate headings as seen below.
> FarmID Address
> 1 1067 Niagara Stone
> 2 4260 Mountainview
> 3 25 Hunter
> 4 1091 Hutchinson
> 5 5172 Green Lane
> 6 500 Glenridge
> 7 471 Foss
> 8 758 Niagara Stone
> 9 3836 Main
> 10 1025 York
> FarmID AddressNum AddressName
> 1 1067 Niagara Stone
> 2 4260 Mountainview
> 3 25 Hunter
> 4 1091 Hutchinson
> 5 5172 Green Lane
> 6 500 Glenridge
> 7 471 Foss
> 8 758 Niagara Stone
> 9 3836 Main
> 10 1025 York
> I have struggled with this for a while and know there must be a simple
> method to achieve this result.
for line in input_lines:
fields = line.split()
farm_id = fields[0]
address_num = fields[1]
address_name = ' '.join(fields[2:])
--
John Gordon Imagine what it must be like for a real medical doctor to
gordon@panix.com watch 'House', or a real serial killer to watch 'Dexter'.
[toc] | [prev] | [next] | [standalone]
| From | Tim Chase <python.list@tim.thechases.com> |
|---|---|
| Date | 2014-01-21 21:03 -0600 |
| Message-ID | <mailman.5828.1390359774.18130.python-list@python.org> |
| In reply to | #64474 |
On 2014-01-22 02:46, John Gordon wrote:
> > FarmID AddressNum AddressName
> > 1 1067 Niagara Stone
> > 2 4260 Mountainview
> > 3 25 Hunter
> > 4 1091 Hutchinson
>
> > I have struggled with this for a while and know there must be a
> > simple method to achieve this result.
>
> for line in input_lines:
> fields = line.split()
> farm_id = fields[0]
> address_num = fields[1]
> address_name = ' '.join(fields[2:])
Or, you can split of just the parts you need:
for line in input_lines:
farm_id, street_no, street_name = line.split(None, 2)
It doesn't address the issues that Ben raised about the crazy formats
you can find in addresses, but address-parsing is an twisty maze of
passages all alike.
-tkc
[toc] | [prev] | [next] | [standalone]
| From | Denis McMahon <denismfmcmahon@gmail.com> |
|---|---|
| Date | 2014-01-22 15:40 +0000 |
| Message-ID | <lboopb$668$2@dont-email.me> |
| In reply to | #64461 |
On Tue, 21 Jan 2014 15:49:16 -0800, Shane Konings wrote: > I have the following sample from a data set and I am looking to split > the address number and name into separate headings as seen below. > > FarmID Address 1 1067 Niagara Stone 2 4260 Mountainview 3 25 Hunter 4 > 1091 Hutchinson 5 5172 Green Lane 6 500 Glenridge 7 471 Foss 8 758 > Niagara Stone 9 3836 Main 10 1025 York > > > FarmID AddressNum AddressName 1 1067 Niagara Stone 2 4260 > Mountainview 3 25 Hunter 4 1091 Hutchinson 5 > 5172 Green Lane 6 500 Glenridge 7 471 Foss > 8 758 Niagara Stone 9 3836 Main 10 1025 York > > I have struggled with this for a while and know there must be a simple > method to achieve this result. Unfortunately the vagaries of nntp, my client and the google nntp posting host are such that I can't discern the format of your data from your post. However, if as I think you have a text field that is always: <1 or more digits><1 or more spaces><the rest> where you want to capture the initial "1 or more digits" and "the rest" as 2 data elements then this should be possible with a simple re: (\d+)\s+(.*) If you have numeric id, whitespace, numeric addr bit, whitespace, the rest, then you may need something more like: (\d+)\s+(\d+)\s+(.*) The assumption is that it's not necessary to hold your hand through the whole looping through the input and applying the re to each line, then reading the captured bits and using them process. -- Denis McMahon, denismfmcmahon@gmail.com
[toc] | [prev] | [standalone]
Back to top | Article view | comp.lang.python
csiph-web