Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #105569 > unrolled thread

Python to do CDC on XML files

Started byBruce Kirk <bruce.kirk24@gmail.com>
First post2016-03-23 13:16 -0700
Last post2016-03-24 09:19 +0100
Articles 5 — 4 participants

Back to article view | Back to comp.lang.python


Contents

  Python to do CDC on XML files Bruce Kirk <bruce.kirk24@gmail.com> - 2016-03-23 13:16 -0700
    Re: Python to do CDC on XML files Bob Gailer <bgailer@gmail.com> - 2016-03-23 16:47 -0400
    Re: Python to do CDC on XML files Bruce Kirk <bruce.kirk24@gmail.com> - 2016-03-23 19:57 -0400
    Re: Python to do CDC on XML files Chris Angelico <rosuav@gmail.com> - 2016-03-24 18:00 +1100
    Re: Python to do CDC on XML files Peter Otten <__peter__@web.de> - 2016-03-24 09:19 +0100

#105569 — Python to do CDC on XML files

FromBruce Kirk <bruce.kirk24@gmail.com>
Date2016-03-23 13:16 -0700
SubjectPython to do CDC on XML files
Message-ID<833ad88a-4840-4a23-8ab3-b736068b49fe@googlegroups.com>
Does anyone know of any existing projects on how to generate a change data capture on 2 very large xml files.

The xml structures are the same, it is the data within the files that may differ.

I need to take a XML file from yesterday and compare it to the XML file produced today and not which XML records have changed.

I have done a google search and I am not able to find much on the subject other than software vendors trying to sell me their products. :-)

Regards

[toc] | [next] | [standalone]


#105571

FromBob Gailer <bgailer@gmail.com>
Date2016-03-23 16:47 -0400
Message-ID<mailman.68.1458766031.2244.python-list@python.org>
In reply to#105569
On Mar 23, 2016 4:20 PM, "Bruce Kirk" <bruce.kirk24@gmail.com> wrote:
>
> Does anyone know of any existing projects on how to generate a change
data capture on 2 very large xml files.
>
> The xml structures are the same, it is the data within the files that may
differ.
>
It should not be too difficult to write a program that locates the tags
delimiting each record, then compare them.

[toc] | [prev] | [next] | [standalone]


#105588

FromBruce Kirk <bruce.kirk24@gmail.com>
Date2016-03-23 19:57 -0400
Message-ID<mailman.79.1458801774.2244.python-list@python.org>
In reply to#105569
I agree, the challenge is the volume of the data to compare is 13. Million records. So it needs to be very fast

Sent from my iPad

> On Mar 23, 2016, at 4:47 PM, Bob Gailer <bgailer@gmail.com> wrote:
> 
> 
> On Mar 23, 2016 4:20 PM, "Bruce Kirk" <bruce.kirk24@gmail.com> wrote:
> >
> > Does anyone know of any existing projects on how to generate a change data capture on 2 very large xml files.
> >
> > The xml structures are the same, it is the data within the files that may differ.
> >
> It should not be too difficult to write a program that locates the tags delimiting each record, then compare them.

[toc] | [prev] | [next] | [standalone]


#105590

FromChris Angelico <rosuav@gmail.com>
Date2016-03-24 18:00 +1100
Message-ID<mailman.81.1458802854.2244.python-list@python.org>
In reply to#105569
On Thu, Mar 24, 2016 at 10:57 AM, Bruce Kirk <bruce.kirk24@gmail.com> wrote:
> I agree, the challenge is the volume of the data to compare is 13. Million records. So it needs to be very fast

13M records is a good lot. To what extent can the data change? You may
find it easiest to do some sort of conversion to text, throwing away
any information that isn't "interesting", and then use the standard
'diff' utility to compare the text files. It's up to you to figure out
what differences are "uninteresting"; it'll depend on your exact data.

As long as you can do the conversion-to-text in a simple and
straight-forward way, the overall operation will be reasonably fast.
If this is a periodic thing (eg you're constantly checking today's
file against yesterday's), saving the dumped text file will mean you
generally need to just convert one file, halving your workload.

This isn't a solution so much as a broad pointer... hope it's at least a start!

ChrisA

[toc] | [prev] | [next] | [standalone]


#105591

FromPeter Otten <__peter__@web.de>
Date2016-03-24 09:19 +0100
Message-ID<mailman.82.1458807583.2244.python-list@python.org>
In reply to#105569
Bruce Kirk wrote:

> Does anyone know of any existing projects on how to generate a change data
> capture on 2 very large xml files.
> 
> The xml structures are the same, it is the data within the files that may
> differ.
> 
> I need to take a XML file from yesterday and compare it to the XML file
> produced today and not which XML records have changed.
> 
> I have done a google search and I am not able to find much on the subject
> other than software vendors trying to sell me their products. :-)

There is

http://www.logilab.org/project/xmldiff

As an alternative you may try to log the changes as they occur instead of 
inspecting the result. If the application generating the file is not under 
your control, does it offer other output formats, e. g. csv?

Or if the xml file is basically a sequence of one type of node you may 
convert it to a database (sqlite will do) to match and compare the 
"records".

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.python


csiph-web