Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.python > #105590

Re: Python to do CDC on XML files

From Chris Angelico <rosuav@gmail.com>
Newsgroups comp.lang.python
Subject Re: Python to do CDC on XML files
Date 2016-03-24 18:00 +1100
Message-ID <mailman.81.1458802854.2244.python-list@python.org> (permalink)
References <833ad88a-4840-4a23-8ab3-b736068b49fe@googlegroups.com> <CAP1rxO79Rzo3tAhR9E5djkhWB79x2QrHB-+0rStW_girQumobg@mail.gmail.com> <683FF696-8223-46FB-9A72-55839A8B4241@gmail.com>

Show all headers | View raw


On Thu, Mar 24, 2016 at 10:57 AM, Bruce Kirk <bruce.kirk24@gmail.com> wrote:
> I agree, the challenge is the volume of the data to compare is 13. Million records. So it needs to be very fast

13M records is a good lot. To what extent can the data change? You may
find it easiest to do some sort of conversion to text, throwing away
any information that isn't "interesting", and then use the standard
'diff' utility to compare the text files. It's up to you to figure out
what differences are "uninteresting"; it'll depend on your exact data.

As long as you can do the conversion-to-text in a simple and
straight-forward way, the overall operation will be reasonably fast.
If this is a periodic thing (eg you're constantly checking today's
file against yesterday's), saving the dumped text file will mean you
generally need to just convert one file, halving your workload.

This isn't a solution so much as a broad pointer... hope it's at least a start!

ChrisA

Back to comp.lang.python | Previous | NextPrevious in thread | Next in thread | Find similar | Unroll thread


Thread

Python to do CDC on XML files Bruce Kirk <bruce.kirk24@gmail.com> - 2016-03-23 13:16 -0700
  Re: Python to do CDC on XML files Bob Gailer <bgailer@gmail.com> - 2016-03-23 16:47 -0400
  Re: Python to do CDC on XML files Bruce Kirk <bruce.kirk24@gmail.com> - 2016-03-23 19:57 -0400
  Re: Python to do CDC on XML files Chris Angelico <rosuav@gmail.com> - 2016-03-24 18:00 +1100
  Re: Python to do CDC on XML files Peter Otten <__peter__@web.de> - 2016-03-24 09:19 +0100

csiph-web