Groups | Search | Server Info | Keyboard shortcuts | Login | Register [http] [https] [nntp] [nntps]


Groups > comp.lang.ruby > #2308 > unrolled thread

Non-correcting library for parsing/modifying broken HTML/PHP files?

Started byMarkus Fischer <markus@fischer.name>
First post2011-04-05 03:56 -0500
Last post2011-04-05 08:27 -0500
Articles 3 — 2 participants

Back to article view | Back to comp.lang.ruby


Contents

  Non-correcting library for parsing/modifying broken HTML/PHP files? Markus Fischer <markus@fischer.name> - 2011-04-05 03:56 -0500
    Re: Non-correcting library for parsing/modifying broken HTML/PHP files? Robert Klemme <shortcutter@googlemail.com> - 2011-04-05 07:59 -0500
      Re: Non-correcting library for parsing/modifying broken HTML/PHP files? Markus Fischer <markus@fischer.name> - 2011-04-05 08:27 -0500

#2308 — Non-correcting library for parsing/modifying broken HTML/PHP files?

FromMarkus Fischer <markus@fischer.name>
Date2011-04-05 03:56 -0500
SubjectNon-correcting library for parsing/modifying broken HTML/PHP files?
Message-ID<4D9AD939.8050600@fischer.name>
Hi,

does anyone know of a library which can work with broken/malformed 
HTML/PHP and still produce the same output like the input?

So far I've tried Nokogiri and Hpricot, they're absolutely amazing and 
excel in their purpose but fail to meet my requirement that, when saving 
the HTML, nothing which I haven't changed due DOM manipulation should 
change in the output.

The thing is that I've to work with such horrible broken HTML (or say, 
PHP) documents that those libraries are über-tempted to correct it. But 
this is troublesome for me, as I've fix a few hundreds, maybe up to 
thousands of documents and their versioned history should really only 
reflect the change I'm doing and not what the library needs to change so 
it can work with it. I looked up at rubygems but was unable to come up 
with more libraries, did I miss them?

Many words, here's an example:

$ cat test.php
<?php include_once('whatever.php'); ?>
<html><title> anything</title>
         <?php includeHtmlHeader(' blabla',',')?>
         <body topmargin="0" bgcolor="#ffffff" leftmargin="0" 
link="#003366" marginheight="0" marginwidth="0" vlink="#003366" 
alink="#800000" >
                 <?includeFile('/application/templates/whatever.shtml')?>
                 <br>
         <?php echo more::code("andsuch"); ?>


<script type="text/javascript">OAS_AD('Position1');</script>


$ ruby -v ; gem list|grep nokogi
ruby 1.9.2p180 (2011-02-18 revision 30909) [x86_64-linux]
nokogiri (1.4.4)


$ ruby -rnokogiri -e 'html = Nokogiri::HTML::Document.parse( 
open("test.php").read) ; open("test2.php", "w") { |f| f.write( 
html.to_html)}'


$ diff -u test.php  test2.php
--- test.php    2011-04-05 10:50:00.000000000 +0200
+++ test2.php   2011-04-05 10:52:31.000000000 +0200
@@ -1,10 +1,11 @@
-<?php include_once('whatever.php'); ?>
-<html><title> anything</title>
-       <?php includeHtmlHeader(' blabla',',')?>
-       <body topmargin="0" bgcolor="#ffffff" leftmargin="0" 
link="#003366" marginheight="0" marginwidth="0" vlink="#003366" 
alink="#800000" >
-               <?includeFile('/application/templates/whatever.shtml')?>
-               <br>
-        <?php echo more::code("andsuch"); ?>
-
-
-<script type="text/javascript">OAS_AD('Position1');</script>
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" 
"http://www.w3.org/TR/REC-html40/loose.dtd">
+<?php include_once('whatever.php'); ?><html>
+<head>
+<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+<title> anything</title>
+<?php includeHtmlHeader(' blabla',',')?>
+</head>
+<body topmargin="0" bgcolor="#ffffff" leftmargin="0" link="#003366" 
marginheight="0" marginwidth="0" vlink="#003366" alink="#800000">
+               <?includeFile 
('/application/templates/whatever.shtml')?><br><?php echo 
more::code("andsuch"); ?><script 
type="text/javascript">OAS_AD('Position1');</script>
+</body>
+</html>


Now with Hpricot:

$ gem list|grep hpri
hpricot (0.8.4)


$ ruby -rhpricot -e 'html = Hpricot( open("test.php").read) ; 
open("test2.php", "w") { |f| f.write( html.to_html)}'


$ diff -u test.php  test2.php
--- test.php    2011-04-05 10:50:00.000000000 +0200
+++ test2.php   2011-04-05 10:53:19.000000000 +0200
@@ -1,10 +1,11 @@
  <?php include_once('whatever.php'); ?>
  <html><title> anything</title>
         <?php includeHtmlHeader(' blabla',',')?>
-       <body topmargin="0" bgcolor="#ffffff" leftmargin="0" 
link="#003366" marginheight="0" marginwidth="0" vlink="#003366" 
alink="#800000" >
+       <body topmargin="0" bgcolor="#ffffff" leftmargin="0" 
link="#003366" marginheight="0" marginwidth="0" vlink="#003366" 
alink="#800000">
                 <?includeFile('/application/templates/whatever.shtml')?>
-               <br>
+               <br />
          <?php echo more::code("andsuch"); ?>


  <script type="text/javascript">OAS_AD('Position1');</script>
+</body></html>
\ No newline at end of file


Much better, still ... as documents are more complex then this sample, 
the changes done by the libraries grow bigger.

thanks,
- Markus

[toc] | [next] | [standalone]


#2336

FromRobert Klemme <shortcutter@googlemail.com>
Date2011-04-05 07:59 -0500
Message-ID<BANLkTimtjZkdv8MPxHGXbNf913ofQtvXoA@mail.gmail.com>
In reply to#2308
On Tue, Apr 5, 2011 at 10:56 AM, Markus Fischer <markus@fischer.name> wrote:

> does anyone know of a library which can work with broken/malformed HTML/PHP
> and still produce the same output like the input?
>
> So far I've tried Nokogiri and Hpricot, they're absolutely amazing and excel
> in their purpose but fail to meet my requirement that, when saving the HTML,
> nothing which I haven't changed due DOM manipulation should change in the
> output.
>
> The thing is that I've to work with such horrible broken HTML (or say, PHP)
> documents that those libraries are über-tempted to correct it. But this is
> troublesome for me, as I've fix a few hundreds, maybe up to thousands of
> documents and their versioned history should really only reflect the change
> I'm doing and not what the library needs to change so it can work with it I
> looked up at rubygems but was unable to come up with more libraries, did I
> miss them?

What about one initial rework to get proper (X)HTML, submit it to your
version control and then create those modifications that you need to
do?  That approach has served me quite well for example when enforcing
a particular source code formatting.

Cheers

robert

-- 
remember.guy do |as, often| as.you_can - without end
http://blog.rubybestpractices.com/

[toc] | [prev] | [next] | [standalone]


#2339

FromMarkus Fischer <markus@fischer.name>
Date2011-04-05 08:27 -0500
Message-ID<4D9B18A8.4050906@fischer.name>
In reply to#2336
Hi Robert,

On 05.04.2011 14:59, Robert Klemme wrote:
> What about one initial rework to get proper (X)HTML, submit it to your
> version control and then create those modifications that you need to
> do?  That approach has served me quite well for example when enforcing
> a particular source code formatting.

I considered this approach too, unfortunately it turns out it breaks the 
history too much, i.e. blaming of content. I mean, nothing gets "broken" 
but when you blame/annotate, and we do this, you get irrelevant noise in 
it, which I really try to avoid.

thanks,
- Markus

[toc] | [prev] | [standalone]


Back to top | Article view | comp.lang.ruby


csiph-web