PDA

View Full Version : Resolved regex!



traq
06-06-2010, 01:54 AM
Okay: I want to, globally and entirely and permanently, convert a CMS from xhtml output to html 4.01. Therefore, I'm trying to use a regular expression (in notepad++) to find self-closing tags that have html equivalents.

I can find self-closing tags in general:
(<)(\w+)([^>]*\/>)however, there are significant portions of script that generate actual xml (not xhtml) markup, which must remain unchanged (well-formed).

My approach so far has been to simply limit my expression to known, self-closing xhtml tags: <br/>, <hr/>, <img/>, <link/>, <meta/> (any others?). However, I can't figure out how to add this list of options into my regex.

Any help/ideas would be appreciated!
Thanks.


I'm no longer using notepad++ to solve this; I've moved to php (preg_replace). Skip to the end (http://www.dynamicdrive.com/forums/showpost.php?p=234550&postcount=23) to see where the problem is at now...

thanks

james438
06-06-2010, 03:29 AM
I generally work with PCRE regular expressions, which may work in your case. What I am not familiar with is
CMS from xhtml output to html 4.01

Could you post some examples of what you want the script to match and what it would be converted to? In the meantime here is a link to a tutorial I wrote for PCRE. http://www.animeviews.com/article.php?ID=66 At the bottom of the page are several more links that may help. regexadvice.com is a great website with a forum that helps people out with simple and complex regular expressions of all kinds. I have also collected several of what I figure are some of the most requested regex patterns.

djr33
06-06-2010, 03:46 AM
I believe "CMS" is irrelevant-- it's just regular HTML that should be XHTML. He wants to add the self-closing tags so it's valid.

I'm not sure how to approach this logically.

To solve the issue of not double-closing tags you just need to add a 'not /' parameter. But I'm not sure how you'd know that there isn't a closing tag somewhere else. What I mean is that if you have <x> that may be a self-closing tag or there may be, at some unpredictable location, a </x> later in the code. Unless you actually parse the entire file to find out how the hierarchy, it'll be really hard to do this.


The two possibilities I can think of:
1. Just look for a solution that already exist: surely someone must have tried to code this before. You just need a function called something like 'html2xhtml()', and that's gotta be out there somewhere..... I think.

2. Continue with your current method of specifying types of tags that must be self-closing. There's no other way I can think of doing this.
So just replace <selfclosingtag*> with <selfclosingtag* />.
Create an array, loop through each selfclosingtag (img, br, etc) and setup some basic regex (right?) to replace that.

traq
06-06-2010, 04:06 AM
thanks, guys.
james: What I mean is that I have a content management system that was designed to output xhtml: but I want it to output html 4.01.

I could manually go through all the files and replace all instances of <sometag /> with <sometag>, but there are thousands of such instances, in hundreds of files, and I'd have to go find them all first.

I'm reading your tutorial now.

djr: I've been looking for such stuff, but this is complicated by two issues:

1) I'm not using php. I'm using my text editor (notepad++) because it has the built-in ability to search and replace across entire directory structures. It's also easier to work with locally. I've considered turning to php for the solution (and I will, if I have to), but I'd rather not if I don't have to.

More importantly:
2) I'm not working with html. I'm working with the php files that will generate it. The complicating issue here is that those same php files also generate xml documents, and I don't want to screw up all of the <xmltags/> that would be screwed up if they were suddenly re-written as <xmltags>. Furthermore, I don't want to accidentally mess with some line of php that contains /> (offhand, I can't actually think of a situation where that might happen, but it seems to be a risk anyway. I tried the blanket search-replace of /> with > and it definitely screwed something up :) ).

So, examples for the both of you:

<?php echo "<img src=\"$imgsrc\" />"; ?> needs to be replaced with <?php echo "<img src=\"$imgsrc\">"; ?>,
whereas <?php $xmlstring = "<xml><emptytag/></xml>"; ?> must be left alone.



I've been trying variations on (<)\b(img|hr)\b(\w+)([^>]*\/>) or (<)img|hr(\w+)([^>]*\/>) or (<)(img|hr)(\w+)([^>]*\/>), which seems like it ought to be the right direction, but they either don't work at all or they match anything with i, m, g, |, h, or r in them... :(

james438
06-06-2010, 04:12 AM
I almost wish I understood XHTML, but I have never really seen the use for it. In this case it would help me to write out a regex if needed, but like you said djr33 it looks like str_replace would take care of what he needs. Why would regex be needed in this case?

I am almost certain I have seen regex that converts HTML to XHTML. I'm gonna take a quick look. I'm pretty sure I know where I saw it.

james438
06-06-2010, 04:15 AM
<?php
$text="this is some <hi>ordinary<hi/> text. this is another <hi>ordinary<hi/> <br>ok<br/>text<meta>text<meta/>.";
$text=preg_replace('/(<)([^>]*br\/>|hr\/>|img\/>|link\/>|meta\/>)/',"XXXX",$text);
echo "$text";
?>
produces

this is some ordinary text. this is another ordinary
okXXXXtexttextXXXX.
dunno if it is what you are looking for though.

I need to finish reading your post traq. Sorry!

traq
06-06-2010, 04:31 AM
well... I need to finish my research.

I'm beginning to think that the I may need to just move to php to do this. I've been looking on the notepad++ forums (http://sourceforge.net/search/?group_id=95717&type_of_search=forums&words=regex&search=Search), and it seems that notepad++ might not support more advanced regex constructs (if you can call one|other "advanced" :p ).

I figured out a regex that will work (in php): (<)(img|hr|br|link|meta)([^>]*\/>)
I just have to write something now that will recursively find *every* php and html file in the CMS directory structure.

...
i hate recursion...

james438
06-06-2010, 04:43 AM
Glad you figured it out. I wish I could have been more help.

What does your complete regex line look like? I am curious what the replacement looks like :).

djr33
06-06-2010, 04:58 AM
One more thought: 1. replace <tag /> with <tag>. 2. Replace <tag> with <tag />.

For <html> it will skip (1) and make it <html /> in (2).
For <xhtml /> it will change it to <xhtml> in (1) and in to <xhtml /> in (2).

Hope that helps.

traq
06-06-2010, 05:32 AM
actually, putting this into preg_replace has uncovered a new issue. It currently looks like this:

$newtag = preg_replace('/(<)(img|hr|br|link|meta)([^>]*)( ?\/>)/', '$1$2$3>', $tag)
I added the fourth group (slash, greater than, optional leading whitespace) so I could match both tags <like this/> and tags <like this /> and replace them with tags <like this> (no trailing whitespace).

However, the third group (one or more of not-greater-than) also matches the whitespace, so the whitespace is preserved when I do my replacement anyway.

I think this could be solved with lookahead/lookbehind assertions? I haven't quite figured out how to use those, though. Is there a way to make sure that ([^>]*) does not end with whitespace? ([^> ]*) and ([^>^ ]*) don't seem to be what I want.

djr:
I think I see what you're getting at - but I think it would lose the distinction between xml and xhtml at some point - part of my problem is that I wanted to replace certain <tags/>, but not <others/>. (I want to do xhtml -> html, btw, not vice-versa.) It's getting late, though, I'm starting to have toruble thinking clearly.

I appreciate your help, guys! Keep throwing out ideas, and I'll pick this back up in the AM. Thanks!

james438
06-06-2010, 09:08 AM
Groups are generally known as captures or, more commonly, subpatterns.
Whitespace can be matched with \s.

to answer your problem, try:

$newtag = preg_replace('/(<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/>/', '$1$2$3>', $tag);
or

$newtag = preg_replace('/(<)(img|hr|br|link|meta)([^>]*\b)\/>| \/>/', '$1$2$3>', $tag);
\b is the barrier between a word character and a non-word character, in this case a space.

traq
06-06-2010, 04:26 PM
That works great on <tags />, but for <tags/> the slash remains after the replacement. I think the slash is getting captured inside the ([^>]*\b) word boundary...?

How can I test this? And does \b count " / " as a word character?

james438
06-06-2010, 06:38 PM
a word character is A-Za-z0-9 and the underscore.

The / is not being captured by the \b. Boundaries is the place between characters. On my end it works fine. Can I see the text that is failing? Sometimes PCRE will fail when dealing with large amounts of data in which case you may need to break it up.

The best way to test out your PCRE is to create a test page, but I suspect you are already doing that. I am not sure if I understand your question.

Gonna do a few things, so I'll be back online later on tonight.

djr33
06-06-2010, 07:43 PM
James, I think you have a typo that may be misleading: A_Za-z0-9 should be A-Za-z0-9.
It isn't that big of a problem, but if someone copies and pastes that, it'll give weird results. Aside from that, you're beyond my knowledge of regex.

traq
06-06-2010, 07:49 PM
script:

$tags = array(
'<img src="http://myimg.jpg" />',
'<hr />',
'<br />',
'<meta name="content-type" content="text/html" />',
'<input type="button" />',
'<link rel="stylesheet" href="my.css" />',
'<img src="http://myimg.jpg"/>',
'<hr/>',
'<br/>',
'<meta name="content-type" content="text/html"/>',
'<input type="button"/>',
'<link rel="stylesheet" href="my.css"/>');
function pre_var_dump($whatever){ echo '<pre>'; var_dump($whatever); echo '</pre>'; }
$c = count($tags);
$newtags = array();
for($x=0;$x<$c;$x++){
$newtags[$x][0] = 'Test tag: '.$tags[$x];
$newtags[$x][1] = 'regex (<)(img|hr|br|link|meta)([^>]*\b)\/>| \/> : '.preg_replace('/(<)(img|hr|br|link|meta)([^>]*\b)\/>| \/>/', '$1$2$3>', $tags[$x]);
$newtags[$x][2] = 'regex (<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/> : '.preg_replace('/(<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/>/', '$1$2$3>', $tags[$x]);
}
pre_var_dump($newtags);

outputs:

array(12) {
[0]=>
array(3) {
[0]=>
string(40) "Test tag: <img src="http://myimg.jpg" />"
[1]=>
string(78) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>| \/> : <img src="http://myimg.jpg">"
[2]=>
string(79) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/> : <img src="http://myimg.jpg">"
}
[1]=>
array(3) {
[0]=>
string(16) "Test tag: <hr />"
[1]=>
string(54) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>| \/> : <hr>"
[2]=>
string(55) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/> : <hr>"
}
[2]=>
array(3) {
[0]=>
string(16) "Test tag: <br />"
[1]=>
string(54) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>| \/> : <br>"
[2]=>
string(55) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/> : <br>"
}
[3]=>
array(3) {
[0]=>
string(58) "Test tag: <meta name="content-type" content="text/html" />"
[1]=>
string(96) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>| \/> : <meta name="content-type" content="text/html">"
[2]=>
string(97) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/> : <meta name="content-type" content="text/html">"
}
[4]=>
array(3) {
[0]=>
string(33) "Test tag: <input type="button" />"
[1]=>
string(71) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>| \/> : <input type="button">"
[2]=>
string(72) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/> : <input type="button">"
}
[5]=>
array(3) {
[0]=>
string(49) "Test tag: <link rel="stylesheet" href="my.css" />"
[1]=>
string(87) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>| \/> : <link rel="stylesheet" href="my.css">"
[2]=>
string(88) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/> : <link rel="stylesheet" href="my.css">"
}
[6]=>
array(3) {
[0]=>
string(39) "Test tag: <img src="http://myimg.jpg"/>"
[1]=>
string(79) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>| \/> : <img src="http://myimg.jpg"/>"
[2]=>
string(80) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/> : <img src="http://myimg.jpg"/>"
}
[7]=>
array(3) {
[0]=>
string(15) "Test tag: <hr/>"
[1]=>
string(54) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>| \/> : <hr>"
[2]=>
string(55) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/> : <hr>"
}
[8]=>
array(3) {
[0]=>
string(15) "Test tag: <br/>"
[1]=>
string(54) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>| \/> : <br>"
[2]=>
string(55) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/> : <br>"
}
[9]=>
array(3) {
[0]=>
string(57) "Test tag: <meta name="content-type" content="text/html"/>"
[1]=>
string(97) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>| \/> : <meta name="content-type" content="text/html"/>"
[2]=>
string(98) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/> : <meta name="content-type" content="text/html"/>"
}
[10]=>
array(3) {
[0]=>
string(32) "Test tag: <input type="button"/>"
[1]=>
string(72) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>| \/> : <input type="button"/>"
[2]=>
string(73) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/> : <input type="button"/>"
}
[11]=>
array(3) {
[0]=>
string(48) "Test tag: <link rel="stylesheet" href="my.css"/>"
[1]=>
string(88) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>| \/> : <link rel="stylesheet" href="my.css"/>"
[2]=>
string(89) "regex (<)(img|hr|br|link|meta)([^>]*\b)\/>|\s\/> : <link rel="stylesheet" href="my.css"/>"
}
}

As you can see, preg_replace() catches the / (with either regex) when there is leading whitespace, but misses it when there is no leading whitespace.


Whoop! Just caught this: preg_replace() does catch the / when there's no leading whitespace, but not when there's no whitespace and it is preceded by a double-quote.

I bet that's a clue... :D

fileserverdirect
06-06-2010, 10:14 PM
Ummm, can't you just "Find & Replace"? ... Find <br /> Replace <br> and repeat for each tag. Change search mode to normal and that should be it. Also yoou could seach and replace " />" but it may interfer with the php code... So...

traq
06-06-2010, 11:36 PM
Yes, that's also an option. It would take longer (more manual operations), but it would work.



another difficulty is the img, link and input tags. I can't replace them blindly because I need to preserve all the attributes="values" while getting rid of the ending /



Of course, I get stubborn when there's a solution that ought to work, but is escaping me somehow. Even if I use some other approach, I've got to figure this one out too. I know, a little obsessive. :)

http://imgs.xkcd.com/comics/nerd_sniping.png

james438
06-07-2010, 01:20 AM
Oops! Thanks djr33! If I am past you in regex it isn't by much and you are way past me in everything else ;)

I think I found the problem Traq. Try this.


$newtag = preg_replace('/<(img|hr|br|link|meta)([^>]*?)\s?\/>/', '<$1$2>', $tag);

notice the "?" I added. "*" means 0 or more. "?" is 0 or one and is used to mean optional, which is the way we have been using it. When placed after the "*" like "*?" it modifies the "*" quantifier to make it ungreedy. In this case the pattern will now match the least amount necessary as opposed to being greedy and matching the longest valid string that it can find.

I looked at your script Traq, but I am terrible with functions. They make my eyes glaze over, sorta like javascript does.

traq
06-07-2010, 01:52 AM
I looked at your script Traq, but I am terrible with functions...
hey, as long as you're good with regex!

That one's a lot cleaner just to look at. I'm starting testing now. thanks! and if you ever need a function, let me know :)

traq
06-07-2010, 02:45 AM
Regex: <(img|hr|br|link|meta|input)([^>]*?)\s?\/>
Results:

array(12) {
[0]=>
array(2) {
[0]=>
string(40) "Test tag: <img src="http://myimg.jpg" />"
[1]=>
string(37) "result : <img src="http://myimg.jpg">"
}
[1]=>
array(2) {
[0]=>
string(16) "Test tag: <hr />"
[1]=>
string(13) "result : <hr>"
}
[2]=>
array(2) {
[0]=>
string(16) "Test tag: <br />"
[1]=>
string(13) "result : <br>"
}
[3]=>
array(2) {
[0]=>
string(58) "Test tag: <meta name="content-type" content="text/html" />"
[1]=>
string(55) "result : <meta name="content-type" content="text/html">"
}
[4]=>
array(2) {
[0]=>
string(33) "Test tag: <input type="button" />"
[1]=>
string(30) "result : <input type="button">"
}
[5]=>
array(2) {
[0]=>
string(49) "Test tag: <link rel="stylesheet" href="my.css" />"
[1]=>
string(46) "result : <link rel="stylesheet" href="my.css">"
}
[6]=>
array(2) {
[0]=>
string(39) "Test tag: <img src="http://myimg.jpg"/>"
[1]=>
string(37) "result : <img src="http://myimg.jpg">"
}
[7]=>
array(2) {
[0]=>
string(15) "Test tag: <hr/>"
[1]=>
string(13) "result : <hr>"
}
[8]=>
array(2) {
[0]=>
string(15) "Test tag: <br/>"
[1]=>
string(13) "result : <br>"
}
[9]=>
array(2) {
[0]=>
string(57) "Test tag: <meta name="content-type" content="text/html"/>"
[1]=>
string(55) "result : <meta name="content-type" content="text/html">"
}
[10]=>
array(2) {
[0]=>
string(32) "Test tag: <input type="button"/>"
[1]=>
string(30) "result : <input type="button">"
}
[11]=>
array(2) {
[0]=>
string(48) "Test tag: <link rel="stylesheet" href="my.css"/>"
[1]=>
string(46) "result : <link rel="stylesheet" href="my.css">"
}
}

thanks a million, James :)

traq
08-16-2010, 01:30 AM
well, it's been a while, but I've got another problem. :D

<(img|hr|br|link|meta)([^>]*?)\s?\/>

The regex works, but fails on lines like this:

<link rel="stylesheet" media="screen" type="text/css" href="<?php echo $this->getStyleSheet('main.css')?>" />

the <?php blahblah ?> ends the comparison prematurely. the ending /> is not replaced with >

just keeps getting more and more complicated, huh? I'm thinking that an optional ?> needs to be added...? anyone know how? thanks

james438
08-16-2010, 03:57 AM
I'm not sure, but it might help to see a little more code.

You might want to try debugging it a little yourself first by looking to see what the variable contains before it is processed by your PCRE by doing an echo/print of said variable if you have not already. The <?php contents ?> should not disrupt your PCRE any.

traq
08-16-2010, 04:50 AM
clarification:

I'm not processing the output; I'm running the regex on the php file contents as text. e.g., in
<?php

function file_2html($file){

$xhtmlfile = file_get_contents($file);
$htmlfile = preg_replace('/<(img|hr|br|link|meta|input)([^>]*?)\s?\/>/', '<$1$2>', $xhtmlfile);
file_put_contents($file, $htmlfile);
}

}

?>, $file is the text of the php code (a literal string). So, no parsing going on in my example above. What you see is what preg_replace() sees.

(
I'm running this on my local machine, it's not going to go on to the production website.

To recap -in case someone doesn't remember what's going on- I've got a CMS that outputs XHTML, and I want it to output html. Aside from the doctype, that basically means no self-closing tags. Requirements:

1. needs to replace all valid, self-closing xhtml tags (e.g., <img />) with valid, non-self closing html tags (<img>).
2. needs to leave self-closing xml (non-xhtml, e.g., <xmldoc><xmltag /></xmldoc>) alone.
3. needs to catch possible whitespace (e.g., both <tag /> and <tag/> are converted to <tag>).
4. as I have now discovered, needs to ignore <?php?> tags (I can't blindly replace /> fragments because it might break xml -see #2 above).
5. needs to not take too long in processing over 3,000 files :D
)

james438
08-16-2010, 06:00 PM
Yes, PCRE can do this. This is one of those areas of PCRE that people familiar with regex ie regexadvice.com (http://regexadvice.com/) find rather easy, but I have had a more difficult time understanding. It involves a negative lookahead of a pattern within a pattern.

Your previous post did clear things up though.

traq
08-16-2010, 07:39 PM
thanks for the link!

traq
08-18-2010, 04:18 AM
Hi everyone, Susan (on regexadvice.com) figured this out for me:
<(img|hr|br|link|meta|input)((\s+\w+(\s*=\s*("[^"]*"|\'[^\']*\'|\w+))?)*)\s*\/>worked perfectly! [ link ] (http://regexadvice.com/forums/thread/70877.aspx)

Thanks again, james, for pointing me there. great place.

james438
08-18-2010, 07:34 AM
Yep, it is a great site and, as far as I can tell, the only one of its kind. Susan is awesome at answering posts with detailed answers. There are a few others there that have been quite helpful as well.

Thank you for posting the solution as well as a link to the thread you created so that I can study what she wrote further :) although honestly I do not research PCRE nearly as much as I used to.

P.S. I use the term PCRE to distinguish it from the other forms of regex out there like Perl or MySQL. I'm betting the javascript version of regex is different as well.

traq
08-18-2010, 02:46 PM
It was a very detailed answer (incidentally, she didn't think I should be using regex at all, but a DOM library - I'm not sure that would be very efficient in this situation), but easy to follow.