Log in

View Full Version : Resolved PCRE match hyphens between HTML tags, but not javascript or CSS



james438
02-09-2010, 08:15 PM
I am trying to create a PCRE that will replace all of the hyphens in a document unless it is javascript or CSS or in an url, div, span tags.

I have tried many different things and have discovered so many "anomalies" in PCRE that I don't really know where to begin, but many of my questions would probably be better put into separate threads.

Here is the test script I am using:


<?php
$text="-te-st-te-st-";
$text=preg_replace('/(.*?)(st.*?te)(.*?)/es',"str_replace('-','&ndash;','$1$3')",$text);
echo "$text";
?>
In the above I am attempting to change all of the hyphens to short HTML dashes unless they are located between "st" and "te". The string might also be in the form of "-te-yy-st-" where all of the hyphens would be replaced or "-pstp --- kktep--p-" where the three hyphens in a row would not be replaced, because they are between "st" and "te".


Here is some actual code/text I would like to test against:




<style type="text/css">
.q
{
font-family:courier;
border-style: solid;
border-width: 3px;
border-color: #525252;
padding-left:25px;
padding-right:25px;
PADDING-TOP:20px;
PADDING-BOTTOM:20px;
color:#ffffff;
background:#434343;
margin: 12px 80px 12px 40px;
}
</style>
<form name="count">
<input type="text" size="69" name="count2">
</form>


<script>

/*
Count down until any date script-
By JavaScript Kit (www.javascriptkit.com)
Over 200+ free scripts here!
*/


//change the text below to reflect your own,
var before="Christmas!"
var current="Today is Christmas. Merry Christmas!"
var montharray=new Array("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")

function countdown(yr,m,d){
theyear=yr;themonth=m;theday=d
var today=new Date()
var todayy=today.getYear()
if (todayy < 1000)
todayy+=1900
var todaym=today.getMonth()
var todayd=today.getDate()
var todayh=today.getHours()
var todaymin=today.getMinutes()
var todaysec=today.getSeconds()
var todaystring=montharray[todaym]+" "+todayd+", "+todayy+" "+todayh+":"+todaymin+":"+todaysec
futurestring=montharray[m-1]+" "+d+", "+yr
dd=Date.parse(futurestring)-Date.parse(todaystring)
dday=Math.floor(dd/(60*60*1000*24)*1)
dhour=Math.floor((dd%(60*60*1000*24))/(60*60*1000)*1)
dmin=Math.floor(((dd%(60*60*1000*24))%(60*60*1000))/(60*1000)*1)
dsec=Math.floor((((dd%(60*60*1000*24))%(60*60*1000))%(60*1000))/1000*1)
if(dday==0&&dhour==0&&dmin==0&&dsec==1){
document.forms.count.count2.value=current
return
}
else
document.forms.count.count2.value="Only "+dday+ " days, "+dhour+" hours, "+dmin+" minutes, and "+dsec+" seconds left until "+before
setTimeout("countdown(theyear,themonth,theday)",1000)
}
//enter the count down date using the format year/month/day
countdown(2002,12,25)
</script>
<p align="center"><font face="arial" size="-2">This free script provided by</font><br>
<font face="arial, helvetica" size="-2"><a href="http://javascriptkit.com">JavaScript
Kit</a></font></p>
<div class='q'><a href="http://www.marvel.com/universe/X-23">X-23</a> is a new character in the <span style="background-color:tan;">Marvel universe</span></div>


Edit: I reformatted this post to give better examples and to clarify what I am looking for.

james438
02-10-2010, 12:40 AM
I talked with a lady at regexadvice (http://regexadvice.com/forums/ShowThread.aspx?PostID=59528) and she helped to correct one or two of my misunderstandings about PCRE and also set me thinking in new directions. Here is what I have so far:


$text=preg_replace("/-(?!((?!\/script|\style).)*\/script>|\/style)/is", "&ndash;", $text);
$text=preg_replace('/&ndash;(?=(.*?>))/',"-",$text);

james438
02-15-2010, 01:29 AM
Here is what I have thus far.


$text=preg_replace("/-(?!((?!<\/?(script|style|a)\b).)*(<\/(script|style|a>)))/is","&ndash;",$text);

It will replace all of the hyphens in a string while avoiding those within a style or script or hyperlink. Inline scripting such as <span style=font-size:24px;>x-x</span> will produce <span style=fontdash;size:24px;>xdash;x</span> which is a bit of a problem, but is easily fixed with class as opposed to inline scripting. I suspect that the script would also match against img src="/x-x.jpg"

Pretty much all of the credit for this script goes to the kind people at regexadvice.com as I had very little to do with this PCRE.

If I do come up with something better I will post here in the forums, but for now I am rather satisfied and am want to take some time off from this script.

Update: Here is the complete expression:

$text=preg_replace("/-(?!((?!<\/?(script|style|a)\b).)*(<\/(script|style|a>)))/is","&ndash;",$text);
$text=preg_replace('/(<a.+>)(.+)(<\/a>)/Ue',"'$1'.str_replace('-','&ndash;','$2').'$3'",$text);
$text=preg_replace('/(style=(\"|\'))(.+)(\'|\")/Ue',"'$1'.str_replace('&ndash;','-','$3').'$4'",$text);
It will replace all of the hyphens in a string unless it is located within javascript or style script. It also ignores hyphens located in anchor tags and inline coding.

It will replace the hyphens in img src, as in <img src="/image/x-x.jpg">. This is not a good thing, but I have not as yet had a need to update the code to accommodate this.

http://www.webmasterworld.com/php/3698684.htm thread helped.

james438
02-15-2010, 12:50 PM
Updated the script a little more.


$text=preg_replace("/-(?!((?!<\/?(script|style|a|object|iframe)\b).)*(<\/(script|style|a>|object|iframe)))/is","&ndash;",$text);
$text=preg_replace('/(<a.+>)(.+)(<\/a>)/Ue',"'$1'.str_replace('-','&ndash;','$2').'$3'",$text);
$text=preg_replace('/(style=(\"|\'))(.+)(\'|\")/Ue',"'$1'.str_replace('&ndash;','-','$3').'$4'",$text);
$text=preg_replace('/(<button\sonclick=(\"|\'))(.+)(\'|\")/Ue',"'$1'.str_replace('&ndash;','-','$3').'$4'",$text);

Line 1 will replace hyphens with dashes, but ignore those found within object, iframe, style, style, script, and anchor tags.
Line 2 will replace hyphens with dashes found between the anchor tags.
Line 3 will revert the &amp;ndash; found in inline styling to hyphens.
Line 4 is just extra, but reverts the &ndash; found within the button tags to hyphens.

EDIT: I notice that the code does not work with strings that are approx 11633 characters in length or longer. The exact number is actually slightly less than this.