Results 1 to 4 of 4

Thread: PCRE match hyphens between HTML tags, but not javascript or CSS

  1. #1
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,211
    Thanks
    96
    Thanked 101 Times in 99 Posts

    Default PCRE match hyphens between HTML tags, but not javascript or CSS

    I am trying to create a PCRE that will replace all of the hyphens in a document unless it is javascript or CSS or in an url, div, span tags.

    I have tried many different things and have discovered so many "anomalies" in PCRE that I don't really know where to begin, but many of my questions would probably be better put into separate threads.

    Here is the test script I am using:

    Code:
    <?php
    $text="-te-st-te-st-";
    $text=preg_replace('/(.*?)(st.*?te)(.*?)/es',"str_replace('-','&ndash;','$1$3')",$text);
    echo "$text";
    ?>
    In the above I am attempting to change all of the hyphens to short HTML dashes unless they are located between "st" and "te". The string might also be in the form of "-te-yy-st-" where all of the hyphens would be replaced or "-pstp --- kktep--p-" where the three hyphens in a row would not be replaced, because they are between "st" and "te".


    Here is some actual code/text I would like to test against:



    Code:
    <style type="text/css">
    .q
    {
    font-family:courier;
    border-style: solid;
    border-width: 3px;
    border-color: #525252;
    padding-left:25px; 
    padding-right:25px;
    PADDING-TOP:20px;
    PADDING-BOTTOM:20px;
    color:#ffffff;
    background:#434343;
    margin: 12px 80px 12px 40px;
    }
    </style>
    <form name="count">
    <input type="text" size="69" name="count2">
    </form>
    
    
    <script>
    
    /*
    Count down until any date script-
    By JavaScript Kit (www.javascriptkit.com)
    Over 200+ free scripts here!
    */
    
    
    //change the text below to reflect your own,
    var before="Christmas!"
    var current="Today is Christmas. Merry Christmas!"
    var montharray=new Array("Jan","Feb","Mar","Apr","May","Jun","Jul","Aug","Sep","Oct","Nov","Dec")
    
    function countdown(yr,m,d){
    theyear=yr;themonth=m;theday=d
    var today=new Date()
    var todayy=today.getYear()
    if (todayy < 1000)
    todayy+=1900
    var todaym=today.getMonth()
    var todayd=today.getDate()
    var todayh=today.getHours()
    var todaymin=today.getMinutes()
    var todaysec=today.getSeconds()
    var todaystring=montharray[todaym]+" "+todayd+", "+todayy+" "+todayh+":"+todaymin+":"+todaysec
    futurestring=montharray[m-1]+" "+d+", "+yr
    dd=Date.parse(futurestring)-Date.parse(todaystring)
    dday=Math.floor(dd/(60*60*1000*24)*1)
    dhour=Math.floor((dd%(60*60*1000*24))/(60*60*1000)*1)
    dmin=Math.floor(((dd%(60*60*1000*24))%(60*60*1000))/(60*1000)*1)
    dsec=Math.floor((((dd%(60*60*1000*24))%(60*60*1000))%(60*1000))/1000*1)
    if(dday==0&&dhour==0&&dmin==0&&dsec==1){
    document.forms.count.count2.value=current
    return
    }
    else
    document.forms.count.count2.value="Only "+dday+ " days, "+dhour+" hours, "+dmin+" minutes, and "+dsec+" seconds left until "+before
    setTimeout("countdown(theyear,themonth,theday)",1000)
    }
    //enter the count down date using the format year/month/day
    countdown(2002,12,25)
    </script>
    <p align="center"><font face="arial" size="-2">This free script provided by</font><br>
    <font face="arial, helvetica" size="-2"><a href="http://javascriptkit.com">JavaScript
    Kit</a></font></p>
    <div class='q'><a href="http://www.marvel.com/universe/X-23">X-23</a> is a new character in the <span style="background-color:tan;">Marvel universe</span></div>

    Edit: I reformatted this post to give better examples and to clarify what I am looking for.
    Last edited by james438; 02-15-2010 at 06:24 AM. Reason: resolved and renamed thread title
    To choose the lesser of two evils is still to choose evil. My personal site

  2. #2
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,211
    Thanks
    96
    Thanked 101 Times in 99 Posts

    Default

    I talked with a lady at regexadvice and she helped to correct one or two of my misunderstandings about PCRE and also set me thinking in new directions. Here is what I have so far:

    PHP Code:
    $text=preg_replace("/-(?!((?!\/script|\style).)*\/script>|\/style)/is""&ndash;"$text);
    $text=preg_replace('/&ndash;(?=(.*?>))/',"-",$text); 
    Last edited by james438; 02-10-2010 at 12:47 AM.
    To choose the lesser of two evils is still to choose evil. My personal site

  3. #3
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,211
    Thanks
    96
    Thanked 101 Times in 99 Posts

    Default

    Here is what I have thus far.

    PHP Code:
    $text=preg_replace("/-(?!((?!<\/?(script|style|a)\b).)*(<\/(script|style|a>)))/is","&ndash;",$text); 
    It will replace all of the hyphens in a string while avoiding those within a style or script or hyperlink. Inline scripting such as <span style=font-size:24px;>x-x</span> will produce <span style=fontdash;size:24px;>xdash;x</span> which is a bit of a problem, but is easily fixed with class as opposed to inline scripting. I suspect that the script would also match against img src="/x-x.jpg"

    Pretty much all of the credit for this script goes to the kind people at regexadvice.com as I had very little to do with this PCRE.

    If I do come up with something better I will post here in the forums, but for now I am rather satisfied and am want to take some time off from this script.

    Update: Here is the complete expression:
    Code:
    $text=preg_replace("/-(?!((?!<\/?(script|style|a)\b).)*(<\/(script|style|a>)))/is","&ndash;",$text);
    $text=preg_replace('/(<a.+>)(.+)(<\/a>)/Ue',"'$1'.str_replace('-','&ndash;','$2').'$3'",$text);
    $text=preg_replace('/(style=(\"|\'))(.+)(\'|\")/Ue',"'$1'.str_replace('&ndash;','-','$3').'$4'",$text);
    It will replace all of the hyphens in a string unless it is located within javascript or style script. It also ignores hyphens located in anchor tags and inline coding.

    It will replace the hyphens in img src, as in <img src="/image/x-x.jpg">. This is not a good thing, but I have not as yet had a need to update the code to accommodate this.

    http://www.webmasterworld.com/php/3698684.htm thread helped.
    Last edited by james438; 02-15-2010 at 06:26 AM.
    To choose the lesser of two evils is still to choose evil. My personal site

  4. #4
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,211
    Thanks
    96
    Thanked 101 Times in 99 Posts

    Default

    Updated the script a little more.

    PHP Code:
    $text=preg_replace("/-(?!((?!<\/?(script|style|a|object|iframe)\b).)*(<\/(script|style|a>|object|iframe)))/is","&ndash;",$text);
    $text=preg_replace('/(<a.+>)(.+)(<\/a>)/Ue',"'$1'.str_replace('-','&ndash;','$2').'$3'",$text);
    $text=preg_replace('/(style=(\"|\'))(.+)(\'|\")/Ue',"'$1'.str_replace('&ndash;','-','$3').'$4'",$text);
    $text=preg_replace('/(<button\sonclick=(\"|\'))(.+)(\'|\")/Ue',"'$1'.str_replace('&ndash;','-','$3').'$4'",$text); 
    Line 1 will replace hyphens with dashes, but ignore those found within object, iframe, style, style, script, and anchor tags.
    Line 2 will replace hyphens with dashes found between the anchor tags.
    Line 3 will revert the &amp;ndash; found in inline styling to hyphens.
    Line 4 is just extra, but reverts the &ndash; found within the button tags to hyphens.

    EDIT: I notice that the code does not work with strings that are approx 11633 characters in length or longer. The exact number is actually slightly less than this.
    Last edited by james438; 02-16-2010 at 07:13 AM.
    To choose the lesser of two evils is still to choose evil. My personal site

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •