Page 1 of 3 123 LastLast
Results 1 to 10 of 28

Thread: regex!

  1. #1
    Join Date
    Apr 2008
    Location
    So.Cal
    Posts
    3,643
    Thanks
    63
    Thanked 516 Times in 502 Posts
    Blog Entries
    5

    Default regex!

    Okay: I want to, globally and entirely and permanently, convert a CMS from xhtml output to html 4.01. Therefore, I'm trying to use a regular expression (in notepad++) to find self-closing tags that have html equivalents.

    I can find self-closing tags in general:
    Code:
    (<)(\w+)([^>]*\/>)
    however, there are significant portions of script that generate actual xml (not xhtml) markup, which must remain unchanged (well-formed).

    My approach so far has been to simply limit my expression to known, self-closing xhtml tags: <br/>, <hr/>, <img/>, <link/>, <meta/> (any others?). However, I can't figure out how to add this list of options into my regex.

    Any help/ideas would be appreciated!
    Thanks.

    Edit:
    I'm no longer using notepad++ to solve this; I've moved to php (preg_replace). Skip to the end to see where the problem is at now...

    thanks
    Last edited by djr33; 08-18-2010 at 04:57 PM. Reason: user request

  2. #2
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,385
    Thanks
    100
    Thanked 113 Times in 111 Posts

    Default

    I generally work with PCRE regular expressions, which may work in your case. What I am not familiar with is
    CMS from xhtml output to html 4.01
    Could you post some examples of what you want the script to match and what it would be converted to? In the meantime here is a link to a tutorial I wrote for PCRE. http://www.animeviews.com/article.php?ID=66 At the bottom of the page are several more links that may help. regexadvice.com is a great website with a forum that helps people out with simple and complex regular expressions of all kinds. I have also collected several of what I figure are some of the most requested regex patterns.
    To choose the lesser of two evils is still to choose evil. My personal site

  3. #3
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    I believe "CMS" is irrelevant-- it's just regular HTML that should be XHTML. He wants to add the self-closing tags so it's valid.

    I'm not sure how to approach this logically.

    To solve the issue of not double-closing tags you just need to add a 'not /' parameter. But I'm not sure how you'd know that there isn't a closing tag somewhere else. What I mean is that if you have <x> that may be a self-closing tag or there may be, at some unpredictable location, a </x> later in the code. Unless you actually parse the entire file to find out how the hierarchy, it'll be really hard to do this.


    The two possibilities I can think of:
    1. Just look for a solution that already exist: surely someone must have tried to code this before. You just need a function called something like 'html2xhtml()', and that's gotta be out there somewhere..... I think.

    2. Continue with your current method of specifying types of tags that must be self-closing. There's no other way I can think of doing this.
    So just replace <selfclosingtag*> with <selfclosingtag* />.
    Create an array, loop through each selfclosingtag (img, br, etc) and setup some basic regex (right?) to replace that.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  4. #4
    Join Date
    Apr 2008
    Location
    So.Cal
    Posts
    3,643
    Thanks
    63
    Thanked 516 Times in 502 Posts
    Blog Entries
    5

    Default

    thanks, guys.
    james: What I mean is that I have a content management system that was designed to output xhtml: but I want it to output html 4.01.

    I could manually go through all the files and replace all instances of <sometag /> with <sometag>, but there are thousands of such instances, in hundreds of files, and I'd have to go find them all first.

    I'm reading your tutorial now.

    djr: I've been looking for such stuff, but this is complicated by two issues:

    1) I'm not using php. I'm using my text editor (notepad++) because it has the built-in ability to search and replace across entire directory structures. It's also easier to work with locally. I've considered turning to php for the solution (and I will, if I have to), but I'd rather not if I don't have to.

    More importantly:
    2) I'm not working with html. I'm working with the php files that will generate it. The complicating issue here is that those same php files also generate xml documents, and I don't want to screw up all of the <xmltags/> that would be screwed up if they were suddenly re-written as <xmltags>. Furthermore, I don't want to accidentally mess with some line of php that contains /> (offhand, I can't actually think of a situation where that might happen, but it seems to be a risk anyway. I tried the blanket search-replace of /> with > and it definitely screwed something up ).

    So, examples for the both of you:

    <?php echo "<img src=\"$imgsrc\" />"; ?> needs to be replaced with <?php echo "<img src=\"$imgsrc\">"; ?>,
    whereas <?php $xmlstring = "<xml><emptytag/></xml>"; ?> must be left alone.

    Edit:

    I've been trying variations on (<)\b(img|hr)\b(\w+)([^>]*\/>) or (<)img|hr(\w+)([^>]*\/>) or (<)(img|hr)(\w+)([^>]*\/>), which seems like it ought to be the right direction, but they either don't work at all or they match anything with i, m, g, |, h, or r in them...

    Last edited by traq; 06-06-2010 at 04:13 AM.

  5. #5
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,385
    Thanks
    100
    Thanked 113 Times in 111 Posts

    Default

    I almost wish I understood XHTML, but I have never really seen the use for it. In this case it would help me to write out a regex if needed, but like you said djr33 it looks like str_replace would take care of what he needs. Why would regex be needed in this case?

    I am almost certain I have seen regex that converts HTML to XHTML. I'm gonna take a quick look. I'm pretty sure I know where I saw it.
    To choose the lesser of two evils is still to choose evil. My personal site

  6. #6
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,385
    Thanks
    100
    Thanked 113 Times in 111 Posts

    Default

    PHP Code:
    <?php
    $text
    ="this is some <hi>ordinary<hi/> text.  this is another <hi>ordinary<hi/> <br>ok<br/>text<meta>text<meta/>.";
    $text=preg_replace('/(<)([^>]*br\/>|hr\/>|img\/>|link\/>|meta\/>)/',"XXXX",$text);
    echo 
    "$text";
    ?>
    produces
    this is some ordinary text. this is another ordinary
    okXXXXtexttextXXXX.
    dunno if it is what you are looking for though.

    Edit: I need to finish reading your post traq. Sorry!
    Last edited by james438; 06-06-2010 at 04:36 AM.
    To choose the lesser of two evils is still to choose evil. My personal site

  7. #7
    Join Date
    Apr 2008
    Location
    So.Cal
    Posts
    3,643
    Thanks
    63
    Thanked 516 Times in 502 Posts
    Blog Entries
    5

    Default

    well... I need to finish my research.

    I'm beginning to think that the I may need to just move to php to do this. I've been looking on the notepad++ forums, and it seems that notepad++ might not support more advanced regex constructs (if you can call one|other "advanced" ).

    I figured out a regex that will work (in php): (<)(img|hr|br|link|meta)([^>]*\/>)
    I just have to write something now that will recursively find *every* php and html file in the CMS directory structure.

    ...
    i hate recursion...
    Last edited by traq; 06-06-2010 at 04:47 AM.

  8. #8
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,385
    Thanks
    100
    Thanked 113 Times in 111 Posts

    Default

    Glad you figured it out. I wish I could have been more help.

    What does your complete regex line look like? I am curious what the replacement looks like .
    To choose the lesser of two evils is still to choose evil. My personal site

  9. #9
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    One more thought: 1. replace <tag /> with <tag>. 2. Replace <tag> with <tag />.

    For <html> it will skip (1) and make it <html /> in (2).
    For <xhtml /> it will change it to <xhtml> in (1) and in to <xhtml /> in (2).

    Hope that helps.
    Daniel - Freelance Web Design | <?php?> | <html>| español | Deutsch | italiano | português | català | un peu de français | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  10. #10
    Join Date
    Apr 2008
    Location
    So.Cal
    Posts
    3,643
    Thanks
    63
    Thanked 516 Times in 502 Posts
    Blog Entries
    5

    Default

    actually, putting this into preg_replace has uncovered a new issue. It currently looks like this:
    PHP Code:
    $newtag preg_replace('/(<)(img|hr|br|link|meta)([^>]*)( ?\/>)/''$1$2$3>'$tag
    I added the fourth group (slash, greater than, optional leading whitespace) so I could match both tags <like this/> and tags <like this /> and replace them with tags <like this> (no trailing whitespace).

    However, the third group (one or more of not-greater-than) also matches the whitespace, so the whitespace is preserved when I do my replacement anyway.

    I think this could be solved with lookahead/lookbehind assertions? I haven't quite figured out how to use those, though. Is there a way to make sure that ([^>]*) does not end with whitespace? ([^> ]*) and ([^>^ ]*) don't seem to be what I want.

    djr:
    I think I see what you're getting at - but I think it would lose the distinction between xml and xhtml at some point - part of my problem is that I wanted to replace certain <tags/>, but not <others/>. (I want to do xhtml -> html, btw, not vice-versa.) It's getting late, though, I'm starting to have toruble thinking clearly.

    I appreciate your help, guys! Keep throwing out ideas, and I'll pick this back up in the AM. Thanks!

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •