Page 1 of 2 12 LastLast
Results 1 to 10 of 13

Thread: Recursive regex pattern replace

  1. #1
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default Recursive regex pattern replace

    I have recently started working with regex and it is nice/useful, but still confusing.
    I've accomplished most of what I want, but now I have a situation where I need a bbcode tag to be able to exist within itself then the proper html to be generated.

    In other words:
    <tag><tag></tag></tag>

    There are a lot of complications that can be generated from this, and I know that recursive regex is possible, but I have no idea where to write it.


    Code:
    //I want this:
    [tag=value]text [tag=value]text[/tag] text[/tag]
    //To become:
    <tag property="value">text <tag property="value">text</tag> text</tag>
    And this needs to go many layers deep (possibly) so a recursive function is best.


    From an example, I can work out the details. A link to a tutorial is fine as well, as long as it has this exact setup (varied tag names of course), since I'm not sure I can adapt it with my current knowledge of regex from another scenario.

    BTW, I do want it to verify that this is valid so it doesn't generate unclosed tags (and invalidate the page), so a simple search and replace won't be enough I don't think.


    Thanks!
    Daniel - Freelance Web Design | <?php?> | <html>| espa˝ol | Deutsch | italiano | portuguŕs | catalÓ | un peu de franšais | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  2. #2
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,172
    Thanks
    96
    Thanked 99 Times in 97 Posts

    Default

    Sorry this is not a complete answer, but take a look at the following example 3 taken from php's preg_replace_callback:

    PHP Code:
    <?php
    $input 
    "plain [indent] deep [indent] deeper[/indent]deep[/indent]plain";
    function 
    parseTagsRecursive($input)
    {
        
    $regex '#\[indent]((?:[^[]|\[(?!/?indent])|(?R))+)\[/indent]#';
        if (
    is_array($input)) {
            
    $input '<div style="margin-left: 10px">'.$input[1].'</div>';
        }
        return 
    preg_replace_callback($regex'parseTagsRecursive'$input);
    }
    $output parseTagsRecursive($input);
    echo 
    $output;
    ?>
    I have not had the need to deal with recursive patterns before, so this is a bit new to me as well. That and the above example uses a user defined function, which I never learned how to write. I think I can come up with something for you tomorrow. I have dealt with a very similar situation a couple times before, so I have a pretty good idea how to go about this. The example I will write won't use a user defined function though. Either way it should give me some hands on experience with recursive patterns unless you or someone else writes one sooner .
    Last edited by james438; 07-05-2010 at 08:39 AM. Reason: minor formatting
    To choose the lesser of two evils is still to choose evil. My personal site

  3. #3
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    I will look at this soon.

    I've been using callback functions and they are very useful.

    The problem, though, is that I don't understand the "greediness" of the code-- consider this example:
    [tag][tag]level 2[/tag][/tag][tag]level 1[/tag]

    If greediness is on full, then it will find the first and last tags, then get a messy remainder in the middle. If greediness is on least, it will find the first and third tags, leaving two uneven parts.
    It's like there should be a "smart greediness" setting, or some way to have it automatically parse outward.

    Actually, I was able to think of a way to do this using non-regex means, and that might be simpler overall.
    1. Replace open tags (and count them). Using regex is fine, but just for open, don't worry about close. (string functions would be a little hard, but work also).
    2. Replace all close tags (and count them). simple str_replace().
    3. Now, here's the trick: compare the counts, then if they don't match add some extra close tags to the end and that will at least force properly formatted html, if not exactly the desired result.
    (One issue is to make sure that close tags come after open tags, but there's probably a way to fix that.)

    Regex would certainly be simpler [as code, shorter, but harder to create], though I'm not sure I'm quite ready for that yet.
    Daniel - Freelance Web Design | <?php?> | <html>| espa˝ol | Deutsch | italiano | portuguŕs | catalÓ | un peu de franšais | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  4. #4
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,172
    Thanks
    96
    Thanked 99 Times in 97 Posts

    Default

    I am not quite sure how to do it yet either. I am pretty sure I can come up with something when I have a bit more free time. As far as smart greediness goes, what you are suggesting sounds nice at first glance, but the number of permutations it has to calculate goes up exponentially very fast leading to some problems quickly in regards to memory allocation.

    I want to create a regex that does what you suggest anyway, because it sounds like it would be handy to add to my regex collection. I am going to try to avoid working on it too hard due to some of the other studies I should be doing right now though. When I do come up with a solution I like I will try to post it here.

    I actually came up with the following last night, but it is not recursive yet.
    PHP Code:
    <?php
    $text 
    '[this=value]text [tag=value]text[/tag] text[/this]';
    $text=preg_replace('/\[(.*?)=(.*?)\](.*?)\[\/\1\]/','<$1 property="$2">$3</$1>',$text);
    echo 
    "$text";
    ?>
    regex can be handy, but is certainly complicated in that it is done almost all in symbols and is rather complicated. I considered posting a tutorial on PCRE in the blog section, but realized that it would be rather longish just covering the basics. One of the more difficult things about it is that good information on this PCRE is rather lacking. In my experience PCRE is not often needed and PHP usually can be used to come up with a better solution, but not always.
    To choose the lesser of two evils is still to choose evil. My personal site

  5. #5
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    I realize this is complicated, but there must be some reasonably efficient way to do it because this is how HTML/XML is parsed and that works fine. I have no idea what that is, though.

    What I'm doing won't be particularly hard on the server because this is going to take a few paragraphs of text and replace some sections with tagged notes (for hovering the mouse). It will be for translations so you can hover over a section and read what it means.
    But of course this is also very useful in general for any type of bbcode, or even for writing an xml parser, etc.


    I have no idea how this is done in regex, but if there is a way to "count" instances, then here's some basic logic:
    open tag; if open tag, then skip as many close tags; close tag
    So the parts will be: 1. open, 2. equal number of open then close, 3. close

    But I don't know if that's even possible. There may be another approach. I might do some more research on this and see if I can find something.

    I'm pretty sure that I can write a linear parser (loop through character by character) that will handle this. I've done this before and though it's not fun, it's the most basic way to approach the problem.
    Daniel - Freelance Web Design | <?php?> | <html>| espa˝ol | Deutsch | italiano | portuguŕs | catalÓ | un peu de franšais | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  6. #6
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,172
    Thanks
    96
    Thanked 99 Times in 97 Posts

    Default

    Will this work?

    PHP Code:
    <?php
    $text 
    '[one=value]text [two=value]text [three=value]text[/three] text[/two] text[/one]';
    $text=preg_replace('#\[(\w+)=(\w+)\]#','<$1 property="$2">',$text);
    $text=preg_replace('#\[/(\w+)\]#','</$1>',$text);
    echo 
    "$text";
    ?>
    I posted this problem on the regexadvice forum in an effort to learn more about recursion, but the lady who responded solved the problem with the simple answer you see above without any use of recursion at all.
    To choose the lesser of two evils is still to choose evil. My personal site

  7. #7
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    That doesn't seem to verify that there are correct pairs of tags. It may generate invalid HTML if someone forgets a close tag for example.
    Daniel - Freelance Web Design | <?php?> | <html>| espa˝ol | Deutsch | italiano | portuguŕs | catalÓ | un peu de franšais | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

  8. #8
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,172
    Thanks
    96
    Thanked 99 Times in 97 Posts

    Default

    This is beyond my understanding of PCRE, but I still want to figure this out. Will the regex match if the tags are incorrectly nested? For example in the following incorrectly nested bb code:

    Code:
    [one=value]text [two=value]text [three=value]text[/one] text[/two] text[/thre]
    will become:

    <one property="value">text <two property="value">text [three=value]text</one> text</two> text[/thre]
    To choose the lesser of two evils is still to choose evil. My personal site

  9. #9
    Join Date
    Jan 2007
    Location
    Davenport, Iowa
    Posts
    2,172
    Thanks
    96
    Thanked 99 Times in 97 Posts

    Default

    Still working on this script.

    Here is the latest regex, but it does not fully solve the problem:

    PHP Code:
    <?php
    $text 
    '[one]text [two]text [three]text[/thre] text[/two] text[/one]';
    $regexp '{((\[([^\]]+)\])((?:(?:(?!\[/?\3\]).)*+|(?1))*)(\[/\3\]))}si';
    while(
    preg_match($regexp,$text,$match)){
        
    $text preg_replace($regexp,'<$3>$4</$3>',$text);
    }
    echo 
    $text;
    ?>
    I got this from regexadvice.com on this thread. I am trying to learn a bit more about lookbehinds and recursion. You may get a bit better results though if you post your problem there yourself.

    If you do I suggest you look at their rules. They are kind of a stickler for the rules. Namely, post the language that you are using, give actual code when posting code and never use made up code. State what you want the output to be and how you want your regex to operate.I think they like it when you post your efforts thus far as well.

    They are pretty helpful and usually get back to you within 24-48 hours.

    Or you can wait till I get a hold of the full solution and post it here.
    To choose the lesser of two evils is still to choose evil. My personal site

  10. #10
    Join Date
    Mar 2006
    Location
    Illinois, USA
    Posts
    12,164
    Thanks
    265
    Thanked 690 Times in 678 Posts

    Default

    I think that's beyond my knowledge of regex at the moment. I'm not in a hurry with this and when I do need it, I'll just work out a string-function linear parser.

    Thanks for the update, and let me know if this goes anywhere. Don't worry if you can't figure it out. I know there must be a way, though...
    Daniel - Freelance Web Design | <?php?> | <html>| espa˝ol | Deutsch | italiano | portuguŕs | catalÓ | un peu de franšais | some knowledge of several other languages: I can sometimes help translate here on DD | Linguistics Forum

Bookmarks

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •