View Full Version : Recursive regex pattern replace
djr33
07-05-2010, 06:10 AM
I have recently started working with regex and it is nice/useful, but still confusing.
I've accomplished most of what I want, but now I have a situation where I need a bbcode tag to be able to exist within itself then the proper html to be generated.
In other words:
<tag><tag></tag></tag>
There are a lot of complications that can be generated from this, and I know that recursive regex is possible, but I have no idea where to write it.
//I want this:
text text text
//To become:
<tag property="value">text <tag property="value">text</tag> text</tag>
And this needs to go many layers deep (possibly) so a recursive function is best.
From an example, I can work out the details. A link to a tutorial is fine as well, as long as it has this exact setup (varied tag names of course), since I'm not sure I can adapt it with my current knowledge of regex from another scenario.
BTW, I do want it to verify that this is valid so it doesn't generate unclosed tags (and invalidate the page), so a simple search and replace won't be enough I don't think.
Thanks!
james438
07-05-2010, 08:33 AM
Sorry this is not a complete answer, but take a look at the following example 3 taken from php's preg_replace_callback (http://us.php.net/manual/en/function.preg-replace-callback.php):
<?php
$input = "plain
deep
deeper deep plain";
function parseTagsRecursive($input)
{
$regex = '#\
((?:[^[]|\[(?!/?indent])|(?R))+)\#';
if (is_array($input)) {
$input = '<div style="margin-left: 10px">'.$input[1].'</div>';
}
return preg_replace_callback($regex, 'parseTagsRecursive', $input);
}
$output = parseTagsRecursive($input);
echo $output;
?>
I have not had the need to deal with recursive patterns before, so this is a bit new to me as well. That and the above example uses a user defined function, which I never learned how to write. I think I can come up with something for you tomorrow. I have dealt with a very similar situation a couple times before, so I have a pretty good idea how to go about this. The example I will write won't use a user defined function though. Either way it should give me some hands on experience with recursive patterns unless you or someone else writes one sooner ;).
djr33
07-05-2010, 01:46 PM
I will look at this soon.
I've been using callback functions and they are very useful.
The problem, though, is that I don't understand the "greediness" of the code-- consider this example:
level 2level 1
If greediness is on full, then it will find the first and last tags, then get a messy remainder in the middle. If greediness is on least, it will find the first and third tags, leaving two uneven parts.
It's like there should be a "smart greediness" setting, or some way to have it automatically parse outward.
Actually, I was able to think of a way to do this using non-regex means, and that might be simpler overall.
1. Replace open tags (and count them). Using regex is fine, but just for open, don't worry about close. (string functions would be a little hard, but work also).
2. Replace all close tags (and count them). simple str_replace().
3. Now, here's the trick: compare the counts, then if they don't match add some extra close tags to the end and that will at least force properly formatted html, if not exactly the desired result.
(One issue is to make sure that close tags come after open tags, but there's probably a way to fix that.)
Regex would certainly be simpler [as code, shorter, but harder to create], though I'm not sure I'm quite ready for that yet.
james438
07-05-2010, 11:49 PM
I am not quite sure how to do it yet either. I am pretty sure I can come up with something when I have a bit more free time. As far as smart greediness goes, what you are suggesting sounds nice at first glance, but the number of permutations it has to calculate goes up exponentially very fast leading to some problems quickly in regards to memory allocation.
I want to create a regex that does what you suggest anyway, because it sounds like it would be handy to add to my regex collection. I am going to try to avoid working on it too hard due to some of the other studies I should be doing right now though. When I do come up with a solution I like I will try to post it here.
I actually came up with the following last night, but it is not recursive yet.
<?php
$text = 'text text text';
$text=preg_replace('/\[(.*?)=(.*?)\](.*?)\[\/\1\]/','<$1 property="$2">$3</$1>',$text);
echo "$text";
?>
regex can be handy, but is certainly complicated in that it is done almost all in symbols and is rather complicated. I considered posting a tutorial on PCRE in the blog section, but realized that it would be rather longish just covering the basics. One of the more difficult things about it is that good information on this PCRE is rather lacking. In my experience PCRE is not often needed and PHP usually can be used to come up with a better solution, but not always.
djr33
07-05-2010, 11:56 PM
I realize this is complicated, but there must be some reasonably efficient way to do it because this is how HTML/XML is parsed and that works fine. I have no idea what that is, though.
What I'm doing won't be particularly hard on the server because this is going to take a few paragraphs of text and replace some sections with tagged notes (for hovering the mouse). It will be for translations so you can hover over a section and read what it means.
But of course this is also very useful in general for any type of bbcode, or even for writing an xml parser, etc.
I have no idea how this is done in regex, but if there is a way to "count" instances, then here's some basic logic:
open tag; if open tag, then skip as many close tags; close tag
So the parts will be: 1. open, 2. equal number of open then close, 3. close
But I don't know if that's even possible. There may be another approach. I might do some more research on this and see if I can find something.
I'm pretty sure that I can write a linear parser (loop through character by character) that will handle this. I've done this before and though it's not fun, it's the most basic way to approach the problem.
james438
07-06-2010, 05:53 AM
Will this work?
<?php
$text = 'text text text text text';
$text=preg_replace('#\[(\w+)=(\w+)\]#','<$1 property="$2">',$text);
$text=preg_replace('#\[/(\w+)\]#','</$1>',$text);
echo "$text";
?>
I posted this problem on the regexadvice forum in an effort to learn more about recursion, but the lady who responded solved the problem with the simple answer you see above without any use of recursion at all.
djr33
07-06-2010, 04:17 PM
That doesn't seem to verify that there are correct pairs of tags. It may generate invalid HTML if someone forgets a close tag for example.
james438
07-07-2010, 03:23 AM
This is beyond my understanding of PCRE, but I still want to figure this out. Will the regex match if the tags are incorrectly nested? For example in the following incorrectly nested bb code:
text text text text text
will become:
<one property="value">text <two property="value">text text</one> text</two> text
james438
07-10-2010, 04:59 AM
Still working on this script.
Here is the latest regex, but it does not fully solve the problem:
<?php
$text = 'text text text text text';
$regexp = '{((\[([^\]]+)\])((?:(?:(?!\[/?\3\]).)*+|(?1))*)(\[/\3\]))}si';
while(preg_match($regexp,$text,$match)){
$text = preg_replace($regexp,'<$3>$4</$3>',$text);
}
echo $text;
?>
I got this from regexadvice.com on this thread (http://regexadvice.com/forums/thread/69343.aspx). I am trying to learn a bit more about lookbehinds and recursion. You may get a bit better results though if you post your problem there yourself.
If you do I suggest you look at their rules. They are kind of a stickler for the rules. Namely, post the language that you are using, give actual code when posting code and never use made up code. State what you want the output to be and how you want your regex to operate.I think they like it when you post your efforts thus far as well.
They are pretty helpful and usually get back to you within 24-48 hours.
Or you can wait till I get a hold of the full solution and post it here.
djr33
07-10-2010, 04:49 PM
I think that's beyond my knowledge of regex at the moment. I'm not in a hurry with this and when I do need it, I'll just work out a string-function linear parser.
Thanks for the update, and let me know if this goes anywhere. Don't worry if you can't figure it out. I know there must be a way, though...
djr33
07-11-2010, 04:43 PM
<?php
function textrepl($matches) {
if (strtolower($matches[0])=='[/label]') {
if ($GLOBALS['level']>0) { //valid level?
$GLOBALS['level']--; //going down one level
return '</span>';
}
else {
return $matches[0]; //default
}
}
else {
$GLOBALS['level']++; //going up one level
return '<span title="'.$matches[1].'">';
}
}
$s = 'Una frase lunga. Ciao.';
$level = 0;
$s = preg_replace_callback('/\[label=(.+)\]|\[\/label\]/Uiu', 'textrepl', $s);
for(;$level>0;$level--) {
$s .= '</span>';
}
echo $s;
?>
The generated HTML is:
<span title="A long sentence.">Una <span title="sentence">frase</span> lunga.</span> <span title="Goodbye.">Ciao.</span>
This works.
Note that my example is with translations: allowing layered titles on spans so you can see what something means. But this can be done in many ways and could generate any type of HTML while keeping the output valid-- making the tags properly line up.
The logic is this: Use regex to match an opening OR closing tag, then use a function to handle the replacements. In this function, track what level the tags are at and if it can't go to a wider level (less than 0), then it must be a mistake so it ignores that close tag and outputs it as text. At the end of this it determines the final level and if it is not 0 it will add close tags until it is.
This way the html output is valid and there's some very basic error correction: ignore extra close tags, and close unclosed open tags.
The code is somewhat ugly and requires a lot of setup, so I might look at rewriting it as an anonymous function or perhaps creating one function that can be used in any such replacement.
Then make a function that holds it so that the for loop at the end is also accounted for.
Then it can be called just like parsetohtml($text,$tagname,$htmltagname), or something like that.
But this works. The trick was figuring out how to count and using "or" is what allows for this.
djr33
07-11-2010, 04:55 PM
Hmm, here it is rewritten with an anonymous function. I'm not sure if this is simpler or more complex, but at least it puts everything in one place:
<?php
$s = 'Una frase lunga. Ciao.';
$level = 0;
$s = preg_replace_callback('/\|\[\/label\]/Uiu',
function ($matches) {
if (strtolower($matches[0])=='') {
if ($GLOBALS['level']>0) { //valid level?
$GLOBALS['level']--; //going down one level
return '</span>';
}
else {
return $matches[0]; //default
}
}
else {
$GLOBALS['level']++; //going up one level
return '<span title="'.$matches[1].'">';
}
},
$s);
for(;$level>0;$level--) {
$s .= '</span>';
}
echo $s;
?>
That does exactly the same thing as above.
djr33
07-11-2010, 11:09 PM
I've now rewritten this as a function that takes 4 parameters: two pairs of search/replace "preg" style parameters, and of course the string... and also a "flags" parameter so you can add "caseless" etc. So there are 6 total.
This works very well and it has a lot of options and seems organized, at least compared to the versions in my previous posts. However, it is still very specific and may have problems if someone's needs differ too greatly from mine-- basically if it's anything other than standard "tags" where there's an open and a close tag, and the close tag doesn't have any variable parts. It actually would work fine, I think, but it could get complex because of the for loop at the end. That is the only part that would (I think) cause problems.
<?php
//replace recursively requiring matching pairs, such as html tags
function preg_replace_recursivepair($po,$pc,$ro,$rc,$s,$flags='') {
//pattern open, pattern close, replace open, replace close, string, preg flags
$level = 0;
$s = preg_replace_callback('/'.$po.'|'.$pc.'/'.$flags,
function ($matches) use (&$level,$po,$pc,$ro,$rc) {
if (preg_match('/'.$po.'/'.$flags,$matches[0])==1) {
$level++; //going up one level
return preg_replace('/'.$po.'/'.$flags,$ro,$matches[0]);
}
else {
if ($level>0) { //valid level?
$level--; //going down one level
return preg_replace('/'.$pc.'/'.$flags,$rc,$matches[0]);
}
else {
return $matches[0]; //no changes, invalid
}
}
},
$s);
for(;$level>0;$level--) {
$s .= $rc;
}
return $s;
}
$s = 'Una frase grande.[/label] [label=Goodbye.]Adiós.';
echo preg_replace_recursivepair('\[label=(.+)\]','\[\/label\]','<span title="$1">','</span>',$s,'Uiu');
?>
Note: I changed my translated text from Italian to Spanish now that I set my file's format to UTF8 and that works....
I'd like some feedback on this and perhaps some ideas on standardizing it.
And I still know there must be a simpler/normal way to do this...
Powered by vBulletin® Version 4.2.2 Copyright © 2021 vBulletin Solutions, Inc. All rights reserved.