PHP Code:
function fixMarkup($input) {
/* It's anticipated that much of the $input text will be copied verbatim as formatting
* elements are unlikely to occur frequently. In an attempt to minimise the number of
* string operations the input will be copied in chunks. The $mark variable marks the
* beginning of each of these chunks as they are encountered.
*/
$mark = 0;
/* The transformed input. If no problems were encountered, this will be equal to
* $input.
*/
$result = '';
/* As the input is examined, a stack of start-tags will be maintained. This stack is
* stored as an array assigned to $tags. Every start-tag will be pushed on to the end
* of the array. If an end tag is encountered, it will be compared with the tag name
* at the top of the stack. If the tag names do not match, names will be popped off
* until a match is found or the stack is empty, writing each one into the result.
*/
$tags = array();
/* Find all of the target start- and end-tags in the input, storing them in $matches.
* The return value is the number of matches; zero means no matches at all, and false
* indicates an error (a condition that should never happen, here).
* The PREG_SET_ORDER flag groups each match together, so the first array element
* contains all of the information for the first match, the second element contains
* all of the information for the second match, and so forth.
* The PREG_OFFSET_CAPTURE flag adds an index into the string where each part of the
* match information was found. With only the previous flag, each match would have
* simple only strings as elements. For example, [ '<b>', '', 'b' ], or
* [ '</u>', '/', 'u' ]. Now, each element is an array, with element zero the string,
* and element one the offset: [ [ '<b>', 2 ], [ '', 2 ], [ 'b', 3 ] ], or
* [ [ '</u>', 2 ], [ '/', 3 ], [ 'u', 4 ] ], where the initial 2 in each case would
* indicate that the match was found at index 2 (the third character) within the input
* string.
*/
if (preg_match_all('|<(/?)([biu])>|', $input, $matches,
PREG_OFFSET_CAPTURE | PREG_SET_ORDER)) {
foreach ($matches as $match) {
$tagName = $match[2][0];
/* If the tag was an end-tag, the first capturing parentheses in the regular
* expression would contain the solidus (/). If it was a start-tag, the
* parentheses would match an empty string.
*/
if ($match[1][0] === '/') {
/* If an end-tag is found, first determine the length of the input chunk
* to copy; the position from the last mark stored in $mark, to the
* offset of the end-tag itself.
*/
$length = $match[0][1] - $mark;
/* Add this chunk (which excludes the end-tag) to the result. */
$result .= substr($input, $mark, $length);
/* Update the mark to point beyond the end-tag. We aren't interested in
* copying the end-tag because we can infer the tag when we examine the
* start-tag stack.
* For simplicity, the strlen function call could be replaced with the
* numeric literal, 4, as all currently accepted end-tags are four
* characters in length. However, you might want to change this in the
* future, so the tag length is computed at run-time.
*/
$mark += $length + strlen($match[0][0]);
/* Here we begin to examine the start-tags encountered so far. The most
* recently matched tag name is stored in the $tagName variable, and we
* hope to find it at the top of the stack.
* We remove the top-most item with the array_pop function. If the stack
* is empty, this function will return null; this will type-convert to
* null and cause the loop below to end, if necessary.
*/
while (($previousTagName = array_pop($tags))) {
/* First, we write the end-tag that corresponds to the top-most tag
* name.
*/
$result .= "</{$previousTagName}>";
/* If that tag name was the one we were looking for, we exit and move
* on to the next match. If not, we have a situation like:
* <i><b></i>
* As 'b' would have been the top-most tag name - it was the most
* recent start-tag added - we would have responded in the previous
* step by writing a corresponding end-tag to the result. In the next
* iteration, would find 'i', write it, and break out as it is what we
* are looking for.
*/
if ($tagName == $previousTagName) {
break;
}
}
/* If an end-tag wasn't encountered, */
} else {
/* ... then it was a start-tag, and we push it on to the stack. */
$tags[] = $tagName;
}
}
}
/* At this point, all start- and end-tags in the input will have been found, leaving
* only plain text, so we finish off writing that remaining text into the result.
*/
$result .= substr($input, $mark);
/* However, that isn't necessarily the end of the story: start-tags that had no
* matching end-tag and weren't forced out when processing other end-tags may remain.
* So, if there are still tag names on the stack, we write them out in reverse order;
* the tag name at the bottom of the stack - the earliest added - gets written out
* last.
*/
$index = count($tags);
while ($index--) {
$result .= "</{$tags[$index]}>";
}
return $result;
}
Hope that helps,
Bookmarks