PDA

View Full Version : tagIsClosed function help.



shachi
11-17-2006, 07:40 PM
Hello everyone,

I was playing with some scripts when I came across a really confusing question. Can PHP check if a tag in a string is properly closed?(For e.g if there is only one tag <b>string can PHP change it to <b>string</b>?)

I have been playing with regular expression to solve this problem for a while but unfotunately, I can't get around with this.

Any help would be appreciated.
Thanks.

Twey
11-17-2006, 08:04 PM
Possible, but not easy. A fairly small task, but not simple. Basically, you'll need to iterate through the string and keep track of what tags are open and in what order they come. You'll also need a list of singleton tags that don't require closing tags.

mwinter
11-17-2006, 10:49 PM
I was playing with some scripts when I came across a really confusing question. Can PHP check if a tag in a string is properly closed?

For what purpose exactly?

If you're trying to repair broken HTML, you'd be better off using the Tidy extension than trying to roll your own version.

Mike

shachi
11-18-2006, 07:22 AM
If you're trying to repair broken HTML, you'd be better off using the Tidy extension than trying to roll your own version.


What is Tidy extension?



For what purpose exactly?


I am building something like a comment system and I have used strip_tags for stopping html tags except <b><i> and <u> but unfortunately if some people write out incomplete tags(<b>comment) then the whole bunch after that comment becomes bold/italic or underlined and the user may also have placed multiple incomplete tags(<b><i><u>comment<b><u>comment2) so I need a function to end all those tags. But I just have no clue how to do it.

Twey
11-18-2006, 12:46 PM
What is Tidy extension?http://www.php.net/tidy

mwinter
11-19-2006, 07:54 PM
Checking for tag pairs and proper nesting isn't especially hard, nor is simple correction. However, if correction is required, it might be a good idea to show a rendered preview of the input so that the user can check that the corrections are what was intended. For that reason, I wouldn't bother with a client-side port.

The following hasn't been exhaustively tested, but it covers the cases I can think of. It requires at least PHP 4.3.0 due to use of the PREG_OFFSET_CAPTURE flag.



function fixMarkup($input) {
$index = 0;
$result = '';
$tags = array();

if (preg_match_all('|<(/?)([biu])>|', $input, $matches, PREG_OFFSET_CAPTURE | PREG_SET_ORDER)) {
foreach ($matches as $match) {
$tagName = $match[2][0];

if ($match[1][0]) {
$length = $match[0][1] - $index;
$result .= substr($input, $index, $length);
$index += $length + strlen($match[0][0]);

while(($previousTagName = array_pop($tags))) {
$tag = "</{$previousTagName}>";
$result .= $tag;
$index += strlen($tag);
if ($tagName == $previousTagName) {
break;
}
}
} else {
$tags[] = $tagName;
}
}
}
$result .= substr($input, $index);

$index = count($tags);
while ($index--) {
$result .= "</{$tags[$index]}>";
}
return $result;
}

Mike

shachi
11-20-2006, 11:02 AM
mwinter: What if the user intentionally does it?

shachi
11-20-2006, 02:55 PM
mwinter: The function you made works great too. By the way if possible can you also puts comments in it describing what the lines do? Thanks again.

mwinter
11-21-2006, 02:03 AM
What if the user intentionally does it?

Then I suppose it's up to you to explain why it's necessary. At the very least, the user needs to include an end-tag if the element ends before the end of the input - the code I posted will add any end-tags for unclosed elements.



By the way if possible can you also puts comments in it describing what the lines do?

Sure. Note that this code contains a correction in the inner-most while loop, specifically the removal of the change to the $mark variable (renamed from $index), as well as a couple of other minor changes.



function fixMarkup($input) {
/* It's anticipated that much of the $input text will be copied verbatim as formatting
* elements are unlikely to occur frequently. In an attempt to minimise the number of
* string operations the input will be copied in chunks. The $mark variable marks the
* beginning of each of these chunks as they are encountered.
*/
$mark = 0;
/* The transformed input. If no problems were encountered, this will be equal to
* $input.
*/
$result = '';
/* As the input is examined, a stack of start-tags will be maintained. This stack is
* stored as an array assigned to $tags. Every start-tag will be pushed on to the end
* of the array. If an end tag is encountered, it will be compared with the tag name
* at the top of the stack. If the tag names do not match, names will be popped off
* until a match is found or the stack is empty, writing each one into the result.
*/
$tags = array();

/* Find all of the target start- and end-tags in the input, storing them in $matches.
* The return value is the number of matches; zero means no matches at all, and false
* indicates an error (a condition that should never happen, here).
* The PREG_SET_ORDER flag groups each match together, so the first array element
* contains all of the information for the first match, the second element contains
* all of the information for the second match, and so forth.
* The PREG_OFFSET_CAPTURE flag adds an index into the string where each part of the
* match information was found. With only the previous flag, each match would have
* simple only strings as elements. For example, [ '<b>', '', 'b' ], or
* [ '</u>', '/', 'u' ]. Now, each element is an array, with element zero the string,
* and element one the offset: [ [ '<b>', 2 ], [ '', 2 ], [ 'b', 3 ] ], or
* [ [ '</u>', 2 ], [ '/', 3 ], [ 'u', 4 ] ], where the initial 2 in each case would
* indicate that the match was found at index 2 (the third character) within the input
* string.
*/
if (preg_match_all('|<(/?)([biu])>|', $input, $matches,
PREG_OFFSET_CAPTURE | PREG_SET_ORDER)) {
foreach ($matches as $match) {
$tagName = $match[2][0];

/* If the tag was an end-tag, the first capturing parentheses in the regular
* expression would contain the solidus (/). If it was a start-tag, the
* parentheses would match an empty string.
*/
if ($match[1][0] === '/') {
/* If an end-tag is found, first determine the length of the input chunk
* to copy; the position from the last mark stored in $mark, to the
* offset of the end-tag itself.
*/
$length = $match[0][1] - $mark;
/* Add this chunk (which excludes the end-tag) to the result. */
$result .= substr($input, $mark, $length);
/* Update the mark to point beyond the end-tag. We aren't interested in
* copying the end-tag because we can infer the tag when we examine the
* start-tag stack.
* For simplicity, the strlen function call could be replaced with the
* numeric literal, 4, as all currently accepted end-tags are four
* characters in length. However, you might want to change this in the
* future, so the tag length is computed at run-time.
*/
$mark += $length + strlen($match[0][0]);

/* Here we begin to examine the start-tags encountered so far. The most
* recently matched tag name is stored in the $tagName variable, and we
* hope to find it at the top of the stack.
* We remove the top-most item with the array_pop function. If the stack
* is empty, this function will return null; this will type-convert to
* null and cause the loop below to end, if necessary.
*/
while (($previousTagName = array_pop($tags))) {
/* First, we write the end-tag that corresponds to the top-most tag
* name.
*/
$result .= "</{$previousTagName}>";
/* If that tag name was the one we were looking for, we exit and move
* on to the next match. If not, we have a situation like:
* <i><b></i>
* As 'b' would have been the top-most tag name - it was the most
* recent start-tag added - we would have responded in the previous
* step by writing a corresponding end-tag to the result. In the next
* iteration, would find 'i', write it, and break out as it is what we
* are looking for.
*/
if ($tagName == $previousTagName) {
break;
}
}
/* If an end-tag wasn't encountered, */
} else {
/* ... then it was a start-tag, and we push it on to the stack. */
$tags[] = $tagName;
}
}
}
/* At this point, all start- and end-tags in the input will have been found, leaving
* only plain text, so we finish off writing that remaining text into the result.
*/
$result .= substr($input, $mark);

/* However, that isn't necessarily the end of the story: start-tags that had no
* matching end-tag and weren't forced out when processing other end-tags may remain.
* So, if there are still tag names on the stack, we write them out in reverse order;
* the tag name at the bottom of the stack - the earliest added - gets written out
* last.
*/
$index = count($tags);
while ($index--) {
$result .= "</{$tags[$index]}>";
}
return $result;
}

Hope that helps,
Mike

shachi
11-21-2006, 09:36 AM
Thanks a lot mwinter, Twey. It helped a lot. Thanks again.