Log in

View Full Version : Removing extra <tr><td><table> and </tr></td></table> with preg_replace



qwikad.com
05-23-2013, 07:28 PM
Hi there!

Sometimes people post ads with html tables in them and more often than not they have extra tags <tr><td><table> and </tr></td></table> which interferes with my site's layout.

Is it possible to strip all EXTRA <tr><td><table> and </tr></td></table> with preg_replace?


Thank you for any input.

traq
05-23-2013, 07:53 PM
You might look at HTMLpurifier (http://htmlpurifier.org/). It's a good idea "anyway" if you're going to allow users to post html.

qwikad.com
05-23-2013, 08:30 PM
Ok, I downloaded it, but it's all too complex for me. I just need a simple preg_replace for just this (stripping extra <tr> <td> <table> and </tr> </td> </table>) function. Any ideas? The thing is I don't have any issues with anything else, just tables.

qwikad.com
05-23-2013, 09:55 PM
The function should go like this. First, it counts all open and closed tags and if there are extra tags, they are removed.

For instance:


<table border="0">
<tr>
<td>

Some text

</td>
<td>

Some text

</td>
</tr>
</table> <tr> <td> <table>

The last <tr> <td> <table> are extra and should be removed since there's no closing </tr> </td> </table> tag. I think it's feasible. No?

traq
05-23-2013, 11:03 PM
It's more complicated than you think.

The best way to do this is by parsing/tokenizing it - that means DomDocument or similar, and HTMLpurifier is far easier than that. The preg_match function is easier, but the regex you'd need would get very complicated, and you'd still run the risk of making things worse by accident.

qwikad.com
05-23-2013, 11:26 PM
I see. Thanks for explaining.

qwikad.com
05-24-2013, 04:38 AM
It's more complicated than you think.

The best way to do this is by parsing/tokenizing it - that means DomDocument or similar, and HTMLpurifier is far easier than that. The preg_match function is easier, but the regex you'd need would get very complicated, and you'd still run the risk of making things worse by accident.

I've looked into this issue again and what I actually need is to strip all extra </table> tags. Just those. When extra closed table tags are gone everything seems to be formed ok. Will it make making a preg_match or regex script easier?

traq
05-24-2013, 07:51 PM
You'd still have to count them (both opening and closing tags) and make sure they're in the right order. That means lookaheads and sub-pattern matching - no, it's not any less complicated. In fact, once you implement it for one kind of tag, it's not really much more work to do it for all of them.

Also consider that extra <table> tags aren't the only thing that can ruin your markup; and a ruined layout isn't the only risk of allowing users to input HTML. I *highly* recommend using HTMLpurifier if you allow user-submitted HTML, if only for the security benefits.

You might ask at RegexAdvice (http://regexadvice.com/forums/) if you really want to pursue a preg_match solution.

qwikad.com
05-24-2013, 11:18 PM
I *highly* recommend using HTMLpurifier if you allow user-submitted HTML, if only for the security benefits.

I AM putting security in place. I am "training" my markdown to filter out anything that can launch an attack or anything that can be used to take advantage of the site. I never thought I'd have the issue with tables. I am seriously considering just stripping them all if I can't resolve this thing. Why do you think it is so hard to resolve something so simple? I've seen preg_match or preg_replace scripts that do AMAZING and complicated things. And here, all I need is for a script to remove extra open </table> tags - and yet it is such a hassle? I posted this same question on two other forums and everybody seems to be having a "just forget it!" type of attitude.... It's kinda frustrating.

djr33
05-24-2013, 11:23 PM
You'd still have to count them (both opening and closing tags) and make sure they're in the right order.That's what I was going to say, in response to the latest post here. It's not an easy task. It's possible-- browsers manage to do this. But you'd need to fully parse the HTML of the page. One option would be to limit the scope of what you're doing to something like single (or dual) level tables, so that you only have one table (or two) at most, and you don't need to worry so much about subpatterns, but this really is complicated.

The issue isn't that preg_match can't do this relatively well, but that to get a perfect script (with zero exceptions) it would be incredibly complicated-- as I said, you'd have to parse all of the HTML on the page to be certain nothing conflicts.

So your options:
1. Do nothing.
2. Fully parse all of the HTML.
3. Simplify the parameters (such as not allowed embedded tables).
4. Settle for an imperfect (but generally working) solution that covers maybe 75-95% of the possible problems, depending on how you write it.




I am "training" my markdown to filter out anything that can launch an attack or anything that can be used to take advantage of the site.The problem here is the difference between a whitelist and a blacklist. If you use a whitelist, then you will only allow those things that are approved and are known to cause no problems (while blocking harmful and harmless other things). If you use a blacklist, as you are suggesting, then it will block all known bad things (while letting everything else-- good or bad) through; the problem with that is that if you just don't know about something (or some new hacking technique is invented) then you will have no defenses at all. There *are* ways to create a working blacklist by overdenying possibly good code, such as removing all HTML, but it doesn't sound like that's what you want either.

traq
05-25-2013, 03:56 AM
Are you sure HTMLpurifier is "too complex" for what you want to do?



require_once '/path/to/HTMLPurifier.auto.php';

$config = HTMLPurifier_Config::createDefault();
$purifier = new HTMLPurifier($config);
$clean_html = $purifier->purify($dirty_html);
($dirty_html would be the markup your users submit; $clean_html would be the purified markup you save/use.)

qwikad.com
05-25-2013, 11:57 AM
It's not as easy as you think, in my case. An ad content, in my case, already goes through a markdown and it's not just one $dirty_html that is called out. In my case, the html is echoed and it has two different components in it. I will need a professional help to make it work.

qwikad.com
05-25-2013, 12:18 PM
I am also wondering, how the purifier is going to help me with html tables. And... if it can, why can't I do the same thing through my markdown? Obviously the purifier is using something to straighten html tables, why can't I just take the same regex (or whatever) and just make it part of my markdown?

traq
05-25-2013, 07:48 PM
I'm sure it's possible to reverse-engineer HTMLpurifier and integrate the relevant parts with your code. I'm also sure that it would not be a "simpler" approach.

If you'd like to share further details - what user input you receive, how you need to process it, how it needs to be output/saved - I'd be happy to continue to help you find a solution.

Otherwise, if you want to hire someone, you're welcome to post in the Paid Work forum.

qwikad.com
05-26-2013, 05:06 PM
traq

You may wander why I need that thing to be fixed so badly. Here's an example of a table based template a user posted on the site:

http://qwikad.com/0/posts/8-real-estate/198-real-estate-wanted/39238-Sell-Your-Problem-House-FAST-.html

That's why I want to perfect the whole table thing so that others could enjoy good looking templates and never encounter the ones that mess up the site. :)

qwikad.com
05-26-2013, 09:34 PM
I guess I've taken the easiest way out for now. I've tried different lines and ended up with this one. It does what I want... for now:


$text = preg_replace( '/(s*<\/table\s*\/?>\s*)+/', "</table>", $text);

Thanks everyone for your input!



By the way, in the process I've discovered this site. Some of you (regex fans) may find it useful: http://regex101.com