Log in

View Full Version : RegExp weirdness



jscheuer1
03-30-2009, 03:50 PM
Consider this:


<?php
$theBack = $_SERVER["HTTP_REFERER"];
$thePat1 = "/^http:\/(\/www\.)|(\/)some\.com/";
$thePat2 = "/^(http:\/\/some\.com)|(http:\/\/www\.some\.com)/";
if(eregi($thePat1, $theBack))
echo $theBack . ' pat one';
if(eregi($thePat2, $theBack))
echo $theBack . ' pat two';
?>

Shouldn't these two patterns match the same strings:

http://www.some.com/whatever

and:

http://some.com/whatever

If not, why not?

In my test environment, thePat1 matches http://some.com and thePat2 matches http://www.some.com

The result being that I'm not 100% sure that either can be relied upon to exclude other domains. Is there a better approach to this. I just want to make sure that the referring document is from the same domain before writing out a back button to that page.

JasonDFR
03-30-2009, 04:33 PM
$theBack = 'http://www.|/some.com/'; // Matches pat1 also

I'll try to work out one pattern that matches both later. I think that when using the | pipe you need to be inside parenthesis.

For example:



(this | that) // not (this) | (that)

I'm no regex expert, but I love trying to figure them out.

Just to be clear, you would like one regex that matches both http://some.com and http://www.some.com ?

JasonDFR
03-30-2009, 05:47 PM
I came up with this:


<?php

$match = 'http://some.com/';

$pat = '/^http:\/\/(www.|)some.com\//';

if ( preg_match($pat, $match) ) {
echo 'Match!<br>';
echo 'Pat: ' . $pat . '<br>';
echo 'Match: ' . $match;
} else {
echo 'no match';
}

exit;

As always I would love to see other regexs. Especially if they are more elegant. hehe...

jscheuer, let me know if this works for you. And BTW, use preg, not ereg.

Schmoopy
03-30-2009, 10:10 PM
So the "(www.|)" is like saying "www." or "" right? Either this or nothing?

Master_script_maker
03-30-2009, 10:29 PM
yes (www\.|) is like (www\.)?
Added below:
Here is a useful little code snippet to test regexs. You can add as many regexs as you want:

$pat=Array(
"com"=> "/^(http:\/\/)?(www\.)?([^.]*?)\.com/",
"s com"=> "/^(http(s)?:\/\/)?(www\.)?([^.]*?)\.com/",
"sub"=> "/^(http(s)?:\/\/)?(www\.)?([^.]*?\.)?([^.]*?)\.com/",
"multiple domain"=> "/^(http(s)?:\/\/)?(www\.)?([^.]*\.)?([^.]*?)(\..{2}\..{2}|\..{3})/"
);
$padding="&nbsp;&nbsp;";
$string="https://bob.site.co.uk";

echo "String: <strong>".$string."</strong><br><br>";

foreach($pat as $k=>$v) {
$pad=$padding;
echo 'Pattern <strong>'.$k.'</strong> <em>' . $v . '</em><br>'.$pad;
if ( preg_match($v, $string, $matches) ) {
echo 'Match: ';
echo cust_print($matches, $pad, 2);
} else {
echo 'No Match';
}
echo "<br><br>";
}

function cust_print($a, $p="", $n=1) {
if(!is_array($a)) {
return $a;
}
$s="Array(";
foreach($a as $k=>$v) {
$s.="<br>".str_repeat($p, $n).'['.$k.'] => '.cust_print($v, $p, $n+1);
}
$s.="<br>".str_repeat($p, $n-1).")";
return $s;
}

jscheuer1
03-31-2009, 02:31 AM
This is all pretty much nonsense or at least inefficient to me. In javascript we can just do:


<script type="text/javascript">
var dr = 'http://www.some.com/whatever.htm',
r = new RegExp('^http:\\/(\\/www\\.|\\/)some\\.com');
alert(r.test(dr)); //alerts true
dr = 'http://some.com/whatever.htm'
alert(r.test(dr)); //alerts true
dr = 'http://www.someother.com/whatever.htm'
alert(r.test(dr)); //alerts false
dr = 'http://someother.com/whatever.htm'
alert(r.test(dr)); //alerts false
</script>

Is there no equivalent in PHP?

If I have the document.referrer string as the var dr, a simple test of either one of these (not both - only one of them is required - it's just that either will work):


/^http:\/(\/www\.|\/)some\.com/.test(dr)

or:


/^http:\/(\/www\.)|(\/)some\.com/.test(dr)

will tell me if it comes from the some.com domain or not. There must be something equally as simple in PHP, perhaps even simpler, as In many cases PHP is simpler than javascript - perhaps something as simple as:


isSameDomain($_SERVER["HTTP_REFERER"])

Is there no equivalent of the javascript test() method in PHP - or better yet - a more efficient way of telling if a given URL comes from the same domain as the present page?

Nile
03-31-2009, 02:50 AM
I guess there is, although I don't know much about reg exp. You could use preg_match, if the value is 0, then it will not be true.

jscheuer1
03-31-2009, 04:42 AM
You guess? What a wimpy answer! Anyways, I found this rather simple approach after much trial and error:


<?php
$thePat = '^http://' . $_SERVER['HTTP_HOST'];
$theBack = $_SERVER['HTTP_REFERER'];
if (ereg($thePat, $theBack))
echo $theBack;
else
echo 'other domain';
?>

I'm thinking I will either echo a link back to the previous page ($_SERVER["HTTP_REFERER"]) or (if the test fails) include a file that has a menu of various on site pages that might be appropriate choices. I'm just wondering if there is anything that looks unworkable/dangerous/stupid/etc. here or not.

jscheuer1
03-31-2009, 01:14 PM
Perhaps even more to the point:


<?php
$theHost = $_SERVER['HTTP_HOST'];
$theBack = $_SERVER['HTTP_REFERER'];
$theBParse = parse_url($theBack);

if ($theBParse['host'] == $theHost)
echo $theBack; // or do whatever if referrer is from the same domain
else
echo 'other domain'; // or do whatever if the referrer is from another domain
?>

As I said, this sort of thing is usually simpler in PHP than in javascript, which is why I couldn't understand how complicated the answers were getting.

Twey
03-31-2009, 04:14 PM
The thing to consider is that | has a very low operator precedence — that's why you usually see an alternative enclosed in brackets.


$thePat1 = "/^http:\/(\/www\.)|(\/)some\.com/";
'Match either "http://www." at the start of the string, or "/some.com" anywhere in the string.'


$thePat2 = "/^(http:\/\/some\.com)|(http:\/\/www\.some\.com)/";
'Match either the start of the string followed by "http://some.com" or "http://www.some.com/" anywhere in the string.'

jscheuer1
03-31-2009, 04:58 PM
Thanks, Twey. Anyways, what do you think about my latest approach vs. the one right before it:


<?php
$theHost = $_SERVER['HTTP_HOST'];
$theBack = $_SERVER['HTTP_REFERER'];
$theBParse = parse_url($theBack);

if ($theBParse['host'] == $theHost)
echo $theBack; // or do whatever if referrer is from the same domain
else
echo 'other domain'; // or do whatever if the referrer is from another domain
?>

vs:


<?php
$thePat = '^http://' . $_SERVER['HTTP_HOST'];
$theBack = $_SERVER['HTTP_REFERER'];
if (ereg($thePat, $theBack))
echo $theBack;
else
echo 'other domain';
?>

I also had another approach using substr() and strlen(), and an equality comparison between:


'http://' . $_SERVER['HTTP_HOST']

and and a substr of equal length counting from the beginning of:


$_SERVER['HTTP_REFERER']

But I think the best is the one at the top of this post. I'm still a great novice at PHP, but I think it combines maximum accuracy with with the lowest possible overhead for that accuracy.

Unfortunately, the server I'm working on does not support the component parameter of parse_url(), otherwise it could have been even simpler, or at least more direct looking.

Twey
04-01-2009, 08:34 AM
Yes, parse_url() is the way to go — it's designed for just this situation, and can be optimised based on that.

jscheuer1
04-01-2009, 03:43 PM
Yes, parse_url() is the way to go — it's designed for just this situation, and can be optimised based on that.

Yes, that's what I thought. Now, I'm a little surprised you haven't mentioned potential problems or issues with the basic idea of even looking for the referrer, let alone making a link based upon it.

Do I take that you are silently saying that if gotten in this manner and used only to make a link back to the current site itself, that it's OK?

In limited testing, if there is no referrer, there is no error, and my code will follow the 'other domain' path.

I'm sure this could be abused, but I'm just thinking of offering a back button in certain cases, with other generic (hard coded) choices, or if no on site referrer is available, just the generic choices.

One other thing, I'm thinking that as long as server security is up to snuff, no one can spoof a referrer in this situation as being from the same domain as the site and have a link created that goes off site or anywhere on site they couldn't reach via the address bar. Or am I mistaken?

Twey
04-01-2009, 08:29 PM
Well, there's nothing intrinsically wrong with testing the referrer :) So long as you bear in mind that it may be nonsensical or non-existant, and always provide an alternative means of accessing anything important, you'll be fine.

Referrer-spoofing is not a server issue but a client issue. It is a client header and as such the client can send anything it wants in that field, providing it isn't stripped or otherwise filtered by some intermediary. This means it needs to be treated with the same paranoia and skepticism as any other user input. Remember, too, that it may not be the browser directly but some piece of malicious software on the user's machine that sets the referrer, so XSS is a possibility.

Your idea of using the referrer to add a 'back' button to your page is a terrible one. The browser already has such facilities; there's no need to duplicate them. The only time it should be necessary to provide your own navigation controls is when your site has its own internal structure that may not be related to the order of pages traversed by the browser.

jscheuer1
04-02-2009, 12:26 AM
I was thinking about this in relation to a site I master, but then just got interested in the concept. At the same time, I'm formulating a more comprehensive approach to this phase of that site. It probably won't use what we've been discussing here. However, it (what we have been discussing) could help drive a pretty mean PHP breadcrumb scenario. But you don't think that would be secure?

Twey
04-02-2009, 07:43 AM
It's perfectly secure so long as you escape the input before displaying it on the page, as you would with any other user input. In fact, even if you don't, it would require some peculiar setup to take advantage of it: a malicious filter, perhaps, or some form of malware on the client system, or a browser bug or feature (as far as I know, none such exists) that allowed one to specify the referrer in a link.

I don't think that breadcrumbs based on the referrer are very useful. Again, it already exists in the browser (right-click your 'back' or 'forward' button some time). Breadcrumbs are about where you are in the site. For example, notice the breadcrumbs at the top of this page: DD Forums > General Coding > PHP > RegExp weirdness, even though the actual path I took to get here was more like New Posts > Page 2 > RegExp weirdness (four keystrokes to get from desktop to DD new posts, oh yes :D). It simply wouldn't be useful to duplicate that.

jscheuer1
04-02-2009, 08:34 AM
OK, could you define:


escape the input before displaying it on the page

Does this mean filtering out anything that could be javascript and/or HTML code? If not, please be more specific. If so, is there a PHP function already for that, or must one make one's own?

I'm pretty clever with code, but I'm really a novice at best when it comes to PHP.

Also, in my code, nothing from the user gets displayed on the page until it has been determined that the referrer is from the same domain, and then only a link to that referrer. If there is no referrer, or if the referrer is from another domain, or doesn't exist, that's when the fall back hard coded include or echoed content would be shown - not secure enough in and of itself though I take it?

Twey
04-02-2009, 11:54 AM
Does this mean filtering out anything that could be javascript and/or HTML code? If not, please be more specific. If so, is there a PHP function already for that, or must one make one's own?Not necessarily filtering out entirely, but making sure the input is safe to be used in the context in which you intend to output it. This is a general principle that should be applied to all user input. For HTML, the PHP function htmlspecialchars (http://www.php.net/htmlspecialchars)() should do the task (basically, that means replacing < with &amp;lt;, & with &amp;, and " with &amp;quot;). A different filter would need to be applied if you intended the input to be used as part of an SQL query or a shell command, for example.
Also, in my code, nothing from the user gets displayed on the page until it has been determined that the referrer is from the same domain, and then only a link to that referrer. If there is no referrer, or if the referrer is from another domain, or doesn't exist, that's when the fall back hard coded include or echoed content would be shown - not secure enough in and of itself though I take it?Really, this is hardly a security issue at all — as I said above, the circumstances for a third party to inject harmful code using this feature would have to be exceptional to the point of considering the client machine effectively compromised already. If, hypothetically, an attacker was capable of altering the referrer on a whim, and you failed to handle it with the proper paranoia, it would be possible to write some session-stealing XSS code to the page and thereby hijack the user's account on your site. The string used to do so could very well contain a completely valid referrer, so simply checking for that will not suffice in terms of checking for validity of the whole (for example, http://www.johnssite.com/innocent/page.php#"><script>stealCookies();</script><br style="display:none;" class=").

PHP is not a hard language to grasp, at least at a fundamental level (there's nothing particularly complicated in it, but it hasn't been thought out well and so there are a lot of non-obvious, inelegant, inconsistent, or otherwise completely stupid things to remember about more advanced features), but the main thing to remember is that it is completely content-agnostic. It doesn't know or care about the content you are writing; to PHP, it's just strings of bytes. As such, it doesn't try to make the content safe in any way: all untrusted content must be verified and/or escaped by you. Forgetting to do so is one of the biggest causes of security holes in PHP-powered sites. You have to think of what you would do when writing that code by hand, and remember to consider exceptional cases like invalid characters. By hand we would have to change a < to a &amp;lt; if we wanted to make sure that parts of the text weren't interpreted as something we didn't intend, so in PHP we have to as well — although the stakes are higher, because in many cases failure to do so will allow someone else to decide what gets interpreted and how.