Log in

View Full Version : Help with a preg_match



nicmo
11-19-2010, 04:16 PM
Hello,

I am trying to find links in members pages, making sure a pre-determined link is present on the html. I am having some problems tho.

Some members use a full url for the link, some only the base dir/file.

So i use the function basename() on the pre-determined full url i want to check to make sure i only check the base dir.

i am using preg_match as such:

preg_match ("|<[aA] (.+?)".basename($links_full_url)."(.+?)>(.+?)<\/[aA]>|i", $result, $matches);

This works for links like:
http://www.dynamicdrive.com/forums/ i end up searching for "forums" and if the page i look into has:


<a href="/forums/">whatever</a>
or

<a href="http://www.dynamicdrive.com/forums/">whatever</a>
or

<a href="forums">whatever</a>

all valid links, it will find them and its fine.

however!

If the full url link is something like http://www.dynamicdrive.com/forums.php i end up searching for "forums.php" and my preg_match cant find it. It can find "forums.p" tho, strange :(

Any help please?

bluewalrus
11-20-2010, 04:11 PM
"/<a href=\"(.*?)"/i"

I don't see what this relates to finding the full link. The full link should be contained in the href of the a tag (assuming your not using js).


So i use the function basename() on the pre-determined full url i want to check to make sure i only check the base dir.

nicmo
11-20-2010, 10:55 PM
ok i should have just asked why with:


$result = '<a href="mypage.php">whatever</a>';

the following preg_match


preg_match ("|<[aA] (.+?)mypage.php(.+?)>(.+?)<\/[aA]>|i", $result, $matches);

has no matches found.

looking in te href would not work for me.

imagine:


<link rel='prev' title='whatever' href='whatever' />

the link would be found but not in a A tag, cheating my lookup.

bluewalrus
11-21-2010, 04:12 PM
What are you looking for in that example? Your first braket finds the href=" then your second gets the closing " and the third grabs the contents of the "a".

The preg_match as I know it doesn't use the pipes (|) but uses the forward slash (/) around the expression.

My example also requires the element be an "a" which wouldn't find your "link" example.

james438
11-22-2010, 10:39 AM
hypothetically what sort of results do you want to get with

$result = '<a href="mypage.php">whatever</a>';

When I use


<pre><?php
$result = '<a href="mypage.php">whatever</a>';
preg_match ("|<[aA] (.+?)mypage.php(.+?)>(.+?)<\/[aA]>|i", $result, $matches);
print_r($matches);
?></pre>
I get the following:

Array
(
[0] => whatever
[1] => href="
[2] => "
[3] => whatever
)
The "whatever" listed above is hyperlinked.

As a side note, you can use the pipe for a delimiter, but I would highly recommend you do not and stick with the old standby of the forward slash / like bluewalrus suggested.


"/<a href=\"(.*?)"/i"
needs to have the third quote escaped. The href in this example is not optional though.

Just so I understand what you are looking for you are looking for web addresses correct? Addresses that could be in the form of:

1 http://www.this.com
2 <a href="this.com">yo</a>
3 <a href="www.this.com">yo</a>
4 <a href="https://www.this.com">yo</a>

the corresponding matches you want will be the following:

1 http://www.this.com
2 this.com
3 www.this.com
4 https://www.this.com

Is that correct?

james438
11-22-2010, 07:04 PM
Actually, why not just use a str_replace to format the data or a sub_str() to detect if there is an anchor being used? The users will know that the field is for formatting links, so the range of things people will attempt to enter into the field will be limited. If this were a 10 page document then we might want to use a complicated pcre to locate and/or format the web addresses, but that does not appear to be the case here.

nicmo
11-22-2010, 11:48 PM
ok, i kinda confused the all thing, sorry about that. By mistake i always added HTML in the $result, thats wrong.

$result only contains a URL like "http://www.dynamicdrive.com/forums.php" on which i run basename() and end up searching only for "forums.php".

And the problem is, my preg match cant find "forums.php"

on


<a href="forums.php?whatever=1">whatever</a>

or


<a href="forums.php">whatever</a>

BUT

for links like "http://www.dynamicdrive.com/forums/"

preg_match can find "forums"

on


<a href="/forums/">whatever</a>

or


<a href="http://www.dynamicdrive.com/forums/">whatever</a>

so sorry about my previous mistake, i bet i made evrything look very confusing hehe

james438
11-23-2010, 05:54 AM
Give this a whirl:


<?php
$result = '<a href="http://www.dynamicdrive.com/forums/?this=3">whatever</a>';
preg_match ('/(<a href=\")(.*?)(\?|\/\"|\")/i', $result, $matches);
$match=$matches[2];
if (substr($match,-1,1)=='/') $match=substr_replace($match,'',-1,1);
$match=explode("/","$match");
$found=end($match);
print $found;
?>

I can't help but think that there must be an easier way to do this without regex, but I am not understanding exactly what your script is supposed to do. I am sure that the pcre could be tightened up a bit, but it works.

Try a few different things and let us know what fails.

nicmo
11-23-2010, 10:41 AM
I have a recip link system, i need this to check if a user is linking back to another member. I use curl to get his page and then look in the html for the link, if i find it all is ok.

this is what i have right now:



<?php
$result = 'more html <a href="forums.php?asd=1">whatever</a> more html';

$full_url = 'http://www.dynamicdrive.com/forums.php?asd=1';

preg_match ("|<[aA] (.+?)".basename($full_url)."(.+?)>(.+?)<\/[aA]>|i", $result, $matches);


echo basename($full_url)."<br><br>";
if (count($matches) > 0)
{
echo "found";
}
else
{
echo "not found";
}
?>



copy and paste that into a php file and you will see: not found.

nicmo
11-23-2010, 10:55 AM
another example where preg_match fails. strpos however can find it. My problem is i need to make sure its inside an A tag to make sure its a propper link.



<?php
$result = '<strong></strong><font face="Verdana" size="2"><br>
» </font><strong><a href="OtherMethods.php"><font face="Verdana" size="2">Other
Methods</font></a></strong>';

$full_url = 'http://www.site.com/OtherMethods.php';

preg_match ("|<[aA] (.+?)".basename($full_url)."(.+?)>(.+?)<\/[aA]>|i", $result, $matches);


echo basename($full_url)."<br><br>";

$mystring = $result;
$findme = basename($full_url);
$pos = strpos($mystring, $findme);

if ($pos === false) {
echo "The string was not found in the string <br>";
} else {
echo "The string was found in the string";
echo " and exists at position $pos <bR>";
}

if (count($matches) > 0)
{
echo "preg_match found";
}
else
{
echo "preg_match not found";
}
?>

james438
11-23-2010, 06:46 PM
I notice you are still using your pcre as opposed to the ones bluewalrus and myself suggested. At least try not to use the pipe: | as a delimiter. The pipe has special meaning in pcre as the OR operator.

nicmo
11-23-2010, 07:29 PM
I notice you are still using your pcre as opposed to the ones bluewalrus and myself suggested. At least try not to use the pipe: | as a delimiter. The pipe has special meaning in pcre as the OR operator.

how do i use this?


"/<a href=\"(.*?)"/i"

where do i tell it what to look for.

james438
11-23-2010, 09:31 PM
Don't use it. It does not work. See my earlier posts and take a closer look at this post (http://www.dynamicdrive.com/forums/showpost.php?p=242052&postcount=8)

nicmo
11-23-2010, 09:52 PM
Don't use it. It does not work. See my earlier posts and take a closer look at this post (http://www.dynamicdrive.com/forums/showpost.php?p=242052&postcount=8)

isnt that just doing what basename() does?

james438
11-24-2010, 05:42 AM
preg_match ("|<[aA] (.+?)".basename($full_url)."(.+?)>(.+?)<\/[aA]>|i", $result, $matches);

when you insert basename($full_url) into a pcre like the above example you are actually looking for the exact phrase "basename" followed by the capture that you have named "full_url". To my knowledge php can't be inserted into pcre in the way that you are trying to do. With some modifiers you can insert php into your second part, $result, but that is something different.

"/<a href=\"(.*?)"/i" does not work for a couple reasons. The most obvious is that the second double quote is not escaped so that it is not recognized as a literal double quote. It should be "/<a href=\"(.*?)\"/i"

The other problem is that it is too simple and only accounts for very limited types of inputs. He is, however on the right track. I also know bluewalrus has a fair amount of experience with pcre as well.

The example script I gave you in an earlier post should find what you are looking for and put it into the variable $found. With $found compare it to your basename($full_url).

For example try adding the following:


$basename_url=basename($full_url);
if ($found=="$basename_url") echo "they matched";
It should look something like this:

<?php
$result = '<strong></strong><font face="Verdana" size="2"><br>
» </font><strong><a href="OtherMethods.php"><font face="Verdana" size="2">Other
Methods</font></a></strong>';
preg_match ('/(<a href=\")(.*?)(\?|\/\"|\")/i', $result, $matches);
$match=$matches[2];
if (substr($match,-1,1)=='/') $match=substr_replace($match,'',-1,1);
$match=explode("/","$match");
$found=end($match);
$full_url='http://www.site.com/OtherMethods.php';
$basename_url=basename($full_url);
print $found;
if ($found=="$basename_url") echo "<br>they matched";
else echo "<br>not found";
?>

nicmo
11-25-2010, 06:42 PM
it wont work, $result is a full html page with many links pulled by curl. I need something that looks inside all links looking for my basename() thats what my broken preg_match does.

james438
11-26-2010, 01:08 AM
just change preg_match to preg_match_all

However, now you will need to put $match into a loop of some sort, probably a "while" loop applying

$full_url='http://www.site.com/OtherMethods.php';
$basename_url=basename($full_url);
if ($found[]=="$basename_url") echo "<br>they matched";
else echo "<br>not found";
to each part of the array. The above will need to be modified slightly.

If you are having trouble with this let me know.

nicmo
11-27-2010, 10:14 PM
hey james this is almost working. Right now only fails on links that contain variables.

Like: "http://www.site.com/OtherMethods.php?error=1&whatever=2"

preg match only saves "http://www.site.com/OtherMethods.php?" from that URL, anyway to keep the entire URL?

james438
11-28-2010, 04:12 AM
can I see what the script you are using currently looks like? Remember, preg_match will only pull the first result it finds and that's it. preg_match_all will pull all of the results and store it into a multidimensional array, which I am not a big fan of pulling results from to be honest, but it all depends on your needs.

The script I presented only pulls
OtherMethods.php?error=1&whatever=2 from
http://www.site.com/OtherMethods.php?error=1&whatever=2 which is what I thought you wanted. If you want the entire url, I can probably write something up for you, but it will be different than what I was writing. I thought you specifically wanted the basename($full_url) as opposed to the full url for comparison purposes... Sorry about the lack of understanding of what you want on my part.

nicmo
11-28-2010, 10:50 AM
preg_match_all ('/(<a href=\")(.*?)(\?|\/\"|\")/i', $result, $matches);

for ($i=0;$i<sizeof($matches[2]);$i++)
{

if (basename($matches[2][$i]) == basename($links_full_url))
{
echo "found on link $i<br>";
break;
}
}


Where $result is some site full HTML source, via curl.

This is what is working for me but i have that variables problem. I do want the basename() or whatever is after the last "/", basename() on" http://www.site.com/OtherMethods.php?error=1&whatever=2" still returns "OtherMethods.php?error=1&whatever=2" and thats what i need to look for because some sites use wordpress or other CMS and their linking systems use variables to find the right pages.

james438
11-28-2010, 01:58 PM
like this?


preg_match_all ('/(<a href=\")(.*?)(\/\"|\")/i', $result, $matches);
for ($i=0;$i<sizeof($matches[2]);$i++)
{
if (basename($matches[2][$i]) == basename($links_full_url))
{
echo "found on link $i<br>";
break;
}
}

nicmo
11-28-2010, 04:52 PM
no, preg_match already stores the URLs in the array without the variables :(

The array also seems to be broken when print_r() so i guess the "&" from the URLs is braking it?

james438
11-28-2010, 07:53 PM
It is working for me. When using print_r() try using your browser to view the source. When you do you will see that the results are a little bit different. You will also notice that the match is found.

Here is the script I was working with:

<pre><?php
$links_full_url="http://www.site.com/OtherMethods.php?error=1&whatever=2";
$g=basename($links_full_url);
$result = '<strong></strong><font face="Verdana" size="2"><br>
» </font><strong><a href="http://www.site.com/OtherMethods.php?error=1&whatever=2"><font face="Verdana" size="2">Other
Methods</font></a></strong>';
preg_match_all ('/(<a href=\")(.*?)(\/\"|\")/i', $result, $matches);
for ($i=0;$i<sizeof($matches[2]);$i++)
{
if (basename($matches[2][$i]) == basename($links_full_url))
{
echo "found on link $i<br>";
break;
}
}
print_r($matches);
?></pre>

nicmo
11-28-2010, 08:58 PM
thanks james its working, i had the wrong preg_match for god knows why...

I started testing and after 4 positive sit checks i got my first problem.



<a class="fadeThis " href="http://www.site.com/OtherMethods.php">


because if the class, it is not found. Also, some sites use:



<a
href="http://www.site.com/OtherMethods.php">


No space between a and href, just a line break. Also cannot be found ofcourse. I guess this can be done by adding some extras to the preg_match, could you help me once more? :P

james438
11-28-2010, 10:41 PM
My goal is to try and keep the preg_match_all simple as it is with most scripting. We can get more complicated as needed. For now try replacing


preg_match_all ('/(<a href=\")(.*?)(\/\"|\")/i', $result, $matches);
with

preg_match_all ('/(href=\")(.*?)(\/\"|\")/i', $result, $matches);

nicmo
11-29-2010, 12:24 AM
would you have any other suggestion to try and keep the "<a "? I really need to make sure its a proper link else people will cheat my system.

james438
11-29-2010, 03:27 AM
how about

preg_match_all ('/(<a.*?href=\")(.*?)(\/\"|\")/i', $result, $matches);

nicmo
11-29-2010, 11:24 AM
that does work with classes and onclicks, etc. However sites that for some reason have their HTML displayed like this site: http://www.secnews.gr/ the preg_match cant find any links. Does .*? include "" or line breaks?



<a
href="http://www.site.com/OtherMethods.php">

james438
11-29-2010, 03:37 PM
try this

preg_match_all ('/(<a.*?href=\")(.*?)(\/\"|\")/is', $result, $matches);

In the pcre that we have been working with thus far we have been using the i modifier, which makes the pcre case insensitive. The s modifier will tell the pcre to capture newlines with the dot metacharacter. For whatever reason, the newline is a whitespace character that the dot metacharacter will not capture by default.

sorry about earlier. when you were saying linebreaks earlier I thought you were talking about <br>

nicmo
11-29-2010, 06:36 PM
Thats awsome James, you saved my bacon :) regex is a little too much info for me right now, hopefully i will grasp it soon.

So far, all the tests are working.

nicmo
11-29-2010, 07:43 PM
ok finnaly found another way that this wont work, should be simple. Some sites use:


<a href='http://www.site.com/OtherMethods.php'>

instead of


<a href="http://www.site.com/OtherMethods.php">

' for "

how should we predict this aswell?

james438
11-29-2010, 09:08 PM
try this:
'/(<a.*?href=\"|\')(.*?)(\/\"|\'|\")/is'

regex is a little too much info for me right now, hopefully i will grasp it soon.

regex can be a bit complicated. What makes it more complicated is that there is seldom a reason to use it, so it's easy to get out of practice. It is also best to use it as little as possible since it is processor heavy.

getting back to the regex at hand though. I really am only barely grasping at what you are trying to do, but it seems well enough that I am able to be of some use to you ;) When using href it is best to use the double quotes so as to avoid potential errors. I believe the errors I speak of have to do with relative versus absolute urls, but this has not been an issue for long enough that I forget exactly what the problem was with using single quotes with href.

Let me know if the regex I posted above fails and with what and we will address that when the time comes.

nicmo
11-29-2010, 10:35 PM
that kinda did its job but the array was getting all messed up with the extra ' all over the place. so i did this:


$result = str_replace("'", '"', $result);

just before preg_match_all, seems it should be fixed. For now... :)

james438
11-29-2010, 10:47 PM
This time I found an error. try the following instead

preg_match_all ('/(<a.*?href=\"|<a.*?href=\')(.*?)(\/\"|\'|\")/is', $result, $matches);
otherwise

<a "http://www.site.com/OtherMethods.php"> will match.

The pcre created thus far could probably still be improved in several different ways yet, so if you see ways that it needs to be modified yet just let me know.