Log in

View Full Version : PHP preg_replace adding http:// only when missing



qwikad.com
09-12-2013, 06:31 PM
Often people post links in our classifieds that start just with just www. or nothing at all (for instance www.blahblah.com or blahblah.com).

I need a preg_replace function that will add http:// to any link that doesn't have it.

So if a link is: www.blahblah.com it should become http://www.blahblah.com
If a links is blahblah.com it should become http://blahblah.com

This is the preg_replace line that needs to be filled: $text = preg_replace( '//', "", $text);

Thank you for any suggestions.

Beverleyh
09-12-2013, 06:48 PM
Have you tried anything from Google? This came up first for me and looks like it does what you want: http://stackoverflow.com/questions/1932292/need-preg-replace-help-in-php

Deadweight
09-12-2013, 08:05 PM
Are you wanting something like this:

<?php

$string = 'blahblah.com';
$add = '';

$add .= strpos($string,'http://') !== false ? '' : 'http://';
$add .= strpos($string,'www.') !== false ? '' : 'www.';

$add .= $string;

echo $add;

?>

Oh since you dont want to add www. delete

$add .= strpos($string,'www.') !== false ? '' : 'www.';

OOps i should have read the whole thing. You need to use a specific tag xD

qwikad.com
09-12-2013, 08:22 PM
I searched for 20-30 minutes before I posted that question here. I don't know how you're able to find the right solution so quickly. :)

Can you verify that this is a viable solution:

$text = preg_replace( '/^(?:http:\/\/)?(.*)/', "http://$1", $text);


Do you think it's going to add http:// to any domain name that doesn't have it?

What about a domain name that already has a http://? Is it still going to work? On that page: http://stackoverflow.com/questions/1932292/need-preg-replace-help-in-php it's not clear if the guy who posted the solution has solved that issue.


Thanks.

Deadweight
09-12-2013, 08:42 PM
$string = 'blahblah.com';
if(preg_match('/http:\/\//',$string)){
echo 'found';
}else{
echo 'not found';
}

I have a reading problem today..... You need replace xD

Also it does work with the solution above... Do you have to use replace or may you use anything?

Good free place to test my php if you like:
http://www.compileonline.com/execute_php_online.php

Beverleyh
09-12-2013, 08:46 PM
I haven't tested anything (currently on iPhone) but you could feed in a few sample variables on a test page and see what pops out the other side? That's all I'd do.

traq
09-12-2013, 09:39 PM
Can you verify that this is a viable solution:

$text = preg_replace( '/^(?:http:\/\/)?(.*)/', "http://$1", $text);

It depends on your problem. This will take any string (which may or may not have "http://" at the beginning) and make sure it has "http://" at the beginning. So, for example, "blahblah.com" will become "http://blahblah.com".

However, "Please visit blahblah.com today!" will become "http://Please visit blahblah.com today!".

So, have you already captured the domains as their own strings? If so, yes, this will work fine. If not, then no.

URLs are typically identified by "http://" and/or "www." at the beginning. It gets very difficult to reliably identify them without one of those.

qwikad.com
09-12-2013, 10:01 PM
It depends on your problem. This will take any string (which may or may not have "http://" at the beginning) and make sure it has "http://" at the beginning. So, for example, "blahblah.com" will become "http://blahblah.com".

However, "Please visit blahblah.com today!" will become "http://Please visit blahblah.com today!".

So, have you already captured the domains as their own strings? If so, yes, this will work fine. If not, then no.

URLs are typically identified by "http://" and/or "www." at the beginning. It gets very difficult to reliably identify them without one of those.



Funny, I just tested it and it does exactly what you've said.

Ok, can you suggest a partial solution to this then? Forget about blahblah.com. Let's say I only want to target URLs that have www. in them. It will take care of 95% of the cases.

Can you suggest something workable that will add http:// to URLs that have www. but are missing http:// and if URLs already have http:// they should not be changed in any shape or form?

Thank you.

traq
09-12-2013, 10:15 PM
So, the URLs are not yet separated from the rest of the text, is that correct?

I will figure something out when I get home today. In the meantime, you might look through Google (or stackoverflow) for "PCRE" + "URL".

qwikad.com
09-12-2013, 10:26 PM
So, the URLs are not yet separated from the rest of the text, is that correct?

I will figure something out when I get home today. In the meantime, you might look through Google (or stackoverflow) for "PCRE" + "URL".

Right. Just picture a classified ad. Lots of text and somewhere in the middle there's a URL. Like in this example:

http://qwikad.com/0/posts/12-business-opps/268-network-marketing/80718-Restore-Renew-Revive-with-Trevo.html

Looking forward to checking out your solution. Will google what you've suggested.

traq
09-13-2013, 01:46 AM
Try this - #\b(?:http://)?(www\.)?(([a-z0-9_-]{2,}\.)+[a-z]{2,}(/[\w\+\-\?\&\;]*)*)\b#i



# delimiter (start of pattern)
\b word boundary
(?:http://)? optional, non-capturing "http://"
(www\.)? optional, capturing (will be $1) "www."
( start of capturing group (will be $2)
( start of sub-pattern (will be $3)
[a-z0-9_-]{2,} something that looks like a domain name
\. followed by a dot
)+ end of $3 - one or more matches
[a-z]{2,} followed by something that looks like a TLD
( start of sub-pattern (will be $4)
/ a slash
[\w\+\-\?\&\;]* followed by something that looks like a path and/or query string
)* end of $4 - zero or more matches
) end of $2 - one match only ($2 also contains $3 and $4)
\b word boundary
#i delimiter (end of pattern), case-insensitive


tested:

$text = "Looking for marketers who want to work. Ready to change your life? Trevo offers one product - an all natural, vegan, kosher nutritional supplement with 174 nutraceuticals. Visit www.SoCal.trevobuilder.com for product info & to purchase. Find me on facebook at Trevo SoCal or @TrevoSoCal on twitter. Low start up, cost covers first 3 bottles or larger packages available. No registration fees. Start making money this week! Visit trevocorporate.com/coach/sjahr to register on the Presidential Elite team.";

$regexp = "#\b(?:http://)?(www\.)?([a-z0-9_-]{2,}\.[a-z]{2,}(/[\w\+\-\?\&\;]*)*)\b#i";

$hypertext = preg_replace( $regexp,'<a href="http://$1$2">$1$2</a>',$text );

print htmlentities( $hypertext );

Note that this pattern will match most (not all) valid URLs, and will not match most (not all) non-URL text. It might be a good, workable compromise for your purpose.

I changed this after posting it. Compare and make sure you're using this version.

james438
09-13-2013, 02:22 AM
Why do you want to add the http:// to each url if it is not there? Are you looking to hyperlink the urls or do you have a script that already does that? I am also interested, because I like to collect pcre patterns/scripts that relate to manipulating urls.

EDIT: Actually, I notice that traq's script does just that, but I do not believe that was specifically asked for, but it may be what you were looking for. I have one that does what traq posted, but mine specifically avoids hyperlinking urls that are missing the www prefix.

traq
09-13-2013, 02:39 AM
I notice that traq's script does just that [turns urls into hyperlinks], but I do not believe that was specifically asked for...

Whup, you're right. I wrote that without re-reading the thread. If you don't want the URLs hyperlinked, just replace
preg_replace( $regexp,'<a href="http://$1$2">$1$2</a>',$text );with
preg_replace( $regexp,'http://$1$2',$text );

qwikad.com
09-13-2013, 02:55 AM
traq I realized that was an issue and changed it to


preg_replace( $regexp,'http://$1$2',$text );

Still it's not doing what I want it to do. You see, I've made a lot (and I mean a lot) of changes to my markdown over the months and I am wondering if something else is now interfering with your code. The way the URLs looked was something like this:

http://www.someurl.comwww.someurl.com

...and despite the fact they had http:// in them, they stopped being clickable (all URLs in the entire website stopped being clickable).

I wonder if the "print htmlentities" had something to do with that.

I also wonder if a one liner (like the one I mentioned before) if modified can solve the issue. Is there a way to tell this line to add http:// only if www. is present in a URL and ignore a URL altogether if it already has http:// ? (sorry for being redundant).


$text = preg_replace( '/^(?:http:\/\/)?(.*)/', "http://$1", $text);


Thanks!

traq
09-13-2013, 05:46 AM
...and despite the fact they had http:// in them, they stopped being clickable (all URLs in the entire website stopped being clickable).
I wonder if the "print htmlentities" had something to do with that.
I used htmlentities (http://php.net/htmlentities) for the test so you could see the HTML output. Simply do print instead.



...and despite the fact they had http:// in them, they stopped being clickable (all URLs in the entire website stopped being clickable).

the text "http://" does not create hyperlinks. If you removed the anchor markup from the replacement pattern (i.e., you replaced '<a href="http://$1$2">$1$2</a>' with 'http://$1$2'), then I would not expect the URLs to be hyperlinked.

You hadn't mentioned markdown earlier. Are you using a markdown parser? At what point?



I also wonder if a one liner (like the one I mentioned before) if modified can solve the issue. Is there a way to tell this line to add http:// only if www. is present in a URL and ignore a URL altogether if it already has http:// ? (sorry for being redundant).

I would hesitate to say until I know more about what you are actually doing - the above makes it clear that we do not have all the necessary information.

qwikad.com
09-13-2013, 01:05 PM
traq, I spent sometime browsing the web and I came upon this little function:


<?php
/*** example usage ***/
$string='http://www.phpro.org';
echo makelink($string);

/**
*
* Function to make URLs into links
*
* @param string The url string
*
* @return string
*
**/
function makeLink($string){

/*** make sure there is an http:// on all URLs ***/
$string = preg_replace("/([^\w\/])(www\.[a-z0-9\-]+\.[a-z0-9\-]+)/i", "$1http://$2",$string);
/*** make all URLs links ***/
$string = preg_replace("/([\w]+:\/\/[\w-?&;#~=\.\/\@]+[\w\/])/i","<a target=\"_blank\" href=\"$1\">$1</A>",$string);
/*** make all emails hot links ***/
$string = preg_replace("/([\w-?&;#~=\.\/]+\@(\[?)[a-zA-Z0-9\-\.]+\.([a-zA-Z]{2,3}|[0-9]{1,3})(\]?))/i","<A HREF=\"mailto:$1\">$1</A>",$string);

return $string;
}

?>


If I take that first line and re-do it to fit my markdown, do you think it may do what I want it to?



$text = preg_replace('/([^\w\/])(www\.[a-z0-9\-]+\.[a-z0-9\-]+)/i', "$1http://$2", $text);

Is it going to add http:// to all links that have www and skip those that already have http://? I'd appreciate if you test it for me. I can't test it live on our classifieds until late at night when I don't have so many people posting ads. Don't want to make their links go all berserk if something is wrong with the code, even temporarily.


Thank you.

Deadweight
09-13-2013, 06:40 PM
$string = 'Looking for marketers who want to work. Ready to change your life? Trevo offers one product - an all natural, vegan, kosher nutritional supplement with 174 nutraceuticals. Visit www.SoCal.trevobuilder.com for product info & to purchase. Find me on facebook at Trevo SoCal or @TrevoSoCal on twitter. Low start up, cost covers first 3 bottles or larger packages available. No registration fees. Start making money this week! Visit trevocorporate.com/coach/sjahr to register on the Presidential Elite team. Welcome to facebook.com.';
$new_string = '';
$string_explode = explode(' ',$string);

foreach($string_explode as $key){
$check = strpos($key,'.') !== false?weblink($key):$key.' ';
echo $check;
}

function weblink($test){
$ch = curl_init();

curl_setopt($ch, CURLOPT_URL, $test);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);

$data = curl_exec($ch);
$info = curl_getinfo($ch);
$code = $info['http_code'];

curl_close($ch);

$good_array = array(100,101,200,201,202,203,206,300,301,302,304);

if(in_array($code,$good_array)){
$add = preg_match('/http:\/\//',$test)?'':'http://';
return '<a href="'.$add.$test.'">'.$test.'</a> ';
}else{
return $test.' ' ;
}
}

FYI i added something to test if it does work. At the very end i added "Welcome to facebook.com." To test if the period would mess up my code and it doesnt. However, if it is ? ! or something weird then the website will not exists. I will have to fix that. - If you want me to fix this problem let me know. Its easy.

Another note: "trevocorporate.com/coach/sjahr" doesnt work so it will not place a link around the website.

traq
09-14-2013, 02:07 AM
traq, I spent sometime browsing the web and I came upon this little function:

[ . . . code . . . ]

If I take that first line and re-do it to fit my markdown, do you think it may do what I want it to?
TBH, that seems to attempt the same task as my code snippet, but doesn't do as well. It won't catch URLs without "www", for example, nor URLs with a path component.

As for your "markdown," I don't know, because you have not explained how/ when/ what you are doing with markdown (beyond the fact that you're using it).



I'd appreciate if you test it for me. I can't test it live on our classifieds until late at night when I don't have so many people posting ads. Don't want to make their links go all berserk if something is wrong with the code, even temporarily.

I thought you might be a "cowboy coder." It would be well worth your while to do your dev/ testing locally, instead of on you live site. It is much less stressful, and much easier to recover from mistakes. It is not difficult at all to set up Apache + PHP on your local computer.



[ ... ]
FYI i added something to test if it does work. At the very end i added "Welcome to facebook.com." To test if the period would mess up my code and it doesnt. However, if it is ? ! or something weird then the website will not exists. I will have to fix that. - If you want me to fix this problem let me know. Its easy.

Another note: "trevocorporate.com/coach/sjahr" doesnt work so it will not place a link around the website.

@Crazykld69, I am not sure what you are trying to contribute with this post. The code you offered doesn't seem to address any of the OP's questions.

If you do have something to contribute, please edit your post to more clearly explain.

qwikad.com
09-14-2013, 04:37 AM
treq, I just tested it. The code is doing exactly what I want it to do. I will still need to post a bunch of ads to ensure it works in all possible situations, but as of now it seems to be working perfectly. It's this one:


$text = preg_replace('/([^\w\/])(www\.[a-z0-9\-]+\.[a-z0-9\-]+)/i', "$1http://$2", $text);

In all honesty, I tried to implement the function you suggested, it wouldn't work. That's why I was looking for a one-liner to at least help the URLs that start with www. I am pretty sure if you had access to my files you would make it work, but unfortunately you are limited by my weak attempts to explain what I want to accomplish. But thank you for nudging me in the right direction nonetheless.

traq
09-14-2013, 06:18 AM
You can use my regex as a "one-liner" - no problem. I didn't realize that's why you were still looking for another answer. (You just want "http://" added, correct? no HTML added?)


$text = preg_replace(
"#\b(?:http://)?(www\.)?([a-z0-9_-]{2,}\.[a-z]{2,}(/[\w\+\-\?\&\;]*)*)\b#i"
,"http://$1$2"
,$text
);

If you're still having trouble integrating it with markdown, you'll need to explain ho you're using it.

In any case, glad to have helped.

Deadweight
09-14-2013, 08:51 AM
Actually, you are incorrect. Not sure why you have to be so stuck up but anyways I'll explain it to you.
I think the main point of this is for him to find in a string all the websites that are actually websites. You are using preg_replace(); however, there is more than one way of doing something. If you actually view your code it doesnt fully wrap the code. On another note if you add this to the end of string "team.error", well it looks like we have a new type of website.

What my code does is actually checks to see if the website is REAL and FULLY wraps the code in html. If he doesnt want the html and just the string well that's easy to do.
Why dont you look and test the code before stating:

@Crazykld69, I am not sure what you are trying to contribute with this post. The code you offered doesn't seem to address any of the OP's questions.

If you do have something to contribute, please edit your post to more clearly explain.
I find that uttly rude.
Because the only reason i am here is to help people out and contribute.

Not sure how my code doesnt help him out but it
1. get the string and checks if there can be anything in there that contains websites
2. if it is an actual website it FULLY and CORRECTLY wraps the string in html
3. Doesn't mark fake websites as websites.

Edited it a little:

$string = 'Looking for marketers who want to work. Ready to change your life? Trevo offers one product - an all natural, vegan, kosher nutritional supplement with 174 nutraceuticals. Visit SoCal.trevobuilder.com for product info & to purchase. Find me on facebook at Trevo SoCal or @TrevoSoCal on twitter. Low start up, cost covers first 3 bottles or larger packages available. No registration fees. Start making money this week! Visit trevocorporate.com/coach/sjahr to register on the Presidential Elite team.error';
$new_string = '';
$string_explode = explode(' ',$string);

foreach($string_explode as $key){
$check = strpos($key,'.') !== false ? weblink($key):$key.' ';
echo $check;
}

function weblink($test){
$ch = curl_init();
$www = preg_match('/www./',$test)?'':'www.'.$test;

curl_setopt($ch, CURLOPT_URL, $www);
curl_setopt($ch, CURLOPT_HEADER, true);
curl_setopt($ch, CURLOPT_NOBODY, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, true);
curl_setopt($ch, CURLOPT_MAXREDIRS, 10);

$data = curl_exec($ch);
$info = curl_getinfo($ch);
$code = $info['http_code'];

curl_close($ch);

$good_array = array(100,101,200,201,202,203,206,300,301,302,304);

if(in_array($code,$good_array)){
$add = preg_match('/http:\/\//',$test)?'':'http://';
return '<a href="'.$add.$test.'">'.$test.'</a> ';
}else{
return $test.' ';
}
}

thanks
-DW

Beverleyh
09-14-2013, 11:01 AM
Thank you for the code explanation Crazykid69 - I believe that is all traq was asking for. He's a really helpful chap and wouldn't have meant anything to come across as stuck-up or rude - try not to read it that way (I didn't).

But in future, it would be great if you could provide the details as part of an all-inclusive answer because you may have interpreted something differently from the OPs question that might not have been immediately obvious to other contributors, or indeed the OP, if their information was a bit ambiguous.

I think its good that you're taking time to help (there should be more helpful folks in life, dont you think?). We regular posters are always here to contribute - part of that is asking for clarification at times, either from the OP or from other contributors - its the best way to work towards an appropriate/comprehensive answer in this diverse community with so many ideas, backgrounds and strengths doing the rounds.

traq
09-15-2013, 03:15 AM
@Crazykld69, I think I phrased my post badly. I was concerned because I wasn't sure what you were trying to offer - thank you for your explanation. I understand wanting to be helpful - I appreciate it. I certainly didn't mean to be rude.

To answer your points,

... I'm not sure what you mean by my code "not fully wrapping." preg_match does operate on the entire string.

... I did try out your code before asking for clarification. One of the OP's concerns was matching URLs with or without "http://" and/or "www." I agree; you're right that this will sometimes lead to false positives.

Deadweight
09-15-2013, 09:36 PM
@traq check your anchor wrap in this "SoCal.trevobuilder.com"

traq
09-16-2013, 01:30 AM
ahh, I see what you're talking about. That happened for two reasons:

First, as I mentioned in this post (http://www.dynamicdrive.com/forums/showthread.php?75060-PHP-preg_replace-adding-http-only-when-missing&p=299424#post299424), I modified the regex I was using after my initial response. The changes I made were specifically to allow matching URLs with multiple subdomains and subdomains other than "www". When I edited my post to show those changes, I changed the regex itself as well as my commentary on it. However, I overlooked the regex in the example I gave.

Second, when I posted the regex, I forgot to use tags, so VBulletin thought I was posting a hyperlink and added [url] tags to parts of the regex (in a discussion about parsing URLs, the irony does not escape me). As you can see, the regex doesn't work properly with those tags inserted. What it comes down to is I tested the code before posting it, but it didn't occur to me to test it after posting.

Here's the regex, without the extraneous BBCode tags:


[noparse]#\b(?:http://)?(www\.)?(([a-z0-9_-]{2,}\.)+[a-z]{2,}(/[\w\+\-\?\&\;]*)*)\b#i

And here's the correct version of the test:
<?php

$text = "Looking for marketers who want to work. Ready to change your life? Trevo offers one product - an all natural, vegan, kosher nutritional supplement with 174 nutraceuticals. Visit www.SoCal.trevobuilder.com for product info & to purchase. Find me on facebook at Trevo SoCal or @TrevoSoCal on twitter. Low start up, cost covers first 3 bottles or larger packages available. No registration fees. Start making money this week! Visit trevocorporate.com/coach/sjahr to register on the Presidential Elite team.";

$regexp = "#\b(?:http://)?(www\.)?(([a-z0-9_-]{2,}\.)+[a-z]{2,}(/[\w\+\-\?\&\;]*)*)\b#i";

$hypertext = preg_replace( $regexp,'<a href="http://$1$2">$1$2</a>',$text );

print htmlentities( $hypertext );

and the results:
Looking for marketers who want to work. Ready to change your life? Trevo offers one product - an all natural, vegan, kosher nutritional supplement with 174 nutraceuticals. Visit <a href="http://www.SoCal.trevobuilder.com">www.SoCal.trevobuilder.com</a> for product info & to purchase. Find me on facebook at Trevo SoCal or @TrevoSoCal on twitter. Low start up, cost covers first 3 bottles or larger packages available. No registration fees. Start making money this week! Visit <a href="http://trevocorporate.com/coach/sjahr">trevocorporate.com/coach/sjahr</a> to register on the Presidential Elite team.

thanks for pointing that out.