Log in

View Full Version : regex - extract data from web page



php_techy
06-12-2009, 05:41 AM
Hi,
My html looks like this


<meta name="description" content="New info! Code: http://www.example/index.html Code: http://testing.com/fil" />
<!-- message -->
<div id="post_message_510223" class="vb_postbit"><font color="green"><font size="3">Temp</font></font><br />
<br />
<br />
<img src="http://sample/test.jpg" border="0" alt="" onload="NcodeImageResizer.createOn(this);" /><br />
<br />
<br />
info!<br />
<br />

<div style="margin:20px; margin-top:5px">
<div class="smallfont" style="margin-bottom:2px">Code:</div>
<pre class="alt2" dir="ltr" style="
margin: 0px;
padding: 6px;
border: 1px inset;
width: 470px;
height: 34px;
text-align: left;
overflow: auto">http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
http://www.sample1.com/part1.html</pre>
</div><br />

<div class="smallfont" style="margin-bottom:2px">Code:</div>
<pre class="alt2" dir="ltr" style="
margin: 0px;
padding: 6px;
border: 1px inset;
width: 470px;
height: 1490px;
text-align: left;
overflow: auto">http://www.sample1.com/part1/sample_code.part01.rar
http://www.sample1.com/part1/sample_code.part01.rar</pre>

</div></div>


I want all the values that are after Code:</div> and between pre tags.
eg http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
http://www.sample1.com/part1.html
and
http://www.sample1.com/part1/sample_code.part01.rar
http://www.sample1.com/part1/sample_code.part01.rar

Please note that at the start in meta tag there is also string Code: and I don't value from it.
Thanks in advance
Regards

Jesdisciple
06-19-2009, 01:43 AM
Please try before asking next time. I can't speak for everyone else, but I hate the feeling that I'm operating like vending machine or printer. It might not be so bad if I were paid...
http://www.regular-expressions.info/tutorial.html
http://www.regular-expressions.info/php.html

Try this. Note that I'm using % instead of / as the delimiter because slashes are in the regex. Also, I'm sacrificing a bit of efficiency so HTML can be inside the <pre> element. If you're sure it won't be, replace the .*? with [^<]*
$matches = array();
preg_match_all (http://us3.php.net/manual/en/function.preg-match-all.php)('%Code:</div>[^<]*<pre[^>]*>(.*?)</pre>%', $html, $matches);
foreach($matches[1] as $match){
//...
}