I posted it here because I did not know where else to post this. Anyways does anyone know of a brief regexp tutorial?? I could not understand those things. Any help would be greatly appereciated. Thanks!!!:)
Printable View
I posted it here because I did not know where else to post this. Anyways does anyone know of a brief regexp tutorial?? I could not understand those things. Any help would be greatly appereciated. Thanks!!!:)
Been there, could not get a thing though. Thanks anyways.
If you can't get through that, you probably aren't gonna find a tutorial that you can get through.
shachi, regular expression syntax isn't difficult to understand. The problem is just that when they get long, they require careful reading. Perhaps if you suggest a pattern of text you'd like to match (make it relatively simple, to start with :)), we can show you the expression and walk through the various parts.
Once you understand the various parts - repetition, grouping, character classes - it's simply a matter of adding them together.
Mike
mwinter: Yes that's the problem, I can't understand long regular expressions, I try and I try but unfortunately I fail to understand what it means for e.g something like (/\d+-\d)(/\d-\d)...//1//2 or something similar. It's kinda confusing.:(
Can you tell me some more about repetition,grouping, character classes, etc. ??
Sorry for not posting long but I just realized that I started a thread about regexps too.
This is an invalid regular expression. The delimiting character (always / in Javascript) appears only twice in a regular expression: once at the beginning and once at the end, before any switches. If we assume you meant:Quote:
(/\d+-\d)(/\d-\d)...//1//2
... then that expression meansCode:/(\/\d+-\d)(\/\d-\d)...\/\/1\/\/2/
There's a very useful KDE tool called kregexpeditor available with KDE. It's handy for breaking down long regular expressions into something a bit more readable.
- A forward slash, then
- any digit repeated one or more times, then
- a hyphen, then
- another digit, then
- another digit, then
- a hyphen, then
- any digit, then
- any character, then
- any character, then
- any character, then
- a forward slash, then
- a forward slash, then
- 1, then
- a forward slash, then
- a forward slash, then 2.
Twey: I am sure it was something like \\1 and \\2 ...
let me check it.
It's something like this:
\[(\d+-\d+)\] and \\1
I don't think [] should be there.Without them it means
- ( , open backreference one
- \d+ , one or more of any digit
- - , a hyphen
- \d+ , one or more of any digit
- ) , close backreference one
- \\1,recall the value of backreference one
The first part should match something that looks like 987-573 or 8-1
To me, understanding regular expressions is a matter of deconstruction; the reverse of (how I go about) writing them.Quote:
Originally Posted by shachi
The parts of regular expressions are quite simple. The power comes from combining them. They can be understood by breaking an expression back down into smaller, simpler groups of those parts.
You've no doubt seen them all.Quote:
Can you tell me some more about repetition,grouping, character classes, etc. ??
Repetition is signalled by the characters "*", "+", and "?". There's also the form, "{n,m}", where n and m are numbers that define the lower and upper range of the repeats, and m, or both the comma and m, is optional. For example: "a{3}" would match exactly three "a"s; "a{3,5}" would match 3, 4, or 5 "a"s; and "a{4,}" would match four or more "a"s.
If a repetition operator follows a character, it applies only to that character. That is, "word{2}" would match "wordd", not "wordword". However, repetition can be applied to a group: "(word){2}" matches "wordword". Character classes can also be repeated.
As shown above, grouping is the use of parentheses. There are two forms: capturing "(...)" and non-capturing "(?:...)". Repetition can be used with either, and both can be used to limit the scope of alternates. For example, "^foo|bar$" would match either a string beginning with "foo", or ending with "bar"; other characters could occur in the string. On the other hand, "^(foo|bar)$" would only match a string that exactly equalled "foo" or "bar".
In addition, the capturing form allows the use of back-references (matching a previously captured character sequence), and obtaining parts of the string after evaluating it. For example, in "([ab])c\1", the \1 is a back-reference. Whatever that back-reference evaluates to must follow "c". So, if the first character was "a", the expression would match "aca". It it was "b", it would match "bcb".
In general, non-capturing parentheses are best as the number of allowable capturing parentheses are limited to less than 100 (though it would be a very complex expression to use all of them). However, in client-side scripts, it's best to avoid the non-capturing kind for now as they'll cause syntax exceptions in older browsers. Feel free to use them server-side, though.
Character classes are used to represent a set of characters. Any of those characters would match in place of the class. Consider "[a-f]": any of the characters "a" through "f" would be matched by that class. There are also a predefined set of character classes that can be specified using escape sequences: \d, \D, \s, \S, \w, \W. \d (digits) is the same as [0-9]. \s (whitespace) is the same as [ \t\v\f\r\n]. \w (word) is the same as [a-zA-Z0-9_] - letters, digits, and underscore (_). The uppercase version is the inverse, so \D, for instance, is all characters except digits: [^0-9].
As an exercise in creating regular expressions, consider domain names. Each part (label) is a sequence of characters separated by a dot (.). They must start and end with either letters or numbers, though they can also contain hyphens (-). Finally, to distinguish them from IP addresses, the top-level domain will be alphabetic.
The last part is the easiest: "[a-z]+" (one or more letters).
Next comes the other labels. They must start with letters or numbers: "[a-z0-9]". They must end with the same, but only one character is necessary within a label, so this part will be optional: "[a-z0-9]?". In addition, they can contain hyphens: "[a-z0-9-]?". This would give us:
  [a-z0-9][a-z0-9-]?[a-z0-9]?
which isn't quite right as it would allow the label to end with a hyphen, so we make the last two classes combined optional by grouping them:
  [a-z0-9]([a-z0-9-]?[a-z0-9])?
Better, but that only allows a maximum of three characters; we want the second class to be repeated.
  [a-z0-9]([a-z0-9-]*[a-z0-9])?
Now, labels are separated by dots (.), and including the final alphabetic label, there must be at least one of them, so we start by adding a dot to the end of the previous expression, and repeat the whole thing:
  ([a-z0-9]([a-z0-9-]*[a-z0-9])?\.)*
Finally, we add the last alphabetic label:
  ([a-z0-9]([a-z0-9-]*[a-z0-9])?\.)*[a-z]+
That's it. On the Web, we'd want the domain names of remote machines to have at least two labels, so we change a zero-or-more (*), to one-or-more (+):
  ([a-z0-9]([a-z0-9-]*[a-z0-9])?\.)+[a-z]+
Now let's try the process backwards.
  ((2[0-4]|1[0-9]|[1-9])?[0-9]|25[0-5])(\.((2[0-4]|1[0-9]|[1-9])?[0-9]|25[0-5])){3}
This validates IPv4 addresses. The first thing we need to do it break it down into smaller chunks. There's a single group (let's call it SG):
  ((2[0-4]|1[0-9]|[1-9])?[0-9]|25[0-5])
and a repeated group:
  (\.((2[0-4]|1[0-9]|[1-9])?[0-9]|25[0-5])){3}
They're more or less the same, so we'll start with the first one, SG. Within the outer brackets, there's an alternate (|) that separates:
  (2[0-4]|1[0-9]|[1-9])?[0-9]
and:
  25[0-5]
From the latter, it's clear that the larger chunk (SG) will match 250, 251, 252, 253, 254, and 255.
Looking at the former, there's another opportunity to break it down by separating the optional group from the required class that follows, giving:
  (2[0-4]|1[0-9]|[1-9])?
and:
  [0-9]
So far, we've established that SG will match 250 through 255, and 0 through 9. Now, what can occur in front of this latter set of digits? Within the brackets, we again have a set of alternates:
  2[0-4]
A 2, followed by the digits 0 through 4. Combined, that would be "2[0-4][0-9]", or the numbers 200 through 249. Next:
  1[0-9]
A 1, followed by the digits 0 through 9. Combined, that would be "1[0-9][0-9]", or the numbers 100 through 199. Finally:
  [1-9]
Combined, this would be "[1-9][0-9]", or the numbers 10 through 99.
In all, that gives us 0-9, 10-99, 100-199, 200-249, and 250-255, or the numbers 0-255, without leading zeros.
Going back to the repeated group, it's exactly the same except on each repeat, that same number range is prefixed by a dot (.). That gives us four numbers in the range 0-255, separated by dots.
One thing that's quite important when working with things like this is a good text editor that performs bracket matching. Even when programming in general, its a good thing to be able to see which bracket is paired with another. Those that do (such as Crimson Editor) also tend to provide a keystroke shortcut for moving between them (I find Ctrl+] to be common).
Mike
Wow, that was awesome. Thanks mwinter, let me read all that(sorry I am at school now).
Hmmm... a lot to work through if new, but that's a good explanation.
Thanks, Mike.
Yea, a lot of work. I have even saved a copy of that(as a .txt file) for future references. Thanks again mwinter.:D
I'm testing this out (actually it's helping me with a script that i'm writing), and I was wondering what's wrong with this:
It's not being true.PHP Code:if (preg_match("([a-z0-9]([a-z0-9-]*[a-z0-9])?\.)+[a-z]+", $url) == 1) {
dostuff();
}
Firstly, you should use delimiters:Subdomains, anyone? .co.uk?Code:/([a-z0-9]([a-z0-9-]*[a-z0-9])?\.)+[a-z]+/