Perl Tutorial - Practical Extraction and Reporting Language (Perl)

Please leave a remark at the bottom of each page with your useful suggestion.


Table of Contents

  1. Perl Introduction
  2. Perl Program Startup
  3. Perl Regular Expressions
  4. Perl Array Program
  5. Perl Basic Program
  6. Perl Subroutine / Function Program
  7. Perl XML Program
  8. Perl String Program
  9. Perl Statement Program
  10. Perl Network Program
  11. Perl Hash Program
  12. Perl File Handling Program
  13. Perl Data Type Program
  14. Perl Database Program
  15. Perl Class Program
  16. Perl CGI Program
  17. Perl GUI Program
  18. Perl Report Program

Perl Regular Expressions



The basic method for applying a regular expression is to use the pattern binding operators =~ and !~. The first operator is a test and assignment operator.

There are three regular expression operators within Perl.

  • Match Regular Expression - m//
  • Substitute Regular Expression - s///
  • Transliterate Regular Expression - tr///

Pattern-matching options.

    
Option          Description 
g               Match all possible patterns 
i               Ignore case 
m               Treat string as multiple lines 
o               Only evaluate once 
s               Treat string as single line 
x               Ignore white space in pattern 

Pattern Modifiers

    

Code    Description

g       Globalmatch all occurrences of the regular expression

i       Ignore casematch any case

m       Multiple linesprocess the input as multiple lines

o       Only oncecompile the regular expression the first time

s       Single lineignore new lines

x       Extra spacesallow comments and spaces in regular expression syntax


Modifier Operator Description
i $_=~s/PATTERN/REPLACEMENT/i;
$_=~m/PATTERN/i;
Makes the match case insensitive.
m $_=~s/PATTERN/REPLACEMENT/m;
$_=~m/PATTERN/m;
Specifies that if the string has newline or carriage return characters, the ^ and $ operators will now match against a newline boundary, instead of a string boundary.
o $_=~s/PATTERN/REPLACEMENT/o;
$_=~m/PATTERN/o;
Evaluates the expression only once.
s $_=~s/PATTERN/REPLACEMENT/s;
$_=~m/PATTERN/s;
Allows use of '.' to match a newline character.
x $_=~s/PATTERN/REPLACEMENT/x;
$_=~m/PATTERN/x;
Allows you to use white space in the expression for clarity.
g $_=~s/PATTERN/REPLACEMENT/g;
$_=~m/PATTERN/g;
Globally finds all matches.
cg $_=~m/PATTERN/cg; Allows the search to continue even after a global match fails.
e $_=~s/PATTERN/REPLACEMENT/e; Evaluates the replacement as if it were a Perl statement, and uses its return value as the replacement text.
c $_=~tr/SEARCHLIST/REPLACEMENTLIST/cds; Complements SEARCHLIST.
d $_=~tr/SEARCHLIST/REPLACEMENTLIST/cds; Deletes found but unreplaced characters.
s $_=~tr/SEARCHLIST/REPLACEMENTLIST/cds; Squashes duplicate replaced characters.

Simple Matching

m/abc/;      # find 'abc'
m#abc#;      # ...
ma\abca;     # ...
/abc/;       # ...
/abc def/;   # find 'abc def'

/^abc/;      # abc at beginning
/abc$/;      # abc at the end
/^$/;        # empty line

Substitution

s/a/b/;       # first a->b
s/a/b/g;      # all a->b

s/Hi!/Ho!/g;  # 'Hi' -> 'Ho'

s/[[:ctrl:]]//g; # remove control chars 

Translation

tr/a/b/;      # all a->b
y/a/b/;       # all a->b
tr/abc/x/;    # a->x,b->x,c->x
tr/xxx/abc/;  # only x->a
tr/[a-z]/[A-Z]/;   # upper case
tr/A-Za-z/N-ZA-Mn-za-m/; # ROT13

Quantities

/^\s?\S/;   # 0..1 spaces
/^\s*\S/;   # 0..n spaces
/^\s+\S/;   # 1..n spaces
/a{3}/;     # 3 times 'a'
/ab{3}/;    # 3 times 'b'
/(ab){3}/;  # 3 times 'ab'
/a{3,4}/;   # 3..4 times 'a'
/a{3,}/;    # 3..n times 'a'
/a.+b/;     # maximal match
/a.+?b/;    # non-greedy match

Grouping and Alternatives

/(abc)def/;       # $1='abc'
/(a)b(cd)/;       # $1='a',$2='cd'
/(a)(?:b)(c)/;    # $1='a',$2='c'
/(start|begin)/;  # either 'start'
                  # or 'begin'

Special Characters
 
\d Digit
\D Non-Digit
\w Word Character
\W Non-Word Character
\s Whitespace
\S Non-Whitespace

Character Classes

:alpha: alphabetic
:alnum: alpha numeric
:upper: upper case
:lower: lower case
:digit: \d
:xdigit: hex number
:print: printable
:space: \s
:blank: space, enter
:punct: punctuation
:graph: alnum and punct
:word: \w
:ascii: ASCII chars
:control: control chars

Greedy searches

 greedy means that each pattern will try to match as much as it can. 
The pattern /a.*a/ matches as many characters as possible between the first a and the last a. 
If your text string is ababacdea, /a.*a/ will match the whole string.
You can control the greediness using a question mark. 
The question mark matches a minimum number of times. 
The following table shows how to minimize the greediness. 

Syntax         Means
*?             Match zero or more times, minimal number of times
+?             Match one or more times, minimal number of times
??             Match zero or one time, minimal number of times
{num}?         Match exactly num times, minimal number of times
{num,}?        Match at least num times, minimal number of times
{num,max}?     Match at least num but not more than max times, minimal number of times

Anchoring Metacharacters

 Metacharacter   What It Matches
^               Matches to beginning of line or beginning of string
$               Matches to end of line or end of a string
\A              Matches the beginning of the string only
\Z              Matches the end of the string or line
\z              Matches the end of string only
\G              Matches where previous m//g left off
\b              Matches a word boundary (when not inside [ ])
\B              Matches a nonword boundary

Character Class: Anchored Characters

 Metacharacter  What It Matches
\b             Matches a word boundary (when not inside [ ])
\B             Matches a nonword boundary
^              Matches to beginning of line
$              Matches to end of line
\A             Matches the beginning of the string only
\Z             Matches the end of the string or line
\z             Matches the end of string only
\G             Matches where previous m//g left off

Character Class: Miscellaneous Characters

 Metacharacter  What It Matches

\12            Matches that octal value, up to \377
\x811          Matches that hex value
\cX            Matches that control character; 
					e.g., \cC is <Ctrl>-C and \cV is <Ctrl>-V
\e             Matches the ASCII ESC character, not backslash
\E             Marks the end of changing 
					case with \U, \L, or \Q
\l             Lowercase the next character only
\L             Lowercase characters until the end of the string or until \E
\N             Matches that named character; e.g., \N{greek:Beta}
\p{PROPERTY}   Matches any character with the named property; e.g., \p{IsAlpha}/
\P{PROPERTY}   Matches any character without the named property
\Q             Quote metacharacters until \E
\u             Titlecase next character only
\U             Uppercase until \E
\x{NUMBER}     Matches Unicode NUMBER given in hexadecimal
\X             Matches Unicode 
					"combining character sequence" string
\[             Matches that metacharacter
\\             Matches a backslash

Character Class: Remembered Characters

 Metacharacter               What It Matches
(string)                    Used for 
								backreferencing (see Examples 9.38 and 9.39)
\1 or $1                    Matches first set of parentheses[a]
\2 or $2                    Matches second set of parentheses
\3 or $3                    Matches third set of parentheses

Character Class: Repeated Characters

 Metacharacter   What It Matches

x?              Matches 0 or 1 x
x*              Matches 0 or more occurrences of x
x+              Matches 1 or more occurrences of x
(xyz)+          Matches 1 or more patterns of xyz
x{m,n}          Matches at least m occurrences of x and no more than n occurrences of x

Character Class: Whitespace Characters

 Metacharacter   What It Matches
\s              Matches a whitespace character, such as spaces, tabs, and newlines
\S              Matches nonwhitespace character
\n              Matches a newline
\r              Matches a return
\t              Matches a tab
\f              Matches a form feed
\b              Matches a backspace
\0              Matches a null character

Checking for multiple occurrences

 Pattern     Interpretation
/a{1,4}/    Matches one, two, three, or four as.
/a{2}/      Matches two as.
/a{0,2}/    Matches one or two as.




Write Your Comments or Suggestion...