Regex Reference
The regular expression (regex) syntax and semantics implemented in BareGrep are common to PHP, Perl and Java.
Characters and Escapes
Logical Operators
Character Classes
Quantifiers
Assertions
There is also an example.
Characters and Escapes
.
|
Any character
"." matches any character.
For example:
...
would match any three character sequence.
To specify a literal ".", escape it with "\". For example:
www\.baremetalsoft\.com
would match "www.baremetalsoft.com".
|
x
|
The literal character x
All characters which do not have a special meaning in a regex match themselves.
For example:
fred
would match the string "fred".
|
\a
|
Alert character (bell)
BEL - ASCII code 07.
|
\cx
|
Control-x character
For example:
\cM
would be equivalent to key sequence Control-M or character ASCII code 0D hexidecimal.
|
\d
|
A digit
A digit from 0 to 9.
This is eqivalent to the regex:
[0-9]
|
\D
|
Any non-digit
Any character which is not a digit.
This is eqivalent to the regex:
[^0-9]
|
\e
|
Escape character
ESC - ASCII code 27 (1B hexidecimal).
|
\f
|
Form feed character
FF - ASCII code 12 (0C hexidecimal).
|
\r
|
Carriage return character
CR - ASCII code 13 (0D hexidecimal). Carriage return characters are automatically stripped
from the ends of lines. However this escape can be used to match a carriage return character
which is not followed by a new line character (ASCII code 10, 0A hexidecimal).
Note: to match the start-of-line use the "^" assertion. To match the end-of-line
use the "$" assertion.
|
\s
|
Any whitespace character
The whitespace characters include space, tab, new-line, carriage-return and form-feed.
This is eqivalent to the regex:
[ \t\r\n\f]
|
\S
|
Any non-whitespace character
This is eqivalent to the regex:
[^ \t\r\n\f]
|
\t
|
Tab character
A horizontal tab character.
HT - ASCII code 09.
|
\nnn
|
The character with octal value nnn
|
\w
|
Any word character
Any word character (in the set "A" to "Z", "a"
to "z", "0" to "9" and "_").
This is equivalent to the regex:
[0-9_A-Za-z]
|
\W
|
Any non-word character
Any non-word character. A character in the set:
[^0-9_A-Za-z]
|
\xhh
|
The character with hexidecimal value hh
|
Logical Operators
XY
|
Catenation
Regex X then Y regex.
For example:
abc
would match the string "abc".
|
X|Y
|
Alternation
X or Y
For example:
ERROR|FATAL
would match "ERROR" or "FATAL".
|
(?:X)
|
Group
Grouping and operator precedence over-ride. For example:
(?:A|B)(?:C|D)
would match "AC", "BC", "AD" or "BD". Whereas:
A|BC|D
would match "A", "BC", or "D".
|
(X)
|
Capturing group
Grouping and capturing of the regex X.
Capturing causes the string which matched the regex X to be displayed
in a separate column in BareGrep.
Capturing groups also imply operator precedence over-ride. For example:
(A|B)(C|D)
would match "AC", "BC", "AD" or "BD". Whereas:
A|BC|D
would match "A", "BC", or "D".
Note: Using capturing involves a significant performance overhead (the search runs slower),
so it is preferrable to use non-capturing groups instead, if capturing is not required.
Nesting of capturing groups can result in regexes which are particularly
slow to execute.
|
Character Classes
[abc]
|
Character set
A single a, b or c character.
For example:
[0123456789ABCDEFabcdef]
would match any hexidecimal digit character
(in the set "0" to "9", "A" to "F" and "a" to "f").
|
[^abc]
|
Inverse character set
Any character other than a, b or c.
For example:
[^0123456789ABCDEFabcdef]
would match any character which is not an hexidecimal digit character.
|
[a-b]
|
Character set range
A character in the range a to b.
For example:
[0-9_A-Za-z]
would match any word character (in the set "A" to "Z", "a"
to "z", "0" to "9" and "_").
|
Quantifiers
X*
|
Set closure
The regex X zero or more times.
For example:
.*
would match anything (or nothing, because it may match zero times).
For example:
A\s*=\s*B
would match "A=B", "A = B" or even "A= B" (ignoring whitespace around the "=").
|
X+
|
Kleene closure
The regex X one or more times.
For example:
\d+
would match a sequence of digits that is at least one character in length.
|
X?
|
Zero or one
The regex X zero or one times.
For example:
\d?
would match zero or one digits only.
|
X{n}
|
Exactly n times
The regex X exactly n times.
For example:
\d{4}
would match exactly 4 digits.
|
X{n,}
|
At least n times
The regex X at least n times.
For example:
\d{4,}
would match 4 or more digits.
|
X{n,m}
|
Between n and m times
The regex X at least n times, but no more than m times.
For example:
\d{4,6}
would match 4, 5 or 6 or more digits.
|
Assertions
^
|
Start-of-line
The start of a line.
For example:
^Status
would match "Status" only at the start of a line.
|
$
|
End-of-line
The end of a line.
For example:
Status$
would match "Status" only at the end of a line.
|
Example
Question
Given the following lines:
#Fields: date time c-ip cs-username s-computername s-ip cs-method cs-uri-stem cs-uri-query sc-status sc-bytes cs-bytes time-taken s-port cs(User-Agent)
2005-01-04 00:31:32 10.67.65.57 - VENUS 10.7.40.91 GET /xpedio/ - 302 288 241 0 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:32 10.67.65.57 - VENUS 10.7.40.91 GET /xpedio/login.html - 200 1337 242 125 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:32 10.67.65.57 - VENUS 10.7.40.91 GET /xpedio/images/FAHC/sm_idoclogo2.gif - 200 1898 310 16 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:32 10.67.65.57 - VENUS 10.7.40.91 GET /intradoc-cgi/idc_cgi_isapi.dll IdcService=LOGIN&Action=GetTemplatePage&Page=HOME_PAGE&Auth=Intranet 200 15431 546 141 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:32 10.67.65.57 FAHC\KioskUser VENUS 10.7.40.91 GET /intradoc-cgi/idc_cgi_isapi.dll IdcService=LOGIN&Action=GetTemplatePage&Page=HOME_PAGE&Auth=Intranet 200 23943 768 390 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 FAHC\KioskUser VENUS 10.7.40.91 GET /xpedio/images/xpedio/enthome2.gif - 200 650 494 32 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 - VENUS 10.7.40.91 GET /xpedio/images/xpedio/enthome.gif - 200 662 493 62 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 FAHC\KioskUser VENUS 10.7.40.91 GET /xpedio/images/xpedio/home.gif - 200 523 490 62 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 - VENUS 10.7.40.91 GET /xpedio/images/xpedio/home2.gif - 200 525 491 16 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 FAHC\KioskUser VENUS 10.7.40.91 GET /xpedio/images/xpedio/library2.gif - 200 698 494 47 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 - VENUS 10.7.40.91 GET /xpedio/images/xpedio/library.gif - 200 701 493 31 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 FAHC\KioskUser VENUS 10.7.40.91 GET /xpedio/images/xpedio/search2.pdf - 200 570 493 31 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 - VENUS 10.7.40.91 GET /xpedio/images/xpedio/help2.gif - 200 553 491 31 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
2005-01-04 00:31:33 10.67.65.57 FAHC\KioskUser VENUS 10.7.40.91 GET /xpedio/images/xpedio/search.gif - 200 574 492 16 80 Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.1;+.NET+CLR+1.0.3705;+.NET+CLR+1.1.4322)
I need to generate a report with the FAHC\username and the .pdf file
they accessed. I can do either one individually, but not sure how to do both.
Here's the regex that works for the username:
(FAHC\\\S+)
and the regex that works for the .pdf file:
(\S+\.pdf)
but how do I format the "find" field for both?
Answer
In this case, as every line has the same format, I would first try to
construct a regex which matches the entire line.
So I'd start with something like:
\S+ \S+ \S+ \S+ \S+ \S+ \S+ \S+ \S+ \S+ \S+ \S+ \S+ \S+ \S+
Then I'd pick out the columns I'm interested in:
\S+ \S+ \S+ (\S+) \S+ \S+ \S+ (\S+) \S+ \S+ \S+ \S+ \S+ \S+ \S+
You can then refine the sub-regex for the two columns you're
interested in:
\S+ \S+ \S+ FAHC\\(\S+) \S+ \S+ \S+ (\S+\.pdf) \S+ \S+ \S+ \S+ \S+ \S+ S+
There are various other ways this could also be done, but this is the
first way that sprung to mind.
Another way would be:
FAHC\\(\S+).* (\S+\.pdf)
This uses ".*" in the middle which means "match anything".
|