C# Professional - Processing Text

talent-agile
2,134 views

Open Source Your Knowledge, Become a Contributor

Technology knowledge has to be shared and made accessible for free. Join the movement.

Create Content

Regular Expressions

Regular expressions are a power tool to work with text. They use patterns to apply different operations on text.

With regular expressions, you can:

  • Parse text to find specific character patterns
  • Edit, replace or delete substrings of a text
  • Extract text matching specific character patterns

Pattern definition

The basic pattern syntax will match any character.

1
10
11
12
13
14
15
16
// {...}
// This pattern will match any string containing a 'ab' substring
var regularExpression = new Regex("ab");
Console.WriteLine($"IsMatch 'abc': {regularExpression.IsMatch("abc")}");
Console.WriteLine($"IsMatch 'aze': {regularExpression.IsMatch("aze")}");
Console.WriteLine($"IsMatch 'bab': {regularExpression.IsMatch("bab")}");
// {...}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

You can define a class of multiple characters using [ and ]. [aeiou] will match one character that can be any vowel. You can use the - to include a range of consecutive characters in

1
10
11
12
13
14
15
// {...}
var regularExpression = new Regex("b[aeiou]c");
Console.WriteLine($"IsMatch 'bac': {regularExpression.IsMatch("bac")}");
Console.WriteLine($"IsMatch 'bbc': {regularExpression.IsMatch("bbc")}");
Console.WriteLine($"IsMatch 'bec': {regularExpression.IsMatch("bec")}");
// {...}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Note that by default, regular expressions are case-sensitive. The .Net Regex class can accept an option when creating a new Regex to specify that the case should be ignored, however, it is better to specify in the pattern that all cases can be accepted.

Here are the most common characters attribute for simple regular expression patterns.

PatternMatching characters
tSingle character t
[aei]A single character of: a, e or i
[a-z]A single character in the range from a to z
[^a-z]A single character not in the range from a to z
\dA decimal character (digit), equivalent to [0-9]
\wA word character, equivalent to [a-ZA-Z_0-9]

For any character, you can use quantifiers to specify how many repetitions of the character should be matched.

QuantifierDefinition
*Will match zero or more repetitions
?Will match zero or one repetition
+Will match one or more repetition
{N}Will match exactly N repetitions
{N,}Will match at least N repetitions
{M,N}Will match between M and N repetitions
1
10
11
12
13
14
15
16
17
18
19
20
// {...}
// Pattern definition
// b? -> 0 or 1 appearance of 'b'
// a{2,5} -> between 2 and 5 appearance of 'a'
// .* -> 0 or more appearance of any character
var regularExpression = new Regex("b?a{2,5}.*");
Console.WriteLine($"IsMatch 'baaac': {regularExpression.IsMatch("baaac")}");
Console.WriteLine($"IsMatch 'abc': {regularExpression.IsMatch("abc")}");
Console.WriteLine($"IsMatch 'aaaaart': {regularExpression.IsMatch("aaaaart")}");
Console.WriteLine($"IsMatch 'bbbaaaaaty': {regularExpression.IsMatch("bbbaaaaaty")}");
// {...}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

You can define anchors to match the beginning or the end of the text or a word.

AnchorDefinition
^Will match the beginning of the text
$Will match the end of the text
\bWill match the boundary of a word (beginning or end)
1
10
11
12
13
14
15
16
// {...}
// This pattern will match all text that starts with 'he'
var regularExpression = new Regex("^he");
Console.WriteLine($"IsMatch 'hello': {regularExpression.IsMatch("hello")}");
Console.WriteLine($"IsMatch 'chemical': {regularExpression.IsMatch("chemical")}");
Console.WriteLine($"IsMatch 'he left': {regularExpression.IsMatch("he left")}");
// {...}
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Open Source Your Knowledge: become a Contributor and help others learn. Create New Content