Text Extraction and Processing with Regular Expressions

Conceived in the 1950’s, a regular expression, regex or regexp, is a sequence of characters that define a search pattern, usually then used by string searching algorithms for "find" or "find and replace" operations on strings. These are used in search engines, search and replace dialogs of word processors and text editors, in text processing utilities such as sed and AWK and in lexical analysis. Many programming languages provide regex capabilities, built-in or via libraries. (from Wikipedia)

Why would you want to learn regular expressions? Imagine:

⁃ Extracting dates from 1 million lines of text in less than a few minutes.

⁃ Pulling specific text data or tables from PDFs

⁃ Removing HTML source and content to obtain the URLs to images in a page 

⁃ Reformatting phone numbers and email addresses 

⁃ Or inserting a blank line after every 3 lines of text

These are only a few of the capabilities one can harness with the power of using regular expressions.

This class will introduce you to the syntax of regular expressions and walk you through several hands-on exercises to help you master RegEx Foo!

For additional details, please contact research@hbs.edu.