Day 30 of 100 - Regular Expressions Day 3 of 3

Regular expressions Day 3 of 3

Day 30 of 100

Regular expressions are a concept that I have heard many things about on my coding journey. The two most common I can recall are (1)They are incredibly powerful and useful and (2)They are very difficult to understand, write, and master. Up to this point I have relied on string methods for working with text and have gotten by, for the most part. Avoiding regular expressions has also limited the type and complexity of applications that I am able to write and understand and thanks to this code challenge, I look forward to many new and exciting things to come.

The videos for this challenge were good, but just scratched the surface. As with the other 3 day challenges, they are designed for beginners that are willing to dig into the content to fill in the knowledge gaps they have. I appreciate this approach because it really lets me dig in to the areas I know I need and it would probably require many hours of video to cover everything that everyone needs in the videos.

My code for this challenge is writing three regular expressions to return the expected test from some test strings. The first regular expressions is expected to find a return all of the times in the samples text in the form of :. For this regular expression, I am look for a pattern of two digits followed by the ":" and then two more digits. Using the findall method returns the results in a list, which is the format required.

The second regular expression was more difficult because it required finding both URL's and hashtags from a sample text. The hashtags were not too difficult because we had to find a pattern that started with the hashtag character and was followed by 1 or more word characters. Writing a pattern for a URL was more difficult and the one I wrote is probably only reliable enough for the sample string. The pattern begins by looking for a pattern or characters that starts with "http://" (for a better pattern I would have to include the "https://" pattern with a | to look for one or the other) but then it gets difficult. After the beginning of the URL, you really have to continue to match any character until a space. Letters, numbers, and symbols could all be included in the URL so I tried .* but it was too "greedy" and I found that it continued to match characters past the space at the end of the URL. I had to change it to .*? and then add the "look ahead" pattern, ?=\s, so it would end on the correct space and not include it. Finally, I had to combine the URL and hashtag patterns using | so it would match one pattern or the other one.

I thought the third pattern would be easier but the challenge was more subtle. I started by writing a pattern to look for the <p> and </p> tags and grab the text in between using the .* character. When I tested this regular expression, I expected to have nailed it but the tests failed because the results included the opening and closing tags and the challenge was to return the text between the tags. After some further research, it looked like the "look ahead" and "look behind" features of regular expressions was the tool I was looking for. I added them to the pattern and it returned the expected results, passing the tests.

import re

COURSE = ('Introduction 1 Lecture 01:47'
          'The Basics 4 Lectures 32:03'
          'Getting Technical!  4 Lectures 41:51'
          'Challenge 2 Lectures 27:48'
          'Afterword 1 Lecture 05:02')
TWEET = ('New PyBites article: Module of the Week - Requests-cache '
         'for Repeated API Calls - http://pybit.es/requests-cache.html '
         '#python #APIs')
HTML = ('<p>pybites != greedy</p>'
        '<p>not the same can be said REgarding ...</p>')
TWEET2 = ('PyBites My Reading List | 12 Rules for Life - #books '
                 'that expand the mind! '
                 'http://pbreadinglist.herokuapp.com/books/'
                 'TvEqDAAAQBAJ#.XVOriU5z2tA.twitter'
                 ' #psychology #philosophy')


def extract_course_times(course=COURSE):
    """Return the course timings from the passed in
       course string. Timings are in mm:ss (minutes:seconds)
       format, so taking COURSE above you would extract:
       ['01:47', '32:03', '41:51', '27:48', '05:02']
       Return this list.
    """
    matches = re.findall(r'\d{2}:\d{2}', course)
    return matches


def get_all_hashtags_and_links(tweet=TWEET2):
    """Get all hashtags and links from the tweet text
       that is passed into this function. So for TWEET
       above you need to extract the following list:
       ['http://pybit.es/requests-cache.html',
        '#python',
        '#APIs']
       Return this list.
    """
    matches = re.findall(r'http://.*?(?=\s)|#\w+', tweet)
    return matches


def match_first_paragraph(html=HTML):
    """Extract the first paragraph of the passed in
       html, so for HTML above this would be:
       'pybites != greedy' (= content of first paragraph).
       Return this string.
    """
    matches = re.findall(r'(?<=<p>)(.*?)(?=</p>)', html)
    return matches[0]

With this challenge complete, I can see that regular expressions are powerful and complex tools for working with text. It will be one of the topics that I will be returning to once the 100 Days is complete. Since the syntax for regular expressions seems to be the same across multiple programming languages, I did find a lot of helpful resources and examples for all of the specific patterns I was looking for. While this is helpful and could be an excuse not to spend more time on them, I also learned that regular expressions can be tricky and the patterns can potentially match the patterns I want but also other text if I don't test them thoroughly. For that reason, using a regular expression that someone else wrote is fine, but I have to understand how it works so I can know if it will only match the patterns I am looking for. For this challenge, I have to thank Al Sweigart and his 2017 PyCon Talk on regular expressions as well as the chapter in Automate the Boring Stuff for helping me through this. I also lean heavily on the excellent videos from Corey Schafer, specifically the Learning Regular Expressions and re Module - How to Write and Match Regular Expressions videos for this challenge. On to logging!