surl

How does surl recognize links?

Created by Savvas Radevic on 2009-08-28

Keywords:

Last updated by:: Savvas Radevic on 2009-09-02

I have integrated a simple way of recognizing and capturing full URL links, using regular expressions (regex):

    def urlRegexString(self):
        """ Returns the regular expressions string used for catching URLs """
        urlregex = {
            'protocol': '(?:(?:https?|ftp)://)?',
            'byip' : '\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}',
            'bydomain': '\S+\.(?:com|org|net|gov|mil|biz|info|mobi|name|aero|jobs|museum|travel|[a-z]{2})',
            'rest2' : '(?:/\S*|/?(?=\s))',
        }
        urlregex_string = urlregex['protocol'] + \
            '(?:' + \
                urlregex['byip'] + '|' + urlregex['bydomain'] + \
            ')' + \
            urlregex['rest2']
        return urlregex_string

*** NOTE ***
You should be aware of one thing: it doesn't detect whitespace characters (space, enter/new line, etc.). For example, if the link is "http://www.example.com/space and something/" it will detect "http://www.example.com/space".
There is nothing I can do about that right now - the proper link should be "http://www.example.com/space%20and%20something/" (where space use %20).

List all FAQs

surl

How does surl recognize links?

Related questions