Faup stands for Finally An Url Parser and is a library and command line tool to parse URLs and normalize fields with two constraints:
- Work with real-life urls (resilient to badly formated ones)
- Be fast: no allocation for string parsing and read characters only once
- Mailing List: libfaup@googlegroups.com
- Documentation: [Library] - [Faup Tool]
What can Faup do for me?
Extract various elements from an URL with no pain. The fields we get are: scheme, credential, subdomain, domain, host, tld, port, resource path, query string and fragment ('#'). Ever dreamed of giving urls to a command line tool doing it? Faup exists!
$ faup -f domain www.github.com
github.com
Why Yet Another URL Extraction Library?
Because they all suck. Find a library that can extract, say, a TLD even if you have an IP address, or http://localhost, or anything that may confuse your regex so much that you end up with an unmaintainable one.
You can see all those failures on the regex library webpage here.
Here's a buch of example with faup run on various urls to extract the TLD:
URL | Faup TLD extraction | Comments |
---|---|---|
www.example.co.uk | co.uk | TLD > 1 |
www.example.bl.uk | uk | bl is an exception in uk TLD extraction |
192.168.0.42 | IPv4 address, no TLD | |
www.tagada.42 | 42 | This is not an IP address, 42 is right |
www.example.paris | paris | GTLD extracted smoothly |
حكومة.امارات | حكومة | United Arab Emirates IDN ccTLD |
How fast?
We did a bunch of tests with a few libraries, regex etc. The regex used was this one:
^(http|https|ftp)\://([a-zA-Z0-9\.\-]+(\:[a-zA- Z0-9\.&%\$\-]+)*@)*((25[0-5]|2[0-4][0-9]|[0-1]{1} [0-9]{2}|[1-9]{1}[0-9]{1}|[1-9])\.(25[0-5]|2[0-4][0-9]| [0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\.(25[0-5]| 2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}|[1-9]|0)\. (25[0-5]|2[0-4][0-9]|[0-1]{1}[0-9]{2}|[1-9]{1}[0-9]{1}| [0-9])|localhost|([a-zA-Z0-9\-]+\.)*[a-zA-Z0-9\-]+\. (com|edu|gov|int|mil|net|org|biz|arpa|info|name|pro| aero|coop|museum|[a-zA-Z]{2}))(\:[0-9]+)*(/($|[a-zA- Z0-9\.\,\?\'\\\+&%\$#\=~_\-]+))*$
This is the result graph against 1 million URLs:
Building faup
To get and build faup, you need cmake. As cmake doesn't allow to build the binary in the source directory, you have to create a build directory.
git clone git://github.com/stricaud/faup.git
cd faup
mkdir build
cd build
cmake .. && make
sudo make install