Sunday, June 28, 2009

Adventures in Prolog – A Simple Tokenizer

Today’s post is concerned with presenting a simple tokenizer that I have written during the last couple of weeks (not full-time off course). The tokenizer’s goal is to – off course – tokenize a provided sentence or input sequence of characters. For example, the following sentence:

“This is my world!”

Should be split up into individual descriptors/ tokens of the respective string part they cover, such as:

Word(This), Space(5), Word(is), Space(8), Word(my), Space(11), Word(world), EndOfSentence(!)

Now, given that the position of a space between characters can be computed by taking the length of the character sequence and adding one to it, we might not need to store the Space explicitly. Also, to avoid verbose representation of tokens we might shorten the name of individual tokens a bit. Since Prolog reserves upper case characters for variables, we might change the casing of tokens too. For example, the above token stream might be collapsed to:

w(This), w(is), w(my), w(world), eos(!)

Now, what difference does it make whether this is written with a capital t or not? Clearly, this is at the beginning of a sentence, but since we now that it is at the beginning of the sentence we might change casing here. Also transforming to lower case characters simplifies the implementation of the tokenizer somewhat, since we don’t need to deal with specific representations of a single character. For this, all characters will be transformed to lower case characters and the above sequence of tokens reduces too:

w(this), w(is), w(my), w(world), eos(!)


Adding a complement to the end of sentence indicator, i.e. beginning of sentence, marks each distinct sentence, such that the above token stream looks like:
bos(i), w(this), w(is), w(my), w(world), eos(!)
i in the bos() token identifies the ith sentence in a set of sentences. As you may see from the above example there are a few situations one wants to deal with in order to tokenize real-world sentences. For this, the tokenizer is spread across several Prolog files, each covering a separate aspect or token. Tokens covered are:

  1. Words
    Description: A sequence of letters, possibly containing “-“ (bar), or “’” (apostrophe).
    Example: Hello, world, top-down, Mike’s, Andreas’, don’t
  2. Simple Numbers (integer, float)
    Description: A sequence of digits, possibly containing a “.” (dot)
    Example: 1234, 12.343
  3. Names, Quantities and Abbreviations
    Description: special words, stored in a name, quantity, abbreviation map
    Example: 5m (five meter), MS (Microsoft)
  4. Unknown and unrecognized character streams
    Description: A sequence of arbitrary characters closed by a “ “ (space)
    Example: 34sd=+

Please note that in order to identify a sentence, a sentence must be properly terminated with one of the following: ‘.’ (Dot), ‘?’ (Question mark) or ‘!’ (Exclamation mark). Other types of (sub)sentence termination , such as the ‘-‘ (bar), ‘;’, (Semicolon), ‘,’ (comma) are handled as sentence fragment terminators. If you want to use the tokenizer, you may either tokenize a file containing sentences or a string. For this, you may use either of the following predicates:



  • tokenize_string(+InString, -OutTokenList)


  • tokenize_file(+InFile, -OutTokenList)


For example, you might ask:

3 ?- tokenize_string("this is my world!", X).
X = [[bos(0), w(this), w(is), w(my), w(world), eos(!)]].

Or you might ask:

tokenize_file(‘Path to file’, X).

The Prolog source code of the tokenizer can be found here. It is stored as a zip container. You will find the following inside the zip-file:



  • Prolog source files (extension pl)

  • Directory “Corpora” with two text files containing some text to test the tokenizer on.


Please note that I do not claim its robustness or coverage of 99% of the possible sentences. Also, performance has not been a major concern for me. For this you will not find the use of difference lists or optimized database access. This might be a future task as well. In fact, this is one of the many open points. However, in case that you do have found a bug, error or recommendation, please do contact me. Currently, I try to add features to the tokenizer and make it more stable. Also, documentation and code cleanup is a item on the to-do list.

Sunday, June 07, 2009

Adventures in Prolog - LinkList (2)

Some time ago, I have done a small tokenizer, which I will present in one of the upcoming posts. However, this post is another link to some Prolog tutorials:

http://www.intranet.csupomona.edu/~jrfisher/www/prolog_tutorial/intro.html

All posted links and links that I will post in the future, will be collected on my Prolog link site at:

http://sites.google.com/site/bittermanpage/Home/prolog-links

This link will be available in the side bar as well.

Thursday, June 04, 2009

Adventures in Prolog - LinkList

Here is a link to the Prolog directory of CMU's AI group:

http://www.cs.cmu.edu/Groups/AI/lang/prolog/