Adventures in Prolog – A Simple Tokenizer
Today’s post is concerned with presenting a simple tokenizer that I have written during the last couple of weeks (not full-time off course). The tokenizer’s goal is to – off course – tokenize a provided sentence or input sequence of characters. For example, the following sentence: “This is my world!” Should be split up into individual descriptors/ tokens of the respective string part they cover, such as: Word(This), Space(5), Word(is), Space(8), Word(my), Space(11), Word(world), EndOfSentence(!) Now, given that the position of a space between characters can be computed by taking the length of the character sequence and adding one to it, we might not need to store the Space explicitly. Also, to avoid verbose representation of tokens we might shorten the name of individual tokens a bit. Since Prolog reserves upper case characters for variables, we might change the casing of tokens too. For example, the above token stream might be collapsed to: w(This), w(is), w(my), w(world), eos(!) Now,...