“This is my world!”
Should be split up into individual descriptors/ tokens of the respective string part they cover, such as:
Word(This), Space(5), Word(is), Space(8), Word(my), Space(11), Word(world), EndOfSentence(!)
Now, given that the position of a space between characters can be computed by taking the length of the character sequence and adding one to it, we might not need to store the Space explicitly. Also, to avoid verbose representation of tokens we might shorten the name of individual tokens a bit. Since Prolog reserves upper case characters for variables, we might change the casing of tokens too. For example, the above token stream might be collapsed to:
w(This), w(is), w(my), w(world), eos(!)
Now, what difference does it make whether this is written with a capital t or not? Clearly, this is at the beginning of a sentence, but since we now that it is at the beginning of the sentence we might change casing here. Also transforming to lower case characters simplifies the implementation of the tokenizer somewhat, since we don’t need to deal with specific representations of a single character. For this, all characters will be transformed to lower case characters and the above sequence of tokens reduces too:
w(this), w(is), w(my), w(world), eos(!)
Adding a complement to the end of sentence indicator, i.e. beginning of sentence, marks each distinct sentence, such that the above token stream looks like:
bos(i), w(this), w(is), w(my), w(world), eos(!)
i in the bos() token identifies the ith sentence in a set of sentences. As you may see from the above example there are a few situations one wants to deal with in order to tokenize real-world sentences. For this, the tokenizer is spread across several Prolog files, each covering a separate aspect or token. Tokens covered are:
- Words
Description: A sequence of letters, possibly containing “-“ (bar), or “’” (apostrophe).
Example: Hello, world, top-down, Mike’s, Andreas’, don’t - Simple Numbers (integer, float)
Description: A sequence of digits, possibly containing a “.” (dot)
Example: 1234, 12.343 - Names, Quantities and Abbreviations
Description: special words, stored in a name, quantity, abbreviation map
Example: 5m (five meter), MS (Microsoft) - Unknown and unrecognized character streams
Description: A sequence of arbitrary characters closed by a “ “ (space)
Example: 34sd=+
Please note that in order to identify a sentence, a sentence must be properly terminated with one of the following: ‘.’ (Dot), ‘?’ (Question mark) or ‘!’ (Exclamation mark). Other types of (sub)sentence termination , such as the ‘-‘ (bar), ‘;’, (Semicolon), ‘,’ (comma) are handled as sentence fragment terminators. If you want to use the tokenizer, you may either tokenize a file containing sentences or a string. For this, you may use either of the following predicates:
tokenize_string(+InString, -OutTokenList)
tokenize_file(+InFile, -OutTokenList)
For example, you might ask:
3 ?- tokenize_string("this is my world!", X).
X = [[bos(0), w(this), w(is), w(my), w(world), eos(!)]].
Or you might ask:
tokenize_file(‘Path to file’, X).
The Prolog source code of the tokenizer can be found here. It is stored as a zip container. You will find the following inside the zip-file:
- Prolog source files (extension pl)
- Directory “Corpora” with two text files containing some text to test the tokenizer on.
Please note that I do not claim its robustness or coverage of 99% of the possible sentences. Also, performance has not been a major concern for me. For this you will not find the use of difference lists or optimized database access. This might be a future task as well. In fact, this is one of the many open points. However, in case that you do have found a bug, error or recommendation, please do contact me. Currently, I try to add features to the tokenizer and make it more stable. Also, documentation and code cleanup is a item on the to-do list.
No comments:
Post a Comment