GSoC Report Week 4-5: Improving search function and adding tests

I continued to elaborate search with translation, wrote some tests for different cases and also used existing tests for translation function to test search within all of supported languages. It helped to reveal some problems, that I needed to fix.

First of all, I needed to take in account abbreviations related to dates on the stage of sentence splitting, because for example Swedish "fredag d. 3 september 2014" (meaning “Friday, September 3, 2014") is a unite date chunk and we don’t want it to be split by a full stop. So all the words with full stop from the Language object’s dictionary are automatically gathered and added to sentence splitting regex.

The other difficulty related to sentence splitting is that in some languages full stops can be found also following digit tokens within the unite date expression: "1. tammikuuta, 2016"  (Finnish for  “January 1, 2016”). There are two ways of solving this problem: treating these cases as abbreviations at sentence splitting stage or dealing with them on the parsing stage. The main problem of seeing them as abbreviations is that in this case date expression ending with digits at the end of the sentence wouldn’t be separated from the following ones at the beginning of the next sentence (as in this example: “Game 1 is July 12, 2017. July 13th is the date of the second game.” ) and such behavior is undesirable.

Secondly, there remain a problem related to tokenization. It occurs while applying simplify function that replaces some natural language expressions with their synonyms contained in the Language object dictionary. For example "midnight" will turn into "00:00". The difficulty is that this function may change actual number of tokens in the sentence and it becomes problematic to align tokens from simplified sentence with tokens from the original sentence. Which seems necessary as in my opinion search function should return substrings from the original text and not their somehow modified versions. Then I added a function which aligns tokens from original sentence with tokens from its simplified version so that they have equal number of tokens and missing ones are replaced with an empty string.

Overall I can say that I’ve accomplished almost all of what was planned for this period and remaining small issues will be fixed very soon.

Comments