GSoC Report Week 2-3: Translation approach to search



These weeks were dedicated to the exploration of possible ways to implement search.

I tried two different approaches: search with regular expressions and the translation approach, which was advised by mentors and seems more typical for dateparser.

In dateparser all supported languages are represented as objects of Language class containing all the information including dictionary of words and expressions concerning date and time. To parse date strings dateparser first translates them to English using dictionary and simplification rules. So the main idea of search with translation is to check given text, word by word, and drop words that don’t have translation and don’t have digits in them.

The first problem was to separate dates one from another if they are multiple. For the moment the solution for this is first to split the original text by sentences and within each sentence date chunks are split if there is a word with no translation or ‘ - ‘ (hyphen with spaces) between them.

Separating sentences isn’t a trivial task, because sentence splitters vary from language to language and it was necessary to take this fact in account. As a result, all supported languages were divided into several groups, depending on their punctuation system and each group corresponded to a regular expression according to which the text should be divided into sentences. (Everything concerning this can be found in this PR).

Another difficulty is to separate numeric parts of date expressions from just numbers, that may also occur in the text. Possible solution to this is to check if substrings that contain digits fit in the known numeric date patterns.

In sum we realized that translating approach (improved with the help of regular expressions) will work well and should be the main strategy for search in this project.

Comments