GSoC 2017 Final Evaluation
Project overviewMy aim for this Summer of Code was to add to dateparser a support of search and parsing of date expressions in the large chunks of text. All the work results in a function which detects the language of a string, finds all substrings that represent dates, parses them and returns list of tuples with pairs: substring and a corresponding datetime.datetime object.
The work that was done had the following stages:
- Implementing search based on translation of the string with the words from the language dictionary.
- Figuring out the way to properly split dates from one another. In my case sentence boundaries and words that aren’t present in the dictionary are considered as splitters. All the language files were updated with the information on sentence splitters that are used in this language. Then dates that aren’t yet separated from one another are split by each of the characters from the list (includes whitespace, comma and other suchlike symbols) by several different methods (every splitter, every second splitter etc). And the best of the possible splits is chosen by the number of resulting valid date expressions.
- Implementing another language detector that will be able to work with large chunks of text. It uses the count of words from the language dictionary and also presence of the characters also gathered from the dictionary. If the new language is added to dateparser it should automatically become available to the detector.
All new code is covered by tests. The resulting function and the class that is used in this function are provided with docstrings.
Future workBefore finally being merged this project’s code needs to be adjusted with the other GSoC 2017 project for dateparser - Integration of unicode CLDR database. Both projects made changes in the dateparser’s native files, not to mention a huge amount of new code, so making it all work together will take some time, but I think the result will be awesome.
Links to the codeAll the work that was done can be found in the pull requests that I made in the dateparser repository on GitHub:
- #324 - [WIP] search with translate (Open) - contains the main part of my work: everything concerning search, language detector for full texts, corresponding tests
- #340 - possible fix of 'ago' problem in Russian (Open) - solution for the bug that I found while working on search
- # 318 - tests for validation.py, utils/__init__.py, strptime.py (Merged) - tests that I wrote as warm-up task to increase coverage of existing code
- # 322 - [WIP] search numeric dates (Open) - the other approach to the search of dates (with regular expressions) which I was thinking of in the beginning, but the translation approach seems more effective