GSoC Report Week 6-7: Switching to parsing stage and adding more tests
I fixed remaining issues of search function and added search and parse function. It parses all substrings found with search function (that contain digits and/or words present in current language’s dictionary) and drops ones that aren’t parsed.
And surely all this needs to be tested on full natural language texts so I wrote tests with short texts (from one to three sentences) containing dates for each of supported languages.
I faced some difficulties described in issues #259 and #330 related to a problem when some words that aren’t actually a valid date expression are parsed to a valid datetime object. When it comes to search for dates through large chunks of text this issue becomes even more significant because the probability of encountering such “infelicitous” words is very high so search and parse function will demonstrate an undesirable behavior. For example, all indefinite articles “a” in an English text will be considered by this function as correct date expressions and will be parsed to valid datetime objects.
Now I’m working on a function which will set correct relative base to incomplete dates. By default incomplete dates are parsed with a relative base equal to current date and time so for example “April 1st” will be considered as “April 1st, 2017”. But when parsing texts with multiple dates when complete one is followed by incomplete ones we expect a different behaviour. In the texts like “April 1st, 1995 was a good day. April 2nd was also a good day” we want “April 2nd” to be considered of 1995 not of a current year. So I think generally incomplete dates should receive relative base of the previous complete date.
And surely all this needs to be tested on full natural language texts so I wrote tests with short texts (from one to three sentences) containing dates for each of supported languages.
I faced some difficulties described in issues #259 and #330 related to a problem when some words that aren’t actually a valid date expression are parsed to a valid datetime object. When it comes to search for dates through large chunks of text this issue becomes even more significant because the probability of encountering such “infelicitous” words is very high so search and parse function will demonstrate an undesirable behavior. For example, all indefinite articles “a” in an English text will be considered by this function as correct date expressions and will be parsed to valid datetime objects.
Now I’m working on a function which will set correct relative base to incomplete dates. By default incomplete dates are parsed with a relative base equal to current date and time so for example “April 1st” will be considered as “April 1st, 2017”. But when parsing texts with multiple dates when complete one is followed by incomplete ones we expect a different behaviour. In the texts like “April 1st, 1995 was a good day. April 2nd was also a good day” we want “April 2nd” to be considered of 1995 not of a current year. So I think generally incomplete dates should receive relative base of the previous complete date.
Comments
Post a Comment