GSoC Report Week 8-9: Splitting and setting relative base

Function setting right relative base which I described in the previous post was added along with the corresponding tests. To each item that is parsed it assigns as a relative base previous date expression that is exact (like “25th of November, 2000”) not relative (like “2 days ago”). Because relative dates if there are several in a row in the natural language usually have common relative base and they should not be based on one another.

Also an important addition is a function that splits chunks of text that consist of more than one date object that aren’t split by sentence borders or words that aren’t in the language dictionary. So they come to the parsing stage united and can not be parsed this way.

Let’s have a look at these examples:
  1. July 13th July 14th” should become “July 13th” and “July 14th
  2. July 13th, July 14th” should become “July 13th” and “July 14th
  3. July 13th, 2014 July 14th, 2014” should become “July 13th, 2014” and “July 14th, 2014
All of them contain whitespaces. The last two contain also commas. All of them weren’t split because comma and whitespace don’t indicate sentence boundary.

The first example needs to be split by whitespaces, but not by all that are contained in the string. The second example needs to be split by a comma and the third one should be split by a whitespace but by only one of them that is in the middle of the string. And we should also take in account cases when there are 3 or more date objects in the one string to split.

To behave this way program uses following steps:
  1. There is a list of possible splitters and the string is split by all that are contained in this string.
  2. For each splitter it makes several possible splits: 
    • by all occurrences in the string
    • by each second occurrence
    • by each third occurrence
  3. All possible splits are stored and parsed. And the best one of them goes to the output.
  4. The program chooses the best split by two criteria: 
    • largest number of valid date objects (that are correctly parsed)
    • substrings resulting from the split have the maximum length, so the number of these substrings should be minimal. Because we want “June 13th, 2014” to stay unite although it can be divided into two valid date expressions.

Comments