GSoC Report Week 10-11: Another language detector


I’m coming to the final stage of my project - language detector. I need another one because language detector that dateparser currently uses for parsing checks if all tokens from the date string are either digits or belong to the dictionary of one of the languages. So it won’t work correctly on strings that contain something other than date expressions and search is likely to be performed on full natural language sentences.

I discussed it with my mentor and we decided that the detector should first of all be universal and automatic. It means that if new language is added to dateparser it should automatically become available to the detector without any other extra actions. So only information from language dictionaries (that are stored as yaml files and are browsed automatically by the program) should be used.

I’m starting with a straightforward approach - counting the number of words from the string contained in each language’s dictionary, language with the maximal count will be the result of the detection.

If this won’t work well I will try to add use of lists of character n-grams also extracted from the language dictionaries, as the use of n-grams is in general one of the most efficient ways of language detection.

Comments