Title: Multi-language document search and retrieval system
Type: issued patent
Patent number: 7,174,290
Issue date: February 6, 2007
Filing date: July 7, 2003 (priority to November 30, 1998)
AIPW Summary: A system for multi-lingual searching and indexing. A string of text is separated into individual words (tokens), with non-indexable tokens being removed. The words are then reduced to their grammatical stems and indexed. A problem with this process is that it is language-dependent. This invention improves both phases to be multi-lingual by removing accent marks from words and word endings from multiple languages. For example, during the tokenization phase, words that would not normally be indexed in one language would not be indexed from any language. The patent gives the example of the word “the” in English (which would not be indexed) and the word “thé” in French (which would be indexed). Under this invention, the string “the” would not be indexed, since in English it would not be indexed (see column 5, lines 21-41).
