| Recognizing Text Similarity (2008) | |||||||||||||||
Abstract | |||||||||||||||
| Overview: There are a variety of circumstances under which it would be useful to determine that two documents contain similar text, including detecting plagiarism and copyright infringement, and filtering and organizing documents returned by a search engine. The vast amount of digital information available on the Web makes it necessary to deal with these issues. The ease of copying facilitates both plagiarism and copyright infringement, while the volume of information available increases the difficulty of finding the right information quickly. Approach: To automatically evaluate text similarity, we look at both the content and the expression of documents. By content we refer to the facts in documents; expression refers to the linguistic choices made by the authors in presenting these facts. Expression, thus, includes biases of the authors towards or against particular linguistic constructs but does not include layout or generic genre characteristics of documents. For example, sentence 1 is similar in content to sentence 2, which summarizes sentence 1, but differs in expression, while it is similar to sentence 3 in expression but differs in content. Sentence 1: A wealthy Texan developer convicted of killing Simon Prankerd, a British yacht captain, and his five passengers in a high-speed drink-driving boat crash has been sentenced to 85 years in prison. Sentence 2: An American developer who was found responsible for the deaths of 6 people involved in a boat crash was sentenced to 85 years in jail. Sentence 3: A wealthy Californian tourist convicted of killing Francois Robert, a French bus driver, and his 8 passengers in a high-speed drunk-driving car crash was sentenced to 70 years in prison. Most tasks related to text similarity focus on content similarity and the algorithms created for these | |||||||||||||||
Publication details | |||||||||||||||
| |||||||||||||||