Critical issues presentations/AI based plagiarism detection to identify and triage copyvio insertions to Wikipedia

From Wikimania 2016 • Esino Lario, Italy
Submission no. 186
Title of the submission

AI based plagiarism detection to identify and triage copyvio insertions to Wikipedia

Author of the submission
  • Gediz Aksit
Country of origin

Belgium

Topics

other, Projects, Research, Technical

Keywords
  • Artificial Intelligence
  • Copyrights
  • Machine Learning
  • Plagiarism Detection
  • Quality Control
  • Wikipedia
Abstract

In a recent 2015 Community Wishlist Survey Wikimedians ranked "Improve the copy and paste detection bot" as the #9th critical issue.[1] This is a problem where users copy paste copyrighted content from other sources (including websites) to Wikipedia creating a problem. While Wikimedians have traditionally been crowdsourcing this using a number of resources such as copyscape, turnitin or simply their own intuition. However, a good chunk of these cases are more obvious making them monotonous and tedious to continuously deal with.


This is where Artificial Intelligence (AI) would be of great benefit. AI is a branch of computer science that makes use of computers to perform tasks that we commonly associate with intelligent humans. While we don't have AIs that are smart enough to replace a human Wikipedia editor, many AI strategies are very good at automating some of the more monotonous and voluminous tasks. Using AI, we would be able to detect obvious cases automatically and triage the rest based on urgency and likelihood for human editors to review.


Recently something similar was achieved for edit quality control for many language editions of Wikipedia as well as Wikidata using Machine Learning algorithms of AI in a project called Revision Scoring as a Service[2]where vast majority of the actual work was triaged as not needing review. This system is able to distinguish between productive edits, damaging edits as well as the intent (good faith/bad faith) to distinguish damaging edits such as newbie mistakes from malicious edits. Output of this tool is used by a variety of third party developers including but not limited to huggle[3], raun[4], Real Time Recent Changes[5] and Dexbot[6] (automatic vandalism revert bot of Persian Wikipedia). The tool achieved this by gathering feedback from the local community to identify what such edits looked like by utilizing Wikilabels[7].This way the system is trained based on the needs of the local community.


Likewise, the problem with plagiarism detection (identification of copyvios) can be handled with cutting edge AI algorithms. Bearing mind, just like how we have a past history of reverted vandalism edits to train AI on vandalism, we have a history of deleted content over copyright/plagiarism concerns which would serve as a starting point of the AI implementation. Such labelled information is a goldmine of knowledge for AI algorithms to train on. This way even less community effort would be needed during the training phase of the implementation.


This submission intends to inform local communities of the capabilities of cutting edge research and how such an AI based system can be implemented to handle the issue of plagiarism detection. Feedback gathered would be very useful to guide the direction of this possible future project.


[1] https://meta.wikimedia.org/wiki/2015_Community_Wishlist_Survey/Bots_and_gadgets#Improve_the_.22copy_and_paste_detection.22_bot

[2] https://meta.wikimedia.org/wiki/Research:Revision_scoring_as_a_service

[3] https://www.mediawiki.org/wiki/Huggle

[4] https://tools.wmflabs.org/raun/?language=pt&project=wikipedia&userlang=en

[5] https://github.com/Krinkle/mw-gadget-rtrc/pull/43

[6] https://fa.wikipedia.org/wiki/User:Dexbot

[7] https://meta.wikimedia.org/wiki/Wiki_labels

Result

Accepted as reserve