Notice

This page show a previous version of the article

Note sul processo di fusione dei duplicati

Questa è una copia modificata (molto poco)dell'elenco di requisiti riguardanti la riscrittura dello script di fusione dei duplicati. L'elenco originale è stato composto da alanf, gillux, Link Mauve, liori e "due autori senza nome".

Requisiti

Requisiti per il codice della deduplicazione (script, codice per il tempo di esecuzione o una combinazione dei due):

  • Non blocca il database.
  • Non incasina il database se viene interrotto prima della fine (per esempio se il server si spegne).
  • Effettua delle esecuzioni sequenziali (per esempio, "considera soltanto le frasi che sono cambiate nella tabella dei contributi tra una data X e una data Y", non solo delle frasi che sono state aggiunte dopo una certa data o ID).
  • Integrazione con del codice in PHP? (per esempio, un'esecuzione al momento dell'inserimento di una frase), può essere un modo di collegare delle frasi per gli utenti normali.
  • Non utilizza l'algoritmo O(n²).
  • Handles cycles intelligently (sentences that link to themselves directly or indirectly) without getting stuck in infinite loops or trying to add the same comment or link to a sentence multiple times.
  • Maybe provides warnings before deletion? Like: first run only adds a comment saying "this sentence is a duplicate of that sentence", and second run actually removes them?
  • Who owns the merged sentence?
  • Sentences with audio prioritized as the target sentence. (Need to decide what to do if multiple duplicate sentences have audio: which one wins? If we use the "dry run" technique, the first run could check for this situation so we could listen for ourselves in case the audio for one is better. Or we could just use a rule (newest wins, or oldest wins). Why not allow multiple audio per sentences? Multiple locuters will add value to the audio anyway. Good idea but I don’t think it’s currently possible. Of course. :-)
  • Fusione di commenti, collegamenti e registri.
  • Merge tags except for "@duplicate", which should be dropped.
  • Aggiorna le frasi preferite degli utenti.
  • Add a new type of contribution to the contributions table: ' sentence'. Any time a sentence is deleted due to the deduplication process, a new entry in the contributions table is added to log the deletion.
  • Can run on dev machine as well as on server. Username, password, etc. are read from command line or config files. (We can just do it as a Cake console script which is run from the command line.) (We should basically go over the database schema and look at all the places where sentence ID shows up, and decide what to do in those places.)

Nice to have

  • Some way of handling the comment threads on the existing sentences so that they can still be understood when the sentences are merged.
  • Can be executed as a cron job (in other words, reliable enough that it doesn't require preparation before it works, observation while it works, or cleanup after it works).
  • Prioritize contributions by (self-identified) native speakers over those who do not self-identify as native. (suggested by CK, who points out that some users and "re-users" of the corpus place more trust in sentences by self-identified natives)

Nonrequirements

  • Does not handle sentences that come in while the script is running(?)
  • Normalize punctuations (’ vs. ', ! vs. ! etc.) — this is a task for a separate script.

Note aggiuntive

  • Another approach is to prevent addition of duplicates in the first place. (see point 3 of requirements) The question is whether that can be done efficiently. Actually you need to lock the database, check for existing sentence, add it and unlock (not a big deal giving the rate sentences are added).
    • Adding this might require quite a lot of changes to the UI and such so that it won't be confusing to the user. For example, when user accidentally makes a sentence that is a duplicate, even if he meant to edit the sentence later.
    • For adding a translation: we could just link the existing sentence with the one he/she intended to. But this could be used as a way to bypass normal user rights and link two sentences randomly. Isn’t that ultimately wanted, and was disabled only because no good UI was found?
    • For adding a new sentence: just show an error.
    • For editing sentence to some already existing one: ???
  • Can a sentence be written as the same text while having different meanings in different languages, thus actually requiering separate sentences?