Board Thread:Code Review/@comment-1757994-20151023054248

In a forum thread on Community Central (19 Oct - 21 Oct 2015), there was a question about how to find duplicate files. One poster recommended DupImageList/code.js. However, I noticed that DupImageList has some bugs that would cause it possibly not to find all duplicate files and possibly to list some duplicates multiple times.

Therefore, I propose several non-trivial changes to fix it and to increase its efficiency. I know this code has existed for a while. Many people have looked at it and used it over the years. I realize some may be skeptical. My objective here is to make a minimal number of changes that preserves the structure of the existing code to the greatest degree rather than replace it with something wholly rewritten, yet still fixing the short-comings and bugs.

DupImageList uses the MediaWiki API with the following HTTP query parameters.
 * format = json
 * action = query
 * prop = duplicatefiles
 * generator = allimages
 * gailimit = 500
 * gaifrom = 

There are two query parameters that are not used (but should be).
 * dflimit
 * dfcontinue

The duplicatefiles module uses dflimit and dfcontinue, while the (generator) allimages module uses (g)ailimit and (g)aifrom. Using the query values above, the allimages module feeds 500 file names to the duplicatefiles module. The duplicatefiles module looks for up to dflimit duplicate files, then stops. The default value of dflimit is 10. The meaning of dflimit is not to find up to dflimit duplicates per file, but to find dflimit duplicates total.

Let's say there are 1000 files: A001 through A500, Z001 through Z500. A001 and Z001 are duplicates of each other; A002 and Z002 are duplicates of each other; etc., up to A500 and Z500.
 * dfcontinue

In the initial request, the API would return a list of the 500 A files with the first 10 Z duplicates. The other 490 Z duplicates would not be returned. The API would also return query-continue values as follows:
 * gaifrom = Z001
 * dfcontinue = A011|Z011

DupImageList would output the duplication information for A001/Z001 through A010/Z010, then use gaifrom to send a second request beginning at Z001. The API would return a list of the 500 Z files with the first 10 A duplicates. The API would also return a query-continue value as follows:
 * dfcontinue = Z011|A011

There is no gaifrom returned from the second request, because, as far as allimages is concerned, there are no more files. DupImageList would try not to output the duplication information, because it's the same information it already output, but the duplication information is reset every time findDupImages iterates. Therefore DupImageList would output the duplication information for Z001/A001 through Z010/A010 (again). It would then see that there is a query-continue value, assume that it is gaifrom, and die.

If you prefer a live example, you can try an xml query from the RS Wiki. At the time of this writing, the query returns query-continue values of
 * gaifrom = Abyssal book.png
 * dfcontinue = A key (Gnome Village Dungeon).png|Toban&amp;#039;s key.png

You can count the total number of duplicate files listed. It's 10. The next duplicate file, "Toban's key.png", is a duplicate of "A key (Gnome Village Dungeon).png", which already has several (but fewer than 10) other duplicates listed. Starting a second query from "Abyssal book.png" would skip "Toban's key.png" (and a lot of other duplicates). Notice also that the apostrophe (') is returned as a character entity number (&amp;#039;).

should be
 * Edit 1

should be
 * Edit 2

If the dflimit cut-off truncates a list of multiple duplicates for any file (whether dflimit is 10 or anything else), the list of duplicates of the current title being processed would resume in the next iteration. DupImageList should output a list of duplicates only when it is known that the list is not truncated. Therefore, the code needs to pass its state information to the next iteration. The state information includes the current title, the current list of duplicates for the title (or the HTML of it), and the cumulative list of duplicates.
 * Continuation state

should be
 * Edit 3

should be
 * Edit 4

should be
 * Edit 5 (Edit 1 revisited)

should be
 * Edit 6

Finding only 10 duplicate files per iteration of 500 titles can be restrictive. There's no reason not to incease the maximum number of duplicates to 500. Increasing the maximum should also mean less network traffic. Strictly speaking, this edit is not required to make anything work. It should just make things work better.
 * dflimit

should be
 * Edit 7

dils (duplicate image list string) is a string made from the array dil (duplicate image list). findDupImages does a substring search to see if any title has already been listed as a duplicate file and skips the title if it has. A substring seach is inefficient (and possibly error-prone). These edits also are not required to make anything work. They should just make things work better.
 * dils

should be
 * Edit 8 (Edit 3 revisited)

should be
 * Edit 9

should be
 * Edit 10

There are other changes that could be made, like wrapping everything in a closure, using strict mode, and using mw.config for stylepath. I think those changes are secondary compared to just getting the code to work the way everyone thinks it already does.
 * Result

I also know that editing the file (any js file) is currently restricted to Wikia staff and possibly some mysterious selection of non-staff personnel (but apparently no one who can edit here). That's a separate problem for discussion elsewhere.

Applying all the above edits, rationalizing the use of ' versus ", and linting a little (but not too much), the updated code is

Comments welcome. 