Wikipedia:Bots/Requests for approval/BareRefBot: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
→‎Trial 3: replying to Rlink2: (@Rlink2, Have you done the trial approve...) [Bawl!]
→‎Trial 3: editing comment (Primefac) Anchor [Bawl!]
Line 665: Line 665:


====Trial 3====
====Trial 3====
{{BotExtendedTrial|edits=50}} [[User:Primefac|Primefac]] ([[User talk:Primefac|talk]]) 12:48, 27 March 2022 (UTC)
<span class="anchor" id="202203271248 Primefac"></span>{{BotExtendedTrial|edits=50}} [[User:Primefac|Primefac]] ([[User talk:Primefac|talk]]) 12:48, 27 March 2022 (UTC)
: {{ping|Rlink2}} Has this trial happened? [[User:Pppery|* Pppery *]] [[User talk:Pppery|<sub style="color:#800000">it has begun...</sub>]] 01:42, 17 April 2022 (UTC)
: {{ping|Rlink2}} Has this trial happened? [[User:Pppery|* Pppery *]] [[User talk:Pppery|<sub style="color:#800000">it has begun...</sub>]] 01:42, 17 April 2022 (UTC)
::@[[User:Pppery|Pppery]] Not yet, busy with IRL stuff. But will get to it soon (by end of next week latest) [[User:Rlink2|Rlink2]] ([[User talk:Rlink2|talk]]) 02:37, 17 April 2022 (UTC)
::@[[User:Pppery|Pppery]] Not yet, busy with IRL stuff. But will get to it soon (by end of next week latest) [[User:Rlink2|Rlink2]] ([[User talk:Rlink2|talk]]) 02:37, 17 April 2022 (UTC)

Revision as of 06:46, 23 August 2022

New to bots on Wikipedia? Read these primers!

Operator: Rlink2 (talk · contribs · SUL · edit count · logs · page moves · block log · rights log · ANI search)

Time filed: 21:35, Thursday, January 20, 2022 (UTC)

Function overview: The function of this bot is to fill in Bare references. A bare reference is a reference with no information about it included in the citaiton, example of this is <ref>https://wikipedia.org</ref> instead of <ref>{{cite web | url = https://encarta.microsoft.com | title = Microsoft Encarta}}</ref>. More detail can be found on Wikipedia:Bare_URLs and User:BrownHairedGirl/Articles_with_bare_links.

Automatic, Supervised, or Manual: Automatic, mistakes will be corrected as it goes.

Programming language(s): Multiple.

Source code available: Not yet.

Links to relevant discussions (where appropriate): WP:Bare_URLs, but citation bot already fills bare refs, and is approved to do so.

Edit period(s): Continuous.

Estimated number of pages affected: around 200,000 pages, maybe less, maybe more.

Namespace(s): Mainspace.

Exclusion compliant (Yes/No): Yes.

Function details: The purpose of the bot is to provide a better way of fixing bare refs. As explained by Enterprisey, our citation tools could do better. Citation bot is overloaded, and Reflinks consistently fails to get the title of the webpage. ReFill is slightly better but is very buggy due to architectual failures in the software pointed out by the author of the tool.

As evidenced by my AWB run, my script can get the title of many sites that Reflinks, reFill, or Citation Bot can not get. The tool is like a "booster" to other tools like Citation bot, it picks up where other tools left off.

There are a few exceptions for when the bot will not fill in the title. For example, if the title is shorter than 5 chacters, it will not fill it in since it is highly unlikely that the title has any useful information. Twitter links will be left alone, as the Sand Doctor has a bot that can do a more complete filling.

There has been discussion over the "incompleteness" of the filling of these refs. For example, it wouldn't fill in the "work="/"website=" parameter unless its a whitelisted site (NYT, Youtube, etc...). This is similar to what Citation bot does IIRC. While these other parameters would usually not filled, the consensus is that "perfect is the enemy of the good" and that any sort of filling will represent an improvement in the citation. Any filled cites can always be improved even further by editors or another bot.


Examples:

Special:Diff/1066367156

Special:Diff/1066364250

Special:Diff/1066364589


Discussion

Pre-trial discussion

{{BotOnHold}} pending closure of Wikipedia:Administrators'_noticeboard/Incidents#Rlink2. ProcrastinatingReader (talk) 23:25, 20 January 2022 (UTC)[reply]

@ProcrastinatingReader: The ANI thread has been closed. Rlink2 (talk) 15:03, 25 January 2022 (UTC)[reply]

Initial questions and thoughts (in no particular order):

  1. I would appreciate some comments on why Citation Bot is trigger-only (i.e. it will only edit individual articles on which it is triggered) rather than approved to mass-edit any article with bare URLs. Assuming the affected page count is accurate, it seems like there's no active and approved task for this job, and since this seems like a task that's obviously suitable for bot use I'm curious to know why that isn't the case.
  2. How did you come to the figure of 200,000 affected pages?
  3. Exactly which values of the citation template will this bot fill in? I gather that it will fill in |title= -- anything else?

ProcrastinatingReader (talk) 23:25, 20 January 2022 (UTC)[reply]

@ProcrastinatingReader: it's not really accurate to say that Citation bot will only edit individual articles on which it is triggered. Yes it needs to be triggered, but it also has a batch mode, of up to 2,200 articles at time. In the last 6 months have used that facility to feed the bot ~700,000 articles with bare URLs.
The reason that Citation bot needs targeting is simply scope. Citation bot can potentially make an improvement to any of the 6.4million articles on Wikipedia, but since it can process only a few thousand per day, it would need about 4 years to process them all. That is why Citation bot needs editors to target the bot at high-priority cases.
By contrast, BareRefBot's set of articles is about 200,000. That's only 3% of the total, and in each case BareRefBot will skip most of the refs on the page (whereas Citation bot processes all the refs, taking up to 10 minutes per page if there are hundreds of refs). The much simpler and more selective BareRefBot can process an article much much faster than Citation bot ... so it is entirely feasible for BareRefBot to process the lot at a steady 10 edits/min running 24/7, in only 14 days (10 X 60 X 24 X 14 = 201,600). It may be desirable to run it more slowly, but basically this job could clear the backlog in a fortnight. Hence no need for further selectivity.
I dunno the source of Rlink2's data, but 200,000 non-PDF bare URLs is my current estimate. I have scanned all the database dumps for the last few months, and that figure is derived from the numbers I found in the last database dump (20220101), minus an estimate of the progress since then. I will get data from the 20220120 dump within the next few days, and will add it here.
Note that my database scans show new articles with bare URLs being added at a rate of about 300 per day. (Probably some are filled promptly, but that's what remains at the end of the month). So there will be ongoing work every month on about 9k–10k articles. Some of that work will be done by Citation bot, which on first pass can usually fill all bare URL refs on about 30% of articles. BareRefBot can handle most of the rest. BrownHairedGirl (talk) • (contribs) 01:40, 21 January 2022 (UTC)[reply]
Numbers of articles. @ProcrastinatingReader: I have now completed my scans of the 20220120 database dump, and have the following headline numbers as of 20220120 :
  • Articles with untagged non-PDF bare URL refs: 221,824
  • Articles with untagged non-PDF bare URL refs in the 20220120 dump which were not in the 20220101 dump: 5,415 (an average of 285 additions per day)
My guesstimate had slightly overestimated the progress since 20220101. However, the 20220120 total of articles with untagged non-PDF bare URL refs is 30,402 lower than the 20220101 total of 252,226. So in 19 days, the total of articles with untagged bare URLs was reduced by just over 12%, which is great progress.
Those numbers do not include refs tagged with {{Bare URL inline}}. That tally fell from 33,794 in 20220101 to 13,082 in 20220120. That is a fall of 20,712 (61%), which is phenomenal progress, and it is overwhelmingly due to @Rlink2's very productive targeting of those inline-tagged bare URL refs.
There is some overlap between the sets of articles with tagged and untagged bare URLs, because some articles have both tagged and untagged bare URL refs. A further element of fuzziness comes from the fact that some of the articles with inline-tagged bare URLs are only to PDFs, which tools cannot fill.
Combining the two lists gives 20220120 total of 231,316 articles with tagged or untagged bare URL refs, including some PDFs. So I guesstimate a total of 230,000 articles with tagged or untagged non-PDF bare URLs refs.
Taking both tagged and untagged bare URL refs, the first 19 days of January saw the tally fall by about 40,000. I estimate that about 25,000 of that is due to the work of Rlink2, which is why I am so keen that Rlink2's work should continue. BrownHairedGirl (talk) • (contribs) 18:06, 22 January 2022 (UTC)[reply]
Update. I now have the data from my scans of the 20220201 database dump
  • Articles with untagged non-PDF bare URL refs: 215,177 (down from 221,824)
  • Articles with untagged non-PDF bare URL refs in the 20220101 dump which were not in the 20220120 dump: 3,731 (an average of 311 additions per day)
  • Articles with inline-tagged bare URL refs: 13,162 (slightly up from 13,082 in 20220120)
So in this 12-day period, the average fall in the number of tagged+untagged non-PDF bare URLs fell by 6,567. That average net cleanup of 547 per day in late January is way down from over 2,000 per day in the first period of January.
In both periods, I was keeping Citation bot fed 24/7 with bare URL cleanup; the difference is that in early January, Rlink2's work turbo-charged progress. When this bot is authorised, the cleanup will be turbo-charged again. BrownHairedGirl (talk) • (contribs) 20:44, 5 February 2022 (UTC)[reply]
Thank you for the update. Provided everything goes well, we'll be singing the victory polka sooner than we think, meaning we can redirect our attention to bare URL pdfs (yes - I have some ideas of how to deal with PDFs, but let's focus on this right now). Rlink2 (talk) 04:10, 7 February 2022 (UTC)[reply]
@Rlink2: Sounds good.
I also have ideas for bare URL PDF refs. When this bot discussion is finished, let's chew over our ideas on how to proceed. BrownHairedGirl (talk) • (contribs) 09:57, 7 February 2022 (UTC)[reply]
  • Scope. @Rlink2: I ask that PDF bare URLs should be excluded from this task. {{Bare URL PDF}} is a useful tag, but I think that there are better ways of handling PDF bare URLs. I will launch a discussion elsewhere on how to proceed. They are easily excluded in database scans, and easily filtered out of other lists (AWB: skip if page does NOT match the regex <ref><ref[^>]*?>\s*\[?\s*https?://[^>< \|\[\]]+(?<!\.pdf)\s*\]?\s*<\s*/\s*ref\b), so the bot can easily pass by them. --BrownHairedGirl (talk) • (contribs) 02:20, 21 January 2022 (UTC)[reply]
@BrownHairedGirl: Ok, I took it out of the proposal. The proposal is on hold due to the ANI, and it has not yet been transcluded on the main BRFA page, so I felt that it was OK to do so to clean up the clutter. Rlink2 (talk) 22:10, 21 January 2022 (UTC)[reply]
@Rlink2: I have had a rethink on the PDF bare URLs, and realise that I had fallen into the trap of letting the best be the enemy of the good.
Yes, I reckon that there probably are better ways to handle them. But as a first step, it is better to have them tagged than not to have them tagged ... and better to have them tagged with the specific {{Bare URL PDF}} than with the generic {{Bare URL inline}}.
So, please may I change my mind, and ask you to reinstate the tagging of PDF bare URLs? Sorry for messing you around. BrownHairedGirl (talk) • (contribs) 09:36, 1 March 2022 (UTC)[reply]
@BrownHairedGirl: No problem. I will make the change and update the source code to reflect it. Thanks for the feedback. Rlink2 (talk) 14:36, 1 March 2022 (UTC)[reply]
@Rlink2: that's great, and thanks for being so nice about my change-of-mind.
In the meantime. I have updated User:BrownHairedGirl/BareURLinline.js so that it uses {{Bare URL PDF}} for PDFs. I have also done an AWB run on the existing uses of {{Bare URL inline}} for PDFs, converting them to {{Bare URL PDF}}. BrownHairedGirl (talk) • (contribs) 16:15, 1 March 2022 (UTC)[reply]

Opening comments: I've seen <!--Bot generated title--> inserted in similar initiatives. Would that be a useful sort of thing to do here? It is acknowledged that the titles proposed to be inserted by this bot can be verbose and repetitive, terse or plainly wrong. Manual improvements will be desired in many cases. How do we help editors interested in doing this work?

The bot has a way of identifying bad and unsuitable titles, and will not fill in the citation if that is the case. I am using the list from Citation bot plus some other ones I have come across in my AWB runs. Rlink2 (talk) 22:06, 21 January 2022 (UTC)[reply]

Like ProcrastinatingReader I am interested in understanding bot permission precedence here. I'm not convinced that these edits are universally productive. I believe there has been restraint exercised in the past on bot jobs for which there is not a strong consensus that the changes are making significant improvements. I think improvements need to be large enough to overcome the downside of all the noise this will be adding to watchlists. I'm not convinced that bar is cleared here. See User_talk:Rlink2#A_little_mindless for background. ~Kvng (talk) 16:53, 21 January 2022 (UTC)[reply]

@Kvng: I think that a ref like {{cite web | title = Wikipedia - Encarta, from Microsoft | url=https://microsoft.com/encarta/shortcode/332d}} is better than simply just a link like <ref>https://microsoft.com/encarta/shortcode/332d</ref>. The consensus is that bare refs that are filled but not "completely" (filled with website parameter), it is still better than the link being 100% bare, if it leaves the new ref more informative. It's impractical do from ok to perfect improvements 100% of the time.
I understand that some people may want perfection, and I think if there is a room for improvement, we should take it. I recently made a upgrade to the script (the upgrade wasn't active for that edit) that does a better job of filling in the website parameter when it can. With the new script update, the ref you talked about on my page (http://encyclopedia2.thefreedictionary.com/leaky+bucket+counter) would be converted into {{cite web |url=http://encyclopedia2.thefreedictionary.com/leaky+bucket+counter |title = Leaky bucket counter | website = TheFreeDictionary.com}} . This is better than the old filling, which was {{cite web |url=http://encyclopedia2.thefreedictionary.com/leaky+bucket+counter. |title = Leaky bucket counter. {{!}} Article about leaky bucket counter. by The Free Dictionary}} It does not work for all sites though, but it is a start. Rlink2 (talk) 22:06, 21 January 2022 (UTC)[reply]
Rlink2 and BrownHairedGirl make the argument that these replacements are good and those opposing them are seeking perfect. In most cases, these are clear incremental improvements (good). In a few cases and aspects, they arguably don't improve or even degrade things (not good). Because the bot relies on external metadata (HTML titles) of highly variable quality and format, there doesn't seem to be a reliable way to separate the good from the not good. One solution is to have human editors follow the bot around and fix these but we don't have volunteers lined up to do that. Another solution is to tolerate the few not good contributions in appreciation of the overall good accomplished but I don't know how we do that value calculation. ~Kvng (talk) 16:14, 22 January 2022 (UTC)[reply]
@Kvng: I already explained I upgraded the script to use some more informatoin than just HTML titles, for a even more complete filling. See my response above Regarding there doesn't seem to be a reliable way to separate I have developed ways to detect bad titles. In those cases, it will not fill in the ref. There is a difference between a slightly ugly title (like the free dictionary one) and some non informative title (like "Website", "Twitter", "News story"). The former one provides more information to the reader, while the latter one provides less information. So if the title is too generic it wouldn't fill in the ref. Rlink2 (talk) 16:18, 22 January 2022 (UTC)[reply]
Sure, we can make improvements as we go but because HTML titles are so varied, there will be more discovered along the way. Correct me if I misunderstand, but the approval I believe you're seeking is to crawl all of Wikipedia at unlimited rate and apply the replacements. With that approach, we'll only know how to avoid problems after all the problems have been introduced. ~Kvng (talk) 16:55, 22 January 2022 (UTC)[reply]
@Kvng: as requested below, please provide diffs showing the alleged problems. BrownHairedGirl (talk) • (contribs) 17:02, 22 January 2022 (UTC)[reply]
@Kvng: with that approach, we'll only know how to avoid problems after all the problems have been introduced. Not necessarily, I save all the titles to a file before applying them. I look over the file and see if there any problem titles. If there are, I remove them, and modify the script to not place that bad title. And even when the bot is in action, I'll still look at some diffs after the fact to catch any possible mistakes. Rlink2 (talk) 17:24, 22 January 2022 (UTC)[reply]
@Kvng: please post diffs which identify the cases where you believe that Rlink2's filling of the ref has:
  1. not improved the ref
  2. degraded the ref
I don't believe that these cases exist. You claim that they do exist, so please provide multiple examples of each type. BrownHairedGirl (talk) • (contribs) 16:29, 22 January 2022 (UTC)[reply]
My previous anecdotal complaints were based on edits I reviewed on my watchlist. I have now reviewed the 37 most recent (a screenful) bare reference edits by Rlink2 and find the following problems. 10 of 37 edits I don't consider to be improvements.
  1. [1] introduces WP:PEACOCK issue
  2. [2] broken link, uses title of redirect page
  3. [3] broken link, uses title of redirect page
  4. [4] broken link, uses title of redirect page
  5. [5] broken link, uses title of redirect page
  6. [6] broken link, uses title of redirect page
  7. [7] website name, not article title
  8. [8] incorrect title
  9. [9] new title gives less context than bare URL
  10. [10] new title gives less context than bare URL ~Kvng (talk) 17:44, 22 January 2022 (UTC)[reply]
@Kvng: So that means there were 27 improvements? Of course there are bugs and stuff, but we can always work through it.
  1. [11] A informative but WP:PEACOCK title is better than a bare ref IMO
Regarding the next set of links (uses title of redirect page), the upgrades I have made will fix those. If two different URLs have the same title, it will assume that it is a generic one. Most of these URL redirects are dead links anyway, so they will be left alone.
  1. [12] This has been fixed in the upgrade.
  2. [13] Don't see an issue.
  1. [14] Easily fixed, didn't catch that one, but kept in mind for future edits.
  1. [15] The bare URL arguably didn't have much information (there is a difference between "https://nytimes.com/what-you-should-do-2022" and "NY times" versus "redwater.ca/pebina-place" and "Pembina Place"). Nevertheless, the upgrade should have tackled some of these issues, so should hopefully happen less and less.
So now there is only one or two problem edits that I have not addressed yet (like the WP:PEACOCK one). Not bad Rlink2 (talk) 18:09, 22 January 2022 (UTC)[reply]
The plan is for the bot to do 200,000 edits and at 1-2 issues for every 37 edits, we'd potentially be introducing 5-10,000 unproductive edits. I'm not sure that's acceptable. ~Kvng (talk) 19:21, 22 January 2022 (UTC)[reply]
@Kvng: I said 1-2 issues in your existing set, not that there would literally be 1-2 issues for every 37 edits. As more issues get fixed, the rate of bad edits will get less and less. The bot will run slowly at first, to catch any mistakes, then speed up. Sound good? Rlink2 (talk) 19:24, 22 January 2022 (UTC)[reply]
I'm extrapolating from a small sample. To find out more accurately what you're up against, we do need a larger review. Looking at just 50 edits, I've seen many ways this can go wrong. That leads me to assume there are still many others that have not been uncovered. You need to add some sort of QA plan to your proposal to address this. ~Kvng (talk) 00:31, 23 January 2022 (UTC)[reply]
@Kvng: You identified many edits of the same problem. The same problems that have been fixed. You didn't find 10 different errors, you found 5 issues, 4 of which have been fixed already/will be fixed, and 1 which I don't think is an issue, even if the title is WP:PEACOCK, it is still more informative than the original ref (I will look into this however). Remember, this is all about incremental improvement. Remember, these citatins have no information attached to them at all. There is nothing. It is important to add "something", even if not perfect, it will always be more informative than having nothing. If you were very thirsty and need of a drink of water right now, would you deny the store brand water because you prefer Fuji Water? It's also like saying you would rather have no car if you can't afford a Ferrari or Lamboghrini.
I have a QA plan already in action, as explained before. Rlink2 (talk) 00:56, 23 January 2022 (UTC)[reply]
I assume you're referring to I save all the titles to a file before applying them. I look over the file and see if there any problem titles. If there are, I remove them, and modify the script to not place that bad title. And even when the bot is in action, I'll still look at some diffs after the fact to catch any possible mistakes. This didn't seem to work well for the meatbot edits you've already done. Despite your script improvements, I'm not confident this will go better with the real bot. How about some sort of a trial run and review of edit quality by independent volunteers. ~Kvng (talk) 21:18, 25 January 2022 (UTC)[reply]
Can you do something about the 30 pages now found by insource:"title = Stocks - Bloomberg"? ~~~~
User:1234qwer1234qwer4 (talk)
20:56, 27 January 2022 (UTC)[reply]
@1234qwer1234qwer4: Nice to see you around here, thanks for reviewing my BRFA. Your opinion is very much appreciated and respected, you would know alot and have lots to say. Regarding "bloomberg", some (but not all) of those titles were placed by me. It appears that those 30 links with the generic title are dead links in the first place. I can go through them and replace them manually. The script has an upgrade to look for and not place any title that has been shared across multiple URLs, to help prevent the placement of generic titles. Rlink2 (talk) 21:11, 27 January 2022 (UTC)[reply]

It looks like a lot of cites use {{!}} with spammy content, for example from the first example |title = Blur {{!}} full Official Chart History {{!}} Official Charts Company. This is hard as you don't know which sub-string is spam vs. the actual title ("Blur"). One approach: split the string into three along the pipe boundary and add each as a new line in a very long text file. Then sort the file with counts for each string eg. "450\tOfficial Charts Company" indicates it found 450 titles containing that string along a pipe boundary ie. it is spam that can be safely removed. Add those strings to a squelch file so whenever they are detected in a title they are removed (along with the leading pipe). The squelch data would be invaluable to other bot writers as well. It can be run on existing cites on-wiki first to build up some data. You'd probably want to manually review the data for false positives but these spam strings are pretty obvious and you can get a lot of them this way pretty quickly. -- GreenC 07:10, 22 January 2022 (UTC)[reply]

If this gets done, I am leaning towards being amenable to a trial run; I don't expect this will get approved after only a single run but as mentioned in Kvng's thread above some of the concerns/issues likely won't pop until the bot actually starts going. Primefac (talk) 16:19, 25 January 2022 (UTC)[reply]
@Primefac: @GreenC:. I have already done this. All the titles are saved into a file, and if the more than one title from the same has common parts after the "|" symbol, it can remove it provided the website parameter can be filled. Detection and filling of the "website=" parameter is also alot better than before, like I explained above.
some concerns/issues likely won't pop until the bot actually starts going. Yeah I agree. It will go slow at first, then speed up. Rlink2 (talk)
I'm not sure if you missed it (or if I've missed your response), but can you confirm your answer to my third initial question? ProcrastinatingReader (talk) 18:04, 25 January 2022 (UTC)[reply]
@ProcrastinatingReader: When I first made the script, it would only fill in the "title=" parameter. Some editors were complaining that they would like to see the "website=" parameter, and while there is consensus that even filling in the "title=" parameter only is better than nothing, I added the capability to add that paremeter when possible into the script. It is sucessful at adding "website=" for some, but not all websites.
However this bot will leave the dead links bare for now. Rlink2 (talk) 18:29, 25 January 2022 (UTC)[reply]
@Rlink2: please please can bot the tag with {{dead link}} (dated) any bare URLs refs which return a 404 error?
This would be a huge help for other bare-URL-fixing processes, because such refs can be excluded at list-making stage, saving a lot of time.
Note that there are other situations where a link should be treated as dead, but they may require multiple checks. A 404 is fairly definitive, so it can be safely tagged on first pass. BrownHairedGirl (talk) • (contribs) 19:07, 25 January 2022 (UTC)[reply]
@BrownHairedGirl: Ok, I can definitely do that. Rlink2 (talk) 20:03, 25 January 2022 (UTC)[reply]
Thanks! BrownHairedGirl (talk) • (contribs) 20:09, 25 January 2022 (UTC)[reply]
PS @Rlink2: my experimentation with bare URL PDFs shows that while HTTP status 410 ("Gone") is rarely used, it does have a non-zero usage.
Since 410 is a definitively dead link, please can the bot treat it like a 404, i.e. tag any such URL as a {{dead link}}?
Also pinging @GreenC, in case they have any caveats to add about 410. BrownHairedGirl (talk) • (contribs) 01:52, 8 February 2022 (UTC)[reply]
@BrownHairedGirl: Sounds good. I will add this. Thanks for bringing up the issue with the highest standards of civility and courteousness, as you always do.
Just to make sure the change works in the bot, could you link to some of the diffs where 410 is the code returned? Thank you again. Rlink2 (talk) 01:59, 8 February 2022 (UTC)[reply]
Many thanks, @Rlink2. I have not been tracking them so far, just tagging them as dead in the hour or so since I added 410 to my experimental code. That leaves no trace of whether the error was 404 or 410.
I will now start logging them as part of my tests, and will get back to you when I have a set. (There won't be any diffs, just page name, URL, HTTP code, and HTTP message). BrownHairedGirl (talk) • (contribs) 02:09, 8 February 2022 (UTC)[reply]
@Rlink2: I have posted[16] at User talk:Rlink2#HTTP_410 a list of 9 such URLs. That is all my script found since I started logging them a few hours ago.
Hope this helps. BrownHairedGirl (talk) • (contribs) 11:25, 8 February 2022 (UTC)[reply]
Accurately determining web page status is deceptively easy. For example, forbes.com uses bot blocking and if you check their site more than X times in a row without sufficient pause it will return 404s (or 403?) even though the page is 200. It's a CloudFlare service I think so lots of sites use it. A robust general purpose dead link checker is quite difficult. IABot for example checks it three times over at least a 3week period to allow for network variances. -- GreenC 20:34, 25 January 2022 (UTC)[reply]
For example, forbes.com uses bot blocking and if you check their site more than X times in a row without sufficient pause it will return 404s (or 403?) even though the page is 200. To be exact, it does not return a 404, it returns something else. BHG was just talking about 404 links, which are pretty clear cut in their "Dead or alive status" Rlink2 (talk) 20:40, 25 January 2022 (UTC)[reply]
Maybe that will work, keep an eye out because websites do all sorts of unexpected nonstandard and illogical things with headers and codes. -- GreenC 21:19, 25 January 2022 (UTC)[reply]
This project has so far been marked by underappreciation of the complexity of the work. We should keep the scope tight and gain some more experience with the primary task. I do not support adding dead link detection and tagging to the bot's function. ~Kvng (talk) 21:26, 25 January 2022 (UTC)[reply]
@Kvng: This project has so far been marked by underappreciation of the complexity of the work. Don't be confused, I have been fine tuning the script for some time now. I am aware of the nooks and crannies. Adding dead link detection is uncontroversial and keeps the tool in scope even more. So why don't you support it? Rlink2 (talk) 21:39, 25 January 2022 (UTC)[reply]
Because assuring we get a usable title is hard enough. We don't need the distraction. The bot is not very likely to be adding a title and dead link tag in the same edit so there will be few additional edits if we do dead link tagging as a separate task later. ~Kvng (talk) 22:44, 25 January 2022 (UTC)[reply]
Because assuring we get a usable title is hard enough. Except it isn't. You have identified 5 bugs, which we have already fixed. The bot is not very likely to be adding a title and dead link tag in the same edit The title and dead link detetction are similar but not the same. If the title is unsuitable, it will leave the ref alone. If the link is dead, it will place the Dead link template. Rlink2 (talk) 22:58, 25 January 2022 (UTC)[reply]
@Rlink2: I don't know how much software development experience you have but my experience tells me that the number of remaining bugs is directly related to the number of bugs already reported. It is wrong to assume all programs have a similar number of bugs and the more you've found and fixed, the better the software is. The reality is that the quality of software and the complexity of problems varies greatly and some software has an order of magnitude more issues than others. I found several problems in your work quickly so I think it is responsible to assume there are many more yet to be found. ~Kvng (talk) 14:12, 26 January 2022 (UTC)[reply]
@Kvng: There is zero distraction. The info needed to decide to tag a URL as dead will always be available to the bot, because the first step in trying to fill the URL is to make a HTTP request. If that request fails with a 404 error, then we have a dead link. It's a very simple binary decision.
Your claim about low coincidence is the complete opposite of my experience of months of working almost solely on bare URLs. There is a very high incidence of pages with both live and dead bare URLs. So not doing it here will mean a lot of additional edits, and -- even more importantly -- a much higher number of wasted human and bot jobs repeatedly trying to fill bare URLs which are actually dead. BrownHairedGirl (talk) • (contribs) 23:01, 25 January 2022 (UTC)[reply]
PS Just for clarification, a 404 error reliably indicates a dead URL. As GreenC notes there are many other results where a URL is definitively dead but a 404 is not returned, and those may take multiple passes. But I haven't seen any false 404s. (There may be some, but they are very rare). BrownHairedGirl (talk) • (contribs) 04:59, 26 January 2022 (UTC)[reply]
@BrownHairedGirl: I respect your experience on this. I did not find any of those cases in the 50 edits I have reviewed. Perhaps that's because of the state Rlink2's tool.
I don't agree that there is zero distraction. We were already distracted discussing the details of implementing this before I came in and suggested we stay focused. ~Kvng (talk) 14:12, 26 January 2022 (UTC)[reply]
@Kvng: That talk of distraction is disingenuous. There were two brief posts on this before you created a distraction by turning it into a debate which required explanation of things you misunderstood. BrownHairedGirl (talk) • (contribs) 14:22, 26 January 2022 (UTC)[reply]
Happy to take the heat for drawing out the process. It's the opposite of what I'm trying to do so apparently I'm not doing it well. I still think we should fight scope creep and stick to filling in missing titles. ~Kvng (talk) 00:21, 27 January 2022 (UTC)[reply]
As I already explained, tagging dead links is an important part of the process of filling titles, because it removes unfixables from the worklist.
And as I already explained, it is a very simple task which uses info which the bot already has. BrownHairedGirl (talk) • (contribs) 00:45, 27 January 2022 (UTC)[reply]
Yes, you did explain and I read and it did not persuade me to change my position. I appreciate that being steadfast about this doesn't mean I get my way. ~Kvng (talk) 00:56, 27 January 2022 (UTC)[reply]

Source code

Speaking of fine tuning, do you intend to publish your source code? I think we may be able to identify additional gotchas though code review. ~Kvng (talk) 22:44, 25 January 2022 (UTC)[reply]
Hopefully, but not right now. It wouldn't be very useful for "code review" in the way you are thinking. If there are bugs though, you can always report it. Rlink2 (talk) 22:54, 25 January 2022 (UTC)[reply]
@Rlink2: I have to disagree with you on this. As a general principle, I am very much in favour of open-source code. That applies even more strongly in a collaborative environment such as Wikipedia, so I approach bots with a basic presumption that the code should be available, unless there is very good reason to make an exception.
Publishing the code brings several benefits:
  1. it allows other editors to verify that the code does what it claims to do
  2. it allows other editors to help find any bugs
  3. it helps others who may want to develop tools for related tasks
So if a bot-owner does not publish the source code, I expect a good explanation of why it is being withheld. BrownHairedGirl (talk) • (contribs) 00:35, 26 January 2022 (UTC)[reply]
@BrownHairedGirl: Ok, nice to see your perspective on it. I will definetly be making it open source then. When should I make it avaliable? I can provide a link later in the week, or should I wait until the bot enters trial? Where would I even post the code anyway? Thanks for your opinion. Rlink2 (talk) 00:39, 26 January 2022 (UTC)[reply]
@Rlink2: Up to you, but my practice is to make it available whenever I am ready to start a trial. That is usually before a trial is authorised.
I usually put the code in a sub-page (or pages) of the BRFA page. BrownHairedGirl (talk) • (contribs) 01:06, 26 January 2022 (UTC)[reply]
@BrownHairedGirl: Sounds good, I will follow your example and make it avaliable as soon as I can (later this week). Subpage sounds great, good idea and keeps everything on wiki. Rlink2 (talk) 01:11, 26 January 2022 (UTC)[reply]
There is preliminary code up on Wikipedia:Bots/Requests_for_approval/BareRefBot/Code. There is more to the script than that (eg: networking code, wikitext code ) but this is the core of it. Will be releasing more as time goes on and I have time to comment the additional portions. Rlink2 (talk) 20:08, 26 January 2022 (UTC)[reply]
Code review comments and discussion at Wikipedia talk:Bots/Requests for approval/BareRefBot/Code

Trial

Trial 1

Approved for trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. As I mentioned above, this is most likely not going to be the only time the bot ends up in trial, and even if there is 100% success in this first round it might get shipped for a larger trial anyway depending on feedback. Primefac (talk) 14:12, 26 January 2022 (UTC)[reply]

@Rlink2: Please can the report on the trial include not just a list of the edits, but also the list of pages which the bot skipped. That info is very useful in evaluating the bot. BrownHairedGirl (talk) • (contribs) 14:25, 26 January 2022 (UTC)[reply]
@BrownHairedGirl: Ok. Rlink2 (talk) 20:07, 26 January 2022 (UTC)]][reply]
@Primefac: could you please enable AWB for the bot for the trial? Thank you. Rlink2 (talk) 21:43, 26 January 2022 (UTC)[reply]
@Rlink2: I don't see any problem with doing the trial edits from your own account, with an edit summary linking to the BRFA: e.g.
[[WP:BRFA/BareRefBot|BareRefBot]] trial: fill 3 [[WP:Bare URLs]]</ref>
... which renders as: BareRefBot trial: fill 3 WP:Bare URLs
That is what I have done with my BRFAs. BrownHairedGirl (talk) • (contribs) 18:06, 27 January 2022 (UTC)[reply]
@BrownHairedGirl: Ok, I will do this later today. Thank you for the tips. Rlink2 (talk) 18:11, 27 January 2022 (UTC)[reply]
Trial complete. See edits here (page bit slow to load). The ones the bot skipped already had the bare refs filled in by Cite Bot, since I am working from the older database dump. If it skipped/skips one due to an bug in the script, I would have listed and noted that. Rlink2 (talk) 03:18, 28 January 2022 (UTC)[reply]
Here is the list of edits via the conventional route of a contribs list: https://en.wikipedia.org/w/index.php?title=Special:Contributions/Rlink2&offset=202201280316&dir=next&target=Rlink2&limit=53
Note that there were 53 edits, rather the authorised 50. BrownHairedGirl (talk) • (contribs) 03:27, 28 January 2022 (UTC)[reply]
Whoops! AWB said 50, so I think the edit counter is slightly off with AWB. Maybe I accidently stopped the session, which reset the edit counter or something. Not sure how it works exactly. Sorry about that. But it's just 3 2 more edits (the actual amount seems to be 52, not 53), so I don't think it should make a big difference. Rlink2 (talk) 03:38, 28 January 2022 (UTC)[reply]
Sorry, it's 52. My contribs list above included one non-article edit. Here's fixed contribs list: https://en.wikipedia.org/w/index.php?target=Rlink2&namespace=0&tagfilter=&start=&end=&limit=52&title=Special%3AContributions
I don't think that it's a big deal of itself. However, when the bot is under scrutiny, the undisclosed counting error is not a great look. --BrownHairedGirl (talk) • (contribs) 13:50, 28 January 2022 (UTC)[reply]
Well, if anything, it was my human mistake for overcounting, not a issue with the bot code. Next time I'll make sure its exactly 50 edits. Sorry about that. Rlink2 (talk) 14:03, 28 January 2022 (UTC)[reply]
I don't know much about this but I thought the way this was done was to program the bot to stop after making 50 edits? Levivich 18:46, 28 January 2022 (UTC)[reply]
I did the trial with AWB manually, and apperently the AWB counter is slightly bugged. If I was using the bot frameworks I could have made it exactly 50. Rlink2 (talk) 21:28, 28 January 2022 (UTC)[reply]
@Rlink2: I think that an AWB bug is very very unlikely. I have done about 1.5 million AWB edits over 16 years, and have never seen a bug in its counter.
I think that the error is most likely to have arisen from the bot saving a page with no changes. That would increment AWB's edit counter, but the server would see it as a WP:Null edit, and not create a new revision.
One technique that I use to avoid this is make the bot copy the variable to ArticleText to FixedArticleText. All changes are applied to FixedArticleText. Then as a final sanity check after all processing is complete, I test whether ArticleText == FixedArticleText ... and if they are equal, I skip the page. BrownHairedGirl (talk) • (contribs) 00:17, 29 January 2022 (UTC)[reply]
I think that the error is most likely to have arisen from the bot saving a page with no changes. This is the most likely explanation. Rlink2 (talk) 01:13, 29 January 2022 (UTC)[reply]
Not sure I understand this, since that would seem to result in less edits being made rather than more. ~~~~
User:1234qwer1234qwer4 (talk)
17:53, 29 January 2022 (UTC)[reply]
Well, if anything, it was my human error that made it above 50, since i manually used the script with AWB. It is not a problem with the bot or the script. Rlink2 (talk) 17:59, 29 January 2022 (UTC)[reply]

Couple thoughts:

  • It looks like if there is |title=xxx - yyy and |url=zzz.com, and zzz is equal to either xxx or yyy, it should be safe to remove it from the title and add to |website=. (or {{!}} or long dash instead of dash). Appears to be a common thing: A1, A2, A3, A4, A5
  • Similar to above could check for abbreviated versions of zzz: B1, B2, B3
  • FindArticles.com is a common site on wikipedia: C1 It's also many soft-404s. Looks like that is the case here, "dead" link resulting in the wrong title.
  • GoogleNews is a common site that could have a special rule: D1

-- GreenC 15:18, 28 January 2022 (UTC)[reply]

and zzz is equal to either xxx or yyy, it should be safe to remove it from the title and add to "website" Like I said, the script does do this when it can. See this diff as one example out of many. Some of the diffs you link also exhibit this behavior. Emphasis on "when it can" - it errs on the side of caution since a sensible title field is better than a possibly malformed website field. Also some of the diffs you linked to are cites of the main page of the website, so in that case a "generic" title is expected.
Even in some of the ones you linked, there is no obvious way to tell the difference between "| Article | RPGGeek" and just " |RPGGeek" since there are two splices and not just one.
findarticles.com - Looks like that is the case here, "dead" link resulting in the wrong title. Ok, good to know. The script does have something to detect when the same title is used across multiple sites.
GoogleNews is a common site that could have a special rule I saw that. I thought it was fine, because the title has more information than the original URL, which is the entire point, right? What special rule are you propsing. Rlink2 (talk) 15:36, 28 January 2022 (UTC)[reply]
  • A1: title {{!}} RPGGeek == url of rpggeek .. thus if the title is split along {{!}} this match shows up. It's a literal match (other than case) should be safe.
  • A2: same as A1. Split along "-" and there is a literal match with the URL. Adjust for spaces and case.
  • A3: same with title {{!}} Air Journal and url air-journal .. in this case flatten case and test for space or dash in URL.
  • A4: same with iReadShakespeare in the title and url.
  • A5: another RPGGeek
  • D1: for example, a site-specific rule if "Google News Archive Search" in the title, remove and set the work "Google News"
-- GreenC 18:09, 28 January 2022 (UTC)[reply]
A1, A2, A3: I didn't make the script code every splice blindly with one half being on the title and the other half being on the website field. That opens up a can of bugs, since sites can put anything in there. If the script is going to split it needs quality assurance. When it has that quality assurance, it will split the title and place the website parameter, like it did with some of the URLs in the airport article diff.
If the source is used alot on enwiki, it is easy to remove the common portions without much thought due to list. But the common portions of a title are not necessarily suitable for a website parameter (for example: the website above is RPGgeek.com, but the common parts of the title is "| Article | RPGgeek.com". Of course, you could say "just take the last splice", but what if there is another site that does "| RPGgeek.com | Article"? There are alot of website configurations so we need to follow Postel's law and play it safe.
Compare this to IMDB, where the part after the dash is suitable for the website parameter. So the script is not going to just remove common parts of the title if its not sure where that extra information should go. We want to make the citation more informative not less.
A4: The website name is part of the title as a pun, look at it closely. That's one case where we don't want to remove the website title, if we just go around removing and splitting stuff blindly this is one of the problems that are creating. And its a cite of a main webpage too.
D1 - OK, that sounds fine. Good suggestion. Rlink2 (talk) 18:49, 28 January 2022 (UTC)[reply]
but for A..A3 it's not just anything, it's a literal match. Test for the literal match. To be more explicit with A1:
title found = whatever {{!} whatever {{!}} RPGGeek. And the existing |url=rpggeek.com. Split the string along {{!}} (or dash). Now there are three strings: "whatever", "whatever", "RPGeek". For each of the three strings, compare with the base URL string, in this case "rpggeek". Comparison 1: "whatever" != "rpggeek". Comparison 2: "whatever" != "rpggeek". Comparison 3: "RPGGeek" == "rpggeek" - we found a match! Thus you can safely do two things: remove {{!}} RPGGeek from the title; and, add |website=RPGGeek. This rule/system should work for every example. You may need to remove spaces and/or replace with "-" and/or lower-case from the title string when doing the url string comparison. I see what your saying about A4 you don't want to mangle existing titles when it's a legit usage along a split boundary I guess the question is how common it is. -- GreenC 19:14, 28 January 2022 (UTC)[reply]
BTW if your not comfortable doing it, don't do it. It's the sort of thing that it may be correct 95% of the time and wrong 5% so you have to weigh the utility of that, versus doing nothing for the 95%. -- GreenC 19:51, 28 January 2022 (UTC)[reply]
@GreenC: Thank you for your insight, I will have think about implementing this. I have already kinda done this, see the updated source code I uploaded. I can implement what you are asking for domains that come after the splices. For example, if the website is "encarta.com" and the title is "Wiki | Encata.com", then the "encarta.com" can be split, but if the title is "Wiki | Encarta Encylopedia - the Paid Encylopedia", with no other metadata to help retrieve the website name, then its a harder situation to deal with, so I don't split at all. Rlink2 (talk) 21:28, 28 January 2022 (UTC)[reply]

I went through all 52 so that my contribution to this venture wouldn't be limited to re-enacting the Spanish Inquisition at ANI.

  1. Special:Diff/1068376250 - The bare link was more informative to the reader than the citation template, because the bare link at least said "Goodreads.com", whereas the citation template just gives the title, which is the title of the book, and the same as the title of the Wikipedia article (and the title was in the URL anyway). So in this case, the bot removed (or hid, behind a citation template) useful information, rather than adding useful information. I don't see how this edit is an improvement.
  2. Special:Diff/1068372499 - Similarly, here the bot replaced a bare URL to aviancargo.com with a citation template with the title "Pagina sin titulo" ("page without title"). This hides the useful information of the domain name and replaces it with a useless page title. This part of this edit is not an improvement.
  3. Special:Diff/1068369653 - Replaces English-language domain name in bare URL with citation template title using foreign language characters. Not an improvement; the English-speaking reader will learn more from the bare URL than the citation template.
  4. Special:Diff/1068369064 - |website=DR tells me less than www.dr.dk, but maybe that's a problem with the whitelist?
  5. Special:Diff/1068369849 - an example of promo being added via website title, in this case the source's tagline, "out loud and in community!"
    • I have a similar concern about Special:Diff/1068369121 because we're adding "Google News Archive Search" prominently in citation templates. However, news.google.com was already in the bare URL, and the bot is also adding the name of the newspaper, so it is adding useful information. My promo concern here is thus weak.
  6. Special:Diff/1068368882, Special:Diff/1068375545, Special:Diff/1068369433 (first one) - tagged as a dead URLs, but are not a dead URLs, they all go to live websites for me.
  7. Special:Diff/1068375631 and Special:Diff/1068369185 - tagged as dead URL but coming back to me as 503 not 404. Similarly Special:Diff/1068372071 is 522 not 404. Special:Diff/1068376097 is coming back to me as a timeout not a 404. Special:Diff/1068376127 as a DNS error, not a 404. This may not be a problem if "dead URL" also applies to 503 and 522s and timeouts and DNS errors and all the rest, and not just 404s, but thought I'd mention it.

I wonder if the concerns in #1-4 could be addressed by simply adding |website=[domain name] to the citation template? That would at least preserve the useful domain name from the bare URL. No. 5 is concerning to me as this came up in previous runs. Even if this promo problem only occurs 2% of the time, if we run this on 200k pages, that's 4,000 promo statements we'll be adding to the encyclopedia. Personally, I don't know if that is, or is not, too high or a price to pay for the benefit of converting bare URLs into citation templates. (I am biased on this issue, though, as I don't see much use in citation templates personally.) No. 6 is a problem, and I question whether tagging something as dead based on one ping is sufficient, as mentioned above. #7 may not be a problem at all, I recognize. Hope this helps, and thank you to everyone involved for your work on this, especially Rlink. Levivich 18:19, 28 January 2022 (UTC)[reply]

@Levivich: I went through all 52 so that my contribution to this venture wouldn't be limited to re-enacting the Spanish Inquisition at ANI. Thank you for taking the time to review those edits, and thank you for your civility and good faith both here and at ANI. Hopefully we avoided Wikimedia Archive War 2. Wikimedia Archive War 1 was the war to end all wars, there were lots of casulties, we don't need another one. as much as i think arguments about archive sites are stupid, and these comment was made before the conflict started, let's respect everyone who is suffering through a very real war right now.... Off topic banter aside....
Special:Diff/1068369064 - the difference between DR and DR.dk is very minimal. Besides, "DR" is the name of the news agency/website, so that is the more accurate one IMO.
And regarding the not 404s, I have explained before that I just recently upgraded the getter to only catch 404 links and not anything else. While the diffs that you linked that are not "404" are mostly actualy still dead links, the consensus here was to only mark firm 404 status code returns as "dead links", so I made that change. The "dead link" data used in this set was collected before that change was made to reflect just 404s only, and I only realized after the fact. Regarding the "completely" lived being marked as dead, might just be a timeout error (not that it matters now, because anything that is not 404 but doesn't work for me will just be left alone).
Even if this promo problem only occurs 2% of the time. It's less than that. I think there was only one diff showing this. And if its a big big issue I can blacklist puff words and not fill those in.
I don't see much use in citation templates personally. Well, normally, it wouldn't matter. But we are adding information to the citation, and the cite template is the perfereed way to do so.
Hope this helps, and thank you to everyone involved for your work on this, especially Rlink. Thank you, and also thanks for all the hard work you do around here on the wikipedia as well. But none of this would be possible without BHG. She laid the foundations of all this stuff. Without her involvement this would have been impossible. Her role in fixing bare refs is far far greater than mine. I am just playing my small part, "helping" out. But she has all the expertise. Rlink2 (talk) 21:28, 28 January 2022 (UTC)[reply]

I've taken the time to review the first 25 edits. My findings:

  1. [17] is a certificate issue (SSL_ERROR_BAD_CERT_DOMAIN) and is presumably accessible if you want to risk it. Is it right to mark this as dead link?
  1. [18], [19], [20], [21] I don't understand why there is no |website= on these.


  1. [22] first link does not appear to be dead.
  2. [23] first link does not appear to be dead.
  3. [24] first appears to be a dead link.
  4. [25] https://www.vueling.com/en/book-your-flight/flight-timetables does not appear to be a dead link.
  5. [26] https://thetriangle.org/news/apartment-complex-will-go-up-at-38th-and-chestnut/ is reporting a Cloudflare connection timeout. Is it right to mark this as a dead link?

Problems with bare link titles are mostly about the |website= parameter. The code that sorts this out is in a library and not posted and I don't know how it works and I'm not convinced it's doing what we want it to do. See the code review page for further discussion. ~Kvng (talk) 18:25, 28 January 2022 (UTC)[reply]


Is it right to mark this as dead link? (regarding SSL_ERROR_BAD_CERT_DOMAIN) I saw that one. If you click through the SSL error (type in "thisisunsafe" in Chrome or Chromium-based browsers) you see it redirected to another page. If you looked even closer, adding any sort of random characters to the URL redirects to the same page, meaning that there is a blanket redirect with that website. So yes, I think it is right to mark it as dead.
Regarding the findarticles thing, yes, it has already been reported. I think I have to add a redirect part to it, if multiple URLs redirect to the same one, mark it as dead. So thank you for reporting that one.
I don't understand why there is no website As explained before, it will only add the website parameter when it is absoultely sure it has a correct and valid website parameter. It is not as simple as splitting any character like "|" and "-", that seems obvious but there are a lot of bugs that could arise just from that.
Is it right to mark this as a dead link? That link does not work for me. I tested on multiple browsers. Rlink2 (talk) 18:47, 28 January 2022 (UTC)[reply]
@Kvng and Levivich: I have always believed that the approach to the |website= parameter should be to:
  1. Use the name of the website if it can be reliably determined (either from the webpage or from a lookup table)
    or
  2. If the name of the website is not available, use the domain name from the URL.
For example, take a bare URL ref to https://www.irishtimes.com/news/world/europe/munich-prosecutors-sent-child-abuse-complaint-linked-to-pope-benedict-1.4788161
If the bot can reliably determine that the name of the website is "The Irish Times", then the cite template should include |website=The Irish Times
... but if the bot cannot reliably determine the name of the website, then the cite template should include |website=www.irishtimes.com.
I take that view because without a name, we have two choices on how to form the cite:
  • A {{cite web |url=https://www.irishtimes.com/news/world/europe/munich-prosecutors-sent-child-abuse-complaint-linked-to-pope-benedict-1.4788161 |title=Munich prosecutors sent child abuse complaint linked to Pope Benedict}}
  • B {{cite web |url=https://www.irishtimes.com/news/world/europe/munich-prosecutors-sent-child-abuse-complaint-linked-to-pope-benedict-1.4788161 |title=Munich prosecutors sent child abuse complaint linked to Pope Benedict |website=www.irishtimes.com}}
Those two options render as:
  • A: "Munich prosecutors sent child abuse complaint linked to Pope Benedict".
  • B: "Munich prosecutors sent child abuse complaint linked to Pope Benedict". www.irishtimes.com.
Option A is to my mind very annoying, because it gives no indication of i) whether the articles appears on a website from Japan or Zambia or Russia or Bolivia, ii) whether the source is a reputable newspaper, a partisan politics site, a blog, a porn site, a satire site or an ecommerce site. That deprives the reader of crucial info needed to make a preliminary assessment of the reliability of the source.
In my view, option B is way more useful, because it gives a precise description of the source. Not as clear as the name, but way better than nothing: in many cases the source can be readily identified from the domain name, and this is one of them.
This is the practice followed by many editors. Unfortunately, a small minority of purists prefer no value for |website= instead |website=domain name. Their perfection-or-nothing approach significantly undermines the utility of bare URL filling, by letting the best (full website name) become the enemy of the good (domain name).
I know that @Rlink2 has had some encounters with those perfection-or-nothing purists, and I fear that Rlink2's commendable willingness to accommodate concerns has led them accept the demands of this fringe group of zealots. I hope that Rlink2 will dismiss that perfectionism, and prioritise utility to readers ... by reconfiguring the bot to add the domain name. BrownHairedGirl (talk) • (contribs) 19:45, 28 January 2022 (UTC)[reply]
I agree. Not only is B better than A in your example, but I would even say that the bare link is better than A in your example, because the bare link has both the title and the website name in it, but A only gives the title. I honestly struggle to see how anyone could think that a blank |website parameter in a citation template is better than having the domain name in the |website parameter. Levivich 19:50, 28 January 2022 (UTC)[reply]
@Levivich: puritanism can lead people to take very strange stances. I have seen some really bizarre stuff in other discussions on filling bare URLs.
As to this particular link, its URL is formed as a derivative of the article title, so the bare URL is quite informative. So it's a bit of tossup whether filling it with only the title is actually an improvement.
However, some major websites form the URL by numerical formulae (e.g. https://www.bbc.co.uk/news/uk-politics-60166997) or alphanumerical formulae (e.g. https://www.ft.com/content/8f1ec868-7e60-11e6-bc52-0c7211ef3198). In those (alpha)numerical examples, the title alone is more informative.
However, title+website is always more informative than bare URL, provided that the title is not generic. BrownHairedGirl (talk) • (contribs) 20:12, 28 January 2022 (UTC)[reply]
On the subject of |website=, one way of determining the correct website title is relying on redirects from domain names. That is since irishtimes.com redirects to The Irish Times, the bot can know to add |website=The Irish Times. That is likely to be more comprehensive then any manually maintained database. * Pppery * it has begun... 20:42, 28 January 2022 (UTC)[reply]
That is a good idea, thanks for letting me know @Pppery:. Your thoughts are always welcome here. I kinda have to agree with BHG, as usual. I just didn't know what the consensus was on it, but BHG and Levivch makes a clear case for the website parameter. I will add this to the script. One of the community wishlist items should have been to bring VE to non article spaces. Replying to this chain is difficult for me. Rlink2 (talk) 21:28, 28 January 2022 (UTC)[reply]
@Rlink2: to make replying much much easier, go to Special:Preferences#mw-prefsection-betafeatures enable "Discussion tools" (4th item from the top).
That will give you "reply" link after every sig. BrownHairedGirl (talk) • (contribs) 22:05, 28 January 2022 (UTC)[reply]
Thanks for that, this way is so much easier. Rlink2 (talk) 22:07, 28 January 2022 (UTC)[reply]
@Pppery: the problem with that approach is that some domain names host more than one publication, e.g.
It would be easy to over-complicate this bot's task by trying to find the publication's name. But better to KISS by just using the domain name. BrownHairedGirl (talk) • (contribs) 22:01, 28 January 2022 (UTC)[reply]
Makes sense. I have no objection to just using the domain name. * Pppery * it has begun... 22:10, 28 January 2022 (UTC)[reply]

Most of my concerns have to do with dead link detection. This is turning out to be the distraction I predicted. There were only 3 articles with bare link and dead link edits: [27], [28], [29]. Running these as separate tasks will require 12% more edits and I don't think that's a big deal. I again request we disable dead link detection and marking and focus on filling bare links now.

Many of the links you linked are actually dead. And regarding the ones that weren't, I think its using the data from when the script was more liberal with tagging a dead link (The code is now much more stricter, 404s only). I said I will be adding more source code as we go along with complete comments. Rlink2 (talk) 18:47, 28 January 2022 (UTC)[reply]
@Rlink2: you could have avoided a lot of drama by publishing all the bot's code, rather than just a useless fragment. I suggest that you do so without delay, ideally with sufficient completeness that another competent editor could use AWB to fully replicate the bot?
Yes, I will do this as soon as I finish my responses to these questions. Rlink2 (talk) 19:17, 28 January 2022 (UTC)[reply]
Also please note that on the 25th, I specifically requested[30] that the bot bot the tag with {{dead link}} (dated) any bare URLs refs which return a 404 error ... and you replied[31] just under 1 hour later to say Ok, I can definitely do that.
Now it seems that in your trial run, dead link tagging was not in fact restricted to 404 errors. I do not see any point in this discussion at which you disclosed that you would use some other basis for tagging a link as dead. Can you see how that undeclared change of scope undermines trust in the bot operator? BrownHairedGirl (talk) • (contribs) 19:15, 28 January 2022 (UTC)[reply]
Yeah, I said it was a bug from stale data from before I updated the script. I am sorry. I only realized after the fact. Rlink2 (talk) 19:17, 28 January 2022 (UTC)[reply]
The posted code was not useless. It helped me understand the project and I pointed out a few things that helped Rlink2 make small improvements.
I'm not upset about a gap between promises and performance on this trial because that is the observation that originally brought me into this. Rlink2 is clearly working in good faith; thank you! Progress has been made and we'll get there soon. ~Kvng (talk) 21:29, 28 January 2022 (UTC)[reply]
Thank you for the kind words. In response to BHG I do not see any point in this discussion at which you disclosed that you would use some other basis for tagging a link as dead. I said when i was running the script on my main account I did use a wider basis for tagging links as dead. However, when we started the BRFA, we limited the scope to just 404 responses. What I shoud have done is run a new batch and use data with the dead link creteria listed in the BRFA, but I forgot to do so and used bare ref data collected from before the BRFA, hence why other stuff that meant it was a dead link (but not 404) was marked as a dead link. I am so so sorry to have disappointed you. I will do better next time and be careful. The fix for this is to not place a "dead link" template for any of the old data, and only do it for the new data going forward, to make sure the scope is defined.
Can you see how that undeclared change of scope undermines trust in the bot operator? It was not my intent to be sneaky or try to bypass the scope.
The most important thing is what Kvng said. Progress has been made and we'll get there soon. yes, there is always improvements to be made. Rlink2 (talk) 21:44, 28 January 2022 (UTC)[reply]
@Rlink2: thanks for the collaborative reply, but this is not yet resolved.
You appear to be saying that the bot relies on either 1) the list of articles which it is fed not including particular articles, or 2) on cached data from previous http requests to that URL.
Neither approach is safe. The code should make a fresh of check each URL for a 404 error, and apply the {{Dead link}} tag to that URL and only that URL.
  1. The list of pages which the bot processes should be irrelevant to its actions. Pre-selection is a great way of making the bot more efficient by avoiding having it skip thousands of pages where it has nothing to do. However, pre-selection is no substitute for code which ensures that the bot can accurately handle any page it processes.
  2. Cacheing http requests for this task is a bad idea. It adds a further vector for errors, which are not limited to this instance of the cache remining unflushed after a change of criteria. BrownHairedGirl (talk) • (contribs) 22:19, 28 January 2022 (UTC)[reply]
I've not had a chance to fully review the additional code Rlink2 has recently posted but a brief look shows that it uses a database of URLs which is apparently populated by a different process. That database should have been rebuilt for the trial and wasn't but there is nothing fundamentally wrong with this sort of two-stage approach to the problem. The list of pages is indeed relevant if this approach is used. ~Kvng (talk) 23:03, 28 January 2022 (UTC)[reply]
the list of articles which it is fed not including particular articles Well, it should work in either batches or individual articles.
Cacheing http requests for this task is a bad idea. Despite my use of 'cache' as a variable name and the database, the way the script is supposed to work is retreive the titles, save it, and then retreive it immediately after, which would constitute as a "fresh check" while saving the title for further analysis. So there is one script that gets the title, and another that places it within the article. I released the code for the latter already, and will release the code for the former shortly. I did try to run the getter in advance for some of them (like now), but I won't do this anymore thanks to your feedback. Rlink2 (talk) 23:11, 28 January 2022 (UTC)[reply]
@Rlink2: my point did not relate to batches vs individual articles. It was about something different: that is, not relying on any pre-selection process.
As to the rest, I remain unclear about how the bot actually works. Posting all the code and AWB settings could resolve that.
@Kvng: the fundamental problem with the two stage approach is as I described above: that it creates extra opportunity for error, as happened in the trial run. BrownHairedGirl (talk) • (contribs) 00:04, 29 January 2022 (UTC)[reply]
I have posted the "getter" code at Wikipedia:Bots/Requests_for_approval/BareRefBot/Code2. If I missed something or something needs clarification let me know. I am a bit tired right now, and have been working all day on this, so it is entirely ossible I forgot to explain something.
Again, the delay in releasing the code is getting it commented and cleaned up so you can understand it and be clear about how the bot actually works Rlink2 (talk) 01:08, 29 January 2022 (UTC)[reply]
  • @Rlink2: I have just begun assessing the trial and noticed two minor things.
  1. the bot is filling the cite templates with a space either side of the equals sign in each parameter, e.g. |website = Cricbuzz.
    That makes the template harder to read, because when the wikimarkup is word-wrapped in the edit window, the spaces can cause the parameter and value to be on different lines. Please can you omit those spaces, e.g. |website=Cricbuzz
  2. in some cases, parameter values are followed by more than one space. Please can you eliminate this by adding some extra code to process each template by replacing multiple successive whitespace character with one space?
Thanks. --BrownHairedGirl (talk) • (contribs) 01:18, 29 January 2022 (UTC)[reply]
the bot is filling the cite templates with a space either side of the equals sign in each parameter Fixed, and reflected in posted source code.
parameter values are followed by more than one space Done, and reflected in posted source code. Rlink2 (talk) 01:22, 29 January 2022 (UTC)[reply]
Thanks, @Rlink2. That was quick! BrownHairedGirl (talk) • (contribs) 01:24, 29 January 2022 (UTC)[reply]
  • Comment Cite templates should only be added to articles that use that style of referencing, what are you doing to detect the referencing style and to keep ammended references in the article style? Keith D (talk) 21:43, 30 January 2022 (UTC)[reply]
    @Keith D: So you are saying that if an article is using references like [https://google.com google], then the bare ref <ref>https://duckduckgo.com</ref> should be converted to <ref>[https://duckduckgo.com Duckduckgo]</ref> style instead of the cite template? I can code that in. Rlink2 (talk) 21:50, 30 January 2022 (UTC)[reply]
    That is what I would expect in that case. Keith D (talk) 21:53, 30 January 2022 (UTC)[reply]
    @Keith D: I have added this in, but will have to update the source code posted here to reflect that. Rlink2 (talk) 15:54, 1 February 2022 (UTC)[reply]
    @Rlink2: Hang on. Please do not implement @Keith D's request.
    Citation bot always converts bracketed bare URLs to cite templates. I don't see why this bot should work differently.
    There are a few articles which deliberately use the bracketed style [https://google.com google], but they are very rare. The only cases I know of are the deaths by month series, e.g. Deaths in May 2020, which use the bracketed style because they have so many refs that cite templates show them down. It would be much better to simply skip those pages, or apply the bracketed format to defined set. BrownHairedGirl (talk) • (contribs) 17:21, 1 February 2022 (UTC)[reply]
    Citation bot should not be converting references to templates if that is not the citation style used in the artical. It sould be honouring the established style of the article. Keith D (talk) 17:37, 1 February 2022 (UTC)[reply]
    This is why {{article style}} exists, which only has 54 transclusions in 6 years! All we need is a new option for square-link-only, editors to use it, and bots to honor it. It's like CSS, a central mechanism to determine any style settings for a page. -- GreenC 18:06, 1 February 2022 (UTC)[reply]
    Citation templates radically improve the maintainability of refs, and ensure consistency of style. There are a very few cases where they are impractical due the server load of hundreds of refs, but those pages are rare.
    In most cases where the square bracket refs dominate, it is simply because refs have been added by editors who don't know how to use the cite templates and/or don't like the extra work involved. We should be working to improve those other refs, not degrading the work of the bot. BrownHairedGirl (talk) • (contribs) 19:20, 1 February 2022 (UTC)[reply]
    See WP:CITEVAR which states "Editors should not attempt to change an article's established citation style merely on the grounds of personal preference, to make it match other articles, or without first seeking consensus for the change." Keith D (talk) 23:45, 1 February 2022 (UTC)[reply]
    It would be foolish to label the results of quick-and-dirty referencing as a "style". BrownHairedGirl (talk) • (contribs) 01:54, 2 February 2022 (UTC)[reply]
    As above, I would much prefer that the bot always use cite templates.
    But if it is going to try to follow the bracketed style where that is the established style, then please can it use a high threshold to determine the established style. I suggest that the threshold should be
    1. Minimum of 5 non-bare refs using the bracketed style (i.e. [http://example.com/foo Fubar] counts are bracketed, but [http://example.com/foo] doesn't)
    2. The bracketed, non-bare refs must be more than 50% of the inline refs on the pge.
    I worry about the extra complexity this all adds, but if the bot is not going to use cite templates every time, then it needs to be careful not to use the bracketed format excessively. BrownHairedGirl (talk) • (contribs) 20:27, 2 February 2022 (UTC)[reply]
    As above, I would much prefer that the bot always use cite templates. As usual I have to agree with BHG here, if to reduce bugs and complexity, if ever. The majority of articles are using citation templates anyway.

    While it is technically possible to implement BHG's creteria, it would cause extra complexity. For that I would prefer following BHG's advice in always using templates, but I am open to anything. Rlink2 (talk) 20:44, 2 February 2022 (UTC)[reply]
    There is currently no mechanism to inform automated tools what style to use. It's so uncommon not to use CS1|2 these days, as a conscious choice, it should be the responsibility of the page to flag tools how to behave rather than engaging in error prone and complex guess work. I'm working on a solution to adapt {{article style}}, but it won't be ready before this BRFA closes. In the mean time, if you run into editors who remain CS1|2 holdouts (do they exist?) they will revert and we can come up with a simple and temporary solution to flag the bot, similar to how {{cbignore}} works - an empty template that does nothing, the bot just checks for its existence anywhere on the page and skips if so. -- GreenC 21:12, 2 February 2022 (UTC)[reply]
    99% of articles are using the citation templates. I agree with BHG, we want to avoid "scope creep" where most of the code is solving 1% of the problems.
    I personally I don't have any skin in the citation game, but again, basically all of the articles are using them.
    In the mean time, if you run into editors who remain holdouts (do they exist?) they will revert and we can come up with a simple and temporary solution to flag the bot Yes. Rlink2 (talk) 15:40, 4 February 2022 (UTC)[reply]
    @Rlink2: that is a much better solution. I suspect that such cases will be very rare, much less than 1% of pages. BrownHairedGirl (talk) • (contribs) 02:32, 5 February 2022 (UTC)[reply]
    I have noticed recently through an article on my watchlist that BrownHairedGirl has manually been tagging the dead 404 refs herself. If she and others can focus on tagging all the dead refs, then we can take dead link tagging out of the bot. What do people here think? Rlink2 (talk) 14:20, 5 February 2022 (UTC)[reply]
    My tagging is very slow work. I have been doing some of it on an experimental basis, but that is no reason to remove the functionality from this bot. If this bot is processing the page, and already has the HTTP error code, then why not use it to tag? BrownHairedGirl (talk) • (contribs) 18:34, 5 February 2022 (UTC)[reply]

It's now over 7 days since the trial edits. @Rlink2: have you made list of what changes have been proposed, and which you have accepted?

I think that a review of that list would get us closer to a second trial. --BrownHairedGirl (talk) • (contribs) 20:47, 5 February 2022 (UTC)[reply]

Here are the big ones:
  • PDF tagging was excluded before the trial, and will continue to stay that way. There was no tagging of PDF refs during the 1st trial.
  • Previous to the trial, the consensus was for the bot to mark refs with the "dead link" template if and only if the link returned a "404" status code at the time of filling. If the link was not 404 but had issues (service unavaliable, "cloudflare", generic redirect, invalid HTTPS certificate, etc...) the bare ref would simply be left alone at that moment. During the trial, several links that were not necessarily alive but did not return a 404 status error were marked with the "dead link" template, which was not the intended goal. The first change was to make sure the 404 detection was working properly, and didn't cache the inaccurate data. Other than the marking of these links, the bot will do nothing regarding dead links in references or archiving, "broadly construed".
  • There was a proposal to use bracketed refs when converting from bare to non bare in articles that predominantly used the bracketed refs, but there was no conesnsus to implement this. Editors pointed out that "bracketed ref" articles are very rare and usually special cases. In cases like this, the editors of the article make it clear that citation templates are not to be used, and use bot exlcusions, so the bot wouldn't have even processed those articles. GreenC pointed out that a template to indicate the citation style of the article existed, but only has 54 tranclsuions, and other editors expanded by explaining that it would difficult for a bot to determine the citation style for the article.
  • BrownHairedGirl pointed out two minor nitpicks regarding spacing of parameters, which was fixed.
  • There was some discussion about the possiblity of WP:PEACOCK titles, but I explained that such instances are rare, and trying to get a bot to understand what a "peacock" title even is would be difficult. The people who brought up this seemed to be satisfied with my answer, and so there was no consensus to do anything regarding this.
  • There was some argument over what to do regarding the website parameter. The bot is able to extract a proper website parameter and split the website and title parameter for some but not all websites. There was some debate over how far the bot could go regarding the website parameter, but I expressed a need to "play it safe" and not dwell too much on this aspect since we are dealing with unstructured data. There was consensus that if the bot could not extract the website name, that it should just use the domain name for the website parameter (etc. {{cite web | title = Search Results | website=duckduckgo.com}} instead of {{cite web | title = Search Results }}) so the resulting ref still has important info about the website being cited. This change has been made. Rlink2 (talk) 21:58, 5 February 2022 (UTC)[reply]
Many thanks, @Rlink2, for that prompt and detailed reply. It seems to me to be a good summary of where we have got to.
It seems to me that on that basis the bot should proceed to a second trial run, to test whether the changes resolve the concerns raised by the first trial. @Primefac, what do you think? Are we ready for that? BrownHairedGirl (talk) • (contribs) 23:13, 5 February 2022 (UTC)[reply]
Small update to this: the bot now catches 410 "gone" status codes, as explained above. 410 is basically a less-used way to indicate that the content is no longer avaliable. The amount of sites using 410 status codes to indicate a dead link is not many, but there are some, so it has been implemented in the bot. Rlink2 (talk) 21:09, 8 February 2022 (UTC)[reply]
Thanks, @Rlink2. After a long batch of checks, I now estimate that about 0.5% of pages with bare URLs have one or more bare URLs which return a 410 error. That suggests that there are about 1,300 such bare URLs to be tagged as {{dead link}}s, so this addition will be v helpful. BrownHairedGirl (talk) • (contribs) 00:29, 9 February 2022 (UTC)[reply]

Trial 2

Trial 2

Approved for extended trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Sorry for the delay here, second trial looks good. Primefac (talk) 14:35, 13 February 2022 (UTC)[reply]

Trial complete. Adding for the record Primefac (talk) 14:46, 21 March 2022 (UTC)[reply]
@Primefac: @BrownHairedGirl:
Diffs can be found here: https://en.wikipedia.org/w/index.php?target=Rlink2&namespace=all&tagfilter=&start=2022-02-16&end=2022-02-16&limit=50&title=Special%3AContributions

The articles skipped were either PDF urls or dead URLs but did not return 404 or 410 (example: expired domain, connection timeout). One site had some strange website misconfig so it didn't work in Chrome, Safari, Pale Moon, Seamonkey, or Firefox. (I could only view it in some obscure browser). As agreed with the conesnsus, the bot will not touch these non-404 or 410 dead links, and it did not during the 2nd trial.

I think there was also a non Wayback Archive.org url (as you know, the archive.org has more than just archived webpages, they have books and scans of documents as well), along with a bare ref with the "Webarchive" template right next to it. As part of "broadly construed" these were not filled. The amount of archive bare refs are small I think, so should not be an issue.

The rest of the sites skipped had junk titles (like "please wait ....." or "403 forbidden")

As requested when the website parameter was added when the "natural" name of the website could not be determined and the website name was not in the title. There was extra care taken to avoid a situation where there is a cite like
{{Cite web | title = Search results {{!}} duckduckgo.com | website=www.duckduckgo.com }}
which would look like
"Search results | duckduckgo.com". www.duckduckgo.com. Rlink2 (talk) 04:18, 16 February 2022 (UTC)[reply]
Thanks, @Rlink2.
The list of exactly 50 Trial2 edits can also be found at https://en.wikipedia.org/w/index.php?title=Special:Contributions&dir=prev&offset=20220214042814&target=Rlink2&namespace=0&tagfilter=AWB BrownHairedGirl (talk) • (contribs) 05:09, 16 February 2022 (UTC)[reply]
Yes, this time I tried to make it exactly 50 for preciseness and to avoid drama. Rlink2 (talk) 15:26, 16 February 2022 (UTC)[reply]
  • Big problem. I just checked the first 6 diffs. One of them is a correctly tagged dead link, but in the other 5 cases ([32], [33], [34], [35], [36]) there is no |website= parameter. Instead, the website name is appended to the title.
    This is not what was agreed after Trial1 (and summarised here[37] by @Rlink2) ... so please revert all trial2 edits which filled a ref without adding the website field. BrownHairedGirl (talk) • (contribs) 05:22, 16 February 2022 (UTC)[reply]
    I see that some edits — [38], [39]did add a website field, filling it with the domain name.
    It appears that what has been happening is that when the bot figures out the website name, it is wrongly appending that to the title, rather than the correct step of placing it in the |website= parameter. BrownHairedGirl (talk) • (contribs) 05:41, 16 February 2022 (UTC)[reply]
    Hi @BrownHairedGirl:
    The reason the website parameter was not added in those diffs is because the website name is in the title (for example, the NYT link has "New York Times" right within in the title, you can check for yourself in your browser). The bot did not modify or change the title to add the website name, if it could extract the website name it would have been added to the "website=" parameter as we have agreed to do.
    There are three possibilities:
    • The website name can be extracted from the website, hence there is no need to use the domain name for the website parameter, since a more accurate name is available. An example of this would be:
    "Article Story". New York Times.
    • The bot could detect that the website name is included in the title, but for some reason could not extract it. As stated before, extracting the website name from a title can be difficult sometimes, so even if it is able to detect the website name is included, it may not be able to get a value suitable for the "website=" parameter. In this case, adding a website parameter would look like:
    "The Battle of Tripoli-Versailles - The Green-Brown Times". www.thegreenbrowntimes.com.
    in which case the website parameter is just repeating information so the bot just did a cite like this instead:
    "The Battle of Tripoli-Versailles - The Green-Brown Times".
    • The bot could not detect the website name and so addded the website parameter with the domain name (and this was done evidenced by the additional diffs you provided above). The cite would look like this:
    "Search results". www.duckduckgo.com. Rlink2 (talk) 15:25, 16 February 2022 (UTC)[reply]
    @Rlink2, I think you are over-complicating something quite simple, which I thought had been clearly agreed: the |website= parameter should always be present and filed. The points above should determine what its value is, but it should never be omitted. BrownHairedGirl (talk) • (contribs) 16:04, 16 February 2022 (UTC)[reply]
    @BrownHairedGirl:
    The reasoning behind adding the "website=" parameter was to make sure the name of the website is always present in the citation. In the first comment where you asked for the website param, the example title did not have the website name, so in that case it was clear that the website parameter should be added. In addition, the "website=" example I gave in my final list before we started Trial 2 the website name was not included in the title. In the citations where it did not add the "website=" parameter, the name of the website was still present.

    Personally, I am fine with following your advice and always including the website parameter, even if the website name is in the title. However, I feared it could have caused anger amongst some factions of the citation game who would claim that the bot was "bloating" refs with possibly redundant info, so this was done to keep them happy. Rlink2 (talk) 18:19, 16 February 2022 (UTC)[reply]
    @Rlink2: the name of the work in which the article is included is always a key fact in any reference. If it is available in any form, it should be included as a separate field ... and for URLs, it is always available in some form, even if only as a domain name. The "separate field" issue is crucial, because the whole aim of cite templates is to provide consistently structured data rather than unstructured text of the form [http://exmple.com/foo More foo in Ballybeg next year -- Daily Example 39th of March 2031]
    If there is any bloating, it is the addition of the site name to the title, where it doesn't belong. If you can reliably remove any such redundancy from the title, then great ... but I don't think you will satisfy anyone at all by dumping all the data into the |title= parameter.
    I am a bit concerned by this, because it doesn't give me confidence that you fully grasp what citation templates are for. They are about consistently structured data, and issues of redundancy are secondary to that core purpose. BrownHairedGirl (talk) • (contribs) 18:37, 16 February 2022 (UTC)[reply]
    @BrownHairedGirl:
    the name of the work in which the article is included is always a key fact in any reference. If it is available in any form, it should be included as a separate field ... and for URLs, it is always available in some form, even if only as a domain name. Ok.
    If you can reliably remove any such redundancy from the title, then great I was actually about to suggest this idea in my first reply, because the bot should be able to reliably remove website titles if that is what is desired. That way we have something like
    {{Cite web | title = Article Title | website=nytimes.com}}
    instead of
    {{Cite web | title = Article Title {{!}} The New York Times }}
    or
    {{Cite web | title = Article Title {{!}} The New York Times | website=nytimes.com }}
    I am a bit concerned by this, because it doesn't give me confidence that you fully grasp what citation templates are for. They are about consistently structured data, and issues of redundancy are secondary to that core purpose. You'd be right, I know relatively little about citation templates compared to people like you, who have been editing even before the citation templates were created, but I am learning as time goes on. Thanks for telling me all this, I really appreciate it. Rlink2 (talk) 18:57, 16 February 2022 (UTC)[reply]
    @Rlink2; thanks for the long reply, but we are still not there. Please do NOT remove website names entirely.
    The ideal output is to have the name of the website in the website field. If that isn't possible, use the domain name.
    If you can determine the website's name with enough reliability to strip it from the |title= parameter, don't just dump the info -- use it in the website field, 'cos it's better than the domain name.
    And if you are not sure, then some redundancy is better than omission.
    Taking your examples above:
    1. {{Cite web | title = Article Title | website=nytimes.com}}
      bad: you had the website's name, but dumped it
    2. {{Cite web | title = Article Title {{!}} The New York Times }}
      bad: no website field
    3. {{Cite web | title = Article Title {{!}} The New York Times | website=nytimes.com }}
      not ideal, but least worst of these three
    In this case, the best would be {{Cite web | title = Article Title |website= The New York Times}}
    I think it might help if I set out in pseudocode what's needed:
VAR thisURL = "http://exmple.com/fubar"
VAR domainName = FunctionGetDomainNamefromURL(thisURL)
VAR articleTitle = FunctionGetTitleFromURL(thisURL)
// start by setting default value for websiteParam 
VAR websiteParam = domainName // e.g. "magicseaweed.com"
// now see if we can get a website name
VAR foundWebsiteName == FunctionToFindWebsiteNameAndDoAsanityCheck()
IF foundWebsiteName  IS NOT BLANK // e.g. "Magic Seaweed" for https://magicseaweed.com/ 
     THEN BEGIN
         websiteParam = foundWebsiteName
         IF articleTitle INCLUDES foundWebsiteName
            THEN BEGIN
                VAR trimmedArticleTitle = articleTitle - foundWebsiteName
                IF trimmedArticleTitle IS NOT BLANK OR CRAP
                    THEN articleTitle = trimmedArticleTitle
                ENDIF 
             END
         ENDIF
     END
ENDIF
FunctionMakeCiteTemplate(thisURL, articleTitle, websiteParam)
  • Hope this helps BrownHairedGirl (talk) • (contribs) 20:25, 16 February 2022 (UTC)[reply]
    @BrownHairedGirl: Ok, this makes sense. I will keep this in mind from here on out. So the website parameter will always be present from now on. Rlink2 (talk) 23:28, 16 February 2022 (UTC)[reply]
    @Rlink2: I was hoping that rather than just keep this in mind, you'd be telling us that the code had been restructured on that basis, and that the revised code had been uploaded. BrownHairedGirl (talk) • (contribs) 13:40, 19 February 2022 (UTC)[reply]
    @BrownHairedGirl: Yes, precise language is not my strong suit ;)
    Done, and reflected in the source code (all the other bug fixes, like the 410 addition, should also be uploaded now as well) So now, if the website parameter can not be extracted or is not present, the domain name will always be used instead.
    And if you are not sure, then some redundancy is better than omission. I agree. Rlink2 (talk) 14:16, 19 February 2022 (UTC)[reply]
    Ok, it's been some time, and this is the only issue that has been brought up (and has been fixed). Should we have one more trial? Rlink2 (talk) 13:56, 22 February 2022 (UTC)[reply]
    @Rlink2: where is the revised code? BrownHairedGirl (talk) • (contribs) 10:01, 23 February 2022 (UTC)[reply]
    @BrownHairedGirl: Code can be found at the same place, Wikipedia:Bots/Requests_for_approval/BareRefBot/Code Rlink2 (talk) 12:48, 23 February 2022 (UTC)[reply]
    @Rlink2: code dated // 2.0 - 2022 Febuary 27.
    Some time-travelling? BrownHairedGirl (talk) • (contribs) 13:39, 23 February 2022 (UTC)[reply]
    @BrownHairedGirl: LOL, I meant 17th. Thank you ;) Rlink2 (talk) 13:44, 23 February 2022 (UTC)[reply]
    @Rlink2, no prob. Tipos happon tu us oll.
    I haven't fully analysed the revised code, but I did look over it. In principle it looks like it's taking a sound approach.
    I think that trial of this new code would be a good idea, and also that this trial should be of a bigger set (say 250 or 500 edits) to test a wider variety of cases. Some webmasters do really weird stuff with their sites. BrownHairedGirl (talk) • (contribs) 20:12, 23 February 2022 (UTC)[reply]
  • Problem2. In the edits which tagged link as dead (e.g. [40], [41]), the tag added is {{Dead link|bot=bareref|date=February 2022}}.
    This is wrong. The bot's name is BareRefBot, so the tag should be {{Dead link|bot=BareRefBot|date=February 2022}}. BrownHairedGirl (talk) • (contribs) 05:33, 16 February 2022 (UTC)[reply]
    I have fixed this. Rlink2 (talk) 15:26, 16 February 2022 (UTC)[reply]
  • I have not checked either trial to see if this issue has arrived, but domain reselling pages and similar should not be populated but the links marked as dead as they need human review to find a suitable archive or new location. AFAIK there is no reliable way to automatically determine whether a page is a domain reseller or not, but the following strings are common examples:
    • This website is for sale
    • Deze website is te koop
    • HugeDomains.com
    • Denna sida är till salu
    • available at DomainMarket.com
    • 主婦が消費者金融に対して思う事
  • In addition, the following indicate errors and should be treated as such (I'd guess the bare URL is going to the best option):
    • page not found
    • ACTUAL ARTICLE TITLE BELONGS HERE
    • Website disabled
  • The string "for sale!" is frequently found in the titles of domain reselling pages and other unsuitable links, but there might be some false positives? If someone has the time (I don't atm) and desire it would be useful to see what the proportion is to determine whether it's better to skip them as more likely unsuitable or accept that we'll get a few unsuitable links alongside many more good ones. In all cases your code should allow the easy addition or removal of strings from each category as they are detected. Thryduulf (talk) 11:44, 23 February 2022 (UTC)[reply]
    @Thryduulf: Thank you for the feedback. I already did this (as in, detect domain for sale titles). Usually anything wih "for sale" in it usually a junk title, and it is better to skip the ref for later than than to fill it with a bad title. Rlink2 (talk) 12:45, 23 February 2022 (UTC)[reply]
    This approach seems sound, but there will always be unexpected edge cases. I suggest that the bot's first few thousand edits be run at a slow pace on a random sample of articles, to facilitate checking.
    It would also be a good idea to
    1. not follow redirected URLs. That facility is widely abused by webmasters, and can lead to very messy outcomes
    2. maintain a blacklist of usurped domains, to accommodate cases which evade the filters above.
    Hope that helps. BrownHairedGirl (talk) • (contribs) 20:18, 23 February 2022 (UTC)[reply]
    @BrownHairedGirl: I suggest that the bot's first few thousand edits be run at a slow pace on a random sample of articles, to facilitate checking. Yes, this is a good idea. While filling out bare refs manually with AWB I saw first hand many of the edge cases and "gotchas", so more checking is always a good thing.
    not follow redirected URLs. This could actually be a good idea. I don't know the data on how many URLs are redirects and how many of those are valid, but there are many dead links that use a redirect to the front page instead of throwing a 404. There can be an exception placed for redirects that just go from HTTP to HTTPS (since that usually does not indicate a change or removal of content). Again, I will have to do some data collection and see if this approach is feasible, but it looks like a good idea that will work.
    maintain a blacklist of usurped domains I already have a list of "blacklisted" domains that will not be filled, yes this is a good idea. Rlink2 (talk) 19:39, 24 February 2022 (UTC)[reply]
    When it comes to soft 404 string detection, they are all edge cases. There is near infinite variety. For example there are titles in foreign languages: "archivo no encontrado|pagina non trovata|página não encontrada|erreur 404|något saknas" .. it goes on and on and on.. -- GreenC 21:40, 24 February 2022 (UTC)[reply]
    @GreenC: well the number "404" is in there for one of them, which would be filtered. Of course there will always be an infinite variety but we can get 99.9% of them. During my run the only soft "404"s I remebering seeing after my already existing filtering were fake redirects to the same page (discussed above). Rlink2 (talk) 22:05, 24 February 2022 (UTC)[reply]
    Well, I've been working on a soft 404 detector for over 4 years as a sponsored employee of Internet Archive and at best I can get 85%. That's after years of effort finding strings to filter on. There is a wall at that point because the last 15% are all mostly unique cases, one offs, so you can't really predict them. I appreciate you strive for 99% but nobody gets that. The best soft 404 filter in existence is made by Google and I don't think they get 99%. There are academic papers on this topic, AI programs, etc.. I wish you luck, please appreciate the problem, it's non-trivial. -- GreenC 23:11, 24 February 2022 (UTC)[reply]
    @GreenC:
    Yes, I agree that soft 404 detection is a very difficult problem. However, in this case, we may not even need to solve it.
    So I'm guessing its 85 percent of 99%. Lets just say because of my relative lack of experience, my script is 75% or even 65%. So out of all the "soft 404s" (of which they are not many of when it comes to Wikipedia bare refs, which is the purpose of the bot) it can still get a good chunk.
    The soft 404's ive seen are things like redirects to the same page. Now some redirects could be legitimate, and some could be not. That's a hard problem to figure out, like you said. But we know that if there is a redirect, there may or may not be a soft 404, hence avoiding the problem of detection by just leaving it alone at that moment.
    Another example could be when multiple pages have the same title. There is a possiblity at that moment of a soft 404, or maybe not. But if we avoid doing anything under this cirumstance at all we don't have to worry about "detecting" a soft 404.
    It's kinda like asking "what is the hottest place to live in at Antartica" and the answer being "Let's avoid the place all together, we'll deal with Africa or South America". not a perfect analogy but you get the point.
    The only thing that I have no idea how to deal with is foreign language 404s, but again, there are not too many of them.
    My usage of "99%" was not literal, it was was an exaggeration ("allteration"). Nothing will even come close to 100% because there are an infinite amount of websites with an endless amount of configurations and stuff. It is impossible to plan out for all those websites, but at the same time those types of websites are rare. Rlink2 (talk) 05:20, 26 February 2022 (UTC)[reply]
    User:Rlink2: Some domains have few to none, others have very high rates like as much as 50% (looking at you ibm.com ugh). What constitutes a soft-404 can itself be difficult to determine because the landing page may have relevant content but is not the same as original only detectable by comparing with the archive URL. One method: determine the date the URL was added to the wiki page. Examine the archive URL for the date, and use the title from there. That's what I would do if writing a title bot. All URLs eventually in time revert to 404 or soft-404 so getting a snapshot close to the time it was added to wiki will be the most reliable data. -- GreenC 15:19, 2 March 2022 (UTC)[reply]
    "determine the date the URL was added to the wiki page. Examine the archive URL for the date, and use the title from there.". This is actually a good idea, I think I thought this once actually but forgot, thanks for telling (or reminding) me.

    However as part of "broadly construed" I don't want the bot to do anything with archive sites, it will create unnecessary drama that will take away from the goal of filling bare refs. Also the website could have changed the title to be more descriptive, or maybe the content moved. So it archived title may not be the best one all of the time. Maybe if there is a some mismatch between the archive title and the current URL title, it should be a signal to leave the ref alone at the moment.

    If any site in particular has high soft 404 rates, we will simply blacklist it and the bot will not fill any refs from those domains. Rlink2 (talk) 16:18, 2 March 2022 (UTC)[reply]
    And regarding foreign titles, there are a very very small amount of them in my runs. At most I saw 10 of them during my 50,000+ bare edit run. Rlink2 (talk) 22:50, 24 February 2022 (UTC)[reply]
    Are you saying foreign language websites account for about 10 out of every 50k? -- GreenC 23:24, 24 February 2022 (UTC)[reply]
    Actually, maybe there were like 50 articles with foreign articles, but I can only remember like 5 or 10 of them. I filtered out some of the Cryliic characters since they were creating cite errors due to the way the script handlded them, so the actual amount the bot has to decide on is less than that. Rlink2 (talk) 05:22, 26 February 2022 (UTC)[reply]

@Rlink2 and Primefac: it is now 4 weeks since the second trial, and Rlink2 has resolved all the issues raised. Isn't it time for a third trial? I suggest that this trial should be bigger, say 250 edits, to give a higher chance of detecting edge cases. --BrownHairedGirl (talk) • (contribs) 23:14, 12 March 2022 (UTC)[reply]

@BrownHairedGirl, Yes, I think its time. Rlink2 (talk) 02:33, 13 March 2022 (UTC)[reply]

BareRefBot as a secondary tool

I would like to ask that BareRefBot be run as a secondary tool, i.e. that it should be targeted as far as possible to work on refs where the more polished Citation bot has tried and failed.

This is a big issue which I should probably have raised at the start. The URLs-that-Citation-but-cannot-fill are why I have been so keen to get BareRefBot working, and I should have explained this in full earlier on. Pinging the other contributors to this BRFA: @Rlink2, Primefac, GreenC, ProcrastinatingReader, Kvng, Levivich, Pppery, 1234qwer1234qwer4, and Thryduulf, whose input on this proposal would be helpful.

I propose this because on the links which Citation bot can handle, it does a very thorough job. It uses the zotero servers to extract a lot of metadata such as date and author which BareRefBot cannot get, and it has a large and well-developed set of lookups to fix issues with individual sites, such as using {{cite news}} or {{cite journal}} when appropriate. It also has well-developed lookup tables for converting domain names to work titles.

So ideally, all bare URLs would be filled by the well-polished Citation bot. Unfortunately, there are many websites which Citation bot cannot fill, because the zotero provides no data. Other tools such as WP:REFLINKS and WP:REFILL often can handle those URLs, but none of them works in batch mode and individual editors cannot do the manual work fast enough to keep up with Citation bot's omissions.

The USP of BareRefBot is that thanks to Rlink2's cunning programming, it can do this followup work in batch mode, and that is where it should be targeted. That way we get the best of both worlds: Citation bot does a polished job if it can, and BareRefBot does the best it can with the rest.

I am systematically feeding Citation bot with long lists of articles with bare URLs, in two sets:

  1. User:BrownHairedGirl/Articles with new bare URL refs, consisting of the Articles with bare URL refs (ABURs) which were in the latest database dump but not in the previous dump. The 20220220 dump had 4,904 new ABURS, of which there were 4,518 ABURs which still hsd bare URLs.
  2. User:BrownHairedGirl/Articles with bare links, consisting of articles not part of my Citation bot lists since a cutoff date. The bot is currently about halfway through a set of 33,239 articles which Citation bot had not processed since 1 December 2021.

If BareRefBot is targeted at these lists after Citation bot has done them, we get the best of both worlds. Currently, these lists are easily accessed: all my use of Citation bot is publicly logged in the pages linked and I will happily email Rlink2 copies of the full (unsplit lists) if that is more convenient. If I get run over by a bus or otherwise stop feeding Citation bot, then it would be simple for Rlink2 or anyone else to take over the work of first feeding Citation bot.

What do others think? --BrownHairedGirl (talk) • (contribs) 11:25, 2 March 2022 (UTC)[reply]

Here is an example of what I propose.
Matt Wieters is page #2178 in my list Not processed since 1 December - part 6 of 11 (2,847 pages), which is currently being processed by Citation bot.
Citation bot edited the article at 11:26, 2 March 2022, but it didn't fill any bare URL refs. I followed up by using WP:REFLINKS to fill the 1 bare URL ref, in this edit.
That followup is what I propose that BareRefBot should do. BrownHairedGirl (talk) • (contribs) 11:42, 2 March 2022 (UTC)[reply]
I think first and foremost you should look both ways before crossing the road so you don't get run over by a bus. :-D It strikes me as more efficient to have BRB follow CB as suggested. I don't see any downside. Levivich 19:28, 2 March 2022 (UTC)[reply]
@BrownHairedGirl
This makes sense, I think that citation bot is better at filling out refs completely. One thing that would be intresting to know is if Citation Bot can improve already filled refs. For example, let's say we have a source that citation bot can get the author, title, name, and date for, but BareRefBot can only get the title. If BareRefBot only fills in the title, and citation bot comes after it, would citation bot fill in the rest?
and it has a large and well-developed set of lookups to fix issues with individual sites, such as using cite news or cite journal when appropriate. I agree .
It uses the zotero servers to extract a lot of metadata such as date and author which BareRefBot cannot get, and it has a large and well-developed set of lookups to fix issues with individual sites Correct.
It also has well-developed lookup tables for converting domain names to work titles. Yes, do note that list could be ported to Bare Ref Bot (list can be found here)
That way we get the best of both worlds: Citation bot does a polished job if it can, and BareRefBot does the best it can with the rest. I agree. Let's see what others have to say Rlink2 (talk) 19:38, 2 March 2022 (UTC)[reply]
Glad we agree in principle, @Rlink2. You raise some useful questions:
One thing that would be intresting to know is if Citation Bot can improve already filled refs.
yes, it can and does. But I don't think it overwrites all existing data, which is why I think it's better to give it the first pass.
For example, let's say we have a source that citation bot can get the author, title, name, and date for, but BareRefBot can only get the title. If BareRefBot only fills in the title, and citation bot comes after it, would citation bot fill in the rest?
If an existing cite has only |title= filled, Citation Bot often adds many other parameters (see e.g. [42]).
However, I thought we had agreed that BareRefBot was always going to add and fill a |website= parameter?
My concern is mostly with the |title=. Citation Bot does quite a good job of stripping extraneous stuff from the title when it fills a bare ref, but I don't think that it re-processes an existing title. So I think it's best to give Citation Bot the first pass at filling the title.
Hope that helps. Maybe CB's maintainer AManWithNoPlan can check my evaluation and let us know if I have misunderstood anything about how Citation Bot handles partially-filled refs. BrownHairedGirl (talk) • (contribs) 20:27, 2 March 2022 (UTC)[reply]
I think you are correct. Citation bot relies mostly on the wikipedia zotero - there are a few that we go beyond zotero: IEEE might be the only one. A bit thing that the bot does is extensive error checking (bad dates, authors of "check the rss feed" and such). Also, almost never overwrites existing data. AManWithNoPlan (talk) 20:35, 2 March 2022 (UTC)[reply]
Many thanks to @AManWithNoPlan for that prompt and helpful clarification. --BrownHairedGirl (talk) • (contribs) 20:51, 2 March 2022 (UTC)[reply]
@BrownHairedGirl @AManWithNoPlan
But I don't think it overwrites all existing data, which is why I think it's better to give it the first pass. Yeah, i think John raised up this point at the Citation Bot talk page, and AManWithNoPlan has said above that it can add new info but no overwrite the old ones..
However, I thought we had agreed that BareRefBot was always going to add and fill a Yes, this hasn't changed. I forgot to say "title and website" while Citation Bot can get author, title, website, date, etc.....
So I think it's best to give Citation Bot the first pass at filling the title. This makes sense.
Citation Bot does quite a good job of stripping extraneous stuff from the title when it fills a bare ref, I agree. Maybe AManWithNoPlan could share the techniques used so they can be ported to BareRefBot? Or is the stripping done on the Zotero servers? He would have more information regarding this.
I also have a question about the turnaround of the list making process. How long does it usually take for Citation Bot to finish a batch of articles? Rlink2 (talk) 20:43, 2 March 2022 (UTC)[reply]
See https://en.wikipedia.org/api/rest_v1/#/Citation/getCitation and https://github.com/ms609/citation-bot/blob/master/Zotero.php it has list of NO_DATE_WEBITES, tidy_date function, etc. AManWithNoPlan (talk) 20:45, 2 March 2022 (UTC)[reply]
@Rlink2: Citation Bot processes my lists of ABURs at a rate of about 3,000 articles per day. There's quite a lot of variation in that (e.g. big lists are slooow, wee stubs are fast), but 3k/day is a good ballpark.
The 20220301 database dump contains 155K ABURs, so we are looking at ~50 days to process the backlog. BrownHairedGirl (talk) • (contribs) 20:47, 2 March 2022 (UTC)[reply]
@BrownHairedGirl
So every 50 days there will be a new list, or you will break the list up into pieces and give the list of articles citation bot did not fix to me incrementally? Rlink2 (talk) 21:01, 2 March 2022 (UTC)[reply]
@Rlink2: it's in batches of up to 2,850 pages, which is the limit for Citation Bot batches.
See my job list pages: User:BrownHairedGirl/Articles with bare links and User:BrownHairedGirl/Articles with new bare URL refs. I can email you the lists as they are done, usually about one per day. BrownHairedGirl (talk) • (contribs) 21:27, 2 March 2022 (UTC)[reply]
  • Duh @me.
@Rlink2, I just realised that in order to follow Citation Bot, BareRefBot's worklist does not need to be built solely off my worklists.
Citation Bot has 4 channels, so my lists comprise only about a quarter of Citation Bot's work. The other edits are done on behalf of other editors, both as batch jobs and as individual requests. Most editors do not publish their work lists like I do, but Citation Bot's contribs list is a record of the pages which the bot edited on their behalf, so it is a partial job list (obviously, it does not include pages which Citation bot processed but did not edit).
https://en.wikiscan.org/user/Citation%20bot shows the bot averaging ~2,500 edits per day. So if BareRefBot grab says the last 10,000 edits by Citation Bot, that will usually amount to about four days work by CB, which would be a good list to work on. Most editors do not not choose their Citation bot jobs on the basis of bare URLs, so the incidence of bare URLs in those lists will be low ... but any bare URLs which are there will have been recently processed by Citation Bot.
Also, I don't see any problem with BareRefBot doing a run in which the bot does no filling, but just applies {{Bare URL PDF}} where appropriate. A crude search shows that there are currently over such 30,000 refs to be tagged, which should keep the bot busy for a few days: just disable filling, and let it run in tagging mode.
Hope this helps. BrownHairedGirl (talk) • (contribs) 21:20, 4 March 2022 (UTC)[reply]
@BrownHairedGirl:
BareRefBot's worklist does not need to be built solely off my worklists. Oh yes, I forgot about the contribution list as well.
So if BareRefBot grab says the last 10,000 edits by Citation Bot, that will usually amount to about four days work by CB, which would be a good list to work on. I agree.
Most editors do not not choose their Citation bot jobs on the basis of bare URLs, so the incidence of bare URLs in those lists will be low ... but any bare URLs which are there will have been recently processed by Citation Bot. True. Just note that tying the bot to Citation bot will mean that the bot can only go as fast as citation bot goes, that's fine with me since there isn't really a big rush, but just something to note.
Also, I don't see any problem with BareRefBot doing a run in which the bot does no filling, Me neither. Rlink2 (talk) 01:44, 5 March 2022 (UTC)[reply]
Thanks, @Rlink2.
I had kinda hoped that once BareRefBot was authorised, it could start working around the clock. At say 7 edits per minute. it would do ~10,000 pages per day, and clear the backlog in under 3 weeks.
By making it follow Citation bot, we restrict it to about 3,000 pages per day. That means that it may take up to 10 weeks, which is a pity. But I think we will get better results this way. BrownHairedGirl (talk) • (contribs) 01:58, 5 March 2022 (UTC)[reply]
@BrownHairedGirl: Maybe a hybrid model could work, for example it could avoid filling in refs for websites where the bot knows citation bot could possibly get better data (e.x: nytimes, journals, websites with metadata tags the barerefbot doesn't understand, etc..). That way we have the best of both worlds - the speed of barerefbot, and the (higher) quality of citation bot. Rlink2 (talk) 02:02, 5 March 2022 (UTC)[reply]
@Rlink2: that is theoretically possible, but I think it adds a lot of complexity with no gain.
The problem that BareRefBot exists to resolve is the opposite of that set, viz. the URLs which Citation bot cannot fill, and we can't get a definitive list of those. My experience of trying to make such a list for Reflinks was daunting: the sub-pages of User:BrownHairedGirl/No-reflinks websites list over 1400 sites, and it's far from complete. BrownHairedGirl (talk) • (contribs) 02:16, 5 March 2022 (UTC)[reply]
  • Some numbers. @Rlink2: I did some analsysis of the numbers, using AWB's list comparer and pre-parser. The TL;DR is that there are indeed very slim pickings for BareRefBot in the other articles processed by Citation bot: ~16 per day.
I took CB's latest 10,000 edits, as of about midday UTC today. That took me back to just two hours short of five days, on 28 Feb. Of those 10K, only 4,041 were not from my list. Only 13 of them still have a {{Bare URL inline}} tag, and 93 have an untagged, non-PDF bare URL ref. After removing duplicates, that left 104 pages, but 25 of those were drafts, leaving only 79 mainspace articles.
So CB's contribs list gives an average of only 16 non-BHG-suggested articles per day for BareRefBot to work on.
In those 5 days, I fed CB with 14,168 articles, on which the bot made just short of 6,000 edits. Of those 14,168 articles, 2,366 still have a {{Bare URL inline}} tag, and 10,107 have an untagged, non-PDF bare URL ref. After removing duplicates, that left 10,143 articles for BareRefBot to work on. That is about 2,000 per day.
So in those 5 days, Citation bot filled all the bare URLs on 28.5% of the articles I fed it. (Ther are more articles where it filed some but not all bare refs). It will be great if BareRefBot can make a big dent in the remainder.
Hope this helps. --BrownHairedGirl (talk) • (contribs) 20:03, 5 March 2022 (UTC)[reply]
  • For what it's worth, I dislike the idea of having a bot whose sole task is to clean up after another bot; we should be improving the other bot in that case. If this bot can edit other pages outside of those done by Citation bot, then it should do so. Primefac (talk) 12:52, 27 March 2022 (UTC)[reply]
    @Primefac, well that's also a good way of thinking about it. I'm personally fine with any of the options (work on its own or follow citation bot), its up to others to come to a consensus over what is best. Rlink2 (talk) 12:55, 27 March 2022 (UTC)[reply]
    @Primefac: my proposal is not clean up after another bot, which describes one bot fixing errors by another.
    My proposal is different: that this bot should do the tasks that Citation bot has failed to do. BrownHairedGirl (talk) • (contribs) 03:37, 28 March 2022 (UTC)[reply]
    BrownHairedGirl is right, the proposal is not cleaning up the other bots errors, it is with what Citation Bot is not doing (more specifically, the bare refs not being filled). Rlink2 (talk) 17:55, 28 March 2022 (UTC)[reply]
    @Primefac: Also, there seems to me to be no scope for extending the range of URLs Citation bot can fill. CB uses the zotero servers for its info on the bare URLs, and if the zotero doesn't provide the info, CB is helpless.
    It is of course theoretically conceivable that CB could be extended with a whole bunch of code of its own to gather data about the URLs which the zoteros can't handle. But that would be a big job, and I don't see anyone volunteering to do that.
    But what we do have is a very willing editor who has developed a separate tool to do some of what CB doesn't do. Please don't let the ideal of an all-encompassing Citation Bot (which is not even on the drawing board) become the enemy of the good, i.e. of the ready-to-roll BareRefBot.
    This BRFA is now Rlink2 in it tenth week. Rlink2 has been very patient, but please lets try to get this bot up and running without further long delay. BrownHairedGirl (talk) • (contribs) 18:25, 28 March 2022 (UTC)[reply]
    Maybe I misread your initial idea, but you have definitely misread my reply. I was saying that if this were just a case of cleaning up after CB, then CB should be fixed. Clearly, there are other pages to be dealt with, which makes that entire statement void, and I never suggested that CB be expanded purely to take over this task. Primefac (talk) 18:31, 28 March 2022 (UTC)[reply]
    @Primefac: maybe we went the long way around, but it's good to find that in the end we agree that there is a job for BareRefBot to do. Please can we try to get it over the line without much more time? BrownHairedGirl (talk) • (contribs) 20:11, 28 March 2022 (UTC)[reply]

Trial 3

Approved for extended trial (50 edits). Please provide a link to the relevant contributions and/or diffs when the trial is complete. Primefac (talk) 12:48, 27 March 2022 (UTC)[reply]

@Rlink2: Has this trial happened? * Pppery * it has begun... 01:42, 17 April 2022 (UTC)[reply]
@Pppery Not yet, busy with IRL stuff. But will get to it soon (by end of next week latest) Rlink2 (talk) 02:37, 17 April 2022 (UTC)[reply]
@Rlink2, now? ―  Qwerfjkltalk 20:08, 3 May 2022 (UTC)[reply]
@Qwerfjkl Not yet, i am still tracting up after my mini wikibreak. I will try to get to it next week. At the absolute latest done by middle of next month (it will probably be done way sooner but I would rather provide a definite upper bound rather than say "maybe this week" and pass the deadline). Rlink2 (talk) 12:29, 4 May 2022 (UTC)[reply]
@Rlink2: any news?
It's now almost mid-June, which was your absolute latest target.
What is your current thinking? Are you losing interest in this task? Or just busy with other things?
We are all volunteers, so if you no longer want to put your great talents into this task, that's absolutely fine. But it's been on hold now for three months, so it would be helpful to know where it's going. BrownHairedGirl (talk) • (contribs) 09:19, 12 June 2022 (UTC)[reply]

I have done extensive testing since the 2nd trial and I think we are finally ready for a third one, after some turbelence. What do people here think? Rlink2 (talk) 08:35, 10 August 2022 (UTC)[reply]

@Rlink2, Have you done the trial approved above? If so, can you link to the edits here? Otherwise, you should complete the trial and post the results here, then wait for feedback. — Qwerfjkltalk 06:44, 23 August 2022 (UTC)[reply]