Wikipedia:Articles for deletion/Statsmodels

The following discussion is an archived debate of the proposed deletion of the article below. Please do not modify it. Subsequent comments should be made on the appropriate discussion page (such as the article's talk page or in a deletion review). No further edits should be made to this page.

The result was delete. WP:GNG / WP:V slakr^\ talk / 03:43, 8 March 2014 (UTC)[reply]

Statsmodels[edit]

(Find sources: Google (books · news · scholar · free images · WP refs) · FENS · JSTOR · TWL)

Notability not established. The main source, a SciPy conference paper, has been cited only three times according to GScholar. The other source is the topic's website. QVVERTYVS (hm?) 13:12, 22 February 2014 (UTC)[reply]

statsmodels is used in industry and research without always citing it, for example

Dabdoub, S. M., A. A. Tsigarida, and P. S. Kumar. 2013. “Patient-Specific Analysis of Periodontal and Peri-Implant Microbiomes.” Journal of Dental Research 92 (12 suppl): 168S–175S. doi:10.1177/0022034513504950. Quote:"Single and multiple comparisons of distributions were carried out with the statistical facilities provided by JMP (SAS Institute Inc.), as well as the Python libraries SciPy, pandas, and statsmodels." — Preceding unsigned comment added by 96.127.225.218 (talk) 14:14, 22 February 2014 (UTC)[reply]

That paper only has one citation. We need something better to satisfy WP:NSOFT. QVVERTYVS (hm?) 15:46, 22 February 2014 (UTC)[reply]

statsmodels is a established tool used by many researchers, including Nobel Prize winners. Many researchers use it without giving it proper credit in their publications. It is part of the Enthought distribution package for scientists: [1]. Nobel Prize Laureate Prof. Thomas Sargent mentions it in his website as one of the most useful Python modules: [2]. It is part of the open source movement, and it would be a mistake for Wikipedia to remove this article. Matplotlib (talk) 15:05, 22 February 2014 (UTC)[reply]

That webpage only mentions statsmodels once, in a list, and WP:NSOFT clearly states that "Inclusion of software in lists of similar software generally does not count as deep coverage" and is not sufficient to establish notability. The rest of your argument is irrelevant, I'm afraid. QVVERTYVS (hm?) 15:46, 22 February 2014 (UTC)[reply]

I respectfully disagree. There are well over 10,000 python modules, and he is only citing 4 modules. Clearly, it is a great endorsement by one of the most relevant academics of our time. Econometricians reading this discussion would be rolling their eyes. Matplotlib (talk) 02:22, 23 February 2014 (UTC)[reply]

Absolutely agree with the significance of this citation. Cerberus (talk) 23:08, 3 March 2014 (UTC)[reply]

For your information: I started to collect a list of references that mention or use statsmodels. Many of those do not cite the conference paper. https://github.com/statsmodels/statsmodels/wiki/Users-and-Citations — Preceding unsigned comment added by Josefpktd (talk • contribs) 15:38, 22 February 2014 (UTC)[reply]

Ok, that might be useful. QVVERTYVS (hm?) 15:46, 22 February 2014 (UTC)[reply]

I collected the list from what I found with Google Scholar. There are two kinds of articles, those that use parts of statsmodels and usually mention the statsmodels homepage in brackets or a footnote. The second kind mentions statsmodels for the python eco-system and in some cases for further analysis. I will add more comments about this. — Preceding unsigned comment added by Josefpktd (talk • contribs) 14:49, 23 February 2014 (UTC)[reply]

I found another one that was not on Google Scholar: they mention using Python and R in the main article, but statsmodels is only cited in the Supplementary Material which is not indexed by Google Scholar, as far as I can see. http://bioinformatics.oxfordjournals.org/content/29/14/1825.full?sid=46bb91f0-38f6-493c-a38c-c202b0dbfc34 — Preceding unsigned comment added by Josefpktd (talk • contribs) 15:46, 23 February 2014 (UTC)[reply]

statsmodels is just a traditional statistics and econometrics package written in Python with less coverage than R or Stata but covers most of the commonly used models and hypothesis tests (together with scipy.stats.) There is no hype associated with it. For a bit of background see http://stats.stackexchange.com/questions/47913/pandas-statsmodel-scikits-learn/48578#48578

The number of articles that use or mention statsmodels shows that statsmodels has found acceptance in the research communities of various fields. Of course the citation or usage count is much smaller than the one of long established packages like R or Stata. We, statsmodels developers, never emphasized getting citations. As pointed out on our mailing list, we don't even have the conference article citation displayed prominently on the documentation website. Statsmodels is also used in a few university courses for using python in the field, but I don't have a list of those.

Eco-system: Referring to "It is not unreasonable to allow relatively informal sources for free and open source software, if significance can be shown" WP:NSOFT.
I think what Matplotlib pointed out in the comment above is important. statsmodels is an established and important part for the python in science and python for data analysis ecosystems. Numpy, scipy, pandas, scikit-learn, statsmodels, pymc, and some others, are the general purpose packages, which are complemented by field or application specific packages. So, often lists for recommended packages will include statsmodels and the other main packages. statsmodels has participated in each of the last five years in the Google Summer of Code under the umbrella of the Python Software Foundation, the first year or two as a scipy project.
Statsmodels is included in all science oriented python distributions, but most of the "spreading the word" goes through blogs and mailing lists.
One example as illustration: http://www.automatedtrader.net/articles/software-review/144328/utopian-quantopian reviews an open source package in finance written by a startup. (Automated Trader Magazine Issue 30 Q3 2013) It mentions statsmodels next to scipy, and then points out a limitation of statsmodels in the next paragraph.

Statsmodels is treated as a tool library, which is necessary but does not require special emphasis. — Preceding unsigned comment added by Josefpktd (talk • contribs) 18:06, 23 February 2014 (UTC)[reply]

FYI: I tried to add statsmodels to Wikipedia in 2011, see User:Josefpktd/Statsmodels for my draft. At the time I did not try to show notability because, although we were already well established in the numerical python community, we did not have a wider user base yet. After an additional two and a half years of growth, I think statsmodels is widespread and known well enough to justify "notability" for Wikipedia. Also note that this time it is not a statsmodels developer that started the Wikipedia page.Josefpktd (talk) 22:02, 23 February 2014 (UTC)[reply]

I don't know what you consider a "reliable source" for significance of open source software. So here are three more examples http://work.thaslwanter.at/Stats/html/index.html A online "book" or set of notes written for a university course in Austria by a user of python. First part is statistics with scipy.stats, last part is with statsmodels. And here are two blog post written by data analysts working out the examples of statistics/machine learning books in python, http://slendermeans.org/pages/will-it-python.html and http://www.datarobot.com/blog/statistical-learning-in-python/ . The first, statistics oriented part uses statsmodels the second, machine learning oriented part uses scikit-learn.Josefpktd (talk) 00:41, 24 February 2014 (UTC)[reply]

Note: This debate has been included in the list of Software-related deletion discussions. • Gene93k (talk) 03:01, 24 February 2014 (UTC)[reply]

Blogs are usually not accepted per WP:SPS, unless they're company blogs or the blog of a major figure in industry, academia or the OSS world. Same goes for StackExchange and similar crowdsourced Q&A websites. I've cited automatedtrader.net. To be honest, I'm convinced that statsmodels is a relatively major library; the question is whether an encyclopedic article can be written about it (but I'm moving towards a "yes" on that question). QVVERTYVS (hm?) 09:21, 24 February 2014 (UTC)[reply]

Related to WP:SPS As far as I understand this would refer to publication like blogs written, in the case of software, by the developers or developing company. I referenced the three blogs (or one lecture notes and two blog) to **illustrate** the significance of statsmodels for open source statistical analysis. Those were written by users that are not directly involved in statsmodels. However, since it's open source, the first author contacted the mailing list when he was writing his course notes. The second improved a function in statsmodels when he found during his blog writing that our previous version was slow. I only found the last blog while searching now for establishing notability. I emphasized "illustrate" because statsmodels doing traditional statistics and econometrics is not newsworthy or hyped enough to make it into the New York Times or Wall Street Journal, and most of the examples and comments on statsmodels are on blogs.Josefpktd (talk) 16:05, 24 February 2014 (UTC)[reply]

Actually, R did make it to the NYT. But coming back to SPS, it makes the exception that "Self-published expert sources may be considered reliable when produced by an established expert on the subject matter, whose work in the relevant field has previously been published by reliable third-party publications." To me, that means that Thomas Sargent's website is an acceptable source, but J. Random User's blog is not, regardless of whether they're involved with statsmodels. The reason for SPS, as I've always understood it, is that it's too easy to create a blog, post what you want on Wikipedia on the blog, then cite it — not so much to stop promotional editing. QVVERTYVS (hm?) 16:43, 24 February 2014 (UTC)[reply]

I thought "Self-" in SPS refers to the subject of the Wikipedia page, and we (self) didn't write those blogs so we can get into Wikipedia. (Aside: R made it into the NYT after 16 years plus another 17 years of S as background. I hope statsmodels makes it sooner. :) I know that many of our sources are not strictly defined as "reliable sources". I'm not sure what "relatively informal sources for free and open source software" means. However, what I tried to show with the wide range of sources is that statsmodels has been "noted" by academic researches, data analysts and companies, so it should be "notable" enough for Wikipedia in my opinion.Josefpktd (talk) 20:02, 24 February 2014 (UTC)[reply]

QVVERTYVS is there anything missing that would help to convince you. Or should we wait another year, and another 10 or 30 publications that use statsmodels and until Tom Sargent includes some statsmodels examples in his quant-econ site.Josefpktd (talk) 20:02, 24 February 2014 (UTC)[reply]

I usually pay more attention to content in blogs than origin. I just saw that the London School of Economics "syndicated" an article on the use of python for statistical analysis, http://blogs.lse.ac.uk/impactofsocialsciences/2014/02/24/on-the-future-of-statistical-languages/ (Note it contains the disclaimer that it's not an official position) — Preceding unsigned comment added by Josefpktd (talk • contribs) 12:42, 24 February 2014 (UTC)[reply]

I'm adding one more example of blogs. There are several companies and startups that use or start to use Python for data analysis. I know of a few but do not have any overview who is using statsmodels as one of the backend tools. cbinsights looks like a analytics company that has never been in contact with statsmodels development, as far as I know: http://www.cbinsights.com/team-blog/python-tools-machine-learning/Josefpktd (talk) 16:14, 24 February 2014 (UTC)[reply]

Just for completeness, a WP:SPS: This is my blog http://jpktd.blogspot.ca/ where I add on and off some explanations or descriptions of statistics that is under development. Except for the release announcement, it is mostly technical and "boring" statistics. I am one of the two main developers and maintainers of statsmodels. http://www.ohloh.net/p/statsmodels/contributors?query=&sort=commits_12_mo Josefpktd (talk) 16:53, 24 February 2014 (UTC)[reply]

Related to adding a page on statsmodels. I'm a frequent user of statistics pages on Wikipedia, and if statsmodels has its own page, then it will be easier to add it to statistics pages that have an implementation section. For example, searching Wikipedia "~statsmodels" shows several pages where Wikipedia contributors have added statsmodels for the Python implementation. — Preceding unsigned comment added by Josefpktd (talk • contribs) 12:16, 24 February 2014 (UTC)[reply]

I just realized that maybe I misunderstood something here: My comments and links on this "Articles for deletion/Statsmodels" page are for establishing notability for the Wikipedia process of including a new page. I did not provide the links with the intention that they are included on the actual Wikipedia page itself, so I did not restrict myself to sources that are acceptable under the Wikipedia editorial policy.Josefpktd (talk) 05:40, 25 February 2014 (UTC)[reply]

There's no difference. A source that cannot be cited cannot establish notability either; in both cases they need to pass the criteria in WP:RS, and for establishing notability the sources need to additionally provide significant coverage. I've cited a few of the suggested sources in the article because I'm willing to help you and I feel statsmodels could deserve its article; I hope other editors can get involved to see if they find the current sources good enough.

I'm changing my opinion to neutral because of the additional sources. Third-party coverage of statsmodels is on the verge of significant; the question is if the slack given by WP:NSOFT is enough. QVVERTYVS (hm?) 14:08, 25 February 2014 (UTC)[reply]

Thank you for your consideration. What criteria for significance is being applied to the statistical software packages listed at List_of_statistical_packages? Wes Turner (talk) 01:43, 1 March 2014 (UTC)[reply]

Delete. Passing mentions, few citations; that's what the above WP:TLDR comes down to. Someone not using his real name (talk) 18:47, 26 February 2014 (UTC)[reply]

Which mentions are "passing"? Which WP:TLDR are you referring to? Wes Turner (talk) 01:45, 1 March 2014 (UTC)[reply]

All of them. Someone not using his real name (talk) 12:34, 1 March 2014 (UTC)[reply]

You're entitled to your opinion. Wes Turner (talk) 18:07, 2 March 2014 (UTC)[reply]

Maybe the references don't spend a lot of time on statsmodels, however, John Stachurski, the co-author of Tom Sargent, is giving a pre-conference workshop on python including statsmodels at the conference of The Society for Computational Economics http://comp-econ.org/CEF_2014/PreConf.htm (although, statsmodels is only mentioned in parentheses as a scientific library)Josefpktd (talk) 23:38, 26 February 2014 (UTC)[reply]

Regarding the AfD criteria: "The minimum search expected is a Google Books search and a Google News archive search; Google Scholar is suggested for academic subjects. Such searches should in most cases take only a minute or two to perform." http://scholar.google.com/scholar?q=statsmodels lists 130 results for mentions of **statsmodels** (a fairly unique term). Statsmodels is a standard SciPy ecosystem package. It is also included with Continuum Anaconda (DARPA) and Enthought Canopy by default. Wes Turner (talk) 01:55, 1 March 2014 (UTC)[reply]
- ~100 citations is too low for academic project to be listed on Wikipedia, in my opinion of course. Someone not using his real name (talk) 12:34, 1 March 2014 (UTC)[reply]

- - Keep. With 130 Google Scholar scholastic mentions, support from both primary scientific Python ecosystem groups, and packages in the standard MacPorts, FreeBSD, NetBSD, Debian, Ubuntu, Arch, and Gentoo package repositories (according to "whohas statsmodels"), Statsmodels is notable and noteworthy and deserves a page in Wikipedia. Wes Turner (talk) 18:07, 2 March 2014 (UTC)[reply]

- - - There isn't any minimal number of citations required for academic work, or academic software for that matter. CS topics relying on a few dozen citations tend to get removed or merged, but there is precedent: scikit-learn was kept because of a JMLR paper with (then) some 100-200 citations. QVVERTYVS (hm?) 17:09, 3 March 2014 (UTC)[reply]

- - - - I suppose prestigious journal citation is one criteria for success and notability. The exact BibTeX citation for a paper presented at the 9th Python in Science Conference (2010) regarding a collection of scipy/scikit statistics routines that had collectively been around and (clearly) utilized for many years in both academia and industry has not been listed in the statsmodels documentation; URIs and (DOI) URN still seem to be a mystery to the PDF community. As it stands, statsmodels meets and exceeds the precedent applied to many of the other statistical computer science software pages. (See Comparison_of_statistical_packages). If you are attempting to create demand for more citations, please consider adding "citation needed" where appropriate. As referenced in the commit log of this article, I utilized the general format of the scikit-learn article (infobox, headings, "is largely written in Python, with some core algorithms written in Cython to achieve performance" (should I mention the Fortran in NumPy and SciPy?)). Is there a particular reason which you have attempted to delete statsmodels in particular? I suggest here that peer review supported by journal advertising is essential to science; and peer production and testing through open source forges such as GitHub are strong indicators of community notability. Furthermore, there are 221 forks of statsmodels hosted by GitHub. While there were, as you mention, 163 forks of scikit-learn at the time the AfD resolution for scikit-learn was Keep, there are now 1,332 forks of scikit-learn hosted by GitHub. This frequency-statistic discussion supports the premise that statsmodels is notable enough for Wikipedia. There are many tests for statsmodels methods (in 'test_<name>.py' files of './tests' directories); there could always be more. Wes Turner (talk) 00:54, 4 March 2014 (UTC)[reply]

- - - - It's not the venue that matters, it's the citation count. If that's too low, WP:TOOSOON applies. I wasn't pointing to my own reasoning there, more to User:Gaijin42's: "usage, community, ecosystem, support, etc are irrelevant for the purposes of establishing WP:N". QVVERTYVS (hm?) 14:04, 4 March 2014 (UTC)[reply]

I don't seem to have commented on this thread, are you referring to a comment I have made elsewhere? I am not taking a !vote on this article, as I have not read the article, or reviewed the sources, just replying to the ping with general information. In any case, the opening paragraph of WP:N is fairly explicit " Determining notability does not necessarily depend on things such as fame, importance, or popularity—although those may enhance the acceptability of a subject that meets the guidelines explained below." See also Wikipedia:Subjective_importance. See also WP:NRV "The common theme in the notability guidelines is that there must be verifiable, objective evidence that the subject has received significant attention from independent sources to support a claim of notability. [...] No subject is automatically or inherently notable merely because it exists: The evidence must show the topic has gained significant independent coverage or recognition" More importantly, if a source fails WP:N it is also probably going to fail WP:RS and WP:V (and therefore WP:OR- if the sources documenting a topic are primary & non-independent sources, they aren't nearly as useful for backing facts. As stated above, blogs, and crowd-sourced info are not reliable by wikipedia standards, and the sites of the authors, or people selling/distributing/supporting the software are not neutral and objective voices about the topic. This is all summarized quite nicely in WP:GNG. There used to be text somewhere, but I can't find it now saying something to the effect of "notability is not the same thing as important. unimportant things can be notable. important things can be non-notable". Gaijin42 (talk) 15:36, 4 March 2014 (UTC)[reply]

User:Gaijin42, I was referring to Wikipedia:Articles for deletion/Scikit-learn, which I cited as precedent for keeping CS-related articles based on citation counts of ca. 100-200. QVVERTYVS (hm?) 17:13, 4 March 2014 (UTC)[reply]

Ah thanks. One clarification from above, I think both count and venue are important. a small handful of cites is more than sufficient, if its say the New York Times, and Time magazine, or something. If its a bunch of personal blogs, Press releases, or places that document every release of every package, or just minor in-passing references, an infinite amount could still not be enough. Optimally you would have completely independent sources doing in depth coverage and analysis and commentary. Merely saying "Package X has feature Y, and is run with command-line Z" is not really helpful as an encyclopedic source. Who should use the package, what are the alternatives, why is this package better. those are sources that show notability. That everyone who uses distro-Z gets a copy automatically isn't really notability, even if distro-Z is the most notable thing in the world. Notability isn't inherited. ~~Again, no commentary on this particular package or its sources, as I have not read them.~~ Gaijin42 (talk) 17:23, 4 March 2014 (UTC)[reply]

https://github.com/statsmodels/statsmodels/wiki/Users-and-Citations Wes Turner (talk) 18:25, 2 March 2014 (UTC)[reply]

delete the seabold journal article is an excellent source, but its the only one. Everything else is just primary sources (docs) or in passing references, some just saying "we used statsmodels". The github list of cites is a prime example of this problem - none of the papers is about statsmodel, or discusses it at any length. Gaijin42 (talk) 17:29, 4 March 2014 (UTC)[reply]

I feel the same could be said for scikit-learn, where the only direct academic cites 3 and 10 are about scikit-learn (by the same authors). I added the links to documentation not to demonstrate notability but to source the claims made in the article. How does Google Scholar read citations, anyway? BibTeX? Wes Turner (talk) 22:54, 4 March 2014 (UTC)[reply]

keep Anything mentioned in Sargent and Stachurski's new book (http://quant-econ.net/_static/pdfs/quant-econ.pdf) is surely significant, and statsmodels meets that criterion. Cerberus (talk) 19:13, 4 March 2014 (UTC)[reply]

The entirety of the content from the book about stats model is "Other Useful Statistics Libraries * statsmodels— various statistical routines" and "There are already functions available that will do this for us — an example is statsmodels.tsa.stattools.periodogram in the statsmodels package". This does not qualify as WP:SIGCOV Gaijin42 (talk) 19:17, 4 March 2014 (UTC)[reply]

If I could summarize my exasperation, if someone was to spend time writing an additional paper advertising statsmodels, format it as a PDF (which doesn't re-flow accessibly, doesn't have #fragment-ids, and doesn't have structured RDFa citations) and then encourage citation proliferation among the communities of users already utilizing this statistical package in production, this WP:N debate would be over and no time would have been spent improving the actual methods or routines in the package. What a wastefully inconsistent application of notability criteria. I would hope that the same criteria is being applied to other statistical packages which are in production use. Again, keep. This time would've been better spent on actual tests than paragraphs of prose. That's all I have to say. Wes Turner (talk) 23:06, 4 March 2014 (UTC)[reply]

It seems to me that this discussion has ignored a bit the WP:RAP distinction that should be applied when accepting or deleting articles. It states there that the principles and rules are "solely intended towards creating and distributing a free encyclopedia of the highest possible quality." It seems to me that if we are able to write an encyclopedic article about statsmodels (I don't think this is in question), then there is no doubt that Wikipedia as an encyclopedia would be improved. There are pages and pages of statistical tests and estimators that have inter-wiki links to software which implements these methods. Here's one I just picked off the top of my head Autoregressive–moving-average_model#Implementations_in_statistics_packages. Statsmodels is notable in that it is the *only* comprehensive library that provides these statistical methods in the Python programming language. To be able to point the interested reader to the Wikipedia page about statsmodels within these articles seems to me that it would without question create a better encyclopedia. — Preceding unsigned comment added by Jseabold (talk • contribs) 23:32, 4 March 2014 (UTC)[reply]

I think this article may have been overlooked. It is another peer-reviewed, published conference paper about statsmodels in addition to mine mentioned above. I was a contributing author, but not the primary one, and I did not give the presentation [3] — Preceding unsigned comment added by Jseabold (talk • contribs) 04:15, 5 March 2014 (UTC)[reply]

keep People don't cite packages they use, especially fundamental ones. I have used in the past many tools or packages and didn't always knew how to cite them. One criteria that, in my humble opinion, should be taken into account by WP for evaluating the notability of an open source package is the presence of a tutorial on that package at a major conference. I was the tutorial co-chair of the biggest conference on Scientific Python in 2012 and out of the 8 packages we deemed the most important to teach new comers to the Scientific Python community, Statsmodels was selected, together with Numpy, IPython or matplotlib (https://conference.scipy.org/scipy2012/tutorials.php). It was recorded and viewed almost 2000 times on youtube (http://www.youtube.com/watch?v=RWRsxhUzpxk). Jonathanrocher (talk) 04:21, 5 March 2014 (UTC)[reply]

The above discussion is preserved as an archive of the debate. Please do not modify it. Subsequent comments should be made on the appropriate discussion page (such as the article's talk page or in a deletion review). No further edits should be made to this page.