Norconex Web Crawler: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
BattyBot (talk | contribs)
→‎top: Expanded Template:Notability and General fixes, replaced: {{notability|date=October 2023}} → {{notability|Product|date=October 2023}}
Tag: New redirect
 
(6 intermediate revisions by 4 users not shown)
Line 1: Line 1:
#REDIRECT [[Web crawler#Open-source crawlers]]
{{Short description|Free and open-source Java web crawler}}
{{notability|Product|date=October 2023}}


{{Rcat shell|
<!-- Note: The following pages were redirects to [[Norconex_Web_Crawler]] before draftification:
{{R to related topic}}
*[[Draft:Norconex Web Crawler]]
-->

{{Infobox software
| title =
| other_names = Norconex HTTP Collector
| developer = {{URL | https://norconex.com/ | Norconex Inc.}}
| released = 2016
| latest release version = 3.0.2
| latest release date = 2022-01-05
| repo = {{URL | https://github.com/Norconex/collector-http | GitHub Repository}}
| programming language = [[Java (programming language)|Java]]
| operating system = [[Cross-platform software|Cross-platform]]
| license = {{URL | https://en.wikipedia.org/wiki/Apache_License | Apache License}}
| website = {{URL | https://opensource.norconex.com/crawlers/web/ | Norconex Web Crawler}}
}}
}}

'''Norconex Web Crawler''' is a [[Free and open-source software|free and open-source]] [[web crawling]] and [[web scraping]] Software written in [[Java (programming language)|Java]] and released under an [[Apache License]]. It can export data to many repositories such as [[Apache Solr]], [[Elasticsearch]], [[Azure Cognitive Search|Microsoft Azure Cognitive Search]], [[Amazon CloudSearch]] and more.<ref>{{cite web |title=Committers |url=https://opensource.norconex.com/committers/ |website=opensource.norconex.com}}</ref><ref>{{cite web |last1=Hoppa |first1=Jocelyn |title=Importing Data from the Web with Norconex & Neo4j |url=https://neo4j.com/blog/importing-data-from-the-web-norconex-neo4j/ |website=Graph Database & Analytics |language=en |date=10 February 2020}}</ref><ref>{{cite web |title=Deploy a Norconex HTTP Collector Indexer Plugin {{!}} Cloud Search |url=https://developers.google.com/cloud-search/docs/guides/norconex-http-connector |website=Google for Developers |language=en}}</ref>

The Crawler can be run on its own or embedded in your own [[Java (programming language)|Java]] application.<ref>{{cite web |last1=Valcheva |first1=Silvia |title=10 Best Open Source Web Crawlers: Web Data Extraction Software |url=https://www.intellspot.com/open-source-web-crawlers/ |website=Blog For Data-Driven Business |date=11 February 2018}}</ref><ref>{{cite web |title=Norconex HTTP Collector |url=https://www.softpedia.com/get/Internet/Other-Internet-Related/Norconex-HTTP-Collector.shtml |website=Softpedia |access-date=25 September 2023}}</ref>

Some key features are:
* Multi-threaded
* Extract text from a variety of file formats (HTML, PDF, Word, etc.)
* Extract metadata associated with documents
* Supports pages rendered with JavaScript
* Incremental crawls
* Supports external commands to parse or manipulate documents
* Send extracted data to a variety of repositories

Some well-known companies and products using Norconex Web Crawler are: Apache Solr Ecosystem, Department of National Defence, Universities Canada, U.S. Department of Education, Department of National Defence.<ref>{{cite web |title=SolrEcosystem - Solr - Apache Software Foundation |url=https://cwiki.apache.org/confluence/display/solr/SolrEcosystem |website=cwiki.apache.org}}</ref>
<ref>{{cite web |title=Norconex Crawler Users |url=https://opensource.norconex.com/crawlers/usedby |website=opensource.norconex.com}}</ref>

== History ==
Norconex Web Crawler was released as [[free and open-source software]] in 2013.<ref>{{Cite web |title=Norconex Gives Back to Open-Source – Norconex Inc |url=https://norconex.com/norconex-gives-back-to-open-source/ |access-date=2023-09-25 |language=en-US}}</ref>

== References ==
<references />

== Mentions in Academic Research ==
* {{cite journal |last1=Kancherla |first1=Vinay |title=A Smart Web Crawler for a Concept Based Semantic Search Engine (pg. 18) |url=https://scholarworks.sjsu.edu/etd_projects/380/ |journal=Master's Projects |access-date=28 September 2023 |doi=10.31979/etd.ubfy-s3es |date=1 December 2014|doi-access=free }}
* {{cite journal |last1=Horváth |first1=Balázs |title=Recommendation Techniques for smart cities (pg. 12) |url=https://aaltodoc.aalto.fi/handle/123456789/27974 |website=Aalto University |access-date=28 September 2023 |language=en |date=28 August 2017}}
* {{cite arXiv |last1=Wani |first1=Mudasir Ahmad |last2=Agarwal |first2=Nancy |last3=Jabin |first3=Suraiya |last4=Hussain |first4=Syed Zesahn |title=Design of iMacros-based Data Crawler and the Behavioral Analysis of Facebook Users |date=2018 |class=cs.SI |eprint=1802.09566 }}
* {{cite web |last1=Abbasi |first1=Vahid |title=Phonetic Analysis and Searching with Google Glass API |url=https://uub.primo.exlibrisgroup.com/discovery/fulldisplay?docid=alma991018494504807596&context=L&vid=46LIBRIS_UUB:UUB&lang=en&search_scope=MyInst_and_CI&adaptor=Local%20Search%20Engine&tab=Everything&query=creator,contains,vahid%20abbasi&offset=0 |website=uub.primo.exlibrisgroup.com |access-date=28 September 2023 |language=en}}

== See also ==
* {{cite web |last1=Mitchell |first1=Pete |title=25 Best Free Web Crawler Tools |url=https://techcult.com/best-free-web-crawler-tools/ |access-date=2023-09-05 |website=TechCult |date=8 April 2022}}

[[Category:Web crawlers]]

Latest revision as of 21:15, 21 May 2024