Bigcrawler

BIGCRAWLER is the brain in our business and is used in many of our services. BIGCRAWLER is developed by Big 2 Great to collect and extract data from various online resources.

BIGCRAWLER uses the newest technology in the field of Big Data to collect data and extract useful information.

BIGCRAWLER can collect data from both normal static webpages and dynamics asynchronous loaded pages which traditional crawlers are unable to collect content from. This enables us to collect information which is out of reach for traditional crawlers. BIGCRAWLER can be configured to crawl the web responsibly, meaning it will not put great load on the websites it visits, or scheduled not to collect data at websites peak hours, which will put less load on the website since it has more resources free out of peak hours. If an area of a website is irrelevant, BIGCRAWLER can exclude these pages of the website and thereby make the data collection process faster and more effective.

BIGCRAWLER works day and night and can retrieve information from multiple sources at the same time. BIGCRAWLER extracts data from gathered pages using artificial intelligence and pattern recognition. In opposition to other crawlers BIGCRAWLER doesn’t need to know the structure of a website to extract data from it because BIGCRAWLER is trained to recognize the data across difference pages and domains. Traditional crawlers require the user to create a recipe for each type of page and source to extract information which requires much manual labor. BIGCRAWLER in opposition to traditional crawlers, only needs a few examples of pages with information and then BIGCRAWLER will learn how to extract the information itself.

	Traditional crawlers	BIGCRAWLER
Concurrent pages	1	A lot
Collection from asynchronous pages
Responsible crawling
Exclude areas
Extraction method	Manual	Intelligent

What is a crawler?

A crawler or a data collector is a program that downloads webpages. The program starts at a given web address and visits all links on that page. Visited pages are downloaded and the links of those new pages are visited. This continues until all pages on the domain is downloaded or the program has reached a specified dept.

What is information extraction?

Information extraction is when a program extract relevant information from a downloaded site. E.g. a downloaded page containing a table in a page which contains some information you want extracted. The program extracts the content of table, keeps it, and throws the rest of the page away. Information extraction is the most difficult part of automatic data collection, because webpages is structured very differently, and because of this, traditional crawlers will result in many errors.