Friday, August 14, 2015
How to Read Data off the Web the Easy Way
This was the problem GrabzIt was founded to solve. Our first priority was to allow people to take an exact copy of a web page by providing a service to screenshot websites as Images or PDF documents, allowing a copy of entire web page to be captured instantly.
Our next priority was to extract data from within a web page, to do this we created a service that converts web pages containing HTML tables into CSV or Excel documents. This would allow computer software to easily read HTML tables and for users to capture snapshots of HTML tables, which is useful for getting historical data on subjects like football scores.
However this doesn't provide the flexibility that many users desired. So we created the Web Scraper. This is a highly flexible tool that can extract data from any web page or PDF document by crawling over a website and extracting data as it goes. In fact it is so powerful that not only can it do things like extract text, images, links and files from websites. It can even extract text from images, check that a link is valid or take screenshots of every page on a website.
To do this a user must specify what data to extract, most of which can be done through an online wizard. Once the Web Scraper has extracted the data it then puts it in a structured format which is crucial for a computer to be able to read it. This ranges from CSV, Excel and HTML documents to a SQL script, which allows the data to be loaded directly into a database.
While many are impressed by what we do, GrabzIt wants to go further and aims to make the web fully machine readable just like any other data source.