Nils Mulvad, editor at Kaas & Mulvad, associate professor at The Danish School of Media and Journalism, session When to Scrape: Tools and techniques. Saturday 1st of March 2014, Baltimore, Nicar
Scraping data on a daily basis on foreing workers in Denmark – purpose to help controlling fair payment is paid to the workers. Foto: Sidse Buch
Big data is here, meaning we have access to more and more data from different sources. We have also more and more tools to extract, match, analyze and present the stories in the data.
Here are 11 tips for what to be aware of. We use Kapow software for our scrapers and run the scrapers on a linux-server, we collect the data in MySQL-databases. We have two servers on Rackspace Cloud.
1. Don’t be scared of big data
There are so many different definitions, normally either from the perspective of job type or possibilities in the data.
Data expert defines it out from the tools they need to handle the data. Data providers see it as the new sources of data. Analyst describes it from what you can do with the data.
Look at this as new big, area to gather and combine your material from. Take your data from censors, GPS, authority data, social media, corporate data and scientific data
2. Data journalists are the kings
For years we have been working with data with a single purpose. Can we extract stories from them? This skill of looking for content in data is perhaps the most important to add to the possibilities right now.
Data analysts are to narrow minded – they look at all data as equal, journalists are too narrow minded – they look at data as a total not understandable part of life.
We operate in between. Very few can do that smart. Time has come to you. Go for the stories in the data.
3. Combine scraping with other tools
Often you need to scrape the same source in a schedule. If it’s an official website you can directly use the content.
Going into a negotiation with the authority might lead you to get the material as an xml-file or as API-access.
This will make it a lot easier for you and them, saving server-time for both, and ensure higher quality of data.
4. Matching data
Sometimes the scraping will combine material from different sources to add more context to your material. Sometimes the best solution combines different tools. If so, use these different tools.
WHERE (rutOK.IdCompany LIKE CONCAT(‘%’, rutCompany.RUTNumber, ‘%’) OR rutCompany.RUTNumber LIKE CONCAT(‘%’, rutOK.IdCompany, ‘%’) OR rutCompany.ForeignCompanyRegistration LIKE CONCAT(‘%’, rutOK.IdCompany, ‘%’) OR rutOK.RutNumber = rutCompany.RUTNumber)
This code above is the main part of combining script from our scraper-language with MySQL to match different fields in the extraction with fields in a table either with a perfect match or where the content of one field is part of another field. Codes can do much.
5. The perfect scraper doesn’t exist – log errors
You can do many things to optimize your scraper, but there will always be a risk of errors. In some cases the authority build and later rebuild their website slightly different, and you could not foresee that when you build your scraper.
You should have an instinct for reverse engineering, finding patterns on their website and possibilities for errors. And then you also need to keep records of your scraping, so you have a warning system for errors.
6. Always include metadata in your scraping
Scraping data into the same table in schedules need you to keep track of each record, meaning you have to include at least:
This should be in every record.
7. Scraping thousands of website a day down in several levels
Making a surveillance of changes on big websites means you have to go down in several levels. Having for instance 20 websites to follow with 50 urls on each site, means you have 20 Urls on level 1, 1.000 url’s on level 2, then 50.000 urls on level3 and in the end 250.000 urls.
You then need to build a system, so you only open an url one time at each level and never opens an url on a specific level if it had been opened before. A clear structure is the answer.
8. Tuning scrapers for loading url’s
If you have a chance to edit the way your scraper handle URL’s then turn these things of:
It’s simply the most effective way to minimize server-time and speed up the scraping. Sometimes you need to have it on for loading the data. Then it is necessary.
9. Always focus on the story – the context
As data journalists we love data, we simply can’t get enough. But never be satisfied just with the data. Always think of the user, the viewer, anybody who will interfere with your material.
Make it as simple and easy to use as possible. Rethink and rethink. Mobile means simple and clear focused.
10. Know the limits of the scraping and machine-generated content
In some situations the scraped material will feed an output – either as email or as a presentation. But be very keen on the finish.
Sometimes it demands a human touch in the end to make it better and even mre focused on the end-user. Follow your products closely.
11. Make scraper-operations scalable
If you begin this operation of scheduling daily scraping jobs, be sure that you have a system easy to scale up.
We run all scraping on scalable cloud-servers. We can just upscale everything on the fly. Make expanding easy.