** If you are a cool professional Ruby programmer you might find all of this just a bit greenish.
We all have our pet web apps we work on whenever our day job does not get in the way. I have mine too. Many of these apps integrate various services over the web – like albums on Flickr. Well, every now and then I get a craving to actually put some usable data in my apps. I go around looking for that data on the net, which is followed by a moral debate with myself when I find some. Actually, it is not as bad as it sounds.
There is a lot of data that you can pick up legally – like the nutrient data that the Australian government publishes in one of it’s websites.
Data comes in various shapes and sizes. Tabular data that can be converted into CSV, which is sometimes the simplest to load. We all have done it a hundred times in projects. But here is how I did it.
It’s a simple approach. The script is fast enough for me. There are 4000 rows in the file and it takes about 20 sec to upload the same. That works for me.
You can off course avoid the first line using
You can also use FasterCSV and then avoid the first line using
That brings us to our next topic
Other alternatives to upload CSV
Here are some of the other alternatives…
- Ruby String#split (slow)
- Ruby CSV (slow)
- FasterCSV (slow)
- ccsv (fast & recommended if you have control over CSV format)
- CSVScan (fast & recommended if you have control over CSV format)
- Excelsior (fast & recommended if you have control over CSV format)
This data can be found here. There are some benchmarks of how fast each of the parser are in that website.
I like the simplicity of
What about other non-tabular data?
Well there is much to be had over the web. There is the RSS feed that you can do a lot with if a website is publishing an RSS. If not, well, there is screen scraping! Ever heard of that beast?! This is how I parsed and uploaded some RSS data some time ago.
Some months hence … and after a few more Railscast under my belt, I would use FeedZirra.
Do have a look at the github page here. You can then set up a call to FeedEntry.update_from_feed(“feed url”) in your crons, or use Whenever.
One is probably wasting a bit of bandwidth by the way, if you are doing this, cause you are downloading the whole feed, when there are means for you to just get what has changed. You can do this using…








