Apache Storm¶
CommonCrawl, a project powered by Apache Storm, publishes data freely and openly for the greater good. One part of the datasets is a collection of news archives produced by StormCrawler, an open source distributed web crawler under ASF license.