[Elsnet-list] ANN: Corpus of 147 million quasi-relational Web tables released for public download

Robert Meusel robert at informatik.uni-mannheim.de
Thu Mar 6 16:43:16 CET 2014


Hi all,

the Team of Web Data Commons [1] is  happy to announce the release of a 
corpus containing 147 million quasi-relational Web tables.

The Web contains vast amounts of HTML tables. Most of these tables are 
used for layout purposes, but a fraction of the tables is also 
quasi-relational, meaning that they contain structured data describing a 
set of entities.

A corpus of Web tables can be useful for research and applications in 
areas such as data search, table augmentation, knowledge base 
construction, and for various NLP tasks.

The WDC Web Tables corpus has been extracted from the 2012 version of 
the Common Crawl [2], the largest Web crawl that is available to the 
public. The corpus contains the subset of the 11 billion HTML tables 
found in the Common Crawl that are likely quasi-relational.

There are similar corpora at Google and Microsoft, but our corpus is the 
only one of this size available to the public

More information about the corpus, its application domains as well as 
information about how to download the corpus is found at 
http://webdatacommons.org/webtables/

Beside of being a good test bed for your Search Join engine and a great 
resource for enriching DBpedia, the tables corpus might also be useful 
for some of the people in the group working on NLP tasks.

Cheers,
Robert

[1] http://webdatacommons.org
[2] http://commoncrawl.org

-- 
Robert Meusel
Chair of Information Systems V
Web-based Systems Group
Universität Mannheim
B6, 26, Room C1.04
D-68159 Mannheim
Phone: +49 621 181 2648
Mail: robert at informatik.uni-mannheim.de
Web: dws.informatik.uni-mannheim.de



More information about the Elsnet-list mailing list