consider the wikipedia article for Walmart () which includes the following infobox :Īs we can see from above, the infoboxes could provide us with a lot of valuable information such as :Īlthough we expect this data to be fairly organized, it would require some post-processing which we will tackle in our next section. ![]() They essentially contain a wealth of metadata about a particular entity the article belongs to which in our case is a company.įor e.g. One very good place to start would be to look at the infoboxes (as wikipedia defines them) of articles corresponsing to each company on the list. Let's say we want to gather some additional data about the Fortune 500 companies and since wikipedia is a rich source for data we decide to use the MediaWiki API to scrape this data. Always be sure to read their documentation throughly. However, third-party libraries or packages which claim to provide more throughput than the official APIs (rate limits, number of requests/sec) generally operate in a gray area as they tend to violate ToS. One advantage of using official APIs is that they are usually compliant of the terms of service (ToS) of a particular service that researchers are looking to gather data from. Here we focus on the latter approach and will use a Python library (a wrapper) called wptools based around the original MediaWiki API.
0 Comments
Leave a Reply. |