Article summarization in wikipedia

Wikipedia provides article summaries for most articles when using its search feature (see screenshot below). I have looked at various articles and couldn't find the text in the original article; neither in the rendered page nor in the metadata inside the Edit section.Now, I have two questions:How does Wikipedia show these summarizations? Are these precurated texts entered by the community or is there any underlying ML algorithm to summarize articles? In the case of the former, can you point me to the location where these data are sourced? In ca...Read more

web crawler - Mining Groups of people from Wikipedia

I am trying to get the list of people from the http://en.wikipedia.org/wiki/Category:People_by_occupation . I have to go through all the sections and get people from each section. How should i go about it ? Should I use a crawler and get the pages and search through those using BeautifulSoup ?Or is there any other alternative to get the same from Wikipedia ?...Read more

How to query Wikipedia full text dump using Google's BigQuery

I am testing out BigQuery in Google API, and would like to run some queries on Wikipedia full text dump. Google sample data doesn't include full text dump (only revision history). There are are few sources for Wikipedia dump, such as this one on Amazon:http://aws.amazon.com/datasets/2506My question are: Is there a way to query these datasets without transfering them to Google BigQuery projec? Equivalently, there is a way for BigQuery to communicate with one of these datasets directly? If it is not possible for BigQuery, then is there an equival...Read more

Can regular expressions be used with the Wikipedia API?

I have to do a lot of queries in several Wikipedia editions to find information that is not in DBpedia (for example, number of words in the Wikipedia page). From a word, such as "Paris", I would like to get with certainty the Wikipedia page of the French capital, not that of Paris in Ontario. I am therefore obliged to use the categories in my queries, like this:https://en.wikipedia.org/w/api.php?action=query&list=search&format=json&srsearch=paris+incategory:Departments_of_FranceProblem : the Wikipedia categories are inconsistent. Th...Read more

Wikipedia API Last Editor

I'm accessing Wikipedia by its API to get the parsed version of an article.https://en.wikipedia.org/w/api.php?action=parse&section=0&format=json&prop=text&page=Stack_OverflowEverything is working fine, except I'm trying to find who the last editor on the page was. Without making a separate request, is it possible to find this out?I've tried accessing Wikipedias documentation but I'm not able to find anything that would seem pretty basic: https://www.mediawiki.org/wiki/API:Main_page...Read more

Get depicted people from a wikipedia file

How can I get the list of depicted people from a wikipedia file?Example: I have a file with article ID 5457009. The wikipedia link is http://commons.wikimedia.org/wiki/File:Bundesarchiv_B_145_Bild-F048807-0025,_Bonn,_Neubau_Kanzleramt,_Schmidt_im_Arbeitszimmer.jpgWhat would the API request look like to extract the people meta data from this file (Schmidt, Helmut: Bundeskanzler, Verteidigungsminister, SPD, Bundesrepublik Deutschland)Here is another example with 3 depicted people: http://commons.wikimedia.org/wiki/File:Bundesarchiv_B_145_Bild-F00...Read more

Is there a way to access Wikipedia tree for disambiguation pages and categories?

1) for example try visiting : http://en.wikipedia.org/wiki/Architecture_(disambiguation)-->Is there any wikipedia api via which I can access this?2) for example try visiting : http://en.wikipedia.org/wiki/ArchitectureWhen you do right at the bottom of the page, there is a Categories: Architecture Architectural design Greek loanwords-->Is there a way to get this data? Preferably in form of a tree?...Read more

How to get Titles from a Wikipedia Page

Is there a direct API call where I can get titles from a wikipedia page.For e.g. from http://en.wikipedia.org/wiki/Chicago, I want to retrieve the following: 1 History  1.1 Rapid growth and development  1.2 20th and 21st centuries2 Geography  2.1 Topography  2.2 Climate 3 Cityscape  3.1 Architecture so on -----------I have looked at http://www.mediawiki.org/wiki/API:Lists/All, but couldn't find an action which gives me above list from a wiki page....Read more

Finding and downloading images within the Wikipedia Dump

I'm trying to find a comprehensive list of all images on wikipedia, which I can then filter down to the public domain ones. I've downloaded the SQL dumps from here:http://dumps.wikimedia.org/enwiki/latest/And studied the DB schema:http://upload.wikimedia.org/wikipedia/commons/thumb/4/42/MediaWiki_1.20_%2844edaa2%29_database_schema.svg/2193px-MediaWiki_1.20_%2844edaa2%29_database_schema.svg.pngI think I understand it but when I pick a sample image from a wikipedia page I can't find it anywhere in the dumps. For example:http://en.wikipedia.org/wi...Read more

Number of links between two Wikipedia pages

I have been successfully able to use the Wikipedia pagelinks SQL dump to obtain hyperlinks between Wikipedia pages for a specific revision time.However, there are cases where multiple instances of such links exist, e.g. the very same https://en.wikipedia.org/wiki/Wikipedia page and https://en.wikipedia.org/wiki/Wikimedia_Foundation. I'm interested to find number of links between pairs of pages for a specific revision. Ideal solutions would involve dump files other than pagelinks (which I'm not aware of), or using the MediaWiki API....Read more

How to extract information (e.g. types and subtypes) from Wikipedia?

I somehow want extract type information from Wikipedia. For example, I want to find: list of all "Carolina Panthers players"list of all "colors"list of all "NFL teams"list of all "month"Any ideas if there is a clean way of doing this? Clearly one alternative is using the API, but as far as I'm aware, it's not trivial to use the existing API to extract such information from Wiki....Read more

Get Wikipedia raw tags directly?

Is there a URL to get the raw tags of a Wikipedia entry directly, not the HTML? That is, I want to get only things like this,{{about|the cat species that is commonly kept as a pet|the cat family|Felidae|other uses|Cat (disambiguation)|and|Cats (disambiguation)}}....not,<!DOCTYPE html><html class="client-nojs" lang="en" dir="ltr"><head><meta charset="UTF-8"/><title>Cat - Wikipedia</title> ....I have sourced Google but could not find the answer. The closest URL I could find is something like https://en.wikipedi...Read more