IND Dataset

Introduction

Indian News Dataset (IND) is used for the task of online news popularity prediction. IND contains news data or articles from ten most rated Indian news websites (i.e, India Times, Firstpost, NDTV, The Indian Express, Times Now, One India, Hindustan Times, India TV, News18 and Zee News), with the main motive that they have news articles with a large number of views or shares which is a good indicator of news popularity among the readers. Considering the news genre common to all the websites, news articles are selected from the categories technology, election, sports, entertainment, and lifestyle. In total, 1,000 news articles were gathered from the websites, i.e., 100 news from each website with 20 articles belonging to each of the five different categories. The dataset was then labeled based on the number of shares. The news having least shares is labeled as ‘Unpopular’, whereas those with large number of shares are labeled as ‘Popular’. The news articles in the dataset are also appended with some additional associated information like the date of publishing a news, news category and name of the news portal. When compared to the existing datasets like mashable, IND provides the title and content of the news rather than some associated statistics or URLs to the news and hence, IND dataset can be considered as a much ready to use dataset.

People

Suharshala R, University of Calicut, Kerala, India. (suharshala@gmail.com)
Anoop K, University of Calicut, Kerala, India. (anoopk_dcs@uoc.ac.in)
Manjary P Gangan , University of Calicut, Kerala, India. (manjaryp_dcs@uoc.ac.in)
Lajish V L , University of Calicut, Kerala, India. (lajish@uoc.ac.in)

Related Publication

Suharshala R., Anoop K., Manjary P. Gangan, Lajish V. L., "Online news popularity prediction before publication: effect of readability, emotion, psycholinguistics features", IAES International Journal of Artificial Intelligence (IJ-AI), ISSN: 2252-8938, Vol. 11, No. 2, June 2022, pp. 539-545, DOI: http://doi.org/10.11591/ijai.v11.i2.pp539-545

Abstract: The development of world wide web with easy access to massive information sources anywhere and anytime paves way for more people to rely on online news media rather than print media. The scenario expedites rapid growth of online news industries and leads to substantial competitive pressure. In this work, we propose a set of hybrid features for online news popularity prediction before publication. Two categories of features extracted from news articles, the first being conventional features comprising metadata, temporal, contextual, and embedding vector features, and the second being enhanced features comprising readability, emotion, and psycholinguistics features are extracted from the articles. Apart from analyzing the effectiveness of conventional and enhanced features, we combine these features to come up with a set of hybrid features. We curate an Indian news dataset consisting of news articles from the most rated Indian news websites for the study and also contribute the dataset for future research. Evaluations are performed over the Indian news dataset (IND) and compared with the performance over the benchmark mashable dataset using various supervised machine learning models. Our results indicate that the proposed hybrid of enhanced features with conventional features are highly effective for online news popularity prediction before publication.

IND Dataset Download

Dataset Download Request Form