Publication Type:

Conference Paper

Source:

2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (2016)

URL:

http://ieeexplore.ieee.org/abstract/document/7732034/

Keywords:

Data mining, EICD, entropy, Feature extraction, HTML, Information entropy, Text Density, vegetation, Web Content Extraction, Web pages, Weighted DOM

Abstract:

Web content extraction is a popular technique for extracting the main content from web pages and discards the irrelevant content. Extracting only the relevant content is a challenging task since it is difficult to determine which part of the web page is relevant and which part is not. Among the existing web content extraction methods, density based content extraction is one popular method. However density based methods, suffer from poor efficiency, especially when the pages containing less information and long noise. We propose a web content extraction technique build on Entropy based Informative Content Density algorithm (EICD). The proposed EICD algorithm initially analyses higher text density content. Further, the entropy-based analysis is performed for selected features. The key idea of EICD is to utilize the information entropy for representing the knowledge that correlates to the amount of informative content in a page. The proposed method is validated through simulation and the results are promising.

Cite this Research Publication

M. Annam and Dr. Sajeev G. P., “Entropy based informative content density approach for efficient web content extraction”, in 2016 International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2016.

207
PROGRAMS
OFFERED
6
AMRITA
CAMPUSES
15
CONSTITUENT
SCHOOLS
A
GRADE BY
NAAC, MHRD
8th
RANK(INDIA):
NIRF 2018
150+
INTERNATIONAL
PARTNERS