Publication Type:

Conference Paper

Source:

Fourth International Symposium on Women in Computing and Informatics (WCI-2016), International Conference on Advances in Computing, Communications and Informatics (ICACCI) (2016)

URL:

http://ieeexplore.ieee.org/abstract/document/7732034/

Keywords:

Data mining, EICD, entropy, Feature extraction, HTML, Information entropy, Text Density, vegetation, Web Content Extraction, Web pages, Weighted DOM

Abstract:

Web content extraction is a popular technique for extracting the main content from web pages and discards the irrelevant content. Extracting only the relevant content is a challenging task since it is difficult to determine which part of the web page is relevant and which part is not. Among the existing web content extraction methods, density based content extraction is one popular method. However density based methods, suffer from poor efficiency, especially when the pages containing less information and long noise. We propose a web content extraction technique build on Entropy based Informative Content Density algorithm (EICD). The proposed EICD algorithm initially analyses higher text density content. Further, the entropy-based analysis is performed for selected features. The key idea of EICD is to utilize the information entropy for representing the knowledge that correlates to the amount of informative content in a page. The proposed method is validated through simulation and the results are promising.

Cite this Research Publication

M. Annam and Dr. Sajeev G. P., “Entropy based informative content density approach for efficient web content extraction”, in Fourth International Symposium on Women in Computing and Informatics (WCI-2016), International Conference on Advances in Computing, Communications and Informatics (ICACCI), 2016.