News Webpage classification

A multiclass classifier to indentify topics/ subjects of news articles.

This system classifies web based articles as news , gadgets and sports based on a probablistic learning approach called as the Naive Bayes Algorithm implemented in python making use of scikit-learn tool.

Initially a web crawler is built using Beautiful Soup Python module that scraps data from different news sites in order to generate a dataset that can be used for training the model. Python NLTK module is used for preprocessing this data such as tokenization, stop words removal, lemmatization etc.

A bag of words model is then generated from the preprocessed data which is segmented into training set and test set to determine the model's efficiency. A GaussianNB is then trained on the data for prediction in the test set. Dynamic training option is also enabled which means that the model is trained again on misclassification.