The scope of the project for my machine learning class was strictly to use text-learning processes in order to create algorithms that could classify ranks/types/indices associated with certain text files. For the sake of my project, I chose two datasets derived from Kaggle (https://www.kaggle.com/mrisdal/fake-news and https://www.kaggle.com/ c/fake-news/data) that I joined into an approximate 20,000 row dataset of news 'text', 'type' (type of news classified. e.g. reliable, unreliable, conspiracy, state, hate, satire, bias), 'date_published', and 'author'. Text and Type were the only two columns used in the project for text-learning. The final project does not constitute a research project as I did not go out of my way to read through all articles and index them myself as fake, hate, reliable, etc. Rather I trusted the datasets provided by Kaggle to use for academic purposes to practice coding and creating text-learning algorithms.
The brunt of the run time with this project was due to cleaning all of the data provided. Above is the code used in the project that goes through, cell by cell, cleaning and filtering the text. Additionally, along with filtering out gibberish and other misc. characters, the cleaning function also removes all 'stop words' in the english language (to learn more about stop words and what they are, please click here). This creates more 'importance' for words that are filtered out and translated into the array, FNclean[], as written in python. The 100 most used words throughout all 20,000 entries are listed above in the header.
After the data was filter, a vectorizer.transform function is called making the array of words into an array of integers. These arrays were then randomly divided into training sets (80% of the original data) and testing sets (20% of the original data). Once the arrays were transformed into integers, I could then call multiple Naive Bayes algorithms ("probabilistic classifiers" based on applying Bayes' theorem to the dataset of features). Below are the results of two Naive Bayes algorithms (Multinomial and Bernoulli) and a Support Vector Machine algorithm that yielded a significantly better result.
The first two results were discouraging to an extent. Naive Bayes algorithms showed that they were only slightly better at classifying types of news than theoretical coin flips. Once the run time on the Support Vector Machine was printed, my eyebrows lit up. Inspired by the accuracy gained from the SVM, I tried to run a Principal Component Analysis (a statistical procedure that converts a set of observations of possibly correlated variables into smaller dimensions of values still representative of the original dataset) on the news data.
Once the PCA transformation ran, I wrote a Random Forest Classifier with parameters n_estimators = 50 and max_depth=25. This yielded jaw dropping performance. The output accuracy for the Random Forest was 99.6%. Below is the confusion matrix created to better represent just how great the Random Forest was at classifying all types of news in the dataset.
The x-axis represents what the test set thought/predicted was the correct classification of the news while the y-axis represents what the actual/true classification of the news was.
The 'bs' label of news was found in Kaggle's 'Getting Real About Fake News' dataset listed in the background of this article. How Kaggle's community developed this 'BS detector'/identifier can be read more here.
Have we beaten fake news? No, absolutely not. Limitations of the algorithm are as follows:
Then what does this project tell us? What I have learned is that the project did indeed teach me that text classifying news is a viable way to monitor content. However, it is the 'real time' issue that limits the application of the project. Social platforms would need to vet content at an extremely efficient and quick rate in order to properly monitor their content. Since these platforms emphasize live posts and interaction, it makes the whole process more difficult with regards to quickly and efficiently classifying posts.
Furthermore, the proliferation of unreliable ("fake") news has shown to be more challenging than previously expected. With the rise of international "fake news"/trolling data centers and warehouses, content providers need to make sure, more than ever, the content being created is legitimate.
I hope this article was interesting to the readers that made it all the way to the end. I hope this educated you in the sense that you can indeed classify news successfully with natural learning text algorithms. The issues that arise from this successful project are how to implement it in a way that is both efficient for the platform and does not undermine legitimate news sources.
Thank you!
All the best,
BSBA Marketing Research | Minors Data Analytics & Spanish
Sources
To verify that this project is not indeed "FAKE", please visit these sources that I used to help me research or code the assignment. Also, feel free to message me for any questions or comments regarding this text learning program.
#https://www.kaggle.com/mrisdal/fake-news
#https://www.kaggle.com/c/fake-news/data
#https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
#https://stackoverflow.com/questions/22216076/unicodedecodeerror-utf8-codec-cantdecode-byte-0xa5-in-position-0-invalid-s
#https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC
#https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
#https://stats.stackexchange.com/questions/31066/what-is-the-influence-of-c-in-svms-with-linear-kernel