Fake News Detection Using Machine Learning Ensemble Methods.
Abstract:
The advent of the World Wide Web and the rapid adoption of social media platforms (such as Facebook, Twitter, and Instagram) paved the way for information dissemination that has never been witnessed in human history before. With the current usage of social media platforms, consumers are creating and sharing more information than ever before, some of which are misleading with no relevance to reality. Automated classification of a text article as misinformation or disinformation is a challenging task. Even an expert in a particular domain has to explore multiple aspects before giving a verdict on the truthfulness of an article. In this work, we propose to use a machine learning ensemble approach for the automated classification of news articles. Our study explores different textual properties that can be used to distinguish fake content from real. By using those properties, we train a combination of different machine learning algorithms using various ensemble methods and evaluate their performance on real-world datasets. The experimental evaluation confirms the superior performance of our proposed ensemble learner approach in comparison to individual learners.
Introduction:
The advent of the World Wide Web and the rapid adoption of social media platforms (such as Facebook, Twitter, and Instagram) paved the way for information dissemination that has never been witnessed in human history before. Besides other use cases, news outlets benefitted from the widespread use of social media platforms by providing updated news in near real-time to their subscribers. The news media evolved from newspapers, tabloids, and magazines to a digital form such as online news platforms, blogs, social media feeds, and other digital media formats. It became easier for consumers to acquire the latest news at their fingertips. Facebook referrals account for 70% of traffic to news websites. These social media platforms in their current state are extremely powerful and useful for their ability to allow users to discuss and share ideas and debate over issues such as democracy, education, and health. However, such platforms are also used with a negative perspective by certain entities commonly for monetary gain and in other cases for creating biased opinions, manipulating mindsets, and spreading satire or absurdity. The phenomenon is commonly known as fake news.
There has been a rapid increase in the spread of fake news in the last decade, most prominently observed in the 2016 US elections. Such proliferation of sharing articles online that do not conform to facts has led to many problems not just limited to politics but covering various other domains such as sports, health, and also science. One such area affected by fake news is the financial markets, where a rumor can have disastrous consequences and may bring the market to a halt.
Our ability to take a decision relies mostly on the type of information we consume; our world view is shaped on the basis of the information we digest. There is increasing evidence that consumers have reacted absurdly to the news that later proved to be fake. One recent case is the spread of the novel coronavirus, where fake reports spread over the Internet about the origin, nature, and behavior of the virus. The situation worsened as more people read about the fake content online. Identifying such news online is a daunting task.
Fortunately, there are a number of computational techniques that can be used to mark certain articles as fake on the basis of their textual content. The majority of these techniques use fact-checking websites such as “PolitiFact” and “Snopes.” There are a number of repositories maintained by researchers that contain lists of websites that are identified as ambiguous and fake. However, the problem with these resources is that human expertise is required to identify articles/websites as fake. More importantly, fact-checking websites contain articles from particular domains such as politics and are not generalized to identify fake news articles from multiple domains such as entertainment, sports, and technology.
The World Wide Web contains data in diverse formats such as documents, videos, and audios. News published online in an unstructured format (such as news, articles, videos, and audios) is relatively difficult to detect and classify as this strictly requires human expertise. However, computational techniques such as natural language processing (NLP) can be used to detect anomalies that separate a text article that is deceptive in nature from articles that are based on facts. Other techniques involve the analysis of propagation of fake news in contrast with real news. More specifically, the approach analyzes how a fake news article propagates differently on a network relative to a true article. The response that an article gets can be differentiated at a theoretical level to classify the article as real or fake. A more hybrid approach can also be used to analyze the social responsibility of an article along with exploring the textual features to examine whether an article is deceptive in nature or not.
A number of studies have primarily focused on the detection and classification of fake news on social media platforms such as Facebook and Twitter. At a conceptual level, fake news has been classified into different types; the knowledge is then expanded to generalize machine learning (ML) models for multiple domains. The study by Ahmed et al. included extracting linguistic features such as n-grams from textual articles and training multiple ML models including K-nearest neighbor (KNN), support vector machine (SVM), logistic regression (LR), linear support vector machine (LSVM), decision tree (DT), and stochastic gradient descent (SGD), achieving the highest accuracy (almost 90%) with SVM and logistic regression. According to the research, as the number of increased in -grams calculated for a particular article, the overall accuracy decreased. The phenomenon has been observed for learning models that are used for classification achieved better accuracies with different models by combining textual features with auxiliary information such as using social engagements on social media. The authors also discussed the social and psychological theories and how they can be used to detect false information online. Further, the authors discussed different data mining algorithms for model constructions and techniques shared for features extraction. These models are based on knowledge such as writing style, and social contexts such as stance and propagation.
The Problem:
The problem is not only hackers, going into accounts, and sending false information. The bigger problem here is what we call “Fake News”. A fake are those news stories that are false: the story itself is fabricated, with no verifiable facts, sources, or quotes.
When someone (or something like a bot) impersonates someone or a reliable source to false spread information, that can also be considered fake news. In most cases, the people creating this false information have an agenda, that can be political, economic or to change the behavior or thought about a topic.
There are countless sources of fake news nowadays, mostly coming from programmed bots, that can’t get tired (they’re machines hehe) and continue to spread false information 24/7.
The tweets in the introduction are just basic examples of this problem, but much more serious studies in the past 5 years, have demonstrated big correlations between the spread of false information and elections, the popular opinion or feelings about different topics.
The problem is real and hard to solve because the bots are getting better are tricking us. Is not simple to detect when the information is true or not all the time, so we need better systems that help us understand the patterns of fake news to improve our social media, communication and to prevent confusion in the world.
Purpose:
In this short article, I’ll explain several ways to detect fake news using collected data from different articles. But the same techniques can be applied to different scenarios.
In this article, I’ll explain the Python code to load, clean, and analyze data. Then we will do some machine learning models to perform a classification task (fake or not).
Tools/Skills Used:
1. Python programing
2. Jupyter Notebook
3. Pandas
4. Numpy
5. Matplotlib
6. Seaborn
7. Exploratory Data Analytics
8. Feature Engineering
9. Data Visualization
10. Sciklearn
11. Machine Learning Algorithm
12. Natural Language Processing(NLTK)
Solving the problem with Python
Data reading
We are using the Pandas library to load the CSV file in the Jupyter Notebook. Pandas is one of the tools in Machine Learning which is used for data cleaning and analysis. It has features that are used for exploring, cleaning, transforming, and visualizing from data.
head():- Returns the first 5 rows of the Dataframe. To override the default, you may insert a value between the parenthesis to change the number of rows returned. Example: df. head(10) will return 10 rows.
Data Cleaning:
It is very important to clean the data because it contains many unwanted columns unwanted data outliers null values nan columns and many more. Data cleaning refers to identifying and correcting errors in the dataset that may negatively impact a predictive model. Data cleaning is used to refer to all kinds of tasks and activities to detect and repair errors in the data.
Hence, using some basics like:
df.columns, df.isnull().sum(), df.drop().
Analyzing the dependent features:
In this Dataset, the target feature is ‘label’ it contains two unique values [‘FAKE’, ‘REAL’] for analyzing it we are using a function like:-
1. nunique() function returns the number of unique elements in the object. It returns a scalar value which is the count of all the unique values in the Index. By default, the NaN values are not included in the count.
2. unique() function is used to find the unique elements of an array.
3. value_counts() function returns an object containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default.
From the above, we can conclude that the target feature is of object type with two unique features FAKE and REAL. As we all know that object type data cannot be pass to the model hence, converting it to numeric datatype by using LabelEncoder.
sklearn.preprocessing.LabelEncoder:
Sklearn provides a very efficient tool for encoding the levels of categorical features into numeric values. LabelEncoder encodes labels with a value between 0 and n_classes-1 where n is the number of distinct labels. If a label repeats it assigns the same value as assigned earlier.
These are transformers that are not intended to be used on features, only on supervised learning targets.
Hence, we convert the target feature object datatypes to numeric datatypes where 0 is for FAKE and 1 is for REAL.
Data Visualization:
Matplotlib and Seaborn are the two libraries that are been used in this model. Matplotlib is mainly deployed for basic plotting. Visualization using Matplotlib generally consists of bars, pies, lines, scatter plots, and so on. Seaborn, on the other hand, provides a variety of visualization patterns. It uses fewer syntax and has easily interesting default themes.
This model applying countplot() from seaborn to show the visualizing count of the target feature.
countplot(): Show the counts of observations in each categorical bin using bars. A count plot can be thought of as a histogram across a categorical, instead of quantitative, variable. The basic API and options are identical to those for barplot(), so you can compare counts across nested variables.
Text cleaning using NLP(Natural Language Processing):
First of taking the copy of the original Dataset so that while cleaning the text the original text doesn’t affect.
Natural language processing is defined as “the application of computational techniques to the analysis and synthesis of natural language and speech”. To perform these computational tasks, we first need to convert the language of text into a language that the machine can understand.
In this problem statement Fake news classifier, I am going to describe some of the most common steps involved in preparing text data for natural language processing.
1)Normalization: One of the key steps in processing language data is to remove noise so that the machine can more easily detect the patterns in the data. Text data contains a lot of noise, this takes the form of special characters such as hashtags, punctuation, and numbers. All of which are difficult for computers to understand if they are present in the data. We need to, therefore, process the data to remove these elements.
Additionally, it is also important to apply some attention to the casing of words. If we include both upper case and lower case versions of the same words then the computer will see these as different entities, even though they may be the same.
2)Stop words: Stop words are commonly occurring words that for some computational processes provide little information or in some cases introduce unnecessary noise and therefore need to be removed. This is particularly the case for text classification tasks.
There are other instances where the removal of stop words is either not advised or needs to be more carefully considered. This includes any situation where the meaning of a piece of text may be lost by the removal of a stop word. For example, if we were building a chatbot and removed the word“not” from the phrase “I am not happy” then the reverse meaning may be interpreted by the algorithm. This would be particularly important for use cases such as chatbots or sentiment analysis.
The Natural Language Toolkit (NLTK) python library has built-in methods for removing stop words.
Let’s take a single paragraph for example message[‘text’][1] and check how the cleaning is going on after that it will be done in whole datasets.
3)Stemming: Stemming is the process of reducing words to their root form. For example, the words “rain”, “raining” and “rained” have very similar, and in many cases, the same meaning. The process of stemming will reduce these to the root form of “rain”. This is again a way to reduce noise and the dimensionality of the data.
The NLTK library also has methods to perform the task of stemming. The code below uses the PorterStemmer to stem the words in my example above. As you can see from the output all the words now become “rain”.
It seems that 80% of the text is cleaned.
Now let’s see what's happening with Lemmatization
4) Lemmatization: The goal of lemmatization is the same as for stemming, in that it aims to reduce words to their root form. However, stemming is known to be a fairly crude method of doing this. Lemmatization, on the other hand, is a tool that performs full morphological analysis to more accurately find the root, or “lemma” for a word.
Again NLTK can be used to perform this task.
Now apply all this cleaning process in whole datasets.
Let’s check now what are the most common fake word used for fake news are what are the most commonly used real word for real news.
For the visualization of most words, I’m using the word cloud library.
Word Cloud is a data visualization technique used for representing text data in which the size of each word indicates its frequency or importance. Significant textual data points can be highlighted using a word cloud. Word clouds are widely used for analyzing data from social network websites.
Now let’s see the frequency of the most commonly used word for fake and real news.
Hence the cleaning of the text is done now the text data is not be understood by the machine therefore it is important to convert the text data to some integers, or floating-point values. And here comes the term of feature extraction i.e CountVectorizer.
CountVectorizer(): To use textual data for predictive modeling, the text must be parsed to remove certain words — this process is called tokenization. These words need to then be encoded as integers, or floating-point values, for use as inputs in machine learning algorithms. This process is called feature extraction (or vectorization).
Scikit-learn’s CountVectorizer is used to convert a collection of text documents to a vector of term/token counts. It also enables the pre-processing of text data before generating the vector representation. This functionality makes it a highly flexible feature representation module for text.
Applying CountVectorizer in both Stemming and Lemmatizing corpus.
There is also another term TF-IDF Vectorizer.
TF-IDF: TF-IDF is an abbreviation for Term Frequency Inverse Document Frequency. This is a very common algorithm to transform the text into a meaningful representation of numbers which is used to fit machine algorithms for prediction. Count Vectorizer gives the number of the frequency concerning index of vocabulary whereas tf-idf considers overall documents of the weight of words.
Applying TF-IDF Vectorizer in this problem statement.
Let’s check the target feature(y):
Modeling:
Let’s know few terms for modeling:
1)train_test_split() is a function in Sklearn model selection for splitting data arrays into two subsets: for training data and for testing data. With this function, you don’t need to divide the dataset manually.
By default, Sklearn train_test_split will make random partitions for the two subsets. However, you can also specify a random state for the operation.
2) The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work. Using MultinomialNB in this problem. It is another useful Naïve Bayes classifier. It assumes that the features are drawn from a simple Multinomial distribution. The Scikit-learn provides sklearn.naive_bayes.MultinomialNB to implement the Multinomial Naïve Bayes algorithm for classification.
3) A confusion matrix is a table that is often used to describe the performance of a classification model (or “classifier”) on a set of test data for which the true values are known. The confusion matrix itself is relatively simple to understand, but the related terminology can be confusing. Confusion Matrix is a useful machine learning method that allows you to measure Recall, Precision, Accuracy, and AUC-ROC curve. Below given is an example to know the terms True Positive, True Negative, False Negative, and True Negative. True Positive: You projected positive and its turns out to be true.
4) The classification report is about key metrics in a classification problem. You’ll have precision, recall, f1-score, and support for each class you’re trying to find. The recall means “how many of this class you find over the whole number of elements of this class”.
Now I’m using CountVectorizer with the stemming text and MultinomialNB() and let’s check the accuracy.
Now let’s plot the heatmap() of the confusion_matrix for the visualization of the confusion_matrix.
By using CountVectorizer and Stemming I got an accuracy of 87%.
Now Let’s check with CountVectorizer and Lemmetizing.
Now let’s plot the heatmap() of the confusion_matrix for the visualization of the confusion_matrix.
By using CountVectorizer and Lemmetizing I got an accuracy of 86%.
Now let’s check by using TF-IDF and Stemming
Now let’s plot the heatmap() of the confusion_matrix for the visualization of the confusion_matrix.
By using TF-IDF and Stemming I got an accuracy of 86%.
Now let’s check by using TF-IDF and Lemmetizing.
Now let’s plot the heatmap() of the confusion_matrix for the visualization of the confusion_matrix.
By using TF-IDF and Lemmetizing I got an accuracy of 83%.
Hence, we can go through that with Lememtizing and CountVectorizer as it is giving almost 86%, with Stemming and CountVectorizer it is giving 88% and TF-IDF with stemming is also giving 86%.
Let’s check the accuracy after applying hyperparameter in MultinomialNB
Hence, it’s better to select alpha=0.0, as it is giving maximum accuracy 88%
With some others algorithm:
Now let’s apply with some other classification algorithm like Now let’s check with other classifier algorithms like DecisionTreeClassifier, RandamForestClassifier, SVM, LogisticRegression by using GridSearchCV and cross validation.
Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary dependent variable, although many more complex extensions exist. In regression analysis, logistic regression (or logit regression) is estimating the parameters of a logistic model (a form of binary regression).
The Random Forest Classifier is a set of decision trees from a randomly selected subset of the training set. It aggregates the votes from different decision trees to decide the final class of the test object.
Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. The decision tree splits the nodes on all available variables and then selects the split which results in the most homogeneous sub-nodes.
SVC()-The objective of a Linear SVC (Support Vector Classifier) is to fit the data you provide, returning a “best fit” hyperplane that divides or categorizes your data. From there, after getting the hyperplane, you can then feed some features to your classifier to see what the “predicted” class is.
GridSearchCV is a library function that is a member of sklearn’s model_selection package. It helps to loop through predefined hyperparameters and fit your estimator (model) on your training set. So, in the end, you can select the best parameters from the listed hyperparameters.
cross_val_score()-A cross-validation generator to use. If int determines the number of folds in StratifiedKFold if y is binary or multiclass and estimator is a classifier, or the number of folds in KFold otherwise. If None, it is equivalent to cv=3.
Hence, we can conclude that this problem statement Fake_News_Classifier work with Logistic_Regression with the accuracy of 90%, SVM with the accuracy of 90%, and also with Random_Forest_Classifier 88%.
Conclusion:
The task of classifying news manually requires in-depth knowledge of the domain and expertise to identify anomalies in the text. In this research, we discussed the problem of classifying fake news articles using machine learning models and ensemble techniques. The data we used in our work is collected from the World Wide Web and contains news articles from various domains to cover most of the news rather than specifically classifying political news. The primary aim of the research is to identify patterns in text that differentiate fake articles from true news. We extracted different textual features from the articles using an LIWC tool and used the feature set as an input to the models. The learning models were trained and parameter-tuned to obtain optimal accuracy. Some models have achieved comparatively higher accuracy than others. We used multiple performance metrics to compare the results for each algorithm. The ensemble learners have shown an overall better score on all performance metrics as compared to the individual learners.
Fake news detection has many open issues that require the attention of researchers. For instance, to reduce the spread of fake news, identifying key elements involved in the spread of news is an important step. Graph theory and machine learning techniques can be employed to identify the key sources involved in the spread of fake news. Likewise, real-time fake news identification in videos can be another possible future direction.