Increasing attention needs to be paid on the kind of content that proliferate the internet and also the kind of effects it has on the viewers. Therapists and Psychiatrists need to be aware of the kind of content being consumed by their patients while viewers (with respect to YouTube) should be aware of the consequences of their consumption. In this post, I talk about our analysis of content of the YouTube videos based on the textual features extracted from the transcripts to construct classifiers that can identify if a video can be termed as depressive or not. Understanding the content of the video gives us a zoomed in picture of what kind of videos has a viewer been watching over the said duration of time.
The main objectives of this research were:
The data for the analysis of the video content was gathered by extracting videos using various keywords and extracting the transcript out of it. There was a total of around 3000 transcripts, adding up to 1,409,719 words. The videos were divided into two categories: depressing and not-depressing. Of the data collected, 1,427( 48%) are considered depressing and these transcripts account for nearly 754,883 words. The 1,573 (52%) non-depressed transcripts are responsible for the other 832,117 words. To gather our data, we considered a video depressing if it was a search result when depressing keywords were used. While looking for depressing videos, we aimed mostly on videos which had tags as "self-harm", "suicidal", "triggering" etc.
Following is a sample list of the keywords used for this purpose:
Most search terms revolved around these keywords aiming depression, self-harm etc. We extracted the video ids, filtered them and kept the unique ones to maintain the quality of data.
With respect to classification, we experimented with three models. We used the CES-D questionnaire (refer to the terminologies
section for a detailed description) as a reference to create features to be fed to the model. The description for each model is
a) First, we built a Multinomial Naive Bayes (NB) Classifier using features extracted from Empath model. Empath is a living lexicon mined from modern text on the web for analyzing text across lexical categories (similar to LIWC). These features were the normalized score for a set of lexical categories. Apart from the pre-defined features (negative emotion, positive emotion), Empath model also gives the flexibility to create customized features. So categories that included seed-terms pertaining to symptoms diagnosed by CES-D questionnaire were also created and included in the set of categories.
Following is a sample of the seed terms corresponding to the questionnaire symptom:
|I did not feel like eating||Appetite|
|I felt that I could not shake off my blues||Sad|
|I had trouble keeping my mind on what I was doing||Distraction, ADHD|
|I felt depressed||Depression|
|My sleep was restless||Insomnia, Nightmare|
|I felt my life had been a failure||Failure, Self-Doubt, Self-hate|
|I felt lonely/ People were unfriendly||Lonely|
We didn't achieve satisfactory results from this model (accuracy hardly crossed the baseline) so we decided to take another
approach and combine textual features along with.
b) So, to obtain better results, we combined TF-IDF weighted combinations of word n-grams from the transcripts and used it along with the features from the previous model to train the Naïve-Bayes classifier. This led to a much better accuracy ( the best amongst the three, as we'll see later in this blog).
c) For the third model, we trained a 1-D Convolutional Neural Network with an embedding layer generated from glove word embeddings. The values for hyperparameters are shown in the table below.
The architecture for the CNN used is as follows. We use a simple architecture using Conv-1D layers with 64 and 128 filters of kernel size 5 each with relu non-linearities. The output layer uses sigmoid as the activation function. To prevent over-fitting, we use Dropout layers with rate 0.5 for both.
|Global Max Pooling||128|
A comparison of the results obtained is shown in the table below. From the table it can be inferred that the combination of Empath model with TF-IDF features resulted in the best classification accuracy of 85.1% . The sensitivity and specificity of classification were 72% and 95.3% respectively with an AUC score of 94.6%
|Emapath + NB||52||5|
|TF-IDF + Empath + NB||85||8|
|CNN + Glove||78||840|
If you notice in the results, the second model gives the highest accuracy taking a very less amount of time as compared to the CNN model which takes around 15 minutes on a CPU. Since our actual research focuses on predicting the affect of the video, we decided to not keep the model complex and validate our analysis using the second model itself (since it gives a good enough accuracy).
The evaluation techniques and proposed methodology for real life validation of our model will be discussed in the next part.
Problem explaination and solution proposal. The path to be followed is explained and pipleines etc. are discussed.
In this post, the CES-D score calculation and analysis of comments is discussed
Calculation of Arousal Valence values from a video is discussed here