Help Shakira to classify her YouTube comments

Spam detection is a classic machine learning problem. A group of researchers have written a paper on spam detection for YouTube and have created a tool that automatically detects spam in the comments. In this post I will create a simple model that classifies spam/non-spam comments. At the end I got 94% accuracy on my model (where spam and non spam are in the same proportions).

The Data

I will use the same data as the researchers mentioned above (that you can find here). The data directory is made of 5 CSV files. Each CSV contains comments from a different youtube video, with approximately 50% of spam and 50% of non spam.

Datasets   Spam   Non-Spam  Total 

Psy        175    175       350 
KatyPerry  175    175       350 
LMFAO      236    202       43c8 
Eminem     245    203       448 
Shakira    174    196       370

The researchers have created one model for each video. But I have chosen to create one model which should be generic to any video on youtube. I will train my model on Psy, KatyPerry, LMFAO and Eminem videos and test it on Shakira’s videos.

The avantages are that :

  • I will train my model on more data, so the training should be better and I could be more confident with my results
  • The model will be more useful : you could enter any comment and it will predict if it is a spam or not. We’re not limited to a particular type of videos.

The drawback is that maybe some comments are typical from a type of video, so the model may have more difficulties to generalize to Shakira comments.

The dataset looks like this :

I want to keep my model simple so I am only using the CONTENT column, and not the metadata such as AUTHOR or DATE. But it could be an areas of improvements.

Preprocessing

Flatten the PHRASES

For the kind of model I want to use I have to convert each word to a number. First I need to split each sentence in a list of word.

def format_phrase(phrase):
 words = re.sub("[^\w]", " ", phrase).split()
 return [w.replace(" ", "").lower() for w in words]

So format_phrase('Agnes Blog is reaLLY awesome :) !!!!') will return ['agnes', 'blog', 'is', 'really', 'awesome']

By doing this I am losing some information, but because the data set is quite small (2000 samples) I prefer to gather some words (for example totaLLy and totally will be mark with the same number).

Associate each word to a number

def get_unique_words(phrases):
 words_list = phrases.sum()
 return np.unique(np.array(words_list))

unique_words = get_unique_words(train_df.CONTENT_WORDS)
word2idx = {v: k for k, v in enumerate(unique_words)}

def words2idxs(phrase):
 words_count = len(word2idx)
 return [word2idx[word] if word in word2idx else words_count for word in phrase]

get_unique_words returns a list of all the words in the train dataset, then it is converted to a dictionary that looks like this :

{'asian': 552,
 'four': 1518,
 'hating': 1744,
 'moneyz': 2405,
 'personally': 2700,
 'protest': 2834,
 'sleep': 3196,
 'vidios': 3696,
 'woods': 3843,
 'yellow': 3901}

Finally words2idxs convert each phrase to a list of indexes that correspond to the words.

The DataFrame now looks like this : from CONTENT to CONTENT_WORDS to CONTENT_IDX

Standardize the size of the commentS

The comments need to have the same shape, because the input shape of the model is fixed. I am looking for the maximum length of a comment in the training set and I apply Keras’ `pad_sequences`

maxlen = train_df.CONTENT_IDX.map(len).max()

train_content_idx = sequence.pad_sequences(train_df.CONTENT_IDX, maxlen=maxlen, value=-1)

The comments are transformed of a vector a length maxlen that looks like this:

The model

For the model I used the Keras Sequential() model and the Embedding() layers. I created a CNN (a bit inspired by VGG16!)

vgg_model = Sequential([
 
 Embedding(vocab_size, 64, input_length=maxlen),
 
 # Conv Block 1
 Conv1D(64, 5, padding='same', activation='relu'),
 Conv1D(64, 3, padding='same', activation='relu'),
 MaxPooling1D(),
 
 # Conv Block 2
 Conv1D(128, 3, padding='same', activation='relu'),
 Conv1D(128, 3, padding='same', activation='relu'),
 MaxPooling1D(),
 
 # FC layers wiht BatchNorm
 Flatten(),
 Dense(100, activation='relu'),
 Dropout(0.7),
 Dense(100, activation='relu'),
 BatchNormalization(),
 Dropout(0.7),
 Dense(1, activation='sigmoid')])

 

vgg_model.compile(loss='binary_crossentropy', optimizer=Adam(), metrics=['accuracy'])

vgg_model.optimizer.lr = 10e-3
vgg_model.fit(train_content_idx, train_df.CLASS, validation_data=(valid_content_idx, valid_df.CLASS), 
 epochs=10, batch_size=64)

A the end of the training, I got 94% of accuracy. This entire code is here. It is the result of a simple model trained in a few seconds so you could expect much better with some improvements :

  • Find more data (2000 comments is a small set)
  • Use the meta data such a data, author, contain an url etc…
  • Use a finer way to slice comment into word

Leave a Reply

Your email address will not be published. Required fields are marked *