DATASHACK 2018
Emporio Sirenuse

Instagram Data Analysis
Community Detection and Content Recommendation

Kimia Mavon
Francesca Morini
Karan Rajesh Motwani
Moreno Raimondo Vendra

START

INTRODUCTION AND SCOPE

Le Sirenuse Positano is a luxury 5 star hotel located in Positano, on the Amalfi Coast. The Emporio Sirenuse store stands in front of the hotel. Here, tourists and guests can find a collection of dresses and swimwear aligned with the Mediterranean style of the hotel. The collection, curated by the hotel owner Mrs. Carla Sersale, is also sold online on a dedicated website and on different e-commerce channels. Both the hotel and the store have an official Instagram account, though the hotel has a much wider influence. The goal of this project is to improve the online presence of the clothing brand by leveraging social media platforms, while protecting its heritage.

In this project, we address how a luxury brand may utilize a growing digital economy using big data, while maintaining its heritage. Our proposed tool, Frank, uses a human-in-the-loop model to identify relevant communities and suggest a post’s text, hashtags, and images. This system is housed in an interactive visualization interface designed for intuitive analysis.

USERS CLUSTERING

Premise

We were given access to a dataset composed of Instagram data from the accounts of Emporio Sirenuse and its followers, along with the accounts of 9 other competitors and their followers. We want to:

Identify methods for feature detection from Social Media text and image data.
Cluster Instagram posts of the client’s followers using these features.
Detect latent communities based on the distribution of user posts in different clusters.

This is a difficult task since social media data, while being complex and composed of heterogeneous components (text, image, geo-tags, hashtags, mentions...etc), tends to be very abstract and hard to ascertain. Moreover this heterogeneity also translates into completely different feature detection methods, which are hard to integrate and validate in an unsupervised learning setting like ours.

Clustering

We intend to cluster users into specific communities, based on their Instagram activity. In order to do that we cluster all their instagram posts into general clusters, which can be thought of as standard units and then we cluster the users based on their representation in these standard units. This allows us to find communities of users that share interests, and post about the same topics in the same measure.

This clustering system is hinged on finding a latent space that accurately sepearates the posts based on their features and characteristics. Therefore, extracting the right features from images and text to create a large vector representation of an Instagram post is the most significant and complex task.

Naive Implementation - Images

The initial implementation of image clustering involved using basic image features such:

Color Histogram
Detail Estimation using Canny Edge Detector
Feature Extraction using Corner Detector (Harris)

A large vector representation of the image was thus constructed and K-Means was performed with K=5

Naive Implementation - Text

The initial implementation of text clustering involved:

One-Hot-Encoding the text of each caption.
Frequency of occurrence of each word is used as a feature.
The caption is thus reduced to a bag of words.

A vector representation of the text is thus a count vector of the entire observed vocabulary.

Of course this way the features were easy to extract and very interpretable. However, the features did not include any information about the objects/context in the image or style features in the text: they did not account for correlation between text and image and the features themselves are too simple to handle the stochasticity of social media data.

Intermediate Trials - Object Detection

To add the perspective of what are the contents of the image, we tried using the Object Detection model designed by the Facebook AI Research team named Mask R-CNN. This model includes the power of a regular Faster R-CNN implementation and includes a Fully Convolutional Network (FCN) branch which performs pixel-by-pixel segmentation of each Region of Interest (RoI). The model detects 70 classes of objects and thus the image is represented by a vector of dimension 1x70.

Intermediate Trials - Scene Detection

To add the information of image context, we trained a Scene Detection model on the Flickr dataset which consists of images and corresponding captions. This model uses a CNN for Image Feature Extraction combined with an LSTM for Text Embedding. The model is trained to learn the relationship between the two vectors. The model outputs a string caption for an image input. For clustering, this caption was encoded using Bag of Words into a count vector.

Object Detection

Person	Ski Blades	Hill
1	1	1

Scene Detection

“A person skis off a ramp to perform a stunt”

Post

Caption : “She can fly !!!”

This approach managed to find the following clusters: Cosmetics, Modelling, Lifestyle.

Given the lack of ground truth at our disposable, we curated a dataset to help quantitatively evaluate the results of clustering. Thus, we chose 4 hashtags and scraped 10,000 posts from Instagram for each of them. The hashtags were:

Food
Cosmetics
Rock Climbing
Nightlife

Note : Although the topics are well defined, only ~70% of the dataset scraped was relevant. The remaining posts were either sarcastic, non-contextual or abstract.

Autoencoder

To obtain a Latent Representation of the Image, we trained a Neural Network to reproduce the original input using a symmetric architecture i.e. an Autoencoder.

An autoencoder neural network is an unsupervised Machine learning algorithm that applies backpropagation, setting the target values to be equal to the inputs. An autoencoder is trained to attempt to copy its input to its output. Internally, it has a hidden layer that describes a code used to represent the input.

The parameters of this implementation were:

Image Size : 64x64
3x2 layers of 64 Convolutional Filters, Kernel Size = (2,2)
Max Pooling Window = (2,2)
Loss : Mean Squared Error
Epochs : 50

The images produced were blurry and did not represent the input well. This can be attributed to the degree of variability in the dataset and overall complexity of the network.

We can clearly observe that the clusters are not well separated and the cluster purity when matched to the ground truth is not high.

CNN

To correctly classify the input images into topic classes by identifying distinct features of an image, we trained a neural network with convolutional layers i.e. Convolutional Neural Network.

A Convolutional Neural Network (CNN) is comprised of one or more convolutional layers (often with a subsampling step) and then followed by one or more fully connected layers as in a standard multilayer neural network. The architecture of a CNN is designed to take advantage of the 2D structure of an input image (or other 2D input such as a speech signal). This is achieved with local connections and tied weights followed by some form of pooling which results in translation invariant features. Another benefit of CNNs is that they are easier to train and have many fewer parameters than fully connected networks with the same number of hidden units.

The parameters of this implementation were:

Image Dimension : 64x64
3 layers of 256, 128 and 64 Convolutional Filters respectively.
Kernel and Max Pooling Window = (2,2)
Loss : Categorical Cross-Entropy
Epochs : 100

The classification accuracy obtained was 70%. The class which was hardest to predict was ‘Rock Climbing’.

We can clearly observe that the clusters are better separated than the Auto Encoder implementation. The cluster purity when matched to the ground truth is also relatively better.

LSTM for Text Feature Extraction

To correctly classify the input text into topic classes by identifying distinct features such as style and grammar, we trained a neural network with a non-preset memory window i.e. a Long Short Term Memory based Recurrent Neural Network.

Long short-term memory (LSTM) units (or blocks) are a building unit for layers of a recurrent neural network (RNN). A RNN composed of LSTM units is often called an LSTM network. A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate. The cell is responsible for "remembering" values over arbitrary time intervals; hence the word "memory" in LSTM. Each of the three gates can be thought of as a "conventional" artificial neuron, as in a multi-layer (or feedforward) neural network: that is, they compute an activation (using an activation function) of a weighted sum.

The parameters of this implementation were:

Google’s word-embedding corpus was used to preprocess the text input. It converts every word to a vector of dimension 1x300
1 LSTM layer of 64 cells
Loss : Categorical Cross Entropy
Epochs : 100

The classification accuracy obtained was 87%. The class which was hardest to predict was ‘Nightlife’.

We can clearly observe that the clusters are well separated. The cluster purity when matched to the ground truth is very impressive.

Model Selection

After observing the results, the choice of model for input Image Embedding was a CNN, whereas LSTM was chosen for text

At this point we need to find the optimal cluster count through Elbow and Avg. Silhouette method; we can clearly observe that the best cluster count is between 4 and 5

Results and Conclusions

Feature extraction is the most important step for efficiently clustering data and that naive feature extraction is simple but not accurate for finding more distinct features of online communities. More specifically when dealing with image data Convolutional Neural Networks are really powerful in feature extraction, whereas for text LSTMs are more robust than standard RNNs given how they overcome the problems of vanishing/exploding gradients. Combining CNN and LSTM features provides an accurate representation of an Instagram post and this allowed us to create meaningful users clusters.

CONTENT RECOMMENDATION

Ranking

From the clustering, we have an understanding of the underlying communities of users.

The intended user, a social media manager for example, will upload his or her pictures (from a photoshoot) to the tool. His/Her goal is to create a post that best appeals to a chosen community. In other words, the social media manager wants to optimize his/her image, caption, and hashtag for a given community.

A post can be optimized by two metrics:

Whether a post will be popular to a community
Whether a post will be similar to a community

The former requires an “engagement metric,” or a score that reflects the likelihood it will be received positively by the population; we address these two issues below.

Maximizing Engagement

In business, social media engagement measures the public shares, likes and comments for an online business' social media efforts. Engagement is a common metric for evaluating social media performance. While maximizing engagement is best, companies like Facebook, Twitter, and e-commerce retailers differ in how they calculate standardized engagement scores. Engagement scores typically include likes, comments, number of followers, and other metrics like views, mentions, and more.

We wanted to predict engagement from the quality of a photo:

Where the dependent variable, y, is the engagement score and the independent variables are the featured images. Thus, this became a supervised machine learning problem.

We calculated an engagement score for every post within each community. The team had much debate in deciding how to calculating an engagement score. We wanted to normalize by the number of followers so that popular people (“Influences”) did not overwhelm the metric, while also crediting comments, which can often be outweighed by the number of likes. We experimented with z-scores and linear combinations of these variables.

Proposed engagement scores were created using an iterative process between adjusting the composition and linear combinations of possible calculations with various supervised machine learning algorithms:

Decision Tree Regressor with polynomial (varying max depths), with normalizer, MinMaxScaler
Ridge with CV, Polynomial Ridge with CV (varying alpha), with normalizer, MinMaxScaler
Lasso with CV, Polynomial Lasso with CV (varying alpha), with normalizer, MinMaxScaler
SVM, Polynomial SVM

We began using the engagement score for linear regressions, but ultimately converted a score into a binomial metric and chose a logistic regression. This method out-performed all others on the test set. A 0/1 was calculated by whether an engagement score was above or below the median. The final model was a logistic regression SVM, as this supervised machine learning algorithm had the highest test score with an R-squared of .64

One reason linear combinations with z scores did not work is because outliers, while useful for the compilation of communities, severely affected engagement scores that used z scores. This is because as a line was attempting to project into a multidimensional space, all of the features were clustered into only a few points (a score of -8, or 40 for example). This caused very poor performance on the test set. We will return the engagement metric later

Suggesting Photos using Community Similarity while Maximizing Engagement

Now, looking back at the tool’s pipeline, we can see that a social media manager has uploaded pictures and would like suggested content. Each picture uploaded by the social media manager is passed through the clustering algorithm, and receives a probability score that the picture belongs to each community. The probability scores that a given picture belongs to each community sums to one. Each community also has its own distribution- the compilation of images that have been passed through the clustering algorithm.

Thus, we have probability distributions that we may measure the similarity between using the Kullback-Leibler Divergence (often shortened to just KL divergence).

KL Divergences measure the difference between two probability distributions p(x) and q(x):

Therefore, a lower KL divergence score suggests the two distributions are more similar.

Therefore, we can quantitatively measure the similarity or difference between an uploaded picture and the community. Each uploaded picture is iterated and a KL divergence score is computed by comparing each picture’s distribution with the target communities distribution. The KL scores and then ranked, from lowest to highest, to suggest which pictures best match the target communities’ images.

Now, we would like to use our trained, and pickled, logistic SVM model to predict whether an incoming image will have a 0/1 (or be a post that is predicted to have ‘engagement.’ Once a score is predicted, we use a linear combination between these scores (0/1) and the KL-Divergence scores. Because it is best to have a low KL divergence score though, we weighted pictures with predicted poor engagement scores (0) to increase their final score. Therefore, we added a 1 to the KL score of pictures that were predicted to be less popular. This was a conservative decision, as many of the images uploaded by the social media manager are expected to be high quality, professional photos with strong lighting and composition. This was qualitatively tested on a sample of previous Le Sirenuse professional photographs.

Suggesting Text and Hashtags using Community Similarity while Maximizing Engagement

Now, we return back to community characterization. To best target a community, one should know how to speak like them and talk about similar themes. Once an engagement metric was established (a standardized ‘likes’ score), an engagement score was calculated for each post per community. This score then signals which accounts in the community have the highest engagement. Our goal is to collect text (topics and hashtags) used by the entire community’s posts, but weight the words by the posts’ engagement score.

The entire corpus post was split between caption and hashtags. Then each of these were transformed into a numerical form using Word2vec. Word2vec is a two-layer neural net that processes text and output a set of vectors. Then, the words said in each post were counted, then multiplied by user engagement score for that post. This was then summed across all possible words to identify the the most “popular” words used within a community, weighted by the engagement score. Therefore, more popular posts would have a higher ranking than posts with little engagement. Now, we have a compilation of popular words and hashtags used by each community.

FOLLOWERS CLASSIFICATION

Why detect brands and retailers

We’re performing clustering and recommendation based on Instagram content (captions and pictures). Therefore, content must be relevant for the client.

Brands and retailers produce on average more than double the content of consumers. Moreover their posts usually target specific topics. This may introduce biases in our training datasets of Instagram accounts, hiding more relevant insights about consumers habits.

Available Data

We were given a list of 685 hand labeled users, of which:

538 consumers
146 brands and retailers

More specifically we were given Instagram data of the users accounts:

Following count
Followers count
Profile picture
Number of posts
Biography

And of their posts:

Tags
Caption
Likes count
Comments count

Features Engineering

How to distinguish brands and retailers accounts from consumers accounts? Which are the features, and characteristics that discriminate the two classes?

We're looking for different habits or strategies for posting contents, like differences in the number of posts per day or the distribution of posts over the time of the day, or again tags and mentions usage. The underlying assumption to this analisys is that regardless of the actual content of the posts, the behaviour of the user online can be a good indicator of wether or not we're in presence of a consumer. More indication about the users nature can be found in the content the user uses to present herself on Instagram, such as the profile picture and the biography.

In order to find the correct features to distinguish users of the two classes we started from the most simple and raw data we had: accounts metadata. Some preliminar analisys were preformed representing each users with all of the account metadata available such as:

Following count
Followers count
Number of posts

The results of these preliminar analisys were quite disappointing so we started making research about which features could be relevant and which were not.

Posts count

Based on the data we had we could compute two different posts counts: we had the total posts count, that is the one indicated in a user's profile, or we could count the posts of which we had metadata, that are the posts that the user uploaded during 2017. The second choice seemed much more relevant as it would introduce a kind of normalization, telling us how many posts each user had uploading in a given time window.

Average number of posts per day for consumers: 0.45
Average number of posts per day for brands and retailers: 0.92

The results we already interesting: brands and retailers produce double the number of posts with respect to consumers, so this feature might be relevant for our task.

Posting hours

Keeping in mind our assumption we looked for a first very simple posting strategy: we tried to compare the times at which the two classes of users uploaded their posts. Brands and Retailers may try to concentrate their posts in specific times of the day in order to maximize the exposure of their content to consumers. These were the resulting distributions:

Number of posts per hour from consumers

Number of posts per hour from brands and retailers

In this case we can't observe a meaningful difference in the distribution of posts over time, so it is not going to serve us as a feature to distinguish the two classes.

Tags and Mentions

As we stated before we're not looking directly at content, but we are rather trying to define a common behaviour. In this case we look for regularity of usage of tags and mentions. Brands and retailers tend to use tags and mentions in a strategic way in order to target a given audience, while normal consumers usually don’t; a very simple approach is the following: build a vector counting the number of times a tag was used by a user, then compute mean and variance of that vector. Using this same approach with mentions and adding these 4 features to the representation of a user brought slight improvements to the classifier.

Profile Pictures

In the case of profile pictures there were two completely different behaviours that could easily be recognized just looking at the users profiles: brands and retailers have a logo or text in their profile picture, whereas consumers have a picture of themselves, usually their face. By using the the OpenCV library, more specifically Haar Cascades, we were able to perform face detection on the users profile pictures and add that as a relevant feature. Moreover we used the tesseract ocr in order to detect the presence of text in the picture. These two features were really relevant and improved the classifier performance.

Biographies

As for profile pictures, bios are used by users to describe themselves, so it is to be expected that the vocabulary used by the two classes of users differs in a relevant way. Moreover usually Instagram bios are not complete sentences, but rather a sequence of words that describe the user. For this reason we used a Count Vectorizer wich takes in input the bio text and returns a vector counting the occurrencies of each word. As expected the results differed quite a lot!

Occurrencies per word in consumers bio

Occurrencies per word in brands and retailers bio

Adding the presence of words from these vocabularies in the bios of the users showed improvements in the results of the classifier.

Results

The following results were obtained using 3 fold cross validation and grid search for the parameters optimization

Support Vector Classifier

Recall mean: 0.8904
Precision mean: 0.5383
F1 mean: 0.6707
Accuracy mean: 0.8131

Logistic Regression

Recall mean: 0.7602
Precision mean: 0.6241
F1 mean: 0.6845
Accuracy mean: 0.8511

Multi Layer Perceptron Classifier

Recall mean: 0.6369
Precision mean: 0.7153
F1 mean: 0.6689
Accuracy mean: 0.8642

Random Forest Classifier

Recall mean: 0.6918
Precision mean: 0.7511
F1 mean: 0.7195
Accuracy mean: 0.8847

Since the purpose of this classifier is to detect brands and retailers profiles in order to remove their posts from the clustering, we aim to get the highest possible recall as it will remove the highest number of unwanted posts. False negatives in this case have a much higher cost than false positives (brands and retailers average number of posts per year are 335, versus 160 for consumers). Thus in this setting Support Vector Classifier is the best performing classifier for the task.

DATASHACK 2018 Emporio Sirenuse

SECTION 1

SECTION 2

SECTION 3

SECTION 4

SECTION 5

SECTION 6

INTRODUCTION AND SCOPE

USERS CLUSTERING

Premise

Clustering

Naive Implementation - Images

Naive Implementation - Text

Intermediate Trials - Object Detection

Intermediate Trials - Scene Detection

Object Detection

Scene Detection

Post

Autoencoder

CNN

LSTM for Text Feature Extraction

Model Selection

Results and Conclusions

CONTENT RECOMMENDATION

Ranking

Maximizing Engagement

Suggesting Photos using Community Similarity while Maximizing Engagement

Suggesting Text and Hashtags using Community Similarity while Maximizing Engagement

FOLLOWERS CLASSIFICATION

Why detect brands and retailers

Available Data

Features Engineering

Posts count

Posting hours

Tags and Mentions

Profile Pictures

Biographies

Results

FRANK

Upload

Rank

Evaluate

Cluster naming

Behind the scenes

CONCLUSIONS

BACK TO TOP

DATASHACK PROJECT © 2018

THE TEAM:

DATASHACK 2018
Emporio Sirenuse