Work at MIT: Recommending Skincare Products w/ Natural Language Processing (NLP) + Collaborative Filtering

  • by
Work at MIT: Recommending Skincare Products w/ Natural Language Processing (NLP) + Collaborative Filtering

A number of you might be wondering what I’ve been up to here at MIT. Short answer: a lot! I just finished a project where we created a skin-care product recommendation algorithm with some MIT taught Natural Language Processing and Collaborative Filtering methods. We worked in collaboration with my teammate’s startup, Atolla Skin Lab.

Using Text Analytics to Create a Recommendation Algorithm for Skincare Products: Leveraging Reddit comments to recommend skincare products by skin type and concern

Jason Kwok, Sid Salvi, Alejandro Fernandez del Castillo


According to Vaseline, an American woman throws away on average about $100,000 in partially used personal care products over her lifetime. In an MIT study, 65% of women find the skincare shopping process overwhelming and 50% of millenial women tries a new beauty product every month (McKinsey, 2018). To simplify the skincare selection process, people are crowdsourcing advice, turning to online forums and social media to discuss skin concerns, recommend products and receive suggestions. Two root problems underpin the frustrating and painful trial and error process in the skincare (and personal care industry more broadly) – 1) people don’t understand their skin, and 2) people don’t know how to match products / ingredients to their skin. We aim to create a recommendation algorithm that will solve the second root problem described above. Given a specific skin type and/or skin concern, what are the best products and ingredients to use?

Summary of Approach and Findings:

To develop the logic for our recommendation algorithm, we devised a three-step approach: 1) collect and label unstructured data from Reddit (comments about skin and skincare), 2) test three different classification and sentiment analysis techniques, and 3) using classification and sentiment results to give ranked product recommendations by skin type and concern.  The three analysis techniques we tested are: CART, Neural Networks and Collaborative Filtering. Neural networks had a poor accuracy in classification (20% accuracy). While CART had a robust accuracy of 74%, the results were difficult to generalize into a recommendation algorithm. Calculating cosine similarity of products from the collaborative filtering results provided a useful and robust basis for a skincare recommendation algorithm.

Using the collaborative filtering with cosine similarity approach, we predicted the highest rated skin-care products for specific skin-types and concerns with a cosine similarity of 0.22 for top 5 related products.  This model provides the basis for product recommendations when combined with the archetypes that group customers by different skin conditions and historical use. For example, below are three of the archetypal groups found using collaborative filtering with the highest ranked product for those customers shown in parentheses.

  • A-1: Besides acne, skin is healthy, and prefers topical products (Tretinoin)

  • A-2: Failed topical solution, switched to medicine (Accutane)

  • A-4: Millenial consumer that is focused on specific active ingredients (Niacinamide), instead of medicine (Accutane)

In addition, we can use the cosine similarity scores to recommend a portfolio of complementary products / ingredients to a particular customer.  Below is one such portfolio.

While this is a strong start to a recommendation algorithm, we would want to strengthen this work by expanding the data used by incorporating more comments and leveraging unsupervised learning techniques to flag skin concerns, products and ingredients in the data and to do sentiment analysis.

In the following pages, we describe in more detail 1) data used and how we cleaned/transformed the data, 2) the collaborative filtering model and results, and 3) recommendations and next steps.  We discuss other techniques tested in the appendices (Appendix A – Neural Network, Appendix B – CART).


To create the recommendation algorithm, we have gathered a vast amount of skin-related comments from Reddit (sub-reddit Skincare Addiction). These ~2,000 threads and ~70,000 comments and replies include positive and negative product reviews, skincare routines and help requests followed by recommendations. Users frequently list their skin-care routines and products, and in great detail outline their satisfaction, or lack thereof.

The unstructured and unorganized nature of Reddit comments makes this analysis especially challenging. We had three primary pieces of information to glean from each comment: (1) the skin type or skin concern (e.g dry skin, acne) (2) the products or ingredients in discussion, and (3) the user’s sentiment about the products or ingredients (positive or negative).

  1. Skin Type and/or Condition: We unsuccessfully played with models such as TFIDF (Term frequency, inverse document frequency) and LDA (Latent Dirichlet allocation) models to naturally identify these topics, we landed on simple keyword matches to a dictionary of 34 types and concerns. If the comment mentioned any of these keywords, we knew it existed. For example, the dictionary included: Discoloration, Hyperpigmentation, Congestion, Acne

  2. Product and/or Ingredients: We also used keyword matches to a dictionary of almost 300 products and ingredients. Examples of this dictionary included Clinical Vitamin C, hyaluronic Acid, niacinamide, retinol, Accutane, and more

  3. Sentiment: The sentiment posed the greatest complexity simply because very long comments posed very complex thoughts. The sentiment score based on the presence of a public dictionary of 4K positive keywords and 4K negative keywords. These sentiments scores were normalized from 1-10.

Below are samples of the comment reclassified:


1: Skin Type or Concern

2: Product or Ingredient

3: Sentiment Score (not normalized)

“I’m *so* happy with my routine and I want to share! With a lot more patience than I thought I would need, hopefully this helps anyone with similar skin woes. I have moderately oily skin (used to be a lot worse) and deal with cystic acne and PIE/PIH…”

oily skin, acne, PIE

Salicylic acid, AHA, BHA


“Went off of it before my treatment was supposed to be officially over because I was having anxiety attacks daily, had severe suicidal ideation and my liver enzymes were through the roof. I’m now on a crazy restrictive diet to manage my condition – no dairy, no gluten and no meat. If I have one speck of gluten it tears apart my insides. I was completely healthy before I took accutane…”




Approach and Model:

We applied collaborative filtering to group products using cosine similarity. Similar to R2 which measures the fit of the model, the cosine similarity metric measures products’ sentiments in number between 0 and 1 by comparing the similarity vectors.  The higher the metric (e.g. if a large enough group of customers rated those two products similarly), the closer the number will be to 1. For example, if 10 people who love Aquafor also like Cetaphil, we will see a score close to 1. In contrast, if none of these users like Hempseed, we will see a score closer to 0. Refer to Appendix C for a more thorough explanation of cosine similarity.

Cosine Similarity (Product A, B) = Dot Product(Product A’s User Ratings Vector, Product B’s User Ratings Vector)

/ [ (√∑(Product A’s User Ratings2) *(√∑ Product B’s User Ratings2) ]

For the top 5 related products, we found an average cosine similarity score of 0.22. Broken apart by the ranking, we see an average cosine similarity of 0.325 for the first related product, which decreases to close to 0.15 for the 5th related product. For example, we can see the high similarity of Aquafor to four other products, boasting a Cosine similarity score of .76 with Cetaphil Hydrating Lotion, La Roche Gentle Face Cleanser, Neutrogena Sunscreen, and Uncle Harry’s Clay Mask Bar. Intuitively, for a person with dry skin that is prone to redness/irritation and who is concerned with sun damage and anti-aging, these products would work well together

Once we understand if a user likes a product, we can use the collaborative filtering customer groups / product paths to recommend other products. While a product-to-product recommendation a valuable first step of execution, it is only usable when the user has expressed previous product preferences. Some products such as Accutane can yield some positive results for some, it can also yield very negative results for others depending on the skin-type.

To thoughtfully recommend products, we recommend  linking the score to user behavior – to overlay the product-to-product recommendations on skin type and skin concern. For example, we can see that Accutane tends to work positive results for people who have acne. However, when users’ skin is dry, damaged, and sensitive, the ranking of Accutane quickly drops it painfully dries out skin. In another case, we see that Vitamin A can be recommended across skin types and concerns to tone skin.

Looking at the archetypes from collaborative filtering provides high level insight to groupings of skin types, skin concerns and and product product recommendations. From the ~1K users and 68 products in our dataset, there are 6 major skincare archetypes (product/ingredient recommendations in parentheses):

  • 1: Has acne, but skin is healthy (Tretinoin)

  • 2: Failed topical solution user, switch to medicine (Accutane)

  • 3: Fairly oily skin, and while overall neutral to semi- positive on many products (Avoid Vaseline)

  • 4: Millennial consumer that is focused on specific active ingredients (Niacinamide)

  • 5: Focused on anti-aging (AHA)

  • 6: Has acne, sensitive skin, seen success with specific OTC topical solutions (Differin)

Recommendations and Next Steps:

We recommend a skincare recommendation algorithm that uses product-to-product cosine similarity scores to help customers discover their perfect skincare portfolio. For example, if a user likes Cetaphil’s hydrating lotion, we can confidently recommend they try La Roche Hydrating Cleanser and Neutrogena Sunscreen.

To build upon this model, we suggest the following: could do the following:

  1. Data Expansion:

  • Access more reviews: To build a comprehensive product graph of recommendations, We accessed 1,800 threads and 70,000 comments, but through cleansing and API failures, only had a little over 1K build product recommendations. Downloading the entire subreddit would give us access to thousands if not millions more comments.

  • Expand dictionary: We found products, ingredients, skin types, and skin conditions through a brute force dictionary match; our sample set is thus limited by the keywords we pre identified. We could improve this by both expanding our dictionary, and refining other unsupervised learning methods to automatically pull out the parts of speech and come with new nouns.

  1. Model Expansion:

  • Unsupervised natural language analysis: Instead of relying on personally tagged dictionaries to manually flag words, automatically identifying parts of speech and products, as well as properly attributing products and skin-types/conditions to vague comments would expand the data.

  • Unsupervised sentiment analysis: As opposed to relying on hard-coded dictionaries, creating a proper neural network that could classify based on the actual comment writing would prove more accurate.

  • Ensemble Model: Test multiple elements of collaborative filtering beyond the model we used.

Appendix A: Neural Network Result

While optimistic about our Neural Networks, our model had recall that was simply too low. We could only predict 6% of the tested data with high enough accuracy. If we were to expand the threshold to ~70% accurate, we would still only cover 18% of the data, again insufficient. One of the major roadblocks with the Neural Network was that only trained 535 comments. Our Bag of Word corpus had almost ~2.4K words. Given the complexity of this, it was advised we also match the input training data to the number of words. Because of this, we were forced to abandon the Neural Network. Our initial Neural Network had 3 layers (1 input layer, 1 hidden layer, and 1 output layer), and we tested hidden neurons between 20 to 200, finding that the best performance came out at 93 hidden network neurons. Improvements would involve expanding our training data to over 3K, and re-testing parameters such as number of layers, hidden layer neurons, and testing alpha.

Confidence from Sigmoid Function

Accuracy %

% of Total

Not confident
























Calculated from Test sample of 147

Appendix B: CART

We used a CART model to predict the sentiment of a skincare comment. For that we stemmed and sparsified (lowfreq = 1000) the 605 manually tagged comments, and we divided them  into a training set of 400 and a test set of 205.

The resulting CART tree had an accuracy of 74%, a TPR of 91% and a FPR of 56%

The metrics of the resulting tree are not bad, but it takes a bit of effort to extract the logic of the sentiment analysis:

  • All comments and replies that have the word “thank” are not positive reviews of products

  • Comments that include “look” and don’t include “get” or “cerav” are positive reviews

  • Mentions of categories of products like “cleanser” or “moistur” are associated with positive reviews

  • Words like “get” and “know” are also indicators that the comments are not positive reviews

We felt that these rules alone would not be enough to properly categorize the rest of the comments, so we decided to discard this analysis.

Appendix C: Collaborative Filtering with Cosine Similarity

By taking the Cosine Similarity, we are capturing if products are liked by the same grouping of users. If a user (or group of users) thinks similarity about a group of products, then those products are likely very similar (or complementary).

For example, here, UserA likes Product1 and Product2. This hints that Products 1 and 2 are similar! UserC also likes Product1 and and Product2! This further reinforces that the two products are similar for these user’s use cases. Meanwhile, UserB dislikes Product1 and Product2, but has a clear preference for Product3 (this is clearly different from UserA and UserB’s preferences).

Collaborative filtering through Cosine Similarity compares each Products’ vector dimensionality against others to mathematically compute the above described similarity.

Ex 1

Product1: [2, 0, 2]

Product2: [2, 0, 2]

Cosine_Similarity(Product1, Product2)

= Dot_Product(A,B)/ [ √(∑A^2) * √(∑B^2) ]

= (2*2 + 0*0 + 2*2) / [√(2^2 + 0 + 2^2) * √(2^2 + 0 + 2^2)]

= (8) / (2.8* 2.8)

= 1 ::: these two vectors are identical (all customers feel the same about these two products, so they have the highest possible Cosine Similarity Score of 1. Products 1 and 2 are similar!

Ex 2

Product2: [2, 0, 2]

Product3: [0, 2, 0]

Cosine_Similarity(Product2, Product3)

= Dot_Product(A,B)/ [ √(∑A^2) * √(∑B^2) ]

= (2*0 + 0*2 + 2*0) / [√(2^2 + 0 + 2^2) * √(0 + 2^2 + 0)]

= (0) / (2.8* 2)

= 0 ::: these two vectors are completely orthogonal (90 degrees) to each other (all customers tend to think very differently about these two products, so they have the lowest Cosine Similarity Score of 0. Products 2 and 3 are very different.