Text mining seeks to derive meaning from text.
One way, we can analyse reviews and identify commonly used words to identify the reason for low sales.
How to clean text data for finding repeated words.
- remove white space.
- Tokenization
- Lowercase
- punctuation
- Stop-word (is, and, the, has, his)
- Remove numbers
You present your findings in a word cloud, making the most frequent words into larger text, and the least frequent into smaller text.
First we should install wordcloud. Go to your terminal and type:
pip install wordcloud
Now go to your code and do the following:
import string
import rea = "Hello hello world 123 34, ?? make Hiiii"
a = a.lower()
# class d function removes numbers \d+ \s = string
# here we remove integers with empty strings.
a = re.sub("\d+", '', a)# whenever you remove punctuation
s = str.maketrans('', '', string.punctuation)
a = a.translate(s)# removes white spaces
# if we tokenize before strip then " " empty spaces will be strings.
s = a.strip()import nltk
from nltk.corpus import stopwords
st = stopwords.words('english')
st.append('the') # you can add any stop words you need to add.from nltk.tokenize import word_tokenize# separate word by word
lst = word_tokenize(a)
for i in lst:
if i in st:
lst.remove(i)# here we find the frequency of each word.
from nltk.probability import FreqDist
fdist1 = FreqDist()
for i in lst:
fdist1[i] += 1
a = ' '.join(lst)from wordcloud import WordCloud
cloud = WordCloud().generate(a)import matplotlib.pyplot as plt
plt.imshow(cloud)