NLP Walkthrough
Table of Contents
- 1. Introduction to Pandas
- 1.1. Load the given CSV file containing text and label columns into a Pandas DataFrame.
- 1.2. Display the first three records of the dataset and print the total number of rows and columns.
- 1.3. Print the column names and data types of each column.
- 1.4. Check whether the dataset contains any missing values and display the count of missing values per column.
- 1.5. Find the number of unique labels present in the dataset.
- 1.6. Display the count of records for each label.
- 1.7. Filter and display all rows where the label is “Negative”.
- 1.8. Create a new column that stores the number of characters present in each text.
- 1.9. Create another column that stores the number of words in each text.
- 1.10. Convert all text entries to lowercase and store the result in a new column without modifying the original text column.
- 1.11. Standardize the label column by converting all labels to lowercase.
- 1.12. Create a new column in which the labels are encoded as follows: positive → 1, neutral → 0, negative → −1.
- 1.13. Sort the DataFrame in descending order based on the char count column.
- 1.14. Display the top three longest text entries based on word count.
- 1.15. Filter and display all texts that contain more than six words and belong to the “positive” label.
- 1.16. Group the data by label and compute the average word count for each group.
- 1.17. Identify the label that has the highest average word count.
- 1.18. Count the number of text entries with fewer than five words for each label.
- 1.19. Change the DataFrame index to start from 1 instead of 0.
- 1.20. Rename the columns text to reviewtext and label to sentiment.
- 1.21. Update the sentiment of a specific record (given index) from “neutral” to “positive”.
- 1.22. Create a new DataFrame that contains only the rows with positive sentiment.
- 1.23. Save the cleaned and updated DataFrame to a new CSV file.
- 1.24. Reload the saved file and verify that the changes are correctly reflected.
1. Introduction to Pandas
import pandas as pd import sys import random
1.1. Load the given CSV file containing text and label columns into a Pandas DataFrame.
df = pd.read_csv("reviews_dataset.csv")
1.2. Display the first three records of the dataset and print the total number of rows and columns.
print(type(df.head(3))) print(df.head(3))
<class 'pandas.core.frame.DataFrame'>
text label
0 This product is amazing and works perfectly Positive
1 Terrible experience, would not recommend Negative
2 It's okay, nothing special Neutral
If you don’t give any parameter to df.head(), it returns the first 5 records by
default (0, 1, 2, 3, 4).
print(type(df.shape)) print(df.shape)
<class 'tuple'> (100, 2)
This returns a tuple containing number of rows and number of columns.
1.3. Print the column names and data types of each column.
print(type(df.columns)) print(df.columns)
<class 'pandas.core.indexes.base.Index'> Index(['text', 'label'], dtype='object')
So to actually get the column names you could do:
for name in df.columns: print(name)
text label
This is how you access first row of the column ’text’.
print(df.at[0, 'text']) print(type(df.at[0, 'text']))
This product is amazing and works perfectly <class 'str'>
1.4. Check whether the dataset contains any missing values and display the count of missing values per column.
print(df.isnull()) print(type(df.isnull()))
text label 0 False False 1 False False 2 False False 3 False False 4 False False .. ... ... 95 False False 96 False False 97 False False 98 False False 99 False False [100 rows x 2 columns] <class 'pandas.core.frame.DataFrame'>
df.isnull()returns a Pandas DataFrame, which returnsTrue(for NaN value) orFalsefor each record and label.- You can call the
sum()method on this DataFrame dataframe.sum()takes an optional parameteraxis. The default value0is the sum across columns, and if you givesum(axis=1), it takes the sum across rows.print(df.isnull().sum())
text 0 label 0 dtype: int64
- This is basically the sum of all boolean values of the column
text, and the sum of all boolean values of the columnlabel. In Python, boolean values are subclasses of integers, henceTrueis the same as the number1andFalseis the same as the number0. That’s how thesum()function is able to work.
1.5. Find the number of unique labels present in the dataset.
df.label or df['label'] returns a Pandas Series (1 Dimensional), corresponding
to the column label.
print(df.label) # or print(df['label']) print(type(df.label))
0 Positive
1 Negative
2 Neutral
3 Positive
4 Negative
...
95 Neutral
96 Positive
97 Negative
98 Neutral
99 Positive
Name: label, Length: 100, dtype: object
<class 'pandas.core.series.Series'>
The .unique() method returns a Numpy Array consisting of all the unique values
in the Series.
print(df['label'].unique()) print(type(df['label'].unique()))
['Positive' 'Negative' 'Neutral'] <class 'numpy.ndarray'>
To find the count of unique values, you could either find length of .unique():
print(len(df['label'].unique()))
3
Or you could use the .nunique() method:
print(df['label'].nunique())
3
1.6. Display the count of records for each label.
print(df["label"].value_counts())
label Positive 35 Negative 33 Neutral 32 Name: count, dtype: int64
And to get the number of unique values you could do this too:
print(len(df["label"].value_counts()))
3
1.7. Filter and display all rows where the label is “Negative”.
Now that we have the unique values of a column using .unique or .value_counts(),
we can now use this to evaluate a boolean expression. The df['label'] ==
"Negative" line returns a Pandas Series of boolean values (same size as df),
where True means the record satisfies the condition.
print(type(df['label'] == "Negative")) print(df['label'] == "Negative")
<class 'pandas.core.series.Series'>
0 False
1 True
2 False
3 False
4 True
...
95 False
96 False
97 True
98 False
99 False
Name: label, Length: 100, dtype: bool
You can index a Pandas DataFrame by a list/Series of boolean values. The list should be of the same size as the DataFrame, and the boolean values tell you if that record should be there in the returned DataFrame or not.
print(df[[True if i%25==0 else False for i in range(100)]]) # Gives only records 0, 25, 50, 75.
text label 0 This product is amazing and works perfectly Positive 25 Failed to meet expectations Negative 50 Passable but not impressive Neutral 75 Superb engineering and design Positive
So you can use the df['label'] == 'Negative' Series to index the Pandas DataFrame.
print(df[df['label'] == 'Negative'])
text label 1 Terrible experience, would not recommend Negative 4 Worst product I've ever bought Negative 7 Disappointing and overpriced Negative 10 Complete waste of money Negative 13 Poor quality, broke after one use Negative 16 Not worth the investment Negative 19 Subpar quality, very unhappy Negative 22 Horrible customer service Negative 25 Failed to meet expectations Negative 28 Completely dissatisfied with purchase Negative 31 Awful, regret buying this Negative 34 Disastrous purchase, avoid at all costs Negative 37 Unacceptable quality standards Negative 40 Rubbish, total disappointment Negative 43 Defective and poorly made Negative 46 Appalling quality control Negative 49 Dreadful experience from start to finish Negative 52 Unpleasant surprise, very poor Negative 55 Horrible quality, fell apart quickly Negative 58 Pathetic excuse for a product Negative 61 Miserable failure of product Negative 64 Ghastly quality, returned immediately Negative 67 Atrocious build quality Negative 70 Abysmal product quality Negative 73 Lamentable purchase decision Negative 76 Woeful performance throughout Negative 79 Disgraceful quality for price Negative 82 Regrettable purchase choice Negative 85 Unfortunate experience overall Negative 88 Frustrating to use daily Negative 91 Terrible build materials Negative 94 Disappointing performance noted Negative 97 Poor design choices Negative
1.8. Create a new column that stores the number of characters present in each text.
The .apply() of a Pandas Series, takes in each element of the series and passes
it to the function. The parameter of .apply() is the name of the function.
df["char_count"] = df["text"].apply(len) print(df.head())
text label char_count 0 This product is amazing and works perfectly Positive 43 1 Terrible experience, would not recommend Negative 40 2 It's okay, nothing special Neutral 26 3 Absolutely love it, best purchase ever Positive 38 4 Worst product I've ever bought Negative 30
1.9. Create another column that stores the number of words in each text.
df["word_count"] = df["text"].apply(lambda x: len(x.split())) print(df)
text label char_count word_count 0 This product is amazing and works perfectly Positive 43 7 1 Terrible experience, would not recommend Negative 40 5 2 It's okay, nothing special Neutral 26 4 3 Absolutely love it, best purchase ever Positive 38 6 4 Worst product I've ever bought Negative 30 5 .. ... ... ... ... 95 Standard expectations met Neutral 25 3 96 Impressive results achieved Positive 27 3 97 Poor design choices Negative 19 3 98 Adequate functionality provided Neutral 31 3 99 Exceptional value delivered Positive 27 3 [100 rows x 4 columns]
1.10. Convert all text entries to lowercase and store the result in a new column without modifying the original text column.
df["text_lower"] = df["text"].apply(lambda x: x.lower()) print(df)
Although Pandas has inbuild vectorized operations written in C. These are way faster:
df["char_count"] = df["text"].str.len() df["word_count"] = df["text"].str.split().str.len() print(df)
text label char_count word_count 0 This product is amazing and works perfectly positive 43 7 1 Terrible experience, would not recommend negative 40 5 2 It's okay, nothing special neutral 26 4 3 Absolutely love it, best purchase ever positive 38 6 4 Worst product I've ever bought negative 30 5 .. ... ... ... ... 95 Standard expectations met neutral 25 3 96 Impressive results achieved positive 27 3 97 Poor design choices negative 19 3 98 Adequate functionality provided neutral 31 3 99 Exceptional value delivered positive 27 3 [100 rows x 4 columns]
.str.split() returns a Series where each element is a list of strings. Calling
.str.len() on this Series, gives the length of the array (hence the number of
strings in each list).
1.11. Standardize the label column by converting all labels to lowercase.
df["label"] = df["label"].apply(lambda x: x.lower())
Once again, the faster way would be:
df["label"] = df["label"].str.lower() print(df)
text label char_count word_count 0 This product is amazing and works perfectly positive 43 7 1 Terrible experience, would not recommend negative 40 5 2 It's okay, nothing special neutral 26 4 3 Absolutely love it, best purchase ever positive 38 6 4 Worst product I've ever bought negative 30 5 .. ... ... ... ... 95 Standard expectations met neutral 25 3 96 Impressive results achieved positive 27 3 97 Poor design choices negative 19 3 98 Adequate functionality provided neutral 31 3 99 Exceptional value delivered positive 27 3 [100 rows x 4 columns]
1.12. Create a new column in which the labels are encoded as follows: positive → 1, neutral → 0, negative → −1.
label_encoding = {"positive": 1, "neutral": 0, "negative": -1} df["label_encoded"] = df["label"].map(label_encoding) print(df.head(3))
text label char_count word_count text_lower label_encoded 0 This product is amazing and works perfectly positive 43 7 this product is amazing and works perfectly 1 1 Terrible experience, would not recommend negative 40 5 terrible experience, would not recommend -1 2 It's okay, nothing special neutral 26 4 it's okay, nothing special 0
1.13. Sort the DataFrame in descending order based on the char count column.
df_sorted = df.sort_values('char_count', ascending=False) print(df_sorted)
text label char_count word_count text_lower label_encoded 0 This product is amazing and works perfectly positive 43 7 this product is amazing and works perfectly 1 1 Terrible experience, would not recommend negative 40 5 terrible experience, would not recommend -1 49 Dreadful experience from start to finish negative 40 6 dreadful experience from start to finish -1 34 Disastrous purchase, avoid at all costs negative 39 6 disastrous purchase, avoid at all costs -1 3 Absolutely love it, best purchase ever positive 38 6 absolutely love it, best purchase ever 1 .. ... ... ... ... ... ... 38 Fine for basic needs neutral 20 4 fine for basic needs 0 42 Perfect in every way positive 20 4 perfect in every way 1 97 Poor design choices negative 19 3 poor design choices -1 56 So-so performance neutral 17 2 so-so performance 0 53 Mediocre at best neutral 16 3 mediocre at best 0 [100 rows x 6 columns]
1.14. Display the top three longest text entries based on word count.
print(df.sort_values('char_count', ascending=False))
text label char_count word_count text_lower label_encoded 0 This product is amazing and works perfectly positive 43 7 this product is amazing and works perfectly 1 1 Terrible experience, would not recommend negative 40 5 terrible experience, would not recommend -1 49 Dreadful experience from start to finish negative 40 6 dreadful experience from start to finish -1 34 Disastrous purchase, avoid at all costs negative 39 6 disastrous purchase, avoid at all costs -1 3 Absolutely love it, best purchase ever positive 38 6 absolutely love it, best purchase ever 1 .. ... ... ... ... ... ... 38 Fine for basic needs neutral 20 4 fine for basic needs 0 42 Perfect in every way positive 20 4 perfect in every way 1 97 Poor design choices negative 19 3 poor design choices -1 56 So-so performance neutral 17 2 so-so performance 0 53 Mediocre at best neutral 16 3 mediocre at best 0 [100 rows x 6 columns]
print(df.sort_values('char_count', ascending=False)['text'].head(3))
0 This product is amazing and works perfectly 1 Terrible experience, would not recommend 49 Dreadful experience from start to finish Name: text, dtype: object
1.15. Filter and display all texts that contain more than six words and belong to the “positive” label.
Creating word count
df['word_count'] = df['text'].apply(lambda x: x.split()).apply(len) print(df)
text label char_count word_count text_lower label_encoded 0 This product is amazing and works perfectly positive 43 7 this product is amazing and works perfectly 1 1 Terrible experience, would not recommend negative 40 5 terrible experience, would not recommend -1 2 It's okay, nothing special neutral 26 4 it's okay, nothing special 0 3 Absolutely love it, best purchase ever positive 38 6 absolutely love it, best purchase ever 1 4 Worst product I've ever bought negative 30 5 worst product i've ever bought -1 .. ... ... ... ... ... ... 95 Standard expectations met neutral 25 3 standard expectations met 0 96 Impressive results achieved positive 27 3 impressive results achieved 1 97 Poor design choices negative 19 3 poor design choices -1 98 Adequate functionality provided neutral 31 3 adequate functionality provided 0 99 Exceptional value delivered positive 27 3 exceptional value delivered 1 [100 rows x 6 columns]
print(df[(df['word_count']>6) & (df['label'] == 'positive')])
text label char_count text_lower label_encoded word_count 0 This product is amazing and works perfectly positive 43 this product is amazing and works perfectly 1 7
In Pandas, you use bitwise operators, and you ensure that the expressions are wrapped in brackets.
1.16. Group the data by label and compute the average word count for each group.
avg_word_count = df.groupby('label')['word_count'].mean() print(avg_word_count)
label negative 3.878788 neutral 3.562500 positive 3.885714 Name: word_count, dtype: float64
1.17. Identify the label that has the highest average word count.
label_max_avg = avg_word_count.idxmax() print(f"Highest average: {label_max_avg}")
Highest average: positive
1.18. Count the number of text entries with fewer than five words for each label.
short_texts = df[df['word_count'] < 5].groupby('label').size() print(short_texts)
label negative 26 neutral 28 positive 31 dtype: int64
1.19. Change the DataFrame index to start from 1 instead of 0.
df.index = range(1, len(df) + 1)
1.20. Rename the columns text to reviewtext and label to sentiment.
df = df.rename(columns={'text': 'review_text', 'label': 'sentiment'})
1.21. Update the sentiment of a specific record (given index) from “neutral” to “positive”.
df.loc[5, 'sentiment'] = 'positive'
1.22. Create a new DataFrame that contains only the rows with positive sentiment.
print(df[df['sentiment'] == 'positive'])
review_text sentiment char_count word_count text_lower label_encoded 1 This product is amazing and works perfectly positive 43 7 this product is amazing and works perfectly 1 4 Absolutely love it, best purchase ever positive 38 6 absolutely love it, best purchase ever 1 5 Worst product I've ever bought positive 30 5 worst product i've ever bought -1 7 Outstanding service and great quality positive 37 5 outstanding service and great quality 1 10 Exceeded all my expectations positive 28 4 exceeded all my expectations 1 13 Highly recommend to everyone positive 28 4 highly recommend to everyone 1 16 Fantastic results, very satisfied positive 33 4 fantastic results, very satisfied 1 18 Good value for money positive 20 4 good value for money 1 19 Incredible performance and durability positive 37 4 incredible performance and durability 1 22 Exceptional product, five stars positive 31 4 exceptional product, five stars 1 25 Best in class, truly outstanding positive 32 5 best in class, truly outstanding 1 28 Wonderful experience overall positive 28 3 wonderful experience overall 1 31 Superb craftsmanship and design positive 31 4 superb craftsmanship and design 1 34 Brilliant product, works flawlessly positive 35 4 brilliant product, works flawlessly 1 37 Marvelous quality and finish positive 28 4 marvelous quality and finish 1 40 Exemplary service and product positive 29 4 exemplary service and product 1 43 Perfect in every way positive 20 4 perfect in every way 1 46 Stellar performance throughout positive 30 3 stellar performance throughout 1 49 Magnificent product design positive 26 3 magnificent product design 1 52 Supreme quality and value positive 25 4 supreme quality and value 1 55 Phenomenal results every time positive 29 4 phenomenal results every time 1 58 Remarkable innovation and quality positive 33 4 remarkable innovation and quality 1 61 Exquisite attention to detail positive 29 4 exquisite attention to detail 1 64 Glorious product, highly effective positive 34 4 glorious product, highly effective 1 67 Splendid craftsmanship evident positive 30 3 splendid craftsmanship evident 1 70 Delightful experience using this positive 32 4 delightful experience using this 1 73 Fabulous innovation here positive 24 3 fabulous innovation here 1 76 Superb engineering and design positive 29 4 superb engineering and design 1 79 Gorgeous aesthetics and function positive 32 4 gorgeous aesthetics and function 1 82 Wonderful addition to collection positive 32 4 wonderful addition to collection 1 85 Excellent durability tested positive 27 3 excellent durability tested 1 88 Amazing value proposition positive 25 3 amazing value proposition 1 91 Premium quality evident positive 23 3 premium quality evident 1 94 Outstanding innovation shown positive 28 3 outstanding innovation shown 1 97 Impressive results achieved positive 27 3 impressive results achieved 1 100 Exceptional value delivered positive 27 3 exceptional value delivered 1
1.23. Save the cleaned and updated DataFrame to a new CSV file.
df.to_csv('cleaned_reviews.csv', index=False)
1.24. Reload the saved file and verify that the changes are correctly reflected.
df_reloaded = pd.read_csv('cleaned_reviews.csv') print(df_reloaded.head()) print(df_reloaded.dtypes)
review_text sentiment char_count word_count text_lower label_encoded 0 This product is amazing and works perfectly positive 43 7 this product is amazing and works perfectly 1 1 Terrible experience, would not recommend negative 40 5 terrible experience, would not recommend -1 2 It's okay, nothing special neutral 26 4 it's okay, nothing special 0 3 Absolutely love it, best purchase ever positive 38 6 absolutely love it, best purchase ever 1 4 Worst product I've ever bought positive 30 5 worst product i've ever bought -1 review_text object sentiment object char_count int64 word_count int64 text_lower object label_encoded int64 dtype: object