Find Most Common Words and Length of Words
- Word counting in a file
Your goal is to write a function that takes a file handle as input and returns the number of words in the file that are a single letter, 2 letters, 3 letters, and so on until the longest word length. Assume that words are separated by space.
Test this function with a program that takes a filename as input and write the letter size distribution in a file called “file name_SizeDistribution”.
Example:
test.txt
test_sizeDistribution.txt
- Most common words
Your goal is to write a Python function that takes a file handle as input and returns the most common words in the text file. Your program should first build a python dictionary that tracks the number of occurrences of every word in the book. Assume that words are separated by space.
Test this with a program that takes a filename as input and prints i) 5 most common words and ii) 5 most common words of length greater than 5 Your code should print the results from the 3 sample files.
- Uniqueness of each book
We will use a very simple definition of uniqueness of a book – the number of unique words that occur in the book that do not occur in any of the other books, as a percentage of total number of words in the book.
Solution
p1.py
defcount_len_freq(handle):
“”” return the number of words in the file with different lengths. “””
freq = {}
for line in handle:
words = line.split()
for word in words:
# increase the count of the corresponding length
freq[len(word)] = freq.get(len(word), 0) + 1
returnfreq
if __name__ == ‘__main__’:
# test program, ask the user to enter the input file name
# and save the output to the file
filename = input(‘Enter the name of the input file (.txt): ‘)
# get the name and extension of the filename and construct
# the name of the output file
importos
base, extension = os.path.splitext(filename)
output = base + “_sizeDistribution” + extension
# open the files
handle1 = open(filename, ‘r’, encoding=’ISO-8859-2′)
handle2 = open(output, ‘w’)
counts = count_len_freq(handle1)
for length, count in counts.items():
handle2.write(“size %d: %d\n” % (length, count))
handle1.close()
handle2.close()
print(“output is saved into file”, output)
p2.py
defcount_word_freq(handle):
“”” return the count of words in the file. Words are case-sensitive.
The results are converted into a list in the decreasing order of the counts. “””
freq = {}
for line in handle:
words = line.split()
for word in words:
# increase the count of the corresponding word
freq[word] = freq.get(word, 0) + 1
# convert the dictionary into a list so we can sort it by the counts
freq = [(v, k) for k, v in freq.items()]
freq.sort(reverse=True)
returnfreq
if __name__ == ‘__main__’:
# test program, ask the user to enter the input file name
importio
filename = input(‘Enter the name of the input file (.txt): ‘)
handle = io.open(filename, ‘r’, encoding=’ISO-8859-2′)
freq = count_word_freq(handle)
# display the 5 most common words
print(“\n\n5 most common words in file %s\n” % (filename))
for i in range(5):
if i <len(freq):
print(“%-4d %s” % (freq[i][0], freq[i][1]))
# display the 5 most common words of length greater than 5
print(“\n\n5 most common words of length greater than 5\n”)
k = 0
for i in range(len(freq)):
if k < 5:
iflen(freq[i][1]) > 5: # the word’s length is more than 5
print(“%-4d %s” % (freq[i][0], freq[i][1]))
k = k + 1
else:
break
handle.close()
p3.py
defget_word_set(handle):
“”” return the set of words in the file. Words are case-sensitive. “””
value = set()
for line in handle:
words = line.split()
for word in words:
# add the word into the word set
value.add(word)
return value
defget_percent_unique(i, sets):
“”” return the percentage of the unique words in sets[i] from words in all sets “””
words = sets[i] # the words to check
allwords = set() # all words (except words in set[i])
for j in range(len(sets)):
if j != i:
allwords |= sets[j]
# now count the number of words that are not in allwords (unique words)
count = 0
for word in words:
if word not in allwords:
count = count + 1
# calculate and return the percentage of the unique words
if count > 0:
return count / len(words)
return 0
if __name__ == ‘__main__’:
# test program, ask the user to enter the input file names
count = int(input(“Enter the number of files to check: “))
# read the file names and collectthw unique words for them
filenames = []
wordsets = []
for i in range(count):
filename = input(“Enter the file name (.txt): “)
# open the file and get the set of words from it
importio
handle = io.open(filename, ‘r’, encoding=’ISO-8859-2′)
words = get_word_set(handle)
handle.close()
# save the file and set into the array list
filenames.append(filename)
wordsets.append(words)
# display the uniqueness of each book
print(“\n\nUniqueness of each book:\n”)
for i in range(len(wordsets)):
uniqueness = get_percent_unique(i, wordsets)
print(“%-30s: %.2f%%” % (filenames[i], uniqueness * 100))