| 
  • If you are citizen of an European Union member nation, you may not use this service unless you are at least 16 years old.

  • You already know Dokkio is an AI-powered assistant to organize & manage your digital files & messages. Very soon, Dokkio will support Outlook as well as One Drive. Check it out today!

View
 

Word Frequency Analyzer2 - With Noise Reduction

Page history last edited by Dorai Thodla 16 years, 3 months ago

Goal: Given an input file and an optional noise word file, print a frequency count of non-noise words in the input file


 

1. Change the name of the program to WordAnalyzer

2. Accept two parameters - input file and noise file

3. Input file is the same as the first version

4. Noise file contains a set of noise words separated by white space (one or more spaces, newlines, tabs)

5. Open both files - raise exception if one of them do not exist and exit program

6. Read noise file and store all the words in memory

7. Read input file, line by line

8. For each line, tokenize (separate into words)

9. Eliminate any punctuation characters ( period, comma, semi-colon, colon, question-mark, exclamation point and other non-alpha numeric characters)

10. Increment word count for all input words

11. Check the word against noise words -

 - if it is a noise word, increment the noise-word count (so that we know how many noise words are in the text)

- if it is not a noise word, - increment word-count and word-frequency count

12. At the end of input file, produce the following output.

13. Count of input words, count of noise words in the input file, count of valid-words

14. Write an output file in the following format (sort it by descending order of frequency)

word, frequencycount

 


Perform the following tests

 

1. Invalid input file, invalid noise word file

2. Valid input, invalid noise word

3. Valid input, valid noise word file

4. Valid input, empty noise word file (the noise word exists but does not contain any blanks)

5. Empty input - input file exists but does not contain any data

6. Input with a single word

7. Input with only punctuation characters (no valid words)


Comments (0)

You don't have permission to comment on this page.