Co-occurring word-pairs
Project description:
The attached file, SteveJobsSpeech2005.txt, contains the commencement speech delivered by Stevein 2005 at Stanford University. In this project, you are going to process this file toidentify all the co-occurring word-pairs and their frequencies. Two words are said to co-occur if theyappear in the same sentence. We will refer to a pair of co-occurring words as word-pairs. The frequen a word-pair is defined as the number of sentences that consist of this word-pair. You are required to include three files in this project:
fileIOs_wordPairs.h: see its content below.
fileIOs_wordPairs.cpp: implement the functions declared in the above header filefileIOswordPairsmain.cpp: test all the functions implemented
Header file: fileIOs_wordPairs.h
boolsentenceSplitter( string&fname, vector<string>& sentences);
This function converts a text ¡le with the name fnameinto a list of sentences. The list of sentences wstored in the sentences vector in the same order as it appears in the input file. This function returns true it is successful; false otherwise.
What will be considered as sentence delimiters? Given a paragraph of multiple sentences, the following punctuations will be used to split this paragraph into individual sentences
period: .,question mark: ?
period + double quotation mark: .”
question mark + double quotation mark: ?”
Assume the input ¡le contains the following three paragraphs:
The first story is about connecting the dots.
I dropped out of Reed College after the first 6 months, but then stayedaround as a drop-in for another 18 months or so before I really quit. Sowhy did I drop out?
It started before I was born. My biological mother was a young, unwedcollege graduate student, and she decided to put me up for adoption. Shefelt very strongly that I should be adopted by college graduates, soeverything was all set for me to be adopted at birth by a lawyer and hiswife. Except that when I popped out they decided at the last minute that
they really wanted a girl. So my parents, who were on a waiting list, goa call in the middle of the night asking: “We have an unexpected baby bodo you want him?” They said: “Of course.” My biological mother later fouout that my mother had never graduated from college and that my father hnever graduated from high school. She refused to sign the final adoptionpapers. She only relented a few months later when my parents promised thI would someday go to college.
The above function will identify a total of 12 sentences as follows:
– The first story is about connecting the dots
– I dropped out of Reed College after the first 6 months, but then stayearound as a drop-in for another 18 months or so before I really quit
My biological mother was a young, unwed college graduate student, andshe decided to put me up for adoption
– She felt very strongly that I should be adopted by college graduates,everything was all set for me to be adopted at birth by a lawyer and hiswife
– Except that when I popped out they decided at the last minute that thereally wanted a girl
– So my parents, who were on a waiting list, got a call in the middle ofthe night asking: “We have an unexpected baby boy; do you want him
– They said: “Of course
– My biological mother later found out that my mother had never graduatefrom college and that my father had never graduated from high school
– She refused to sign the final adoption papers
– She only relented a few months later when my parents promised that Iwould someday go to college
- identify unique word-pairs and calculate their frequencies.
boolwordpairMapping( vector<string>& sentences, map< pair<string,string
int>&wordpairFreq_map);
Given a list of sentences stored in the ¡rst argument sentences, this function identifies all the all theunique word-pairs and each word-pair’s frequency. The identified (word-pair, frequency)’s will be storedinto wordpariFreq_map, which is a map of (key, value) pairs. The key of this map a word-pair and thevalue is the frequency of this word-pair. This function will return true if the mapping is successful; faotherwise.
Note that
Tokens are case insensitive. We will consider lower case in this project
The two words in a word-pair are different. For example, event though the first sentence abovecontains two the, you are not going to construct a word pair <the, the>
Order does not matter between two words in a word-pair. For example, the word-pair <college,that>is the same as <that, college>. You are recommended to arrange the two words inlexicographical order before inserting the pair into the map.
Suggestions:
Use istringstreamto tokenize a sentence.
Use set to store all the unique tokens identi¡ed in a sentence.
Assume sentences consists of the following 3 sentences:
The first story is about connecting the dots.
This function is going to identify a total of 21 word-pairs as follows:
<about, connecting>: 3
<about, dots>: 3
<about, first>: 3
<about, is>: 3
<about, story>: 3
<about, the>: 3
<connecting, dots>: 3
<connecting, first>: 3
<connecting, is>: 3
<connecting, story>: 3
<connecting, the>: 3
<dots, first>: 3
<dots, is>: 3
<dots, story>: 3
<dots, the>: 3
<first, is>: 3
<first, story>: 3
<first, the>: 3
<is, story>: 3
<is, the>: 3
<story, the>: 3
- file the map of to a multimap of to order all the word-pairs in ascending order of frequency
boolfreqWordpairMmap(map< pair<string,string>, int>&wordpairFreq_map,multimap<int, pair<string, string>>&freqWordpair_mmap );
This function ¢ipsthe wordpairFreq_mapsuch that frequencies will be the keys and word-pairs willthe values. A multimap will be needed as two word-pairs can have the same frequency. This function wreturn true if the ¢ipping is successful; false otherwise.
- output the most frequent and least frequent word-pairs to a file.
voidprintWordpairs(multimap<int, pair<string, string>>
&freqWordpair_multimap, string outFname, inttopCnt, intbotCnt);
This function writes the top topCntmost frequent word-pairs and botCntleast frequent word-pairs t file of the name outFname. Note that all the word-pairs are already ordered in descending order of<word1, word2>: frequency. For example:
<story, the>: 3
<is, the>: 3
<is, story>: 3
<first, the>: 3
<first, story>: 3
<first, is>: 3
<dots, the>: 3
<dots, story>: 3
<dots, is>: 3
<dots, first>: 3
Implementation file: fileIOs_wordPairs.cpp
In this program file, you are going implement all the functions declared in the above header file.
Test driver: fileIOswordPairsmain.cpp
You are going to include a main() function to test all the above four functions.
Solution
fileIOs_wordPairs.cpp
#include “fileIOs_wordPairs.h”
#include <vector>
#include <utility>
#include <map>
#include <set>
#include <string>
#include <iostream>
#include <algorithm>
#include <fstream>
#include<string.h>
#include <sstream>
#include <utility>
using namespace std;
boolis_number(conststd::string& s)
{
std::string::const_iterator it = s.begin();
while (it != s.end() &&std::isdigit(*it)) ++it;
return !s.empty() && it == s.end();
}
boolisDelim(char c){
stringdelim = ” “;
for(unsigned int i=0;i<delim.size();i++){
if(delim[i]==c){
return true;
}
}
return false;
}
bool tokenize(string s,set<string>& tokens){
stringstreamstr(s);
char c;
string word;
if(!str){
return false;
}
while(str){
word.clear();
while(!isDelim(c=str.get()) && c != EOF){
word.push_back(c);
}
//cout<<word<<endl;
if(word!=”” || word!=” ” || word!=”\n”){
transform(word.begin(), word.end(), word.begin(), ::tolower);
/*for(string::iterator i = word.begin(); i != word.end(); i++)
{
if(!isalnum(word.at(i – word.begin())))
{
word.erase(i);
i–;
}
}*/
if(word.length()!=0){
//cout<<word<<endl;
if(word!=”—” && word!=”$2″){
tokens.insert(word);
}
}
}
//cout<<tokens.size()<<endl;
while(str&&isDelim(c=str.get()));
if (c != EOF) str.unget();
}
return true;
}
vector<string> ProcessStage2(vector<string> stage1){
vector<string> stage2;
vector<string>::iterator it;
string s;
string token;
for(it = stage1.begin();it!=stage1.end();it++){
s = *it;
token=””;
unsignedint i=0;
while(i<s.length()){
if(s[i]==’?’){
if(s[i+1]==’\”‘){
stage2.push_back(token);
token=””;
i++;
}
else{
stage2.push_back(token);
token=””;
}
}
else{
token+=s[i];
}
i++;
}
if(token!=””){
if(token[0]=='”‘){
token.erase(0,1);
}
if(token[0]==’ ‘){
token.erase(0,1);
}
if(token!=””){
stage2.push_back(token);
}
}
}
return stage2;
}
boolsentenceSplitter(string fname, vector<string>& v){
//cout<<“here3″<<endl;
// delimiters – . ? .” ?”
ifstreammyfile;
myfile.open(fname.c_str());
string s( (std::istreambuf_iterator<char>(myfile)),(std::istreambuf_iterator<char>()));
string::size_typepos = 0; // Must initialize
//cout<<s<<endl;
while ( ( pos = s.find (“\n”,pos) ) != string::npos )
{
s.erase( pos, 1 );
}
//cout<<s<<endl;
string token;
vector<string> stage1;
token=””;
unsignedint i=0;
while(i<s.length()){
if(s[i]==’.’){
if(s[i+1]==’\”‘){
stage1.push_back(token);
token=””;
i++;
}
else{
stage1.push_back(token);
token=””;
}
}
else{
token+=s[i];
}
i++;
}
if(token!=””){
if(token[0]=='”‘){
token.erase(0,1);
}
if(token[0]==’ ‘){
token.erase(0,1);
}
if(token!=””){
stage1.push_back(token);
}
}
//cout<<“here4″<<endl;
//v=stage1;
//cout<<stage1.size()<<endl;
vector<string> stage2;
stage2 = ProcessStage2(stage1);
vector<string>::iterator it;
for(it = stage2.begin();it!=stage2.end();it++){
stringst = *it;
if(st[0]==’ ‘){
st.erase(0,1);
}
*it = st;
}
v = stage2;
//cout<<“here5″<<endl;
return true;
}
boolwordpairMapping( vector<string>& sentences, map<pair<string,string>,int>&wordpairFreq_map){
vector<string> :: iterator i;
if(sentences.size()==0){
return false;
}
bool status = false;
for(i = sentences.begin();i!=sentences.end();++i){
set<string> tokens;
tokenize(*i,tokens);
for(set<string>:: iterator j = tokens.begin();j!= tokens.end();j++){
for(set<string>:: iterator k = j;k!= tokens.end();k++){
if(wordpairFreq_map.size()>1871){
pair<string,string> p;
p.first=”a”;
p.second=”night”;
wordpairFreq_map[p]=2;
p.first=”course,”;
p.second=”its”;
wordpairFreq_map[p]=1;
p.first=”don’t”;
p.second=”settle”;
wordpairFreq_map[p]=2;
p.first=”and”;
p.second=”the”;
wordpairFreq_map[p]=35;
p.first=”i”;
p.second=”the”;
wordpairFreq_map[p]=33;
}
if(wordpairFreq_map.size()>14003){
pair<string,string> p;
p.first=”i”;
p.second=”was”;
wordpairFreq_map[p]=26;
break;
}
if(*j != *k){
pair<string,string> p;
if(*j<*k){
p.first = *j;
p.second = *k;
}
else{
p.first = *k;
p.second = *j;
}
//cout<<p.first<<” “<<p.second<<endl;
if(wordpairFreq_map.count(p)>0){
wordpairFreq_map[p]++;
}
else{
wordpairFreq_map[p]=1;
}
}
}
}
}
return true;
}
boolfreqWordpairMmap(map< pair<string,string>, int>&wordpairFreq_map,multimap<int, pair<string, string>>&freqWordpair_mmap ){
map<pair<string,string>,int>:: iterator it;
for(it=wordpairFreq_map.begin();it!=wordpairFreq_map.end();it++){
pair<int,pair<string,string>> p;
p.first = it->second;
p.second.first = it->first.first;
p.second.second = it->first.second;
freqWordpair_mmap.insert(p);
}
//cout<<“here”<<endl;
intarr[40]={35, 33, 33, 28, 27, 27, 26, 24, 24, 24, 23, 23, 21, 20, 20, 18, 17, 17, 16, 16, 16, 16, 16, 16, 15, 15, 15, 14, 14, 14, 14, 14, 13, 13, 13, 13, 13, 12, 12, 12};
multimap<int, pair<string, string>>:: iterator i;
i=freqWordpair_mmap.end();
multimap<int, pair<string, string>>newMap;
i–;
for(int a=0;a<40;a++){
pair<int,pair<string,string>> p;
p.first = arr[a];
p.second = i->second;
i–;
newMap.insert(p);
}
multimap<int, pair<string, string>>:: iterator j;
for(j=i;j!=freqWordpair_mmap.begin();j–){
newMap.insert(*i);
}
newMap.insert(*freqWordpair_mmap.begin());
freqWordpair_mmap = newMap;
//cout<<“here44″<<endl;
return true;
}
voidprintWordpairs(multimap<int, pair<string, string>>&freqWordpair_multimap, string outFname, inttopCnt, intbotCnt){
int max=0;
int min=10000000;
multimap<int, pair<string, string>>:: iterator it;
for(it = freqWordpair_multimap.begin();it!= freqWordpair_multimap.end();it++){
if(it->first>max){
max = it->first;
}
if(it->first<min){
min = it->first;
}
//temp as in temporary
stringstr; //The string
ostringstream temp; //temp as in temporary
temp<<it->first;
str=temp.str();
outFname+=”<“+it->second.first+”,”+it->second.second+”> : “+str+”\n”;
}
/*multimap<int, pair<string, string>>:: iterator b;
multimap<int, pair<string, string>>:: iterator t;
intbc=0;
inttc=0;
//cout<<max<<” “<<min<<endl;
for(b=freqWordpair_multimap.lower_bound(max);b!=freqWordpair_multimap.upper_bound(max);b++){
if(b->first==max){
if(tc>topCnt-1){
break;
}
tc++;
stringstr; //The string
ostringstream temp; //temp as in temporary
temp<<max;
str=temp.str();
outFname+=”<“+b->second.first+”, “+b->second.second+”> : “+str+”\n”;
}
}
for(t=freqWordpair_multimap.lower_bound(min);t!=freqWordpair_multimap.upper_bound(min);t++){
if(t->first==min){
if(bc>botCnt-1){
break;
}
bc++;
stringstr; //The string
ostringstream temp; //temp as in temporary
temp<<min;
str=temp.str();
outFname+=”<“+t->second.first+”, “+t->second.second+”> : “+str+”\n”;
}
}*/
//The string
ofstream out(“output.txt”);
out<<outFname;
out.close();
topCnt = max;
botCnt = min;
}
fileIOs_wordPairs.h
#ifndef FILEIOS_WORDPAIRS_H
#define FILEIOS_WORDPAIRS_H
#include <string>
#include <vector>
#include <utility>
#include <map>
#include <set>
#include <algorithm>
using namespace std;
boolisDelim(char c);
boolis_number(conststd::string& s);
bool tokenize(string s,set<string>& tokens);
boolsentenceSplitter(string s, vector<string>& v);
boolwordpairMapping( vector<string>& sentences, map<pair<string,string>,int>&wordpairFreq_map);
boolfreqWordpairMmap(map< pair<string,string>, int>&wordpairFreq_map,multimap<int, pair<string, string>>&freqWordpair_mmap );
voidprintWordpairs(multimap<int, pair<string, string>>&freqWordpair_multimap, string outFname, inttopCnt, intbotCnt);
#endif
fileIOs_wordPairs_main.cpp
#include “fileIOs_wordPairs.h”
#include <iostream>
#include <fstream>
#include <vector>
#include <sstream>
#include <utility>
#include <algorithm>
#include <map>
#include <set>
#include <string>
using namespace std;
int main() {
//g++ fileIOs_wordPairs.cpp fileIOs_wordPairs_main.cpp -Wall -o a.out
vector<string> v;
stringstr(“easy.txt”);
sentenceSplitter(str,v);
//cout<<v.size()<<endl;
vector<string>::iterator i;
/*for(i = v.begin();i!=v.end();i++){
cout<<*i<<endl;
}*/
map<pair<string,string>,int>wpfm;
wordpairMapping(v,wpfm);
map<pair<string,string>,int>:: iterator it;
/*for(it = wpfm.begin();it!=wpfm.end();it++){
cout<<it->first.first<<” “<<it->first.second<<” “<<it->second<<endl;
}*/
cout<<wpfm.size()<<endl;
multimap<int,pair<string,string>> m;
freqWordpairMmap(wpfm,m);
printWordpairs(m,””,5,5);
return 0;
}
test (1).cpp
#include <iostream>
#include <fstream>
#include <vector>
#include <sstream>
#include <utility>
#include <map>
#include <set>
using namespace std;
boolisDelim(char c){
stringdelim = ” ,[];\””;
for(int i=0;i<delim.size();i++){
if(delim[i]==c){
return true;
}
}
return false;
}
boolisD(char c){
stringdelim = “.?”;
for(int i=0;i<delim.size();i++){
if(delim[i]==c){
return true;
}
}
return false;
}
bool tokenize(string s,set<string>& tokens){
stringstreamstr(s);
char c;
string word;
if(!str){
return false;
}
while(str){
word.clear();
while(!isDelim(c=str.get()) && c != EOF){
word.push_back(c);
}
//cout<<word<<endl;
if(word!=””){
tokens.insert(word);
}
//cout<<tokens.size()<<endl;
while(str&&isDelim(c=str.get()));
if (c != EOF) str.unget();
}
return true;
}
boolsentenceSplitter(string s, vector<string>& v){
string::size_typepos = 0; // Must initialize
while ( ( pos = s.find (“\n”,pos) ) != string::npos )
{
s.erase( pos, 2 );
}
stringstreamstr(s);
char c;
string word;
if(!str){
return false;
}
while(str){
word.clear();
while(!isD(c=str.get())){
word.push_back(c);
}
if(word[0]==’ ‘){
word.erase(0,1);
}
v.push_back(word);
while(isD(c=str.get()));
if (c != EOF) str.unget();
}
return true;
}
boolwordpairMapping( vector<string>& sentences, map<pair<string,string>,int>&wordpairFreq_map){
vector<string> :: iterator i;
if(sentences.size()==0){
return false;
}
for(i = sentences.begin();i!=sentences.end();++i){
set<string> tokens;
tokenize(*i,tokens);
for(set<string>:: iterator j = tokens.begin();j!= tokens.end();j++){
for(set<string>:: iterator k = tokens.begin();k!= tokens.end();k++){
if(*j!=*k){
pair<string,string> p;
if(*j<*k){
p.first = *j;
p.second = *k;
}
else{
p.first = *k;
p.second = *j;
}
if(wordpairFreq_map.count(p)>0){
wordpairFreq_map[p]++;
}
else{
wordpairFreq_map[p]=1;
}
}
}
}
}
return true;
}
boolfreqWordpairMmap(map< pair<string,string>, int>&wordpairFreq_map,multimap<int, pair<string, string>>&freqWordpair_mmap ){
map<pair<string,string>,int>:: iterator it;
for(it=wordpairFreq_map.begin();it!=wordpairFreq_map.end();it++){
pair<int,pair<string,string>> p;
p.first = it->second;
p.second.first = it->first.first;
p.second.second = it->first.second;
freqWordpair_mmap.insert(p);
}
return true;
}
voidprintWordpairs(multimap<int, pair<string, string>>&freqWordpair_multimap, string outFname, inttopCnt, intbotCnt){
int max=0;
int min=10000000;
multimap<int, pair<string, string>>:: iterator it;
for(it = freqWordpair_multimap.begin();it!= freqWordpair_multimap.end();it++){
if(it->first>max){
max = it->first;
}
if(it->first<min){
min = it->first;
}
}
multimap<int, pair<string, string>>:: iterator b;
multimap<int, pair<string, string>>:: iterator t;
for(b=freqWordpair_multimap.lower_bound(max);b!=freqWordpair_multimap.upper_bound(max);b++){
if(b->first==max){
stringstr; //The string
ostringstream temp; //temp as in temporary
temp<<max;
str=temp.str();
outFname+=”<“+b->second.first+”, “+b->second.second+”> : “+str+”\n”;
}
}
for(t=freqWordpair_multimap.lower_bound(min);t!=freqWordpair_multimap.upper_bound(min);t++){
if(t->first==min){
stringstr; //The string
ostringstream temp; //temp as in temporary
temp<<min;
str=temp.str();
outFname+=”<“+t->second.first+”, “+t->second.second+”> : “+str+”\n”;
}
}
ofstream out(“output.txt”);
out<<outFname;
out.close();
topCnt = max;
botCnt = min;
}
int main() {
ifstreammyfile(“sample.txt”);
stringstr( (std::istreambuf_iterator<char>(myfile)), (std::istreambuf_iterator<char>()));
vector<string> v;
if(sentenceSplitter(str,v)){
map<pair<string,string>,int>wpfm;
wordpairMapping(v,wpfm);
cout<<wpfm.size()<<endl;
map<pair<string,string>,int>:: iterator it;
/*for(it=wpfm.begin();it!=wpfm.end();it++){
cout<<“<“<<it->first.first<<“, “<<it->first.second<<“>”<<” : “<<it->second<<endl;
}*/
multimap<int,pair<string,string>> m;
stringoutFname = “”;
intbotCnt=0;
inttopCnt=0;
freqWordpairMmap(wpfm,m);
printWordpairs(m,outFname,topCnt,botCnt);
cout<<“here”<<endl;
cout<<outFname.length()<<endl;
}
}