Natural Language Processing with NLTK, Regular Expressions
Natural Language Engineering
Part 1: Tokenization, Part-of-Speech Tagging
Identify all tokens in the following text (which appeared on the Private Eye Web site1). Use one of theNLTK tokenizers to perform this task.
A Message From The Headmaster
Kow tow!
That’s the Chinese for “hello”, as I’ve learnt this week, because those were the first words said to me by the Headof our Chinese sister academy, the Tiananmen Not At All Free School.
As you’ve probably gathered, we were rolling out the school “red”(!) carpet this week for Mr Xi Sho-pping, whowas on a visit to try and buy as much of the school as we could sell him.
…
MrSho-pping then visited the old boiler room and we signed an agreement for him to build not only a newnuclear boiler to replace it, but to build another two boilers, just in case the other one blows up, which it won’tof course. Even better, all the boilers will be run entirely by himself and members of the Beijing NuclearIntelligence Services Department, who are delighted to be the given the chance of experimenting with untriednuclear technology in someone else’s school. How exciting is that! I can’t have been the only one to feel a warmglow at the thought of so much radioactivity at the very heart of the school. Who knows, by this time next yearI could be the Two-head-master! (Finkelstein, D., you’re on fire – as is the boiler room!)
D.C.
Assign part-of-speech tags to all tokens in the text used above.
Use one of the implemented taggers in NLTK to do this.
Identify possible tagging errors.
Part 2: Regular Expressions, FSAs, and FSTs
In this part of the assignment you will do some simple information extraction, namely the identification of amountsof money in text
• Write a regular expression that can find all amounts of money in a text. Your expression should beable to deal with different formats and currencies, for example £50,000 and £117.3m as well as 30p, 500meuro, 338bn euros, $15bn and $92.88. Make sure that you can at least detect amounts in Pounds, Dollars andEuros. For full marks: include the output of a Python program that applies your regular expression to thefollowing BBC News Web site:
http://www.bbc.co.uk/news/business-41779341
The output should clearly identify what currency and what amount has been recognised, e.g, if your inputtext contains the string $20m, then your output could look like this:
Found a match!
Currency: Dollar
Amount: 20m
• Write a FSA equivalent to the regular expression you just wrote. You can either use a drawing
program or write down a transition table.
Part 3: Write your own version of ELIZA
Implement an ELIZA-like program, using substitutions as used in the lecture notes. Your version of ELIZA shouldnot be a psychologist but a system that generates answers to commonly asked questions related to flights to Parisand Berlin. An example conversation could go like this:
User: I want to fly to Paris next month.
ELIZA: We apologise, but all our services to Paris next month have been cancelled.
User: How about to Berlin on December 31st?
ELIZA: We are sorry, but all our flights on December 31st to Berlin are fully booked.
User: What would be the first available flight to Paris next year?
ELIZA: There are no available flights next year to Paris. Please accept our apologies …
You can have a simple command-line script that reads a line at a time and responds to it. It then waits for the nextline.
Solution
part1.py
importnltk
def main():
fo = open(“sample.txt”, ‘r’,encoding=’cp932′, errors=’ignore’)
data= fo.read()
tokens = nltk.word_tokenize(data)
print(“\n tokens\n”)
print(tokens)
print(“\n pos tags \n”)
posTagged = nltk.pos_tag(tokens)
print(posTagged)
main()
#pos tagger errors
# sample input 1 “The quick brown fox jumps over the lazy dog”
# the pos tagger returned [(‘The’, ‘DT’), (‘quick’, ‘NN’), (‘brown’, ‘NN’), (‘fox’, ‘NN’),
#(‘jumps’, ‘NNS’), (‘over’, ‘IN’), (‘the’, ‘DT’), (‘lazy’, ‘NN’), (‘dog’, ‘NN’)]
#(‘brown’, ‘NN’) is wrong
# (‘brown’, ‘JJ’) is correct answer.
# sample input 2 “a woman needs a man like a fish needs a bicycle”
#output- [(‘a’, ‘DT’), (‘woman’, ‘NN’), (‘needs’, ‘VBZ’), (‘a’, ‘DT’), (‘man’, ‘NN’), (‘like’, ‘IN’), (‘a’, ‘DT’),
#(‘fish’, ‘JJ’), (‘needs’, ‘VBZ’), (‘a’, ‘DT’), (‘bicycle’, ‘NN’)]
#(‘fish’, ‘JJ’) should be (‘fish’, ‘NN’)
part2.py
importnltk
import re
defany_curr(s, curr=”¥$€£”):
return any(c in s for c in curr)
def main():
fo = open(“sample.txt”)
st = fo.read()
st = st.lower()
tok = nltk.word_tokenize(st)
###
#(?P<currency>[£\$€])? – backup
#print (tok)
#se=re.search(ur'([£\$€])(\d+(?:\.\d{2})?)’, s).groups()
for i in range(0,len(tok)):
# scanning tokens witb regex to detect pattern for currency
se=re.search(r'(?P<currency>[£\$€])?(?P<value>\d+([,]\d+)?(\.\d+)?)(?P<abb>([pm](bn)?)|[pm]?(bn))?’,tok[i])
# for each value found it is put in a dictionary item with it’s 1.currency type, 2.numeric value and 3.suffix seperated
if se:
d = se.groupdict()
#writing switch cases for diff values of the above 3 parameters
if(d[“currency”]!=None or d[“abb”]!=None):
if(not d[“currency”]):
if(i>0 and tok[i-1] in [‘£’,’$’,’€’]):
curr=””
if(tok[i-1]==’£’):
curr=”Pounds”
elif(tok[i-1]==”$”):
curr=”Dollars”
else:
curr=”Euros”
print(“Found a match”)
print(“Currency : “+curr)
print(“Amount : “+d[“value”]+d[“abb”])
elif(d[“abb”]==’p’):
print(“Found a match”)
print(“Currency : Pounds”)
print(“Amount : “+d[“value”]+d[“abb”])
elif(i<len(tok)-1):
s =re.search(r'[eE]uro(s)?|[pP]ound(s)?|[dD]ollar(s)?’,tok[i+1])
if(s):
curr = s.group();
print(“Found a match”)
print(“Currency : “+curr)
print(“Amount : “+d[“value”]+d[“abb”])
elif d[“abb”]==None:
curr=””
if(d[“currency”]==’£’):
curr = “Pounds”
elif(d[“currency”]==”$”):
curr= “Dollars”
else:
curr=”Euros”
print(“Found a match”)
print(“Currency : “+curr)
print(“Amount : “+d[“value”])
else:
curr=””
if(d[“currency”]==’£’):
curr = “Pounds”
elif(d[“currency”]==”$”):
curr= “Dollars”
else:
curr=”Euros”
print(“Found a match”)
print(“Currency : “+curr)
print(“Amount : “+d[“value”]+d[“abb”])
else:
if(i>0 and tok[i-1] in [‘£’,’$’,’€’]):
curr=””
if(tok[i-1]==’£’):
curr=”Pounds”
elif(tok[i-1]==”$”):
curr=”Dollars”
else:
curr=”Euros”
print(“Found a match”)
print(“Currency : “+curr)
print(“Amount : “+d[“value”])
main()
part3.py
import re
import random
reflections = {
“am”: “are”,
“was”: “were”,
“i”: “you”,
“i’d”: “you would”,
“i’ve”: “you have”,
“i’ll”: “you will”,
“my”: “your”,
“are”: “am”,
“you’ve”: “I have”,
“you’ll”: “I will”,
“your”: “my”,
“yours”: “mine”,
“you”: “me”,
“me”: “you”
}
slaybot = [
(r'((?P<nmo>next month)|(?P<nye>next year)|(?P<nwe>next week)|(?P<daft>dayafter)|(?P<tod>today)|(?P<tom>tomorrow))’,
“We apologise, but all our services to Paris {} have been cancelled.”),
(r'((?P<jan>[jJ]anuary)|(?P<feb>[fF]ebruary)|(?P<mar>[mM]arch)|(?P<apr>[aA]pril)|(?P<may>[mM]ay)|(?P<jun>[jJ]une)|(?P<jul>[jJ]uly)|(?P<aug>[aA]ugust)|(?P<sep>[sS]eptember)|(?P<oct>[oO]ctober)|(?P<nov>[nN]ovember)|(?P<dec>[dD]ecember)) (((?P<second>[1-3])(?P<third>[0-9]))|(?P<first>[1-9]))’,
“We apologise, but all our services to Paris {} {}{} have been cancelled.”),
(r'(?P<quit>quit)’,
“Thank you for talking with me.”,
“Good-bye.”,
“Thank you, that will be $150. Have a good day!”),
(r'(.*)’,
“Sorry,I can’t understand you.Please enter an enquiry”)
]
def analyze(statement):
for i in range(0,len(slaybot)):
pattern = slaybot[i][0]
response = slaybot[i][1]
match = re.search(pattern, statement.rstrip(“.!”))
if match:
d=match.groupdict()
print (d)
if(i==0):
if(d[“nwe”]!=None):
returnresponse.format(d[“nwe”])
elif(d[“daft”]!=None):
returnresponse.format(d[“daft”])
elif(d[“tom”]!=None):
returnresponse.format(d[“tom”])
elif(d[“nmo”]!=None):
returnresponse.format(d[“nmo”])
elif(d[“tod”]!=None):
returnresponse.format(d[“tod”])
elif(d[“nye”]!=None):
returnresponse.format(d[“nye”])
elif(i==1):
if(d[“jan”]!=None):
if(d[“first”]!=None):
returnresponse.format(d[“jan”],0,d[“first”])
else:
returnresponse.format(d[“jan”],d[“second”],d[“third”])
elif(d[“feb”]!=None):
if(d[“first”]!=None):
returnresponse.format(d[“feb”],0,d[“first”])
else:
returnresponse.format(d[“feb”],d[“second”],d[“third”])
elif(d[“mar”]!=None):
if(d[“first”]!=None):
returnresponse.format(d[“mar”],0,d[“first”])
else:
returnresponse.format(d[“mar”],d[“second”],d[“third”])
elif(d[“apr”]!=None):
if(d[“first”]!=None):
returnresponse.format(d[“apr”],0,d[“first”])
else:
returnresponse.format(d[“apr”],d[“second”],d[“third”])
elif(d[“may”]!=None):
if(d[“first”]!=None):
returnresponse.format(d[“may”],0,d[“first”])
else:
returnresponse.format(d[“may”],d[“second”],d[“third”])
elif(d[“jun”]!=None):
if(d[“first”]!=None):
returnresponse.format(d[“jun”],0,d[“first”])
else:
returnresponse.format(d[“jun”],d[“second”],d[“third”])
elif(d[“jul”]!=None):
if(d[“first”]!=None):
returnresponse.format(d[“jul”],0,d[“first”])
else:
returnresponse.format(d[“jul”],d[“second”],d[“third”])
elif(d[“aug”]!=None):
if(d[“first”]!=None):
returnresponse.format(d[“aug”],0,d[“first”])
else:
returnresponse.format(d[“aug”],d[“second”],d[“third”])
elif(d[“sep”]!=None):
if(d[“first”]!=None):
returnresponse.format(d[“sep”],0,d[“first”])
else:
returnresponse.format(d[“sep”],d[“second”],d[“third”])
elif(d[“nov”]!=None):
if(d[“first”]!=None):
returnresponse.format(d[“nov”],0,d[“first”])
else:
returnresponse.format(d[“nov”],d[“second”],d[“third”])
elif(d[“dec”]!=None):
if(d[“first”]!=None):
returnresponse.format(d[“dec”],0,d[“first”])
else:
returnresponse.format(d[“dec”],d[“second”],d[“third”])
elif(i==2):
return response
elif(i==3):
return response
#print(match.groupdict())
def main():
print (“Hello. please type your enquiry”)
while True:
statement = input(“> “)
print (analyze(statement))
if statement == “quit”:
break
if __name__ == “__main__”:
main()