====== Trabalho Final da Júlia Raíces ======
Primeiro seguem os arquivos... \\
- da função: {{:bie5782:01_curso_atual:alunos:trabalho_final:julia.raices:funcao_julia.r|funcao_final.R}}\\
- do help: {{:bie5782:01_curso_atual:alunos:trabalho_final:julia.raices:help_julia.txt|help_funcao_final.txt}}\\
- e os arquivos de entrada possíveis... \\
* a lista de palavras e scores: pode ser baixado [[ https://drive.google.com/file/d/0B-WbKmjtS19Hd3JlbzE4SEFFTnM/view?usp=sharing | aqui]] \\
* um texto de verdade para ser analisado: {{:bie5782:01_curso_atual:alunos:trabalho_final:julia.raices:stoya.txt|stoya.txt}}\\
* um texto de mentirinha para ser analisado ("texto" criado a partir de palavras com score e que não faz sentido): {{:bie5782:01_curso_atual:alunos:trabalho_final:julia.raices:test.txt|test.txt}}\\
===== A Função =====
########## Function to assert a sentiment analysis of a given text ##########
####### Author: Julia Raices - noUSP: 6802291 #######
sentiment.analysis <- function(your.text, word.list.and.scores="WordScores.txt"){ # this line asserts the function name and files it will use.
unclean.text <- scan(your.text, character(0)) #separate each word from the file (accordin to R wiki ^^) # and it actually does so! and in text[n] is stored the Nth word from the text =)
text <- gsub("[^[:alpha:][:space:]'’]", "", unclean.text) # #Vivi helped me on thi one... because I had to remove all punctuation except for the apostrophes. So here I only keep the alphabet characters, the spaces and the apostophes....
text <- tolower(text) #assures every thing is lower case, so there is no mismacth because of case
score.text=0 # equals the text score to zero, so there is no mistake when we start the function
positive=0 # equals the number of positive words to zero, to prevent future mistakes
negative=0 # equals the number of negative words to zero, to prevent future mistakes
neutral=0 # equals the number of neutral words to zero to prevent future mistakes
leng <- length(text)#gets and stores the size (in words) of your file
LIST <- read.table(word.list.and.scores, sep="\t", header=F, strip.white=T, blank.lines.skip=T, col.names=c("words", "scores"), as.is=T) #.. opens your score file if you gave one, and opens the default file, if you didn't gave one
len <- length(LIST$words)# gets the length (number of words) in the word score file
LIST$word <- tolower(LIST$words)# also makes sure the words in the list are all lower case, so there is no mismacth because of the case.
for(i in 1:len){ # for the length of the score file do the following:
for(j in 1:leng){# for the length of the text file do the following:
if(text[j]==LIST$word[i]){ #if you can find the word from the text in the list of words...
two.space <- paste(LIST$word[i], LIST$word[i+1], sep=" ") # creates a string with the word that was not in the list and the next one, separate them with a space
two.dash <- paste(LIST$word[i], LIST$word[i+1], sep="-") # creates a string with the word that was not in the list and the next one, separate them with a dash
three.space <- paste(LIST$word[i-1], LIST$word[i], LIST$word[i+1], sep=" ") # creates a string with the word that was not in the list and the next two, separate them with a space
three.dash <- paste(LIST$word[i-1], LIST$word[i], LIST$word[i+1], sep="-") # creates a string with the word that was not in the list and the next two, separate them with a dash
four.space <- paste(LIST$word[i], LIST$word[i+1], LIST$word[i+2], LIST$word[i+3], sep=" ") # creates a string with the word that was not in the list and the next three, separate them with a space
four.dash <- paste(LIST$word[i-1], LIST$word[i], LIST$word[i+1], LIST$word[i+3], sep="-") # creates a string with the word that was not in the list and the next three, separate them with a dash
if(four.space==LIST$word[i]){ # checks if the new string created with the two consecutive words
score.text <- score.text + LIST$scores[i] # increase the text score by the word score.
if(LIST$scores[i] > 0){ # checks if the word is positive or negative
positive <- positive + 1 # if it's positive, add one to the positive-words couter
}# closes bracket from if the word is positive
else if(LIST$scores[i] < 0){ # checks if the word is positive or negative
negative <- negative + 1 # if it's negative, add one to the negative-words counter
}# closes bracket from if the word is negative
} # close brackets for the created string
else if(four.dash==LIST$word[i]){ # checks if the new string created with the two consecutive words
score.text <- score.text + LIST$scores[i] # increase the text score by the word score.
if(LIST$scores[i] > 0){ # checks if the word is positive or negative
positive <- positive + 1 # if it's positive, add one to the positive-words couter
}# closes bracket from if the word is positive
else if(LIST$scores[i] < 0){ # checks if the word is positive or negative
negative <- negative + 1 # if it's negative, add one to the negative-words counter
}# closes bracket from if the word is negative
} # close brackets for the created string
else if(three.space==LIST$word[i]){ # checks if the new string created with the two consecutive words
score.text <- score.text + LIST$scores[i] # increase the text score by the word score.
if(LIST$scores[i] > 0){ # checks if the word is positive or negative
positive <- positive + 1 # if it's positive, add one to the positive-words couter
}# closes bracket from if the word is positive
else if(LIST$scores[i] < 0){ # checks if the word is positive or negative
negative <- negative + 1 # if it's negative, add one to the negative-words counter
}# closes bracket from if the word is negative
} # close brackets for the created string
else if(three.dash==LIST$word[i]){ # checks if the new string created with the two consecutive words
score.text <- score.text + LIST$scores[i] # increase the text score by the word score.
if(LIST$scores[i] > 0){ # checks if the word is positive or negative
positive <- positive + 1 # if it's positive, add one to the positive-words couter
}# closes bracket from if the word is positive
else if(LIST$scores[i] < 0){ # checks if the word is positive or negative
negative <- negative + 1 # if it's negative, add one to the negative-words counter
}# closes bracket from if the word is negative
} # close brackets for the created string
else if(two.space==LIST$word[i]){ # checks if the new string created with the two consecutive words is in the list
score.text <- score.text + LIST$scores[i] # increase the text score by the word score.
if(LIST$scores[i] > 0){ # checks if the word is positive or negative
positive <- positive + 1 # if it's positive, add one to the positive-words couter
}# closes bracket from if the word is positive
else if(LIST$scores[i] < 0){ # checks if the word is positive or negative
negative <- negative + 1 # if it's negative, add one to the negative-words counter
}# closes bracket from if the word is negative
} # close brackets for the created string
else if(two.dash==LIST$word[i]){ # checks if the new string created with the two consecutive words is in the list
score.text <- score.text + LIST$scores[i] # increase the text score by the word score.
if(LIST$scores[i] > 0){ # checks if the word is positive or negative
positive <- positive + 1 # if it's positive, add one to the positive-words couter
}# closes bracket from if the word is positive
else if(LIST$scores[i] < 0){ # checks if the word is positive or negative
negative <- negative + 1 # if it's negative, add one to the negative-words counter
}# closes bracket from if the word is negative
} # close brackets for the created string
else { # else for if there is no grouping of the word on the list that was also on the list
score.text <- score.text + LIST$scores[i] # increase the text score by the word score.
if(LIST$scores[i] > 0){ # checks if the word is positive or negative
positive <- positive + 1 # if it's positive, add one to the positive-words couter
}# closes bracket from if the word is positive
else if(LIST$scores[i] < 0){ # checks if the word is positive or negative
negative <- negative + 1 # if it's negative, add one to the negative-words counter
}# closes bracket from if the word is negative
} # close else from if there was no grouping of the word on the list that was also on the list
}#close brackets from the "if the word from the text is on the list"
else{ # if the single word is not in the list, we try combination of the word with the following one and two, to assert expressions
two.space <- paste(LIST$word[i], LIST$word[i+1], sep=" ") # creates a string with the word that was not in the list and the next one, separate them with a space
two.dash <- paste(LIST$word[i], LIST$word[i+1], sep="-") # creates a string with the word that was not in the list and the next one, separate them with a dash
three.space <- paste(LIST$word[i-1], LIST$word[i], LIST$word[i+1], sep=" ") # creates a string with the word that was not in the list and the next two, separate them with a space
three.dash <- paste(LIST$word[i-1], LIST$word[i], LIST$word[i+1], sep="-") # creates a string with the word that was not in the list and the next two, separate them with a dash
four.space <- paste(LIST$word[i], LIST$word[i+1], LIST$word[i+2], LIST$word[i+3], sep=" ") # creates a string with the word that was not in the list and the next three, separate them with a space
four.dash <- paste(LIST$word[i-1], LIST$word[i], LIST$word[i+1], LIST$word[i+3], sep="-") # creates a string with the word that was not in the list and the next three, separate them with a dash
if(four.space==LIST$word[i]){ # checks if the new string created with the two consecutive words
score.text <- score.text + LIST$scores[i] # increase the text score by the word score.
if(LIST$scores[i] > 0){ # checks if the word is positive or negative
positive <- positive + 1 # if it's positive, add one to the positive-words couter
}# closes bracket from if the word is positive
else if(LIST$scores[i] < 0){ # checks if the word is positive or negative
negative <- negative + 1 # if it's negative, add one to the negative-words counter
}# closes bracket from if the word is negative
} # close brackets for the created string
else if(four.dash==LIST$word[i]){ # checks if the new string created with the two consecutive words
score.text <- score.text + LIST$scores[i] # increase the text score by the word score.
if(LIST$scores[i] > 0){ # checks if the word is positive or negative
positive <- positive + 1 # if it's positive, add one to the positive-words couter
}# closes bracket from if the word is positive
else if(LIST$scores[i] < 0){ # checks if the word is positive or negative
negative <- negative + 1 # if it's negative, add one to the negative-words counter
}# closes bracket from if the word is negative
} # close brackets for the created string
else if(three.space==LIST$word[i]){ # checks if the new string created with the two consecutive words
score.text <- score.text + LIST$scores[i] # increase the text score by the word score.
if(LIST$scores[i] > 0){ # checks if the word is positive or negative
positive <- positive + 1 # if it's positive, add one to the positive-words couter
}# closes bracket from if the word is positive
else if(LIST$scores[i] < 0){ # checks if the word is positive or negative
negative <- negative + 1 # if it's negative, add one to the negative-words counter
}# closes bracket from if the word is negative
} # close brackets for the created string
else if(three.dash==LIST$word[i]){ # checks if the new string created with the two consecutive words
score.text <- score.text + LIST$scores[i] # increase the text score by the word score.
if(LIST$scores[i] > 0){ # checks if the word is positive or negative
positive <- positive + 1 # if it's positive, add one to the positive-words couter
}# closes bracket from if the word is positive
else if(LIST$scores[i] < 0){ # checks if the word is positive or negative
negative <- negative + 1 # if it's negative, add one to the negative-words counter
}# closes bracket from if the word is negative
} # close brackets for the created string
else if(two.space==LIST$word[i]){ # checks if the new string created with the two consecutive words is in the list
score.text <- score.text + LIST$scores[i] # increase the text score by the word score.
if(LIST$scores[i] > 0){ # checks if the word is positive or negative
positive <- positive + 1 # if it's positive, add one to the positive-words couter
}# closes bracket from if the word is positive
else if(LIST$scores[i] < 0){ # checks if the word is positive or negative
negative <- negative + 1 # if it's negative, add one to the negative-words counter
}# closes bracket from if the word is negative
} # close brackets for the created string
else if(two.dash==LIST$word[i]){ # checks if the new string created with the two consecutive words is in the list
score.text <- score.text + LIST$scores[i] # increase the text score by the word score.
if(LIST$scores[i] > 0){ # checks if the word is positive or negative
positive <- positive + 1 # if it's positive, add one to the positive-words couter
}# closes bracket from if the word is positive
else if(LIST$scores[i] < 0){ # checks if the word is positive or negative
negative <- negative + 1 # if it's negative, add one to the negative-words counter
}# closes bracket from if the word is negative
} # close brackets for the created string
} # closes the brackets from the else, case in which the word from the text was not in the list
}# close brackets from the "for the length of the text"
}# close brackets from the "for the length of the word score file"
vetor <- c(rep("positive", positive), rep("negative", negative))#creates a vector with the positive and negative counts
vetor <- factor(vetor, levels=c("negative", "positive")) # makes this vector into a factor vector
dev.new() # opens new graphic device
barplot(prop.table(table(vetor)), main="Relative frequency of categorized\n words in each sentiment category", xlab="Sentiment category", ylab="Relative frequency", col=c("indianred1", "lightskyblue3")) # creates a barplot with the frequencies of each category (positive or negative)
neutral <- leng - (positive+negative) # since all the words that are neither positive nor negative are neutral (or uncategorized, which will be treated as neutral), here we have all the neutral words =)
vetorn <- c(rep("positive", positive), rep("negative", negative), rep("neutral", neutral))#creates a vector with the positive, negative and neutral counts
vetorn <- factor(vetorn, levels=c("negative", "neutral", "positive")) # makes this vector into a factor vector
dev.new() #opens new graphic device
barplot(prop.table(table(vetorn)), main="Relative frequency of all words\n in each sentiment category", xlab="Sentiment category", ylab="Relative frequency", col=c("indianred1", "lightgreen", "lightskyblue3")) # creates a barplot with the frequencies of each category (positive or negative or neutral)
return(score.text) # returns the sum of the scores from the words of the text
} # closes the brackets of the function
===== O Texto de Ajuda =====
sentiment.analysis package:none R Documentation
Gives a sentiment score to a text that is the sum of the score of all
scored words/expressions in the text.
Description:
sentiment.analysis receives a text and a table of words
and their scores (if none is provided, the function will
search for the "WordScores.txt" file provided with this
function). Each word and expression of up to 4 words from
the text is searched in the table data base and the scores
of all words/expressions is added to get the text score.
Usage:
sentiment.analysis(your.text, word.list.and.scores)
## Default:
sentiment.analysis(your.text, word.list.and.scores="WordScores.txt")
Arguments:
your.text: character. A text given by the user in
UTF-8 encoding, in english language,
usually a .txt file, but internet sites
can also be given here.
word.list.and.scores: table. A table with the words/
expressions in the fist column,
and their score in the second
column. The separator must be
a tab ("\t") and their should
be no header in the file.
Value:
The function returns the value of the text score (sum of all
word/expression scores) and two graphics: one of the relative
positive and negative words (in the universe of all categorized
words) and one of the relative positive, negative and neutral
words (in the universe of all words from the text, were un-
categorized words are considered neutral).
Warning:
The function is not really fast, the bigger the text and
you word list and score the more it will take. With the
default word list even small texts may take a while to be
processed. Remember that R demands quite a bit from your
RAM memory, so it mey be a good ideia to make a camomile
tea while you wait for the function to run, mainly if you
are using large data.
Also, in the case of low RAM memory and large texts there
may appear a few warnings, but the program usually works
out just fine. Just don't forget the camomile tea.
Author:
Júlia Beck Raíces
nºUSP: 6802291
julia.raices@gmail.com
juliar@riseup.net
fingerprint: BF75 AF9A 1232 DFF6 0189 5D72 7877 3E81 1433 5F11
Thanks:
Special thanks to Viviane Santos who helped me with the ideia
of the function and with the references and to Chalom, who
always helps me with the constant despair of computer programming.
References:
- The words list and scores was obtainned (and slightly modified) from:
" Lars Kai Hansen, Adam Arvidsson, Finn Årup Nielsen, Elanor Colleoni,
Michael Etter, "Good Friends, Bad News - Affect and Virality in
Twitter", The 2011 International Workshop on Social Computing,
Network, and Services (SocialComNet 2011). "
- The "stoya.txt" test archive is the text "Sigh" from Stoya.
Obtained from her blog at: http://graphicdescriptions.com/11-sigh
See Also:
- Bo Pang and Lillian Lee "Opinion Mining and Sentiment Analysis",
Foundations and Trands on Information Retrieval, Vol 2 (2008).
- SentiWordNet ( http://sentiwordnet.isti.cnr.it/ )
- Stanford's Sentiment Analysis website
( http://nlp.stanford.edu/sentiment/ )
Examples:
# Download both the "WordScores.txt" file and the "stoya.txt"
## file at http://tinyurl.com/q353stg
sentiment.analysis("stoya.txt", "WordScores.txt")
# gives the text score and the graphs
# Download both the files "WordScores.txt" and "test.txt"
##at http://tinyurl.com/q353stg
sentiment analysis("test.txt")
# gives the score and graphics for another text (in the case
## a made-up text of scored words). Notice that when
## word.list.and.scores is not given the function automatticaly
## uses the "WordScores.txt" file.