O que você precisa dominar antes da análise de sentimentos
Regex
Introdução ao pacote stringr
Text mining
introdução a NLP
introdução a transformers
What’s the point?
Os tópicos discutidos hoje são a base para executar MUITA coisa legal e relevante hoje em dia, como classificação textual, análise de sentimentos, modelos de linguagem (o chatGPT é um grande modelo de linguagem).
Sabendo que tirar conclusões e insights de uma tabela organizada, se chama Data mining, vamos chamar text mining de Data mining textual.
Créditos da imagem geeks for geeks.
Vale ressaltar as diferenças entre tibble, que vamos usar aqui, e dataframe.
Tibble é uma versão mais organizada de dataframe, e segura, por ter características que previnem erros, como:
Mais diferenças aqui.
A base de hoje se chama Harry Potter Dialogue (HPD)
, base com 1042 diálogos dos livros da série Harry Potter, o que corresponde ao conjunto de treino fornecido pelo site, 149 diálogos estão no conjunto de teste e não serão utilizados aqui. O site, HPD, é mantido por Nuo Chen e Yan Wang.
A library stringr
já está contida no tidyverse, então podemos ir direto para o código e começar a entender o que são:
a,5,#,&, etc
.
\x
()
+
*
?
Um metacaracter que pode representar qualquer caracter
#créditos da inspiração para este exemplo: https://www.devsv.com.br/regex/2017/03/31/falando-sobre-expressoes-regulares-primeiros-metacaracteres.html
acentos <- c(
'eu nao curto nao.',
'eu acho inconcebível não gostar de acentos!'
)
str_view(acentos,'n.o')
[1] │ eu <nao> curto <nao>.
[2] │ eu acho i<nco>ncebível <não> gostar de ace<nto>s!
Aqui acontecem duas coisas interessantes. O .
no regex não da match com o .
na frase e nco
ou nto
não nos interessa… Como podemos resolver se quisermos dar match no .
ou garantir que match apenas com nao
ou não
?
segura o primeiro problema, vamos focar em resolver o nao
+não
:
Mas perae, tem alguma coisa errada. a primeira frase tem mais de um nao
. Esse é um ponto de cuidado que nos leva ao:
Reparam algo estranho no match do nao
? Muitas funções do stringr tem uma versão que termina com all
, e você PRECISA selecionar ela caso queira encontrar todos os terminar na string.
Experimentem rodar stringr::all
para ver quais as funções podem te aplicar essa pegadinha
\
lembra do problema de dar match no ponto? Chegamos na solução
Dígito
e quantificador um ou mais
Não Dígito
e quantificador um ou mais
?Não Dígito
e quantificador um ou mais
?[[1]]
[,1]
[1,] "My"
[2,] "favorite"
[3,] "is"
[4,] "42"
[5,] "or"
[6,] "12"
[[2]]
[,1]
[1,] "I"
[2,] "like"
[3,] "10"
[4,] "and"
[5,] "15"
[[3]]
[,1]
[1,] "Umm"
[2,] "33"
[3,] "and"
[4,] "35"
Vamos para o primeiro exercício! Como ficaria o regex?
Você pode estar se perguntando. Eis aqui um exemplo, tokenização de texto em unidades menores. Frequentemente essas unidades menores são palavras, então podemos fazer o seguinte:
Validar entrada de dados pelo usuário.
O que acontece se trocarmos s
por S
?
Wildcard
e quantificador um ou mais
Literais
+ wildcard
+ um ou mais
um ou mais
e zero ou mais
zero ou um
[1] │ <organize>eeeee, <organização>, <organiz>
[1] │ <organizeeeeee>, <organização>, organiz
[1] │ <organizeeeeee>, <organização>, <organiz>
^
[,1]
[1,] "caterpillar"
[2,] "catapult"
[3,] "cattle"
[4,] "cat"
$
- fim da sentençaAqui estão mais alguns exemplos de regex um pouco mais complexos.
pattern = '.*(\\d{3}).*(\\d{3}).*(\\d{4})'
phone_numbers = c(
"(541) 471 3918",
"(603)281-0308",
"Home: 425-707-7220",
"(814)-462-8074",
"9704443106",
"I don't have a phone."
)
str_match(phone_numbers, pattern)
[,1] [,2] [,3] [,4]
[1,] "(541) 471 3918" "541" "471" "3918"
[2,] "(603)281-0308" "603" "281" "0308"
[3,] "Home: 425-707-7220" "425" "707" "7220"
[4,] "(814)-462-8074" "814" "462" "8074"
[5,] "9704443106" "970" "444" "3106"
[6,] NA NA NA NA
Agora vamos entrar em cuidados que muitas vezes são necessários para que uma base de texto seja inputada em modelos mais complexos.
stopwords_getsources
retorna as fontes disponíveis.Utilize a len
, para comparar a complexidade, ou comprimento, das stopwords e setdiff
para comparar a composição.
stopwords_pt <- stopwords::stopwords(language = 'pt', source = 'snowball')
sample_text <- c("O coda Amazônia está incrível!",
"Análise de sentimentos é uma parte importante da NLP.",
"Vamos remover as stopwords deste texto em português.")
# Create a Corpus
corpus <- Corpus(VectorSource(sample_text))
inspect(corpus)
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 3
[1] O coda Amazônia está incrível!
[2] Análise de sentimentos é uma parte importante da NLP.
[3] Vamos remover as stopwords deste texto em português.
remove_pt_stopwords <- function(text, stopwords) {
tokens <- unlist(strsplit(text, " "))
tokens <- tokens[!tokens %in% stopwords]
paste(tokens, collapse = " ")
}
# Apply function to the corpus
corpus_clean <- tm_map(corpus, content_transformer(remove_pt_stopwords), stopwords_pt)
inspect(corpus_clean)
<<SimpleCorpus>>
Metadata: corpus specific: 1, document level (indexed): 0
Content: documents: 1
[1] O coda Amazônia incrível! Análise sentimentos é parte importante NLP. Vamos remover stopwords deste texto português.
As vezes apenas a “raiz” da palavra importa, por exemplo, se você estiver tentando montar um modelo de busca, como o google. Isso também traz uma vantagem de processamento por diminuir o número de palavras distintas na base de treino.
exemplo<-"sê eu comesse açai com peixe, eu estaria comendo açai, mas comeria achando que é outra coisa, pois o açai que to acostumado não é comestível com peixe."
exemplo
[1] "sê eu comesse açai com peixe, eu estaria comendo açai, mas comeria achando que é outra coisa, pois o açai que to acostumado não é comestível com peixe."
[1] "se eu comesse açai com peixe, eu estaria comendo açai, mas comeria achando que e outra coisa, pois o açai que to acostumado não e comestivel com peixe."
exemplo_corpus <- VCorpus(VectorSource(exemplo))
exemplo_corpus <- tm_map(exemplo_corpus, content_transformer(tolower))
exemplo_corpus <- tm_map(exemplo_corpus, removePunctuation)
exemplo_corpus <- tm_map(exemplo_corpus, removeNumbers)
exemplo_corpus <- tm_map(exemplo_corpus, removeWords, stopwords("portuguese"))
exemplo_corpus <- tm_map(exemplo_corpus, stripWhitespace)
exemplo_corpus <- TermDocumentMatrix(exemplo_corpus)
exemplo_corpus_matrix <- as.matrix(exemplo_corpus)
palavras<-row.names(exemplo_corpus_matrix)
# fonte
#lemma_dic <- read.delim(file = "https://raw.githubusercontent.com/michmech/lemmatization-lists/master/lemmatization-pt.txt", header = FALSE, stringsAsFactors = FALSE)
lemma_dic <- readRDS("lemma.RDS")
names(lemma_dic) <- c("stem", "term")
[1] "açai" "achando" "acostumado" "coisa" "comendo"
[6] "comeria" "comesse" "comestível" "estaria" "outra"
[11] "peixe" "pois"
for (j in 1:length(palavras)){
comparacao <- palavras[j] == lemma_dic$term
if (sum(comparacao) == 1){
palavras[j] <- as.character(lemma_dic$stem[comparacao])
} else {
palavras[j] <- palavras[j]
}
}
palavras
[1] "açai" "achar" "acostumar" "coisa" "comer"
[6] "comer" "comer" "comestível" "estar" "outro"
[11] "peixe" "pois"
Créditos da imagem www.nomidl.com.
$`Session-1`
$`Session-1`$position
[1] "Book1-chapter2"
$`Session-1`$speakers
[1] "Petunia" "Harry"
$`Session-1`$scene
[1] "“Up! Get up! Now!”"
[2] "Harry woke with a start. His aunt rapped on the door again."
[3] "“Up!” she screeched. Harry heard her walking toward the kitchen and then the sound of the frying pan being put on the stove. He rolled onto his back and tried to remember the dream he had been having. It had been a good one. There had been a flying motorcycle in it. He had a funny feeling he’d had the same dream before."
[4] "His aunt was back outside the door."
[5] "“Are you up yet?” she demanded."
[6] "“Nearly,” said Harry."
[7] "“Well, get a move on, I want you to look after the bacon. And don’t you dare let it burn, I want everything perfect on Duddy’s birthday.”"
[8] "Harry groaned."
[9] "“What did you say?” his aunt snapped through the door."
[10] "“Nothing, nothing . . .”"
$`Session-1`$dialogue
[1] "Petunia: Up! Get up! Now! Up! Up! Are you up yet?"
[2] "Harry: Nearly,"
[3] "Petunia: Well, get a move on, I want you to look after the bacon. What did you say?"
[4] "Harry: Nothing, nothing . . ."
$`Session-1`$attributes
$`Session-1`$attributes$Harry
$`Session-1`$attributes$Harry$name
[1] "Harry"
$`Session-1`$attributes$Harry$nickname
[1] "The boy who lived"
$`Session-1`$attributes$Harry$gender
[1] "male"
$`Session-1`$attributes$Harry$age
[1] "age 11"
$`Session-1`$attributes$Harry$looks
[1] "Very thin, black hair, emerald green eyes, wearing glasses, knife injury with lightning shape at the forehead"
$`Session-1`$attributes$Harry$hobbies
[1] "None"
$`Session-1`$attributes$Harry$character
[1] "None"
$`Session-1`$attributes$Harry$talents
[1] "None"
$`Session-1`$attributes$Harry$export
[1] "None"
$`Session-1`$attributes$Harry$belongings
[1] "None"
$`Session-1`$attributes$Harry$affiliation
[1] "None"
$`Session-1`$attributes$Harry$lineage
[1] "None"
$`Session-1`$attributes$Harry$title
[1] "The boy who lived"
$`Session-1`$attributes$Harry$spells
[1] ""
$`Session-1`$attributes$Petunia
$`Session-1`$attributes$Petunia$name
[1] "Petunia"
$`Session-1`$attributes$Petunia$nickname
[1] "None"
$`Session-1`$attributes$Petunia$gender
[1] "Female"
$`Session-1`$attributes$Petunia$age
[1] "Adult"
$`Session-1`$attributes$Petunia$looks
[1] "Slender, blond hair, long neck"
$`Session-1`$attributes$Petunia$hobbies
[1] "None"
$`Session-1`$attributes$Petunia$character
[1] "Message, gossip"
$`Session-1`$attributes$Petunia$talents
[1] "None"
$`Session-1`$attributes$Petunia$export
[1] "None"
$`Session-1`$attributes$Petunia$belongings
[1] "None"
$`Session-1`$attributes$Petunia$affiliation
[1] "None"
$`Session-1`$attributes$Petunia$lineage
[1] "Maculogy"
$`Session-1`$attributes$Petunia$title
[1] "None"
$`Session-1`$attributes$Petunia$spells
[1] ""
$`Session-1`$`relations with Harry`
$`Session-1`$`relations with Harry`$Petunia
$`Session-1`$`relations with Harry`$Petunia$name
[1] "Petunia"
$`Session-1`$`relations with Harry`$Petunia$friend
[1] 0
$`Session-1`$`relations with Harry`$Petunia$classmate
[1] 0
$`Session-1`$`relations with Harry`$Petunia$teacher
[1] 0
$`Session-1`$`relations with Harry`$Petunia$family
[1] 1
$`Session-1`$`relations with Harry`$Petunia$`immediate family`
[1] 0
$`Session-1`$`relations with Harry`$Petunia$lover
[1] 0
$`Session-1`$`relations with Harry`$Petunia$opponent
[1] 0
$`Session-1`$`relations with Harry`$Petunia$colleague
[1] 0
$`Session-1`$`relations with Harry`$Petunia$teammate
[1] 0
$`Session-1`$`relations with Harry`$Petunia$enemy
[1] 0
$`Session-1`$`relations with Harry`$Petunia$`Harry's affection to him`
[1] -4
$`Session-1`$`relations with Harry`$Petunia$`Harry's familiarity with him`
[1] 8
$`Session-1`$`relations with Harry`$Petunia$`His affection to Harry`
[1] -4
$`Session-1`$`relations with Harry`$Petunia$`His familiarity with Harry`
[1] 6
$`Session-2`
$`Session-2`$position
[1] "Book1-chapter2"
$`Session-2`$speakers
[1] "Petunia" "Vernon" "Harry"
$`Session-2`$scene
[1] "“Bad news, Vernon,” she said. “Mrs. Figg’s broken her leg. She can’t take him.” She jerked her head in Harry’s direction."
[2] "Dudley’s mouth fell open in horror, but Harry’s heart gave a leap. Every year on Dudley’s birthday, his parents took him and a friend out for the day, to adventure parks, hamburger restaurants, or the movies. Every year, Harry was left behind with Mrs. Figg, a mad old lady who lived two streets away. Harry hated it there. The whole house smelled of cabbage and Mrs. Figg made him look at photographs of all the cats she’d ever owned."
[3] "“Now what?” said Aunt Petunia, looking furiously at Harry as though he’d planned this. Harry knew he ought to feel sorry that Mrs. Figg had broken her leg, but it wasn’t easy when he reminded himself it would be a whole year before he had to look at Tibbles, Snowy, Mr. Paws, and Tufty again."
[4] "“We could phone Marge,” Uncle Vernon suggested."
[5] "“Don’t be silly, Vernon, she hates the boy.”"
[6] "The Dursleys often spoke about Harry like this, as though he wasn’t there — or rather, as though he was something very nasty that couldn’t understand them, like a slug."
[7] "“What about what’s-her-name, your friend — Yvonne?”"
[8] "“On vacation in Majorca,” snapped Aunt Petunia."
[9] "“You could just leave me here,” Harry put in hopefully (he’d be able to watch what he wanted on television for a change and maybe even have a go on Dudley’s computer)."
[10] "Aunt Petunia looked as though she’d just swallowed a lemon."
[11] "“And come back and find the house in ruins?” she snarled."
[12] "“I won’t blow up the house,” said Harry, but they weren’t listening."
[13] "“I suppose we could take him to the zoo,” said Aunt Petunia slowly, “. . . and leave him in the car. . . .”"
[14] "“That car’s new, he’s not sitting in it alone. . . .”"
[15] "Dudley began to cry loudly. In fact, he wasn’t really crying — it had been years since he’d really cried — but he knew that if he screwed up his face and wailed, his mother would give him anything he wanted."
[16] "“Dinky Duddydums, don’t cry, Mummy won’t let him spoil your special day!” she cried, flinging her arms around him."
[17] "“I . . . don’t . . . want . . . him . . . t-t-to come!” Dudley yelled between huge, pretend sobs. “He always sp-spoils everything!” He shot Harry a nasty grin through the gap in his mother’s arms."
[18] "Just then, the doorbell rang —“Oh, good Lord, they’re here!” said Aunt Petunia frantically — and a moment later, Dudley’s best friend, Piers Polkiss, walked in with his mother. Piers was a scrawny boy with a face like a rat. He was usually the one who held people’s arms behind their backs while Dudley hit them. Dudley stopped pretending to cry at once."
[19] "Half an hour later, Harry, who couldn’t believe his luck, was sitting in the back of the Dursleys’ car with Piers and Dudley, on the way to the zoo for the first time in his life. His aunt and uncle hadn’t been able to think of anything else to do with him, but before they’d left, Uncle Vernon had taken Harry aside."
[20] "“I’m warning you,” he had said, putting his large purple face right up close to Harry’s, “I’m warning you now, boy — any funny business, anything at all — and you’ll be in that cupboard from now until Christmas.”"
$`Session-2`$dialogue
[1] "Petunia: Bad news, Vernon, Mrs. Figg’s broken her leg. She can’t take him. Now what? "
[2] "Vernon: We could phone Marge,"
[3] "Petunia: Don’t be silly, Vernon, she hates the boy."
[4] "Vernon: What about what’s-her-name, your friend — Yvonne?"
[5] "Petunia: On vacation in Majorca,"
[6] "Harry: You could just leave me here,"
[7] "Petunia: And come back and find the house in ruins? "
[8] "Harry: I won’t blow up the house,"
[9] "Petunia: I suppose we could take him to the zoo,"
[10] "Vernon: That car’s new, he’s not sitting in it alone. . . ."
[11] "Petunia: Dinky Duddydums, don’t cry, Mummy won’t let him spoil your special day! Oh, good Lord, they’re here!"
[12] "Vernon: I’m warning you, I’m warning you now, boy — any funny business, anything at all — and you’ll be in that cupboard from now until Christmas."
$`Session-2`$attributes
$`Session-2`$attributes$Harry
$`Session-2`$attributes$Harry$name
[1] "Harry"
$`Session-2`$attributes$Harry$nickname
[1] "The boy who lived"
$`Session-2`$attributes$Harry$gender
[1] "male"
$`Session-2`$attributes$Harry$age
[1] "age 11"
$`Session-2`$attributes$Harry$looks
[1] "Very thin, black hair, emerald green eyes, wearing glasses, knife injury with lightning shape at the forehead"
$`Session-2`$attributes$Harry$hobbies
[1] "None"
$`Session-2`$attributes$Harry$character
[1] "None"
$`Session-2`$attributes$Harry$talents
[1] "None"
$`Session-2`$attributes$Harry$export
[1] "None"
$`Session-2`$attributes$Harry$belongings
[1] "None"
$`Session-2`$attributes$Harry$affiliation
[1] "None"
$`Session-2`$attributes$Harry$lineage
[1] "None"
$`Session-2`$attributes$Harry$title
[1] "The boy who lived"
$`Session-2`$attributes$Harry$spells
[1] ""
$`Session-2`$attributes$Petunia
$`Session-2`$attributes$Petunia$name
[1] "Petunia"
$`Session-2`$attributes$Petunia$nickname
[1] "None"
$`Session-2`$attributes$Petunia$gender
[1] "Female"
$`Session-2`$attributes$Petunia$age
[1] "Adult"
$`Session-2`$attributes$Petunia$looks
[1] "Slender, blond hair, long neck"
$`Session-2`$attributes$Petunia$hobbies
[1] "None"
$`Session-2`$attributes$Petunia$character
[1] "Message, gossip"
$`Session-2`$attributes$Petunia$talents
[1] "None"
$`Session-2`$attributes$Petunia$export
[1] "None"
$`Session-2`$attributes$Petunia$belongings
[1] "None"
$`Session-2`$attributes$Petunia$affiliation
[1] "None"
$`Session-2`$attributes$Petunia$lineage
[1] "Maculogy"
$`Session-2`$attributes$Petunia$title
[1] "None"
$`Session-2`$attributes$Petunia$spells
[1] ""
$`Session-2`$attributes$Vernon
$`Session-2`$attributes$Vernon$name
[1] "Vernon Dursley"
$`Session-2`$attributes$Vernon$nickname
[1] "None"
$`Session-2`$attributes$Vernon$gender
[1] "male"
$`Session-2`$attributes$Vernon$age
[1] "Adult"
$`Session-2`$attributes$Vernon$looks
[1] "Very fat, tall and burly, accumulating beard"
$`Session-2`$attributes$Vernon$hobbies
[1] "None"
$`Session-2`$attributes$Vernon$character
[1] "mean"
$`Session-2`$attributes$Vernon$talents
[1] "None"
$`Session-2`$attributes$Vernon$export
[1] "None"
$`Session-2`$attributes$Vernon$belongings
[1] "car"
$`Session-2`$attributes$Vernon$affiliation
[1] "Grandine Company"
$`Session-2`$attributes$Vernon$lineage
[1] "Maculogy"
$`Session-2`$attributes$Vernon$title
[1] "Company supervisor"
$`Session-2`$attributes$Vernon$spells
[1] ""
$`Session-2`$`relations with Harry`
$`Session-2`$`relations with Harry`$Petunia
$`Session-2`$`relations with Harry`$Petunia$name
[1] "Petunia"
$`Session-2`$`relations with Harry`$Petunia$friend
[1] 0
$`Session-2`$`relations with Harry`$Petunia$classmate
[1] 0
$`Session-2`$`relations with Harry`$Petunia$teacher
[1] 0
$`Session-2`$`relations with Harry`$Petunia$family
[1] 1
$`Session-2`$`relations with Harry`$Petunia$`immediate family`
[1] 0
$`Session-2`$`relations with Harry`$Petunia$lover
[1] 0
$`Session-2`$`relations with Harry`$Petunia$opponent
[1] 0
$`Session-2`$`relations with Harry`$Petunia$colleague
[1] 0
$`Session-2`$`relations with Harry`$Petunia$teammate
[1] 0
$`Session-2`$`relations with Harry`$Petunia$enemy
[1] 0
$`Session-2`$`relations with Harry`$Petunia$`Harry's affection to him`
[1] -4
$`Session-2`$`relations with Harry`$Petunia$`Harry's familiarity with him`
[1] 8
$`Session-2`$`relations with Harry`$Petunia$`His affection to Harry`
[1] -4
$`Session-2`$`relations with Harry`$Petunia$`His familiarity with Harry`
[1] 6
$`Session-2`$`relations with Harry`$Vernon
$`Session-2`$`relations with Harry`$Vernon$name
[1] "Vernon Dursley"
$`Session-2`$`relations with Harry`$Vernon$friend
[1] 0
$`Session-2`$`relations with Harry`$Vernon$classmate
[1] 0
$`Session-2`$`relations with Harry`$Vernon$teacher
[1] 0
$`Session-2`$`relations with Harry`$Vernon$family
[1] 1
$`Session-2`$`relations with Harry`$Vernon$`immediate family`
[1] 0
$`Session-2`$`relations with Harry`$Vernon$lover
[1] 0
$`Session-2`$`relations with Harry`$Vernon$opponent
[1] 0
$`Session-2`$`relations with Harry`$Vernon$colleague
[1] 0
$`Session-2`$`relations with Harry`$Vernon$teammate
[1] 0
$`Session-2`$`relations with Harry`$Vernon$enemy
[1] 0
$`Session-2`$`relations with Harry`$Vernon$`Harry's affection to him`
[1] -4
$`Session-2`$`relations with Harry`$Vernon$`Harry's familiarity with him`
[1] 8
$`Session-2`$`relations with Harry`$Vernon$`His affection to Harry`
[1] -4
$`Session-2`$`relations with Harry`$Vernon$`His familiarity with Harry`
[1] 6
[1] "Petunia: Up! Get up! Now! Up! Up! Are you up yet?"
[2] "Harry: Nearly,"
[3] "Petunia: Well, get a move on, I want you to look after the bacon. What did you say?"
[4] "Harry: Nothing, nothing . . ."
#queremos session, locutores e dialogo
session_names <- rep(names(extracted_dialogue),
times = sapply(extracted_dialogue, length))
extracted_dialogue$`Session-1`
[1] "Petunia: Up! Get up! Now! Up! Up! Are you up yet?"
[2] "Harry: Nearly,"
[3] "Petunia: Well, get a move on, I want you to look after the bacon. What did you say?"
[4] "Harry: Nothing, nothing . . ."
tibble_hp <- tibble(dialogue = unlist(extracted_dialogue)
)
# antes de fechar o pre-processamento, vamos entender a str_split_fixed
str_split_fixed(string = tibble_hp$dialogue,
pattern = ':',n=2) |> head()
[,1]
[1,] "Petunia"
[2,] "Harry"
[3,] "Petunia"
[4,] "Harry"
[5,] "Petunia"
[6,] "Vernon"
[,2]
[1,] " Up! Get up! Now! Up! Up! Are you up yet?"
[2,] " Nearly,"
[3,] " Well, get a move on, I want you to look after the bacon. What did you say?"
[4,] " Nothing, nothing . . ."
[5,] " Bad news, Vernon, Mrs. Figg’s broken her leg. She can’t take him. Now what? "
[6,] " We could phone Marge,"
#memo: fazer todos os códigos passo a passo
# explicando porque usamos str_trim
dialogo_tb <- str_split_fixed(string = tibble(
dialogue = unlist(extracted_dialogue)
)$dialogue,
pattern = ':',n=2) |>
as_tibble() |>
mutate(session = session_names,
V1 = str_trim(V1)) |>
select(session, personagem = V1, dialogo = V2)
character_mentions <- sapply( #aplica função em cada elemento
unique(dialogo_tb$personagem), #lista person. únicos
function(char) { #func que detecta mencoes a personagens em cada dialogo
sum(
str_detect(dialogo_tb$dialogo, fixed(char)) # fixed() = correspondencia exata
)
}
)
mentions_df <- data.frame(character = unique(dialogo_tb$personagem),
mentions = character_mentions)|>
filter(character != "hat")
# cumprimentos
greetings <- c("Hello", "Hi", "Greetings", "Hey")
# extraindo eles do dialogo
greetings_found <- sapply(greetings, function(greet) {
unlist(str_extract_all(dialogo_tb$dialogo, fixed(greet)))
})
# resultados
greetings_found
$Hello
[1] "Hello" "Hello" "Hello" "Hello" "Hello" "Hello" "Hello" "Hello" "Hello"
[10] "Hello" "Hello" "Hello" "Hello" "Hello" "Hello" "Hello" "Hello" "Hello"
[19] "Hello" "Hello" "Hello" "Hello" "Hello" "Hello" "Hello" "Hello" "Hello"
[28] "Hello" "Hello" "Hello" "Hello" "Hello" "Hello" "Hello" "Hello" "Hello"
[37] "Hello" "Hello" "Hello" "Hello" "Hello"
$Hi
[1] "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi"
[16] "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi"
[31] "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi"
[46] "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi"
[61] "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi"
[76] "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi"
[91] "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi"
[106] "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi"
[121] "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi"
[136] "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi" "Hi"
$Greetings
character(0)
$Hey
[1] "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey"
[13] "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey"
[25] "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey" "Hey"
[37] "Hey" "Hey" "Hey" "Hey" "Hey"
# comprimento de cada dialogo
dialogo_tb$length <- str_length(dialogo_tb$dialogo)
# identificando o dialogo mais longo
longest_dialogue <- dialogo_tb |>
arrange(desc(length)) |>
slice(1)
#e o mais curto
mini_dialog <- dialogo_tb |>
arrange(length) |>
slice(1)
# resultados
longest_dialogue |> View()
mini_dialog |> View()
# conta quantas vezes Ron menciona Harry
ron_mentions_potter <- sum(str_detect(
dialogo_tb$dialogo[dialogo_tb$personagem == "Ron"], "Harry"))
# conta quantas vezes Hermione menciona Harry
hermione_mentions_harry <- sum(str_detect(
dialogo_tb$dialogo[dialogo_tb$personagem == "Hermione"], "Harry"))
# Displaying the results
cat("Ron menciona 'Harry':", ron_mentions_potter, "vezes\n")
Ron menciona 'Harry': 149 vezes
Hermione menciona 'Harry': 390 vezes
# Exemplo de dados de texto
texto <- dialogo_tb$dialogo[50:300]
# Processar e criar uma tabela de frequência das palavras
corpus <- Corpus(VectorSource(texto))
dtm <- TermDocumentMatrix(corpus)
matriz <- as.matrix(dtm)
frequencia <- sort(rowSums(matriz), decreasing = TRUE)
dados <- data.frame(word = names(frequencia), freq = frequencia) |>
anti_join(stop_words) |>
filter(word != "—")
wordcloud2(dados,shape = 'star',size = 0.5)
Bing <- get_sentiments("bing")
#obs: existe em portugues tb: https://www.inf.pucrs.br/linatural/wordpress/recursos-e-ferramentas/oplexicon/
Sentiment <- dialogo_tb |>
unnest_tokens(output = word, input = dialogo) |>
left_join(Bing, "word") |>
filter(is.na(sentiment)==F)
Sentiment |>
group_by(word, sentiment) |>
summarise(count = n(), .groups = 'drop') |>
arrange(desc(count)) |>
slice(1:20) |>
ggplot( aes(reorder(word, +count), count, fill = sentiment))+
geom_bar(stat = "identity", width = 0.62, alpha = 0.9)+
scale_fill_brewer(palette = "Set1")+
coord_flip()+
labs(title = "palavras mais populares com sentimento associado",
subtitle = "Top 20",
x = "Word", y = "Frequency", fill = "Sentiment",
caption = "© Feito por Ían, inspirado por Michau96/Kaggle")+
guides(fill = guide_legend(reverse = T))+
theme_minimal()+
theme(legend.title.align = 0.5, legend.position = "right", legend.direction = "vertical")
Hoje aprendemos a base da análise de texto,
entendemos regex
, seus elementos e quantificadores mais importantes
conhecemos a biblioteca stringr
Fizemos análise exploratória dos diálogos da série Harry Potter
introduzimos análise de sentimentos
R-Ladies theme for Quarto Presentations. Code available on GitHub.