[問題] 關於文字探勘

看板R_Language作者abcggg (小雞逼逼)時間9年前 (2015/08/18 22:40)推噓0(0推 0噓 1→)

留言1則, 1人參與討論串1/2 (看更多)

[問題類型]: 程式諮詢 [軟體熟悉度]: 入門(寫過其他程式，只是對語法不熟悉) [問題敘述]: 大家好，最近剛接觸文字探勘參考了陳嘉葳大大的文章：用R進行中文 text Mining (http://goo.gl/3mTrDg) 我也照做了一番，有些地方因為有更新所以有自己修改。問題1：在輸出TermDocumentMatrix時，出現了以下的樣子不知道怎麼把\n弄掉，原本的文章也沒有長這樣。 Docs Terms 1 2 3 4 5 一生\n志願 0 0 0 0 0 一生\n味道 0 0 0 0 0 一家\n嘴\n罐子 1 0 0 0 0 一連串 0 0 0 0 0 人\n 0 0 0 0 0 人\n人\n人\n強者\n同學 0 0 0 0 0 人\n人\n小學 0 0 0 0 0 人\n人\n事\n課\n精神 0 0 0 0 0 人\n人\n東西 0 0 0 0 0 人\n山\n雙手 0 0 0 0 0 問題2：輸出的文字雲長這樣 http://i.imgur.com/W6Bo2Tk.png

明明程式碼一樣，不知為何我的卻是方的，而且沒有很密集。想知道問題出在哪。問題3：由於原文章只鎖定名詞 d.corpus <- tm_map(d.corpus[1:100], segmentCN, nature = TRUE) d.corpus <- tm_map(d.corpus, function(sentence) { noun <- lapply(sentence, function(w) { w[names(w) == "n"] }) unlist(noun) }) 若所有詞性都想要，該怎麼做呢？我有試做看看，卻在輸出tdm時得到error Error in `[.simple_triplet_matrix`(tdm, 1:10, 1:5) : subscript out of bounds [程式範例]: 大概的程式碼如下： d.corpus0 <- Corpus(DirSource('doc'), list(language = NA)) #語料庫 d.corpus_clean <- tm_map(d.corpus0, removePunctuation) d.corpus_clean <- tm_map(d.corpus_clean, removeNumbers) d.corpus_clean <- tm_map(d.corpus_clean, function(word) { gsub("[A-Za-z0-9]", "", word) }) d.corpus_seg <- tm_map(d.corpus_clean[1:100], segmentCN, nature = TRUE) d.corpus_seg2 <- tm_map(d.corpus_seg, function(sentence) { noun <- lapply(sentence, function(w) { w[names(w) == "n"] }) unlist(noun) }) #d.corpus_vec <- Corpus(VectorSource(d.corpus_seg)) #無法run d.corpus_stop <- tm_map(d.corpus_seg2, removeWords, myStopWords) #建立TermDocumentMatrix(自己修改過) corpus_clean <- tm_map(d.corpus_stop, PlainTextDocument) d.corpus_vec <- Corpus(VectorSource(corpus_clean)) tdm <- TermDocumentMatrix(d.corpus_vec, control = list(wordLengths = c(2, Inf))) #文字雲 m1 <- as.matrix(tdm) v <- sort(rowSums(m1), decreasing = TRUE) d <- data.frame(word = names(v), freq = v) wordcloud(d$word, d$freq, min.freq = 2, random.order = F, ordered.colors = F, colors = rainbow(length(row.names(m1)))) 這個問題已經困擾我好幾天，想了很久也找很多資料還是無解，才想說上來請各位高手解答小妹第一次在本版發文，若有任何不妥請多多包涵:) [關鍵字]: 文字探勘，text mining -- ※ 發信站: 批踢踢實業坊(ptt.cc), 來自: 1.34.144.35 ※ 文章網址: https://www.ptt.cc/bbs/R_Language/M.1439908857.A.A29.html

→

celestialgod

08/18 22:59, , 1^F

08/18 22:59, 1^F

真的耶！不好意思，我也回去試試看舊版本好了… 只是還想知道有沒有人也有這個問題然後有解決的～ ※ 編輯: abcggg (1.34.144.23), 08/19/2015 02:17:14

‣ 返回看板[ R_Language ] 程式

‣ 更多 abcggg 的文章

文章代碼(AID): #1LqqFvef (R_Language)