ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

举报
一个处女座的程序猿 发表于 2021/03/26 23:38:22 2021/03/26
【摘要】 ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估   目录 输出结果 设计思路 核心代码               输出结果     设计思路     核心代码 class TfidfVectorizer Found at: sklearn.featu...

ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估

 

目录

输出结果

设计思路

核心代码


 

 

 

 

 

 

 

输出结果

 

 

设计思路

 

 

核心代码


  
  1. class TfidfVectorizer Found at: sklearn.feature_extraction.text
  2. class TfidfVectorizer(CountVectorizer):
  3. """Convert a collection of raw documents to a matrix of TF-IDF features.
  4. Equivalent to CountVectorizer followed by TfidfTransformer.
  5. Read more in the :ref:`User Guide <text_feature_extraction>`.
  6. Parameters
  7. ----------
  8. input : string {'filename', 'file', 'content'}
  9. If 'filename', the sequence passed as an argument to fit is
  10. expected to be a list of filenames that need reading to fetch
  11. the raw content to analyze.
  12. If 'file', the sequence items must have a 'read' method (file-like
  13. object) that is called to fetch the bytes in memory.
  14. Otherwise the input is expected to be the sequence strings or
  15. bytes items are expected to be analyzed directly.
  16. encoding : string, 'utf-8' by default.
  17. If bytes or files are given to analyze, this encoding is used to
  18. decode.
  19. decode_error : {'strict', 'ignore', 'replace'}
  20. Instruction on what to do if a byte sequence is given to analyze that
  21. contains characters not of the given `encoding`. By default, it is
  22. 'strict', meaning that a UnicodeDecodeError will be raised. Other
  23. values are 'ignore' and 'replace'.
  24. strip_accents : {'ascii', 'unicode', None}
  25. Remove accents during the preprocessing step.
  26. 'ascii' is a fast method that only works on characters that have
  27. an direct ASCII mapping.
  28. 'unicode' is a slightly slower method that works on any characters.
  29. None (default) does nothing.
  30. analyzer : string, {'word', 'char'} or callable
  31. Whether the feature should be made of word or character n-grams.
  32. If a callable is passed it is used to extract the sequence of features
  33. out of the raw, unprocessed input.
  34. preprocessor : callable or None (default)
  35. Override the preprocessing (string transformation) stage while
  36. preserving the tokenizing and n-grams generation steps.
  37. tokenizer : callable or None (default)
  38. Override the string tokenization step while preserving the
  39. preprocessing and n-grams generation steps.
  40. Only applies if ``analyzer == 'word'``.
  41. ngram_range : tuple (min_n, max_n)
  42. The lower and upper boundary of the range of n-values for different
  43. n-grams to be extracted. All values of n such that min_n <= n <= max_n
  44. will be used.
  45. stop_words : string {'english'}, list, or None (default)
  46. If a string, it is passed to _check_stop_list and the appropriate stop
  47. list is returned. 'english' is currently the only supported string
  48. value.
  49. If a list, that list is assumed to contain stop words, all of which
  50. will be removed from the resulting tokens.
  51. Only applies if ``analyzer == 'word'``.
  52. If None, no stop words will be used. max_df can be set to a value
  53. in the range [0.7, 1.0) to automatically detect and filter stop
  54. words based on intra corpus document frequency of terms.
  55. lowercase : boolean, default True
  56. Convert all characters to lowercase before tokenizing.
  57. token_pattern : string
  58. Regular expression denoting what constitutes a "token", only used
  59. if ``analyzer == 'word'``. The default regexp selects tokens of 2
  60. or more alphanumeric characters (punctuation is completely ignored
  61. and always treated as a token separator).
  62. max_df : float in range [0.0, 1.0] or int, default=1.0
  63. When building the vocabulary ignore terms that have a document
  64. frequency strictly higher than the given threshold (corpus-specific
  65. stop words).
  66. If float, the parameter represents a proportion of documents, integer
  67. absolute counts.
  68. This parameter is ignored if vocabulary is not None.
  69. min_df : float in range [0.0, 1.0] or int, default=1
  70. When building the vocabulary ignore terms that have a document
  71. frequency strictly lower than the given threshold. This value is also
  72. called cut-off in the literature.
  73. If float, the parameter represents a proportion of documents, integer
  74. absolute counts.
  75. This parameter is ignored if vocabulary is not None.
  76. max_features : int or None, default=None
  77. If not None, build a vocabulary that only consider the top
  78. max_features ordered by term frequency across the corpus.
  79. This parameter is ignored if vocabulary is not None.
  80. vocabulary : Mapping or iterable, optional
  81. Either a Mapping (e.g., a dict) where keys are terms and values are
  82. indices in the feature matrix, or an iterable over terms. If not
  83. given, a vocabulary is determined from the input documents.
  84. binary : boolean, default=False
  85. If True, all non-zero term counts are set to 1. This does not mean
  86. outputs will have only 0/1 values, only that the tf term in tf-idf
  87. is binary. (Set idf and normalization to False to get 0/1 outputs.)
  88. dtype : type, optional
  89. Type of the matrix returned by fit_transform() or transform().
  90. norm : 'l1', 'l2' or None, optional
  91. Norm used to normalize term vectors. None for no normalization.
  92. use_idf : boolean, default=True
  93. Enable inverse-document-frequency reweighting.
  94. smooth_idf : boolean, default=True
  95. Smooth idf weights by adding one to document frequencies, as if an
  96. extra document was seen containing every term in the collection
  97. exactly once. Prevents zero divisions.
  98. sublinear_tf : boolean, default=False
  99. Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
  100. Attributes
  101. ----------
  102. vocabulary_ : dict
  103. A mapping of terms to feature indices.
  104. idf_ : array, shape = [n_features], or None
  105. The learned idf vector (global term weights)
  106. when ``use_idf`` is set to True, None otherwise.
  107. stop_words_ : set
  108. Terms that were ignored because they either:
  109. - occurred in too many documents (`max_df`)
  110. - occurred in too few documents (`min_df`)
  111. - were cut off by feature selection (`max_features`).
  112. This is only available if no vocabulary was given.
  113. See also
  114. --------
  115. CountVectorizer
  116. Tokenize the documents and count the occurrences of token and
  117. return
  118. them as a sparse matrix
  119. TfidfTransformer
  120. Apply Term Frequency Inverse Document Frequency normalization to a
  121. sparse matrix of occurrence counts.
  122. Notes
  123. -----
  124. The ``stop_words_`` attribute can get large and increase the model size
  125. when pickling. This attribute is provided only for introspection and can
  126. be safely removed using delattr or set to None before pickling.
  127. """
  128. def __init__(self, input='content', encoding='utf-8',
  129. decode_error='strict', strip_accents=None, lowercase=True,
  130. preprocessor=None, tokenizer=None, analyzer='word',
  131. stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
  132. ngram_range=(1, 1), max_df=1.0, min_df=1,
  133. max_features=None, vocabulary=None, binary=False,
  134. dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True,
  135. sublinear_tf=False):
  136. super(TfidfVectorizer, self).__init__(input=input, encoding=encoding,
  137. decode_error=decode_error, strip_accents=strip_accents,
  138. lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer,
  139. analyzer=analyzer, stop_words=stop_words,
  140. token_pattern=token_pattern, ngram_range=ngram_range,
  141. max_df=max_df, min_df=min_df, max_features=max_features,
  142. vocabulary=vocabulary, binary=binary, dtype=dtype)
  143. self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf,
  144. smooth_idf=smooth_idf,
  145. sublinear_tf=sublinear_tf)
  146. # Broadcast the TF-IDF parameters to the underlying transformer
  147. instance
  148. # for easy grid search and repr
  149. @property
  150. def norm(self):
  151. return self._tfidf.norm
  152. @norm.setter
  153. def norm(self, value):
  154. self._tfidf.norm = value
  155. @property
  156. def use_idf(self):
  157. return self._tfidf.use_idf
  158. @use_idf.setter
  159. def use_idf(self, value):
  160. self._tfidf.use_idf = value
  161. @property
  162. def smooth_idf(self):
  163. return self._tfidf.smooth_idf
  164. @smooth_idf.setter
  165. def smooth_idf(self, value):
  166. self._tfidf.smooth_idf = value
  167. @property
  168. def sublinear_tf(self):
  169. return self._tfidf.sublinear_tf
  170. @sublinear_tf.setter
  171. def sublinear_tf(self, value):
  172. self._tfidf.sublinear_tf = value
  173. @property
  174. def idf_(self):
  175. return self._tfidf.idf_
  176. def fit(self, raw_documents, y=None):
  177. """Learn vocabulary and idf from training set.
  178. Parameters
  179. ----------
  180. raw_documents : iterable
  181. an iterable which yields either str, unicode or file objects
  182. Returns
  183. -------
  184. self : TfidfVectorizer
  185. """
  186. X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  187. self._tfidf.fit(X)
  188. return self
  189. def fit_transform(self, raw_documents, y=None):
  190. """Learn vocabulary and idf, return term-document matrix.
  191. This is equivalent to fit followed by transform, but more efficiently
  192. implemented.
  193. Parameters
  194. ----------
  195. raw_documents : iterable
  196. an iterable which yields either str, unicode or file objects
  197. Returns
  198. -------
  199. X : sparse matrix, [n_samples, n_features]
  200. Tf-idf-weighted document-term matrix.
  201. """
  202. X = super(TfidfVectorizer, self).fit_transform(raw_documents)
  203. self._tfidf.fit(X)
  204. # X is already a transformed view of raw_documents so
  205. # we set copy to False
  206. return self._tfidf.transform(X, copy=False)
  207. def transform(self, raw_documents, copy=True):
  208. """Transform documents to document-term matrix.
  209. Uses the vocabulary and document frequencies (df) learned by fit (or
  210. fit_transform).
  211. Parameters
  212. ----------
  213. raw_documents : iterable
  214. an iterable which yields either str, unicode or file objects
  215. copy : boolean, default True
  216. Whether to copy X and operate on the copy or perform in-place
  217. operations.
  218. Returns
  219. -------
  220. X : sparse matrix, [n_samples, n_features]
  221. Tf-idf-weighted document-term matrix.
  222. """
  223. check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')
  224. X = super(TfidfVectorizer, self).transform(raw_documents)
  225. return self._tfidf.transform(X, copy=False)

 

 

 

文章来源: yunyaniu.blog.csdn.net,作者:一个处女座的程序猿,版权归原作者所有,如需转载,请联系作者。

原文链接:yunyaniu.blog.csdn.net/article/details/88086335

【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱: cloudbbs@huaweicloud.com
  • 点赞
  • 收藏
  • 关注作者

评论(0

0/1000
抱歉,系统识别当前为高风险访问,暂不支持该操作

全部回复

上滑加载中

设置昵称

在此一键设置昵称,即可参与社区互动!

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。

*长度不超过10个汉字或20个英文字符,设置后3个月内不可修改。