ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估
【摘要】 ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估
目录
输出结果
设计思路
核心代码
输出结果
设计思路
核心代码
class TfidfVectorizer Found at: sklearn.featu...
ML之NB:利用朴素贝叶斯NB算法(TfidfVectorizer+不去除停用词)对20类新闻文本数据集进行分类预测、评估
目录
输出结果
设计思路
核心代码
-
class TfidfVectorizer Found at: sklearn.feature_extraction.text
-
-
class TfidfVectorizer(CountVectorizer):
-
"""Convert a collection of raw documents to a matrix of TF-IDF features.
-
-
Equivalent to CountVectorizer followed by TfidfTransformer.
-
-
Read more in the :ref:`User Guide <text_feature_extraction>`.
-
-
Parameters
-
----------
-
input : string {'filename', 'file', 'content'}
-
If 'filename', the sequence passed as an argument to fit is
-
expected to be a list of filenames that need reading to fetch
-
the raw content to analyze.
-
-
If 'file', the sequence items must have a 'read' method (file-like
-
object) that is called to fetch the bytes in memory.
-
-
Otherwise the input is expected to be the sequence strings or
-
bytes items are expected to be analyzed directly.
-
-
encoding : string, 'utf-8' by default.
-
If bytes or files are given to analyze, this encoding is used to
-
decode.
-
-
decode_error : {'strict', 'ignore', 'replace'}
-
Instruction on what to do if a byte sequence is given to analyze that
-
contains characters not of the given `encoding`. By default, it is
-
'strict', meaning that a UnicodeDecodeError will be raised. Other
-
values are 'ignore' and 'replace'.
-
-
strip_accents : {'ascii', 'unicode', None}
-
Remove accents during the preprocessing step.
-
'ascii' is a fast method that only works on characters that have
-
an direct ASCII mapping.
-
'unicode' is a slightly slower method that works on any characters.
-
None (default) does nothing.
-
-
analyzer : string, {'word', 'char'} or callable
-
Whether the feature should be made of word or character n-grams.
-
-
If a callable is passed it is used to extract the sequence of features
-
out of the raw, unprocessed input.
-
-
preprocessor : callable or None (default)
-
Override the preprocessing (string transformation) stage while
-
preserving the tokenizing and n-grams generation steps.
-
-
tokenizer : callable or None (default)
-
Override the string tokenization step while preserving the
-
preprocessing and n-grams generation steps.
-
Only applies if ``analyzer == 'word'``.
-
-
ngram_range : tuple (min_n, max_n)
-
The lower and upper boundary of the range of n-values for different
-
n-grams to be extracted. All values of n such that min_n <= n <= max_n
-
will be used.
-
-
stop_words : string {'english'}, list, or None (default)
-
If a string, it is passed to _check_stop_list and the appropriate stop
-
list is returned. 'english' is currently the only supported string
-
value.
-
-
If a list, that list is assumed to contain stop words, all of which
-
will be removed from the resulting tokens.
-
Only applies if ``analyzer == 'word'``.
-
-
If None, no stop words will be used. max_df can be set to a value
-
in the range [0.7, 1.0) to automatically detect and filter stop
-
words based on intra corpus document frequency of terms.
-
-
lowercase : boolean, default True
-
Convert all characters to lowercase before tokenizing.
-
-
token_pattern : string
-
Regular expression denoting what constitutes a "token", only used
-
if ``analyzer == 'word'``. The default regexp selects tokens of 2
-
or more alphanumeric characters (punctuation is completely ignored
-
and always treated as a token separator).
-
-
max_df : float in range [0.0, 1.0] or int, default=1.0
-
When building the vocabulary ignore terms that have a document
-
frequency strictly higher than the given threshold (corpus-specific
-
stop words).
-
If float, the parameter represents a proportion of documents, integer
-
absolute counts.
-
This parameter is ignored if vocabulary is not None.
-
-
min_df : float in range [0.0, 1.0] or int, default=1
-
When building the vocabulary ignore terms that have a document
-
frequency strictly lower than the given threshold. This value is also
-
called cut-off in the literature.
-
If float, the parameter represents a proportion of documents, integer
-
absolute counts.
-
This parameter is ignored if vocabulary is not None.
-
-
max_features : int or None, default=None
-
If not None, build a vocabulary that only consider the top
-
max_features ordered by term frequency across the corpus.
-
-
This parameter is ignored if vocabulary is not None.
-
-
vocabulary : Mapping or iterable, optional
-
Either a Mapping (e.g., a dict) where keys are terms and values are
-
indices in the feature matrix, or an iterable over terms. If not
-
given, a vocabulary is determined from the input documents.
-
-
binary : boolean, default=False
-
If True, all non-zero term counts are set to 1. This does not mean
-
outputs will have only 0/1 values, only that the tf term in tf-idf
-
is binary. (Set idf and normalization to False to get 0/1 outputs.)
-
-
dtype : type, optional
-
Type of the matrix returned by fit_transform() or transform().
-
-
norm : 'l1', 'l2' or None, optional
-
Norm used to normalize term vectors. None for no normalization.
-
-
use_idf : boolean, default=True
-
Enable inverse-document-frequency reweighting.
-
-
smooth_idf : boolean, default=True
-
Smooth idf weights by adding one to document frequencies, as if an
-
extra document was seen containing every term in the collection
-
exactly once. Prevents zero divisions.
-
-
sublinear_tf : boolean, default=False
-
Apply sublinear tf scaling, i.e. replace tf with 1 + log(tf).
-
-
Attributes
-
----------
-
vocabulary_ : dict
-
A mapping of terms to feature indices.
-
-
idf_ : array, shape = [n_features], or None
-
The learned idf vector (global term weights)
-
when ``use_idf`` is set to True, None otherwise.
-
-
stop_words_ : set
-
Terms that were ignored because they either:
-
-
- occurred in too many documents (`max_df`)
-
- occurred in too few documents (`min_df`)
-
- were cut off by feature selection (`max_features`).
-
-
This is only available if no vocabulary was given.
-
-
See also
-
--------
-
CountVectorizer
-
Tokenize the documents and count the occurrences of token and
-
return
-
them as a sparse matrix
-
-
TfidfTransformer
-
Apply Term Frequency Inverse Document Frequency normalization to a
-
sparse matrix of occurrence counts.
-
-
Notes
-
-----
-
The ``stop_words_`` attribute can get large and increase the model size
-
when pickling. This attribute is provided only for introspection and can
-
be safely removed using delattr or set to None before pickling.
-
"""
-
def __init__(self, input='content', encoding='utf-8',
-
decode_error='strict', strip_accents=None, lowercase=True,
-
preprocessor=None, tokenizer=None, analyzer='word',
-
stop_words=None, token_pattern=r"(?u)\b\w\w+\b",
-
ngram_range=(1, 1), max_df=1.0, min_df=1,
-
max_features=None, vocabulary=None, binary=False,
-
dtype=np.int64, norm='l2', use_idf=True, smooth_idf=True,
-
sublinear_tf=False):
-
super(TfidfVectorizer, self).__init__(input=input, encoding=encoding,
-
decode_error=decode_error, strip_accents=strip_accents,
-
lowercase=lowercase, preprocessor=preprocessor, tokenizer=tokenizer,
-
analyzer=analyzer, stop_words=stop_words,
-
token_pattern=token_pattern, ngram_range=ngram_range,
-
max_df=max_df, min_df=min_df, max_features=max_features,
-
vocabulary=vocabulary, binary=binary, dtype=dtype)
-
self._tfidf = TfidfTransformer(norm=norm, use_idf=use_idf,
-
smooth_idf=smooth_idf,
-
sublinear_tf=sublinear_tf)
-
-
# Broadcast the TF-IDF parameters to the underlying transformer
-
instance
-
# for easy grid search and repr
-
@property
-
def norm(self):
-
return self._tfidf.norm
-
-
@norm.setter
-
def norm(self, value):
-
self._tfidf.norm = value
-
-
@property
-
def use_idf(self):
-
return self._tfidf.use_idf
-
-
@use_idf.setter
-
def use_idf(self, value):
-
self._tfidf.use_idf = value
-
-
@property
-
def smooth_idf(self):
-
return self._tfidf.smooth_idf
-
-
@smooth_idf.setter
-
def smooth_idf(self, value):
-
self._tfidf.smooth_idf = value
-
-
@property
-
def sublinear_tf(self):
-
return self._tfidf.sublinear_tf
-
-
@sublinear_tf.setter
-
def sublinear_tf(self, value):
-
self._tfidf.sublinear_tf = value
-
-
@property
-
def idf_(self):
-
return self._tfidf.idf_
-
-
def fit(self, raw_documents, y=None):
-
"""Learn vocabulary and idf from training set.
-
-
Parameters
-
----------
-
raw_documents : iterable
-
an iterable which yields either str, unicode or file objects
-
-
Returns
-
-------
-
self : TfidfVectorizer
-
"""
-
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
-
self._tfidf.fit(X)
-
return self
-
-
def fit_transform(self, raw_documents, y=None):
-
"""Learn vocabulary and idf, return term-document matrix.
-
-
This is equivalent to fit followed by transform, but more efficiently
-
implemented.
-
-
Parameters
-
----------
-
raw_documents : iterable
-
an iterable which yields either str, unicode or file objects
-
-
Returns
-
-------
-
X : sparse matrix, [n_samples, n_features]
-
Tf-idf-weighted document-term matrix.
-
"""
-
X = super(TfidfVectorizer, self).fit_transform(raw_documents)
-
self._tfidf.fit(X)
-
# X is already a transformed view of raw_documents so
-
# we set copy to False
-
return self._tfidf.transform(X, copy=False)
-
-
def transform(self, raw_documents, copy=True):
-
"""Transform documents to document-term matrix.
-
-
Uses the vocabulary and document frequencies (df) learned by fit (or
-
fit_transform).
-
-
Parameters
-
----------
-
raw_documents : iterable
-
an iterable which yields either str, unicode or file objects
-
-
copy : boolean, default True
-
Whether to copy X and operate on the copy or perform in-place
-
operations.
-
-
Returns
-
-------
-
X : sparse matrix, [n_samples, n_features]
-
Tf-idf-weighted document-term matrix.
-
"""
-
check_is_fitted(self, '_tfidf', 'The tfidf vector is not fitted')
-
X = super(TfidfVectorizer, self).transform(raw_documents)
-
return self._tfidf.transform(X, copy=False)
文章来源: yunyaniu.blog.csdn.net,作者:一个处女座的程序猿,版权归原作者所有,如需转载,请联系作者。
原文链接:yunyaniu.blog.csdn.net/article/details/88086335
【版权声明】本文为华为云社区用户转载文章,如果您发现本社区中有涉嫌抄袭的内容,欢迎发送邮件进行举报,并提供相关证据,一经查实,本社区将立刻删除涉嫌侵权内容,举报邮箱:
cloudbbs@huaweicloud.com
- 点赞
- 收藏
- 关注作者
评论(0)