<small id='Ll8pbhMU'></small> <noframes id='9WijS4z6Km'>

  • <tfoot id='RWt0kSxoL'></tfoot>

      <legend id='8s4YWgc'><style id='CQvUBNk'><dir id='6Pb32'><q id='dXEkcj'></q></dir></style></legend>
      <i id='atxSjG'><tr id='CyMIz'><dt id='qkizv8'><q id='TXI45'><span id='m5FqR7'><b id='SyVLPYWKj'><form id='qfgQSUBw'><ins id='FMjS'></ins><ul id='qiRbQyWxFl'></ul><sub id='rmNsI'></sub></form><legend id='R0e5JjWc'></legend><bdo id='EulOvrZ'><pre id='2mrf'><center id='oV1P'></center></pre></bdo></b><th id='74N9BQAdoT'></th></span></q></dt></tr></i><div id='CwAuKpZ'><tfoot id='PjpIM'></tfoot><dl id='a1xCR6'><fieldset id='oHx870'></fieldset></dl></div>

          <bdo id='PT3Fe'></bdo><ul id='yv38V1'></ul>

          1. <li id='l2r7z'></li>
            登陆

            怎么开端学NLP? 6种用来标识化的办法

            admin 2019-09-07 226人围观 ,发现0个评论

            介绍

            你对互联网上的许怎么开端学NLP? 6种用来标识化的办法多文本数据入神吗?你是否正在寻觅处理这些文本数据的办法,但不确认从哪里开端?究竟,机器辨认的是数字,而不是咱们言语中的字母。在机器学习中,这或许是一个扎手的问题。

            那么,咱们怎样操作和处理这些文本数据来构建模型呢?答案就在自然言语处理(NLP)的美妙国际中。

            处理一个NLP问题是一个多阶段的进程。在进入建模阶段之前,咱们需求首要处理非结构化文本数据。处理数据包括以下几个关键过程:

            • 标识化
            • 猜测每个单词的词性
            • 词形复原
            • 辨认和删去中止词,等等

            在本文中,咱们将评论第一步—标识化。咱们将首要了解什么是标识化,以及为什么在NLP中需求标识化。然后,咱们将研讨在Python中进行标识化的六种共同办法。

            在NLP中,什么是标识化?

            标识化是处理文本数据时最常见的使命之一。可是标识化(tokenization)具体是什么意思呢?

            标识化(tokenization)本质上是将短语、语句、阶段或整个文本文档切割成更小的单元,例如单个单词或术语。每个较小的单元都称为标识符(token)

            看看下面这张图片,你就能了解这个界说了:

            标识符可所以单词、数字或标点符号。在标识化中,经过定位单词鸿沟创立更小的单元。等等,或许你又有疑问,什么是单词鸿沟呢?

            单词鸿沟是一个单词的完毕点和下一个单词的开端。而这些标识符被以为是词干提取(stemming)和词形复原(lemmatization )的第一步。

            为什么在NLP中需求标识化?

            在这儿,我想让你们考虑一下英语这门言语。想一句任何你能想到的一个英语语句,然后在你接下去读这部分的时分,把它记在心里。这将协助你更简略地了解标识化的重要性。

            在处理一种自然言语之前,咱们需求辨认组成字符串的单词,这便是为什么标识化是处理NLP(文本数据)的最基本过程。这一点很重要,由于经过剖析文本fightting中的单词能够很简略地解说文本的意义。

            让咱们举个比如,以下面的字符串为例:

            “This is a cat.”

            你以为咱们对这个字符串进行标识化之后会发作什么?是的,咱们将得到[' This ', ' is ', ' a ', cat ']。

            这样做有许多用处,咱们能够运用这个标识符方式:

            • 计数文本中呈现的单词总数
            • 计数单词呈现的频率,也便是某个单词呈现的次数

            之外,还有其他用处。咱们能够提取更多的信息,这些信息将在今后的文章中具体评论。现在,是咱们深入研讨本文的首要内容的时分了——在NLP中进行标识化的不同办法。

            在Python中履行标识化的办法

            咱们将介绍对英文文本数据进行标识化的六种共同办法。我现已为每个办法供给了Python代码,所以你能够在自己的机器上运转示例用来学习。

            1.运用python的split()函数进行标识化

            让咱们从split()办法开端,由于它是最基本的办法。它经过指定的分隔符切割给定的字符串后回来字符串列表。默许情况下,split()是以一个或多个空格作为分隔符。咱们能够把分隔符换成任何东西。让咱们来看看。

            单词标识化

            text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
            species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed
            liquid-fuel launch vehicle to orbit the Earth."""
            # 以空格为分隔符进行切割
            text.split()
            Output : ['Founded', 'in', '2002,', 'SpaceX’s', 'mission', 'is', 'to', 'enable', 'humans',
            'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi-planet',
            'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on', 'Mars.', 'In',
            '2008,', 'SpaceX’s', 'Falcon', '1', 'became', 'the', 'first', 'privately',
            'developed', 'liquid-fuel', 'launch', 'vehicle', 'to', 'orbit', 'the', 'Earth.']

            语句标识化:

            这类似于单词标识化。这儿,咱们在剖析中研讨语句的结构。一个语句一般以句号(.)结束,所以咱们能够用"."作为分隔符来切割字符串:

            text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet 
            species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed
            liquid-fuel launch vehicle to orbit the Earth."""
            # 以"."作为切割符进行切割
            text.split('. ')
            Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring
            civilization and a multi-planet \nspecies by building a self-sustaining city on
            Mars',
            'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel
            launch vehicle to orbit the Earth.']

            运用Python的split()办法的一个首要缺陷是一次只能运用一个分隔符。另一件需求留意的工作是——在单词标识化中,split()没有将标点符号视为独自的标识符。

            2.运用正则表达式(RegEx)进行标识化

            正则表达式是什么?它基本上是一个特别的字符序列,运用该序列作为形式协助你匹配或查找其他字符串或字符串集。

            咱们能够运用Python中的re库来处理正则表达式。这个库预装置在Python装置包中。

            现在,让咱们记住正则表达式并履行单词标识化和语句标识化。

            单词标识化

            import re
            text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet
            species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed
            liquid-fuel launch vehicle to orbit the Earth."""
            tokens = re.findall("[\w']+", text)
            tokens
            Output : ['Founded', 'in', '2002', 'SpaceX', 's', 'mission', 'is', 'to', 'enable',
            'humans', 'to', 'become', 怎么开端学NLP? 6种用来标识化的办法'a', 'spacefaring', 'civilization', 'and', 'a',
            'multi', 'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining',
            'city', 'on', 'Mars', 'In', '2008', 'SpaceX', 's', 'Falcon', '1', 'became',
            'the', 'first', 'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle',
            'to', 'orbit', 'the', 'Earth']

            re.findall()函数的作用是查找与传递给它的形式匹配的一切单词,并将其存储在列表中。\w表明“任何字符”,一般表明字母数字和下划线(_)。+表明恣意呈现次数。因而[\w']+表明代码应该找到一切的字母数字字符,直到遇到任何其他字符停止。

            语句标识化

            要履行语句标识化,能够运用re.split()函数,将经过传递一个形式给函数将文本分红语句。

            import re
            text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet
            species by building a self-sustaining city on, Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed
            liquid-fuel launch vehicle to orbit the Earth."""
            sentences = re.compile('[.!?] ').split(text)
            sentences
            Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring
            civilization and a multi-planet \nspecies by building a self-sustaining city on
            Mars.',
            'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel
            launch vehicle to orbit the Earth.']

            这儿,咱们比较split()办法上有一个优势,由于咱们能够一起传递多个分隔符。在上面的代码中,咱们运用了的re.compile()函数,并传递一个形式[.?!]。这意味着一旦遇到这些字符,语句就会被切割开来。

            3.运用NLTK进行标识化

            NLTK是Natural Language ToolKit的缩写,是用Python编写的用于符号和计算自然言语处理的库。

            你能够运用以下指令装置NLTK:

            pip install --user -U nltk

            NLTK包括一个名为tokenize()的模块,它能够进一步划分为两个子类别:

            • Word tokenize:咱们运用word_tokenize()办法将一个语句切割成标识符
            • Sentence tokenize:咱们运用sent_tokenize()办法将文档或阶段切割成语句

            让咱们一个一个来看是怎样操作的。

            单词标识化

            from nltk.tokenize import word_tokenize 
            text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet
            species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed
            liquid-fuel launch vehicle to orbit the Earth."""
            word_tokenize(text)
            Output: ['Founded', 'in', '2002', ',', 'SpaceX', '’', 's', 'mission', 'is', 'to', 'enable',
            'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a',
            'multi-planet', 'species', 'by', 'building', 'a', 'self-sustaining', 'city', 'on',
            'Mars', '.', 'In', '2008', ',', 'SpaceX', '’', 's', 'Falcon', '1', 'became',
            'the', 'first', 'privately', 'developed', 'liquid-fuel', 'launch', 'vehicle',
            'to', 'orbit', 'the', 'Earth', '.']

            留意到NLTK是怎样考虑将标点符号作为标识符的吗?因而,关于之后的使命,咱们需求从初始列表中删去这些标点符号。

            语句标识化

            from nltk.tokenize import sent_tokenize
            text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet
            species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed
            liquid-fuel launch vehicle to orbit the Earth."""
            sent_tokenize(text)
            Output: ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring
            civilization and a multi-planet \nspecies by building a self-sustaining city on
            Mars.',
            'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel
            launch vehicle to orbit the Earth.']

            4.运用`spaCy`库进行标识化

            我喜爱spaCy这个库,我乃至不记得前次我在做NLP项目时没有运用它是什么时分了。是的,它便是那么有用。

            spaCy是一个用于高档自然言语处理(NLP)的开源库。它支撑超越49种言语,并具有最快的的计算速度。

            在Linux上装置Spacy的指令:

            pip install -U spacy

            python -m spacy download en

            要在其他操作系统上装置它,能够经过下面链接检查:

            https://spacy.io/usage

            所以,让咱们看看怎样运用spaCy的奇特之处来进行标识化。咱们将运用spacy.lang.en以支撑英文。

            单词标识化

            from spacy.lang.en import English
            # 加载英文分词器,符号器、解析器、命名实体辨认和词向量
            nlp = English()
            text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet
            species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed
            liquid-fuel launch vehicle to orbit the Earth."""
            #"nlp" 目标用于创立具有言语注解的文档
            my_doc = nlp(text)
            # 创立单词标识符列表
            token_list = []
            for token in my_doc:
            token_list.append(token.text)
            token_list
            Output : ['Founded', 'in', '2002', ',', 'SpaceX', '’s', 'mission', 'is', 'to', 'enable',
            'humans', 'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a',
            'multi', '-', 'planet', '\n', 'species', 'by', 'building', 'a', 'self', '-',
            'sustaining', 'city', 'on', 'Mars', '.', 'In', '2008', ',', 'SpaceX', '’s',
            'Falcon', '1', 'became', 'the', 'first', 'privately', 'developed', '\n',
            'liquid', '-', 'fuel', 'launch', 'vehicl怎么开端学NLP? 6种用来标识化的办法e', 'to', 'orbit', 'the', 'Earth', '.']

            语句标识化

            from spacy.lang.en import English
            # 加载英文分词器,符号器、解析器、命名实体辨认和词向量
            nlp = English()
            # 创立管道 'sentencizer' 组件
            sbd = nlp.create_pipe('sentencizer')
            # 将组成添加到管道中
            nlp.add_pipe(sbd)
            text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet
            species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed
            liquid-fuel launch vehicle to orbit the Earth."""
            # "nlp" 目标用于创立具有言语注解的文档
            doc = nlp(text)
            # 创立语句标识符列表
            sents_list = []
            for sent in doc.sents:
            se怎么开端学NLP? 6种用来标识化的办法nts_list.append(sent.text)
            sents_list
            Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring
            civilization and a multi-planet \nspecies by building a self-sustaining city on
            Mars.',
            'In 2008, SpaceX’s Falcon 1 became the first privately developed \nliquid-fuel
            launch vehicle to orbit the Earth.']

            在履行NLP使命时,与其他库比较,spaCy的速度适当快(是的,乃至相较于NLTK)。

            5.运用Keras进行标识化

            Keras是现在业界最抢手的深度学习结构之一。它是Python的一个开源神经网络库。Keras十分简略运用,也能够运转在TensorFlow之上。

            在NLP上下文中,咱们能够运用Keras处理咱们一般收集到的非结构化文本数据。

            在你的机子上,只需求一行代码就能够在机器上装置Keras:

            pip install Keras

            让咱们开端进行试验,要运用Keras履行单词符号化,咱们运用keras.preprocessing.text类中的text_to_word_sequence办法.

            让咱们看看keras是怎样做的。

            单词标识化

            from keras.preprocessing.text import text_to_word_sequence
            # 文本数据
            text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet
            species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed
            liquid-fuel launch vehicle to orbit the Earth."""
            # 标识化
            result = text_to_word_sequence(text)
            result
            Output : ['founded', 'in', '2002', 'spacex’s', 'mission', 'is', 'to', 'enable', 'humans',
            'to', 'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi',
            'planet', 'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on',
            'mars', 'in', '2008', 'spacex’s', 'falcon', '1', 'became', 'the', 'first',
            'privately', 'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit',
            'the', 'earth']

            Keras在进行符号之前将一切字母转换成小写。你能够幻想,这为咱们节省了许多时刻!

            6.运用Gensim进行标识化

            咱们介绍的最终一个标识化办法是运用Gensim库。它是一个用于无监督主题建模和自然言语处理的开源库,旨在从给定文档中主动提取语义主题。

            下面咱们在机器上装置Gensim:

            pip install gensim

            咱们能够用gensim.utils类导入用于履行单词标识化的tokenize办法。

            单词标识化

            from gensim.utils import tokenize
            text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet
            species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed
            liquid-fuel launch vehicle to orbit the Earth."""
            list(tokenize(text))
            Outpur : ['Founded', 'in', 'SpaceX', 's', 'mission', 'is', 'to', 'enable', 'humans', 'to',
            'become', 'a', 'spacefaring', 'civilization', 'and', 'a', 'multi', 'planet',
            'species', 'by', 'building', 'a', 'self', 'sustaining', 'city', 'on', 'Mars',
            'In', 'SpaceX', 's', 'Falcon', 'became', 'the', 'first', 'privately',
            'developed', 'liquid', 'fuel', 'launch', 'vehicle', 'to', 'orbit', 'the',
            'Earth']

            语句标识化

            from gensim.summarization.textcleaner import split_sentences
            text = """Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring civilization and a multi-planet
            species by building a self-sustaining city on Mars. In 2008, SpaceX’s Falcon 1 became the first privately developed
            liquid-fuel launch vehicle to orbit the Earth."""
            result = split_sentences(text)
            result
            Output : ['Founded in 2002, SpaceX’s mission is to enable humans to become a spacefaring
            civilization and a multi-planet ',
            'species by building a self-sustaining city on Mars.',
            'In 2008, SpaceX’s Falcon 1 became the first privately developed ',
            'liquid-fuel launch vehicle to orbit the Earth.']

            你或许现已留意到,Gensim对标点符号十分严厉。每逢遇到标点符号时,它就会切割。在语句切割中,Gensim在遇到\n时会切割文本,而其他库则是疏忽它。

            总结

            标识化是整个处理NLP使命中的一个关键过程。假如不先处理文本,咱们就不能简略地进入模型构建部分。

            在本文中,关于给定的英文文本,咱们运用了六种不同的标识化办法(单词和语句)。当然,还有其他的办法,可是这些办法现已满足让你开端进行标识化了。

            请关注微信公众号
            微信二维码
            不容错过
            Powered By Z-BlogPHP