.. camphr documentation master file, created by sphinx-quickstart on Wed Jan 29 22:55:04 2020. You can adapt this file completely to your liking, but it should at least contain the root `toctree` directive. .. include:: replaces.txt Camphr ================================== Camphr is a *Natural Language Processing* library that helps in seamless integration for a wide variety of techniques from state-of-the-art to conventional ones. You can use `Transformers `_ , `Udify `_, `ELmo `_, etc. on spaCy_. .. _spaCy: https://spacy.io/ Features ~~~~~~~~ * A spaCy_ plugin - Easily integration for a wide variety of methods * `Transformers `_ with spaCy_ - :doc:`Fine tuning `, :doc:`Embedding vector ` * `Udify `_ - BERT based multitask model in 75 languages * `Elmo `_ - Deep contextualized word representations * Rule base matching with `Aho-Corasick `_, Regex * (for Japanese) `KNP `_ Installation ~~~~~~~~~~~~ Just pip install: .. parsed-literal:: |install-camphr| Camphr requires Python3.6 or newer. Quick tour ~~~~~~~~~~ .. testsetup:: * import camphr import spacy cfg = """ lang: name: en pipeline: pretrained: ../tests/fixtures/xlnet """ nlp = camphr.load(cfg) :doc:`Transformers for text embedding ` ----------------------------------------------------------------------------- >>> doc = nlp("BERT converts text to vector") >>> doc.tensor # doctest: +ELLIPSIS tensor([[-0.4646, 0.6749, -3.6471, 1.9478, 0.2647, -0.5829, -1.0046, -0.4127, ... >>> doc[0].vector # token vector # doctest: +ELLIPSIS +NORMALIZE_WHITESPACE array([-0.46461838, 0.6748918 , -3.647077 , 1.9477932 , 0.26473868, -0.5829216 , -1.004647 , -0.41271996, 0.99519366, 1.7323551 , ... >>> doc2 = nlp("Doc simlarity can be computed based on doc.tensor") >>> doc.similarity(doc2) # doctest: +ELLIPSIS -0.1252622... >>> doc[0].similarity(doc2[0]) # tokens similarity # doctest: +ELLIPSIS -0.049367390... :doc:`Fine-tune Transformers for NER and text classification ` ------------------------------------------------------------------------------------------- Camphr provides training CLI built on `Hydra `_: .. code-block:: console $ camphr train train.data.path="./train.jsonl" \ textcat_label="./label.json" \ pretrained=bert-base-cased \ lang=en >>> import spacy >>> nlp = spacy("./outputs/2020-01-30/19-31-23/models/0") >>> doc = nlp("Fine-tune Transformers and use it as a spaCy pipeline") >>> print(doc.ents) [Transformers, spaCy] :doc:`Udify - BERT based dependency parser for 75 languages ` -------------------------------------------------------------------------- .. testsetup:: udify import spacy nlp = spacy.load("en_udify") .. doctest:: udify >>> nlp = spacy.load("en_udify") >>> doc = nlp("Udify is a BERT based dependency parser") >>> spacy.displacy.render(doc) # doctest: +SKIP .. image:: notes/udify_dep_en.png .. doctest:: udify >>> doc = nlp("Deutsch kann so wie es ist analysiert werden") >>> spacy.displacy.render(doc) # doctest: +SKIP .. image:: notes/udify_dep_de.png :doc:`Elmo - Deep contextualized word representations ` ------------------------------------------------------------------- >>> nlp = spacy.load("en_elmo_medium") # doctest: +SKIP >>> doc = nlp("One can deposit money at the bank") >>> doc.tensor tensor([[ 0.4673, -1.7633, 0.6011, 1.0225, -0.6563, 0.2700, -0.6024, -1.5284, ... [ 0.7888, 1.5784, 0.8037, -0.5507, -0.9697, 2.5356, -0.0293, 1.1222, 2.8126, -0.2315, 0.5175, -1.4777, -2.8232, -3.0741, -0.8167, -0.1859]]) >>> doc[0].vector # doctest: +ELLIPSIS array([ 0.46731022, -1.763341 , 0.6010663 , 1.0225006 , -0.65628755, ... 0.13352573], dtype=float32) See the tutorials below for more details. Tutorials ~~~~~~~~~ .. toctree:: :maxdepth: 1 notes/transformers notes/finetune_transformers notes/udify notes/elmo notes/sentencepiece notes/rule_base_match notes/camphr_load notes/knp