Welcome to WEBERT’s documentation!¶

This toolkit computes word embeddings using Bidirectional Encoder Representations from Transformers (BERT) for cased and large models in spanish and english automatically. BERT embeddings are computed using Transformers (https://github.com/huggingface/transformers). The project is currently ongoing.

The code for this project is available at https://github.com/PauPerezT/WEBERT

Guide¶

Installation¶

From the source file:

git clone https://github.com/PauPerezT/WEBERT
cd WEBERT

To install the requeriments, please run:

install.sh

Executing commands¶

Run it automatically from linux terminal¶

To compute Bert embeddings automatically

Optional arguments	Optional Values	Description
-h		Show this help message and exit
-f		Path folder of the txt documents (Only txt format). By default ‘./texts’
-sv		Path to save the embeddings. By default ‘./bert_embeddings’
-bm	Bert, Beto, SciBert	Choose between three different BERT models. By default BERT
-d	True, False	Boolean value to get dynamic features= True. By default True.
-st	True, False	Boolean value to get static features= True from the embeddings such as mean, standard deviation, kurtosis, skeweness, min and max. By default False.
-l	english, spanish	Chosen language (only available for BERT model). By default english.
-sw	True, False	Boolean value, set True if you want to remove stopwords. By default False.
-m	base, large	Bert models, two options base and large. By default base.
-ca	True, False	Boolean value for cased= True o lower-cased= False models. No avalaible for SciBert. By default False.
-cu	True, False	Boolean value for using cuda to compute the embeddings (True). By default False.

Usage example:

python get_embeddings.py -f ./texts/ -sv ./bert_embs -bm Bert -d True -st True -l english -sw True -m base -ca True -cu True

Methods¶

class WEBERT.BERT(inputs, file, language='english', stopwords=False, model='base', cased=False, cuda=False)[source]¶

WEBERT-BERT computes BERT to get static or dynamic embeddings. BERT uses Transformers (https://github.com/huggingface/transformers). It can be computed using english and spanish (multilingual) model. Also considers cased or uncased options, and stopword removal.

Parameters:

inputs – input data
file – name of the document.
language – input language (By defalut: english).
stopwords – boolean variable for removing stopwords (By defalut: False).
model – base or large model (By defalut: base).
cased – boolean variable to compute cased or lower-case model (By defalut: False).
cuda – boolean value for using cuda to compute the embeddings, True for using it. (By defalut: False).

Returns:

WEBERT object

get_bert_embeddings(path, dynamic=True, static=False)[source]¶

Bert embeddings computation using Transformes. It store and transforms the texts into BERT embeddings. The embeddings are stored in csv files.

Parameters:	path – path to save the embeddings dynamic – boolean variable to compute the dynamic embeddings (By defalut: True). static – boolean variable to compute the static embeddings (By defalut: False).
Returns:	static embeddings if static=True

preprocessing(inputs)[source]¶

Text Pre-processing

Parameters:	inputs – input data
Returns:	proprocessed text

class WEBERT.BETO(inputs, file, stopwords=False, model='base', cased=False, cuda=False)[source]¶

WEBERT-BETO computes BETO to get static or dynamic embeddings. BETO is a pretrained BERT model from spanish corpus (https://github.com/dccuchile/beto). BETO uses Transformers (https://github.com/huggingface/transformers). It can be computed using only spanish model. Also considers cased or uncased options, and stopword removal.

Parameters:	inputs – input data file – name of the document. stopwords – boolean variable for removing stopwords (By defalut: False). model – base or large model (By defalut: base). cased – boolean variable to compute cased or lower-case model (By defalut: False). cuda – boolean value for using cuda to compute the embeddings, True for using it. (By defalut: False).
Returns:	WEBERT object

get_bert_embeddings(path, dynamic=True, static=False)[source]¶

BETO embeddings computation using Transformes. It store and transforms the texts into BETO embeddings. The embeddings are stored in csv files.

Parameters:	path – path to save the embeddings dynamic – boolean variable to compute the dynamic embeddings (By defalut: True). static – boolean variable to compute the static embeddings (By defalut: False).
Returns:	static embeddings if static=True

preprocessing(inputs)[source]¶

Text Pre-processing

Parameters:	inputs – input data
Returns:	proprocessed text

class WEBERT.SciBERT(inputs, file, stopwords=False, cased=False, cuda=False)[source]¶

WEBERT-SCIBERT computes BERT to get static or dynamic embeddings. SCIBERT is a pre-trained model on english scientific text (https://github.com/allenai/scibert). BERT uses Transformers (https://github.com/huggingface/transformers). This toolkit only considered the scivocab model. Also considers cased or uncased options, and stopword removal.

Parameters:	inputs – input data file – name of the document. stopwords – boolean variable for removing stopwords (By defalut: False). cased – boolean variable to compute cased or lower-case model (By defalut: False). cuda – boolean value for using cuda to compute the embeddings, True for using it. (By defalut: False).
Returns:	WEBERT object

get_bert_embeddings(path, dynamic=True, static=False)[source]¶

SciBert embeddings computation using Transformes. It store and transforms the texts into SciBERT embeddings. The embeddings are stored in csv files.

Parameters:	path – path to save the embeddings dynamic – boolean variable to compute the dynamic embeddings (By defalut: True). static – boolean variable to compute the static embeddings (By defalut: False).
Returns:	static embeddings if static=True

preprocessing(inputs)[source]¶

Text Pre-processing

Parameters:	inputs – input data
Returns:	proprocessed text

Welcome to WEBERT’s documentation!¶

Guide¶

Installation¶

Executing commands¶

Run it automatically from linux terminal¶

Methods¶

Indices and tables¶