lexicons_builder documentation

The lexicons_builder package aims to provide a basic API to create lexicons related to specific words.

Key principle: Given the input words, the main algorithm will look for synonyms and neighboors in the synonym dictionaries, in the NLP model(s) provided and in WordNet. For each of the new retrieved terms, it will look again for its neighboors or synonyms and so on..

The general method is implemented on 3 different supports:

  1. Synonyms dictionaries (See complete list of the dictionaries here)

  2. NLP language models (Word2Vec or FastText format)

  3. WordNet (or WOLF)

Output can be text file, turtle file or a Graph object. See Quickstart section for examples.

Note

The synonyms comming from the web are retreived by scrapping each webpage. Which means that a change in the html might return wrong results.

Contents

License

The MIT License (MIT)

Copyright (c) 2021 GLNB

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Contributors

Changelog

Version 0.1

Installation

With pip

It is recommanded to use a virtual environment.

$ python -m venv env
$ source env/bin/activate
$ pip install lexicons-builder

From source

To install the module from source:

$ pip install git+git://github.com/GuillaumeLNB/lexicons_builder

Download NLP models (optionnal)

Here’s a non exhaustive list of websites where you can download NLP models manually. The models should be in word2vec or fasttext format.

Link

Language(s)

https://fauconnier.github.io/#data

French

https://wikipedia2vec.github.io/wikipedia2vec/pretrained/

Multilingual

http://vectors.nlpl.eu/repository/

Multilingual

https://github.com/alexandres/lexvec#pre-trained-vectors

Multilingual

https://fasttext.cc/docs/en/english-vectors.html

English / Multilingual

https://github.com/mmihaltz/word2vec-GoogleNews-vectors

English

Download wordnet

>>> import nltk
>>> nltk.download()

Download WOLF (French WordNet) (optionnal)

$ # download WOLF (French wordnet if needed)
$ wget https://gforge.inria.fr/frs/download.php/file/33496/wolf-1.0b4.xml.bz2
$ # (and extract it)
$ bzip2 -d wolf-1.0b4.xml.bz2

QuickStart

Command line interface (CLI)

To get words from input words through CLI, run

$ python -m lexicons_builder <words>  \
      --lang <LANG>                 \
      --out-file <OUTFILE>          \
      --format <FORMAT>             \
      --depth <DEPTH>               \
      --nlp-model <NLP_MODEL_PATHS> \
      --web                         \
      --wordnet                     \
      --wolf-path <WOLF_PATH>       \
      --strict
With:
  • <words> The word(s) we want to get synonyms from

  • <LANG> The word language (eg: fr, en, nl, …)

  • <DEPTH> The depth we want to dig in the models, websites, …

  • <OUTFILE> The file where the results will be stored

  • <FORMAT> The wanted output format (txt with indentation, ttl or xlsx)

At least ONE of the following options is needed:
  • --nlp-model <NLP_MODEL_PATHS> The path to the nlp model(s)

  • --web Search online for synonyms

  • --wordnet Search on WordNet using nltk

  • --wolf-path <WOLF_PATH> The path to WOLF (French wordnet)

Optional
  • --strict remove non relevant words

Eg: if we want to look for related terms linked to ‘eat’ and ‘drink’ on wordnet at a depth of 2, excecute:

$ python -m lexicons_builder eat drink  \
      --lang        en                  \
      --out-file    test_en.txt         \
      --format      txt                 \
      --depth       1                   \
      --wordnet
$ Note the indentation is linked to the depth a which the word was found
$ head test_en.txt
  drink
  eat
    absorb
    ade
    aerophagia
    alcohol
    alcoholic_beverage
    alcoholic_drink
    banquet
    bar_hop
    belt_down
    beverage
    bi
  ...

Python

To get related terms interactively through Python, run

>>> from lexicons_builder import build_lexicon
>>> # search for related terms of 'book' and 'read' in English at depth 1 online
>>> output = build_lexicon(["book", "read"], 'en', 1, web=True)
...
>>> # we then get a graph object
>>> # output as a list
>>> output.to_list()
['PS', 'accept', 'accommodate', 'according to the rules', 'account book', 'accountability', 'accountancy', 'accountant', 'accounting', 'accounts', 'accuse', 'acquire', 'act', 'adjudge', 'admit', 'adopt', 'afl', 'agree', 'aim', "al-qur'an", 'album', 'allege', 'allocate', 'allow', 'analyse', 'analyze', 'annuaire', 'anthology', 'appear in reading', 'apply', 'appropriate', 'arrange', 'arrange for', 'arrest', 'articulate', 'ascertain' ...
>>> # output as rdf/turtle
>>> print(output)
@prefix ns1: <http://taxref.mnhn.fr/lod/property/> .
@prefix ns2: <urn:default:baseUri:#> .
@prefix ns3: <http://www.w3.org/2004/02/skos/core#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ns2:PS ns1:isSynonymOf ns2:root_word_uri ;
    ns3:prefLabel "PS" ;
    ns2:comesFrom <synonyms.com> ;
    ns2:depth 1 .

ns2:accept ns1:isSynonymOf ns2:root_word_uri ;
    ns3:prefLabel "accept" ;
    ns2:comesFrom <synonyms.com> ;
    ns2:depth 1 .
...

>>> # output to an indented file
>>> output.to_text_file("filename.txt")
>>> with open("filename.txt") as f:
...     print(f.read(1000))
...
read
book
  PS
  accept
  accommodate
  according to the rules
  account book
  accountability
...
>>> # output to xslx file
>>> output.to_xlsx_file("results.xlsx")

>>> # full search with 2 nlp models, wordnet and on the web
>>> # download and extract google word2vec model
>>> # from https://github.com/mmihaltz/word2vec-GoogleNews-vectors
>>>
>>> # download and extract FastText models
>>> # from https://fasttext.cc/docs/en/english-vectors.html
>>>
>>> nlp_models = ["GoogleNews-vectors-negative300.bin", "wiki-news-300d-1M.vec"]
>>> output = build_lexicon(["book", "letter"], "en", 1, web=True, wordnet=True, nlp_model_paths=nlp_models)
>>> # can take a while
>>> len(output.to_list())
614
>>> # deleting non relevant words
>>> output.pop_non_relevant_words()
>>> len(output.to_list())
57

Note

If the depth parameter is too high (higher than 3), the words found could seem unrelated to the root words. It can take also a long time to compute too.

Note

The word senses are taken equally, which means that you might get terms you would think are not related to the input word. Eg: looking for the word ‘test’ might give you words linked to Sea urchins, as a ‘test’ is also a type of shell https://en.wikipedia.org/wiki/Test_(biology).

lexicons_builder

lexicons_builder package

Subpackages
lexicons_builder.graphs package
Submodules
lexicons_builder.graphs.graphs module
lexicons_builder.nlp_model_explorer package
Submodules
lexicons_builder.nlp_model_explorer.explorer module
lexicons_builder.scrappers package
Submodules
lexicons_builder.scrappers.scrappers module
lexicons_builder.touch_file package
Submodules
lexicons_builder.touch_file.touch_file module
lexicons_builder.wordnet_explorer package
Submodules
lexicons_builder.wordnet_explorer.explorer module

Complete list of synonyms dictionaries

Here’s the list of the synonyms dictionaries the scrappers look up.

Website

Language(s)

synonymes.com

French

les-synonymes.com

French

leconjugueur.lefigaro.fr

French

crisco2.unicaen.fr

French

synonyms.reverso.net

English French Spanish Italian German

lexico.com

English

synonyms.com

English

mijnwoordenboek.nl

Dutch

synonyme.de

German

sapere.virgilio.it

Italian

sinonim.org

Russian

synonymonline.ru

Russian

Potential issues

As this package is still under development, it is likely you will run into some issues.

Please report any issue at https://github.com/GuillaumeLNB/lexicons_builder/issues.

Indices and tables

Note

If you encounter an issue, feel free raise it on GitHub