QuickStart
Command line interface (CLI)
To get words from input words through CLI, run
$ python -m lexicons_builder <words> \ --lang <LANG> \ --out-file <OUTFILE> \ --format <FORMAT> \ --depth <DEPTH> \ --nlp-model <NLP_MODEL_PATHS> \ --web \ --wordnet \ --wolf-path <WOLF_PATH> \ --strict
- With:
<words>
The word(s) we want to get synonyms from<LANG>
The word language (eg: fr, en, nl, …)<DEPTH>
The depth we want to dig in the models, websites, …<OUTFILE>
The file where the results will be stored<FORMAT>
The wanted output format (txt with indentation, ttl or xlsx)
- At least ONE of the following options is needed:
--nlp-model <NLP_MODEL_PATHS>
The path to the nlp model(s)--web
Search online for synonyms--wordnet
Search on WordNet using nltk--wolf-path <WOLF_PATH>
The path to WOLF (French wordnet)
- Optional
--strict
remove non relevant words
Eg: if we want to look for related terms linked to ‘eat’ and ‘drink’ on wordnet at a depth of 2, excecute:
$ python -m lexicons_builder eat drink \ --lang en \ --out-file test_en.txt \ --format txt \ --depth 1 \ --wordnet $ Note the indentation is linked to the depth a which the word was found $ head test_en.txt drink eat absorb ade aerophagia alcohol alcoholic_beverage alcoholic_drink banquet bar_hop belt_down beverage bi ...
Python
To get related terms interactively through Python, run
>>> from lexicons_builder import build_lexicon >>> # search for related terms of 'book' and 'read' in English at depth 1 online >>> output = build_lexicon(["book", "read"], 'en', 1, web=True) ... >>> # we then get a graph object >>> # output as a list >>> output.to_list() ['PS', 'accept', 'accommodate', 'according to the rules', 'account book', 'accountability', 'accountancy', 'accountant', 'accounting', 'accounts', 'accuse', 'acquire', 'act', 'adjudge', 'admit', 'adopt', 'afl', 'agree', 'aim', "al-qur'an", 'album', 'allege', 'allocate', 'allow', 'analyse', 'analyze', 'annuaire', 'anthology', 'appear in reading', 'apply', 'appropriate', 'arrange', 'arrange for', 'arrest', 'articulate', 'ascertain' ... >>> # output as rdf/turtle >>> print(output) @prefix ns1: <http://taxref.mnhn.fr/lod/property/> . @prefix ns2: <urn:default:baseUri:#> . @prefix ns3: <http://www.w3.org/2004/02/skos/core#> . @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . ns2:PS ns1:isSynonymOf ns2:root_word_uri ; ns3:prefLabel "PS" ; ns2:comesFrom <synonyms.com> ; ns2:depth 1 . ns2:accept ns1:isSynonymOf ns2:root_word_uri ; ns3:prefLabel "accept" ; ns2:comesFrom <synonyms.com> ; ns2:depth 1 . ... >>> # output to an indented file >>> output.to_text_file("filename.txt") >>> with open("filename.txt") as f: ... print(f.read(1000)) ... read book PS accept accommodate according to the rules account book accountability ... >>> # output to xslx file >>> output.to_xlsx_file("results.xlsx") >>> # full search with 2 nlp models, wordnet and on the web >>> # download and extract google word2vec model >>> # from https://github.com/mmihaltz/word2vec-GoogleNews-vectors >>> >>> # download and extract FastText models >>> # from https://fasttext.cc/docs/en/english-vectors.html >>> >>> nlp_models = ["GoogleNews-vectors-negative300.bin", "wiki-news-300d-1M.vec"] >>> output = build_lexicon(["book", "letter"], "en", 1, web=True, wordnet=True, nlp_model_paths=nlp_models) >>> # can take a while >>> len(output.to_list()) 614 >>> # deleting non relevant words >>> output.pop_non_relevant_words() >>> len(output.to_list()) 57
Note
If the depth parameter is too high (higher than 3), the words found could seem unrelated to the root words. It can take also a long time to compute too.
Note
The word senses are taken equally, which means that you might get terms you would think are not related to the input word. Eg: looking for the word ‘test’ might give you words linked to Sea urchins, as a ‘test’ is also a type of shell https://en.wikipedia.org/wiki/Test_(biology).