🏃♂️Quickstart¶
This guide will take you through all the steps to set up a Planchet instance, and run a simple NER processing over sample text using a spaCy worker script.
On the server¶
On the server we need to install Planchet and download some news headlines data in an accessible directory. Then we copy over the data 1000 times to make it large.
git clone https://github.com/savkov/planchet.git
cd planchet
mkdir data
wget https://raw.githubusercontent.com/explosion/prodigy-recipes/master/example-datasets/news_headlines.jsonl -O data/news_headlines.jsonl
python -c "news=open('data/news_headlines.jsonl').read();open('data/news_headlines.jsonl', 'w').write(''.join([news for _ in range(200)]))"
export PLANCHET_REDIS_PWD=my-super-secret-password-%$^@
make install
make install-redis
make run
Note that the service will run at 0.0.0.0:5005 on your host machine. If you want to use a different host or port, use the make parameters:
make run HOST=my.host.com PORT=6000
Note: this guide will not work if you run a docker instance. If you do want to do that, you will need to alter the script as indicated in the comments below.
On the client¶
On the client side we need to install the Planchet client and spaCy.
pip install planchet spacy tqdm
python -m spacy download en_core_web_sm
export PLANCHET_REDIS_PWD=<your-redis-password>
Then we write the following script in a file called spacy_ner.py
making sure
you fill in the placeholders.
from planchet import PlanchetClient
import spacy
from tqdm import tqdm
nlp = spacy.load("en_core_web_sm")
PLANCHET_HOST = '0.0.0.0' # <--- CHANGE IF NEEDED
PLANCHET_PORT = 5005
url = f'http://{PLANCHET_HOST}:{PLANCHET_PORT}'
client = PlanchetClient(url)
job_name = 'spacy-ner-job'
metadata = { # NOTE: this assumes planchet has access to this path
'input_file_path': './data/news_headlines.jsonl', # <--- change to /data/[...] if using docker
'output_file_path': './data/entities.jsonl' # <--- change to /data/[...] if using docker
}
# make sure you don't use the clean_start option here
client.start_job(job_name, metadata, 'JsonlReader', writer_name='JsonlWriter')
# make sure the number of items is large enough to avoid blocking the server
n_items = 100
headlines = client.get(job_name, n_items)
while headlines:
ents = []
print('Processing headlines batch...')
for id_, item in tqdm(headlines):
item['ents'] = [ent.text for ent in nlp(item['text']).ents]
ents.append((id_, item))
client.send(job_name, ents)
headlines = client.get(job_name, n_items)
Finally, we want to do some parallel processing with 8 processes. We can start each process manually or we can use the parallel tool to start them all.
seq -w 0 8 | parallel python spacy_ner.py {}