Updated 30 June 2020.

TWSO - runs on commdev, in /greenstone/greenstone3/web/sites/twso. All mods contained in the site.

Setting up TWSO
###################

In a recent greenstone 3. In sites folder, checkout twso:
svn co http://svn.greenstone.org/main/trunk/model-sites-dev/twso

cd into the collection:
cd twso/collect/twso

Populate the import folder by copying the contents of import/Programmes from either the existing collection on commdev (/greenstone/greenstone3/web/sites/twso/collect/twso) or from storage on /nzdl-storage/TWSO-Backup/twso-site/collect/twso)

Populate the videos folder from either of the same two places.

Updating TWSO collection
##########################

I have Greenstone 3 installed locally, with twso site and collection.

Ian sends PDF files of the programmes, plus text/word doc for the list of players, and the pieces metadata.

Players list: Should be in the following form - edit it if not. A text file.

conductor Lastname, Firstname
soloist Lastname, Firstname
Lastname, Firstname (all the players listed here)
Lastname, Firstname
....

(If you don't get this text file, see instructions at the end of this file for how to generate it.)

Once you have the eg Aug_2019_name_list.txt file, then you need to check names against the master_name_list, to make sure that we use consistent spelling and notation across all programmes.

The easiest way to do this:

Make a backup of the master_name_list.txt - copy to master_name_list.backup

Run

python masternamecreator2.py Aug_2019_name_list.txt

This will add any new names from Aug_2019_name_list into the master_name_list.

Now, do a diff:
diff master_name_list.txt master_name_list.backup

The differences will be any new names. Check the master for these new names and make sure that there are not alternate spellings. Also, some names have maiden names.
eg Oliver, Bev and Oliver (Formerly Nation), Beverley

NOTE: master_list_notes.txt has some info about people and different spellings etc. Look through this first before you check the differences - helps you to know what to do when you find some.

If there is a new variant which is wrong, remove the new variant from the master list, and change the name in the programme name list.

Once you have all the names listed correctly, then generate the metadata.xml file:

python metadatacreator.py Aug_2019_name_list.txt Aug_2019
The last argument is the name of the PDF file, without the .pdf file extension.

This will create a metadata.xml file called Aug_2019-metadata.xml.

It will list all the players, plus conductor and soloist if these were included in the name_list file.

Open up this metadata file and add the extra metadata:
Copy and paste a list of empty elements from metadata-skeleton.xml, to save on typing.

The first three may have been done for you depending on how much of the programme text was in the text file.
pd.Player - format LastName, FirstName. or LastName (nee MaidenName), FirstName. or LastName (formerly PreviousName), FirstName (for change of name that is not due to marriage).
pd.Soloist (same format) include orchestral and vocal soloists, narrators (but not MC)
pd.Conductor (same format as player name)

pd.Location  format Location 1 &amp; Location 2 &amp; Location 3...
pd.Date  format yyyymmdd. Add multiple dates separately
pd.formatDate  if there are multiple dates, then add this, format like 21/22 November 2014
pd.Composer - surname only, unless there are composers with same surname.
pd.Piece - format: composer - piece title, opus number
pd.Title - concert title
pd.SubTitle - if concert has a subtitle
pd.CoPerformer - if other groups are part of the concert. eg Cantando Choir.

If the composer has done an arrangement, for Piece put 
Composer (Arranged) - Piece name
Or if both composer and arranger are listed, put eg
"Narro arr. Isaac" for both composer and piece.

MCs are not added.

Add these two files (pdf and metadata.xml) to the import/Programmes/year folder, Create the year if it is a new year.
 Add into the collection using 
incremental-rebuild.pl -site twso twso. 
Or you can rebuild the entire collection using 
full-rebuild.pl -site twso twso


Notes:
* All the scripts are doing is trying to identify player names. If you can't get these to work properly, you can just manually create the metadata.xml file from scratch, and add all the players in by hand.
* the collection uses unknownPLugin to import the PDF files, so no conversion is done. Therefore doesn't take very long to do a full rebuild.


Uploading to commdev.
##########################

 * ssh to commdev

ssh commdev.nzdl.org

 * sudo to nzdl-gs3 user.

sudo su - nzdl-gs3

 * update import folder:
 
cd /greenstone/greenstone3/web/sites/twso/collect/twso/import/Programmes

rsync -pavHt kjdon@toetoe.cms.waikato.ac.nz:/Scratch/kjdon/gs3-pei-jones-plus-twso/web/sites/twso/collect/twso/import/Programmes/ .
(use appropriate user and paths)

 * update index
 
delete old backup index (eg index.jun2016).
rename current index to a backup, eg index.may2018

rysnc the new one

rsync -pavHt kjdon@toetoe.cms.waikato.ac.nz:/Scratch/kjdon/gs3-pei-jones-plus-twso/web/sites/twso/collect/twso/index .

  * restart tomcat
  
logout of nzdl-gs3 user. As yourself:

restart tomcat:
sudo systemctl restart greenstone3

  * backup the new import files

Back as nzdl-gs3 user, backup the new import files to /nzdl-storage/TWSO-Backup/twso-site/collect/twso/import

If you need to go back and change all occurrences of a name
###########################################################
Sometimes it turns out that the master name variant you have been using is no longer correct - eg if someone changes their name, so now all the old occurences need to be changed to X (formerly Y), etc.

You can make a backup of import folder first if you like (in case you muck up the commands).
cp -r import import.save

* cd into Programmes folder:
cd import/Programmes

* find all the metadata files:
find . -name *metadata.xml

* test with a grep
find . -name *metadata.xml -exec grep Garcia-Gil {} \;
  - this should list all the lines that you want changing

* do the replacement
find . -name *metadata.xml -exec sed -i 's/Garcia-Gil/Garcia Gil/g' {} \;

* list all the tilda files
find . -name *metadata.xml~

* then remove them
find . -name *metadata.xml~ -exec rm {} \;

* check that you have only changed the bits you wanted
cd ../../
diff -r import import.save

*****************************
Legacy instructions:
*****************************

Extracting the name list
###########################

If you don't get the players list in the right form, here's how to extract them:

Generate a text file of the PDF. This is used to extract player names. The easiest way is to copy and paste the list of players from the programme into a text file.

If the PDF is old, and you can't cut and paste: then you will either need to type the players names out manually, or you can OCR the file and copy out the list of players.

   Scan the file using Abbyy Fine Reader on Katherine's laptop. PDF -> Word.
   Save the Word file, then in Word do a SaveAs plain text, unicode encoding.
   Edit the text file so that just the section with players names is left.
   Or, you can open up the word doc, and copy and paste the list of players section into a text file.

You may need to modify the formatting.
It should look like
Instrument
player
player

Instrument
player
player

etc.

Notes on format:
* there must be a blank line between each section of instrument + players
* the actual instrument doesn't matter, as long as it is recognised as an instrument so that the list of names gets added correctly. In fact, if you are having trouble with the formatting, just put all under a single instrument name. This actually makes the manual part of namefinder processing below much easier.
* it won't like things like harp / keyboard - just change to a single instrument
* if you get a new instrument name/format, you can add it into the roles file for next time.

Once you have this text version of the players, in the OCRPrograms folder, run:

python namefinder1.py nameoftxtfile

This will prompt you to add names.
 * y = yes, 
 * e = edit, if you need to change the format, 
 * space = don't add + move to the next name. (Don't use n = no as it will ignore any more in that section). Have the pdf programme open as you go so you can check off names.
Keep an eye out for missed sections, eg if it doesn't recognise the instrument.
And names with more than two words will get processed wrongly.

The output is _name_list.txt. Rename to match the input pdf file. eg Aug_2019_name_list.txt